As I mentioned in my introductory blog entry, I have already set up a working nutch installation and crawled/indexed some documents.
Now I have a different question: how can I evolve a corpus over time? Basically I want to start with a group of seed URLs and do a nutch crawl. There are two methodologies I know of so far: I’m not sure whether I want to do an “intranet crawl” or a “whole web crawl“. The first uses the “nutch crawl” command:
Usage: Crawl <urlDir> [-dir d] [-threads n] [-depth i] [-topN N]
The “whole web crawl” breaks that down to its constituent steps; here’s one I did:
nutch inject crawl/crawldb seed nutch generate crawl/crawldb crawl/segments s1=`ls -d crawl/segments/2* | tail -1` nutch fetch2 $s1 nutch updatedb crawl/crawldb $s1 nutch generate crawl/crawldb crawl/segments -topN 1000 s2=`ls -d crawl/segments/2* | tail -1` nutch fetch2 $s2 nutch updatedb crawl/crawldb $s2 nutch generate crawl/crawldb crawl/segments -topN 1000 s3=`ls -d crawl/segments/2* | tail -1` nutch fetch2 $s3 nutch updatedb crawl/crawldb $s3 nutch invertlinks crawl/linkdb -dir crawl/segments nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/*
The essence of the above is:
- inject
- loop on these:
- generate
- fetch2
- updatedb
- invertlinks
- index
So far I’ve built a corpus of several thousand documents. How should I add to it?
To be clear, I am, a bit, conflating two issues. Recrawling and merging are two separate operations. Recrawling seeks to go through the existing pages and update them. The wiki has a recrawl script (which is unfortunately not updated for version 0.9; whether it’s still good for 0.9 isn’t clear). Alternately, merging seeks to combine two (usually?/mostly?) disjoint sets of documents and attendant indexes. Merging has the MergeCrawl script which is detailed on the wiki, again, only through version 0.8. Neither of these scripts is in the distribution, why is that?
Neither of “recrawl” or “merge” are mentioned in the nutch tutorial.
I did a search for “merge” on nutch-user; I also did a search for “recrawl”. Then I followed a few of the threads:
“Nutch Crawl Vs. Merge Time Complexity” (Mar 2006) asks:
I’m using Nutch v0.7 and I’ve been running nutch on our company unix system and it was setup to crawl our intranet sites for updates daily, I’ve tried using the Merge, dedup, updatedb, and etc…I’d notice the time complexity and efficiency was less productive than doing a fresh new crawl. For example if I have two separate crawls from two different domains such as hotmail and yahoo, what would the time complexity for nutch to crawl this two domains and then do a merge compare to just doing a single full crawl of both domains? My guess would be that it will take nutch the same amount of times to do either one, if that is so is there a reason to use the Merge at all?
“Incremental indexing” (Jun 2007) asks:
As the size of my data keeps growing, and the indexing time grows even
faster, I’m trying to switch from a “reindex all at every crawl” model to an
incremental indexing one. I intend to keep the segments separate, but I
want to index only the segment fetched during the last cycle, and then merge
indexes and perhaps linkdb. I have a few questions:1. In an incremental scenario, how do I remove from the indexes references
to segments that have expired??2. Looking at http://wiki.apache.org/nutch/MergeCrawl , it would appear that
I can call “bin/nutch merge” with only two parameters: the original index
directory as destination, and the directory to be merged in the former:$nutch_dir/nutch merge $index_dir $new_indexesBut when I do that, the merged data are left in a subdirectory called $index_dir/merge_output . Shouldn’t I instead create a new empty destination directory, do the merge, and then replace the original with the newly merged directory:
merged_indexes=$crawl_dir/merged_indexesrm -rf $merged_indexes # just in case it's already there $nutch_dir/nutch merge $merged_indexes $index_dir $new_indexes rm -rf $index_dir.old # just in case it's already there mv $index_dir $index_dir.old mv $merged_indexes $index_dir rm -rf $index_dir.old3. Regarding linkdb, does running “$nutch_dir/nutch invertlinks” on the latest segment only, and then merging the newly obtained linkdb with the current one with “$nutch_dir/nutch mergelinkdb”, make sense rather than recreating linkdb afresh from the whole set of segments every time? In other words, can invertlinks work incrementally, or does it need to have a view of all segments in order to work correctly?
“Recrawl URLS” (Aug 2006) has a discussion between two people:
Q: I was searching for the method to add new url to the crawling url list
and how to recrawl all urls…A: You could use the command
bin/nutch inject $nutch-dir/db -urlfile. To recrawl your WebDB you can use this
urlfile.txt
script.http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html[That's the same as the recrawl script from the wiki. --Kai]
Take a look to the adddays argument and to the configuration property
db.default.fetch.interval. They influence to the result.Q: I have another question, I done what you give me… But it inject the
new urls and “recrawl” it, but against the first crawl It doesn’t
download the web pages and really crawl them… perhaps I’m mistaking
somewhere…Any idea ?A: In the nutch
conf/nutch-default.xmlconfiguration file exist a property call
db.default.fetch.interval. When you crawl a site, nutch schedules the next
fetch to “today + db.default.fetch.interval” days. If you execute the recrawl
command and the pages that you fetch don’t reach this date, they won’t be
re-fetched. When you add new urls to the webdb, they will be ready to be
fetch. So at this moment only this pages will be fetched by the recrawl
script.Q: But the websites just added hasn’t been yet crawled… And they’re not
crawled during recrawl…
Does “bin/nutch purge” will restart all ?A: This command “
bin/nutch purge” doesn’t exist. Well I can’t say you what is
happening. Give me the output when you run the recrawl.
I found that a bit inconclusive. Points of interest:
$ nutch inject Usage: Injector <crawldb> <url_dir>
The above is the usage printed for “nutch inject” on the command line. And now from nutch-default.xml:
<property>
<name>db.default.fetch.interval</name>
<value>30</value>
<description>(DEPRECATED) The default number of days between re-fetches of a page.
</description>
</property>
Ok, great, that’s deprecated. I really need some current documentation!
“Recrawling… Methodology?” (Jul 2006) asks:
I need some help clarifying if recrawling is doing exactly what I think it is. Here’s the current scenario of how I think a recrawl should work:
I crawl my intranet with a depth of 2. Later, I recrawl using the script found below: http://wiki.apache.org/nutch/IntranetRecrawl#head-e58e25a0b9530bb6fcdfb282fd27a207fc0aff03 [the standard script –Kai]
In my recrawl, I also specify a depth of 2. It reindexes each of the pages before, and if they have changed update the pages content. If they have changed and new links exist, the links are followed to a maximum depth of 2.
This is how I think a typical recrawl should work. However, when I recrawl using the script linked to above, tons of new pages are indexed, whether they have changed or not. It seems as if I crawl the content with a depth of 2, and then come back and recrawl with a depth of 2, it really adds a couple of crawl depth levels and the outcome is that I have done a crawl with a depth of 4 (instead of crawl with a depth of 2 and then just a recrawl to catch any new pages).
The current steps of the recrawl are as follows:
for (how many depth levels specified)$nutch_dir/nutch generate $webdb_dir $segments_dir -adddays $adddays segment=`ls -d $segments_dir/* | tail -1` $nutch_dir/nutch fetch $segment $nutch_dir/nutch updatedb $webdb_dir $segment
- invertlinks
- index
- dedup
- merge
Basically what made me wonder is that it took me 2 minutes to do the crawl. It’s taken me over 3 hours and still going to do the recrawl (same depth levels specified). After I recrawl once, I believe it then speeds up.
I don’t know if that guy ever fixed his problem. He was doing the same thing as I except that he started initially with an “intranet crawl” and built on it (I deleted my initial “intranet crawl” and recrawled incrementally).
I’m not sure if repetition will help me, but here’s another description of how crawl works – “Re: How to recrawl urls” (Dec 2005):
The scheme of intranet crawling is like this: Firstly, you create a webdb using WebDBAdminTool. After that, you fetch a seed URL using WebDBInjector. The seed URL is inserted into your webdb, marked by current date and time. Then, you create a fetch list using FetchListTool. The FetchListTool read all URLs in the webdb which are due to crawl, and put them to the fetchlist. Next, the Fetcher crawls all URLs in the fetchlist. Finally, once crawling is finished, UpdateDatabaseTool extracts all outlinks and put them to webdb. Newly extracted outlinks are set date and time to current date and time, while all just-crawled URLs date and time are set to next 30 days (these things happen actually in FetchListTool). So all extracted links will be crawled for the next time, but not the just-crawled URLs. So on and soforth.
Therefore, once the crawler is still alive after 30 days (or the threshold that you set), all “just-crawled” urls will be taken out to recrawl. That’s why we need to maintain a live crawler at that time. This could be done using cron job, I think.
Slightly further into the above thread, Stefan Groschupf suggests: “do the steps manually as described here: SimpleMapReduceTutorial“; that tutorial, written by Earl Cahill in Oct 2005, has these steps (plus explanation):
cd nutch/branches/mapredmkdir urls echo "http://lucene.apache.org/nutch/" > urls/urls perl -pi -e 's|MY.DOMAIN.NAME|lucene.apache.org/nutch|' conf/crawl-urlfilter.txt ./bin/nutch crawl urls CRAWLDB=`find crawl-2* -name crawldb` SEGMENTS_DIR=`find crawl-2* -maxdepth 1 -name segments` ./bin/nutch generate $CRAWLDB $SEGMENTS_DIR SEGMENT=`find crawl-2*/segments/2* -maxdepth 0 | tail -1` ./bin/nutch fetch $SEGMENT ./bin/nutch updatedb $CRAWLDB $SEGMENT LINKDB=`find crawl-2* -name linkdb -maxdepth 1` SEGMENTS=`find crawl-2* -name segments -maxdepth 1` ./bin/nutch invertlinks $LINKDB $SEGMENTS mkdir myindex ls -alR myindex
Here’s a somewhat basic discussion on merging: “Problem with merge-output” (Jun 2007)
Q: After recrawl several times, I have problem with the directory: merge-output. I have digged into mail archive and found some clue: you should use a new dir name for the new merge, e.g., merge-output_new, then mv merge-output_new to merge-output.
A: This is something I usually do:-
$NUTCH_HOME/bin/nutch mergesegs crawl/MERGEDsegments crawl/segments/* rm -rf crawl/segments/* mv crawl/MERGEDsegments/* crawl/segmentsYou might want to replace the second statement with a ‘mv’ statement to backup the segments.
Here’s another: “Simple question about the merge tool” (Jul 2005):
Q: I have a simple question about how to use the merge tool. I’ve done three small crawls resulting in three small segment directories. How can I merge these into one directory with one index? I notice the merge command options:
Usage: IndexMerger (-local | -ndfs <nameserver:port>) [-workingdir <workingdir>] outputIndex segments...I don’t really understand what it’s doing with the outputIndex and the segments. Will this automatically delete segments after merging them into the output?
A: Use the
bin/nutch mergesegsto merge many segments into one.
Mergesegs has the following usage as reported by running “nutch mergesegs” on the command line:
I’m curious about the usage of the merge command. Here’s a console session detailing these:
$ nutch | grep merg
mergedb merge crawldb-s, with optional filtering
mergesegs merge several segments, with optional filtering and slicing
mergelinkdb merge linkdb-s, with optional filtering
merge merge several segment indexes
$ nutch mergedb
Usage: CrawlDbMerger [ ...] [-normalize] [-filter]
output_crawldb output CrawlDb
crawldb1 ... input CrawlDb-s (single input CrawlDb is ok)
-normalize use URLNormalizer on urls in the crawldb(s) (usually not needed)
-filter use URLFilters on urls in the crawldb(s)
$ nutch mergesegs
SegmentMerger output_dir (-dir segments | seg1 seg2 ...) [-filter] [-slice NNNN]
output_dir name of the parent dir for output segment slice(s)
-dir segments parent dir containing several segments
seg1 seg2 ... list of segment dirs
-filter filter out URL-s prohibited by current URLFilters
-slice NNNN create many output segments, each containing NNNN URLs
$ nutch mergelinkdb
Usage: LinkDbMerger [ ...] [-normalize] [-filter]
output_linkdb output LinkDb
linkdb1 ... input LinkDb-s (single input LinkDb is ok)
-normalize use URLNormalizer on both fromUrls and toUrls in linkdb(s) (usually not needed)
-filter use URLFilters on both fromUrls and toUrls in linkdb(s)
$ nutch merge
Usage: IndexMerger [-workingdir ] outputIndex indexesDir...
Ah: the nutch javadoc has some comments on each of the above classes:
CrawlDbMerger - “nutch mergedb” – see also mergedb wiki
org.apache.nutch.crawl
Class CrawlDbMergerjava.lang.Objectorg.apache.hadoop.util.ToolBase
org.apache.nutch.crawl.CrawlDbMerger
- All Implemented Interfaces:
- Configurable, Tool
public class CrawlDbMerger- extends ToolBase
This tool merges several CrawlDb-s into one, optionally filtering URLs through the current URLFilters, to skip prohibited pages.
It’s possible to use this tool just for filtering – in that case only one CrawlDb should be specified in arguments.
If more than one CrawlDb contains information about the same URL, only the most recent version is retained, as determined by the value of
CrawlDatum.getFetchTime(). However, all metadata information from all versions is accumulated, with newer values taking precedence over older values.
- Author:
- Andrzej Bialecki
SegmentMerger - “nutch mergesegs” – see also mergesegs wiki
org.apache.nutch.segment
Class SegmentMergerjava.lang.Objectorg.apache.hadoop.conf.Configured
org.apache.nutch.segment.SegmentMerger
- All Implemented Interfaces:
- Configurable, Closeable, JobConfigurable, Mapper, Reducer
public class SegmentMerger- extends Configured
- implements Mapper, Reducer
This tool takes several segments and merges their data together. Only the latest versions of data is retained.
Optionally, you can apply current URLFilters to remove prohibited URL-s.
Also, it’s possible to slice the resulting segment into chunks of fixed size.
Important Notes
Which parts are merged?
It doesn’t make sense to merge data from segments, which are at different stages of processing (e.g. one unfetched segment, one fetched but not parsed, and one fetched and parsed). Therefore, prior to merging, the tool will determine the lowest common set of input data, and only this data will be merged. This may have some unintended consequences: e.g. if majority of input segments are fetched and parsed, but one of them is unfetched, the tool will fall back to just merging fetchlists, and it will skip all other data from all segments.
Merging fetchlists
Merging segments, which contain just fetchlists (i.e. prior to fetching) is not recommended, because this tool (unlike the
Generatordoesn’t ensure that fetchlist parts for each map task are disjoint.Duplicate content
Merging segments removes older content whenever possible (see below). However, this is NOT the same as de-duplication, which in addition removes identical content found at different URL-s. In other words, running DeleteDuplicates is still necessary.
For some types of data (especially ParseText) it’s not possible to determine which version is really older. Therefore the tool always uses segment names as timestamps, for all types of input data. Segment names are compared in forward lexicographic order (0-9a-zA-Z), and data from segments with “higher” names will prevail. It follows then that it is extremely important that segments be named in an increasing lexicographic order as their creation time increases.
Merging and indexes
Merged segment gets a different name. Since Indexer embeds segment names in indexes, any indexes originally created for the input segments will NOT work with the merged segment. Newly created merged segment(s) need to be indexed afresh. This tool doesn’t use existing indexes in any way, so if you plan to merge segments you don’t have to index them prior to merging.
- Author:
- Andrzej Bialecki
LinkDbMerger - “nutch mergelinkdb” – see also mergelinkdb wiki
org.apache.nutch.crawl
Class LinkDbMergerjava.lang.Objectorg.apache.hadoop.util.ToolBase
org.apache.nutch.crawl.LinkDbMerger
- All Implemented Interfaces:
- Configurable, Closeable, JobConfigurable, Reducer, Tool
This tool merges several LinkDb-s into one, optionally filtering URLs through the current URLFilters, to skip prohibited URLs and links.
It’s possible to use this tool just for filtering – in that case only one LinkDb should be specified in arguments.
If more than one LinkDb contains information about the same URL, all inlinks are accumulated, but only at most
db.max.inlinksinlinks will ever be added.If activated, URLFilters will be applied to both the target URLs and to any incoming link URL. If a target URL is prohibited, all inlinks to that target will be removed, including the target URL. If some of incoming links are prohibited, only they will be removed, and they won’t count when checking the above-mentioned maximum limit.
- Author:
- Andrzej Bialecki
IndexMerger – “nutch merge” – see also merge wiki
org.apache.nutch.indexer
Class IndexMergerjava.lang.Objectorg.apache.hadoop.util.ToolBase
org.apache.nutch.indexer.IndexMerger
- All Implemented Interfaces:
- Configurable, Tool
public class IndexMerger- extends ToolBase
IndexMerger creates an index for the output corresponding to a single fetcher run.
- Author:
- Doug Cutting, Mike Cafarella
I wrote a post asking for clarification about the above four merge commands: “four nutch merge commands: mergedb, mergesegs, mergelinkdb, merge” (Jul 2007).
Q: Naively: why are there four merge commands? Are some subsets of the others? Are they used in conjunction? What are the usage scenarios of each?
A: Each is used in a different scenario
mergedb: as its name does not imply, it is used to merge crawldb. So consider this mergecrawldb
mergesegs: merges segments. It merges <segment>/{content,crawl_fetch, crawl_generate, crawl_parse, parse_data, parse_text} information from different segments.
merge: Merges lucene indexes. After a index job, you end up with a indexes directory with a bunch of part-<num> directories inside. Command merge takes such a directory and produces a single index. A single index has a better performance (I think). You can say that merge is poorly named, it should have been called mergeindexes or something.
mergelinkdb: Should be obvious, merges linkdb-s.
So none of them is a subset of another. They all have different purposes. It is kind of confusing to have a “merge” command that only merges indexes, so perhaps we can add a mergeindexes command, keep merge for some time (noting that it has been deprecated) then remove it.
Q: It seems most of the nutch-user discussions I’ve seen so far relate to the simple merge command. Are the first three “advanced commands”?
A: They serve different purpose – let’s assume that somehow you’ve got two crawldb-s, e.g. you ran two crawls with different seed lists and different filters. Now you want to take these collections of urls and create a one big crawl. Then you would use mergedb to merge crawldb-s, mergelinkdb to merge linkdb-s, and mergesegs to merge segments
![]()
And a simple “merge” merges indexes of multiple segments, which is a performance-related step in the regular Nutch work-cycle.
“Incremental indexing” (June 2007) discusses the complex aspects of recrawling/merging rather clearly. It’s too bad nobody on nutch-user replied to it.
As the size of my data keeps growing, and the indexing time grows even faster, I’m trying to switch from a “reindex all at every crawl” model to an incremental indexing one. I intend to keep the segments separate, but I
want to index only the segment fetched during the last cycle, and then merge indexes and perhaps linkdb. I have a few questions:1. In an incremental scenario, how do I remove from the indexes references to segments that have expired??
2. Looking at http://wiki.apache.org/nutch/MergeCrawl , it would appear that I can call “bin/nutch merge” with only two parameters: the original index directory as destination, and the directory to be merged in the former:
$nutch_dir/nutch merge $index_dir $new_indexesBut when I do that, the merged data are left in a subdirectory called $index_dir/merge_output . Shouldn’t I instead create a new empty destination directory, do the merge, and then replace the original with the newly merged directory:
merged_indexes=$crawl_dir/merged_indexesrm -rf $merged_indexes # just in case it's already there $nutch_dir/nutch merge $merged_indexes $index_dir $new_indexes rm -rf $index_dir.old # just in case it's already there mv $index_dir $index_dir.old mv $merged_indexes $index_dir rm -rf $index_dir.old3. Regarding linkdb, does running “$nutch_dir/nutch invertlinks” on the latest segment only, and then merging the newly obtained linkdb with the current one with “$nutch_dir/nutch mergelinkdb”, make sense rather than recreating linkdb afresh from the whole set of segments every time? In other words, can invertlinks work incrementally, or does it need to have a view of all segments in order to work correctly?
Here’s a very current and rather complex question, with replies, titled “incremental growing index” (Jul 2007):
Q: Our crawler generates and fetches segments continuously. We’d like to index and merge each new segment immediately (or with a small delay) such that our index grows incrementally. This is unlike the normal situation where one would create a linkdb and an index of all segments at once, after the crawl has finished. The problem we have is that Nutch currently needs the complete linkdb and crawldb each time we want to index a single segment.
A: The reason for wanting the linkdb is the anchor information. If you don’t need any anchor information, you can provide an empty linkdb.
The reason why crawldb is needed is to get the current page status information (which may have changed in the meantime due to subsequent crawldb updates from newer segments). If you don’t need this information, you can modify Indexer.reduce() (~line 212) method to allow for this, and then remove the line in Indexer.index() that adds crawldb to the list of input paths.
Q: The Indexer map task processes all keys (urls) from the input files (linkdb, crawldb and segment). This includes all data from the linkdb and crawldb that we actually don’t need since we are only interested in the data that corresponds to the keys (urls) in our segment (this is filtered out in the Indexer reduce task). Obviously, as the linkdb and crawldb grow, this becomes more and more of a problem.
A: Is this really a problem for you now? Unless your segments are tiny, the indexing process will be dominated by I/O from the processing of parseText / parseData and Lucene operations.
Q: Any ideas on how to tackle this issue? Is it feasible to lookup the corresponding linkdb and crawldb data for each key (url) in the segment before or during indexing?
A: It would be probably too slow, unless you made a copy of linkdb/crawldb on the local FS-es of each node. But at this point the benefit of this change would be doubtful, because of all the I/O you would need to do to prepare each task’s environment …
Q: Thanks Andrzej. Perhaps these numbers make our issue more clear:
- after a week of (internet) crawling, the crawldb contains about 22M documents. - 6M documents are fetched, in 257 segments (topN = 25,000) - size of the crawldb = 4,399 MB (22M docs, 0.2 kB/doc) - size of the linkdb = 75,955 MB (22M docs, 3.5 kB/doc) - size of a segment = somewhere between 100 and 500 MB (25K docs, 20 kB/doc (max))As you can see: for a segment of 500 MB, more than 99% of the IO during indexing is due to the linkdb and crawldb. We could increase the size of our segments, but in the end this only delays the problem. We are now indexing without the linkdb. This reduces the time needed by a factor 10. But we would really like to have the link texts back in again in the future.
Here’s a thread I started a couple weeks back: “Interrupting a nutch crawl — or use topN?” (Jun 2007):
I am running a nutch crawl of 19 sites. I wish to let this crawl go for about two days then gracefully stop it (I don’t expect it to complete by then). Is there a way to do this? I want it to stop crawling then build the lucene
index. Note that I used a simple nutch crawl command, rather than the “whole web” crawling methodology:nutch crawl urls.txt -dir /usr/tmp/19sites -depth 10Or is it better to use the -topN option?
Some documentation for topN:
“Re: How to terminate the crawl?”
“You can limit the number of pages by using the -topN parameter. This limits the number of pages fetched in each round. Pages are prioritized by how well-linked they are. The maximum number of pages that can be
fetched is topN*depth.”Or from the tutorial:
-topN N determines the maximum number of pages that will be retrieved at each level up to the depth.
For example, a typical call might be:
bin/nutch crawl urls -dir crawl -depth 3 -topN 50Typically one starts testing one’s configuration by crawling at shallow depths, sharply limiting the number of pages fetched at each level (-topN), and watching the output to check that desired pages are fetched and undesirable pages are not. Once one is confident of the configuration, then an appropriate depth for a full crawl is around 10. The number of pages per level (-topN) for a full crawl can be from tens of thousands to millions, depending on your resources.
Here was one response to my question:
I use a iterative approach using a script similar to what Sami blogs about here:
Online indexing – integrating Nutch with Solr
excerpt:
There might be times when you would like to integrate Apache Nutch crawling with a single Apache Solr index server – for example when your collection size is limited to amount of documents that can be served by single Solr instance, or you like to do your updates on “live” index. By using Solr as your indexing server might even ease up your maintenance burden quite a bit – you would get rid of manual index life cycle management in Nutch and let Solr handle your index.
I then issue a crawl of 10,000 URLs at a time, and just repeat the process for as long as the window available. because I use solr to store the crawl results. It makes the index available during the crawl window. But I’m a relative newbie as well, so look forward what the experts say.
I looked at Sami Siren’s script; it’s pretty much the same as what I did a the top of this blog, except his script “will execute one iteration of fetching and indexing.” The script’s only real difference is that it uses ’SolrIndexer’ (that you write) rather than the normal Indexer class, org.apache.nutch.indexer.Indexer (here’s the Indexer javadoc). I think I guess correctly that Indexer is what runs when you do “nutch index” from the command line. Just to beat a dead horse a bit more, here’s an excerpt from Sami’s script:
bin/nutch inject $BASEDIR/crawldb urls checkStatus bin/nutch generate $BASEDIR/crawldb $BASEDIR/segments -topN $NUMDOCS checkStatus SEGMENT=`bin/hadoop dfs -ls $BASEDIR/segments|grep $BASEDIR|cut -f1|sort|tail -1` echo processing segment $SEGMENT bin/nutch fetch $SEGMENT -threads 20 checkStatus bin/nutch updatedb $BASEDIR/crawldb $SEGMENT -filter checkStatus bin/nutch invertlinks $BASEDIR/linkdb $SEGMENT checkStatus bin/nutch org.apache.nutch.indexer.SolrIndexer $BASEDIR/crawldb $BASEDIR/linkdb $SEGMENT checkStatus
checkStatus is just a short function in the script that looks to see if any errors were generated by whatever command ran last. I also note that Sami is using a hadoop command that I don’t understand; the NutchHadoopTutorial mentions ‘hadoop dfs’ … but I think I may be drifting off topic.
Here is the other response I got to my post:
In the past Andrzej put some stuff related to your issue in the Jira. Try to
look it up there.Found it
http://issues.apache.org/jira/browse/NUTCH-368
NUTCH-368: Message queueing system (Sep 2006)
This is an implementation of a filesystem-based message queueing system. The motivation for this functionality is explained in HADOOP-490
HADOOP-490: Add ability to send “signals” to jobs and tasks
In some cases it would be useful to be able to “signal” a job and its tasks about some external condition, or to broadcast a specific message to all tasks in a job. Currently we can only send a single pseudo-signal, that is to kill a job.
This patch uses the message queueing framework to implement the following functionality in Fetcher:
* ability to gracefully stop fetching the current segment. This is different from simply killing the job in that the partial results (partially fetched segment) are available and can be further processed. This is especially useful for fetching large segments with long “tails”, i.e. pages which are fetched very slowly, either because of politeness settings or the target site’s bandwidth limitations.
* ability to dynamicaly adjust the number of fetcher threads. For a long-running fetch job it makes sense to decrease the number of fetcher threads during the day, and increase it during the night. This can be done now with a cron script, using the MsgQueueTool command-line.
It’s worthwhile to note that the patch itself is trivial, and most of the work is done by the MQ framework.
After you apply this patch you can start a long-running fetcher job, check its <jobId>, and control the fetcher this way:
bin/nutch org.apache.nutch.util.msg.MsgQueueTool -createMsg <job_id> ctrl THREADS 50This adjusts the number of threads to 50 (starting more threads or stopping some threads as necessary).
Then run:
bin/nutch org.apache.nutch.util.msg.MsgQueueTool -createMsg <job_id> ctrl HALTThis will gracefully shut down all threads after they finish fetching their current url, and finish the job, keeping the partial segment data intact.
Susam Pal has posted (Aug 2007) a new script to crawl with nutch 0.9:
#!/bin/sh
# Runs the Nutch bot to crawl or re-crawl
# Usage: bin/runbot [safe]
# If executed in 'safe' mode, it doesn't delete the temporary
# directories generated during crawl. This might be helpful for
# analysis and recovery in case a crawl fails.
#
# Author: Susam Pal
depth=2
threads=50
adddays=5
topN=2 # Comment this statement if you don't want to set topN value
# Parse arguments
if [ "$1" == "safe" ]
then
safe=yes
fi
if [ -z "$NUTCH_HOME" ]
then
NUTCH_HOME=.
echo runbot: $0 could not find environment variable NUTCH_HOME
echo runbot: NUTCH_HOME=$NUTCH_HOME has been set by the script
else
echo runbot: $0 found environment variable NUTCH_HOME=$NUTCH_HOME
fi
if [ -z "$CATALINA_HOME" ]
then
CATALINA_HOME=/opt/apache-tomcat-6.0.10
echo runbot: $0 could not find environment variable NUTCH_HOME
echo runbot: CATALINA_HOME=$CATALINA_HOME has been set by the script
else
echo runbot: $0 found environment variable CATALINA_HOME=$CATALINA_HOME
fi
if [ -n "$topN" ]
then
topN="--topN $rank"
else
topN=""
fi
steps=8
echo "----- Inject (Step 1 of $steps) -----"
$NUTCH_HOME/bin/nutch inject crawl/crawldb urls
echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"
for((i=0; i < $depth; i++))
do
echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
$NUTCH_HOME/bin/nutch generate crawl/crawldb crawl/segments $topN
-adddays $adddays
if [ $? -ne 0 ]
then
echo "runbot: Stopping at depth $depth. No more URLs to fetch."
break
fi
segment=`ls -d crawl/segments/* | tail -1`
$NUTCH_HOME/bin/nutch fetch $segment -threads $threads
if [ $? -ne 0 ]
then
echo "runbot: fetch $segment at depth $depth failed. Deleting it."
rm -rf $segment
continue
fi
$NUTCH_HOME/bin/nutch updatedb crawl/crawldb $segment
done
echo "----- Merge Segments (Step 3 of $steps) -----"
$NUTCH_HOME/bin/nutch mergesegs crawl/MERGEDsegments crawl/segments/*
if [ "$safe" != "yes" ]
then
rm -rf crawl/segments/*
else
mkdir crawl/FETCHEDsegments
mv --verbose crawl/segments/* crawl/FETCHEDsegments
fi
mv --verbose crawl/MERGEDsegments/* crawl/segments
rmdir crawl/MERGEDsegments
echo "----- Invert Links (Step 4 of $steps) -----"
$NUTCH_HOME/bin/nutch invertlinks crawl/linkdb crawl/segments/*
echo "----- Index (Step 5 of $steps) -----"
$NUTCH_HOME/bin/nutch index crawl/NEWindexes crawl/crawldb
crawl/linkdb crawl/segments/*
echo "----- Dedup (Step 6 of $steps) -----"
$NUTCH_HOME/bin/nutch dedup crawl/NEWindexes
echo "----- Merge Indexes (Step 7 of $steps) -----"
$NUTCH_HOME/bin/nutch merge crawl/index crawl/NEWindexes
if [ "$safe" != "yes" ]
then
rm -rf crawl/NEWindexes
fi
echo "----- Reloading index on the search site (Step 8 of $steps) -----"
if [ "$safe" != "yes" ]
then
touch ${CATALINA_HOME}/webapps/ROOT/WEB-INF/web.xml
echo Done!
else
echo runbot: Can not reload index in safe mode.
echo runbot: Please reload it manually using the following command:
echo runbot: touch ${CATALINA_HOME}/webapps/ROOT/WEB-INF/web.xml
fi
echo "runbot: FINISHED: Crawl completed!"
Susam comments as follows:
I have written this script to crawl with Nutch 0.9. Though, I have tried to take care that this should work for re-crawls as well, but I have never done any real world testing for re-crawls. I use this to crawl. You may try this out. We can make some changes if this is not found to be appropriate for re-crawls.


July 14, 2007 at 2:47 am
A great collection of comments and information.
I’ll add to this when I have more information.
I am interested in the incremental/merge scenario, where I may have a 5-10Gb index and then want to add an additional 100-200Mb of additional data.
July 15, 2007 at 6:37 am
I am also very interested in understanding how to recrawl a site – It is unfortunate there is no clear documentation on how to “maintain” an index once we have initially crawled. I would imagine almost 100% of real users would want to keep a maintained uptodate index – taking into account changed, new and deleted documents. Although there are lot of discussions, it would be really great if someone can publish a simple step by step instructions on how to maintain a site – may be daily or weekly basis.
July 19, 2007 at 2:44 pm
One of the most detailed technical blog posts I’ve ever seen.
August 15, 2007 at 1:06 pm
I’ve noted that to merge crawl/indexes to crawl/index doesn’t seem to work for me (but i’m running under cygwin) – it may be some odd permissioning thing but deleting crawl/index helps in the merge (and similar for indexes when indexing). might be worth trying.
September 5, 2007 at 9:59 am
This script seems a little awkward.
I used a url file of some 100 urls and just a depth of 1 to test it.
When it was done (1 min later) I ran it again but then instead of the behaviour which I expected, a recrawl of the previous injected urls (hopefully deeming this crawl as uneccessary) it started to crawl the links which was fetched from the top-urls in round1.
I have a use-case which can’t be to off the normal.
1. Crawl X urls the first time. The initial urls registered.
2. Crawl X plus Y urls the next time call it Z. (since some urls registered in the network).
2.1 Check all already existing urls if they have updated content, crawl and index.
2.2 Fully crawl the new urls and index.
3. Some time later (days, weeks) – crawl Z minus A urls (since some urls left the network)
3.1 Check all already existing urls if they have updated content.
3.2 Delete all A urls from the crawl list and index.
Is this a strange setup ? I would like to do it this way since I want to present a widget/page about what keywords have the highest frequency in the network. Then using scores from long gone sites is not to good.
I will do a sample search of the index each week and present a “Hottest keyword” list, store it in a db and compare it to the previous week. So basically I’m just interested in snapshotting the state of the crawled index each week, something like sampling statistics to be able to create a trend diagram. Something like what I’m doing here with stats:
http://www.tailsweep.com/partners/englas_showroom
Any ideas why it digs deeper than I want and skipping the top?
BYTW Tailsweep is my company.
October 14, 2007 at 8:48 am
I’m starting to feel like an idiot but I’m trying a simple thing.
I’m struggling to get the runbit script to be able to recrawl the same urls over and over but cannot get it to work.
I crawl one url only for example: http://gadgets.fosfor.se/
Let’s say I have invented a trigger which could detect that the page has been updated. Then I would immediately send a signal to the crawler and let i recrawl only the pages which have been updated nothing more, nothing less. This is probably a special usecase since I will have many urls to inject and each url only a depth of 1.
Anyway the sequence below do not work.
generate – 1st time great but for every consequent time it generates a
new list consisting of all the anchors which the previous fetch contained. I added them with adddays=0 which I thought would force it to do this again.
fetch
updatedb
Anyone have any ideas of how i can recrawl a single url many times in a row ?
I can of course do a crawl to a new dir each time and merge the output…