Recrawling and merging

As I mentioned in my introductory blog entry, I have already set up a working nutch installation and crawled/indexed some documents.

Now I have a different question: how can I evolve a corpus over time? Basically I want to start with a group of seed URLs and do a nutch crawl. There are two methodologies I know of so far: I’m not sure whether I want to do an “intranet crawl” or a “whole web crawl“. The first uses the “nutch crawl” command:

Usage: Crawl <urlDir> [-dir d] [-threads n] [-depth i] [-topN N]

The “whole web crawl” breaks that down to its constituent steps; here’s one I did:

nutch inject crawl/crawldb seed 
nutch generate crawl/crawldb crawl/segments 
s1=`ls -d crawl/segments/2* | tail -1` 
nutch fetch2 $s1 
nutch updatedb crawl/crawldb $s1 
nutch generate crawl/crawldb crawl/segments -topN 1000 
s2=`ls -d crawl/segments/2* | tail -1` 
nutch fetch2 $s2 
nutch updatedb crawl/crawldb $s2 
nutch generate crawl/crawldb crawl/segments -topN 1000 
s3=`ls -d crawl/segments/2* | tail -1` 
nutch fetch2 $s3 
nutch updatedb crawl/crawldb $s3 
nutch invertlinks crawl/linkdb -dir crawl/segments 
nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/*

The essence of the above is:

  1. inject
  2. loop on these:
    1. generate
    2. fetch2
    3. updatedb
  3. invertlinks
  4. index

So far I’ve built a corpus of several thousand documents. How should I add to it?

To be clear, I am, a bit, conflating two issues. Recrawling and merging are two separate operations. Recrawling seeks to go through the existing pages and update them. The wiki has a recrawl script (which is unfortunately not updated for version 0.9; whether it’s still good for 0.9 isn’t clear). Alternately, merging seeks to combine two (usually?/mostly?) disjoint sets of documents and attendant indexes. Merging has the MergeCrawl script which is detailed on the wiki, again, only through version 0.8. Neither of these scripts is in the distribution, why is that?

Neither of “recrawl” or “merge” are mentioned in the nutch tutorial.

I did a search for “merge” on nutch-user; I also did a search for “recrawl”. Then I followed a few of the threads:

“Nutch Crawl Vs. Merge Time Complexity” (Mar 2006) asks:

I’m using Nutch v0.7 and I’ve been running nutch on our company unix system and it was setup to crawl our intranet sites for updates daily, I’ve tried using the Merge, dedup, updatedb, and etc…I’d notice the time complexity and efficiency was less productive than doing a fresh new crawl. For example if I have two separate crawls from two different domains such as hotmail and yahoo, what would the time complexity for nutch to crawl this two domains and then do a merge compare to just doing a single full crawl of both domains? My guess would be that it will take nutch the same amount of times to do either one, if that is so is there a reason to use the Merge at all?

“Incremental indexing” (Jun 2007) asks:

As the size of my data keeps growing, and the indexing time grows even
faster, I’m trying to switch from a “reindex all at every crawl” model to an
incremental indexing one. I intend to keep the segments separate, but I
want to index only the segment fetched during the last cycle, and then merge
indexes and perhaps linkdb. I have a few questions:

1. In an incremental scenario, how do I remove from the indexes references
to segments that have expired??

2. Looking at http://wiki.apache.org/nutch/MergeCrawl , it would appear that
I can call “bin/nutch merge” with only two parameters: the original index
directory as destination, and the directory to be merged in the former:

 $nutch_dir/nutch merge $index_dir $new_indexes

But when I do that, the merged data are left in a subdirectory called $index_dir/merge_output . Shouldn’t I instead create a new empty destination directory, do the merge, and then replace the original with the newly merged directory:

merged_indexes=$crawl_dir/merged_indexesrm -rf $merged_indexes # just in case it's already there 
$nutch_dir/nutch merge $merged_indexes $index_dir $new_indexes 
rm -rf $index_dir.old # just in case it's already there 
mv $index_dir $index_dir.old 
mv $merged_indexes $index_dir 
rm -rf $index_dir.old

3. Regarding linkdb, does running “$nutch_dir/nutch invertlinks” on the latest segment only, and then merging the newly obtained linkdb with the current one with “$nutch_dir/nutch mergelinkdb”, make sense rather than recreating linkdb afresh from the whole set of segments every time? In other words, can invertlinks work incrementally, or does it need to have a view of all segments in order to work correctly?

“Recrawl URLS” (Aug 2006) has a discussion between two people:

Q: I was searching for the method to add new url to the crawling url list
and how to recrawl all urls…

A: You could use the command bin/nutch inject $nutch-dir/db -urlfile
urlfile.txt
. To recrawl your WebDB you can use this
script.http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html

[That’s the same as the recrawl script from the wiki. –Kai]

Take a look to the adddays argument and to the configuration property
db.default.fetch.interval. They influence to the result.

Q: I have another question, I done what you give me… But it inject the
new urls and “recrawl” it, but against the first crawl It doesn’t
download the web pages and really crawl them… perhaps I’m mistaking
somewhere…Any idea ?

A: In the nutch conf/nutch-default.xml configuration file exist a property call
db.default.fetch.interval. When you crawl a site, nutch schedules the next
fetch to “today + db.default.fetch.interval” days. If you execute the recrawl
command and the pages that you fetch don’t reach this date, they won’t be
re-fetched. When you add new urls to the webdb, they will be ready to be
fetch. So at this moment only this pages will be fetched by the recrawl
script.

Q: But the websites just added hasn’t been yet crawled… And they’re not
crawled during recrawl…
Does “bin/nutch purge” will restart all ?

A: This command “bin/nutch purge” doesn’t exist. Well I can’t say you what is
happening. Give me the output when you run the recrawl.

I found that a bit inconclusive. Points of interest:

$ nutch inject Usage: Injector <crawldb> <url_dir>

The above is the usage printed for “nutch inject” on the command line. And now from nutch-default.xml:

<property> 
    <name>db.default.fetch.interval</name> 
    <value>30</value> 
    <description>(DEPRECATED) The default number of days between re-fetches of a page. 
    </description> 
</property>

Ok, great, that’s deprecated. I really need some current documentation!

Recrawling… Methodology?” (Jul 2006) asks:

I need some help clarifying if recrawling is doing exactly what I think it is. Here’s the current scenario of how I think a recrawl should work:

I crawl my intranet with a depth of 2. Later, I recrawl using the script found below: http://wiki.apache.org/nutch/IntranetRecrawl#head-e58e25a0b9530bb6fcdfb282fd27a207fc0aff03 [the standard script –Kai]

In my recrawl, I also specify a depth of 2. It reindexes each of the pages before, and if they have changed update the pages content. If they have changed and new links exist, the links are followed to a maximum depth of 2.

This is how I think a typical recrawl should work. However, when I recrawl using the script linked to above, tons of new pages are indexed, whether they have changed or not. It seems as if I crawl the content with a depth of 2, and then come back and recrawl with a depth of 2, it really adds a couple of crawl depth levels and the outcome is that I have done a crawl with a depth of 4 (instead of crawl with a depth of 2 and then just a recrawl to catch any new pages).

The current steps of the recrawl are as follows:
for (how many depth levels specified)

$nutch_dir/nutch generate $webdb_dir $segments_dir -adddays $adddays 
segment=`ls -d $segments_dir/* | tail -1` 
$nutch_dir/nutch fetch $segment 
$nutch_dir/nutch updatedb $webdb_dir $segment
  • invertlinks
  • index
  • dedup
  • merge

Basically what made me wonder is that it took me 2 minutes to do the crawl. It’s taken me over 3 hours and still going to do the recrawl (same depth levels specified). After I recrawl once, I believe it then speeds up.

I don’t know if that guy ever fixed his problem. He was doing the same thing as I except that he started initially with an “intranet crawl” and built on it (I deleted my initial “intranet crawl” and recrawled incrementally).

I’m not sure if repetition will help me, but here’s another description of how crawl works – “Re: How to recrawl urls” (Dec 2005):

The scheme of intranet crawling is like this: Firstly, you create a webdb using WebDBAdminTool. After that, you fetch a seed URL using WebDBInjector. The seed URL is inserted into your webdb, marked by current date and time. Then, you create a fetch list using FetchListTool. The FetchListTool read all URLs in the webdb which are due to crawl, and put them to the fetchlist. Next, the Fetcher crawls all URLs in the fetchlist. Finally, once crawling is finished, UpdateDatabaseTool extracts all outlinks and put them to webdb. Newly extracted outlinks are set date and time to current date and time, while all just-crawled URLs date and time are set to next 30 days (these things happen actually in FetchListTool). So all extracted links will be crawled for the next time, but not the just-crawled URLs. So on and soforth.

Therefore, once the crawler is still alive after 30 days (or the threshold that you set), all “just-crawled” urls will be taken out to recrawl. That’s why we need to maintain a live crawler at that time. This could be done using cron job, I think.

Slightly further into the above thread, Stefan Groschupf suggests: “do the steps manually as described here: SimpleMapReduceTutorial“; that tutorial, written by Earl Cahill in Oct 2005, has these steps (plus explanation):

cd nutch/branches/mapredmkdir urls 
echo "http://lucene.apache.org/nutch/" > urls/urls 
perl -pi -e 's|MY.DOMAIN.NAME|lucene.apache.org/nutch|' conf/crawl-urlfilter.txt 
./bin/nutch crawl urls 
CRAWLDB=`find crawl-2* -name crawldb` 
SEGMENTS_DIR=`find crawl-2* -maxdepth 1 -name segments` 
./bin/nutch generate $CRAWLDB $SEGMENTS_DIR 
SEGMENT=`find crawl-2*/segments/2* -maxdepth 0 | tail -1` 
./bin/nutch fetch $SEGMENT 
./bin/nutch updatedb $CRAWLDB $SEGMENT 
LINKDB=`find crawl-2* -name linkdb -maxdepth 1` 
SEGMENTS=`find crawl-2* -name segments -maxdepth 1` 
./bin/nutch invertlinks $LINKDB $SEGMENTS 
mkdir myindex 
ls -alR myindex

Here’s a somewhat basic discussion on merging: “Problem with merge-output” (Jun 2007)

Q: After recrawl several times, I have problem with the directory: merge-output. I have digged into mail archive and found some clue: you should use a new dir name for the new merge, e.g., merge-output_new, then mv merge-output_new to merge-output.

A: This is something I usually do:-

$NUTCH_HOME/bin/nutch mergesegs crawl/MERGEDsegments crawl/segments/* 
rm -rf crawl/segments/* 
mv crawl/MERGEDsegments/* crawl/segments

You might want to replace the second statement with a ‘mv’ statement to backup the segments.

Here’s another: “Simple question about the merge tool” (Jul 2005):

Q: I have a simple question about how to use the merge tool. I’ve done three small crawls resulting in three small segment directories. How can I merge these into one directory with one index? I notice the merge command options:

Usage: IndexMerger (-local | -ndfs <nameserver:port>) [-workingdir <workingdir>] outputIndex segments...

I don’t really understand what it’s doing with the outputIndex and the segments. Will this automatically delete segments after merging them into the output?

A: Use the bin/nutch mergesegs to merge many segments into one.

Mergesegs has the following usage as reported by running “nutch mergesegs” on the command line:
I’m curious about the usage of the merge command. Here’s a console session detailing these:

$ nutch | grep merg 
  mergedb           merge crawldb-s, with optional filtering 
  mergesegs         merge several segments, with optional filtering and slicing 
  mergelinkdb       merge linkdb-s, with optional filtering 
  merge             merge several segment indexes 
$ nutch mergedb 
Usage: CrawlDbMerger   [  ...] [-normalize] [-filter] 
        output_crawldb  output CrawlDb 
        crawldb1 ...    input CrawlDb-s (single input CrawlDb is ok) 
        -normalize      use URLNormalizer on urls in the crawldb(s) (usually not needed) 
        -filter use URLFilters on urls in the crawldb(s) 
$ nutch mergesegs 
SegmentMerger output_dir (-dir segments | seg1 seg2 ...) [-filter] [-slice NNNN] 
        output_dir      name of the parent dir for output segment slice(s) 
        -dir segments   parent dir containing several segments 
        seg1 seg2 ...   list of segment dirs 
        -filter         filter out URL-s prohibited by current URLFilters 
        -slice NNNN     create many output segments, each containing NNNN URLs 
$ nutch mergelinkdb 
Usage: LinkDbMerger   [  ...] [-normalize] [-filter] 
        output_linkdb   output LinkDb 
        linkdb1 ...     input LinkDb-s (single input LinkDb is ok) 
        -normalize      use URLNormalizer on both fromUrls and toUrls in linkdb(s) (usually not needed) 
        -filter use URLFilters on both fromUrls and toUrls in linkdb(s) 
$ nutch merge 
Usage: IndexMerger [-workingdir ] outputIndex indexesDir...

Ah: the nutch javadoc has some comments on each of the above classes:

CrawlDbMerger – “nutch mergedb” – see also mergedb wiki

org.apache.nutch.crawl
Class CrawlDbMerger

java.lang.Object   extended by org.apache.hadoop.util.ToolBase       extended by org.apache.nutch.crawl.CrawlDbMerger
All Implemented Interfaces:
Configurable, Tool

public class CrawlDbMerger                     
extends ToolBase

This tool merges several CrawlDb-s into one, optionally filtering URLs through the current URLFilters, to skip prohibited pages.

It’s possible to use this tool just for filtering – in that case only one CrawlDb should be specified in arguments.

If more than one CrawlDb contains information about the same URL, only the most recent version is retained, as determined by the value of CrawlDatum.getFetchTime(). However, all metadata information from all versions is accumulated, with newer values taking precedence over older values.

Author:
Andrzej Bialecki

SegmentMerger – “nutch mergesegs” – see also mergesegs wiki

org.apache.nutch.segment
Class SegmentMerger

java.lang.Object   extended by org.apache.hadoop.conf.Configured       extended by org.apache.nutch.segment.SegmentMerger
All Implemented Interfaces:
Configurable, Closeable, JobConfigurable, Mapper, Reducer

public class SegmentMerger                     
extends Configured
implements Mapper, Reducer

This tool takes several segments and merges their data together. Only the latest versions of data is retained.

Optionally, you can apply current URLFilters to remove prohibited URL-s.

Also, it’s possible to slice the resulting segment into chunks of fixed size.

Important Notes

Which parts are merged?

It doesn’t make sense to merge data from segments, which are at different stages of processing (e.g. one unfetched segment, one fetched but not parsed, and one fetched and parsed). Therefore, prior to merging, the tool will determine the lowest common set of input data, and only this data will be merged. This may have some unintended consequences: e.g. if majority of input segments are fetched and parsed, but one of them is unfetched, the tool will fall back to just merging fetchlists, and it will skip all other data from all segments.

Merging fetchlists

Merging segments, which contain just fetchlists (i.e. prior to fetching) is not recommended, because this tool (unlike the Generator doesn’t ensure that fetchlist parts for each map task are disjoint.

Duplicate content

Merging segments removes older content whenever possible (see below). However, this is NOT the same as de-duplication, which in addition removes identical content found at different URL-s. In other words, running DeleteDuplicates is still necessary.

For some types of data (especially ParseText) it’s not possible to determine which version is really older. Therefore the tool always uses segment names as timestamps, for all types of input data. Segment names are compared in forward lexicographic order (0-9a-zA-Z), and data from segments with “higher” names will prevail. It follows then that it is extremely important that segments be named in an increasing lexicographic order as their creation time increases.

Merging and indexes

Merged segment gets a different name. Since Indexer embeds segment names in indexes, any indexes originally created for the input segments will NOT work with the merged segment. Newly created merged segment(s) need to be indexed afresh. This tool doesn’t use existing indexes in any way, so if you plan to merge segments you don’t have to index them prior to merging.

Author:
Andrzej Bialecki

LinkDbMerger – “nutch mergelinkdb” – see also mergelinkdb wiki

org.apache.nutch.crawl
Class LinkDbMerger

java.lang.Object   extended by org.apache.hadoop.util.ToolBase       extended by org.apache.nutch.crawl.LinkDbMerger
All Implemented Interfaces:
Configurable, Closeable, JobConfigurable, Reducer, Tool

public class LinkDbMerger                     
extends ToolBase
implements Reducer

This tool merges several LinkDb-s into one, optionally filtering URLs through the current URLFilters, to skip prohibited URLs and links.

It’s possible to use this tool just for filtering – in that case only one LinkDb should be specified in arguments.

If more than one LinkDb contains information about the same URL, all inlinks are accumulated, but only at most db.max.inlinks inlinks will ever be added.

If activated, URLFilters will be applied to both the target URLs and to any incoming link URL. If a target URL is prohibited, all inlinks to that target will be removed, including the target URL. If some of incoming links are prohibited, only they will be removed, and they won’t count when checking the above-mentioned maximum limit.

Author:
Andrzej Bialecki

IndexMerger – “nutch merge” – see also merge wiki

org.apache.nutch.indexer
Class IndexMerger

java.lang.Object   extended by org.apache.hadoop.util.ToolBase       extended by org.apache.nutch.indexer.IndexMerger
All Implemented Interfaces:
Configurable, Tool

public class IndexMerger                     
extends ToolBase

IndexMerger creates an index for the output corresponding to a single fetcher run.

Author:
Doug Cutting, Mike Cafarella

I wrote a post asking for clarification about the above four merge commands: “four nutch merge commands: mergedb, mergesegs, mergelinkdb, merge” (Jul 2007).

Q: Naively: why are there four merge commands? Are some subsets of the others? Are they used in conjunction? What are the usage scenarios of each?

A: Each is used in a different scenario

mergedb: as its name does not imply, it is used to merge crawldb. So consider this mergecrawldb

mergesegs: merges segments. It merges <segment>/{content,crawl_fetch, crawl_generate, crawl_parse, parse_data, parse_text} information from different segments.

merge: Merges lucene indexes. After a index job, you end up with a indexes directory with a bunch of part-<num> directories inside. Command merge takes such a directory and produces a single index. A single index has a better performance (I think). You can say that merge is poorly named, it should have been called mergeindexes or something.

mergelinkdb: Should be obvious, merges linkdb-s.

So none of them is a subset of another. They all have different purposes. It is kind of confusing to have a “merge” command that only merges indexes, so perhaps we can add a mergeindexes command, keep merge for some time (noting that it has been deprecated) then remove it.

Q: It seems most of the nutch-user discussions I’ve seen so far relate to the simple merge command. Are the first three “advanced commands”?

A: They serve different purpose – let’s assume that somehow you’ve got two crawldb-s, e.g. you ran two crawls with different seed lists and different filters. Now you want to take these collections of urls and create a one big crawl. Then you would use mergedb to merge crawldb-s, mergelinkdb to merge linkdb-s, and mergesegs to merge segments 😉

And a simple “merge” merges indexes of multiple segments, which is a performance-related step in the regular Nutch work-cycle.

Incremental indexing” (June 2007) discusses the complex aspects of recrawling/merging rather clearly. It’s too bad nobody on nutch-user replied to it.

As the size of my data keeps growing, and the indexing time grows even faster, I’m trying to switch from a “reindex all at every crawl” model to an incremental indexing one. I intend to keep the segments separate, but I
want to index only the segment fetched during the last cycle, and then merge indexes and perhaps linkdb. I have a few questions:

1. In an incremental scenario, how do I remove from the indexes references to segments that have expired??

2. Looking at http://wiki.apache.org/nutch/MergeCrawl , it would appear that I can call “bin/nutch merge” with only two parameters: the original index directory as destination, and the directory to be merged in the former:

$nutch_dir/nutch merge $index_dir $new_indexes

But when I do that, the merged data are left in a subdirectory called $index_dir/merge_output . Shouldn’t I instead create a new empty destination directory, do the merge, and then replace the original with the newly merged directory:

merged_indexes=$crawl_dir/merged_indexesrm -rf $merged_indexes # just in case it's already there 
$nutch_dir/nutch merge $merged_indexes $index_dir $new_indexes 
rm -rf $index_dir.old # just in case it's already there 
mv $index_dir $index_dir.old 
mv $merged_indexes $index_dir 
rm -rf $index_dir.old

3. Regarding linkdb, does running “$nutch_dir/nutch invertlinks” on the latest segment only, and then merging the newly obtained linkdb with the current one with “$nutch_dir/nutch mergelinkdb”, make sense rather than recreating linkdb afresh from the whole set of segments every time? In other words, can invertlinks work incrementally, or does it need to have a view of all segments in order to work correctly?

Here’s a very current and rather complex question, with replies, titled “incremental growing index” (Jul 2007):

Q: Our crawler generates and fetches segments continuously. We’d like to index and merge each new segment immediately (or with a small delay) such that our index grows incrementally. This is unlike the normal situation where one would create a linkdb and an index of all segments at once, after the crawl has finished. The problem we have is that Nutch currently needs the complete linkdb and crawldb each time we want to index a single segment.

A: The reason for wanting the linkdb is the anchor information. If you don’t need any anchor information, you can provide an empty linkdb.

The reason why crawldb is needed is to get the current page status information (which may have changed in the meantime due to subsequent crawldb updates from newer segments). If you don’t need this information, you can modify Indexer.reduce() (~line 212) method to allow for this, and then remove the line in Indexer.index() that adds crawldb to the list of input paths.

Q: The Indexer map task processes all keys (urls) from the input files (linkdb, crawldb and segment). This includes all data from the linkdb and crawldb that we actually don’t need since we are only interested in the data that corresponds to the keys (urls) in our segment (this is filtered out in the Indexer reduce task). Obviously, as the linkdb and crawldb grow, this becomes more and more of a problem.

A: Is this really a problem for you now? Unless your segments are tiny, the indexing process will be dominated by I/O from the processing of parseText / parseData and Lucene operations.

Q: Any ideas on how to tackle this issue? Is it feasible to lookup the corresponding linkdb and crawldb data for each key (url) in the segment before or during indexing?

A: It would be probably too slow, unless you made a copy of linkdb/crawldb on the local FS-es of each node. But at this point the benefit of this change would be doubtful, because of all the I/O you would need to do to prepare each task’s environment …

Q: Thanks Andrzej. Perhaps these numbers make our issue more clear:

- after a week of (internet) crawling, the crawldb contains about 22M documents. 
- 6M documents are fetched, in 257 segments (topN = 25,000) 
- size of the crawldb = 4,399 MB (22M docs, 0.2 kB/doc) 
- size of the linkdb = 75,955 MB (22M docs, 3.5 kB/doc) 
- size of a segment = somewhere between 100 and 500 MB (25K docs, 20 kB/doc (max))

As you can see: for a segment of 500 MB, more than 99% of the IO during indexing is due to the linkdb and crawldb. We could increase the size of our segments, but in the end this only delays the problem. We are now indexing without the linkdb. This reduces the time needed by a factor 10. But we would really like to have the link texts back in again in the future.

Here’s a thread I started a couple weeks back: “Interrupting a nutch crawl — or use topN?” (Jun 2007):

I am running a nutch crawl of 19 sites. I wish to let this crawl go for about two days then gracefully stop it (I don’t expect it to complete by then). Is there a way to do this? I want it to stop crawling then build the lucene
index. Note that I used a simple nutch crawl command, rather than the “whole web” crawling methodology:

nutch crawl urls.txt -dir /usr/tmp/19sites -depth 10

Or is it better to use the -topN option?

Some documentation for topN:

“Re: How to terminate the crawl?”

“You can limit the number of pages by using the -topN parameter. This limits the number of pages fetched in each round. Pages are prioritized by how well-linked they are. The maximum number of pages that can be
fetched is topN*depth.”

Or from the tutorial:

-topN N determines the maximum number of pages that will be retrieved at each level up to the depth.

For example, a typical call might be:

bin/nutch crawl urls -dir crawl -depth 3 -topN 50

Typically one starts testing one’s configuration by crawling at shallow depths, sharply limiting the number of pages fetched at each level (-topN), and watching the output to check that desired pages are fetched and undesirable pages are not. Once one is confident of the configuration, then an appropriate depth for a full crawl is around 10. The number of pages per level (-topN) for a full crawl can be from tens of thousands to millions, depending on your resources.

Here was one response to my question:

I use a iterative approach using a script similar to what Sami blogs about here:

Online indexing – integrating Nutch with Solr

excerpt:

There might be times when you would like to integrate Apache Nutch crawling with a single Apache Solr index server – for example when your collection size is limited to amount of documents that can be served by single Solr instance, or you like to do your updates on “live” index. By using Solr as your indexing server might even ease up your maintenance burden quite a bit – you would get rid of manual index life cycle management in Nutch and let Solr handle your index.

I then issue a crawl of 10,000 URLs at a time, and just repeat the process for as long as the window available. because I use solr to store the crawl results. It makes the index available during the crawl window. But I’m a relative newbie as well, so look forward what the experts say.

I looked at Sami Siren’s script; it’s pretty much the same as what I did a the top of this blog, except his script “will execute one iteration of fetching and indexing.”  The script’s only real difference is that it uses ‘SolrIndexer’ (that you write) rather than the normal Indexer class, org.apache.nutch.indexer.Indexer (here’s the Indexer javadoc).  I think I guess correctly that Indexer is what runs when you do “nutch index” from the command line.  Just to beat a dead horse a bit more, here’s an excerpt from Sami’s script:

bin/nutch inject $BASEDIR/crawldb urls 
checkStatus 
bin/nutch generate $BASEDIR/crawldb $BASEDIR/segments -topN $NUMDOCS 
checkStatus 
SEGMENT=`bin/hadoop dfs -ls $BASEDIR/segments|grep $BASEDIR|cut -f1|sort|tail -1` 
echo processing segment $SEGMENT 
bin/nutch fetch $SEGMENT -threads 20 
checkStatus 
bin/nutch updatedb $BASEDIR/crawldb $SEGMENT -filter 
checkStatus 
bin/nutch invertlinks $BASEDIR/linkdb $SEGMENT 
checkStatus 
bin/nutch org.apache.nutch.indexer.SolrIndexer $BASEDIR/crawldb $BASEDIR/linkdb $SEGMENT 
checkStatus

checkStatus is just a short function in the script that looks to see if any errors were generated by whatever command ran last.   I also note that Sami is using a hadoop command that I don’t understand; the NutchHadoopTutorial mentions ‘hadoop dfs’ … but I think I may be drifting off topic.

Here is the other response I got to my post:

In the past Andrzej put some stuff related to your issue in the Jira. Try to
look it up there.

Found it 🙂 http://issues.apache.org/jira/browse/NUTCH-368

NUTCH-368: Message queueing system (Sep 2006)

This is an implementation of a filesystem-based message queueing system. The motivation for this functionality is explained in HADOOP-490

HADOOP-490: Add ability to send “signals” to jobs and tasks

In some cases it would be useful to be able to “signal” a job and its tasks about some external condition, or to broadcast a specific message to all tasks in a job. Currently we can only send a single pseudo-signal, that is to kill a job.

This patch uses the message queueing framework to implement the following functionality in Fetcher:

* ability to gracefully stop fetching the current segment. This is different from simply killing the job in that the partial results (partially fetched segment) are available and can be further processed. This is especially useful for fetching large segments with long “tails”, i.e. pages which are fetched very slowly, either because of politeness settings or the target site’s bandwidth limitations.

* ability to dynamicaly adjust the number of fetcher threads. For a long-running fetch job it makes sense to decrease the number of fetcher threads during the day, and increase it during the night. This can be done now with a cron script, using the MsgQueueTool command-line.

It’s worthwhile to note that the patch itself is trivial, and most of the work is done by the MQ framework.

After you apply this patch you can start a long-running fetcher job, check its <jobId>, and control the fetcher this way:

bin/nutch org.apache.nutch.util.msg.MsgQueueTool -createMsg <job_id> ctrl THREADS 50

This adjusts the number of threads to 50 (starting more threads or stopping some threads as necessary).

Then run:

bin/nutch org.apache.nutch.util.msg.MsgQueueTool -createMsg <job_id> ctrl HALT

This will gracefully shut down all threads after they finish fetching their current url, and finish the job, keeping the partial segment data intact.

Susam Pal has posted (Aug 2007) a new script to crawl with nutch 0.9:

#!/bin/sh

# Runs the Nutch bot to crawl or re-crawl
# Usage: bin/runbot [safe]
#        If executed in 'safe' mode, it doesn't delete the temporary
#        directories generated during crawl. This might be helpful for
#        analysis and recovery in case a crawl fails.
#
# Author: Susam Pal

depth=2
threads=50
adddays=5
topN=2 # Comment this statement if you don't want to set topN value

# Parse arguments
if [ "$1" == "safe" ]
then
  safe=yes
fi

if [ -z "$NUTCH_HOME" ]
then
  NUTCH_HOME=.
  echo runbot: $0 could not find environment variable NUTCH_HOME
  echo runbot: NUTCH_HOME=$NUTCH_HOME has been set by the script
else
  echo runbot: $0 found environment variable NUTCH_HOME=$NUTCH_HOME
fi

if [ -z "$CATALINA_HOME" ]
then
  CATALINA_HOME=/opt/apache-tomcat-6.0.10
  echo runbot: $0 could not find environment variable NUTCH_HOME
  echo runbot: CATALINA_HOME=$CATALINA_HOME has been set by the script
else
  echo runbot: $0 found environment variable CATALINA_HOME=$CATALINA_HOME
fi

if [ -n "$topN" ]
then
  topN="--topN $rank"
else
  topN=""
fi

steps=8
echo "----- Inject (Step 1 of $steps) -----"
$NUTCH_HOME/bin/nutch inject crawl/crawldb urls

echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"
for((i=0; i < $depth; i++))
do
  echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
  $NUTCH_HOME/bin/nutch generate crawl/crawldb crawl/segments $topN
-adddays $adddays
  if [ $? -ne 0 ]
  then
    echo "runbot: Stopping at depth $depth. No more URLs to fetch."
    break
  fi
  segment=`ls -d crawl/segments/* | tail -1`

  $NUTCH_HOME/bin/nutch fetch $segment -threads $threads
  if [ $? -ne 0 ]
  then
    echo "runbot: fetch $segment at depth $depth failed. Deleting it."
    rm -rf $segment
    continue
  fi

  $NUTCH_HOME/bin/nutch updatedb crawl/crawldb $segment
done

echo "----- Merge Segments (Step 3 of $steps) -----"
$NUTCH_HOME/bin/nutch mergesegs crawl/MERGEDsegments crawl/segments/*
if [ "$safe" != "yes" ]
then
  rm -rf crawl/segments/*
else
  mkdir crawl/FETCHEDsegments
  mv --verbose crawl/segments/* crawl/FETCHEDsegments
fi

mv --verbose crawl/MERGEDsegments/* crawl/segments
rmdir crawl/MERGEDsegments

echo "----- Invert Links (Step 4 of $steps) -----"
$NUTCH_HOME/bin/nutch invertlinks crawl/linkdb crawl/segments/*

echo "----- Index (Step 5 of $steps) -----"
$NUTCH_HOME/bin/nutch index crawl/NEWindexes crawl/crawldb
crawl/linkdb crawl/segments/*

echo "----- Dedup (Step 6 of $steps) -----"
$NUTCH_HOME/bin/nutch dedup crawl/NEWindexes

echo "----- Merge Indexes (Step 7 of $steps) -----"
$NUTCH_HOME/bin/nutch merge crawl/index crawl/NEWindexes

if [ "$safe" != "yes" ]
then
  rm -rf crawl/NEWindexes
fi

echo "----- Reloading index on the search site (Step 8 of $steps) -----"
if [ "$safe" != "yes" ]
then
  touch ${CATALINA_HOME}/webapps/ROOT/WEB-INF/web.xml
  echo Done!
else
  echo runbot: Can not reload index in safe mode.
  echo runbot: Please reload it manually using the following command:
  echo runbot: touch ${CATALINA_HOME}/webapps/ROOT/WEB-INF/web.xml
fi

echo "runbot: FINISHED: Crawl completed!"

Susam comments as follows:

I have written this script to crawl with Nutch 0.9. Though, I have tried to take care that this should work for re-crawls as well, but I have never done any real world testing for re-crawls. I use this to crawl. You may try this out. We can make some changes if this is not found to be appropriate for re-crawls.

Advertisements

Introductory comments to this blog

From wikipedia:

Nutch is an effort to build an open source search engine based on Lucene Java for the search and index component.

I am writing this blog in order to publicly document my exploration of the nutch crawler and get feedback about what other folks have tried or discovered. I’ve already been using nutch for a few weeks so this blog doesn’t start completely at the beginning for me, but I’ll try to be explanatory in how I write here. Like many open source projects, nutch is poorly documented. This means that in order to find answers one has to make extensive use of google plus comb the nutch forums: nutch-user and nutch-dev. (Those links are hosted at http://www.mail-archive.com; they’re also hosted by http://www.nabble.com in a different format here: nutch-user and nutch-dev.) I’ve found that people are pretty responsive on nutch-user. The nutch to-do list, bugs, and enhancements are listed using JIRA software at issues.apache.org/jira/browse/Nutch.

Backdrop: I had latitude in making a choice of crawler/indexer, so in the beginning I read some general literature such as “Crawling the Web” by Gautam Pant, Padmini Srinivasan, and Filippo Menczer.  On approaches to search the entertaining “Psychosomatic addict insane” (2007) discusses latent semantic indexing and contextual network graphs. And let’s not forget spreading activation networks.  Writing a crawler is not easy so I looked at some java-based open source crawlers and started examining Heritrix.  In a conversation with Gordon Mohr of the internet archive I decided to go with nutch as he said Heritrix was more focused on storing precise renditions of web pages and on storing multiple versions of the same page as it changes over time.  On the other hand, nutch just stores text, and it directly creates and accesses Lucene indexes whereas the internet archive also has to use NutchWax to interact with Lucene.

The current version of nutch is 0.9; but rather than the main release I’m using one of the nightly builds that fixes a bug I ran into (see the NUTCH-505 JIRA). The nightly build also has a more advanced RSS feed handler. But I’m getting ahead of myself.

The best overall introductory article to nutch I’ve found so far is the following two-parter written by Tom White in January of 2006.  It has a brief overall description of nutch’s architecture, then delves into the specifics of crawling a small example site; it tells how to set up nutch as well as tomcat, and what kind of sanity checks to do on the results you get back.

On the architecture:

Nutch divides naturally into two pieces: the crawler and the searcher. The crawler fetches pages and turns them into an inverted index, which the searcher uses to answer users’ search queries. The interface between the two pieces is the index, so apart from an agreement about the fields in the index, the two are highly decoupled. (Actually, it is a little more complicated than this, since the page content is not stored in the index, so the searcher needs access to the segments [a collection of pages fetched and indexed by the crawler in a single run] below in order to produce page summaries and to provide access to cached pages.)

The nutch site itself has a few items of note:

There is an article written by nutch auther Doug Cutting as well as Rohit Khare, Kragen Sitaker, and Adam Rifkinthat. It has a clean description of nutch’s architecture and is entitled “Nutch: A Flexible and Scalable Open-Source Web Search Engine“.

Excerpt:

4.1 Crawling: An intranet or niche search engine might only take a single machine a few hours to crawl, while a whole-web crawl might take many machines several weeks or longer. A single crawling cycle consists of generating a fetchlist from the webdb, fetching those pages, parsing those for links, then updating the webdb. In the terminology of [4], Nutch’s crawler supports both a crawl-and-stop and crawl-and-stop-with-threshold (which requires feedback from scoring and specifying a floor). It also uses a uniform refresh policy; all pages are refetched at the same interval (30 days, by default) regardless of how frequently they change There is no feedback loop yet, though the design of Page.java can set individual recrawl-deadlines on every page). The fetching process must also respect bandwidth and other limitations of the target website. However, any polite solution requires coordination before fetching; Nutch uses the most straightforward localization of references possible: namely, making all fetches from a particular host run on one machine.

Another slide show (PDF) by Doug Cutting, ”Nutch, Open-Source Web Search“ shows the architecture:

Nutch Architecture

Searcher: Given a query, it must quickly find a small relevant subset of a corpus of documents, then present them. Finding a large relevant subset is normally done with an inverted index of the corpus; ranking within that set to produce the most relevant documents, which then must be summarized for display.

Indexer: Creates the inverted index from which the searcher extracts results. It uses Lucene storing indexes.

Web DB: Stores the document contents for indexing and later summarization by the searcher, along with information such as the link structure of the document space and the time each document was last fetched.

Fetcher: Requests web pages, parses them, and extracts links from them. Nutch’s robot has been written entirely from scratch.

There is a lengthy video presentation (71 minutes) with Doug Cutting, sponsored by IIIS in Helsinki, 2006. It has an associated PDF slide show entitled “Open Source Platforms for Search“. The introduction has a philosophical discourse on open source software then gets down to a meaty technical discussion after about eight minutes. For instance, Doug discusses that with a single person as administrator, nutch scales well up to about 100 million documents. Beyond that, billions of pages are “operationally onerous”.

One of the more widely linked articles articles by Doug Cutting and Mike Cafarella is “Building Nutch: Open Source Search” (printer friendly version).  On page 3 they outline nutch’s operational costs–note that these $ estimates were done in early 2004:

A typical back-end machine is a single-processor box with 1 gigabyte of RAM, a RAID controller, and eight hard drives. The filesystem is mirrored (RAID level 1) and provides 1 terabyte of reliable storage. Such a machine can be assembled for a cost of about $3,000…. A typical front-end machine is a single-processor box with 4 gigabytes of RAM and a single hard drive. Such a machine can be assembled for about $1,000…. Note that as traffic increases, front-end hardware quickly becomes the dominant hardware cost.

A 2007 paper from IBM Research entitled “Scalability of the Nutch Search Engine” explores some blade server configurations and uses mathematical models to conclude that nutch can scale well past the base cases they actually run.  Note that the paper is about the index/search aspect of nutch rather than the crawling.

Search workloads behave well in a scale-out environment. The highly parallel nature of this workload, combined with a fairly predictable behavior in terms of processor, network and storage scalability, makes search a perfect candidate for scale-out. Scalability to thousands of nodes is well within reach, based on our evaluation that combines measurement data and modeling.

Lucene is the searching/indexing component of nutch; one of the things that attracted me to nutch was that I would be able to have an end-to-end, customizable package to implement search.  And either lucene or nutch can be used for the query processing; nutch just has a simpler query syntax: it is optimized for the most common web queries so it doesn’t support OR queries, for instance.  There are other crawlers, such as Heritrix which is very robust and is used by the internet archive, and other indexers like Xapian, which is very performant. ‘Archiving “Katrina” Lessons Learned‘ was a project that chose to use Heritrix and NutchWax. For now I’m happy with nutch+lucene.  The one book I found that has much to say about Lucene (and even it has only minimal coverage of nutch) is Lucene in Action by Erik Hatcher and Otis Gospodnetic.  I should also mention that the book has thorough coverage of Luke, a tool that is useful for playing with lucene indexes.  The apache lucene mailing lists in searchable form are java-user and java-dev.  The lucene FAQ is frequently updated.

Add to Technorati Favorites