Introductory comments to this blog

From wikipedia:

Nutch is an effort to build an open source search engine based on Lucene Java for the search and index component.

I am writing this blog in order to publicly document my exploration of the nutch crawler and get feedback about what other folks have tried or discovered. I’ve already been using nutch for a few weeks so this blog doesn’t start completely at the beginning for me, but I’ll try to be explanatory in how I write here. Like many open source projects, nutch is poorly documented. This means that in order to find answers one has to make extensive use of google plus comb the nutch forums: nutch-user and nutch-dev. (Those links are hosted at www.mail-archive.com; they’re also hosted by www.nabble.com in a different format here: nutch-user and nutch-dev.) I’ve found that people are pretty responsive on nutch-user. The nutch to-do list, bugs, and enhancements are listed using JIRA software at issues.apache.org/jira/browse/Nutch.

Backdrop: I had latitude in making a choice of crawler/indexer, so in the beginning I read some general literature such as “Crawling the Web” by Gautam Pant, Padmini Srinivasan, and Filippo Menczer.  On approaches to search the entertaining “Psychosomatic addict insane” (2007) discusses latent semantic indexing and contextual network graphs. And let’s not forget spreading activation networks.  Writing a crawler is not easy so I looked at some java-based open source crawlers and started examining Heritrix.  In a conversation with Gordon Mohr of the internet archive I decided to go with nutch as he said Heritrix was more focused on storing precise renditions of web pages and on storing multiple versions of the same page as it changes over time.  On the other hand, nutch just stores text, and it directly creates and accesses Lucene indexes whereas the internet archive also has to use NutchWax to interact with Lucene.

The current version of nutch is 0.9; but rather than the main release I’m using one of the nightly builds that fixes a bug I ran into (see the NUTCH-505 JIRA). The nightly build also has a more advanced RSS feed handler. But I’m getting ahead of myself.

The best overall introductory article to nutch I’ve found so far is the following two-parter written by Tom White in January of 2006.  It has a brief overall description of nutch’s architecture, then delves into the specifics of crawling a small example site; it tells how to set up nutch as well as tomcat, and what kind of sanity checks to do on the results you get back.

On the architecture:

Nutch divides naturally into two pieces: the crawler and the searcher. The crawler fetches pages and turns them into an inverted index, which the searcher uses to answer users’ search queries. The interface between the two pieces is the index, so apart from an agreement about the fields in the index, the two are highly decoupled. (Actually, it is a little more complicated than this, since the page content is not stored in the index, so the searcher needs access to the segments [a collection of pages fetched and indexed by the crawler in a single run] below in order to produce page summaries and to provide access to cached pages.)

The nutch site itself has a few items of note:

There is an article written by nutch auther Doug Cutting as well as Rohit Khare, Kragen Sitaker, and Adam Rifkinthat. It has a clean description of nutch’s architecture and is entitled “Nutch: A Flexible and Scalable Open-Source Web Search Engine“.

Excerpt:

4.1 Crawling: An intranet or niche search engine might only take a single machine a few hours to crawl, while a whole-web crawl might take many machines several weeks or longer. A single crawling cycle consists of generating a fetchlist from the webdb, fetching those pages, parsing those for links, then updating the webdb. In the terminology of [4], Nutch’s crawler supports both a crawl-and-stop and crawl-and-stop-with-threshold (which requires feedback from scoring and specifying a floor). It also uses a uniform refresh policy; all pages are refetched at the same interval (30 days, by default) regardless of how frequently they change There is no feedback loop yet, though the design of Page.java can set individual recrawl-deadlines on every page). The fetching process must also respect bandwidth and other limitations of the target website. However, any polite solution requires coordination before fetching; Nutch uses the most straightforward localization of references possible: namely, making all fetches from a particular host run on one machine.

Another slide show (PDF) by Doug Cutting, ”Nutch, Open-Source Web Search“ shows the architecture:

Nutch Architecture

Searcher: Given a query, it must quickly find a small relevant subset of a corpus of documents, then present them. Finding a large relevant subset is normally done with an inverted index of the corpus; ranking within that set to produce the most relevant documents, which then must be summarized for display.

Indexer: Creates the inverted index from which the searcher extracts results. It uses Lucene storing indexes.

Web DB: Stores the document contents for indexing and later summarization by the searcher, along with information such as the link structure of the document space and the time each document was last fetched.

Fetcher: Requests web pages, parses them, and extracts links from them. Nutch’s robot has been written entirely from scratch.

There is a lengthy video presentation (71 minutes) with Doug Cutting, sponsored by IIIS in Helsinki, 2006. It has an associated PDF slide show entitled “Open Source Platforms for Search“. The introduction has a philosophical discourse on open source software then gets down to a meaty technical discussion after about eight minutes. For instance, Doug discusses that with a single person as administrator, nutch scales well up to about 100 million documents. Beyond that, billions of pages are “operationally onerous”.

One of the more widely linked articles articles by Doug Cutting and Mike Cafarella is “Building Nutch: Open Source Search” (printer friendly version).  On page 3 they outline nutch’s operational costs–note that these $ estimates were done in early 2004:

A typical back-end machine is a single-processor box with 1 gigabyte of RAM, a RAID controller, and eight hard drives. The filesystem is mirrored (RAID level 1) and provides 1 terabyte of reliable storage. Such a machine can be assembled for a cost of about $3,000…. A typical front-end machine is a single-processor box with 4 gigabytes of RAM and a single hard drive. Such a machine can be assembled for about $1,000…. Note that as traffic increases, front-end hardware quickly becomes the dominant hardware cost.

A 2007 paper from IBM Research entitled “Scalability of the Nutch Search Engine” explores some blade server configurations and uses mathematical models to conclude that nutch can scale well past the base cases they actually run.  Note that the paper is about the index/search aspect of nutch rather than the crawling.

Search workloads behave well in a scale-out environment. The highly parallel nature of this workload, combined with a fairly predictable behavior in terms of processor, network and storage scalability, makes search a perfect candidate for scale-out. Scalability to thousands of nodes is well within reach, based on our evaluation that combines measurement data and modeling.

Lucene is the searching/indexing component of nutch; one of the things that attracted me to nutch was that I would be able to have an end-to-end, customizable package to implement search.  And either lucene or nutch can be used for the query processing; nutch just has a simpler query syntax: it is optimized for the most common web queries so it doesn’t support OR queries, for instance.  There are other crawlers, such as Heritrix which is very robust and is used by the internet archive, and other indexers like Xapian, which is very performant. ‘Archiving “Katrina” Lessons Learned‘ was a project that chose to use Heritrix and NutchWax. For now I’m happy with nutch+lucene.  The one book I found that has much to say about Lucene (and even it has only minimal coverage of nutch) is Lucene in Action by Erik Hatcher and Otis Gospodnetic.  I should also mention that the book has thorough coverage of Luke, a tool that is useful for playing with lucene indexes.  The apache lucene mailing lists in searchable form are java-user and java-dev.  The lucene FAQ is frequently updated.

Add to Technorati Favorites