This package implements a very simple multipurpose web crawler. It's not the fastest, but it is very simple to use. Without writing any additional code, you can run it at the command line with the following command:

java edu.stanford.pubcrawl.newcrawler.Crawler startURL

This command creates a web crawler with 10 threads and starts it crawling at the URL startURL. There is no termination condition for the crawl - the process must be killed by the invoking shell.

The programmatic interface gives the user much more control over the crawl. The following code snippet illustrates how easy it is to create and run a crawler.

int numThreads = 10; String userAgent = "nlp.stanford.edu"; LinkScorerFactory linkScorerFactory = new BasicLinkScorerFactory(); PageHandlerFactory pageHandlerFactory = new SavingPageHandlerFactory(); Crawler c = new Crawler(numThreads, userAgent, linkScorerFactory, pageHandlerFactory); String startURL = "http://www.stanford.edu"; c.crawl(startURL);

The key things to notice about this invocation are the LinkScorerFactory and PageHandlerFactory interfaces. These factory interfaces vend LinkScorer and PageHandler objects, respectively. The LinkScorer is responsible for assigning to each discovered link a double score which will then be used to rank the links in a priority queue for exploration. Thus, to change crawl order, it is necessary only to create a LinkScorer. The BasicLinkScorer returned by the BasicLinkScorerFactory assigns a descending score to each subsequent page discovered, which induces a breadth-first crawling order. By defining appropriate LinkScorers it is possible to do "focused" crawling, depth-first crawling, etc.

The PageHandler is responsible for handling a page once it has been downloaded. The SavingPageHandler returned by the SavingPageHandlerFactory merely saves each page downloaded (including non-HTML binary files) to a file in the current directory on disk. By defining new PageHandlers it is possible to save only relevant pages, or to process pages in memory and extract information without saving them at all.

In practice, 10-20 threads works well for most applications. Use more threads if you are limited only by bandwidth. If instead you are limited by available memory, use fewer threads. (This helps because each thread maintains its own buffer for processing downloaded files.) If the bottleneck is processing power, changing the number of threads will probably not help much.

Send all questions and comments to the author Teg Grenager.