Learning to Crawl: Topic Driven Web Crawlers
(Research Seminar, February 12th, 2004)

Gautam Pant
University of Iowa

Abstract
Topical or focused crawlers are programs that effectively fetch thousands to millions of Web pages on a topic or a set of topics. We model these smart Web crawlers as parallel best-first search algorithms and use alternate Web mining techniques to provide the guiding heuristics. Our multi-threaded Java implementation of the topical crawlers provides for flexible replacement of heuristics, respects crawling ethics, and consumes reasonable bandwidth. We motivate the research through some applications. We then discuss experiments with multiple versions of different classification schemes to guide topical crawlers. We suggest metrics to evaluate their performance, and try to explain the better performance of some of the classification schemes. In particular, our experiments show that a crawler based on support vector machine (SVM) with a linear kernel provides superior performance with low training times.