| 
| Learning to Crawl: Topic Driven Web Crawlers (Research Seminar, February 12th, 2004)
 
 Gautam Pant
 University of Iowa
 
 Abstract
Topical or focused crawlers are programs that effectively fetch thousands to 
millions of Web pages on a topic or a set of topics. We model these smart Web 
crawlers as parallel best-first search algorithms and use alternate 
Web mining techniques to provide the guiding heuristics. Our multi-threaded 
Java implementation of the topical crawlers provides for flexible replacement 
of heuristics, respects crawling ethics, and consumes reasonable bandwidth. We 
motivate the research through some applications. We then discuss experiments 
with multiple versions of different classification schemes to guide topical 
crawlers. We suggest metrics to evaluate their performance, and try to explain 
the better performance of some of the classification schemes. In particular, 
our experiments show that a crawler based on support vector machine (SVM) with 
a linear kernel provides superior performance with low training times.
 
 
 |  |