Hidden-Web Databases: Classification and Search


(Research Seminar, December 3rd, 2002)

Luis Gravano
Computer Science Department, Columbia University


Many valuable text databases on the web have non-crawlable contents that are "hidden" behind search interfaces.  Hence traditional search engines do not index this valuable information. One way to facilitate access to "hidden-web" databases is through commercial Yahoo!-like directories, which have started to organize these databases manually into categories that users can browse.  In this talk, I will describe a technique to automate the classification of hidden-web databases. Our technique adaptively probes the databases with queries derived from document classifiers, without retrieving any documents. A large-scale experimental evaluation over 130 real web databases indicates that our technique produces highly accurate database classification results using -on average- fewer than 200 queries of four words or less to classify a database.

An alternative way to facilitate access to hidden-web databases is through "metasearchers," which provide a unified query interface to search many databases at once.  For efficiency, a critical task for a metasearcher is the selection of the most promising databases to search for a query, a task that typically relies on statistical summaries of the database contents.  In this talk, I will also describe a recent technique to derive content summaries from hidden-web databases. We exploit our probing-based classification algorithm to adaptively zoom in on and extract documents that are representative of the topic coverage of the databases. We can then build content summaries from these topically-focused document samples. A large-scale experimental evaluation over a variety of databases indicates that our new content-summary construction technique is efficient and produces more accurate summaries than those from previously proposed strategies.