Hidden-Web Databases: Classification and Search
(Research Seminar, December 3rd,
2002)
Luis Gravano
Computer Science
Department, Columbia University
Abstract:
Many valuable text databases on the web have
non-crawlable contents that are "hidden" behind search
interfaces. Hence traditional search engines do not index this
valuable information. One way to facilitate access to
"hidden-web" databases is through commercial Yahoo!-like
directories, which have started to organize these databases manually into
categories that users can browse. In this talk, I will describe a
technique to automate the classification of hidden-web databases. Our
technique adaptively probes the databases with queries derived from
document classifiers, without retrieving any documents. A large-scale
experimental evaluation over 130 real web databases indicates that our
technique produces highly accurate database classification results using
-on average- fewer than 200 queries of four words or less to classify a
database.
An alternative way to facilitate access to hidden-web databases is through
"metasearchers," which provide a unified query interface to
search many databases at once. For efficiency, a critical task for a
metasearcher is the selection of the most promising databases to search for
a query, a task that typically relies on statistical summaries of the
database contents. In this talk, I will also describe a recent
technique to derive content summaries from hidden-web databases. We exploit
our probing-based classification algorithm to adaptively zoom in on and
extract documents that are representative of the topic coverage of the
databases. We can then build content summaries from these topically-focused
document samples. A large-scale experimental evaluation over a variety of
databases indicates that our new content-summary construction technique is
efficient and produces more accurate summaries than those from previously
proposed strategies.
|