Classifying and Searching Hidden Web Databases
(Research Seminar, February 19th, 2004)
Panos Ipeirotis
Columbia University
Abstract
Many valuable text databases on the web have non-crawlable contents that are
"hidden'' behind search interfaces. Hence, traditional search engines do
not index this valuable information. One way to facilitate access to
"hidden-web'' databases is through Yahoo!-like directories, which organize
these databases manually into categories that users can browse. An
alternative way is through ``metasearchers,'' which provide a unified query
interface to search many databases at once. As part of my thesis, I have
developed QProber, a system to automatically categorize and search
autonomous, hidden-web databases. To categorize a database, QProber uses
just the number of matches generated by a small number of query probes
derived using state-of-the-art machine learning techniques. To search over
"uncooperative'' hidden-web databases, QProber exploits the database
categorization to extract a small, topically-focused document sample from
each database, from which a statistical summary of the database contents is
produced. The content summaries can then be used during metasearching to
select the most appropriate databases for a given query, a critical task for
search scalability and effectiveness. Specifically, QProber identifies the
most relevant databases for a query by exploiting both the database
classification information and the extracted summaries. QProber produces
high-quality database selection decisions, which in turn help return highly
relevant search results.
|
|