Classifying and Searching Hidden Web Databases
(Research Seminar, February 19th, 2004)

Panos Ipeirotis
Columbia University

Abstract
Many valuable text databases on the web have non-crawlable contents that are "hidden'' behind search interfaces. Hence, traditional search engines do not index this valuable information. One way to facilitate access to "hidden-web'' databases is through Yahoo!-like directories, which organize these databases manually into categories that users can browse. An alternative way is through ``metasearchers,'' which provide a unified query interface to search many databases at once. As part of my thesis, I have developed QProber, a system to automatically categorize and search autonomous, hidden-web databases. To categorize a database, QProber uses just the number of matches generated by a small number of query probes derived using state-of-the-art machine learning techniques. To search over "uncooperative'' hidden-web databases, QProber exploits the database categorization to extract a small, topically-focused document sample from each database, from which a statistical summary of the database contents is produced. The content summaries can then be used during metasearching to select the most appropriate databases for a given query, a critical task for search scalability and effectiveness. Specifically, QProber identifies the most relevant databases for a query by exploiting both the database classification information and the extracted summaries. QProber produces high-quality database selection decisions, which in turn help return highly relevant search results.