Abstract: Classifying and Searching Hidden Web Databases

Classifying and Searching Hidden Web Databases
(Research Seminar, February 19th, 2004)

Panos Ipeirotis
Columbia University

Abstract Many valuable text databases on the web have non-crawlable contents that are "hidden'' behind search interfaces. Hence, traditional search engines do not index this valuable information. One way to facilitate access to "hidden-web'' databases is through Yahoo!-like directories, which organize these databases manually into categories that users can browse. An alternative way is through ``metasearchers,'' which provide a unified query interface to search many databases at once. As part of my thesis, I have developed QProber, a system to automatically categorize and search autonomous, hidden-web databases. To categorize a database, QProber uses just the number of matches generated by a small number of query probes derived using state-of-the-art machine learning techniques. To search over "uncooperative'' hidden-web databases, QProber exploits the database categorization to extract a small, topically-focused document sample from each database, from which a statistical summary of the database contents is produced. The content summaries can then be used during metasearching to select the most appropriate databases for a given query, a critical task for search scalability and effectiveness. Specifically, QProber identifies the most relevant databases for a query by exploiting both the database classification information and the extracted summaries. QProber produces high-quality database selection decisions, which in turn help return highly relevant search results.