Query Processing over Continuous Data Streams

 

(Research Seminar, January 30th, 2003)

Minos Garofalakis
Bell Laboratories

Abstract


For several emerging application domains (e.g., network monitoring, fraud/anomaly detection), large volumes of data arrive and need to be processed in real time on a continuous basis, without the benefit of several passes over a static, persistent data image.  As a result, there is increasing interest in the design of data-processing algorithms that work over such continuous data streams, i.e., algorithms that work with limited memory to answer user queries while looking at the relevant data items only once and in the fixed order of arrival.

In this talk, I will discuss our work on providing approximate, guaranteed-quality results to general SQL queries (with, possibly, multiple join operations) over continuous data streams with limited memory.  Our method relies on randomizing techniques that compute small "sketch" synopses of the streams that can then be used to provide approximate answers with provable probabilistic guarantees on the approximation error. We also demonstrate how existing statistical information on the base data (e.g., histograms) can be used in the proposed framework to improve the quality of the approximation provided by our algorithms.  The key idea is to intelligently partition the domain of the underlying attribute(s) and, thus, decompose the sketching problem in a way that provably tightens our guarantees.  Finally, I will discuss some of our ongoing work that aims to extend our techniques to deal with multiple standing SQL queries as well as richer types of queries and streaming data.