Query Processing
over Continuous Data Streams
(Research Seminar, January 30th,
2003)
Minos Garofalakis
Bell Laboratories
Abstract
For several emerging application domains (e.g., network monitoring, fraud/anomaly
detection), large volumes of data arrive and need to be processed in real
time on a continuous basis, without the benefit of several passes over a
static, persistent data image. As a result, there is increasing
interest in the design of data-processing algorithms that work over such
continuous data streams, i.e., algorithms that work with limited memory to
answer user queries while looking at the relevant data items only once and
in the fixed order of arrival.
In this talk, I will discuss our work on providing approximate, guaranteed-quality
results to general SQL queries (with, possibly, multiple join operations)
over continuous data streams with limited memory. Our method relies
on randomizing techniques that compute small "sketch" synopses of
the streams that can then be used to provide approximate answers with
provable probabilistic guarantees on the approximation error. We also demonstrate
how existing statistical information on the base data (e.g., histograms)
can be used in the proposed framework to improve the quality of the
approximation provided by our algorithms. The key idea is to
intelligently partition the domain of the underlying attribute(s) and,
thus, decompose the sketching problem in a way that provably tightens our
guarantees. Finally, I will discuss some of our ongoing work that aims
to extend our techniques to deal with multiple standing SQL queries as well
as richer types of queries and streaming data.
|