Project QUASI-CUBES: Exploiting Approximations in Multidimensional Data Sets

MOTIVATION A data cube is a popular abstraction for multidimensional data. A cube is simply a multidimensional structure that contains at each point an aggregate value, i.e., the result of applying an aggregate function to an underlying relation. Data cubes are used to implement On-Line Analytical Processing (OLAP) in a variety of fields (business, government and scientific). Even though vendors have been selling products to support data cubes for a while, it is accepted that OLAP products do not scale to large datasets or high dimensions. There are two obstacles that make scaling difficult. First, there is the issue of database explosion: even though the multidimensional cube is usually sparse, materializing every cell is very often prohibitive. It is important to notice that materializing every cell would be the ideal solution, since this would have a big impact in the performance of queries. (If a cell is not materialized, its value needs to be computed from the underlying data.) Secondly, the demands on query performance are strict (analysts need the answers quickly, so they can figure out the next question to ask to the system). This is aggravated by the fact that queries are ad-hoc, so optimizing all of them becomes difficult. So, even if all cells are materialized, there is a need to support a large variety of queries efficiently.
PROJECT GOALS
Develop, implement and evaluate techniques for efficiently reducing the storage needs in a multidimensional cube. The techniques will be based in modeling regions of the cube thereby providing succinct descriptions of the data, and at the same time storing the cell values that do not fit well in the models (outliers) to reduce the errors caused by the estimation. Several models need to be evaluated with respect to issues such as time to build the models, storage savings achieved, and errors incurred in the estimation. We will implement the techniques and evaluate them experimentally. Develop, implement and evaluate techniques to support multiresolution in data cubes. These techniques will allow users to obtain results with a requested error level attached to the answer. The larger the error, the faster the answer will be obtained. The users can ``zoom in'' as desired, reducing the level of error incrementally. Alternatively, the system can do it automatically, providing rough estimates at the beginning, and then incrementally refining the answers in front of the users. This procedure has the advantage of eliminating lengthy waits for the answers to materialize (latency), which tends to irritate most users. Moreover, it puts the users in control of the system, allowing them to stop the computations when the quality of the answers satisfies their needs. We will implement the techniques which will be also based in modeling regions of the cube and classifying the cells according to their error level. We will evaluate the techniques experimentally and study the tradeoffs involved. Develop, implement and evaluate techniques to do data mining on multidimensional data. The information obtained in the process of modeling regions of the cube is a valuable start point for mining patterns in the data. Moreover, the information about outliers is valuable in identifying cells that have an ``abnormal'' behavior (respect to other cells in the region). We hope to develop a set of tools to effectively mine data in the cube. Implement a prototype Database Management System for OLAP, using the concepts and tradeoffs studied in the previous objectives. This prototype will be able to manage large multidimensional datasets and provide the tools to do mining and analysis of the data.
PUBLICATIONS

Using loglinear models to compress datacubes by Daniel Barbará and Xintao Wu. (In postcript)

Using Approximations to Scale Exploratory Data Analysis in Datacubes by Daniel Barbará and Xintao Wu. (Published in ACM KDD99) postcript

Quasi-Cubes: A space-efficient way to support approximate multidimensional databases postcript

The New Jersey Data Reduction Report postcript

PEOPLE

Daniel Barbará (Principal Investigator)

Xintao Wu (Graduate Student)

Oscar Ousinegad (Graduate Student)

Back to Daniel Barbará page