Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab and Sub-Totals
Gray et Al.
Presented By:Priya Rajan
Introduction Increasing popularity of Data Warehouses has
increased the demand for querying, reporting and OLAP tools
Need to extract maximum information from data and have the ability to analyze it and aggregate it across multiple dimensions
Cross Tabs over various dimensions are particularly useful because aggregates convey a lot of information when the amount of data is large
Traditional Representation of N-Dimensional Data
N-Dimensional data represented as relation with N-attribute columns Example: Here Sales Quantity is a function of
Supplier,Product and Date
3P2S2
3P2S2
2P3S1
1P1S1
Date_IDProduct
Supplier
Sales Table
Represents sales quantity by supplier S1 of product P2 on Date_ID 1& at Loc5
1
3
2
4
Loc_ID
11
39
18
10
Sales Quantity
Aggregation Queries
Standard SQL provides COUNT(), SUM(), MIN(), MAX(), AVG()
Example:SELECT AVG(Sales Quantity) FROM SALES
SQL also provides a GROUP-BY operator-produces 0 & 1 dimensional aggregates
Example: SELECT SUPPLIER, PRODUCT, AVG(SALES QUANTITY) AS A FROM SALES
GROUP BY SUPPLIER, PRODUCT Result:
Supplier
Product
A
S1 P1 10
S1 P3 18
S2 P2 25
Inadequacy of GROUP BY Cannot perform aggregation over computed
categories (histograms) directly Example: Suppose suppliers and products together define a
‘factory’ and a Factory() function maps suppliers and Loc_Ids to a factory. It would be nice to define the query to find maximum daily sales for each factory as:
SELECT Date, factory, MAX(SALES)FROM SALESGROUP BY Date, Factory(suppliers,Loc_ID) as factory;
Instead we have to compute a table indirectly and then perform the aggregation (a nested query):
SELECT Date, factory, MAX(SALES)FROM (SELECT Date, factory(suppliers, LOC_ID) as factory, sales
FROM SALES) AS FooGROUP BY Date, factory;
Inadequacy of GROUP BY In analyzing data we often need to “roll-up”
(example: go from sales/citysales/state) or “drill-down” (vice-versa)
Problem with Group By for roll-ups and drill-downs: have to store the subtotal at each level of the aggregation
Rollup Representations Need 2N columns to
represent a rollup of N elements!
Introduce ‘ALL’ value Problem: Need to
union many GROUP Bys
Rollup Representations
SELECT Model, ALL, ALL, SUM(Sales)FROM Sales WHERE Model = 'Chevy'
GROUP BY ModelUNION SELECT Model, Year, ALL, SUM(Sales)
FROM SalesWHERE Model = 'Chevy'GROUP BY Model, Year
UNIONSELECT Model, Year, Color, SUM(Sales)
FROM SalesWHERE Model = 'Chevy'GROUP BY Model, Year, Color;
Note that aggregation by year is not included
To add that in we would have to Union in SELECT Model, ALL, Color, SUM(Sales)FROM SalesWHERE Model = 'Chevy'GROUP BY Model, Color;
Query is really complex-for a cross tab in 6-dimensions need 26=64unions of GROUP BY statements!
This is the motivation for a new operator!
To arrive at the previous table we would have to do the following complex query:
The CUBE Operator The CUBE operator builds a table with all
possible aggregated values constructed by grouping—it is the generalization of a simple aggregate function in N-dimensions
The 0-D cube is the aggregate of all values Ex: Total Sales over all models,years and colors
The 1-D cube is a Group By on one dimension Ex: Group By Color
The 2-D cube is a cross-tab The 3-D cube is an intersection of 3 2-D cubes This can be generalized to N-Dimensions
Intuition behind the CUBE
REDWHITEBLUE
By Color
By Make & Color
By Make & Year
By Color & Year
By MakeBy Year
Sum
The Data Cube and The Sub-Space AggregatesSum
REDWHITEBLUE
Chevy Ford
By Make
By ColorCross Tab
REDWHITEBLUE
By Color
Sum
Group By (with total)
Sum
Aggregate 0-D cube
1-D cube
2-D cube
3-D cube
Creating a CUBE The syntax for a cube is:
SELECT <select-list>FROM <relation>GROUP BY CUBE <select-list>
To create the global cube need to generate the power set of aggregation columns
{Model, Year, Color}
{Model,Year} {Model,Color} {Year,Color}
{Model} {Year} {Color}
{ }
Creating a CUBE The CUBE operator first aggregates over the
<select-list> just as the GROUP BY would. It then Unions in the 2N-1 super aggregates shown in the previous picture.
‘ALL’ is substituted for the Aggregating column. Every domain has an extra ‘ALL’ value now. The cube has (C1+1)*(C2+1)…..*(CN+1) values, where Ci is the size of each domain.
Idea is that Since Ci is usually large, CUBE will be only a little bigger compared to corresponding GROUP BY but provides a lot more information (not really true—see Conclusion)
CUBE Creation: EXAMPLE
ROLLUP Operator If the user only needs a roll-up report or a drill down
report computing the entire cube is wasteful The ROLLUP Operator is used to compute only the super
aggregates In the previous example the if ROLLUP were used in place
of CUBE only the following will be returned (marked in the figure in red)
Regular aggregation rows that would be produced by GROUP BY without using ROLLUP
First-level subtotals aggregating across Color for each combination of Model and Year
Second-level subtotals aggregating across Year and Color for each Model value
A grand total row
Integrating CUBE into SQL GROUP BY <select-list>
ROLLUP <select-list>CUBE<select-list>
Could support histograms by extending GROUP BY Use of ‘ALL’ to indicate ‘aggregate’ could cause
problems ALL is set valued other things are not Could use NULL instead of ALL Use Grouping function to distinguish between NULL and
ALL
Computing Cubes and Rollups Aggregation Computation Techniques
Compute aggregates at lowest possible system level to minimize processing cost
Use hashing to organize aggregation columns in memory
If aggregates don’t fit in memory use hybrid-hashing to organize data by value
If data is spread over several disks exploit parallelism for each partition and then combine results
Computing Cubes and Rollups Simplest Algorithm 2N algorithm
Allocate handle for every cell in the cube When a tuple is passed in a function Iter is
called on the handle and tuple value. Iter is called 2N times
After all tuples have been computed a function, final is called to complete the aggregation
If base table has T tuples Iter is called Tx 2N
times—can improve!
Types of Aggregate Functions Distributive Functions: F(X)=G(F(X)) Where
X: {Xij | i = 1,...,I; j=1,...,J} Example: MIN(), COUNT(),SUM(), MAX() Algorithm Idea: Can compute N-1th dimension by
aggregating Nth dimension Example: Cross Tab—Here SUM() is the Distributive
Function
1st Dimension=sum of 2nd Dim rows
Oth Dimension=sum of 1st dimension
2-D cube
Types of Aggregation Functions
Algebraic Functions: F(X)=H(G(X)) Example: Average(), Standard Dev. In Average G
would be the sum and the count and H would be sum/count
Algorithm Idea: A handle is maintained for each cube cell and is passed on to the N-1dimensional superaggregates. A function Iter(&handle, &handle) is used to pass the aggregates. At the end Final() is called.
Example: Cross-Tab. Here Avg() is the algebraic function
Aggregation Functions Holistic Functions: Functions where the sub-
aggregates cannot be described by a fixed storage bound
Example: Median(), Rank() No known algorithm better than 2N
Comments & Conclusion The CUBE operator is useful in Multi-Dimensional data
analysis. Popular in OLAP tools Can now answer queries that were either previously
impossible or very complicated in SQL (using GROUP BY) Claim that CUBE is only slightly larger than corresponding
GROUP BY for large domains may not always be valid-often a large number of data values are null, so CUBE is much larger than the GROUP BY
Next logical issue is the issue of maintaining cubes If Cube is materialized need to consider updates. Need to
only small scan for some functions but need re-computation for others.
Example: Max() for SELECT, INSERT and DELETE
Implementing Data Cubes Efficiently
Venky HarinarayanAnand RajaramanJeffery D. Ullman
Introduction Decision Support Systems(DSS) rely a great deal
on aggregations made on data in data-warehouses These queries can be very complex and could take
a substantial amount of time. Need queries to be computed very quickly for
Decision Support Systems Idea: Can materialize or pre-compute certain
queries Pre-compute frequently asked queries Pre-compute infrequent queries so that the time saved
can be used to compute other queries
Materializing the Cube Data Warehouses represent data in a multi-
dimensional cube to answer aggregation queries and enable analysis in multiple dimensions
Computing every cell on request increases the response time, making it unacceptable for DSS
Need to materialize cube. Options: Materialize the entire cube: Although pre-computing
every cell gives excellent response times but space consumption makes it infeasible
Materialize parts of the cube: This is the problem explored in this paper-pick the optimal (close to optimal) cells to materialize
Dependencies in the Cube
Cells in the data cube that have the value “ALL” can be computed from other cells in the cube. Those that don’t are independent—need to query raw data to compute
Can materialize dependent cells(‘views’) Interesting Questions:
How many views to materialize for good performance? Which views to materialize to minimize average query
cost?
Example {Part,Supplier,Customer}
{part, customer} {part, supplier} {supplier,customer}
{part} {supplier} {customer}
{ }
We need to materialize the {Part, Supplier, Customer} because we cannot derive this from any other view. Say the cost of this is 6 M rows
Suppose we want to answer a query on supplier-customer. If the cost of this view is also 6M, there is no point in materializing it since we can derive it from the {Part, Supplier, Customer} view. On the other hand say the part view costs 0.2 M and we need to query parts. There might be some reason in materializing it . If we used the {Part, Supplier, Customer} view we would have to process 6M rows.
Lattice Structure & Hierarchies We say Query1 “is dependent on” Query2 if the result
set of Query2 can be used to answer Query1 For example (part) “is dependent on” (part, customer) Can form a lattice diagram from these dependencies In addition each dimension of the data-cube can have a
set of associated values, forming a hierarchy Example: Day
Week Month Year
none
Week is “dependent” on Day
Combining Dimension Hierarchies with Query Dependencies
Need to combine these two concepts when arriving at query dependencies. The results from one query can contain any element in the hierarchy for that dimension
Suppose we have two tuples T1 and T2 then T1 is dependent on T2 iff every element in the tuple T1 is lower in the dimension hierarchy compared to the corresponding element in T2.
Cost Model If view V has been materialized and V can be
used to process V’ then the cost of answering V is assumed to be the number of rows in V’
In reality more or less than 1 entire scan of V’ may be required. V’ may be clustered on a particular attribute or may have an index on an attribute, eliminating the need to scan it once.
This uniform cost model is adopted to simplify analysis
Experimentally showed the linear relationship between query size and running time
Optimizing Data Cube Lattices There is a space-time tradeoff. Need to
optimize this tradeoff Problem:
Minimize Time taken to evaluate queries Have to materialize k views (this can be easily
converted to a space restraint and the algorithm is the same)
This problem is NP-Complete. Greedy Algorithm used to approximate optimal
solution.
AlgorithmLet C(v) be the cost of a view vIf S is the set of views chosen so far the the benefit of adding a view v to s is B(v,S)
For each w that is depended on v let Bw be defined as follows: If u is in S and w is dependent on u(u is the
least cost such view in S) & C(v)<C(u) then Bw=C(v)-C(u) otherwise Bw=0
B(v,S) is then w dependent on v Bw Intuitively what is being measured is the sum of
the benefits other views will gain if v is included in S
AlgorithmS={top view}
Select v such that B(v,S) is maximized
S is the set of ‘optimal’
views
K times
Example
Greedy not Always Good
Performance of Greedy Algorithm
Greedy guarantees at least (e-1/e) of the benefit of the optimal algorithm
The greedy algorithm is exactly optimal when the benefit of the first picked view is much greater the subsequent views
If the benefits of all the views picked are equal then the greedy algorithm is optimal
Performance of Greedy Algorithm
Performance of Greedy Algorithm & Extensions to the Algorithm
The views in the lattice should have some probability associated with it so that the algorithm uses these probabilities as weights
Have to restrict space not number of views Now need to consider benefit/space Problem is that a small view may have high
benefit/space compared to large view and large view may be excluded because of lack of space— views picked in the future are affected by this. Treat this as a boundary case and ignore!
Hypercube Lattice In a hypercube typically, grouping occurs on n
attributes and aggregation on the n+1th attribute.
Example: SELECT Mode, Year, Color, SUM(Sales) AS Sales
FROM SALES GROUP BY CUBE Model, Year, Color
Exploit this regularity in selecting views If size of each domain is r & only m cells of the top-
most view are filled. If we group on I attributes then:If ri >=m then use m as size for any view where If ri >m If ri<m then use ri as size for any view where If ri <m
Time and Space Optimality
Space is minimized when only the top query is materialized
Each query takes time m. Total cost is mx2N
Time is minimized if we materialize every query
Time and Space Optimality
Conclusion Materialization of parts of the cube is very
important in increasing response times for DSS Pick the ‘optimal’ views that strike a balance
between the space lost due to materialization and the increase in response time due to computing on the fly.
The greedy algorithm gives a close approximation of the optimal solution
Future Work: Dynamic materialization strategies
Top Related