Download - Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab and Sub-Totals

Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab and Sub-Totals

Gray et Al.

Presented By:Priya Rajan

Introduction Increasing popularity of Data Warehouses has

increased the demand for querying, reporting and OLAP tools

Need to extract maximum information from data and have the ability to analyze it and aggregate it across multiple dimensions

Cross Tabs over various dimensions are particularly useful because aggregates convey a lot of information when the amount of data is large

Traditional Representation of N-Dimensional Data

N-Dimensional data represented as relation with N-attribute columns Example: Here Sales Quantity is a function of

Supplier,Product and Date

3P2S2

3P2S2

2P3S1

1P1S1

Date_IDProduct

Supplier

Sales Table

Represents sales quantity by supplier S1 of product P2 on Date_ID 1& at Loc5

1

3

2

4

Loc_ID

11

39

18

10

Sales Quantity

Aggregation Queries

Standard SQL provides COUNT(), SUM(), MIN(), MAX(), AVG()

Example:SELECT AVG(Sales Quantity) FROM SALES

SQL also provides a GROUP-BY operator-produces 0 & 1 dimensional aggregates

Example: SELECT SUPPLIER, PRODUCT, AVG(SALES QUANTITY) AS A FROM SALES

GROUP BY SUPPLIER, PRODUCT Result:

Supplier

Product

A

S1 P1 10

S1 P3 18

S2 P2 25

Inadequacy of GROUP BY Cannot perform aggregation over computed

categories (histograms) directly Example: Suppose suppliers and products together define a

‘factory’ and a Factory() function maps suppliers and Loc_Ids to a factory. It would be nice to define the query to find maximum daily sales for each factory as:

SELECT Date, factory, MAX(SALES)FROM SALESGROUP BY Date, Factory(suppliers,Loc_ID) as factory;

Instead we have to compute a table indirectly and then perform the aggregation (a nested query):

SELECT Date, factory, MAX(SALES)FROM (SELECT Date, factory(suppliers, LOC_ID) as factory, sales

FROM SALES) AS FooGROUP BY Date, factory;

Inadequacy of GROUP BY In analyzing data we often need to “roll-up”

(example: go from sales/citysales/state) or “drill-down” (vice-versa)

Problem with Group By for roll-ups and drill-downs: have to store the subtotal at each level of the aggregation

Rollup Representations Need 2N columns to

represent a rollup of N elements!

Introduce ‘ALL’ value Problem: Need to

union many GROUP Bys

Rollup Representations

SELECT Model, ALL, ALL, SUM(Sales)FROM Sales WHERE Model = 'Chevy'

GROUP BY ModelUNION SELECT Model, Year, ALL, SUM(Sales)

FROM SalesWHERE Model = 'Chevy'GROUP BY Model, Year

UNIONSELECT Model, Year, Color, SUM(Sales)

FROM SalesWHERE Model = 'Chevy'GROUP BY Model, Year, Color;

Note that aggregation by year is not included

To add that in we would have to Union in SELECT Model, ALL, Color, SUM(Sales)FROM SalesWHERE Model = 'Chevy'GROUP BY Model, Color;

Query is really complex-for a cross tab in 6-dimensions need 26=64unions of GROUP BY statements!

This is the motivation for a new operator!

To arrive at the previous table we would have to do the following complex query:

The CUBE Operator The CUBE operator builds a table with all

possible aggregated values constructed by grouping—it is the generalization of a simple aggregate function in N-dimensions

The 0-D cube is the aggregate of all values Ex: Total Sales over all models,years and colors

The 1-D cube is a Group By on one dimension Ex: Group By Color

The 2-D cube is a cross-tab The 3-D cube is an intersection of 3 2-D cubes This can be generalized to N-Dimensions

Intuition behind the CUBE

REDWHITEBLUE

By Color

By Make & Color

By Make & Year

By Color & Year

By MakeBy Year

Sum

The Data Cube and The Sub-Space AggregatesSum

REDWHITEBLUE

Chevy Ford

By Make

By ColorCross Tab

REDWHITEBLUE

By Color

Sum

Group By (with total)

Sum

Aggregate 0-D cube

1-D cube

2-D cube

3-D cube

Creating a CUBE The syntax for a cube is:

SELECT <select-list>FROM <relation>GROUP BY CUBE <select-list>

To create the global cube need to generate the power set of aggregation columns

{Model, Year, Color}

{Model,Year} {Model,Color} {Year,Color}

{Model} {Year} {Color}

{ }

Creating a CUBE The CUBE operator first aggregates over the

<select-list> just as the GROUP BY would. It then Unions in the 2N-1 super aggregates shown in the previous picture.

‘ALL’ is substituted for the Aggregating column. Every domain has an extra ‘ALL’ value now. The cube has (C1+1)*(C2+1)…..*(CN+1) values, where Ci is the size of each domain.

Idea is that Since Ci is usually large, CUBE will be only a little bigger compared to corresponding GROUP BY but provides a lot more information (not really true—see Conclusion)

CUBE Creation: EXAMPLE

ROLLUP Operator If the user only needs a roll-up report or a drill down

report computing the entire cube is wasteful The ROLLUP Operator is used to compute only the super

aggregates In the previous example the if ROLLUP were used in place

of CUBE only the following will be returned (marked in the figure in red)

Regular aggregation rows that would be produced by GROUP BY without using ROLLUP

First-level subtotals aggregating across Color for each combination of Model and Year

Second-level subtotals aggregating across Year and Color for each Model value

A grand total row

Integrating CUBE into SQL GROUP BY <select-list>

ROLLUP <select-list>CUBE<select-list>

Could support histograms by extending GROUP BY Use of ‘ALL’ to indicate ‘aggregate’ could cause

problems ALL is set valued other things are not Could use NULL instead of ALL Use Grouping function to distinguish between NULL and

ALL

Computing Cubes and Rollups Aggregation Computation Techniques

Compute aggregates at lowest possible system level to minimize processing cost

Use hashing to organize aggregation columns in memory

If aggregates don’t fit in memory use hybrid-hashing to organize data by value

If data is spread over several disks exploit parallelism for each partition and then combine results

Computing Cubes and Rollups Simplest Algorithm 2N algorithm

Allocate handle for every cell in the cube When a tuple is passed in a function Iter is

called on the handle and tuple value. Iter is called 2N times

After all tuples have been computed a function, final is called to complete the aggregation

If base table has T tuples Iter is called Tx 2N

times—can improve!

Types of Aggregate Functions Distributive Functions: F(X)=G(F(X)) Where

X: {Xij | i = 1,...,I; j=1,...,J} Example: MIN(), COUNT(),SUM(), MAX() Algorithm Idea: Can compute N-1th dimension by

aggregating Nth dimension Example: Cross Tab—Here SUM() is the Distributive

Function

1st Dimension=sum of 2nd Dim rows

Oth Dimension=sum of 1st dimension

2-D cube

Types of Aggregation Functions

Algebraic Functions: F(X)=H(G(X)) Example: Average(), Standard Dev. In Average G

would be the sum and the count and H would be sum/count

Algorithm Idea: A handle is maintained for each cube cell and is passed on to the N-1dimensional superaggregates. A function Iter(&handle, &handle) is used to pass the aggregates. At the end Final() is called.

Example: Cross-Tab. Here Avg() is the algebraic function

Aggregation Functions Holistic Functions: Functions where the sub-

aggregates cannot be described by a fixed storage bound

Example: Median(), Rank() No known algorithm better than 2N

Comments & Conclusion The CUBE operator is useful in Multi-Dimensional data

analysis. Popular in OLAP tools Can now answer queries that were either previously

impossible or very complicated in SQL (using GROUP BY) Claim that CUBE is only slightly larger than corresponding

GROUP BY for large domains may not always be valid-often a large number of data values are null, so CUBE is much larger than the GROUP BY

Next logical issue is the issue of maintaining cubes If Cube is materialized need to consider updates. Need to

only small scan for some functions but need re-computation for others.

Example: Max() for SELECT, INSERT and DELETE

Implementing Data Cubes Efficiently

Venky HarinarayanAnand RajaramanJeffery D. Ullman

Introduction Decision Support Systems(DSS) rely a great deal

on aggregations made on data in data-warehouses These queries can be very complex and could take

a substantial amount of time. Need queries to be computed very quickly for

Decision Support Systems Idea: Can materialize or pre-compute certain

queries Pre-compute frequently asked queries Pre-compute infrequent queries so that the time saved

can be used to compute other queries

Materializing the Cube Data Warehouses represent data in a multi-

dimensional cube to answer aggregation queries and enable analysis in multiple dimensions

Computing every cell on request increases the response time, making it unacceptable for DSS

Need to materialize cube. Options: Materialize the entire cube: Although pre-computing

every cell gives excellent response times but space consumption makes it infeasible

Materialize parts of the cube: This is the problem explored in this paper-pick the optimal (close to optimal) cells to materialize

Dependencies in the Cube

Cells in the data cube that have the value “ALL” can be computed from other cells in the cube. Those that don’t are independent—need to query raw data to compute

Can materialize dependent cells(‘views’) Interesting Questions:

How many views to materialize for good performance? Which views to materialize to minimize average query

cost?

Example {Part,Supplier,Customer}

{part, customer} {part, supplier} {supplier,customer}

{part} {supplier} {customer}

{ }

We need to materialize the {Part, Supplier, Customer} because we cannot derive this from any other view. Say the cost of this is 6 M rows

Suppose we want to answer a query on supplier-customer. If the cost of this view is also 6M, there is no point in materializing it since we can derive it from the {Part, Supplier, Customer} view. On the other hand say the part view costs 0.2 M and we need to query parts. There might be some reason in materializing it . If we used the {Part, Supplier, Customer} view we would have to process 6M rows.

Lattice Structure & Hierarchies We say Query1 “is dependent on” Query2 if the result

set of Query2 can be used to answer Query1 For example (part) “is dependent on” (part, customer) Can form a lattice diagram from these dependencies In addition each dimension of the data-cube can have a

set of associated values, forming a hierarchy Example: Day

Week Month Year

none

Week is “dependent” on Day

Combining Dimension Hierarchies with Query Dependencies

Need to combine these two concepts when arriving at query dependencies. The results from one query can contain any element in the hierarchy for that dimension

Suppose we have two tuples T1 and T2 then T1 is dependent on T2 iff every element in the tuple T1 is lower in the dimension hierarchy compared to the corresponding element in T2.

Cost Model If view V has been materialized and V can be

used to process V’ then the cost of answering V is assumed to be the number of rows in V’

In reality more or less than 1 entire scan of V’ may be required. V’ may be clustered on a particular attribute or may have an index on an attribute, eliminating the need to scan it once.

This uniform cost model is adopted to simplify analysis

Experimentally showed the linear relationship between query size and running time

Optimizing Data Cube Lattices There is a space-time tradeoff. Need to

optimize this tradeoff Problem:

Minimize Time taken to evaluate queries Have to materialize k views (this can be easily

converted to a space restraint and the algorithm is the same)

This problem is NP-Complete. Greedy Algorithm used to approximate optimal

solution.

AlgorithmLet C(v) be the cost of a view vIf S is the set of views chosen so far the the benefit of adding a view v to s is B(v,S)

For each w that is depended on v let Bw be defined as follows: If u is in S and w is dependent on u(u is the

least cost such view in S) & C(v)<C(u) then Bw=C(v)-C(u) otherwise Bw=0

B(v,S) is then w dependent on v Bw Intuitively what is being measured is the sum of

the benefits other views will gain if v is included in S

AlgorithmS={top view}

Select v such that B(v,S) is maximized

S is the set of ‘optimal’

views

K times

Example

Greedy not Always Good

Performance of Greedy Algorithm

Greedy guarantees at least (e-1/e) of the benefit of the optimal algorithm

The greedy algorithm is exactly optimal when the benefit of the first picked view is much greater the subsequent views

If the benefits of all the views picked are equal then the greedy algorithm is optimal

Performance of Greedy Algorithm

Performance of Greedy Algorithm & Extensions to the Algorithm

The views in the lattice should have some probability associated with it so that the algorithm uses these probabilities as weights

Have to restrict space not number of views Now need to consider benefit/space Problem is that a small view may have high

benefit/space compared to large view and large view may be excluded because of lack of space— views picked in the future are affected by this. Treat this as a boundary case and ignore!

Hypercube Lattice In a hypercube typically, grouping occurs on n

attributes and aggregation on the n+1th attribute.

Example: SELECT Mode, Year, Color, SUM(Sales) AS Sales

FROM SALES GROUP BY CUBE Model, Year, Color

Exploit this regularity in selecting views If size of each domain is r & only m cells of the top-

most view are filled. If we group on I attributes then:If ri >=m then use m as size for any view where If ri >m If ri<m then use ri as size for any view where If ri <m

Time and Space Optimality

Space is minimized when only the top query is materialized

Each query takes time m. Total cost is mx2N

Time is minimized if we materialize every query

Time and Space Optimality

Conclusion Materialization of parts of the cube is very

important in increasing response times for DSS Pick the ‘optimal’ views that strike a balance

between the space lost due to materialization and the increase in response time due to computing on the fly.

The greedy algorithm gives a close approximation of the optimal solution

Future Work: Dynamic materialization strategies