Aggregation in Main Memory

March 30 2001 DGRC FedStats Visit

Aggregation in Main Memory

Kenneth A. Ross

Columbia University


Research Experience

Complex query processing Data Warehousing Main memory databases

Students: Kazi Zaman, Junyan Ding


MediatorMediatorQueryQuery

UnifiedUnifiedResultsResults

UserUser

Main-MemoryDBMS

TraditionalDBMS

......

Scenario A


MediatorMediatorData RequestData Request


UserUser

Web

TraditionalDBMS

......

Scenario B

Main Memory

DB

Sequence OfSequence OfInteractiveInteractive QueriesQueries


MediatorMediator

Data RequestData Request


UserUser Web

TraditionalDBMS

......

Scenario C

Main Memory

DB

Graphical User Graphical User InterfaceInterface

Dynamic QueryDynamic Query


Outline

Introduction to Datacubes Frameworks for querying cubes The Main Memory based framework Experimental Results Conclusions and Plan


The CUBE BY Operator

State Year Grade Sales

CA 1997 Regular 90NY 1997 Premium 70CA 1998 Premium 65

NY 1998 Premium 95

State Year Grade Sales

CA 1997 Regular 90CA 1997 ALL 90ALL 1997 Regular 90CA ALL Regular 90

ALL 1997 Regular 90ALL 1997 ALL 160ALL ALL Regular 90CA ALL ALL 155

ALL ALL ALL 320

CUBE BY(sum Sales)

Large increase in total Size,especially with many dimensions

…….

Additional records

DGRC FedStats VisitMarch 30 2001

Lattice Representation

State, Year, Grade

State, Year State, Grade Year, Grade

State Year Grade


Modeling Queries

Slice Queries ask for a single aggregate record

SELECT State, year, sum(sales)FROM BLS-12345GROUP BY State, yearHAVING State = “NY” AND year = “1998”


Existing Frameworks

State, Year, Grade

State, Year State,Grade Year,Grade

State Year Grade

Choose subset of cube tomaterialize based on workload.Materialize on disk

Appropriate record recovered orcomputed for incoming slice query

Drawbacks: Ignores Clustering of Relation on disk.Smallest unit of materialization is too big.


Our approach

State, Year, Grade

State, Year State,Grade Year,Grade

State Year Grade

The full cube is often larger than available memory, but ...

The finest granularity aggregate may fit.

Any record can be computedwithout having to go to disk.

How should the finest granularity be organized ?


Framework

Level-1 Store Level-2 Store

records in linked lists

Slot directory

Selected coarse recordsin hash table

Finest granularity cuboid

Query q


The Level-1 Store

Records are <Key,Value> pairs stored in a hash table.

Records can contain ALL’s

Given query Q, form compositekey and check level-1 store (constant time).

If not found, use level-2 store

Key Value a1 55 b2 34 c2 12

… ...


The Level-2 StoreLevel-2 Store

records in linked lists

Slot directory

Finest granularity cuboidSlot directory is organized asa multidimensional array:level2[sz1][sz2][sz3][sz4]

Each slot points to a linkedlist of elements.

Records placed according toset of mapping functions H


Using the Level-2 store

b4

Query Q without ALL’s

d5a3 c2

Slot 4 Slot 3 Slot 7 Slot1

Access list denoted by level2[4][3][7][1] ;aggregate those matching (a3,b4,c2,d5).


Using the Level-2 store

ALL

Query Q with ALL’s

ALLa3 c2

Slot 4 List of Slots Slot 7 List of Slots

Access lists matching level2[4][*][7][*] ;aggregate those matching (a3,*,c2,*).


Demo

Shows multidimensional dataset (subset of columns of 5% Census sample for NY in 1990).

User asks queries: fast answers. Future: User Interface asks many

queries, with display changing interactively.

demo


Experimental ResultsQuery Processing Time vs Additional Memory Used

(real dataset, 10^6 records, 8 dimensions)

0

5

10

15

0 20 40 60 80

Additional Memory Used in MB

Ave

rage

tim

e pe

r qu

ery

in m

illi

seco

nds

Query Cost

Scanning all records takes 194 ms.


Importance of Work

•Aggregation is fundamental to analysis.

•Make analysis interactive, even for many dimensions.

•Make a variety of aggregate granularities available, where possible.


Contributions

A Main Memory based framework for answering datacube queries efficiently.

Query Performance in the 2-4 ms range which is more efficient than going to disk.


Plan

Integrate with user interface to generate dynamic queries.

Self-tuning capability. Multiple data sets. Work with agencies to generate value

– For intra-agency analysis– For enhanced data dissemination

Aggregation in Main Memory

Documents

Transcript of Aggregation in Main Memory