Aggregation in Main Memory

21
March 30 2001 DGRC FedStats Visit Aggregation in Main Memory Kenneth A. Ross Columbia University

description

Aggregation in Main Memory. Kenneth A. Ross Columbia University. Research Experience. Complex query processing Data Warehousing Main memory databases. Students: Kazi Zaman, Junyan Ding. Main- Memory DBMS. Query. Mediator. User. Unified Results. Traditional DBMS. Scenario A. - PowerPoint PPT Presentation

Transcript of Aggregation in Main Memory

Page 1: Aggregation in Main Memory

March 30 2001 DGRC FedStats Visit

Aggregation in Main Memory

Kenneth A. Ross

Columbia University

Page 2: Aggregation in Main Memory

March 30 2001 DGRC FedStats Visit

Research Experience

Complex query processing Data Warehousing Main memory databases

Students: Kazi Zaman, Junyan Ding

Page 3: Aggregation in Main Memory

March 30 2001 DGRC FedStats Visit

MediatorMediatorQueryQuery

UnifiedUnifiedResultsResults

UserUser

Main-MemoryDBMS

TraditionalDBMS

......

Scenario A

Page 4: Aggregation in Main Memory

March 30 2001 DGRC FedStats Visit

MediatorMediatorData RequestData Request

UnifiedUnifiedResultsResults

UserUser

Web

TraditionalDBMS

......

Scenario B

Main Memory

DB

Sequence OfSequence OfInteractiveInteractive QueriesQueries

Page 5: Aggregation in Main Memory

March 30 2001 DGRC FedStats Visit

MediatorMediator

Data RequestData Request

UnifiedUnifiedResultsResults

UserUser Web

TraditionalDBMS

......

Scenario C

Main Memory

DB

Graphical User Graphical User InterfaceInterface

Dynamic QueryDynamic Query

Page 6: Aggregation in Main Memory

March 30 2001 DGRC FedStats Visit

Outline

Introduction to Datacubes Frameworks for querying cubes The Main Memory based framework Experimental Results Conclusions and Plan

Page 7: Aggregation in Main Memory

March 30 2001 DGRC FedStats Visit

The CUBE BY Operator

State Year Grade Sales

CA 1997 Regular 90NY 1997 Premium 70CA 1998 Premium 65

NY 1998 Premium 95

State Year Grade Sales

CA 1997 Regular 90CA 1997 ALL 90ALL 1997 Regular 90CA ALL Regular 90

ALL 1997 Regular 90ALL 1997 ALL 160ALL ALL Regular 90CA ALL ALL 155

ALL ALL ALL 320

CUBE BY(sum Sales)

Large increase in total Size,especially with many dimensions

…….

Additional records

Page 8: Aggregation in Main Memory

DGRC FedStats VisitMarch 30 2001

Lattice Representation

State, Year, Grade

State, Year State, Grade Year, Grade

State Year Grade

Page 9: Aggregation in Main Memory

March 30 2001 DGRC FedStats Visit

Modeling Queries

Slice Queries ask for a single aggregate record

SELECT State, year, sum(sales)FROM BLS-12345GROUP BY State, yearHAVING State = “NY” AND year = “1998”

Page 10: Aggregation in Main Memory

March 30 2001 DGRC FedStats Visit

Existing Frameworks

State, Year, Grade

State, Year State,Grade Year,Grade

State Year Grade

Choose subset of cube tomaterialize based on workload.Materialize on disk

Appropriate record recovered orcomputed for incoming slice query

Drawbacks: Ignores Clustering of Relation on disk.Smallest unit of materialization is too big.

Page 11: Aggregation in Main Memory

March 30 2001 DGRC FedStats Visit

Our approach

State, Year, Grade

State, Year State,Grade Year,Grade

State Year Grade

The full cube is often larger than available memory, but ...

The finest granularity aggregate may fit.

Any record can be computedwithout having to go to disk.

How should the finest granularity be organized ?

Page 12: Aggregation in Main Memory

March 30 2001 DGRC FedStats Visit

Framework

Level-1 Store Level-2 Store

records in linked lists

Slot directory

Selected coarse recordsin hash table

Finest granularity cuboid

Query q

Page 13: Aggregation in Main Memory

March 30 2001 DGRC FedStats Visit

The Level-1 Store

Records are <Key,Value> pairs stored in a hash table.

Records can contain ALL’s

Given query Q, form compositekey and check level-1 store (constant time).

If not found, use level-2 store

Key Value a1 55 b2 34 c2 12

… ...

Page 14: Aggregation in Main Memory

March 30 2001 DGRC FedStats Visit

The Level-2 StoreLevel-2 Store

records in linked lists

Slot directory

Finest granularity cuboidSlot directory is organized asa multidimensional array:level2[sz1][sz2][sz3][sz4]

Each slot points to a linkedlist of elements.

Records placed according toset of mapping functions H

Page 15: Aggregation in Main Memory

March 30 2001 DGRC FedStats Visit

Using the Level-2 store

b4

Query Q without ALL’s

d5a3 c2

Slot 4 Slot 3 Slot 7 Slot1

Access list denoted by level2[4][3][7][1] ;aggregate those matching (a3,b4,c2,d5).

Page 16: Aggregation in Main Memory

March 30 2001 DGRC FedStats Visit

Using the Level-2 store

ALL

Query Q with ALL’s

ALLa3 c2

Slot 4 List of Slots Slot 7 List of Slots

Access lists matching level2[4][*][7][*] ;aggregate those matching (a3,*,c2,*).

Page 17: Aggregation in Main Memory

March 30 2001 DGRC FedStats Visit

Demo

Shows multidimensional dataset (subset of columns of 5% Census sample for NY in 1990).

User asks queries: fast answers. Future: User Interface asks many

queries, with display changing interactively.

demo

Page 18: Aggregation in Main Memory

March 30 2001 DGRC FedStats Visit

Experimental ResultsQuery Processing Time vs Additional Memory Used

(real dataset, 10^6 records, 8 dimensions)

0

5

10

15

0 20 40 60 80

Additional Memory Used in MB

Ave

rage

tim

e pe

r qu

ery

in m

illi

seco

nds

Query Cost

Scanning all records takes 194 ms.

Page 19: Aggregation in Main Memory

March 30 2001 DGRC FedStats Visit

Importance of Work

•Aggregation is fundamental to analysis.

•Make analysis interactive, even for many dimensions.

•Make a variety of aggregate granularities available, where possible.

Page 20: Aggregation in Main Memory

March 30 2001 DGRC FedStats Visit

Contributions

A Main Memory based framework for answering datacube queries efficiently.

Query Performance in the 2-4 ms range which is more efficient than going to disk.

Page 21: Aggregation in Main Memory

March 30 2001 DGRC FedStats Visit

Plan

Integrate with user interface to generate dynamic queries.

Self-tuning capability. Multiple data sets. Work with agencies to generate value

– For intra-agency analysis– For enhanced data dissemination