Introduction to Data Warehousing CPS 196.03 Notes 6.

43
Introduction to Data Warehousing CPS 196.03 Notes 6

Transcript of Introduction to Data Warehousing CPS 196.03 Notes 6.

Page 1: Introduction to Data Warehousing CPS 196.03 Notes 6.

Introduction to Data Warehousing

CPS 196.03Notes 6

Page 2: Introduction to Data Warehousing CPS 196.03 Notes 6.

2

Warehousing

Growing industry: $30+ billion industry Range from desktop to huge:

Walmart: 900-CPU, 2,700 disk, 23TBTeradata system (numbers from earlier part of this decade)

Lots of buzzwords, hype slice & dice, rollup, MOLAP, pivot, ...

Page 3: Introduction to Data Warehousing CPS 196.03 Notes 6.

3

Outline

What is a data warehouse? Why a warehouse? Models & operations Implementing a warehouse

Page 4: Introduction to Data Warehousing CPS 196.03 Notes 6.

4

What is a Warehouse?

Collection of diverse data subject oriented aimed at executive, decision maker often a copy of operational data with value-added data (e.g., summaries, history)

integrated time-varying non-volatile

more

Page 5: Introduction to Data Warehousing CPS 196.03 Notes 6.

5

What is a Warehouse?

Collection of tools gathering data cleansing, integrating, ... querying, reporting, analysis data mining monitoring, administering warehouse

Page 6: Introduction to Data Warehousing CPS 196.03 Notes 6.

6

Warehouse Architecture

Client Client

Warehouse

Source Source Source

Query & Analysis

Integration

Metadata

Page 7: Introduction to Data Warehousing CPS 196.03 Notes 6.

7

Motivating Examples

Forecasting Comparing performance of units Monitoring, detecting fraud Visualization

Page 8: Introduction to Data Warehousing CPS 196.03 Notes 6.

8

Why a Warehouse?

Two Approaches: Query-Driven (Lazy) Warehouse (Eager)

Source Source

?

Page 9: Introduction to Data Warehousing CPS 196.03 Notes 6.

9

Query-Driven Approach

Client Client

Wrapper Wrapper Wrapper

Mediator

Source Source Source

Page 10: Introduction to Data Warehousing CPS 196.03 Notes 6.

10

Advantages of Warehousing

High query performance Queries not visible outside warehouse Local processing at sources unaffected Can operate when sources unavailable Can query data not stored in a DBMS Extra information at warehouse

Modify, summarize (store aggregates) Add historical information

Page 11: Introduction to Data Warehousing CPS 196.03 Notes 6.

11

Advantages of Query-Driven

No need to copy data less storage no need to purchase data

More up-to-date data Query needs can be unknown Only query interface needed at sources May be less draining on sources

Page 12: Introduction to Data Warehousing CPS 196.03 Notes 6.

12

OLTP vs. OLAP

OLTP: On Line Transaction Processing Describes processing at operational sites

OLAP: On Line Analytical Processing Describes processing at warehouse

Page 13: Introduction to Data Warehousing CPS 196.03 Notes 6.

13

OLTP vs. OLAP

Mostly updates Many small transactions Mb-Gb of data Raw data Clerical users Up-to-date data Consistency,

recoverability critical

Mostly reads Queries long, complex Tb-Pb of data Summarized,

consolidated data Decision-makers,

analysts as users

OLTP OLAP

Page 14: Introduction to Data Warehousing CPS 196.03 Notes 6.

14

Data Marts

Smaller warehouses Spans part of organization

e.g., marketing (customers, products, sales) Do not require enterprise-wide consensus

but long term integration problems?

Page 15: Introduction to Data Warehousing CPS 196.03 Notes 6.

15

Warehouse Models & Operators

Data Models relations stars & snowflakes cubes

Operators slice & dice roll-up, drill down pivoting other

Page 16: Introduction to Data Warehousing CPS 196.03 Notes 6.

16

Warehouse Models

Modeling data warehouses: dimensions, measures Star schema: A fact table in the middle connected to a set

of dimension tables

Snowflake schema: A refinement of star schema where

some dimensional hierarchy is normalized into a set of

smaller dimension tables, forming a shape similar to

snowflake

Fact constellations: Multiple fact tables share dimension

tables, viewed as a collection of stars, therefore called

galaxy schema or fact constellation

Page 17: Introduction to Data Warehousing CPS 196.03 Notes 6.

17

Star

customer custId name address city53 joe 10 main sfo81 fred 12 main sfo

111 sally 80 willow la

product prodId name pricep1 bolt 10p2 nut 5

store storeId cityc1 nycc2 sfoc3 la

sale oderId date custId prodId storeId qty amto100 1/7/97 53 p1 c1 1 12o102 2/7/97 53 p2 c1 2 11105 3/8/97 111 p1 c3 5 50

Measures

Page 18: Introduction to Data Warehousing CPS 196.03 Notes 6.

18

Star Schema

saleorderId

datecustIdprodIdstoreId

qtyamt

customercustIdname

addresscity

productprodIdnameprice

storestoreId

city

Page 19: Introduction to Data Warehousing CPS 196.03 Notes 6.

19

Another Example of Star Schema

time_keydayday_of_the_weekmonthquarteryear

time

location_keystreetcitystate_or_provincecountry

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_keyitem_namebrandtypesupplier_type

item

branch_keybranch_namebranch_type

branch

Page 20: Introduction to Data Warehousing CPS 196.03 Notes 6.

20

Terms

Fact table Dimension tables Measures

saleorderId

datecustIdprodIdstoreId

qtyamt

customercustIdname

addresscity

productprodIdnameprice

storestoreId

city

Page 21: Introduction to Data Warehousing CPS 196.03 Notes 6.

21

Dimension Hierarchies

store storeId cityId tId mgrs5 sfo t1 joes7 sfo t2 freds9 la t1 nancy

city cityId pop regIdsfo 1M northla 5M south

region regId namenorth cold regionsouth warm region

sType tId size locationt1 small downtownt2 large suburbs

storesType

city region

snowflake schema constellations

Page 22: Introduction to Data Warehousing CPS 196.03 Notes 6.

22

Example of Snowflake Schema

time_keydayday_of_the_weekmonthquarteryear

time

location_keystreetcity_key

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_keyitem_namebrandtypesupplier_key

item

branch_keybranch_namebranch_type

branch

supplier_keysupplier_type

supplier

city_keycitystate_or_provincecountry

city

Page 23: Introduction to Data Warehousing CPS 196.03 Notes 6.

23

Example of Fact Constellation

time_keydayday_of_the_weekmonthquarteryear

time

location_keystreetcityprovince_or_statecountry

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_keyitem_namebrandtypesupplier_type

item

branch_keybranch_namebranch_type

branch

Shipping Fact Table

time_key

item_key

shipper_key

from_location

to_location

dollars_cost

units_shipped

shipper_keyshipper_namelocation_keyshipper_type

shipper

Page 24: Introduction to Data Warehousing CPS 196.03 Notes 6.

24

Cube

sale prodId storeId amtp1 c1 12p2 c1 11p1 c3 50p2 c2 8

c1 c2 c3p1 12 50p2 11 8

Fact table view: Multi-dimensional cube:

dimensions = 2

Recall counters in Apriori

Page 25: Introduction to Data Warehousing CPS 196.03 Notes 6.

25

3-D Cube

sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4

day 2c1 c2 c3

p1 44 4p2 c1 c2 c3

p1 12 50p2 11 8

day 1

dimensions = 3

Multi-dimensional cube:Fact table view:

Page 26: Introduction to Data Warehousing CPS 196.03 Notes 6.

26

Aggregates

sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4

• Add up amounts for day 1• In SQL: SELECT sum(amt) FROM SALE WHERE date = 1

81

Page 27: Introduction to Data Warehousing CPS 196.03 Notes 6.

27

Aggregates

sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4

• Add up amounts by day• In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date

ans date sum1 812 48

Page 28: Introduction to Data Warehousing CPS 196.03 Notes 6.

28

Another Example

sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4

• Add up amounts by day, product• In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date, prodId

sale prodId date amtp1 1 62p2 1 19p1 2 48

drill-down

rollup

Page 29: Introduction to Data Warehousing CPS 196.03 Notes 6.

29

Aggregates

Operators: sum, count, max, min, median, ave

“Having” clause Using dimension hierarchy

average by region (within store) maximum by month (within date)

Page 30: Introduction to Data Warehousing CPS 196.03 Notes 6.

30

Types of Measures in Data Cubes

Distributive: if the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning

E.g., count(), sum(), min(), max() Algebraic: if it can be computed by an algebraic function with

M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate function

E.g., avg(), min_N(), standard_deviation() Holistic: if there is no constant bound on the storage size

needed to describe a subaggregate. E.g., median(), mode(), rank()

Page 31: Introduction to Data Warehousing CPS 196.03 Notes 6.

31

Cube Aggregation

day 2c1 c2 c3

p1 44 4p2 c1 c2 c3

p1 12 50p2 11 8

day 1

c1 c2 c3p1 56 4 50p2 11 8

c1 c2 c3sum 67 12 50

sump1 110p2 19

129

. . .

drill-down

rollup

Example: computing sums

Page 32: Introduction to Data Warehousing CPS 196.03 Notes 6.

32

Cube Operators

day 2c1 c2 c3

p1 44 4p2 c1 c2 c3

p1 12 50p2 11 8

day 1

c1 c2 c3p1 56 4 50p2 11 8

c1 c2 c3sum 67 12 50

sump1 110p2 19

129

. . .

sale(c1,*,*)

sale(*,*,*)sale(c2,p2,*)

Page 33: Introduction to Data Warehousing CPS 196.03 Notes 6.

33

c1 c2 c3 *p1 56 4 50 110p2 11 8 19* 67 12 50 129

Extended Cube

day 2 c1 c2 c3 *p1 44 4 48p2* 44 4 48

c1 c2 c3 *p1 12 50 62p2 11 8 19* 23 8 50 81

day 1

*

sale(*,p2,*)

Page 34: Introduction to Data Warehousing CPS 196.03 Notes 6.

34

Cube Aggregates Lattice

city, product, date

city, product city, date product, date

city product date

all

day 2c1 c2 c3

p1 44 4p2 c1 c2 c3

p1 12 50p2 11 8

day 1

c1 c2 c3p1 56 4 50p2 11 8

c1 c2 c3p1 67 12 50

129

Page 35: Introduction to Data Warehousing CPS 196.03 Notes 6.

35

Dimension Hierarchies

all

state

city

cities city statec1 CAc2 NY

Page 36: Introduction to Data Warehousing CPS 196.03 Notes 6.

36

Dimension Hierarchies

city, product

city, product, date

city, date product, date

city product date

all

state, product, date

state, date

state, product

state

not all arcs shown...

Page 37: Introduction to Data Warehousing CPS 196.03 Notes 6.

37

Interesting Hierarchy

all

years

quarters

months

days

weeks

time day week month quarter year1 1 1 1 20002 1 1 1 20003 1 1 1 20004 1 1 1 20005 1 1 1 20006 1 1 1 20007 1 1 1 20008 2 1 1 2000

conceptualdimension table

Page 38: Introduction to Data Warehousing CPS 196.03 Notes 6.

38

Aggregation Using Hierarchies

day 2c1 c2 c3

p1 44 4p2 c1 c2 c3

p1 12 50p2 11 8

day 1

region A region Bp1 56 54p2 11 8

customer

region

country

(customer c1 in Region A;customers c2, c3 in Region B)

Page 39: Introduction to Data Warehousing CPS 196.03 Notes 6.

39

Multidimensional Data

Sales volume as a function of product, month, and region

Pro

duct

Regio

n

Month

Dimensions: Product, Location, TimeHierarchical summarization paths

Industry Region Year

Category Country Quarter

Product City Month Week

Office Day

Page 40: Introduction to Data Warehousing CPS 196.03 Notes 6.

40

Typical OLAP Operations

Total annual salesof TV in U.S.A.Date

Produ

ct

Cou

ntr

ysum

sum TV

VCRPC

1Qtr 2Qtr 3Qtr 4Qtr

U.S.A

Canada

Mexico

sum

Page 41: Introduction to Data Warehousing CPS 196.03 Notes 6.

41

Typical OLAP Operations

Roll up (drill-up): summarize data by climbing up hierarchy or by dimension reduction

Drill down (roll down): reverse of roll-up from higher level summary to lower level summary or

detailed data, or introducing new dimensions Slice and dice: project and select Pivot (rotate):

reorient the cube, visualization, 3D to series of 2D planes Other operations

drill across: involving (across) more than one fact table drill through: through the bottom level of the cube to its

back-end relational tables (using SQL)

Page 42: Introduction to Data Warehousing CPS 196.03 Notes 6.

42

Fig. 3.10 Typical OLAP Operations

Page 43: Introduction to Data Warehousing CPS 196.03 Notes 6.

43

Pivoting

sale prodId storeId date amtp1 c1 1 12p2 c1 1 11p1 c3 1 50p2 c2 1 8p1 c1 2 44p1 c2 2 4

day 2c1 c2 c3

p1 44 4p2 c1 c2 c3

p1 12 50p2 11 8

day 1

Multi-dimensional cube:Fact table view:

c1 c2 c3p1 56 4 50p2 11 8

Pivot turns unique values fromone column into unique columnsin the output