Dr . Panagiotis S ymeonidis Data Engineering Laboratory

39
1 Dr. Panagiotis Symeonidis Data Engineering Laboratory http://delab.csd.auth.gr/ http://delab.csd.auth.gr/ ~symeon ~symeon Data Warehouse implementation: Part B

description

Data Warehouse implementation: Part B. Dr . Panagiotis S ymeonidis Data Engineering Laboratory. http://delab.csd.auth.gr/~symeon. Cuboids Materialization as an Optimization Problem. Minimize : the average time taken to evaluate a view Constraint : materialize a fixed number k of views - PowerPoint PPT Presentation

Transcript of Dr . Panagiotis S ymeonidis Data Engineering Laboratory

Page 1: Dr .  Panagiotis S ymeonidis Data Engineering Laboratory

1

Dr. Panagiotis SymeonidisData Engineering Laboratory

http://delab.csd.auth.gr/~symeonhttp://delab.csd.auth.gr/~symeon

Data Warehouse implementation: Part B

Page 2: Dr .  Panagiotis S ymeonidis Data Engineering Laboratory

2

Cuboids Materialization as an Optimization Problem

Minimize: the average time taken to evaluate a view

Constraint: materialize a fixed number k of views

Greedy algorithm Best choice is given based on what has gone before It does not give the optimal solution

Page 3: Dr .  Panagiotis S ymeonidis Data Engineering Laboratory

3

Example of lattice of views diagram

psc

pc ps sc

p s c

p: parts: suppc: cust

Page 4: Dr .  Panagiotis S ymeonidis Data Engineering Laboratory

4

The lattice of views framework

if view V2 can be answered using results of view V1 then

V2 is descendent of V1 V1 is ancestor of V2

(denoted V2 ≼ V1)

E.g. (part) ≼ (part, cust)

Page 5: Dr .  Panagiotis S ymeonidis Data Engineering Laboratory

5

Some Definitions

K is the number of views to be materialized

C (v ) is the cost of view v Given

v is a view S is a set of views which are already selected to be

materialized The Benefit of selecting v for materialization is

B(v, S) = C(S) – C(S U v)

Page 6: Dr .  Panagiotis S ymeonidis Data Engineering Laboratory

6

Greedy Algorithm

S {top view}; For i = 1 to k do

Select that view v not in S such that B(v, S) is maximized;

S S U {v} Return S

Page 7: Dr .  Panagiotis S ymeonidis Data Engineering Laboratory

7

1.1 Data Cubepsc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

1st Choice (M)

2nd Choice (M)

pc

ps

sc

p

s

c

0 x 3= 0

Benefit from pc =

Benefit

6M-6M = 0 k = 2

Page 8: Dr .  Panagiotis S ymeonidis Data Engineering Laboratory

8

1.1 Data Cubepsc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

1st Choice (M)

2nd Choice (M)

pc

ps

sc

p

s

c

0 x 3= 0

5.2 x 3= 15.6

Benefit from ps =

Benefit

6M-0.8M = 5.2M k = 2

Page 9: Dr .  Panagiotis S ymeonidis Data Engineering Laboratory

9

1.1 Data Cubepsc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

1st Choice (M)

2nd Choice (M)

pc

ps

sc

p

s

c

0 x 3= 0

5.2 x 3= 15.6

Benefit from sc =

Benefit

6M-6M = 0

0 x 3= 0

k = 2

Page 10: Dr .  Panagiotis S ymeonidis Data Engineering Laboratory

10

1.1 Data Cubepsc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

1st Choice (M)

2nd Choice (M)

pc

ps

sc

p

s

c

0 x 3= 0

5.2 x 3= 15.6

Benefit from p =

Benefit

6M-0.2M = 5.8M

0 x 3= 0

5.8 x 1= 5.8

k = 2

Page 11: Dr .  Panagiotis S ymeonidis Data Engineering Laboratory

11

1.1 Data Cubepsc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

1st Choice (M)

2nd Choice (M)

pc

ps

sc

p

s

c

0 x 3= 0

5.2 x 3= 15.6

Benefit from s =

Benefit

6M-0.01M = 5.99M

0 x 3= 0

5.8 x 1= 5.8

5.99 x 1= 5.99

k = 2

Page 12: Dr .  Panagiotis S ymeonidis Data Engineering Laboratory

12

1.1 Data Cubepsc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

1st Choice (M)

2nd Choice (M)

pc

ps

sc

p

s

c

0 x 3= 0

5.2 x 3= 15.6

Benefit from c =

Benefit

6M-0.1M = 5.9M

0 x 3= 0

5.8 x 1= 5.8

5.99 x 1= 5.99

5.9 x 1= 5.9

k = 2

Page 13: Dr .  Panagiotis S ymeonidis Data Engineering Laboratory

13

1.1 Data Cubepsc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

1st Choice (M)

2nd Choice (M)

pc

ps

sc

p

s

c

0 x 3= 0

5.2 x 3= 15.6

Benefit

0 x 3= 0

5.8 x 1= 5.8

5.99 x 1= 5.99

5.9 x 1= 5.9

Benefit from pc = 6M-6M = 0

0 x 2= 0

k = 2

Page 14: Dr .  Panagiotis S ymeonidis Data Engineering Laboratory

14

1.1 Data Cubepsc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

1st Choice (M)

2nd Choice (M)

pc

ps

sc

p

s

c

0 x 3= 0

5.2 x 3= 15.6

Benefit

0 x 3= 0

5.8 x 1= 5.8

5.99 x 1= 5.99

5.9 x 1= 5.9

Benefit from sc = 6M-6M = 0

0 x 2= 0

0 x 2= 0

k = 2

Page 15: Dr .  Panagiotis S ymeonidis Data Engineering Laboratory

15

psc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

1st Choice (M)

2nd Choice (M)

pc

ps

sc

p

s

c

0 x 3= 0

5.2 x 3= 15.6

Benefit

0 x 3= 0

5.8 x 1= 5.8

5.99 x 1= 5.99

5.9 x 1= 5.9

Benefit from p = 0.8M-0.2M = 0.6M

0 x 2= 0

0 x 2= 0

0.6 x 1= 0.6

k = 2

Page 16: Dr .  Panagiotis S ymeonidis Data Engineering Laboratory

16

1.1 Data Cubepsc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

1st Choice (M)

2nd Choice (M)

pc

ps

sc

p

s

c

0 x 3= 0

5.2 x 3= 15.6

Benefit

0 x 3= 0

5.8 x 1= 5.8

5.99 x 1= 5.99

5.9 x 1= 5.9

Benefit from s = 0.8M-0.01M = 0.79M

0 x 2= 0

0 x 2= 0

0.6 x 1= 0.6

0.79 x 1= 0.79

k = 2

Page 17: Dr .  Panagiotis S ymeonidis Data Engineering Laboratory

17

1.1 Data Cubepsc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

1st Choice (M)

2nd Choice (M)

pc

ps

sc

p

s

c

0 x 3= 0

5.2 x 3= 15.6

Benefit

0 x 3= 0

5.8 x 1= 5.8

5.99 x 1= 5.99

5.9 x 1= 5.9

Benefit from c = 6M-0.1M = 5.9M

0 x 2= 0

0 x 2= 0

0.6 x 1= 0.6

0.79 x 1= 0.79

5.9 x 1= 5.9

k = 2

Page 18: Dr .  Panagiotis S ymeonidis Data Engineering Laboratory

18

psc 6M

pc 6M ps 0.8M sc 6M

p 0.2M s 0.01M c 0.1M

1st Choice (M)

2nd Choice (M)

pc

ps

sc

p

s

c

0 x 3= 0

5.2 x 3= 15.6

Benefit

0 x 3= 0

5.8 x 1= 5.8

5.99 x 1= 5.99

5.9 x 1= 5.9

0 x 2= 0

0 x 2= 0

0.6 x 1= 0.6

0.79 x 1= 0.79

5.9 x 1= 5.9

Two views to be materialized are

1. ps2. c

V = {ps, c} Gain(V U {top view}, {top view})= 15.6 + 5.9 = 21.5

k = 2

Page 19: Dr .  Panagiotis S ymeonidis Data Engineering Laboratory

19

2nd Example of greedy algorithm

Initially, S = {a} k = 4 (select 3

more)

a

b c

d e

g

f

h

100

50 75

20 30 40

1 10

Page 20: Dr .  Panagiotis S ymeonidis Data Engineering Laboratory

20

2nd Example of greedy algorithm

First choice b: 50 5 = 250 c: 25 5 = 125 d: 80 2 = 160 e: 70 3 = 210 f: 60 2 = 120 g: 99 1 = 99 h: 90 1 = 90

a

b c

d e

g

f

h

100

50 75

20 30 40

1 10

Page 21: Dr .  Panagiotis S ymeonidis Data Engineering Laboratory

21

2nd Example of greedy algorithm

Second choice c: 25 2 = 50 d: 30 2 = 60 e: 20 3 = 60 f: (100-40) 1 + (50-40)

1= 60+10 = 70

g: 49 1 = 49 h: 40 1 = 40

a

b c

d e

g

f

h

100

50 75

20 30 40

1 10

Page 22: Dr .  Panagiotis S ymeonidis Data Engineering Laboratory

22

2nd Example of greedy algorithm

Third choice c: 25 1 = 25 d: 30 2 = 60 e: (50-30) 2 + (40-30)

1=20 2 + 10 1 = 50

g: 49 1 = 49 h: 30 1 = 30

a

b c

d e

g

f

h

100

50 75

20 30 40

1 10

Page 23: Dr .  Panagiotis S ymeonidis Data Engineering Laboratory

23

2nd Example of greedy algorithm

If we materialize only a then cost would be 8*100 =800

Now, cost is 800-250-70-60 = 420

a

b c

e

g

f

h

100

50 75

20 30 40

1 10

d

Page 24: Dr .  Panagiotis S ymeonidis Data Engineering Laboratory

24

Performance Study

How bad does the Greedy Algorithm perform?

Page 25: Dr .  Panagiotis S ymeonidis Data Engineering Laboratory

25

1.1 Data Cubea 200

b 100 c 99 d 100

p1 97

1st Choice (M) 2nd Choice (M)

b

c

d

… … …

41 x 100= 4100

Benefit from b =

Benefit

200-100= 100

20 nodes …

p20 97

q1 97

q20 97

r1 97

r20 97

s1 97

…s20 97

k = 2

Page 26: Dr .  Panagiotis S ymeonidis Data Engineering Laboratory

26

1.1 Data Cubea 200

b 100 c 99 d 100

p1 97

1st Choice (M) 2nd Choice (M)

b

c

d

… … …

41 x 100= 4100

Benefit from c =

Benefit

200-99 = 101

20 nodes …

p20 97

q1 97

q20 97

r1 97

r20 97

s1 97

…s20 97

41 x 101= 4141

k = 2

Page 27: Dr .  Panagiotis S ymeonidis Data Engineering Laboratory

27

1.1 Data Cubea 200

b 100 c 99 d 100

p1 97

1st Choice (M) 2nd Choice (M)

b

c

d

… … …

41 x 100= 4100

Benefit

20 nodes …

p20 97

q1 97

q20 97

r1 97

r20 97

s1 97

…s20 97

41 x 101= 4141

41 x 100= 4100

k = 2

Page 28: Dr .  Panagiotis S ymeonidis Data Engineering Laboratory

28

1.1 Data Cubea 200

b 100 c 99 d 100

p1 97

1st Choice (M) 2nd Choice (M)

b

c

d

… … …

41 x 100= 4100

Benefit

20 nodes …

p20 97

q1 97

q20 97

r1 97

r20 97

s1 97

…s20 97

41 x 101= 4141

41 x 100= 4100

Benefit from b = 200-100= 100

21 x 100= 2100

k = 2

Page 29: Dr .  Panagiotis S ymeonidis Data Engineering Laboratory

29

1.1 Data Cubea 200

b 100 c 99 d 100

p1 97

1st Choice (M) 2nd Choice (M)

b

c

d

… … …

41 x 100= 4100

Benefit

20 nodes …

p20 97

q1 97

q20 97

r1 97

r20 97

s1 97

…s20 97

41 x 101= 4141

41 x 100= 4100

21 x 100= 2100

21 x 100= 2100

Greedy: V = {b, c} Gain(V U {top view}, {top view})= 4141 + 2100 = 6241

k = 2

Page 30: Dr .  Panagiotis S ymeonidis Data Engineering Laboratory

30

1.1 Data Cubea 200

b 100 c 99 d 100

p1 97

1st Choice (M) 2nd Choice (M)

b

c

d

… … …

41 x 100= 4100

Benefit

20 nodes …

p20 97

q1 97

q20 97

r1 97

r20 97

s1 97

…s20 97

41 x 101= 4141

41 x 100= 4100 41 x 100= 4100

Greedy: V = {b, c} Gain(V U {top view}, {top view})= 4141 + 2100 = 6241

21 x 101 + 20 x 1= 2141

Optimal: V = {b, d} Gain(V U {top view}, {top view})= 4100 + 4100 = 8200

k = 2

Page 31: Dr .  Panagiotis S ymeonidis Data Engineering Laboratory

31

1.1 Data Cubea 200

b 100 c 99 d 100

p1 97

20 nodes …

p20 97

q1 97

q20 97

r1 97

r20 97

s1 97

…s20 97

Greedy: V = {b, c} Gain(V U {top view}, {top view})= 4141 + 2100 = 6241

Optimal: V = {b, d} Gain(V U {top view}, {top view})= 4100 + 4100 = 8200

Greedy

Optimal=

6241

8200=0.7611

If this ratio = 1, Greedy can give an optimal solution. If this ratio 0, Greedy may give a “bad” solution.

Does this ratio has a “lower” bound?

It is proved that this ratio is at least 0.63.

k = 2

Page 32: Dr .  Panagiotis S ymeonidis Data Engineering Laboratory

32

Indexing OLAP Data: Bitmap Index

Cust Region TypeC1 Asia RetailC2 Europe DealerC3 Asia DealerC4 America RetailC5 Europe Dealer

RecID Retail Dealer1 1 02 0 13 0 14 1 05 0 1

RecID Asia Europe America1 1 0 02 0 1 03 1 0 04 0 0 15 0 1 0

Relation table Index on Region Index on Type

Page 33: Dr .  Panagiotis S ymeonidis Data Engineering Laboratory

33

Determining which materialized

cuboid(s) should be selected for

OLAP operations Query : Find the total sales group by {product-

category, province} with the condition “year =

2004”.

Which one of the 4 following materialized cuboids should be

selected to process the query?

1) {year, product, city}

2) {year, product-category, country}

3) {year, product-category, province}

4) {product, province} where year = 2004

Page 34: Dr .  Panagiotis S ymeonidis Data Engineering Laboratory

34

Solution:

1) {year, product, city}

– it can be used. However, it costs most because product and city are of

lower level

2) {year, product-category, country}

– it cannot be used because country is a more general concept than province

3) {year, product_category, province}

- it can be used. It could cost less than Solution 4, if there were no many

year values and there are many products for each product-category.

4) {product, province} where year = 2004

- it can be used.

Let the query to be processed be on {product_category, province} with the condition “year = 2004”, and there are 4 materialized cuboids available:

Page 35: Dr .  Panagiotis S ymeonidis Data Engineering Laboratory

35

Assume we want to find pairs of customers and items such that the customer has purchased the item at least 5 times

select P.custid, P. item, sum(P.qty)

from Purchases P

group by P.custid, P.item

having sum (P.qty) > 5

Execution plan for the query?

The number of groups is very large but the answer to the query (the top of the iceberg) is usually very small

Iceberg queries

Page 36: Dr .  Panagiotis S ymeonidis Data Engineering Laboratory

36

select P.custid, P. item, sum(P.qty)

from Purchases P

group by P.custid, P.item

having sum (P.qty) > 5

select P.custid

from Purchases P

group by P.custid

having sum (P.qty) > 5

select P.item

from Purchases P

group by P.item

having sum (P.qty) > 5

Generate (custid, item) pairs only forcustid from Q1 and item from Q2

Q1 Q2

Iceberg queries

Page 37: Dr .  Panagiotis S ymeonidis Data Engineering Laboratory

37

From On-Line Analytical Processing (OLAP) to On Line Analytical Mining (OLAM)

Why online analytical mining?

High quality of data in data warehouses

OLAP-based exploratory data analysis

Easy selection of data mining functions

Page 38: Dr .  Panagiotis S ymeonidis Data Engineering Laboratory

April 21, 2023 38

An OLAM System Architecture

Data Warehouse

Meta Data

MDDB

OLAMEngine

OLAPEngine

User GUI API

Data Cube API

Database API

Data cleaning

Data integration

Layer3

OLAP/OLAM

Layer2

MDDB

Layer1

Data Repository

Layer4

User Interface

Filtering&Integration Filtering

Databases

Mining query Mining result

Page 39: Dr .  Panagiotis S ymeonidis Data Engineering Laboratory

39

Dr. Panagiotis SymeonidisData Engineering Laboratory

http://delab.csd.auth.gr/~symeonhttp://delab.csd.auth.gr/~symeon

Data Warehouse implementation: Part B