CS5226 2002 Data Warehouse & Performance Tuning Xiaofang Zhou School of Computing, NUS Office:...

CS5226 2002

Data Warehouse & Performance Tuning

Xiaofang ZhouSchool of Computing, NUS

Office: S16-08-20Email: [email protected]: www.itee.uq.edu.au/~zxf

2

Outline

Part 1: A review of data warehousing

Part 2: Data warehouse tuning

3

OLAP vs OLTP

OLAP: On-line analytical processing Very large amount of historical data Read only Long and complicated sessions For a few trained users

OLTP: On-line transaction processing Small amount of current data Frequent updates, deletions and insertions A very large number of fixed, short-duration

transactions For many ad hoc users

4

DW Characteristics A DW provides access to date for complex

analysis, knowledge discovery and decision support

Subject oriented (vs application oriented) The data is organized around subjects (such as

Sales), rather than operations applications (such as order processing).

Non-volatile Not usually subject to change

Integrated Data is consistent

Time variant Historical data is recorded

(Inmon 1992)

5

Data Warehouse Types Enterprise-wide data warehouses

Large projects with massive investment of time and resources

Virtual data warehouses Provide views of operational DBs that are

materialised for efficiency Data marts

Targeted to a subset of the organisation Also called department-level data warehouse Low-risk, low-cost, but hard to evolve

… a data warehouse = a collection of data marts?

6

Process of Data Warehousing

Databases

Cleaning Reformatting

OLAP

DSS

Data Mining

Back flushing

Other data inputs

Data Warehouse

Data Metadata

Updates/New Data

7

Data Cube

Sales data with three dimensions: region, product and fiscal period

Product

Region

Fiscal period

Sales

8

OLAP Data Models ROLAP: relational OLAP

Use of relational technology, suitably adapted and extended

The data is stored using tables But the analysis operations are carried out

efficiently using special data structures MOLAP: multidimensional OLAP

A more radical approach Storing data directly in a multi-dimensional form

HOLAP: hybrid OLAP Data defined as in ROLAP but stored as in

MLOAP

9

Multidimensional model is usually not 3NF But it is as simple as a spreadsheet

two-dimensional matrix

And its query performance is much better then in the relational model

Multidimensional or Relational

Region

Fiscal period

Sales

10

Tables Types in MD Model A multidimensional model consists a fact table

and a number of dimension tables Fact table - contains measured variables, and

identifies them with pointers to dimension tables Dimension table - tuples of attributes of dimension

ProdID

RegID

OrderTime

Quantity

ItemCost

SalesProduct(Prod No, Name, Price,…)

Region(Area, Description)

Time(Date, PeriodNum, QuaterNum, Year)

dimension tablesfact table

11

Multidimensional Schema Star schema

One fact table One table for each dimension Typically not normalised

Snowflake schema A variation on the star schema Dimensional tables are organised into a hierarchy Can be in 3NF

12

Star Schema Example

Sales

Product

Region

Time

13

Snowflake Schema Example

Sales

Product

Region

Time

Class

…if there are queries about classes of products, after

normalisation, it becomes a snowflake schema…

14

Complex Snowflake Schema

Fact table

Fiscal qtrs FQ dates

Sales revenue

ProductProd. name

P. line

Dimension tablesDimension tables

15

Typical Functionality of DW Pivoting

Rotate data cube to show a different orientation of axes

Roll-up Move up concept hierarchy, grouping into larger

units along a dimension Drill-down

Disaggregate to a finer-grained view Slice and dice

Perform project operations on the dimensions Other operations, such as sorting, selection

16

Data Warehousing Technology Aggregate targeting

Aggregates flow up from a wide selection of data, and then Targeted decisions flow down

Workload type Broad: aggregate queries over ranges of values

E.g., find the total sales by region and quarter. Deep: queries that require precise individualized information

E.g., which frequent flyers have been delayed several times in the last month?

Dynamic (vs. Static): queries that require up-to-date information E.g.. which nodes have the highest traffic now?

Main approaches Multidimensional arrays Special indexes – bitmaps and multidimensional indexes Materialised views Optimised foreign key joins Approximation by sampling

17

Tuning Knobs

Indexes Materialised views Approximation

18

Bitmap Indexes Suitable for deep and broad, static,

many query attributes One bitmap per attribute value

Bitmap size = the number of tuples Bitmap value: a[i]=1tuple i has the value

Compact, and easy to intersect several bitmaps (for multi-attribute queries) B-tree for selective attributes, bitmaps for

unselective attributes… multiple B-tree indexes = multidimensional indexes?

19

Bitmaps - Data

Settings:lineitem ( L_ORDERKEY, L_PARTKEY , L_SUPPKEY, L_LINENUMBER,

L_QUANTITY, L_EXTENDEDPRICE , L_DISCOUNT, L_TAX , L_RETURNFLAG, L_LINESTATUS , L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE, L_SHIPINSTRUCT , L_SHIPMODE , L_COMMENT );

create bitmap index b_lin_2 on lineitem(l_returnflag);

create bitmap index b_lin_3 on lineitem(l_linestatus);

create bitmap index b_lin_4 on lineitem(l_linenumber); 100000 rows ; cold buffer Dual Pentium II (450MHz, 512Kb), 512 Mb RAM, 3x18Gb

drives (10000RPM), Windows 2000.

20

Bitmaps - Queries

Queries: 1 attribute

select count(*) from lineitem where l_returnflag = 'N';

2 attributesselect count(*) from lineitem where l_returnflag = 'N' and l_linenumber > 3;

3 attributesselect count(*) from lineitem where l_returnflag =

'N' and l_linenumber > 3 and l_linestatus = 'F';

21

Bitmaps Order of magnitude

improvement compared to scan.

Bitmaps are best suited for multiple conditions on several attributes, each having a low selectivity.

ANR

l_returnflagOF

l_linestatus

0

2

4

6

8

10

12

1 2 3

Th

rou

gh

pu

t (Q

uer

ies/

sec)

linear scan

bitmap

22

Multidimensional Indexes Suitable for deep or broad, static, few

query attributes with range queries Examples: quadtrees and R-trees

The space is conceptually organised in grids Variable sizes and hierarchical

Quadtree: space-centric, regular, recursive decomposition

R-tree: data-centric, hierarchical bounding rectangles

The problem of “dimensional curse”

23

Multidimensional Indexes - Data

Settings:create table spatial_facts( a1 int, a2 int, a3 int, a4 int, a5 int, a6 int, a7 int, a8 int, a9 int, a10 int, geom_a3_a7 mdsys.sdo_geometry );create index r_spatialfacts on spatial_facts(geom_a3_a7) indextype is mdsys.spatial_index;create bitmap index b2_spatialfacts on

spatial_facts(a3,a7); 500000 rows ; cold buffer Dual Pentium II (450MHz, 512Kb), 512 Mb RAM,

3x18Gb drives (10000RPM), Windows 2000.

24

Multidimensional Indexes - QueriesQueries:

Point Queriesselect count(*) from fact where a3 = 694014 and a7 = 928878;

select count(*) from spatial_facts where SDO_RELATE(geom_a3_a7, MDSYS.SDO_GEOMETRY(2001, NULL, MDSYS.SDO_POINT_TYPE(694014,928878, NULL), NULL, NULL), 'mask=equal querytype=WINDOW') = 'TRUE';

Range Queriesselect count(*) from spatial_facts where

SDO_RELATE(geom_a3_a7, mdsys.sdo_geometry(2003,NULL,NULL, mdsys.sdo_elem_info_array(1,1003,3),mdsys.sdo_ordinate_array(10,800000,1000000,1000000)), 'mask=inside querytype=WINDOW') = 'TRUE';

select count(*) from spatial_facts where a3 > 10 and a3 < 1000000 and a7 > 800000 and a7 < 1000000;

25

Multidimensional Indexes Oracle 8i on

Windows 2000 Spatial Extension:

2-dimensional data Spatial functions used

in the query R-tree does not

perform well because of the overhead of spatial extension.

0

2

4

6

8

10

12

14

point query range query

Res

po

nse

Tim

e (s

ec) bitmap - two

attributes

r-tree

26

Materialised Views

Suitable for broad, static applications Also dynamic but at a cost

Frequent queries are materialised View maintenance cost

Incremental maintenance Self-maintainable data warehouses

27

Materialised Views - Data

Settings:orders( ordernum, itemnum, quantity, purchaser, vendor );

create clustered index i_order on orders(itemnum);

store( vendor, name );

item(itemnum, price);

create clustered index i_item on item(itemnum);

1000000 orders, 10000 stores, 400000 items; Cold buffer Oracle 9i Pentium III (1 GHz, 256 Kb), 1Gb RAM, Adapter 39160

with 2 channels, 3x18Gb drives (10000RPM), Linux Debian 2.4.

28

Materialised Views - Data

Settings:create materialized view vendorOutstandingbuild immediaterefresh completeenable query rewriteasselect orders.vendor, sum(orders.quantity*item.price)from orders,itemwhere orders.itemnum = item.itemnumgroup by orders.vendor;

29

Materialised Views -Transactions

Concurrent Transactions: Insertions

insert into orders values (1000350,7825,562,'xxxxxx6944','vendor4');

Queriesselect orders.vendor,

sum(orders.quantity*item.price)from orders,itemwhere orders.itemnum = item.itemnumgroup by orders.vendor;

select * from vendorOutstanding;

30

Materialised Views Graph:

Oracle9i on Linux Total sale by vendor is

materialized Trade-off between query

speed-up and view maintenance:

The impact of incremental maintenance on performance is significant.

Rebuild maintenance achieves a good throughput.

A static data warehouse offers a good trade-off.

0

400

800

1200

1600

MaterializedView (fast on

commit)

MaterializedView (complete

on demand)

NoMaterialized

ViewThro

ughp

ut (s

tate

men

ts/s

ec)

0

2

4

6

8

10

12

14

Materialized View No Materialized View

Res

pons

e Ti

me

(sec

)

31

Materialised View Maintenance Problem when large

number of views to maintain.

The order in which views are maintained is important:

A view can be computed from an existing view instead of being recomputed from the base relations (total per region can be computed from total per nation).

Let the views and base tables be nodes v_i

Let there be an edge from v_1 to v_2 if it possible to compute the view v_2 from v_1. Associate the cost of computing v_2 from v_1 to this edge.

Compute all pairs shortest path where the start nodes are the set of base tables.

The result is an acyclic graph A. Take a topological sort of A and let that be the order of view construction.

32

Result Approximation

Suitable for broad, dynamic, large data

Examples When computing the average, one can

increase sample sizes until no significant change of the value

Increased accuracy with time Approximated results within given error

bounds

33

Approximations - Data

Settings: TPC-H schema Approximations

insert into approxlineitem

select top 6000 *from lineitemwhere l_linenumber = 4;

insert into approxordersselect O_ORDERKEY, O_CUSTKEY, O_ORDERSTATUS, O_TOTALPRICE, O_ORDERDATE, O_ORDERPRIORITY, O_CLERK, O_SHIPPRIORITY, O_COMMENT

from orders, approxlineitemwhere o_orderkey = l_orderkey;

34

Approximations - Queriesinsert into approxsupplierselect distinct S_SUPPKEY,

S_NAME ,S_ADDRESS,S_NATIONKEY,S_PHONE,S_ACCTBAL,S_COMMENT

from approxlineitem, supplierwhere s_suppkey = l_suppkey;

insert into approxpartselect distinct P_PARTKEY,

P_NAME , P_MFGR , P_BRAND , P_TYPE , P_SIZE ,P_CONTAINER , P_RETAILPRICE , P_COMMENT

from approxlineitem, partwhere p_partkey = l_partkey;

insert into approxpartsuppselect distinct PS_PARTKEY,

PS_SUPPKEY,PS_AVAILQTY,PS_SUPPLYCOST,PS_COMMENT

from partsupp, approxpart, approxsupplierwhere ps_partkey = p_partkey and

ps_suppkey = s_suppkey;

insert into approxcustomerselect distinct C_CUSTKEY,

C_NAME ,C_ADDRESS,C_NATIONKEY,C_PHONE ,C_ACCTBAL,C_MKTSEGMENT,C_COMMENT

from customer, approxorderswhere o_custkey = c_custkey;insert into approxregion select * from

region;insert into approxnation select * from

nation;

35

Approximations - More queries

Queries: Single table query on lineitemselect l_returnflag, l_linestatus, sum(l_quantity) as sum_qty,

sum(l_extendedprice) as sum_base_price,

sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,

sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,

avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price, avg(l_discount) as avg_disc, count(*) as count_order

from lineitem

where datediff(day, l_shipdate, '1998-12-01') <= '120'

group by l_returnflag, l_linestatus

order by l_returnflag, l_linestatus;

36

Approximations - Still MoreQueries:

6-way joinselect n_name, avg(l_extendedprice * (1 - l_discount)) as

revenuefrom customer, orders, lineitem, supplier, nation, regionwhere c_custkey = o_custkey

and l_orderkey = o_orderkeyand l_suppkey = s_suppkeyand c_nationkey = s_nationkeyand s_nationkey = n_nationkeyand n_regionkey = r_regionkeyand r_name = 'AFRICA'and o_orderdate >= '1993-01-01'and datediff(year, o_orderdate,'1993-01-01') < 1

group by n_nameorder by revenue desc;

37

Approximation Accuracy

Good approximation for query Q1 on lineitem

The aggregated values obtained on a query with a 6-way join are significantly different from the actual values -- for some applications may still be good enough.

TPC-H Query Q5: 6 way join

-40

-20

0

20

40

60

1 2 3 4 5

Groups returned in the query

Ap

pro

xim

ated

ag

gre

gat

ed v

alu

es

(% o

f B

ase

Ag

gre

gat

e V

alu

e)

1% sample

10% sample

TPC-H Query Q1: lineitem

-40

-20

0

20

40

60

1 2 3 4 5 6 7 8

Aggregate values

Ap

pro

xim

ated

A

gg

reg

ate

Val

ues

(%

of

Bas

e A

gg

reg

ate

Val

ue)

1% sample

10% sample

38

Approximation Speedup Aqua approximation

on the TPC-H schema 1% and 10% lineitem

sample propagated. The query speed-up

obtained with approximated relations is significant.

00.5

11.5

22.5

33.5

4

Q1 Q5

TPC-H Queries

Res

po

nse

Tim

e (s

ec)

base relations

1 % sample

10% sample

39

Summary

In this module, we have covered: Basic concepts of data warehousing Data warehousing technologies How to optimise data warehouse

performance

CS5226 2002 Data Warehouse & Performance Tuning Xiaofang Zhou School of Computing, NUS Office:...

Documents

Transcript of CS5226 2002 Data Warehouse & Performance Tuning Xiaofang Zhou School of Computing, NUS Office:...