CS5226 2002 Data Warehouse & Performance Tuning Xiaofang Zhou School of Computing, NUS Office:...
-
Upload
jessie-woods -
Category
Documents
-
view
214 -
download
0
Transcript of CS5226 2002 Data Warehouse & Performance Tuning Xiaofang Zhou School of Computing, NUS Office:...
CS5226 2002
Data Warehouse & Performance Tuning
Xiaofang ZhouSchool of Computing, NUS
Office: S16-08-20Email: [email protected]: www.itee.uq.edu.au/~zxf
2
Outline
Part 1: A review of data warehousing
Part 2: Data warehouse tuning
3
OLAP vs OLTP
OLAP: On-line analytical processing Very large amount of historical data Read only Long and complicated sessions For a few trained users
OLTP: On-line transaction processing Small amount of current data Frequent updates, deletions and insertions A very large number of fixed, short-duration
transactions For many ad hoc users
4
DW Characteristics A DW provides access to date for complex
analysis, knowledge discovery and decision support
Subject oriented (vs application oriented) The data is organized around subjects (such as
Sales), rather than operations applications (such as order processing).
Non-volatile Not usually subject to change
Integrated Data is consistent
Time variant Historical data is recorded
(Inmon 1992)
5
Data Warehouse Types Enterprise-wide data warehouses
Large projects with massive investment of time and resources
Virtual data warehouses Provide views of operational DBs that are
materialised for efficiency Data marts
Targeted to a subset of the organisation Also called department-level data warehouse Low-risk, low-cost, but hard to evolve
… a data warehouse = a collection of data marts?
6
Process of Data Warehousing
Databases
Cleaning Reformatting
OLAP
DSS
Data Mining
Back flushing
Other data inputs
Data Warehouse
Data Metadata
Updates/New Data
7
Data Cube
Sales data with three dimensions: region, product and fiscal period
Product
Region
Fiscal period
Sales
8
OLAP Data Models ROLAP: relational OLAP
Use of relational technology, suitably adapted and extended
The data is stored using tables But the analysis operations are carried out
efficiently using special data structures MOLAP: multidimensional OLAP
A more radical approach Storing data directly in a multi-dimensional form
HOLAP: hybrid OLAP Data defined as in ROLAP but stored as in
MLOAP
9
Multidimensional model is usually not 3NF But it is as simple as a spreadsheet
two-dimensional matrix
And its query performance is much better then in the relational model
Multidimensional or Relational
Region
Fiscal period
Sales
10
Tables Types in MD Model A multidimensional model consists a fact table
and a number of dimension tables Fact table - contains measured variables, and
identifies them with pointers to dimension tables Dimension table - tuples of attributes of dimension
ProdID
RegID
OrderTime
Quantity
ItemCost
SalesProduct(Prod No, Name, Price,…)
Region(Area, Description)
Time(Date, PeriodNum, QuaterNum, Year)
dimension tablesfact table
11
Multidimensional Schema Star schema
One fact table One table for each dimension Typically not normalised
Snowflake schema A variation on the star schema Dimensional tables are organised into a hierarchy Can be in 3NF
12
Star Schema Example
Sales
Product
Region
Time
13
Snowflake Schema Example
Sales
Product
Region
Time
Class
…if there are queries about classes of products, after
normalisation, it becomes a snowflake schema…
14
Complex Snowflake Schema
Fact table
Fiscal qtrs FQ dates
Sales revenue
ProductProd. name
P. line
Dimension tablesDimension tables
15
Typical Functionality of DW Pivoting
Rotate data cube to show a different orientation of axes
Roll-up Move up concept hierarchy, grouping into larger
units along a dimension Drill-down
Disaggregate to a finer-grained view Slice and dice
Perform project operations on the dimensions Other operations, such as sorting, selection
16
Data Warehousing Technology Aggregate targeting
Aggregates flow up from a wide selection of data, and then Targeted decisions flow down
Workload type Broad: aggregate queries over ranges of values
E.g., find the total sales by region and quarter. Deep: queries that require precise individualized information
E.g., which frequent flyers have been delayed several times in the last month?
Dynamic (vs. Static): queries that require up-to-date information E.g.. which nodes have the highest traffic now?
Main approaches Multidimensional arrays Special indexes – bitmaps and multidimensional indexes Materialised views Optimised foreign key joins Approximation by sampling
17
Tuning Knobs
Indexes Materialised views Approximation
18
Bitmap Indexes Suitable for deep and broad, static,
many query attributes One bitmap per attribute value
Bitmap size = the number of tuples Bitmap value: a[i]=1tuple i has the value
Compact, and easy to intersect several bitmaps (for multi-attribute queries) B-tree for selective attributes, bitmaps for
unselective attributes… multiple B-tree indexes = multidimensional indexes?
19
Bitmaps - Data
Settings:lineitem ( L_ORDERKEY, L_PARTKEY , L_SUPPKEY, L_LINENUMBER,
L_QUANTITY, L_EXTENDEDPRICE , L_DISCOUNT, L_TAX , L_RETURNFLAG, L_LINESTATUS , L_SHIPDATE, L_COMMITDATE, L_RECEIPTDATE, L_SHIPINSTRUCT , L_SHIPMODE , L_COMMENT );
create bitmap index b_lin_2 on lineitem(l_returnflag);
create bitmap index b_lin_3 on lineitem(l_linestatus);
create bitmap index b_lin_4 on lineitem(l_linenumber); 100000 rows ; cold buffer Dual Pentium II (450MHz, 512Kb), 512 Mb RAM, 3x18Gb
drives (10000RPM), Windows 2000.
20
Bitmaps - Queries
Queries: 1 attribute
select count(*) from lineitem where l_returnflag = 'N';
2 attributesselect count(*) from lineitem where l_returnflag = 'N' and l_linenumber > 3;
3 attributesselect count(*) from lineitem where l_returnflag =
'N' and l_linenumber > 3 and l_linestatus = 'F';
21
Bitmaps Order of magnitude
improvement compared to scan.
Bitmaps are best suited for multiple conditions on several attributes, each having a low selectivity.
ANR
l_returnflagOF
l_linestatus
0
2
4
6
8
10
12
1 2 3
Th
rou
gh
pu
t (Q
uer
ies/
sec)
linear scan
bitmap
22
Multidimensional Indexes Suitable for deep or broad, static, few
query attributes with range queries Examples: quadtrees and R-trees
The space is conceptually organised in grids Variable sizes and hierarchical
Quadtree: space-centric, regular, recursive decomposition
R-tree: data-centric, hierarchical bounding rectangles
The problem of “dimensional curse”
23
Multidimensional Indexes - Data
Settings:create table spatial_facts( a1 int, a2 int, a3 int, a4 int, a5 int, a6 int, a7 int, a8 int, a9 int, a10 int, geom_a3_a7 mdsys.sdo_geometry );create index r_spatialfacts on spatial_facts(geom_a3_a7) indextype is mdsys.spatial_index;create bitmap index b2_spatialfacts on
spatial_facts(a3,a7); 500000 rows ; cold buffer Dual Pentium II (450MHz, 512Kb), 512 Mb RAM,
3x18Gb drives (10000RPM), Windows 2000.
24
Multidimensional Indexes - QueriesQueries:
Point Queriesselect count(*) from fact where a3 = 694014 and a7 = 928878;
select count(*) from spatial_facts where SDO_RELATE(geom_a3_a7, MDSYS.SDO_GEOMETRY(2001, NULL, MDSYS.SDO_POINT_TYPE(694014,928878, NULL), NULL, NULL), 'mask=equal querytype=WINDOW') = 'TRUE';
Range Queriesselect count(*) from spatial_facts where
SDO_RELATE(geom_a3_a7, mdsys.sdo_geometry(2003,NULL,NULL, mdsys.sdo_elem_info_array(1,1003,3),mdsys.sdo_ordinate_array(10,800000,1000000,1000000)), 'mask=inside querytype=WINDOW') = 'TRUE';
select count(*) from spatial_facts where a3 > 10 and a3 < 1000000 and a7 > 800000 and a7 < 1000000;
25
Multidimensional Indexes Oracle 8i on
Windows 2000 Spatial Extension:
2-dimensional data Spatial functions used
in the query R-tree does not
perform well because of the overhead of spatial extension.
0
2
4
6
8
10
12
14
point query range query
Res
po
nse
Tim
e (s
ec) bitmap - two
attributes
r-tree
26
Materialised Views
Suitable for broad, static applications Also dynamic but at a cost
Frequent queries are materialised View maintenance cost
Incremental maintenance Self-maintainable data warehouses
27
Materialised Views - Data
Settings:orders( ordernum, itemnum, quantity, purchaser, vendor );
create clustered index i_order on orders(itemnum);
store( vendor, name );
item(itemnum, price);
create clustered index i_item on item(itemnum);
1000000 orders, 10000 stores, 400000 items; Cold buffer Oracle 9i Pentium III (1 GHz, 256 Kb), 1Gb RAM, Adapter 39160
with 2 channels, 3x18Gb drives (10000RPM), Linux Debian 2.4.
28
Materialised Views - Data
Settings:create materialized view vendorOutstandingbuild immediaterefresh completeenable query rewriteasselect orders.vendor, sum(orders.quantity*item.price)from orders,itemwhere orders.itemnum = item.itemnumgroup by orders.vendor;
29
Materialised Views -Transactions
Concurrent Transactions: Insertions
insert into orders values (1000350,7825,562,'xxxxxx6944','vendor4');
Queriesselect orders.vendor,
sum(orders.quantity*item.price)from orders,itemwhere orders.itemnum = item.itemnumgroup by orders.vendor;
select * from vendorOutstanding;
30
Materialised Views Graph:
Oracle9i on Linux Total sale by vendor is
materialized Trade-off between query
speed-up and view maintenance:
The impact of incremental maintenance on performance is significant.
Rebuild maintenance achieves a good throughput.
A static data warehouse offers a good trade-off.
0
400
800
1200
1600
MaterializedView (fast on
commit)
MaterializedView (complete
on demand)
NoMaterialized
ViewThro
ughp
ut (s
tate
men
ts/s
ec)
0
2
4
6
8
10
12
14
Materialized View No Materialized View
Res
pons
e Ti
me
(sec
)
31
Materialised View Maintenance Problem when large
number of views to maintain.
The order in which views are maintained is important:
A view can be computed from an existing view instead of being recomputed from the base relations (total per region can be computed from total per nation).
Let the views and base tables be nodes v_i
Let there be an edge from v_1 to v_2 if it possible to compute the view v_2 from v_1. Associate the cost of computing v_2 from v_1 to this edge.
Compute all pairs shortest path where the start nodes are the set of base tables.
The result is an acyclic graph A. Take a topological sort of A and let that be the order of view construction.
32
Result Approximation
Suitable for broad, dynamic, large data
Examples When computing the average, one can
increase sample sizes until no significant change of the value
Increased accuracy with time Approximated results within given error
bounds
33
Approximations - Data
Settings: TPC-H schema Approximations
insert into approxlineitem
select top 6000 *from lineitemwhere l_linenumber = 4;
insert into approxordersselect O_ORDERKEY, O_CUSTKEY, O_ORDERSTATUS, O_TOTALPRICE, O_ORDERDATE, O_ORDERPRIORITY, O_CLERK, O_SHIPPRIORITY, O_COMMENT
from orders, approxlineitemwhere o_orderkey = l_orderkey;
34
Approximations - Queriesinsert into approxsupplierselect distinct S_SUPPKEY,
S_NAME ,S_ADDRESS,S_NATIONKEY,S_PHONE,S_ACCTBAL,S_COMMENT
from approxlineitem, supplierwhere s_suppkey = l_suppkey;
insert into approxpartselect distinct P_PARTKEY,
P_NAME , P_MFGR , P_BRAND , P_TYPE , P_SIZE ,P_CONTAINER , P_RETAILPRICE , P_COMMENT
from approxlineitem, partwhere p_partkey = l_partkey;
insert into approxpartsuppselect distinct PS_PARTKEY,
PS_SUPPKEY,PS_AVAILQTY,PS_SUPPLYCOST,PS_COMMENT
from partsupp, approxpart, approxsupplierwhere ps_partkey = p_partkey and
ps_suppkey = s_suppkey;
insert into approxcustomerselect distinct C_CUSTKEY,
C_NAME ,C_ADDRESS,C_NATIONKEY,C_PHONE ,C_ACCTBAL,C_MKTSEGMENT,C_COMMENT
from customer, approxorderswhere o_custkey = c_custkey;insert into approxregion select * from
region;insert into approxnation select * from
nation;
35
Approximations - More queries
Queries: Single table query on lineitemselect l_returnflag, l_linestatus, sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price, avg(l_discount) as avg_disc, count(*) as count_order
from lineitem
where datediff(day, l_shipdate, '1998-12-01') <= '120'
group by l_returnflag, l_linestatus
order by l_returnflag, l_linestatus;
36
Approximations - Still MoreQueries:
6-way joinselect n_name, avg(l_extendedprice * (1 - l_discount)) as
revenuefrom customer, orders, lineitem, supplier, nation, regionwhere c_custkey = o_custkey
and l_orderkey = o_orderkeyand l_suppkey = s_suppkeyand c_nationkey = s_nationkeyand s_nationkey = n_nationkeyand n_regionkey = r_regionkeyand r_name = 'AFRICA'and o_orderdate >= '1993-01-01'and datediff(year, o_orderdate,'1993-01-01') < 1
group by n_nameorder by revenue desc;
37
Approximation Accuracy
Good approximation for query Q1 on lineitem
The aggregated values obtained on a query with a 6-way join are significantly different from the actual values -- for some applications may still be good enough.
TPC-H Query Q5: 6 way join
-40
-20
0
20
40
60
1 2 3 4 5
Groups returned in the query
Ap
pro
xim
ated
ag
gre
gat
ed v
alu
es
(% o
f B
ase
Ag
gre
gat
e V
alu
e)
1% sample
10% sample
TPC-H Query Q1: lineitem
-40
-20
0
20
40
60
1 2 3 4 5 6 7 8
Aggregate values
Ap
pro
xim
ated
A
gg
reg
ate
Val
ues
(%
of
Bas
e A
gg
reg
ate
Val
ue)
1% sample
10% sample
38
Approximation Speedup Aqua approximation
on the TPC-H schema 1% and 10% lineitem
sample propagated. The query speed-up
obtained with approximated relations is significant.
00.5
11.5
22.5
33.5
4
Q1 Q5
TPC-H Queries
Res
po
nse
Tim
e (s
ec)
base relations
1 % sample
10% sample
39
Summary
In this module, we have covered: Basic concepts of data warehousing Data warehousing technologies How to optimise data warehouse
performance