Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 1
Design and Maintenance of Data Warehouses
Design and Maintenance of Data Warehouses
Timos SellisNational Technical University of AthensKDBS Laboratoryhttp://www.dbnet.ece.ntua.gr/
Many thanks to P. Vassiliadis and A. Tsois
EDBT Summer School - Cargese 2002 2
Outline
What’s and Why’s for DW’sDW architectureDW SchemaBack End of the DWFront End of the DWDW ServersMetadata RepositoryConclusions
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 2
EDBT Summer School - Cargese 2002 3
OLTP
On-line transaction processing (OLTP) is the traditional way of using a database
Legacy systems: relational, hierarchical, network databases / COBOL applications / …Short transactions (read/update few records) with ACID propertiesNormally, only the last version of data stored in the database
EDBT Summer School - Cargese 2002 4
DSS & OLAP
Decision support systems - help the executive, manager, analyst make faster and better decisions.
What where the sales volumes by region and product category for the last year?Will a 10% discount increase sales volumes sufficiently?
On-line analytical processing (OLAP) is an element of decision support systems (DSS)
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 3
EDBT Summer School - Cargese 2002 5
OLTP vs. OLAP
OLTP OLAP User Clerk Manager Function Day to day operations Decision support Access Read/write Mostly read Data detailed, up-to-date,
flat relational summarised, historical, multidimensional
Db Size 100MB - 1GB 100GB - 1TB
Chaudhuri& Dayal@VLDB’96
EDBT Summer School - Cargese 2002 6
Data Warehouse
A decision support database that is maintained separately from the organization’s operational database.
• S. Chaudhuri, U. Dayal, VLDB’96 tutorialA data warehouse is a subject-oriented, integrated, time-varying, non-volatile collection of data that is used primarily in organizational decision making.
• W.H. Inmon, Building the Data Warehouse, 1992
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 4
EDBT Summer School - Cargese 2002 7
Reasons for Building Data Warehouses
Semantic ReconciliationDispread data sources within the same organizationDifferent encoding of the same entitiesDW encompasses the full volume of these data under a single, reconciled schemaKeeps the history of these data, too
EDBT Summer School - Cargese 2002 8
Reasons for Building Data Warehouses
PerformanceOLAP applications need different organization of dataComplex OLAP queries would degrade OLTP performance
AvailabilitySeparation increases availabilityPossibly the only way to query the dispread data sources
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 5
EDBT Summer School - Cargese 2002 9
Reasons for Building Data WarehousesData Quality
The validity of source data is not guaranteed (data can be missing, inconsistent, out of date, violating business and database rules…)Errors in data reach a minimum 10% in most data storesCan lead to wasting of resources of 25-40%DW acts as a data cleaning buffer
…. and the market is there!
EDBT Summer School - Cargese 2002 10
The Market
Estimated sales in millions of dollars [ShTy98] (*estimates are from [Pend00]).
1998 1999 2000 2001 2002 CAGR (%)RDBMS sales for DW 900.0 1110.0 1390.0 1750.0 2200.0 25.0Data Marts 92.4 125.0 172.0 243.0 355.0 40.0ETL tools 101.0 125.0 150.0 180.0 210.0 20.1Data Quality 48.0 55.0 64.5 76.0 90.0 17.0Metadata Management 35.0 40.0 46.0 53.0 60.0 14.4OLAP (including implementationservices)*
2000 2500 3000 3600 4000 18.9
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 6
EDBT Summer School - Cargese 2002 11
Data Warehouse ArchitectureA Simple View
Client Client
Warehouse
Source
Source
Source
Query & Analysis
Integration
Metadata
EDBT Summer School - Cargese 2002 12
Data Warehouse Architecture
Sources
Administrator
DSA
Administrator
DW
Designer
Data Marts
Metadata Repository
End User
Quality Issues
Quality Issues
Quality Issues
Quality Issues
Reporting / OLAP tools
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 7
EDBT Summer School - Cargese 2002 13
Two / Three Tier Architecture
Warehouse database serveralmost always relational (RDBMS)
Data Marts / OLAP serverRelational OLAP (ROLAP)Multidimensional OLAP (MOLAP)
ClientsQuery and reporting toolsAnalysis tools / Data mining tools
EDBT Summer School - Cargese 2002 14
Data Warehouse Architecture
Enterprise warehouse: collects all information about subjects
requires extensive business modelingmay take years to design and build
Data Marts: Departmental subsets that focus on selected subjectsVirtual warehouse: views over operational dbs
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 8
EDBT Summer School - Cargese 2002 15
How to build the DWTop – down
Single integrated enterprise modelReduce all sources (and clients, if necessary) to the central model
− Time consuming; labor intensive; slow to produce results− Enhances the risk of the DW project due to late delivery of
results+ Provides a consistent, global view of the enterprise data
EDBT Summer School - Cargese 2002 16
How to build the DWBottom – up
Build smaller data marts firstProgressively combine pairwise
− Fails to provide a global view of the enterprise data− Possibly enhances the risk since a complete
integration might prove impossible late in the project+ Early delivery of results+ Less time consuming, less labor intensive
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 9
EDBT Summer School - Cargese 2002 17
Data Warehouse Back-End
Sources
Administrator
DSA
Administrator
DW
Designer
Data Marts
Metadata Repository
End User
Quality Issues
Quality Issues
Quality Issues
Quality Issues
Reporting / OLAP tools
EDBT Summer School - Cargese 2002 18
Design: Global-As-View IntegrationPreintegration. What schemata to integrate and in which orderSchema Comparison. To determine the correlations among concepts of different schemata and to detect possible naming, semantic, structural, … conflictsSchema Conforming. Conflict resolution for heterogeneous schemataSchema Merging and Restructuring. Production of a single conformed schema
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 10
EDBT Summer School - Cargese 2002 19
Design: Local-As-View IntegrationWorks the other way around.Main deliverable is a central conceptual model, produced by interactively examining user needs and existing schemataAll source and client schemata are expressed in terms of the central data warehouse schema and not the other way around.
EDBT Summer School - Cargese 2002 20
DW = Materialized Views?
DW.PARTSUPP Aggregate1
PKEY, DAYMIN(COST)
Aggregate2
PKEY, MONTHAVG(COST)
V2
V1
TIME
DW.PARTSUPP.DATE,DAY
S1_PARTSUPP
S2_PARTSUPP
Sources DW
U
Simple View of a DW
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 11
EDBT Summer School - Cargese 2002 21
Add_SPK1
SUPPKEY=1
SK1
DS.PS1.PKEY, LOOKUP_PS.SKEY,
SUPPKEY
$2€
COST DATE
DS.PS2 Add_SPK2
SUPPKEY=2
SK2
DS.PS2.PKEY, LOOKUP_PS.SKEY,
SUPPKEYCOST DATE=SYSDATE
AddDate CheckQTY
QTY>0
U
DS.PS1
Log
rejected
Log
rejected
A2EDate
NotNULL
Log
rejected
Log
rejected
Log
rejected
DIFF1
DS.PS_NEW1.PKEY,DS.PS_OLD1.PKEYDS.PS_NEW
1
DS.PS_OLD1
DW.PARTSUPP Aggregate1
PKEY, DAYMIN(COST)
Aggregate2
PKEY, MONTHAVG(COST)
V2
V1
TIME
DW.PARTSUPP.DATE,DAY
FTP1S1_PARTSU
PP
S2_PARTSUPP FTP2
DS.PS_NEW2
DIFF2
DS.PS_OLD2
DS.PS_NEW2.PKEY,DS.PS_OLD2.PKEY
DW ≠ Materialized Views !
Sources DW
DSA
EDBT Summer School - Cargese 2002 22
Operational Processes
Data extraction, transform & loadOriginally treated as the ‘refreshment’ problemRequires to transform, clean, integrate data from different sources.
Build/refresh derived data and viewsService queriesMonitor the warehouse
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 12
EDBT Summer School - Cargese 2002 23
The Refreshment Problem
Propagate updates on source data to the warehouseIssues:
when to refreshon every updateperiodicallyrefresh policy set by administrator
how to refresh
EDBT Summer School - Cargese 2002 24
Refreshment Techniques
Full extract from base tablesIncremental techniques
detect changes on base tablessnapshotstransaction shippingactive rules
logical correctnesstransactional correctness
Currently, in practice we use ETL tools/scripts (see next)…
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 13
EDBT Summer School - Cargese 2002 25
Data ExtractionCan take snapshot or differentials (new/deleted/updated) of source dataTransfer, encryption, compression are also involvedTime window and source system overhead involvedIn general, faced with the requirement of minimal changes to existing configuration of sources
EDBT Summer School - Cargese 2002 26
Data TransformationSchema Reconciliation: conflicts at the schema level (different attributes for the same information)Value Identification & Reconciliation: different (same) id’s for same (different) objects (use surrogate keys)
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 14
EDBT Summer School - Cargese 2002 27
Data CleaningOffending Data: duplicates, integrity/business rules/format violations …Incompleteness: missing dataRenicing: esp. addresses
EDBT Summer School - Cargese 2002 28
Data Loading
This final stage may still require additional preprocessing:
sorting, summarizing, performing computationsIssues:
huge volumes of data to be loadedsmall time windowwhen to build indexes and summary tablesrestart after failure with no loss of data integrity
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 15
EDBT Summer School - Cargese 2002 29
Loading TechniquesCannot use SQL language interface to update or append data.
record at a timetoo slow since it uses random disc I/Ocan make rollback segment or log file to burst
Use batch load utilitysort input records on a clustering keysequential I/O 100 times faster than random I/Obuild index at the same timeuse parallelism to accelerate load operations
EDBT Summer School - Cargese 2002 30
Incremental Loading
Use incremental loads during refresh to reduce data volume (e.g. Redbrick)
insert only updated tuplesincremental load conflicts with queriesbreak into sequence of shorter transactionscoordinate this sequence of transactions: must ensure consistency between base and derived tables and indices.
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 16
EDBT Summer School - Cargese 2002 31
Data Warehouse Front-End
Sources
Administrator
DSA
Administrator
DW
Designer
Data Marts
Metadata Repository
End User
Quality Issues
Quality Issues
Quality Issues
Quality Issues
Reporting / OLAP tools
EDBT Summer School - Cargese 2002 32
Front End Tools
Ad hoc query and reportingExample: MS Excel, ProReports
OLAP: ‘Multidimensional spreadsheet’pivot tables, drill down, roll up, slice, dice
Data Mining
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 17
EDBT Summer School - Cargese 2002 33
Basic ideas for OLAP
Several numeric measures that are analyzedsales, budget, revenue, inventory
Dimensionscontexts in which a measure appearsExample: store, product, date information associated with a sale.each context is a dimension and the measure is a point in a multi-dimensional world
EDBT Summer School - Cargese 2002 34
Basic ideas for OLAP
Nature of Analysisaggregation (total sales, percent-to-total)comparison (budget vs. expense)ranking (top 10)access to detailed and aggregate datacomplex criteria specificationvisualization
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 18
EDBT Summer School - Cargese 2002 35
Basic ideas for OLAP
Attributesinformation associated with a dimensionexample: owner of store, county in which the store is located
Attribute HierarchiesAttributes of a dimension are often related in a a hierarchical wayexample: street city country
EDBT Summer School - Cargese 2002 36
Multidimensional Data
Dimensions: Product, Region, Date
Hierarchical summarization paths:
Month
Region
Prod
uct
Sales volume
Industry
Category
Product
Country
Region
City
Office
Year
Quarter
Month Week
Day
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 19
EDBT Summer School - Cargese 2002 37
Operations
Roll up: summarize dataDrill down: go from higher level summary to lower level summary or detailed dataSlice and dice: select and projectPivot: re-orient cube
EDBT Summer School - Cargese 2002 38
Roll up
Sales volume
ElectronicsToysClothingCosmetics
Q1
$5,2$1,9$2,3$1,1
ElectronicsToysClothingCosmetics
Q2
$8,9$0,75$4,6$1,5
Products Store1 Store2
$5,6$1,4$2,6$1,1$7,2$0,4$4,6$0,5
Sales volume
ElectronicsToysClothingCosmeticsY
ear 1
996 $14,1
$2,65$6,9$2,6
Products Store1 Store2
$12,8$1,8$7,2$1,6
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 20
EDBT Summer School - Cargese 2002 39
Drill down
Sales volume
ElectronicsToysClothingCosmetics
Q1
$5,2$1,9$2,3$1,1
ElectronicsToysClothingCosmetics
Q2
$8,9$0,75$4,6$1,5
Products Store1 Store2
$5,6$1,4$2,6$1,1$7,2$0,4$4,6$0,5
Sales volume
VCRCamcorderTVCD player
Q1
$1,4$0,6$2,0$1,2
VCRCamcorderTVCD player
Q2
$2,4$3,3$2,2$1,0
Electronics Store1 Store2
$1,4$0,6$2,4$1,2$2,4$1,3$2,5$1,0
EDBT Summer School - Cargese 2002 40
Pivot
Sales volume
ElectronicsToysClothingCosmetics
Q1
$5,2$1,9$2,3$1,1
ElectronicsToysClothingCosmetics
Q2
$8,9$0,75$4,6$1,5
Products Store1 Store2
$5,6$1,4$2,6$1,1$7,2$0,4$4,6$0,5
Sales volume
ElectronicsToysClothingCosmetics
Stor
e 1 $5,2
$1,9$2,3$1,1
ElectronicsToysClothingCosmetics
Stor
e 2 $5,6
$1,4$2,6$1,1
Products Q1 Q2
$8,9$0,75$4,6$1,5$7,2$0,4$4,6$0,5
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 21
EDBT Summer School - Cargese 2002 41
Slice and Dice
Sales volume
ElectronicsToysClothingCosmetics
Q1
$5,2$1,9$2,3$1,1
ElectronicsToysClothingCosmetics
Q2
$8,9$0,75$4,6$1,5
Products Store1 Store2
$5,6$1,4$2,6$1,1$7,2$0,4$4,6$0,5
Sales volume
ElectronicsToysQ
1 $5,2$1,9
Products Store1
ElectronicsToysQ
2 $8,9$0,75
EDBT Summer School - Cargese 2002 42
Data Warehouse Server
Sources
Administrator
DSA
Administrator
DW
Designer
Data Marts
Metadata Repository
End User
Quality Issues
Quality Issues
Quality Issues
Quality Issues
Reporting / OLAP tools
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 22
EDBT Summer School - Cargese 2002 43
Data Warehouse Servers - Outline
Server Technology: ROLAP & MOLAPIndexing TechniquesQuery Processing and Optimization
EDBT Summer School - Cargese 2002 44
Database Servers
Relational and Specialized Relational DBMSRelational OLAP (ROLAP) DBMSMultidimensional OLAP (MOLAP) DBMS
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 23
EDBT Summer School - Cargese 2002 45
Relational DBMS
Features that support DSSSpecialized Indexing techniquesSpecialized Join and Scan MethodsData Partitioning and use of ParallelismComplex Query ProcessingIntelligent Processing of AggregatesExtensions to SQL and their processing
EDBT Summer School - Cargese 2002 46
ROLAP Servers
Exploits services of a relational engine effectivelyKey functionality
needs aggregation navigation logicability to generate multi statement SQLoptimize for each individual database backend
Additional servicescost-based query governordesign tool for DSS schemaperformance analysis tool
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 24
EDBT Summer School - Cargese 2002 47
Database Schemata for DW & ROLAP
Star SchemaSnowflake SchemaFact ConstellationAggregated data
EDBT Summer School - Cargese 2002 48
Star Schema
A star schema consists of one central fact table and several denormalized dimension tables. The measures of interest for OLAP are stored in the fact table (e.g. Dollar Amount, Units in the table SALES).For each dimension of the multidimensional model there exists a dimension table (e.g. Geography, Product, Time, Account) with all the levels of aggregation and the extra properties of these levels.
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 25
EDBT Summer School - Cargese 2002 49
Star Schema
SALESGeography CodeTime CodeAccount CodeProduct CodeDollar AmountUnits
GeographyGeography CodeRegion CodeRegion ManagerState CodeCity Code.....
ProductProduct CodeProduct NameBrand CodeBrand NameProd. Line CodeProd. Line Name
TimeTime CodeQuarter CodeQuarter NameMonth CodeMonth NameDate
AccountAccount CodeKeyAccount CodeKeyAccountNameAccount NameAccount TypeAccount Market
Stanford Technology Group, Inc., 1996
EDBT Summer School - Cargese 2002 50
Snowflake Schema
The normalized version of the star schemaExplicit treatment of dimension hierarchies (each level has its own table)Easier to maintain, slower in query answering
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 26
EDBT Summer School - Cargese 2002 51
Snowflake Schema
SALESPostal CodeTime CodeAccount CodeProduct CodeDollar AmountUnits
TimeTime CodeQuarter CodeMonth Code
QuarterQuarter CodeQuarterName
MonthMonth CodeMonth Name
AccountAccount CodeKeyAccountCode
AccountattributesAccount CodeAccountName
KeyAccountKeyAcc CodeKeyAcc Name
GeographyPostal CodeRegion CodeState CodeCity Code
RegionRegion CodeRegion Mgr
StateState CodeState Name
CityCity CodeCity Name
ProductProduct CodeProd Line CodeBrand Code
ProductProduct CodeProductName
BrandBrand CodeBrand Name
ProdLineProdLineCodeProdLineName
Stanford Technology Group, Inc., 1996
EDBT Summer School - Cargese 2002 52
Fact Constellation
Multiple fact tables that share many dimension tablesExample: projected expense and the actual expense may share dimensional tables
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 27
EDBT Summer School - Cargese 2002 53
Aggregated Tables
In addition to base fact and dimension tables, data warehouse keeps aggregated (summary) data for efficiency.Two approaches
store as separate summary fact and dimension tablesadd to the existing base tables
EDBT Summer School - Cargese 2002 54
Aggregated Tables
RID City Amount1 Athens $1002 N.Y. $3003 Rome $1204 Athens $2505 Rome $1806 Rome $657 N.Y. $450
City AmountAthens $350N.Y. $750Rome $365
RID City Amount Level1 Athens $100 NULL2 N.Y. $300 NULL3 Rome $120 NULL4 Athens $250 NULL5 Rome $180 NULL6 Rome $65 NULL7 N.Y. $450 NULL8 Athens $350 City9 N.Y. $750 City
10 Rome $365 City
• Separate sum-table• Extend existing base tables
Extended Sales table
Sales table
City-dimension sum table
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 28
EDBT Summer School - Cargese 2002 55
MOLAP Servers
The storage model is an n-dimensional arrayVery fast in computations and OLAP operationsNormally they require pre-computation of the available cubesCompression of data to save storage spaceCurrently: 98% of the market for client tools
SISYPHUS: A Chunk-Based Storage Manager for OLAP Cubes
PhD work of Nikos KarayannidisNational Technical University of Athens
(NTUA)
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 29
EDBT Summer School - Cargese 2002 57
ERATOSTHENES projectERATOSTHENES, is a specialized database management system for OLAP cubes which is under development.In the context of ERATOSTHENES, a prototype storage manager for OLAP cubes, called SISYPHUS, has been developed.Storage Engine
(SISYPHUS)
Processing Engine
Presentation Engine
EDBT Summer School - Cargese 2002 58
Why OLAP poses new require-ments to storage management?
Small response time: good physical clustering + efficient access pathsMultidimensionality: md-storage structures, address by locationHierarchies: access paths, clusteringSparseness: not random but according to hierarchies.
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 30
EDBT Summer School - Cargese 2002 59
Architecture: levels of abstraction in SISYPHUS
SSM Record-oriented storage mngmnt
File Manager Bucket-oriented File mngmnt
Logging/Recovery
Buffer ManagerBuffer mngmnt
Access Manager Chunk-oriented File mngmnt
Cube Access Methods OLAP Processing
rec.oriented access
bckt.oriented access
chnk.oriented access
Cell oriented access
EDBT Summer School - Cargese 2002 60
Dimension data encoding
City
Region
Country
LOCATION
0.1.2
0 1 2CityA CityB CityC CityD
0 1RegionA RegionB
0CountryA
3
order-codes
member-code
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 31
EDBT Summer School - Cargese 2002 61
A chunk-oriented file system: the hierarchically chunked cube
Use the bucket file system.Chunking Method: partition the data space by forming a hierarchy of chunks that is based on the dimension hierarchies.
continent
city
region
country
item
type
category
item
Pseudo
[0..18]
[0..10]
[0..4]
[0..2]
[0..5]
[0..2]
[0..2]
[0..1]
EDBT Summer School - Cargese 2002 62
D = 0
continent
city
region
country
item
type
category
item
Pseudo
[0..18] (LOCATION)
[0..5
] (P
RO
DU
CT)
(0,0)
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 32
EDBT Summer School - Cargese 2002 63
continent
city
region
country
item
type
category
item
Pseudo
[0..5] [6..10] [11..18]
[0..3
][4
..5]
D = 1
EDBT Summer School - Cargese 2002 64
continent
city
region
country
item
type
category
item
Pseudo
[0..2] [3..5] [6..10] [11..14] [15..18]
[4..5
][0
..1]
[2..3
]
D = 2
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 33
EDBT Summer School - Cargese 2002 65
continent
city
region
country
item
type
category
item
Pseudo[0
..1]
[2..3
][4
..5]
[1..2][0] [4..5][3] [8..9][6..7] [10] [12..14][11] [17..18][15..16]
D = 3 (Max Depth)
EDBT Summer School - Cargese 2002 66
Chunk Identifiers (chunk-ids)Chunk addressing.Unique identifier of chunk within cube + depicts hierarchy path of chunk.Interleave the member-codes of the pivot-level members that define a chunk (at any depth).
e.g. D = 2 LOCATION: 2.3, PRODUCT:1.2
2.3 1.22 . 31 2| |
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 34
EDBT Summer School - Cargese 2002 67
Accessing the chunks of a cubeNeed some chunk directory.Idea: use intermediate depth chunks as directory chunks that will guide us to the data chunks(Dmax + 1)Create a chunk-tree.
EDBT Summer School - Cargese 2002 68
1
3
Grain level(Data Chunks)
Root Chunk
P P
0 1 2 3
D = 1
D = 2
LOCATION
PRODUCT
0 1 2
0
1
0
00.00 00.10
D = 3 (Max Depth)
0
00.00.0P
0
1
1 2
00.00.1P
0
1
00.10.2P
0
1
4 5
00.10.3P
0
1
0 1
00
P P
0 1 2 3
00.01 00.11
30
00.01.0P
2
3
1 2
00.01.1P
2
3
00.11.2P
2
3
4 5
00.11.3P
2
3
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 35
EDBT Summer School - Cargese 2002 69
Bucket Organization3 parts: bucket header, directory chunk vector, data chunk vector.Main idea: try to store in the same bucket whole families (i.e. sub-trees of chunks)!
A) A single sub-treeB) Many sub-trees that form a bucket region C) A single tree of directory chunks (root bucket)D) A single data chunk
EDBT Summer School - Cargese 2002 70
Chunk organizationImplementation data structure: multidimensional arrays:
Offer data address by-location, native to cubes.Enable chunk id exploitation.We don’t have to store the chunk ids.Are FAST!
Compression schemes:Data chunks: allocate only non-empty cells, maintain bitmap.Directory chunks: full cell allocation but no allocation for empty sub-trees.
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 36
EDBT Summer School - Cargese 2002 71
SummaryStorage management in OLAPSISYPHUS storage manager for OLAPChunk-oriented file system:
Natively multidimensional and supports hierarchies.Clusters data hierarchically.It is space conservative.Adopts location-based than content-based data address scheme.
Also: data-access interface can be used for defining access paths and OLAP operations.
EDBT Summer School - Cargese 2002 72
Future WorkExperimental tests.Design/Implementation of algorithms for typical OLAP operations.Other research issues:
Finding optimal bucket regionsUpdating interface for common OLAP updating operations.Efficient file organization for dimension data
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 37
EDBT Summer School - Cargese 2002 73
Data Warehouse Servers - Outline
Server Technology: ROLAP & MOLAPIndexing TechniquesQuery Processing and Optimization
EDBT Summer School - Cargese 2002 74
Why specialized indexing
Join-intensive queriesAlmost all queries demand joins of the fact table with some dimensions
Very large tablestraditional index become too large to be efficient
Complex queriesselections based on complex criteria
Read-intensive workload
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 38
EDBT Summer School - Cargese 2002 75
BitMap Indexes
An alternative representation of RID-listAdvantageous for low-cardinality domainsRepresent each row of a table by a bit and the table as a bit vectorThere is a distinct bit vector Bv for each value v for the domain.The j-th bit in the vector Bv is set if the j-th row of the table has the value v for the column
EDBT Summer School - Cargese 2002 76
BitMap Indexes
Example: The attribute sex has values M and F.A table of 100 million people needs 2 lists of 100 million bitsComparison, join and aggregation operations are reduced to bit arithmetic with dramatic improvement in processing timeSignificant reduction in space and I/O (30:1)
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 39
EDBT Summer School - Cargese 2002 77
BitMap Indexes
Cust Region RatingC1 N HC2 S MC3 W LC4 W HC5 S LC6 W LC7 N H
RID N S E W 1 1 0 0 0 2 0 1 0 0 3 0 0 0 1 4 0 0 0 1 5 0 1 0 0 6 0 0 0 1 7 1 0 0 0
RID H M L1 1 0 02 0 1 03 0 0 14 1 0 05 0 0 16 0 0 17 1 0 0
Base Table Region Index Rating Index
EDBT Summer School - Cargese 2002 78
BitMap Indexes
Works poorly for high cardinality domains since the number of vectors increaseHowever, often good performance via compression since scarcity also increasesProducts that support bitmaps: Model 204, TargetIndex (Redbrick), IQ (Sybase), Oracle
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 40
EDBT Summer School - Cargese 2002 79
Join Indexes
Traditional indexes map the value in a column to a list of rows with that valueJoin indexes maintain relationships between the primary key and the foreign keysThus, join indexes relate the values of the dimensions of a star schema to rows in the fact table.Join indexes may span multiple dimensions
EDBT Summer School - Cargese 2002 80
Join IndexesJoin index for a single dimension:
Consider a schema with a Sales fact table and two dimensions city and productIf there is a join index on city, then for each distinct city, the index maintains a list of RIDs of the tuples recording sale in that cityExample: The node Athens in the index points to the list of RIDs in the fact table corresponding to transactions (sale) in Athens.
Join indexes can span multiple dimensionsthe node (Athens, oranges) points to transactions that took place in Athens and which corresponds to purchase of oranges
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 41
EDBT Summer School - Cargese 2002 81
Join Indexes
RID City Amount1 Athens $1002 N.Y. $3003 Rome $1204 Athens $2505 Rome $1806 Rome $657 N.Y. $450
City Country PopulationAthens Greece 3.507.000Rome Italy 3.033.000N.Y. USA 17.953.000
Sales table City table
City RIDsAthens 1, 4Rome 3, 5, 6N.Y. 2, 7
Index on City-Sales
EDBT Summer School - Cargese 2002 82
Data Warehouse Servers - Outline
Server Technology: ROLAP & MOLAPIndexing TechniquesQuery Processing and Optimization
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 42
EDBT Summer School - Cargese 2002 83
Specialized Join Methods
Traditional systems limit themselves to binary joins
results in many intermediate tablesFor a query over many dimensions, the optimization time can be substantial
EDBT Summer School - Cargese 2002 84
Specialized Join Methods
StarJoin Algorithm (Redbrick)use join indexes to identify regions of cartesianproduct that are of interest
Intelligent Scan (Redbrick)take advantage of the “read-only” environment
Parallel Join Methods
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 43
EDBT Summer School - Cargese 2002 85
Complex Query Processing
Extensible optimization frameworks (e.g. Starburst [IBM Almaden])Estimation of Statistics (histograms, sampling)Some of the ideas useful for DSS:
interleaving GroupBy and JoinMerging ViewsPropagating selection through viewsOptimizing nested subqueries
EDBT Summer School - Cargese 2002 86
Example of Optimizing Nested Subqueries
Find all employees younger than 35 who earn more than the average of their departmentAlternatives:
Iterate over each employee: (1) find the department of the employee (2) compute average salary in the department (3) check if the employee’s salary is above the averageCompute the average salary of each department. For each employee, check if his/her salary is above the corresponding average salaryFind out the set of all departments where at least one of the employees is 35. Compute the average salary of only those departments. Repeat the previous step.
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 44
EDBT Summer School - Cargese 2002 87
Rollup and Cube operators
[Gray et.al. 1996] Rollup operator for nested aggregations
rollup product, store, citygroup by product, store, citygroup by store, citygroup by city
Cube operator for all possible combinationsgroup by product,store,city cube
group by each subset of {product, store, city}, independently of the order of columns in the statement
EDBT Summer School - Cargese 2002 88
The CUBE operatorJim GrayAdam BosworthAndrew LaymanMicrosoft
CHEVY
FORD 19901991
19921993
REDWHITEBLUE
By Color
By Make & Color
By Make & Year
By Color & Year
By MakeBy Year
Sum
The Data Cube and The Sub-Space Aggregates
REDWHITEBLUE
Chevy Ford
By Make
By Color
Sum
Cross TabRED
WHITEBLUE
By Color
Sum
Group By (with total)Sum
Aggregate
Hamid PiraheshIBM
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 45
Processing Star Queries on Hierarchically-Clustered Fact Tables
Nikos Karayannidis1, Aris Tsois1, Timos Sellis1, Roland Pieringer2, Volker Markl4,
Frank Ramsak3,Robert Fenk3, Klaus Elhardt2, Rudolf Bayer5
1I.C.C.S. - N.T.U.Athens, 3FORWISS –5T.U.München,
2TransAction Software GmbH, 4IBM Almaden Research Center
EDBT Summer School - Cargese 2002 90
Key PointsStar queries are ubiquitous in DW and OLAPNew trend: Hierarchically clustered star-schemataNew processing frameworkNew optimization challenges Implemented in TransBase HyperCubeTested with real-world application (up to 40 speed-up)
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 46
EDBT Summer School - Cargese 2002 91
EDITHEDITH - the European Development on Indexing Techniques for Databases with Multidimensional Hierarchies Information Society Technologies Programme (IST) - grant No. IST-1999-20722. http://edith.in.tum.de
EDBT Summer School - Cargese 2002 92
Motivation – Problem statement
Not just report! What about ad hoc queries?OLAP requires efficient processing of ad-hoc star queriesMajor bottleneck processing of the star-join
Cartesian product, bitmap indexes, …NOT enough: Efficiency requires good physical clusteringof data
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 47
EDBT Summer School - Cargese 2002 93
Hierarchical ClusteringA new trend:
hierarchical clustering of fact table data through path-based surrogate keysExploitation of multidimensional indexesStar join transforms to multidimensional range query
The overall processing framework of star queries changes radically
EDBT Summer School - Cargese 2002 94
ContributionsPresent a novel processing framework for star queries over hierarchically clustered dataDiscuss optimizationsRealization of our technology in a real systemEvaluation on a real-world application has shown significant speed-ups.
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 48
EDBT Summer School - Cargese 2002 95
Hierarchical Surrogate KeysApply hierarchical encoding on each dimension tableSystem-assigned h-surrogate key:
e.g., oc1(“Greece”)/oc2(“Athens”)/oc3(“Store5”)
Implementation based on underlying physical data structure
EDBT Summer School - Cargese 2002 96
Database Schema
FTm1m2
d1d2…dN
D1
h1---------------
h2h3f1f2
D2
h1---------------
h2h3h4
DN
h1---------------
h2f1f2f3hsk1
hsk2…
hskN
hsk1
hsk2
hskN
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 49
EDBT Summer School - Cargese 2002 97
Star Queries
SELECT {Di.hj}{Di.fj}{aggr(…)AS AMj}FROM FT,D1,…,DNWHERE FT.d1 = D1.h1 AND…
LOCPRED({D1}) AND …MPRED({FT.mi})
GROUP BY {Di.hj},{Di.fj},{FT.mj}HAVING <having clause>ORDER BY <ordering fields>
Star-join conditions
Dimension restrictions
Measure restrictions
EDBT Summer School - Cargese 2002 98
The Abstract Processing Plan
...Dn
FT
MD Range Access
Residual Join
Group-Select
Order_By
D1
Dj
Di
Residual Join
...Create_RangeCreate_Range
...
h-surrogate processing Main execution phase
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 50
EDBT Summer School - Cargese 2002 99
Optimization IssuesOptimizing h-surrogate processing
Single tuple retrieval for hierarchical prefix path restrictionsExploit composite index on (hm, hm-1,…, h1, hski)
Pregrouping transformation Reduces tuples for residual join and speeds up groupingHeuristic algorithm based on query syntax
EDBT Summer School - Cargese 2002 100
Pre-grouping Transformation
F
Group Selectby month, store
Residual Join
MD Range Access
Residual Join
Date
LocationDate
F
Group Selectby month, store
Residual Join
MD Range Access
Residual Join
Location
Group Selectby hsk1, hsk2
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 51
EDBT Summer School - Cargese 2002 101
Performance EvaluationGreek electronic retailer data:
3 dims (1.4M, 27K, 2.5K) tuplesFact table: 15.5M tuples (1.5GB)220 ad hoc star queries from real application
Compare 3 plans: STAR, AEP and OPTFT selectivity range: 0.0% to 5.0% of FTResult:
AEP vs. STAR 20 avg. speed upOPT vs STAR 40 avg speed up
EDBT Summer School - Cargese 2002 102
SummaryEfficient star query processing a must in DW and OLAPNew trend: Hierarchically clustered star-schemataPresented a novel processing framework for star queries over hierarchically clustered dataDiscussed optimization issuesFully implemented our technology in TransBaseEvaluation with real-word application has shown significant speed-ups
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 52
EDBT Summer School - Cargese 2002 103
Future WorkExtensive experimental evaluationInvestigate applicability of our processing framework to other areasFurther optimization issues: reducing the number of produced h-surrogate ranges
EDBT Summer School - Cargese 2002 104
Metadata Repository
Sources
Administrator
DSA
Administrator
DW
Designer
Data Marts
Metadata Repository
End User
Quality Issues
Quality Issues
Quality Issues
Quality Issues
Reporting / OLAP tools
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 53
EDBT Summer School - Cargese 2002 105
The Lack of Conceptual Support
InformationSource
DataWarehouse
Wrapper/Loader
Multidim.Data Mart
Aggregation/Customization?
Observation
OLTP
OLAPAnalyst
Operational Department
Enterprise
Source Quality
DWQuality
MartQuality
(1)
(2)
(3)
(4)
(5)
EDBT Summer School - Cargese 2002 106
Conceptual-Logical-Physical
SourceData Store
DWData Store
Wrapper
ClientData Store
Aggregation/Customization?
Observation
OLTP
OLAPClient Model
Operational Department
Model
Enterprise Model
SourceSchema
DWSchema
TransportationAgent
TransportationAgent
ClientSchema
Conceptual Perspective
LogicalPerspective
PhysicalPerspective
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 54
EDBT Summer School - Cargese 2002 107
The DWQ ApproachClient Level
DW Level
Source Level
Conceptual Perspective Logical
PerspectivePhysicalPerspective
Meta Model Level
Models/ Meta Data
Level
in
Real World
in in
ProcessModel
ProcessMeta
Model
uses
Process
Processes
Quality Metamodel
Quality Model
Quality Measure- ments
EDBT Summer School - Cargese 2002 108
DWQ RepositoryThe DWQ approach for managing data warehouse quality is organized around an extended, semantically rich metadata repository (prototypically implemented using ConceptBase), which controls all relevant metadataWe have developed meta models for DW architecture, quality, processes and evolutionMetadata can be provided and queried by external tools, via active rules external tools could even be activated
[Jarke et al., CAiSE98]
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 55
EDBT Summer School - Cargese 2002 109
DWQ Metadata Framework
Sources
...
...
EnterpriseModel
Client Client
Source SourceModel_1 Model_n
Model_1 Model_m
Mediators
conceptual/logical mappingphysical/logical mapping
conceptual link
data flow
logical link
Source SourceWrappers
physical levelmeta level conceptual level logical level
Met
a M
odel
Interface
SchemaStore
Client Client
DWDW
Source Source
Schema_1 Schema_n
Schema_1 Schema_m
Data Store_1 Data Store_n
EDBT Summer School - Cargese 2002 110
Quality Model: An Adapted GQM Approach
DW Designers
DecisionMaker
DWAdministrator
QualityGoal
QualityQuery
DW Objects, Processes and Data
Metadata for DW Architecture,
Quality and Processes
establish
Measurement Processes
evaluated by
evidence for
defined on
QualityFactor
[Jarke et al., IS99]
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 56
EDBT Summer School - Cargese 2002 111
Quality Factors by PerspectiveConceptual Perspective
• Completeness• Redundancy• Consistency• Correctness• Traceabilityof Concepts andModels
Logical Perspective
• Usefulness of schemas• Correctness of mappings• Interpretability of schemas
Physical Perspective
• Efficiency • Interpretability of schemas• Timeliness of stored data • Maintainability/ Usability of software components
EDBT Summer School - Cargese 2002 112
Towards Quality-Oriented DW Design Quality
Goal
1. Design 2. Evaluation 3. Analysis& Improvement
DefineQualityFactorTypes
DefineObjectTypes
Define ObjectInstances &Properties
Define Metrics& Agents
Compute!
Acquire values forquality factors
(current status)
Feed values toquality scenario
and play!
Discover/Refinenew/old
"functions"
Take actions!
Decomposecomplex objects
and iterate
Empiricallyderive
"functions"
Analyticalyderive
"functions"
Produce ascenariofor a goal
Produce expected/acceptable values
Negotiate!
4. Re-evalution& evolution
[Vassiliadis et al., IS00]
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 57
EDBT Summer School - Cargese 2002 113
DWQ Methodology : Summary
R1 R2 R3
EnterpriseModel
MaterializedViews
C1 C2 Cm
Conj.Queries
R1 R2 R3
S1R1 R2 R3
S2R1 R2 R3
S3R1 R2 R3
Sn
Conj.Queries
Conj.Queries
User queries
OLTP updates
3. ConceptualClient Modeling
1. ConceptualEnterprise Model
2. ConceptualSource Models
Rewriting ofAggregate Queries
Refreshment
6. DataReconciliation
4. Translate aggregates into OLAP operations
5. DesignOptimization
Metadata Repository
EDBT Summer School - Cargese 2002 114
Key Formal Results on Quality Impacts
conceptual: description logic theory and tools for complete reasoning about the relationships between source, enterprise, and client models conceptual/logical: containment, satisfiability, and rewriting of queries over views with & without aggregateslogical/physical: incremental cost-based optimization of view materializations physical: detailed impact analysis of replication and refreshment policies
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 58
EDBT Summer School - Cargese 2002 115
ConceptBase User Interface
EDBT Summer School - Cargese 2002 116
DW Quality Example
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 59
EDBT Summer School - Cargese 2002 117
Metadata StandardsMetadata Coalition
MetaData Interchange Specification (MDIS)Open Information Model (OIM)
OMG (latest development)Common Warehouse Model (CWM)
Microsoft Repository
EDBT Summer School - Cargese 2002 118
SummaryOLAP - Multidimensional dataDrill down, Roll Up, Pivot, Slice and DiceData warehouse architectureWarehouse operational process
Loading - Cleaning - Serving (ROLAP/MOLAP)Refreshing
Warehouse server requirementsStar-Snowflake schemesSpecialized indexes: BitMap - Join Indexes
Design and Maintenance of DataWharehouses
ABIS 2002 – Timos Sellis 60
EDBT Summer School - Cargese 2002 119
Research issues
Data cleaningfocus on schema inconsistencies
Data warehouse designsummary tables, indexing
Query Processinguse summary data, statistics mgt, dynamic optimization
Warehouse Managementresource management, runaway queriesincremental refresh techniques
EDBT Summer School - Cargese 2002 120
ReferencesW. H. Inmon: Building the Data Warehouse (2nd Edition),John Wiley, 1996.R. Kimball: The Data Warehouse Toolkit, John Wiley, 1996.H. Garcia-Molina, Data Warehousing Overview, class notes, Stanford University.S. Chaudhuri & U. Dayal: Data Warehousing and OLAP for Decision Support - VLDB’96 tutorialOracle, IBM, Redbrick, Sybase, Informix, Tandem, Teradata, HP, … web sites.The DWQ project: http://www.dbnet.ece.ntua.gr/~dwq/
Top Related