Attribution-NonCommercial-ShareAlike 4.0 International (CC ...
1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are...
-
Upload
eunice-oliver -
Category
Documents
-
view
220 -
download
0
Transcript of 1 Advanced Database Topics Copyright © Ellis Cohen 2002-2005 Data Warehousing These slides are...
1
Advanced Database Topics
Copyright © Ellis Cohen 2002-2005
Data Warehousing
These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License.
For more information on how you may use them, please see http://www.openlineconsult.com/db
© Ellis Cohen, 2002-2005 2
TopicsOverviewStar Schema:
Fact & Dimension TablesThe Star Schema & DenormalizationViewing The Data CubeDrill Down & RollupCross TabulationsData VisualizationTrend & Rank AnalysisETL: Extraction, Transformation & LoadingMaterialized Views & Query RewritingIndexing for Data Warehouses
© Ellis Cohen, 2002-2005 3
Operational vs Analytical DBs
Operational DatabaseData needed and updated constantly to directly
support business operationsFocus on OLTP (on-line transaction processing):
Transactional access & modification of relatively small # of data points at a time
Analytical Database:Data Warehouse & Data MartCopious amounts of relatively static data, culled
& integrated across enterprise, cleansed & summarized, maintained historically, used for decision support and business intelligence (BI)
Focus on OLAP (on-line analytical processing): Querying large amounts of data, scheduled modifications
© Ellis Cohen, 2002-2005 4
Operational vs Analytical DBs
Operational Warehouse
UsageTransactional
(OLTP)Analytical
(OLAP)
Organized for Modifications Queries
Modifications Continual Periodic
QueriesNarrow-scope
Low-complexityBroad-scope
High-complexity
Database RelationalRelational/
Dimensional
Data NormalizedDenormalizedAggregated &
Derived
© Ellis Cohen, 2002-2005 5
Central Data Warehouse
(from Oracle 9i Data Warehousing Guide)
© Ellis Cohen, 2002-2005 6
Warehouse Questions
How many red Bally shoes did we sell by region in the third quarter of each of the last 5 years?
What are the top 25 selling products by category and region for this past quarter?
What percent of the market do we own for each product we make?
Which of our customer's zipcodes were responsible for the top 10% of total sales over the last year.
© Ellis Cohen, 2002-2005 7
Star Schema:Fact & Dimension
Tables
© Ellis Cohen, 2002-2005 8
Star Schema
Stores (Dimension)
DailySales (Fact)
storidprodiddatepriceunits
storid…
Products (Dimension)
prodid…
Measures
A Star Schema has a central fact table, with a composite primary key, which references multiple Dimension tables
what each fact measures
Data Warehousesare organized usingStar Schema models
foreign key
© Ellis Cohen, 2002-2005 9
Subjects (Facts) & Dimensions
Instead of thinking about entities & relationships, design a data warehouse by thinking about
Subjects (represented by fact tables)
Sales, Distribution, Purchases
Dimensions (represented by dimension tables)
How to uniquely identify the facts about each subject– Sales: Product, Stores, Dates
(maybe also Employee, Customer: depends what you want to analyze)
– Distribution: Warehouses, Products, Stores, Dates (maybe Employees & Trucks)
– Purchases: Products, Vendors, Dates (maybe also Employees)
© Ellis Cohen, 2002-2005 10
Fact & Dimension Tables
Fact TablesComposite primary key
• identify dimensions• uniquely identify each fact (or measurement)
Additional attributes: measures• what is measured about each fact
Dimension TablesPrimary key
Surrogate key uniquely identifies each dimension value
Additional attributesProperties of each dimension value
© Ellis Cohen, 2002-2005 11
Dimensions & Granularity
Dimensions have different levels of granularity
Stores
Regions
Districts
Products
SubCategories
ProductTypes
Categories
Manufacturers
© Ellis Cohen, 2002-2005 12
Snowflake Schema(with Normalized Dimensions)
Stores (Dimension) DailySales (Fact)storidprodiddatepriceunits
storidstornamcitystatedistid
Products (Dimension)
prodidcolorsizeprodtyp
Districtsdistiddistnamdistarearegid
Regionsregidregnam
ProductTypes
prodtypprodnamprodescrsubcatidmanfid
SubCategories
subcatidsubnamsubdescrcatid
Categories
catidcatnamcatdescr
Manufacturers
manfidmanfnam
© Ellis Cohen, 2002-2005 13
Typical Warehouse Query
How many red Bally shoes did we sell in each region in 2002?
SELECT r.regnam as region, sum(f.units) as sumunitsFROM DailySales f NATURAL JOIN Stores NATURAL JOIN Districts NATURAL JOIN Regions r NATURAL JOIN Products p NATURAL JOIN ProductTypes NATURAL JOIN SubCategorie s NATURAL JOIN Manufacturers mWHERE to_char(f.date,'YYYY') = '2002' AND p.color = 'red' AND m.manfnam = 'Bally' AND s.subnam = 'Shoe'GROUP BY r.regnam
© Ellis Cohen, 2002-2005 14
Aggregate Functions
AVG: Average COUNT: Count MIN: Minimum Value MAX: Maximum Value STDDEV: Standard Deviation
(and STDDEV_POP & STDDEV_SAMP) SUM: Sum VARIANCE: Variance
(and VAR_POP & VAR_SAMP)
© Ellis Cohen, 2002-2005 15
The Star Schema & Denormalization
© Ellis Cohen, 2002-2005 16
Snowflake Schema is Normalized
Snowflake Schema has normalized dimension tables
• Each dimension is represented by multiple sub-dimension tables at different levels of granularity (Product, ProductType, Category, etc.)
• Each sub-dimension table has attributes appropriate to the level of granularity– Product: color, size
– ProductType: prodnam, prodescr
– etc.
© Ellis Cohen, 2002-2005 17
Denormalization
Products (Dimension)
prodidcolorsizeprodtypprodnamprodescrmanfidmanfnamsubcatidsubnamsubdescrcatidcatnamcatdescr
Products (Dimension)
prodidcolorsizeprodtyp
ProductTypes
prodtypprodnamprodescrsubcatidmanfid
SubCategories
subcatidsubnamsubdescrcatid
Categories
catidcatnamcatdescr
Manufacturers
manfidmanfnam
Why is there redundancy
here?
© Ellis Cohen, 2002-2005 18
Star Schema is Denormalized
The Star Schema has denormalized dimension tables
• Each dimension by joining together the sub-dimension table to form a single dimension table
• The dimension table has attributes at different levels of granularity
• The dimension tables contain lots of redundancy, but queries use far fewer joins
• Does not dramatically impact space: dimension tables usually < 1% size of fact table (but some descriptions may need to be stored separately)
© Ellis Cohen, 2002-2005 19
Star Schema(Fully Denormalized Dimensions)
Stores (Dimension)
DailySales (Fact)
storidprodiddatepriceunits
storidstornamcitystatedistiddistnamdistarearegidregnam
Products (Dimension)
prodidcolorsizeprodtypprodnamprodescrmanfidmanfnamsubcatidsubnamsubdescrcatidcatnamcatdescrMaybe catdescr not
included here if it is a GIF or a 4000 byte
description
Why should this be
replaced by a dateid?
© Ellis Cohen, 2002-2005 20
Schema Types
Snowflake SchemaFact table with
fully normalized dimension tables
Star SchemaFact table with
fully de-normalized dimension tables
Starflake SchemaFact table with
fully de-normalized dimension and (as needed) sub-dimension tables
Constellation SchemaMultiple fact tables
with shared dimension tables
© Ellis Cohen, 2002-2005 21
Query with Denormalized Schema
How many red Bally shoes did we sell in each region in 2002?
SELECT s.regnam as region, sum(f.units) as sumunitsFROM DailySales f NATURAL JOIN Stores s NATURAL JOIN Products p WHERE to_char(f.date,'YYYY') = '2002' AND p.color = 'red' AND p.manfnam = 'Bally' AND p.subnam = 'Shoe'GROUP BY s.regnam Costly
© Ellis Cohen, 2002-2005 22
Typical Date Dimension Attributes
Requires Month + Year to identify a month within a year.Might want to add a single MonthYr field to represent the pair
Field Example Value
Year 2005
Month Feb
Quarter 1
DayOfMonth 12
DayOfYear 43
WeekOfYear 7
DayOfWeek Sat
Note: Quarter is less granular than MonthAlso, DayOfYear, WeekOfYear & DayOfWeek can be derived form the other fields
It is common and almost always more efficient to treat Dates as a dimension with a number of attributes
© Ellis Cohen, 2002-2005 23
Extended Date Dimension Hierarchy
Date (e.g. Feb 12, 2005)
DayOfWeek(e.g. Sat)
WeekYr(e.g. 2005Wk7)
MonthYr(e.g. Feb2005)
QuarterYr(e.g. 2005Q1)
Year(e.g 2005)
Quarter(e.g. 1)
Month(e.g. Feb)
WeekOfYear(e.g. 7)
DayOfYear(e.g. 43)
DayOfMonth(e.g. 12)
© Ellis Cohen, 2002-2005 24
Star Schema with Date Dimension
Stores (Dimension)DailySales (Fact)
storidprodiddateidpriceunits
storidstornamcitystatedistiddistnamdistarearegidregnam
Products (Dimension)prodidcolorsizeprodtypprodnamprodescrmanfidmanfnamsubcatidsubnamsubdescrcatidcatnamcatdescr
Dates(Dimension)
dateiddatedayofweekdayofmonthdayofyearweekyrweekofyearmonthyrmonthquarteryrquarteryear
In general, represent dates by a Dates dimension table
© Ellis Cohen, 2002-2005 25
Query using Dates DimensionHow many red Bally shoes did we sell
in each region in 2002?SELECT s.regnam as region,
sum(f.units) as sumunitsFROM DailySales f NATURAL JOIN Stores s NATURAL JOIN Products p NATURAL JOIN Dates dWHERE d.year = 2002 AND p.color = 'red' AND p.manfnam = 'Bally' AND p.subnam = 'Shoe'GROUP BY s.regnam
Needs an extra join, but simpler query, Executes faster if Dates is indexed by year
© Ellis Cohen, 2002-2005 26
More Complex Query
How many red Bally shoes did we sell by region in the third quarter of each of the last 5 years?
SELECT s.regnam as region, d.quarteryr, sum(f.units) as sumunitsFROM DailySales f NATURAL JOIN Stores s NATURAL JOIN Products p NATURAL JOIN Dates d WHERE p.color = 'red' AND p.manfnam = 'Bally' AND p.subnam = 'Shoe'GROUP BY s.regnam, d.quarteryr, d.quarter, d.yearHAVING d.quarter = 3 AND d.year BETWEEN 1998 and 2002
© Ellis Cohen, 2002-2005 27
The M:N Mapping Problem DailySales (Fact)
storidprodiddateidpriceunits Products (Dimension)
prodidcolorsizeprodtypprodnamprodescrmanfidmanfnamsubcatidsubnamsubdescrcatidcatnamcatdescr
Products
SubCategories
ProductTypes
Categories
Manufacturers
Suppose a product type may have multiple associated subcategories.
What do we do?
© Ellis Cohen, 2002-2005 28
M:N Mappings
DailySales (Fact)
storidprodiddateidpriceunits
prodidcolorsizeprodtypprodnamprodescrmanfidmanfnam
Products (Dimension)
subcatidsubnamsubdescrcatidcatnamcatdescr
SubCategories
prodtypsubcatid
ProdCatMap
Can't be a foreign key constraint, since
prodtyp is not unique in Product
A product type can have more than subcategory
OK to keep the M:N
bridge table
© Ellis Cohen, 2002-2005 29
Non-1NF Denormalization
DailySales (Fact)
storidprodiddateidpriceunits
prodidcolorsizeprodtypprodnamprodescrmanfidmanfnam{ subcatid }
Products (Dimension)
subcatidsubnamsubdescrcatidcatnamcatdescr
SubCategories
Represent a list of subcategories by
•A (non-standard) list datatype
•Delimited string – e.g. |314|209|812|
Another reasonable approach(esp if DB
support for lists)
© Ellis Cohen, 2002-2005 30
Full Denormalization DailySales (Fact)
storidprodiddateidpriceunits
prodidcolorsizeprodtypprodnamprodescrmanfidmanfnamsubcatidsubnamsubdescrcatidcatnamcatdescr
Product (Dimension)
SELECT prodtyp, sum(units) FROM DailySales NATURAL JOIN Product GROUP BY prodtyp
is not correct because of duplication.Must write
WITH JustProduct AS (SELECT DISTINCT prodid, prodtyp FROM Product)SELECT prodtyp, sum(units) FROM DailySales NATURAL JOIN JustProduct GROUP BY prodtyp
UGLY! BAD!Don't do this!
Complicates joins
© Ellis Cohen, 2002-2005 31
Limits to DenormalizationStores (Dimension)
DailySales (Fact)
storidprodiddateidpriceunits
storid… prodid
…
Products (Dimension)
Dates (Dimension)
dateid…
StorePromos
storidstartdateidenddateidpromonamdiscount
Can't denormalize StorePromos
Unless you replace Store & Date with a
singleStoreDate dimension with storeid & dateid as primary keys: Way
too big
© Ellis Cohen, 2002-2005 32
ViewingThe Data Cube
© Ellis Cohen, 2002-2005 33
Data Cube Representation
Products dimension
Stores dimensio
n
Dates dimension
Sales of Beanie Babies in
Pittsburgh Store Today
Sales of Beanie Babies in Pittsburgh
Store Yesterday
All Sales(of all products
over time) in NYC Store
Pgh
NYC
Sales Cube
© Ellis Cohen, 2002-2005 34
Data Cube Characteristics
Each axis represents a dimension
– Elements along axis are at lowest granularity for that dimension
Measures are the data within the cells at intersections of the cube
– Information about the topic of the cube
– e.g. units & price for each sales fact (i.e. sales in a store of a product on a date)
© Ellis Cohen, 2002-2005 35
Data Cube ViewsSlice
View data relative to a point in one or more dimensions
View sales today (for each store & each product category)
View Bally shoe sales at the NYC store (for each date)
DiceView data relative to (sets of) ranges in one or
more dimensionsView sales for the last 4 days (for each store &
each product category)View sales for each type of shoes at all the NY
and NJ stores for each of the last 10 quarters
© Ellis Cohen, 2002-2005 36
MDDB: MultiDimensional DataBase
Knows about Fact & Dimension TablesUses direct (n dimensional) hypercube
representation to provide fast access to fact elements in query
Supports sparse representations– The Pittsburgh store doesn't sell lingerie– The Cape Cod store is not open in the winter– Baked Beanie Babies are only sold in the NE
region
Uses specialized query languagee.g. MDX (used by Microsoft OLAP Server)w basic data types: cube, slice, dice
© Ellis Cohen, 2002-2005 37
Choosing a ViewStore
State
City
Products
Brand
StoreType
Country
MinorSubCategory
MinorCategory
MajorSubCategory
MajorCategory
MonthYr
QuarterYr
Year
Customers
State
City
Country
Detailed Dicing Dimension
Slicing Dimensions
CA
1997Q1
Drink
EducLevel
© Ellis Cohen, 2002-2005 38
Slicing & Dicing
Detailed Dicing Dimension
Slicing AttributesBaseMeasures
DerivedMeasures
Examples use dynasight, www.arcplan.com
© Ellis Cohen, 2002-2005 39
Slice, Dice & Chart
Different Dicing Dimension
Measures
ChartedMeasures
SlicingAttributes
© Ellis Cohen, 2002-2005 40
Drill Down & Roll Up
© Ellis Cohen, 2002-2005 41
Slicing & Dicing
Detailed Dicing Dimension
Slicing AttributesMeasures
Drill Down
© Ellis Cohen, 2002-2005 42
Drill Down
Drill DownRe-Slice
© Ellis Cohen, 2002-2005 43
Uniform Drill Down & Rollup
Uniform Drill DownUniformly drill down to a certain level
Uniform RollupCompute Aggregate Values
at that level and all higher levels
Can be computed with a single SELECT statement using the ROLLUP grouping function
Non-uniform rollups (previous slide) require UNIONs
© Ellis Cohen, 2002-2005 44
Ordinary Group AggregationSELECT c.country, c.state, c.city, sum(f.sale) as
StoreSales, sum(f.cost) as StoreCost,StoreSales - StoreCost as StoreNet,100* StoreCost / StoreSales as PctCost
FROM Facts f NATURAL JOIN Customers cNATURAL JOIN Stores sNATURAL JOIN Dates dNATURAL JOIN Products p
WHERE p.MajorCategory = 'Drink' AND d.QuarterYr = '1997Q1' AND s.country = 'USA' AND s.State = 'CA'
GROUP BYc.country, c.state, c.city
Note: Constrain Store, not Customer
© Ellis Cohen, 2002-2005 45
Aggregate Query Results
Country State City StoreSales …USA CA Altadena 96 …USA CA Arcadia 64 …USA CA … … …USA CA Woodland … …USA OR Beaverton 12 …USA OR Corvalis 21 …USA OR … … …USA OR Woodburn 4 …… … … … …CANADA BC Victoria … …
Per-City Rollups
in CA
Per-City Rollups
in OR
That's fine, but it does NOT give us• Aggregate store sales for CA, OR, BC, etc• Aggregate store sales for USA, CANADA• Aggregate store sales overall
© Ellis Cohen, 2002-2005 46
Rollup Query ResultsCountry State City StoreSales …NULL NULL NULL 6310 …USA NULL NULL 4310 …USA CA NULL 3310 …USA CA Altadena 96 …USA CA Arcadia 64 …USA CA … … …USA CA Woodland … …USA OR NULL 1000 …USA OR Beaverton 12 …USA OR Corvalis 21 …USA OR … … …USA OR Woodburn 4 …… … … … …CANADA NULL NULL … …CANADA BC NULL … …CANADA BC Victoria … ……
Per-City Rollups
in CA
Per-City Rollups
in OR
OR Rollup
CA Rollup
USA RollupRollup ALL
Canada Rollup
BC Rollup
© Ellis Cohen, 2002-2005 47
Rollup using Union
SELECT c.country, c.state, c.city, …GROUP BY c.country, c.state, c.city
UNIONSELECT c.country, c.state, NULL AS city, …
GROUP BY c.country, c.stateUNIONSELECT c.country,
NULL AS state, NULL AS city, …GROUP BY c.country
UNIONSELECT NULL AS country, NULL AS state,
NULL AS city, …
© Ellis Cohen, 2002-2005 48
GROUP BY ROLLUPSELECT c.country, c.state, c.city, sum(f.sale) as
StoreSales, sum(f.cost) as StoreCost,StoreSales - StoreCost as StoreNet,100* StoreCost / StoreSales as PctCost
FROM Facts f NATURAL JOIN Customers cNATURAL JOIN Stores sNATURAL JOIN Dates dNATURAL JOIN Products p
WHERE p.MajorCategory = 'Drink' AND d.QuarterYr = '1997Q1' AND s.country = 'USA' AND s.State = 'CA'
GROUP BYROLLUP( c.country, c.state, c.city )
Note: Constrain Store, not Customer
© Ellis Cohen, 2002-2005 49
Cross Dimension Rollups
SELECT c.state, d.QuarterYr, sum(f.cost) as StoreCost
FROM Fact f NATURAL JOIN Customers cNATURAL JOIN Stores sNATURAL JOIN Dates dNATURAL JOIN Products p
WHERE p.MajorCategory = 'Drink' AND d.year = 1997 AND s.country = 'USA' AND s.State = 'CA'
GROUP BYROLLUP( c.state, d.QuarterYr )
© Ellis Cohen, 2002-2005 50
Cross Dimension Rollup Results
State Quarter StoreSalesNULL NULL 225,627CA NULL 63,530CA 1997Q1 14,431CA 1997Q2 15,332CA 1997Q3 15,673CA 1997Q4 18,094OR NULL 56,773OR 1997Q1 16,081OR 1997Q2 12,679OR 1997Q3 14,274OR 1997Q4 13,739WA NULL 105,324WA 1997Q1 25,240…
Per-Qtr Rollups
in CA
Per-Qtr Rollups
in OR
OR Rollup
CA Rollup
Rollup ALL
WA Rollup
© Ellis Cohen, 2002-2005 51
Cross Tabulations
© Ellis Cohen, 2002-2005 52
Cross Tab View of Rollup
14,431 15,332 15,673 18,094
16,081 12,679 14,274 13,739
25,240 24,953 25,958 29,173
CA
OR
WA
1997 1997Q1 1997Q2 1997Q3 1997Q4
225,627
63,530
56,773
105,324
?
© Ellis Cohen, 2002-2005 53
Cross Tab with Sums
Sums
© Ellis Cohen, 2002-2005 54
GROUP BY CUBE
SELECT c.state, d.QuarterYr, sum(f.cost) as StoreCost
FROM Fact f NATURAL JOIN Customers cNATURAL JOIN Stores sNATURAL JOIN Dates dNATURAL JOIN Products p
WHERE p.MajorCategory = 'Drink' AND d.Year = 1997 AND s.country = 'USA' AND s.State = 'CA'
GROUP BYCUBE( c.state, d.QuarterYr )
© Ellis Cohen, 2002-2005 55
Cube ResultsState Quarter StoreSalesNULL NULL 225,627NULL 1997Q1 55,752NULL 1997Q2 52,964NULL 1997Q3 55,905NULL 1997Q4 61,006CA NULL 63,530CA 1997Q1 14,431CA 1997Q2 15,332CA 1997Q3 15,673CA 1997Q4 18,094OR NULL 56,773OR 1997Q1 16,081OR 1997Q2 12,679OR 1997Q3 14,274OR 1997Q4 13,739WA NULL 105,324WA 1997Q1 25,240…
Per-Qtr Rollups
in CA
Per-Qtr Rollups
in OR
OR Rollup
CA Rollup
Rollup ALL
WA Rollup
Qtr Rollups
© Ellis Cohen, 2002-2005 56
Detailed Cross Tab
© Ellis Cohen, 2002-2005 57
Data Visualization
© Ellis Cohen, 2002-2005 58
Charting Visualizations
All: 1 dimension, 1 measure
© Ellis Cohen, 2002-2005 59
Volume Visualization
Clustered data: 3 dimensions, 1 measure shown using color
© Ellis Cohen, 2002-2005 60
Colored Sphere Visualization
Sparse data: 3 dimensions, 2 measures: pt size & color
White: colored measure unknown
© Ellis Cohen, 2002-2005 61
Vector Glyph Visualization
2 dimensions, 4 measures: <x,y,z> & color
© Ellis Cohen, 2002-2005 62
Dimensional Stacking
4 (2+2)dimensions, 1 binary measure•could use color for continuous measure •could chart: 3 (2+1) dimensions, 1 measure
© Ellis Cohen, 2002-2005 63
Visualization IssuesDimensions
How many dimension attributesHow dimension attributes are represented
MeasuresHow many simultaneous measuresHow measures are representedSpatial, Color (hue/brightness/…),
Texture, Audio, other sensory
TransformationsMeasures & Dimension attributes1 variable: sqr, sqrt, log, exp, 1/xN variables: linear combinations
Drill Up , Drill Down, PivotInteractivity & Immersiveness
© Ellis Cohen, 2002-2005 64
Trend & Rank Analysis
© Ellis Cohen, 2002-2005 65
Trend Example
Month Year TotalSmoothed
Total
Jan 1994 200 200
Feb 1994 344 272
Mar 1994 401 315
Apr 1994 443 347
May 1994 360 387
Jun 1994 404 402
Jul 1994 389 399
Aug 1994 451 401
Window
In addition to calculating the total # of units sold by monthwe want to smooth that over the preceding 3 months
© Ellis Cohen, 2002-2005 66
Trend Analysis
When you build a result setyou may want to define a fieldthat depends on a group of related rows
in the same result set (the window)This is particularly useful for
analyzing trends
SELECT d.month, d.year, sum(f.units) as totunits, {moving average of totunits over 3 months preceding} as movavgFROM DailySales f NATURAL JOIN Dates dGROUP BY d.year, d.month
window
© Ellis Cohen, 2002-2005 67
Trends in Oracle SQL
WITH MonthlyUnits AS (SELECT d.month, d.year, sum(f.units) as totunits FROM DailySales f NATURAL JOIN Dates dGROUP BY d.year, d.month )
SELECT month, year, totunits, avg(totunit) OVER ( ORDER BY year, month ROWS 3 PRECEDING ) AS movavgFROM MonthlyUnits
Window
© Ellis Cohen, 2002-2005 68
Trends in SQL 99
WITH MonthlyUnits AS (SELECT d.month, d.year, sum(f.units) as totunits FROM DailySales f NATURAL JOIN Product pGROUP BY d.year, d.month )
SELECT month, year, totunits, avg(totunits) OVER w AS movavgFROM MonthlyUnitsWINDOW w AS ( ORDER BY year, month ROWS BETWEEN 3 PRECEDING AND CURRENT ROW )
© Ellis Cohen, 2002-2005 69
Trends using Subqueries
WITH MonthlyUnits AS (SELECT d.month, d.year, 12*d.year + d.month as mknt, sum(f.units) as totunits,FROM DailySales f NATURAL JOIN Product pGROUP BY d.year, d.month)
SELECT m.month, m.year, m.totunits, (SELECT avg(mm.totunits) FROM MonthlyUnits mm WHERE mm.mknt BETWEEN m.mknt – 3 AND m.mknt) AS movavgFROM MonthlyUnits mORDER BY m.year, m.month
© Ellis Cohen, 2002-2005 70
Rank Example
Month Year Total Rank
Jan 1994 200 8
Feb 1994 344 7
Mar 1994 401 4
Apr 1994 443 2
May 1994 360 6
Jun 1994 404 3
Jul 1994 389 5
Aug 1994 451 1
Window
In addition to calculating the total # of units sold by monthwe want to rank it with respect to all the months
© Ellis Cohen, 2002-2005 71
Ranking in Oracle SQL
WITH MonthlyUnits AS (SELECT d.month, d.year, sum(f.units) as totunits FROM DailySales f NATURAL JOIN Product pGROUP BY d.year, d.month )
SELECT month, year, totunits, rank() OVER ( ORDER BY totunit DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) AS ranktotalFROM MonthlyUnits
The ugly idiom for 'ALL ROWS'
© Ellis Cohen, 2002-2005 72
Analytical Functions
Ranking FunctionsRANK, DENSE_RANK, CUME_DIST,
PERCENT_RANK, NTILE, ROW_NUMBERRank current row within the window
Inverse Percentile FunctionsPERCENTILE_CONT (continuous)PERCENTILE_DISC (discrete)
Histogram SupportWIDTH_BUCKET
Lag/Lead FunctionsLAG, LEADReturn data from a row at a specified offset from
the current row within the window
© Ellis Cohen, 2002-2005 73
ETL:Extraction,
Transformation & Loading
© Ellis Cohen, 2002-2005 74
ETL: Extraction, Transformation & Loading
80% of total cost of building warehouse
Extraction Loading
Transformation
© Ellis Cohen, 2002-2005 75
ExtractionSources
Multiple DB'sFlat FilesExternal Data Sources
• e.g. Census, Geographic, Weather, Financial, Unemployment Data
• Standard DB/Spreadsheet format or semi-structured data from the web
FrequencyPeriodic (hourly, daily, weekly, …)Triggered
• Single event• #, sequence, pattern of events
MechanismsSnapshots / Materialized Views / ReplicationDatabase TriggersProcess LogsQuery Sources (full vs incremental)
© Ellis Cohen, 2002-2005 76
TransformationCleaning
ScrubbingFilteringConformance
IntegrationRenamingFusion & MergingDetermine Surrogate KeysTimestampingSummarization
Schema OrganizationDimension TablesPre-Aggregation via Materialized Views Derivation
© Ellis Cohen, 2002-2005 77
(Transformation) Cleaning
ScrubbingUse domain-specific knowledgee.g. SS#, phone-number, zipcode
FilteringCheck for inconsistent dataUse data validation rules
ConformanceMap similarly typed data to standard
representation Convert
units (inch => cm, $ => euro)scale (mm => cm)formats (string => integer, string with/wo
$)
© Ellis Cohen, 2002-2005 78
(Transformation) IntegrationRenaming
Resolve name conflictsFusion - e.g. merge
– properties in city db– properties in developer lists
Determine Surrogate KeysDo not use keys from operational data as
primary key in warehouse dataTimestamping
Add timestamps to fact data where missing to enable historical queries
Reorganization & EvolutionSupport Data Reorganization & Schema
EvolutionSummarization
Summarize original operational data and combine into less detailed tables
© Ellis Cohen, 2002-2005 79
Integration (Data Reorganization)What do we do when attributes change?
Suppose districts are reorganized and a store is now part of a different district
Consistently changing mapping of store to district– Allows new and old data to be compared
reasonably by district– But causes incorrect comparisons by district
among older data alone
Solutions1. Keep fields for both old and new mapping -- in
fact, potentially a separate field for each reorganization
2. Add effective date to store dimension.Have multiple rows for same store - each with different effective date
© Ellis Cohen, 2002-2005 80
(Integration) Summarization
DailySales (Fact)storidprodiddatepriceunitsCustomerTransaction
transidcustidempidposidtime
ItemPurchasetransidlinenoprodidpriceunits
PointOfSaleTerminals
posidpostypstoridloc
Might build different fact tables for different purposes:
e.g. ones involving Customersones involving Store Locations
TradeoffSmaller Fact Tables vs.Missed Relationships
© Ellis Cohen, 2002-2005 81
Loading
Alternatives– Incremental vs Full Refresh:
most data is incrementally added to the warehouse
– Off-line vs on-line– Frequency
• Nightly• Weekly• Monthly
– All-at-once vs Staged
What indices to create or drop?What statistics to collect (& use)?
© Ellis Cohen, 2002-2005 82
Constellation SchemaData warehouses often are designed as
constellations• Multiple fact tables• Shared/related dimension tables
Examples– Sales: store, product, date– Distribution: distributor, store, product,
carrier, period– Advertising: store, medium, product, period
Query across same or related dimensions– Compare advertising and sales by store
within various periods
© Ellis Cohen, 2002-2005 83
Data Marts
Store different fact tables (or different groups of fact tables) in separate data marts
© Ellis Cohen, 2002-2005 84
Data Mart Architectures
Subset of Data WarehouseMeets needs of subgroup of users
• Top-down: – Extracted from Data Warehouse– Problem: early availability
• Bottom-up:– Built directly from staging area– Can be combined to form warehouse– Problem: Conformance.
ETL tool must provide metadata
• Hybrid:– Some data marts built directly from staging area– Others extracted from Data Warehouse
© Ellis Cohen, 2002-2005 85
Metadata Management
Identify & define each attribute– Source(s)– Transformation(s) applied– How aggregated– Description of what it represents– Relationships to other attributes– History
© Ellis Cohen, 2002-2005 86
Materialized Views & Query Rewriting
© Ellis Cohen, 2002-2005 87
What is an Ordinary View
A view is not a tableA view does not hold data
A view is just a descriptionused in expanding queries
which refer to the view!
© Ellis Cohen, 2002-2005 88
View ExpansionSuppose we define
CREATE VIEW HiEmps AS SELECT * FROM Emps WHERE sal > 1500
and then execute the query
SELECT ename, job FROM HiEmps
The database engine automatically expands this into
SELECT ename, job FROM Emps WHERE sal > 1500
© Ellis Cohen, 2002-2005 89
Motivating Materialized Views
Suppose a view is• Used frequently in an application• Somewhat expensive to compute• Based on tables that change infrequently
It would be useful to• Store the contents of the view in a table• Use the table for queries• Arrange to update the table
(automatically) when the base tables change [or, perhaps less frequently, if the view does not need to be perfectly up-to-date]
© Ellis Cohen, 2002-2005 90
Example Star Schema
Stores (Dimension)DailySales (Fact)
storidprodiddateidpriceunits
storidstornamcitystatedistiddistnamdistarearegidregnam
Products (Dimension)
prodidcolorsizeprodtypprodnamprodescrmanfidmanfnamsubcatidsubnamsubdescrcatidcatnamcatdescr
Dates(Dimension)
dateiddatedayofweekdayofmonthdayofyearweekyrweekofyearmonthyrmonthquarteryrquarteryear
© Ellis Cohen, 2002-2005 91
Materialized Views
Materialized views actually hold data CREATE MATERIALIZED VIEW ProdDistYrSum ASSELECT p.prodtyp, s.distid, d.year,
sum(f.units) as totunitsFROM DailySales f NATURAL JOIN Stores s NATURAL JOIN Products p NATURAL JOIN Dates dGROUP BY p.prodtyp, s.distid, d.year
A materialized view is
• Like a table, in that it actually stores the result of the query
• Like a view, in that it is possible to arrange for it be automatically updated when the underlying base data changes
© Ellis Cohen, 2002-2005 92
Updating Materialized Views
During the loading phase new data is incrementally added to data warehouse tables
Materialized Views (which are defined as part of architecting the data warehouse) are either– Recalculated from scratch based on the
the new base table contents
– Incrementally updated based on incremental changes to the base tables.
How is ProdDistSumYr incrementally updatedwhen a new day's worth of data is added?
© Ellis Cohen, 2002-2005 93
Using Materialized ViewsInstead of writing,
SELECT p.prodtyp, s.distid, sum(f.units) as totunitsFROM DailySales f NATURAL JOIN Stores s NATURAL JOIN Products p NATURAL JOIN Dates dWHERE d.year = 2002GROUP BY p.prodtyp, s.distid
just writeSELECT prodtyp, distid, totunits
FROM ProdDistYrSumWHERE year = 2002
Because ProdDistYrSum is a materialized view, the database engine does NOT expand it, but just uses its materialized data
© Ellis Cohen, 2002-2005 94
Aggregating Materialized Views
Instead of writing,SELECT p.prodtyp, s.distid, sum(f.units) as totunits
FROM DailySales f NATURAL JOIN Stores s NATURAL JOIN Products pGROUP BY p.prodtyp, s.distid
just writeSELECT prodtyp, distid,
sum(totunits) AS totunits_overalltimeFROM ProdDistYrSumGROUP BY prodtyp, distid
© Ellis Cohen, 2002-2005 95
Architecting Materialized Views
Many possible combinations:– district, region– district/week, district/month, …– region/week, region/month, …– district/category, district/manufacturer, …– category/week, category/month, …– category/district/week, …
Design balances– Cost of precalculating & storing view– Cost of calculating on the fly
A heuristic optimization problem– Uses statistics of queries– Uses size of each combination– e.g. Benefit Per Unit Space (BPUS)
Which views should be materialized?
© Ellis Cohen, 2002-2005 96
Materialized View Evolution Problem
As the data warehouse evolves, the set of materialized views needs to change.
But, if the DW design already includes 1000 analysis queries, they would need to be rewritten to use the new set of materialized views.
This is expensive!
© Ellis Cohen, 2002-2005 97
Query Rewriting
Systems (like Oracle) that support query rewriting
• Can automatically rewrite queries to use available materialized views (this can be complicated!)
• Allow a subset of materialized views to be marked for use in query rewriting
Query rewriting is the opposite of view expansion!
If the data warehouse does not support query rewriting, the ETL tool could do it instead!
© Ellis Cohen, 2002-2005 98
Stores x SubCategories
CREATE MATERIALIZED VIEW StoreSubcatSum ASSELECT p.storid, s.subcatid, sum(f.units) as totunitsFROM DailySales f NATURAL JOIN Store s NATURAL JOIN Product pGROUP BY p.storid, s.subcatid
Stores
Regions
Districts
Products
SubCategories
ProductTypes
Categories
© Ellis Cohen, 2002-2005 99
Districts x ProductTypes
CREATE MATERIALIZED VIEW DistProdtypSum ASSELECT p.prodtyp, s.distid, sum(f.units) as totunitsFROM DailySales f NATURAL JOIN Stores s NATURAL JOIN Products pGROUP BY p.prodtyp, s.distid
Stores
Regions
Districts
Products
SubCategories
ProductTypes
Categories
© Ellis Cohen, 2002-2005 100
Multiple Materialized View Alternatives
SELECT p.catid, s.regid, sum(f.units) as totunitsFROM DailySales f NATURAL JOIN Stores s NATURAL JOIN Products p
GROUP BY p.catid, s.regid
Stores
Regions
Districts
Product
SubCategories
ProductTypes
Categories
Should the optimizer rewrite
this query in terms of
StoreSubcatSum or
DistProdtypSum?
In general, how good is the optimizer?Can it discover unions, etc.
© Ellis Cohen, 2002-2005 101
Automatic Result Caching
Database can (potentially)– cache the results of any query
automatically as a materialized view– use the query history to automatically
define new materialized views
Then, based on their size & usage statistics, the DB can automatically determine
– whether to discard any of these views after a while
– whether to discard or update any of these views when their underlying base tables are updated
© Ellis Cohen, 2002-2005 102
Materialized View References
Stores (Dimension)DailySales (Fact)
storidprodiddateidpriceunits
storidstornamcitystatedistiddistnamdistarearegidregnam
Products (Dimension)
prodidcolorsizeprodtypprodnamprodescrmanfidmanfnamsubcatidsubnamsubdescrcatidcatnamcatdescr
prodtypdistidtotunits
DistProdtypSum
…
Not Foreign Key Constraints
When dimensions are denormalized, primary keys of materialized do
not refer to unique dimension attributes
This requires using DISTINCT queries
© Ellis Cohen, 2002-2005 103
rewritten as
Queries Requiring DISTINCTSELECT p.prodtyp, s.distnam,
sum(f.units) as totunitsFROM DailySales f NATURAL JOIN Stores s NATURAL JOIN Products pGROUP BY p.prodtyp, s.distid, s.distnam
SELECT DISTINCT prodtyp, distnam, totunitsFROM DistProdTypeSum NATURAL JOIN Stores
DISTINCT is only needed because distid is not unique in Stores
Fix by adding a denormalized subdimension table for Districts
© Ellis Cohen, 2002-2005 104
Starflake Schema(Fully Denormalized Dimensionsand SubDimensions as needed)
Store (Dimension)
DailySales (Fact)
storidprodiddateidpriceunits
storidstornamcitystatedistiddistnamdistarearegidregnam
Products (Dimension)
prodidcolorsizeprodtypprodnamprodescrmanfidmanfnamsubcatidsubnamsubdescrcatidcatnamcatdescr
prodtypdistidtotunits
…
prodtypprodnamprodescrmanfidmanfnamsubcatidsubnamsubdescrcatidcatnamcatdescr
ProductTypes (SubDimension)
Keep Denormalized
distiddistnamdistarearegidregnam
Districts(SubDimension)
DistProdtypSum
© Ellis Cohen, 2002-2005 105
Indexing for Data Warehouses
© Ellis Cohen, 2002-2005 106
Implementing Warehouse Queries
SELECT sum(f.units) as totunitsFROM DailySales f NATURAL JOIN Products pWHERE p.catid = 5
DailySales (Fact)
storidprodiddateidpriceunits
Products (Dimension)
prodidcolorsizeprodtypprodnamprodescrmanfidmanfnamsubcatidsubnamsubdescrcatidcatnamcatdescr
Assume that• Products is indexed by catid• Daily Sales is indexed by
prodidFor a specific value of catid
Get all rowids in Products with that catid
Extract the prodid'sGet all rowids in DailySales
with those prodid'sExtract the units from the rows &
sum
Index by catid
Index by prodid
© Ellis Cohen, 2002-2005 107
Using Join Indexing
SELECT sum(f.units) as totunitsFROM DailySales f NATURAL JOIN Products pWHERE p.catid = 5
DailySales (Fact)
storidprodiddateidpriceunits
Products (Dimension)
prodidcolorsizeprodtypprodnamprodescrmanfidmanfnamsubcatidsubnamsubdescrcatidcatnamcatdescr
Assume that• Daily Sales is indexed by
Products.catidFor a specific value of catid
Get all rowids in DailySales with that catid
Extract the units from the rows & sum
Index by Product.catid
A join index is an index on one table based (in part) on values of fields of other tables
© Ellis Cohen, 2002-2005 108
Multi-Table JoinsSELECT sum(f.units) as totunits
FROM DailySales f NATURAL JOIN Products p NATURAL JOIN Dates d WHERE p.catid = 5 AND d.year = 2002
DailySales (Fact)
storidprodiddateidpriceunits
Assume that• Daily Sales is indexed by Product.catid• Daily Sales is indexed by Dates.year
For a specific value of catidGet all rowids in DailySales with that
catidFor a specific year
Get all rowids in DailySales with that year
Intersect the two lists of rowids (lots!)Extract the units from the rows & sum
Index by Product.catid
Index by Dates.year
© Ellis Cohen, 2002-2005 109
Multi-Table Join IndexesSELECT sum(f.units) as totunits
FROM DailySales f NATURAL JOIN Product p NATURAL JOIN Dates d WHERE p.catid = 5 AND d.year = 2002
DailySales (Fact)
storidprodiddateidpriceunits
Assume that• Daily Sales is indexed by
( Product.catid, Dates.year )For a distinct pair of (catid,year)
Get all rowids in DailySales with that catid & year
Extract the units from the rows & sum
Index by (Product.catid,
Dates.year)
Since issue as for aggregates. Lots of possible different combinations
of multi-table join indices: Which ones are worth building?
© Ellis Cohen, 2002-2005 110
Bitmap Indexing
603942 … 5 …
603947 … 2 …
603950 … 2 …
603951 … 2 …
603964 … 3 …
603968 … 5 …
…
prodid … catid …
Product
1
1
1
1
1
1
ProductBitmapIndex
1 2 3 4 5
© Ellis Cohen, 2002-2005 111
Using Bitmap Indices
Bitmap Index by category
SELECT min(size), max(size)FROM ProductWHERE catid = 5
Assume that• Product has bitmap index on catid
(faster than doing full scan or using B+ tree)
Implement Query ByScan all tuples in Products with the
bit set for catid: 5Extract the size from each tuple and
compute min(size) and max(size)
Product (Dimension)
prodidcolorsizeprodtypprodnamprodescrmanfidmanfnamsubcatidsubnamsubdescrcatidcatnamcatdescr
© Ellis Cohen, 2002-2005 112
Bitmap Intersection
Bitmap Indices by category &
by color
SELECT min(size), max(size)FROM ProductWHERE catid = 5 AND color = 'fuschia'
Assume that• Product has bitmap indices on
catid and on colorImplement Query By
Construct the bit vector which has bits set for both catid: 5 and color: fuschia (very fast!)
Scan all tuples in Product with the bit set in the resulting bit vector
Extract the size from each tuple and compute min(size) and max(size)
Product (Dimension)
prodidcolorsizeprodtypprodnamprodescrmanfidmanfnamsubcatidsubnamsubdescrcatidcatnamcatdescr
© Ellis Cohen, 2002-2005 113
Bitmap Join Indexing
SELECT sum(f.units) as totunitsFROM DailySales f NATURAL JOIN Product pWHERE p.catid = 5 DailySales
(Fact)
storidprodiddateidpriceunits
Assume that• Daily Sales has bitmap index
on Product.catidFor a specific value of catid
Get all rowids in DailySales with that catid
Extract the units from the rows & sum
Bitmap Index by
Product.catid
© Ellis Cohen, 2002-2005 114
Multi-Table Joins with Bitmap Indices
SELECT sum(f.units) as totunitsFROM DailySales f NATURAL JOIN Product p NATURAL JOIN Store s WHERE p.catid = 5 AND s.city = ‘Boston’
DailySales (Fact)
storidprodiddateidpriceunits
Assume that• Daily Sales has bitmap index on
Product.catid• Daily Sales has bitmap index on
Store.cityImplement Query By
Construct the bit vector which has bits set for both category: 5 and city: Boston (very fast!)
Get the rowids of all rows with bit set in the resulting bit vector
Extract the units from the rows & sum
Bitmap Index by Product.catid
Bitmap Index by Store.city
© Ellis Cohen, 2002-2005 115
Indexing vs Materialization
IndexingLess spaceUsable with different kinds of
aggregation and analysis operations
More opportunities for combining
Materialized Views (esp. Aggregates)
Avoid recomputation,esp. recalculation of aggregates