CS 345: Topics in Data Warehousing Thursday, September 30, 2004.
-
Upload
brendan-nichols -
Category
Documents
-
view
214 -
download
0
Transcript of CS 345: Topics in Data Warehousing Thursday, September 30, 2004.
![Page 1: CS 345: Topics in Data Warehousing Thursday, September 30, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062719/56649ef05503460f94c00e67/html5/thumbnails/1.jpg)
CS 345:Topics in Data Warehousing
Thursday, September 30, 2004
![Page 2: CS 345: Topics in Data Warehousing Thursday, September 30, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062719/56649ef05503460f94c00e67/html5/thumbnails/2.jpg)
Review of Tuesday’s Class
• Roles of Information Technology– 1. Automate clerical work – 2. Decision support
• OLAP vs. OLTP– Different query characteristics– Different performance requirements– Different data modeling requirements– OLAP combines data from many sources
• High-level course outline– Logical Database Design– Query Processing– Physical Database Design– Data Mining
![Page 3: CS 345: Topics in Data Warehousing Thursday, September 30, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062719/56649ef05503460f94c00e67/html5/thumbnails/3.jpg)
Outline of Today’s Class
• Data integration
• Basic OLAP queries– Data cubes– Slice and dice, drill down, roll up– MOLAP vs. ROLAP– SQL OLAP Extensions: ROLLUP, CUBE
• Star Schemas– Facts and Dimensions
![Page 4: CS 345: Topics in Data Warehousing Thursday, September 30, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062719/56649ef05503460f94c00e67/html5/thumbnails/4.jpg)
Loading the Data Warehouse
Source Systems Data Staging Area Data Warehouse
(OLTP)
Data is periodically extracted
Data is cleansed and transformed
Users query the data warehouse
![Page 5: CS 345: Topics in Data Warehousing Thursday, September 30, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062719/56649ef05503460f94c00e67/html5/thumbnails/5.jpg)
Terminology: ETL
• ETL = Extraction, Transformation, & Load• Extraction: Get the data out of the source
systems• Transformation: Convert the data into a useful
format for analysis• Load: Get the data into the data warehouse
(…and build indexes, materialized views, etc.)
• We will return to this topic in a couple weeks.
![Page 6: CS 345: Topics in Data Warehousing Thursday, September 30, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062719/56649ef05503460f94c00e67/html5/thumbnails/6.jpg)
Data Integration is Hard
• Data warehouses combine data from multiple sources• Data must be translated into a consistent format• Data integration represents ~80% of effort for a typical
data warehouse project!• Some reasons why it’s hard:
– Metadata is often poor or non-existent– Data quality is often bad
• Missing or default values• Multiple spellings of the same thing
(Cal vs. UC Berkeley vs. University of California)
– Inconsistent semantics• What is an airline passenger?
![Page 7: CS 345: Topics in Data Warehousing Thursday, September 30, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062719/56649ef05503460f94c00e67/html5/thumbnails/7.jpg)
Federated Databases
• An alternative to data warehouses• Data warehouse
– Create a copy of all the data – Execute queries against the copy
• Federated database – Pull data from source systems as needed to answer queries
• “lazy” vs. “eager” data integration
Data Warehouse Federated Database
Query
Answer
QueryExtraction
Rewritten Queries
Answer
SourceSystems
SourceSystems
WarehouseMediator
![Page 8: CS 345: Topics in Data Warehousing Thursday, September 30, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062719/56649ef05503460f94c00e67/html5/thumbnails/8.jpg)
Warehouses vs. Federation
• Advantages of federated databases:– No redundant copying of data– Queries see “real-time” view of evolving data– More flexible security policy
• Disadvantages of federated databases:– Analysis queries place extra load on transactional systems– Query optimization is hard to do well– Historical data may not be available– Complex “wrappers” needed to mediate between analysis server
and source systems• Data warehouses are much more common in practice
– Better performance– Lower complexity– Slightly out-of-date data is acceptable
![Page 9: CS 345: Topics in Data Warehousing Thursday, September 30, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062719/56649ef05503460f94c00e67/html5/thumbnails/9.jpg)
Two Approaches to Data Warehousing
• Data mart: like a data warehouse, but smaller and more focused
• Top-down approach– First build single unified data warehouse with all enterprise data– Then create data marts containing specialized subsets of the
data from the warehouse
• Bottom-up approach– First build a data mart to solve the most pressing problem– Then build another data mart, then another– Data warehouse = union of all data marts
• In practice, not much difference between the two• Our book advocates the bottom-up approach
![Page 10: CS 345: Topics in Data Warehousing Thursday, September 30, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062719/56649ef05503460f94c00e67/html5/thumbnails/10.jpg)
Data Cube
• Axes of the cube represent attributes of the data records– Generally discrete-valued /
categorical– e.g. color, month, state– Called dimensions
• Cells hold aggregated measurements – e.g. total $ sales, number
of autos sold– Called facts
• Real data cubes have >> 3 dimensions Jul Aug Sep
CAORWA
Red
Blue
Gray
Auto Sales
![Page 11: CS 345: Topics in Data Warehousing Thursday, September 30, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062719/56649ef05503460f94c00e67/html5/thumbnails/11.jpg)
Slicing and Dicing
Jul Aug SepCAORWA
Red
Blue
Gray
Red
Blue
Gray
Jul Aug SepCAORWA
Blue
Jul Aug SepCAORWA
Blue
Jul Aug SepTotal
![Page 12: CS 345: Topics in Data Warehousing Thursday, September 30, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062719/56649ef05503460f94c00e67/html5/thumbnails/12.jpg)
Querying the Data Cube
• Cross-tabulation– “Cross-tab” for short– Report data grouped by 2
dimensions– Aggregate across other
dimensions– Include subtotals
• Operations on a cross-tab– Roll up (further aggregation)– Drill down (less aggregation)
CA OR WA Total
Jul 45 33 30 108
Aug 50 36 42 128
Sep 38 31 40 109
Total 133 100 112 345
Number of Autos Sold
![Page 13: CS 345: Topics in Data Warehousing Thursday, September 30, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062719/56649ef05503460f94c00e67/html5/thumbnails/13.jpg)
Roll Up and Drill Down
CA OR WA Total
Jul 45 33 30 108
Aug 50 36 42 128
Sep 38 31 40 109
Total 133 100 112 345
Number of Autos Sold
CA OR WA Total
133 100 112 345
Number of Autos Sold
CA OR WA Total
Red 40 29 40 109
Blue 45 31 37 113
Gray 48 40 35 123
Total 133 100 112 345
Roll upby Month
Number of Autos Sold
Drill downby Color
![Page 14: CS 345: Topics in Data Warehousing Thursday, September 30, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062719/56649ef05503460f94c00e67/html5/thumbnails/14.jpg)
“Standard” Data Cube Query
• Measurements– Which fact(s) should be reported?
• Filters– What slice(s) of the cube should be used?
• Grouping attributes– How finely should the cube be diced?– Each dimension is either:
• (a) A grouping attribute• (b) Aggregated over (“Rolled up” into a single total)
– n dimensions → 2n sets of grouping attributes– Aggregation = projection to a lower-dimensional
subspace
![Page 15: CS 345: Topics in Data Warehousing Thursday, September 30, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062719/56649ef05503460f94c00e67/html5/thumbnails/15.jpg)
Full Data Cube with Subtotals
• Pre-computation of aggregates → fast answers to OLAP queries
• Ideally, pre-compute all 2n types of subtotals• Otherwise, perform aggregation as needed• Coarser-grained totals can be computed from
finer-grained totals– But not the other way around
![Page 16: CS 345: Topics in Data Warehousing Thursday, September 30, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062719/56649ef05503460f94c00e67/html5/thumbnails/16.jpg)
Data Cube Lattice
Total
State Month Color
State, Month
State,Color
Month,Color
State, Month, Color
DrillDown
RollUp
![Page 17: CS 345: Topics in Data Warehousing Thursday, September 30, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062719/56649ef05503460f94c00e67/html5/thumbnails/17.jpg)
MOLAP vs. ROLAP
• MOLAP = Multidimensional OLAP
• Store data cube as multidimensional array
• (Usually) pre-compute all aggregates
• Advantages:– Very efficient data access → fast answers
• Disadvantages:– Doesn’t scale to large numbers of dimensions– Requires special-purpose data store
![Page 18: CS 345: Topics in Data Warehousing Thursday, September 30, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062719/56649ef05503460f94c00e67/html5/thumbnails/18.jpg)
Sparsity
• Imagine a data warehouse for Safeway.• Suppose dimensions are: Customer, Product, Store, Day• If there are 100,000 customers, 10,000 products, 1,000
stores, and 1,000 days…• …data cube has 1,000,000,000,000,000 cells!• Fortunately, most cells are empty.• A given store doesn’t sell every product on every day.• A given customer has never visited most of the stores.• A given customer has never purchased most products.• Multi-dimensional arrays are not an efficient way to store
sparse data.
![Page 19: CS 345: Topics in Data Warehousing Thursday, September 30, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062719/56649ef05503460f94c00e67/html5/thumbnails/19.jpg)
MOLAP vs. ROLAP
• ROLAP = Relational OLAP• Store data cube in relational database• Express queries in SQL• Advantages:
– Scales well to high dimensionality– Scales well to large data sets– Sparsity is not a problem– Uses well-known, mature technology
• Disadvantages:– Query performance is slower than MOLAP– Need to construct explicit indexes
![Page 20: CS 345: Topics in Data Warehousing Thursday, September 30, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062719/56649ef05503460f94c00e67/html5/thumbnails/20.jpg)
Creating a Cross-tab with SQL
SELECT state, month, SUM(quantity)FROM salesGROUP BY state, monthWHERE color = 'Red'
Grouping Attributes
Measurements
Filters
![Page 21: CS 345: Topics in Data Warehousing Thursday, September 30, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062719/56649ef05503460f94c00e67/html5/thumbnails/21.jpg)
What about the totals?
• SQL aggregation query with GROUP BY does not produce subtotals, totals
• Our cross-tab report is incomplete.
CA OR WA Total
Jul 45 33 30 ?
Aug 50 36 42 ?
Sep 38 31 40 ?
Total ? ? ? ?
Number of Autos Sold
State Month SUMCA Jul 45CA Aug 50CA Sep 38OR Jul 33OR Aug 36OR Sep 31WA Jul 30WA Aug 42WA Sep 40
![Page 22: CS 345: Topics in Data Warehousing Thursday, September 30, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062719/56649ef05503460f94c00e67/html5/thumbnails/22.jpg)
One solution: a big UNION ALL
SELECT state, month, SUM(quantity)FROM salesGROUP BY state, monthWHERE color = 'Red‘UNION ALLSELECT state, "ALL", SUM(quantity)FROM salesGROUP BY stateWHERE color = 'Red'UNION ALLSELECT "ALL", month, SUM(quantity)FROM salesGROUP BY monthWHERE color = 'Red‘UNION ALLSELECT "ALL", "ALL", SUM(quantity)FROM salesWHERE color = 'Red'
OriginalQuery
StateSubtotals
MonthSubtotals
OverallTotal
![Page 23: CS 345: Topics in Data Warehousing Thursday, September 30, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062719/56649ef05503460f94c00e67/html5/thumbnails/23.jpg)
A better solution
• “UNION ALL” solution gets cumbersome with more than 2 grouping attributes
• n grouping attributes → 2n parts in the union• OLAP extensions added to SQL 99 are more
convenient– CUBE, ROLLUP
SELECT state, month, SUM(quantity)FROM salesGROUP BY CUBE(state, month)WHERE color = 'Red'
![Page 24: CS 345: Topics in Data Warehousing Thursday, September 30, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062719/56649ef05503460f94c00e67/html5/thumbnails/24.jpg)
Results of the CUBE queryState MonthSUM(quantity)CA Jul 45CA Aug 50CA Sep 38CA NULL 133OR Jul 33OR Aug 36OR Sep 31OR NULL 100WA Jul 30WA Aug 42WA Sep 40WA NULL 112NULL Jul 108NULL Aug 128NULL Sep 109NULL NULL 345
Notice the use of NULL for totals
Subtotals at all levels
![Page 25: CS 345: Topics in Data Warehousing Thursday, September 30, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062719/56649ef05503460f94c00e67/html5/thumbnails/25.jpg)
ROLLUP vs. CUBE
• CUBE computes entire lattice
• ROLLUP computes one path through lattice– Order of GROUP BY list matters– Groups by all prefixes of the GROUP BY list
GROUP BY ROLLUP(A,B,C)•A,B,C•(A,B) subtotals•(A) subtotals•Total
GROUP BY CUBE(A,B,C)•A,B,C•Subtotals for the following:(A,B), (A,C), (B,C), (A), (B), (C)•Total
![Page 26: CS 345: Topics in Data Warehousing Thursday, September 30, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062719/56649ef05503460f94c00e67/html5/thumbnails/26.jpg)
ROLLUP example
Total
State Month Color
State, Month
State,Color
Month,Color
State, Month, Color
SELECT color, month, state, SUM(quantity)FROM salesGROUP BY ROLLUP(color,month,state)
![Page 27: CS 345: Topics in Data Warehousing Thursday, September 30, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062719/56649ef05503460f94c00e67/html5/thumbnails/27.jpg)
Logical Database Design
• Logical database design is the topic for the next two weeks.
• Logical design vs. physical design:– Logical design = conceptual organization for the
database• Create an abstraction for a real-world process
– Physical design = how is the data stored• Select data structures (tables, indexes, materialized views)• Organize data structures on disk
• Three main goals for logical design:– Simplicity– Expressiveness– Performance
![Page 28: CS 345: Topics in Data Warehousing Thursday, September 30, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062719/56649ef05503460f94c00e67/html5/thumbnails/28.jpg)
Goals for Logical Design
• Simplicity– Users should understand the design– Data model should match users’ conceptual model– Queries should be easy and intuitive to write
• Expressiveness– Include enough information to answer all important
queries– Include all relevant data (without irrelevant data)
• Performance– An efficient physical design should be possible
![Page 29: CS 345: Topics in Data Warehousing Thursday, September 30, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062719/56649ef05503460f94c00e67/html5/thumbnails/29.jpg)
Another perspective…
• From a presentation at SIGMOD*:
• Three design rules:– Simplicity– Simplicity– Simplicity
*by Jeff Byard and Donovan Schneider, Red Brick Systems
![Page 30: CS 345: Topics in Data Warehousing Thursday, September 30, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062719/56649ef05503460f94c00e67/html5/thumbnails/30.jpg)
Star Schema
Sales
Date
Product Store
Promotion
Fact table
Dimension tables
![Page 31: CS 345: Topics in Data Warehousing Thursday, September 30, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062719/56649ef05503460f94c00e67/html5/thumbnails/31.jpg)
Dimension Tables
• Each one corresponds to a real-world object or concept.– Examples: Customer, Product, Date, Employee, Region, Store,
Promotion, Vendor, Partner, Account, Department
• Properties of dimension tables:– Contain many descriptive columns
• Dimension tables are wide (dozens of columns)
– Generally don’t have too many rows• At least in comparison to the fact tables• Usually < 1 million rows
– Contents are relatively static• Almost like a lookup table
• Uses of dimension tables– Filters are based on dimension attributes– Grouping columns are dimension attributes– Fact tables are referenced through dimensions
![Page 32: CS 345: Topics in Data Warehousing Thursday, September 30, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062719/56649ef05503460f94c00e67/html5/thumbnails/32.jpg)
Fact Tables
• Each fact table contains measurements about a process of interest.
• Each fact row contains two things:– Numerical measure columns– Foreign keys to dimension tables– That’s all!
• Properties of fact tables:– Very big
• Often millions or billions of rows– Narrow
• Small number of columns– Changes often
• New events in the world → new rows in the fact table• Typically append-only
• Uses of fact tables:– Measurements are aggregated fact columns.
![Page 33: CS 345: Topics in Data Warehousing Thursday, September 30, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062719/56649ef05503460f94c00e67/html5/thumbnails/33.jpg)
Comparing Facts and Dimensions
• Narrow• Big (many rows)• Numeric• Growing over time
• Wide• Small (few rows)• Descriptive• Static
Facts Dimensions
Facts contain numbers, dimensions contain labels
![Page 34: CS 345: Topics in Data Warehousing Thursday, September 30, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062719/56649ef05503460f94c00e67/html5/thumbnails/34.jpg)
Four steps in dimensional modeling
1. Identify the process being modeled.
2. Determine the grain at which facts will be stored.
3. Choose the dimensions.
4. Identify the numeric measures for the facts.
![Page 35: CS 345: Topics in Data Warehousing Thursday, September 30, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062719/56649ef05503460f94c00e67/html5/thumbnails/35.jpg)
Grain of a Fact Table
• Grain of a fact table = the meaning of one fact table row• Determines the maximum level of detail of the
warehouse• Example grain statements: (one fact row represents a…)
– Line item from a cash register receipt– Boarding pass to get on a flight– Daily snapshot of inventory level for a product in a warehouse– Sensor reading per minute for a sensor– Student enrolled in a course
• Finer-grained fact tables:– are more expressive– have more rows
• Trade-off between performance and expressiveness– Rule of thumb: Err in favor of expressiveness– Pre-computed aggregates can solve performance problems
![Page 36: CS 345: Topics in Data Warehousing Thursday, September 30, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062719/56649ef05503460f94c00e67/html5/thumbnails/36.jpg)
Choosing Dimensions
• Determine a candidate key based on the grain statement.– Example: a student enrolled in a course– (Course, Student, Term) is a candidate key
• Add other relevant dimensions that are functionally determined by the candidate key.– For example, Instructor and Classroom
• Assuming each course has a single instructor!