Michael Goshey University of Minnesota, Fall 2006 CSci 8701: Overview of Database Research
description
Transcript of Michael Goshey University of Minnesota, Fall 2006 CSci 8701: Overview of Database Research
An Analysis of the Publication "An Overview of Data Warehousing and OLAP Technology” by Surajit Chaudhuri, Umeshwar Dayal
Michael GosheyUniversity of Minnesota, Fall 2006CSci 8701: Overview of Database Research
Michael Goshey: 9/19/2006 2
Outline
1. Introduction
2. Problem Addressed
3. Major Contributions
4. Key Concepts
5. Validation Methodology
6. Assumptions
7. 2006 Rewrite
Michael Goshey: 9/19/2006 3
Introduction
Selected paper S. Chaudhuri and U. Dayal, An Overview of
Data Warehousing and OLAP Technology, SIGMOD Record 26(1): 65-74(1997).
Motivation Personal Interest
Michael Goshey: 9/19/2006 4
Outline
1. Introduction
2. Problem Addressed
3. Major Contributions
4. Key Concepts
5. Validation Methodology
6. Assumptions
7. 2006 Rewrite
Michael Goshey: 9/19/2006 5
Problem Addressed
Problem Statement Survey: organizing the data warehousing space Differing requirements between OLTP and
OLAP Significance
Growth area Reference work establishing consensus on
terms, architectures and issues
Michael Goshey: 9/19/2006 6
Outline
1. Introduction
2. Problem Addressed
3. Major Contributions
4. Key Concepts
5. Validation Methodology
6. Assumptions
7. 2006 Rewrite
Michael Goshey: 9/19/2006 7
Major Contributions
Bridging the gulf between industry and academia OLTP vs. OLAP: clarifying the differences Concise survey of relevant issues, architectures
and tools Concrete list of data warehouse design and build
steps
Michael Goshey: 9/19/2006 8
Outline
1. Introduction
2. Problem Addressed
3. Major Contributions
4. Key Concepts
5. Validation Methodology
6. Assumptions
7. 2006 Rewrite
Michael Goshey: 9/19/2006 9
Key Concepts
Data warehouses and data marts OLTP, OLAP, ROLAP vs. MOLAP) Relational and dimensional data models Bitmap Index ETL Metadata Managed query vs. ad hoc environments Materialized views SQL extensions (cube, rollup, rank, percentile, etc.)
Michael Goshey: 9/19/2006 10
Data Warehouse, Data Mart
Data Staging
Area
MetadataCatalog
Typical Data Warehouse Architecture
ETL Services
Dimensional Data Marts including atomic data
Other uses
Source Systems
Ad Hoc Query and Analysis Tools
Reporting ToolsDimensional Data Marts with
only aggregated data
Michael Goshey: 9/19/2006 11
Relational or Dimensional?Categories
PK CategoryID
U1 CategoryName Description Picture
Shippers
PK ShipperID
CompanyName Phone
Order Details
PK,FK1,I2,I1 OrderIDPK,FK2,I4,I3 ProductID
UnitPrice Quantity Discount
Customers
PK CustomerID
I2 CompanyName ContactName ContactTitle AddressI1 CityI4 RegionI3 PostalCode Country Phone Fax
Suppliers
PK SupplierID
I1 CompanyName ContactName ContactTitle Address City RegionI2 PostalCode Country Phone Fax HomePage
Orders
PK OrderID
FK1,I2,I1 CustomerIDFK2,I3,I4 EmployeeIDI5 OrderDate RequiredDateI6 ShippedDateFK3,I7 ShipVia Freight ShipName ShipAddress ShipCity ShipRegionI8 ShipPostalCode ShipCountry
Employees
PK EmployeeID
I1 LastName FirstName Title TitleOfCourtesy BirthDate HireDate Address City RegionI2 PostalCode Country HomePhone Extension Photo Notes ReportsTo
Products
PK ProductID
I3 ProductNameFK2,I5,I4 SupplierIDFK1,I1,I2 CategoryID QuantityPerUnit UnitPrice UnitsInStock UnitsOnOrder ReorderLevel Discontinued
Michael Goshey: 9/19/2006 12
Relational or Dimensional?
(image from http://www.laynetworks.com)
Michael Goshey: 9/19/2006 13
Bitmap Indices
customer
age 0-10 age 11-20 age 21-30 age 31-40
Mary 1 0 0 0
John 0 1 0 0
Steve 0 0 1 0
Tom 0 0 0 1
Lisa 0 0 1 0
cardinality: unique values/total rows B-Tree vs. bitmap: 1% rule, uniqueness Boolean algebra directly on indices
Michael Goshey: 9/19/2006 14
Outline
1. Introduction
2. Problem Addressed
3. Major Contributions
4. Key Concepts
5. Validation Methodology
6. Assumptions
7. 2006 Rewrite
Michael Goshey: 9/19/2006 15
Validation Methodology
Survey paper goals Academic and industry citations Referencing tools, vendors Case studies
Michael Goshey: 9/19/2006 16
Outline
1. Introduction
2. Problem Addressed
3. Major Contributions
4. Key Concepts
5. Validation Methodology
6. Assumptions
7. 2006 Rewrite
Michael Goshey: 9/19/2006 17
Assumptions
Read-only environments Shortcomings
(occasional) transactional commitments the data revision problem
Michael Goshey: 9/19/2006 18
Outline
1. Introduction
2. Problem Addressed
3. Major Contributions
4. Key Concepts
5. Validation Methodology
6. Assumptions
7. 2006 Rewrite
Michael Goshey: 9/19/2006 19
2006 Rewrite
Changes in terminology, tools, vendors Fact constellations -> conformed dimensions Decision support -> BI Vendors and tools in BI, ETL, OLAP
Multiple user constituencies Data history difficulties
petabyte databases -> very large warehouses common
data expiry challenges slowly changing dimensions
Michael Goshey: 9/19/2006 20
Slowly Changing Dimensions
CustomerID Name Status
001 Mary Johnson
Gold
CustomerID Name Status
001 Mary Johnson
Platinum
CustomerID Name Status
001 Mary Johnson
Gold
001 Mary Johnson
Platinum
CustomerID Name Original Status
Current Status
Effective Date
001 Mary Johnson
Gold Platinum 10/1/2006
Before
After: Type 1
After: Type 2
After: Type 3
CustomerID Name Status
001 Mary Johnson
Platinum
Michael Goshey: 9/19/2006 21
Questions?