7/30/2019 data mning by jaiwei han chapter 2
1/90
December 19, 2012 Data Mining: Concepts and Techniques 1
Data Mining:Concepts and Techniques
Slides for Textbook Chapter 2
Jiawei Han and Micheline Kamber
Department of Computer ScienceUniversity of Illinois at Urbana-Champaign
www.cs.uiuc.edu/~hanj
7/30/2019 data mning by jaiwei han chapter 2
2/90
December 19, 2012 Data Mining: Concepts and Techniques 2
Chapter 2: Data Warehousing andOLAP Technology for Data Mining
What is a data warehouse?
A multi-dimensional data model
Data warehouse architecture
Data warehouse implementation
Further development of data cube technology
From data warehousing to data mining
7/30/2019 data mning by jaiwei han chapter 2
3/90
December 19, 2012 Data Mining: Concepts and Techniques 3
What is Data Warehouse?
Defined in many different ways, but not rigorously.
A decision support database that is maintained separately from
the organization s operational database
Support information processing by providing a solid platform of
consolidated, historical data for analysis.
A data warehouse is a subject-oriented , integrated , time-variant ,
and nonvolatile collection of data in support of management s
decision- making process. W. H. InmonData warehousing:
The process of constructing and using data warehouses
7/30/2019 data mning by jaiwei han chapter 2
4/90
December 19, 2012 Data Mining: Concepts and Techniques 4
Data Warehouse Subject-Oriented
Organized around major subjects, such as customer,product, sales .
Focusing on the modeling and analysis of data for decision
makers, not on daily operations or transaction processing.
Provide a simple and concise view around particular subjectissues by excluding data that are not useful in the decision
support process .
7/30/2019 data mning by jaiwei han chapter 2
5/90
December 19, 2012 Data Mining: Concepts and Techniques 5
Data Warehouse Integrated
Constructed by integrating multiple, heterogeneous datasources
relational databases, flat files, on-line transactionrecords
Data cleaning and data integration techniques areapplied.
Ensure consistency in naming conventions, encodingstructures, attribute measures, etc. among differentdata sources
E.g., Hotel price: currency, tax, breakfast covered, etc.
When data is moved to the warehouse, it isconverted.
7/30/2019 data mning by jaiwei han chapter 2
6/90
December 19, 2012 Data Mining: Concepts and Techniques 6
Data Warehouse Time Variant
The time horizon for the data warehouse is significantlylonger than that of operational systems.
Operational database: current value data.
Data warehouse data: provide information from ahistorical perspective (e.g., past 5-10 years)
Every key structure in the data warehouse
Contains an element of time, explicitly or implicitly
But the key of operational data may or may not contain time element.
7/30/2019 data mning by jaiwei han chapter 2
7/90December 19, 2012 Data Mining: Concepts and Techniques 7
Data Warehouse Non-Volatile
A physically separate store of data transformed from theoperational environment.
Operational update of data does not occur in the data
warehouse environment.Does not require transaction processing, recovery, andconcurrency control mechanisms
Requires only two operations in data accessing:initial loading of data and access of data .
7/30/2019 data mning by jaiwei han chapter 2
8/90December 19, 2012 Data Mining: Concepts and Techniques 8
Data Warehouse vs. Heterogeneous DBMS
Traditional heterogeneous DB integration:Build wrappers/mediators on top of heterogeneous databases
Query driven approach
When a query is posed to a client site, a meta-dictionary isused to translate the query into queries appropriate forindividual heterogeneous sites involved, and the results areintegrated into a global answer set
Complex information filtering, compete for resources
Data warehouse: update-driven , high performanceInformation from heterogeneous sources is integrated in advanceand stored in warehouses for direct query and analysis
7/30/2019 data mning by jaiwei han chapter 2
9/90December 19, 2012 Data Mining: Concepts and Techniques 9
Data Warehouse vs. Operational DBMS
OLTP (on-line transaction processing)Major task of traditional relational DBMS
Day-to-day operations: purchasing, inventory, banking,manufacturing, payroll, registration, accounting, etc.
OLAP (on-line analytical processing)Major task of data warehouse system
Data analysis and decision makingDistinct features (OLTP vs. OLAP):
User and system orientation: customer vs. marketData contents: current, detailed vs. historical, consolidatedDatabase design: ER + application vs. star + subject
View: current, local vs. evolutionary, integrated Access patterns: update vs. read-only but complex queries
7/30/2019 data mning by jaiwei han chapter 2
10/90December 19, 2012 Data Mining: Concepts and Techniques 10
OLTP vs. OLAP
OLTP OLAPusers clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date
detailed, flat relationalisolated
historical,
summarized, multidimensionalintegrated, consolidated
usage repetitive ad-hoc
access read/writeindex/hash on prim. key
lots of scans
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
7/30/2019 data mning by jaiwei han chapter 2
11/90December 19, 2012 Data Mining: Concepts and Techniques 11
Why Separate Data Warehouse?
High performance for both systems
DBMS tuned for OLTP: access methods, indexing, concurrencycontrol, recovery
Warehouse tuned for OLAP: complex OLAP queries,multidimensional view, consolidation.
Different functions and different data:
missing data : Decision support requires historical data whichoperational DBs do not typically maintain
data consolidation : DS requires consolidation (aggregation,summarization) of data from heterogeneous sources
data quality : different sources typically use inconsistent datarepresentations, codes and formats which have to be reconciled
7/30/2019 data mning by jaiwei han chapter 2
12/90December 19, 2012 Data Mining: Concepts and Techniques 12
Chapter 2: Data Warehousing and OLAPTechnology for Data Mining
What is a data warehouse?
A multi-dimensional data model
Data warehouse architecture
Data warehouse implementation
Further development of data cube technology
From data warehousing to data mining
7/30/2019 data mning by jaiwei han chapter 2
13/90December 19, 2012 Data Mining: Concepts and Techniques 13
From Tables and Spreadsheets to Data Cubes
A data warehouse is based on a multidimensional data model whichviews data in the form of a data cube
A data cube, such as sales , allows data to be modeled and viewed inmultiple dimensions
Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year)
Fact table contains measures (such as dollars_sold ) and keys toeach of the related dimension tables
In data warehousing literature, an n-D base cube is called a basecuboid . The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid . The lattice of cuboidsforms a data cube.
7/30/2019 data mning by jaiwei han chapter 2
14/90December 19, 2012 Data Mining: Concepts and Techniques 14
Cube: A Lattice of Cuboids
all
time item location supplier
time,item time,location
time,supplier
item,location
item,supplier
location,supplier
time,item,location
time,item,supplier
time,location,supplier
item,location,supplier
time, item, location, supplier
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D cuboids
4-D(base) cuboid
7/30/2019 data mning by jaiwei han chapter 2
15/90December 19, 2012 Data Mining: Concepts and Techniques 15
Conceptual Modeling of Data Warehouses
Modeling data warehouses: dimensions & measures
Star schema : A fact table in the middle connected to aset of dimension tables
Snowflake schema : A refinement of star schemawhere some dimensional hierarchy is normalized into aset of smaller dimension tables , forming a shapesimilar to snowflake
Fact constellations : Multiple fact tables sharedimension tables , viewed as a collection of stars,
therefore called galaxy schema or fact constellation
7/30/2019 data mning by jaiwei han chapter 2
16/90
7/30/2019 data mning by jaiwei han chapter 2
17/90December 19, 2012 Data Mining: Concepts and Techniques 17
Example of Snowflake Schema
time_keydayday_of_the_week monthquarteryear
time
location_keystreetcity_key
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_solddollars_sold
avg_sales
Measures
item_keyitem_namebrandtypesupplier_key
item
branch_key
branch_namebranch_type
branch
supplier_keysupplier_type
supplier
city_keycitystate_or_provincecountry
city
7/30/2019 data mning by jaiwei han chapter 2
18/90December 19, 2012 Data Mining: Concepts and Techniques 18
Example of Fact Constellation
time_keydayday_of_the_week monthquarteryear
time
location_keystreetcityprovince_or_statecountry
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_keyitem_namebrandtypesupplier_type
item
branch_key
branch_namebranch_type
branch
Shipping Fact Table
time_key
item_key
shipper_keyfrom_location
to_location
dollars_cost
units_shipped
shipper_keyshipper_name
location_keyshipper_type
shipper
7/30/2019 data mning by jaiwei han chapter 2
19/90December 19, 2012 Data Mining: Concepts and Techniques 19
A Data Mining Query Language: DMQL
Cube Definition (Fact Table)define cube []:
Dimension Definition ( Dimension Table ) define dimension as ()
Special Case (Shared Dimension Tables)
First time as cube definition define dimension as in cube
7/30/2019 data mning by jaiwei han chapter 2
20/90December 19, 2012 Data Mining: Concepts and Techniques 20
Defining a Star Schema in DMQL
define cube sales_star [time, item, branch, location]:dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week,month, quarter, year)define dimension item as (item_key, item_name, brand,
type, supplier_type)
define dimension branch as (branch_key, branch_name,branch_type)define dimension location as (location_key, street, city,
province_or_state, country)
7/30/2019 data mning by jaiwei han chapter 2
21/90December 19, 2012 Data Mining: Concepts and Techniques 21
Defining a Snowflake Schema in DMQL
define cube sales_snowflake [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales =avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter,year)
define dimension item as (item_key, item_name, brand, type,supplier(supplier_key, supplier_type))
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city(city_key,province_or_state, country))
7/30/2019 data mning by jaiwei han chapter 2
22/90December 19, 2012 Data Mining: Concepts and Techniques 22
Defining a Fact Constellation in DMQL
define cube sales [time, item, branch, location]:dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)define dimension time as (time_key, day, day_of_week, month, quarter, year)define dimension item as (item_key, item_name, brand, type, supplier_type)define dimension branch as (branch_key, branch_name, branch_type)define dimension location as (location_key, street, city, province_or_state,
country)define cube shipping [time, item, shipper, from_location, to_location]:
dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)define dimension time as time in cube sales
define dimension item as item in cube salesdefine dimension shipper as (shipper_key, shipper_name, location as location
in cube sales, shipper_type)define dimension from_location as location in cube salesdefine dimension to_location as location in cube sales
7/30/2019 data mning by jaiwei han chapter 2
23/90
7/30/2019 data mning by jaiwei han chapter 2
24/90
December 19, 2012 Data Mining: Concepts and Techniques 24
A Concept Hierarchy: Dimension (location)
all
Europe North_America
MexicoCanadaSpainGermany
Vancouver
M. WindL. Chan
...
......
... ...
...
all
region
office
country
TorontoFrankfurtcity
7/30/2019 data mning by jaiwei han chapter 2
25/90
December 19, 2012 Data Mining: Concepts and Techniques 25
View of Warehouses and Hierarchies
Specification of hierarchiesSchema hierarchyday < {month = minsup
MotivationOnly a small portion of cube cells may be
above the water in a sparse cube Only calculate interesting data data
above certain thresholdSuppose 100 dimensions, only 1 base cell.How many aggregate (non-base) cells if count >= 1? What about count >= 2?
7/30/2019 data mning by jaiwei han chapter 2
55/90
7/30/2019 data mning by jaiwei han chapter 2
56/90
7/30/2019 data mning by jaiwei han chapter 2
57/90
December 19, 2012 Data Mining: Concepts and Techniques 57
Drawbacks of BUC
Requires a significant amount of memoryOn par with most other CUBE algorithms though
Does not obtain good performance with dense CUBEs
Overly skewed data or a bad choice of dimensionordering reduces performanceCannot compute iceberg cubes with complex measures
CREATE CUBE Sales_Iceberg AS
SELECT month, city, cust_grp, AVG(price), COUNT(*)
FROM Sales_InforCUBEBY month, city, cust_grpHAVING AVG(price) >= 800 AND
COUNT(*) >= 50
7/30/2019 data mning by jaiwei han chapter 2
58/90
December 19, 2012 Data Mining: Concepts and Techniques 58
Non-Anti-Monotonic Measures
The cubing query with avg is non-anti-monotonic!
(Mar, *, *, 600, 1800) fails the HAVING clause
(Mar, *, Bus, 1300, 360) passes the clause
CREATE CUBE Sales_Iceberg ASSELECT month, city, cust_grp,
AVG(price), COUNT(*)
FROM Sales_InforCUBEBY month, city, cust_grpHAVING AVG(price) >= 800 AND
COUNT(*) >= 50
Month City Cust_grp Prod Cost Price
Jan Tor Edu Printer 500 485
Jan Tor Hld TV 800 1200
Jan Tor Edu Camera 1160 1280
Feb Mon Bus Laptop 1500 2500
Mar Van Edu HD 540 520
7/30/2019 data mning by jaiwei han chapter 2
59/90
December 19, 2012 Data Mining: Concepts and Techniques 59
Top-k Average
Let (*, Van, *) cover 1,000 records Avg(price) is the average price of those 1000 sales Avg 50(price) is the average price of the top-50 sales
(top-50 according to the sales priceTop-k average is anti-monotonic
The top 50 sales in Van. is with avg(price)
7/30/2019 data mning by jaiwei han chapter 2
60/90
December 19, 2012 Data Mining: Concepts and Techniques 60
Binning for Top-k Average
Computing top-k avg is costly with large k Binning idea
Avg 50(c) >= 800Large value collapsing: use a sum and a countto summarize records with measure >= 800
If count>=800, no need to check small records
Small value binning: a group of binsOne bin covers a range, e.g., 600~800, 400~600,etc.Register a sum and a count for each bin
7/30/2019 data mning by jaiwei han chapter 2
61/90
December 19, 2012 Data Mining: Concepts and Techniques 61
Approximate top-k average
Range Sum Count
Over 800 28000 20600~800 10600 15
400~600 15200 30
Top 50
Approximate avg 50()=
(28000+10600+600*15)/50=952
Suppose for (*, Van, *), we have
Month City Cust_grp Prod Cost Price
The cell may pass the HAVING clause
Quant-info for Top-k Average
7/30/2019 data mning by jaiwei han chapter 2
62/90
December 19, 2012 Data Mining: Concepts and Techniques 62
Quant-info for Top-k AverageBinning
Accumulate quant-info for cells to computeaverage iceberg cubes efficiently
Three pieces: sum, count, top-k bins
Use top-k bins to estimate/prune descendantsUse sum and count to consolidate current cell
Approximate avg 50 ()
Anti-monotonic, canbe computed
efficiently
real avg 50 ()
Anti-monotonic, butcomputationally
costly
avg()
Not anti-monotonic
strongestweakest
An Efficient Iceberg Cubing Method:
7/30/2019 data mning by jaiwei han chapter 2
63/90
December 19, 2012 Data Mining: Concepts and Techniques 63
An Efficient Iceberg Cubing Method:Top-k H-Cubing
One can revise Apriori or BUC to compute a top-k avg
iceberg cube. This leads to top-k-Apriori and top-k BUC.
Can we compute iceberg cube more efficiently?
Top-k H-cubing: an efficient method to compute iceberg
cubes with average measure
H-tree: a hyper-tree structure
H-cubing: computing iceberg cubes using H-tree
7/30/2019 data mning by jaiwei han chapter 2
64/90
December 19, 2012 Data Mining: Concepts and Techniques 64
H-tree: A Prefix Hyper-tree
Month City Cust_grp Prod Cost Price
Jan Tor Edu Printer 500 485
Jan Tor Hhd TV 800 1200
Jan Tor Edu Camera 1160 1280
Feb Mon Bus Laptop 1500 2500
Mar Van Edu HD 540 520
root
edu hhd bus
Jan Mar Jan Feb
Tor Van Tor Mon
Q.I.Q.I. Q.I.Quant-InfoSum: 1765Cnt: 2
bins
Attr. Val. Quant-Info Side-link Edu Sum:2285 Hhd Bus
Jan Feb
Tor Van Mon
Headertable
7/30/2019 data mning by jaiwei han chapter 2
65/90
December 19, 2012 Data Mining: Concepts and Techniques 65
Properties of H-tree
Construction cost: a single database scan
Completeness: It contains the complete
information needed for computing the icebergcube
Compactness: # of nodes n*m+1
n: # of tuples in the table
m: # of attributes
7/30/2019 data mning by jaiwei han chapter 2
66/90
Computing Cells Involving Month
7/30/2019 data mning by jaiwei han chapter 2
67/90
December 19, 2012 Data Mining: Concepts and Techniques 67
Computing Cells Involving MonthBut No City
root
Edu. Hhd. Bus.
Jan. Mar. Jan. Feb.
Tor. Van. Tor. Mont.
Q.I.Q.I. Q.I.
Attr. Val. Quant-Info Side-link Edu. Sum:2285 Hhd. Bus.
Jan. Feb. Mar.
Tor.
Van. Mont.
1. Roll up quant-info2. Compute cells involving
month but no city
Q.I.
Top-k OK mark: if Q.I. in a child passestop-k avg threshold, so does its parents.No binning is needed!
Computing Cells Involving Only
7/30/2019 data mning by jaiwei han chapter 2
68/90
December 19, 2012 Data Mining: Concepts and Techniques 68
p g g yCust_grp
root
edu hhd bus
Jan Mar Jan Feb
Tor Van Tor Mon
Q.I.Q.I. Q.I.
Attr. Val. Quant-Info Side-link Edu Sum:2285 Hhd Bus
Jan Feb Mar
Tor Van Mon
Check header table directly
Q.I.
7/30/2019 data mning by jaiwei han chapter 2
69/90
Scalability w r t Count Threshold
7/30/2019 data mning by jaiwei han chapter 2
70/90
December 19, 2012 Data Mining: Concepts and Techniques 70
Scalability w.r.t. Count Threshold(No min_avg Setting)
0
50
100
150
200
250
300
0.00% 0.05% 0.10%Count threshold
R u n
t i m e
( s e c o n
d ) top-k H-Cubing
top-k BUC
Computing Iceberg Cubes with Other
7/30/2019 data mning by jaiwei han chapter 2
71/90
December 19, 2012 Data Mining: Concepts and Techniques 71
Computing Iceberg Cubes with OtherComplex Measures
Computing other complex measures
Key point: find a function which is weaker but ensurescertain anti-monotonicity
Examples
Avg() v: avg k (c) v (bottom-k avg)
Avg() v only (no count): max(price) v
Sum(profit) (profit can be negative):p_sum(c) v if p_count(c) k; or otherwise, sum k (c) v
Others: conjunctions of multiple conditions
7/30/2019 data mning by jaiwei han chapter 2
72/90
7/30/2019 data mning by jaiwei han chapter 2
73/90
December 19, 2012 Data Mining: Concepts and Techniques 73
Condensed Cube
W. Wang, H. Lu, J. Feng, J. X. Yu, Condensed Cube: An Effective Approach to Reducing Data Cube Size. ICDE 02.
Icerberg cube cannot solve all the problems
Suppose 100 dimensions, only 1 base cell with count = 10.
How many aggregate (non-base) cells if count >= 10?Condensed cube
Only need to store one cell (a 1, a 2, , a 100 , 10), whichrepresents all the corresponding aggregate cells
Adv.Fully precomputed cube without compression
Efficient computation of the minimal condensed cube
Chapter 2: Data Warehousing and OLAP
7/30/2019 data mning by jaiwei han chapter 2
74/90
December 19, 2012 Data Mining: Concepts and Techniques 74
Chapter 2: Data Warehousing and OLAPTechnology for Data Mining
What is a data warehouse?
A multi-dimensional data model
Data warehouse architecture
Data warehouse implementation
Further development of data cube technology
From data warehousing to data mining
7/30/2019 data mning by jaiwei han chapter 2
75/90
December 19, 2012 Data Mining: Concepts and Techniques 75
Data Warehouse Usage
Three kinds of data warehouse applicationsInformation processing
supports querying, basic statistical analysis, and reportingusing crosstabs, tables, charts and graphs
Analytical processing
multidimensional analysis of data warehouse datasupports basic OLAP operations, slice-dice, drilling, pivoting
Data miningknowledge discovery from hidden patterns
supports associations, constructing analytical models,performing classification and prediction, and presenting themining results using visualization tools.
Differences among the three tasks
From On-Line Analytical Processing
7/30/2019 data mning by jaiwei han chapter 2
76/90
December 19, 2012 Data Mining: Concepts and Techniques 76
From On Line Analytical Processingto On Line Analytical Mining (OLAM)
Why online analytical mining?High quality of data in data warehouses
DW contains integrated, consistent, cleaned data Available information processing structure surrounding datawarehouses
ODBC, OLEDB, Web accessing, service facilities, reporting andOLAP tools
OLAP-based exploratory data analysismining with drilling, dicing, pivoting, etc.
On-line selection of data mining functionsintegration and swapping of multiple mining functions,algorithms, and tasks.
Architecture of OLAM
An OLAM Architecture
7/30/2019 data mning by jaiwei han chapter 2
77/90
December 19, 2012 Data Mining: Concepts and Techniques 77
An OLAM Architecture
DataWarehouse
Meta Data
MDDB
OLAMEngine
OLAPEngine
User GUI API
Data Cube API
Database API
Data cleaning
Data integration
Layer3
OLAP/OLAM
Layer2
MDDB
Layer1
Data
Repository
Layer4
User Interface
Filtering&Integration Filtering
Databases
Mining query Mining result
Discovery-Driven Exploration of Data
7/30/2019 data mning by jaiwei han chapter 2
78/90
December 19, 2012 Data Mining: Concepts and Techniques 78
y pCubes
Hypothesis-drivenexploration by user, huge search space
Discovery- driven (Sarawagi, et al. 98)
Effective navigation of large OLAP data cubespre-compute measures indicating exceptions, guideuser in the data analysis, at all levels of aggregation
Exception: significantly different from the valueanticipated, based on a statistical model
Visual cues such as background color are used toreflect the degree of exception of each cell
Ki d f E ti d th i C t ti
7/30/2019 data mning by jaiwei han chapter 2
79/90
December 19, 2012 Data Mining: Concepts and Techniques 79
Kinds of Exceptions and their Computation
ParametersSelfExp: surprise of cell relative to other cells at samelevel of aggregationInExp: surprise beneath the cellPathExp: surprise beneath cell for each drill-downpath
Computation of exception indicator (modeling fitting andcomputing SelfExp, InExp, and PathExp values) can beoverlapped with cube constructionException themselves can be stored, indexed andretrieved like precomputed aggregates
E l Di D i D t C b
7/30/2019 data mning by jaiwei han chapter 2
80/90
December 19, 2012 Data Mining: Concepts and Techniques 80
Examples: Discovery-Driven Data Cubes
Complex Aggregation at Multiple
7/30/2019 data mning by jaiwei han chapter 2
81/90
December 19, 2012 Data Mining: Concepts and Techniques 81
Complex Aggregation at MultipleGranularities: Multi-Feature Cubes
Multi-feature cubes (Ross, et al. 1998): Compute complex queriesinvolving multiple dependent aggregates at multiple granularitiesEx. Grouping by all subsets of {item, region, month}, find themaximum price in 1997 for each group, and the total sales among allmaximum price tuples
select item, region, month, max(price), sum(R.sales)from purchases
where year = 1997cube by item, region, month: R
such that R.price = max(price)Continuing the last example, among the max price tuples, find themin and max shelf live, and find the fraction of the total sales due totuple that have min shelf life within the set of all max price tuples
7/30/2019 data mning by jaiwei han chapter 2
82/90
From Cubegrade to Multi-dimensionalC i d G di i C b
7/30/2019 data mning by jaiwei han chapter 2
83/90
December 19, 2012 Data Mining: Concepts and Techniques 83
Constrained Gradients in Data Cubes
Significantly more expressive than association rulesCapture trends in user-specified measures
Serious challenges
Many trivial cells in a cube significance constraint to prune trivial cells
Numerate pairs of cells probe constraint to selecta subset of cells to examine
Only interesting changes wanted gradientconstraint to capture significant changes
MD C t i d G di t Mi i
7/30/2019 data mning by jaiwei han chapter 2
84/90
December 19, 2012 Data Mining: Concepts and Techniques 84
MD Constrained Gradient Mining
Significance constraint C sig: (cnt 100)Probe constraint C prb: (city=Van, cust_grp=busi,prod_grp=*) Gradient constraint C grad (cg, c p):
(avg_price(c g)/avg_price(c p) 1.3)
Dimensions Measurescid Yr City Cst_grp Prd_grp Cnt Avg_price
c1 00 Van Busi PC 300 2100
c2 * Van Busi PC 2800 1800
c3 * Tor Busi PC 7900 2350
c4 * * busi PC 58600 2250
Base cell
Aggregated cell
Siblings
Ancestor
Probe cell: satisfied C prb (c4, c2) satisfies C grad !
A Li S t D i Al ith
7/30/2019 data mning by jaiwei han chapter 2
85/90
December 19, 2012 Data Mining: Concepts and Techniques 85
A LiveSet-Driven Algorithm
Compute probe cells using C sig and C prb The set of probe cells P is often very small
Use probe P and constraints to find gradientsPushing selection deeplySet-oriented processing for probe cellsIceberg growing from low to high dimensionalities
Dynamic pruning probe cells during growthIncorporating efficient iceberg cubing method
7/30/2019 data mning by jaiwei han chapter 2
86/90
R f (I)
7/30/2019 data mning by jaiwei han chapter 2
87/90
December 19, 2012 Data Mining: Concepts and Techniques 87
References (I)S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan,and S. Sarawagi. On the computation of multidimensional aggregates. VLDB 96
D. Agrawal, A. E. Abbadi, A. Singh, and T. Yurek. Efficient view maintenance in datawarehouses. SIGMOD 97.
R. Agrawal, A. Gupta, and S. Sarawagi. Modeling multidimensional databases. ICDE 97
K. Beyer and R. Ramakrishnan. Bottom-Up Computation of Sparse and Iceberg CUBEs..SIGMOD 99.
S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP technology. ACM SIGMOD Record, 26:65-74, 1997.
OLAP council. MDAPI specification version 2.0. Inhttp://www.olapcouncil.org/research/apily.htm, 1998.
G. Dong, J. Han, J. Lam, J. Pei, K. Wang. Mining Multi-dimensional ConstrainedGradients in Data Cubes. VLDB 2001
J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow,and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by,cross-tab and sub-totals. Data Mining and Knowledge Discovery, 1:29-54, 1997.
R f (II)
7/30/2019 data mning by jaiwei han chapter 2
88/90
December 19, 2012 Data Mining: Concepts and Techniques 88
References (II)
J. Han , J. Pei, G. Dong, K. Wang. Efficient Computation of Iceberg Cubes With ComplexMeasures. SIGMOD 01
V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes efficiently.SIGMOD 96 Microsoft. OLEDB for OLAP programmer's reference version 1.0. Inhttp://www.microsoft.com/data/oledb/olap, 1998.
K. Ross and D. Srivastava. Fast computation of sparse datacubes. VLDB 97. K. A. Ross, D. Srivastava, and D. Chatziantoniou. Complex aggregation at multiplegranularities. EDBT'98.S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-driven exploration of OLAP datacubes. EDBT'98.E. Thomsen. OLAP Solutions: Building Multidimensional Information Systems. JohnWiley & Sons, 1997.W. Wang, H. Lu, J. Feng, J. X. Yu, Condensed Cube: An Effective Approach toReducing Data Cube Size. ICDE 02.
Y. Zhao, P. M. Deshpande, and J. F. Naughton. An array-based algorithm forsimultaneous multidimensional aggregates. SIGMOD 97 .
www cs uiuc edu/~hanj
http://www.cs.uiuc.edu/~hanjhttp://www.cs.uiuc.edu/~hanj7/30/2019 data mning by jaiwei han chapter 2
89/90
December 19, 2012 Data Mining: Concepts and Techniques 89
www.cs.uiuc.edu/~hanj
Thank you !!!
Work to be done
http://www.cs.uiuc.edu/~hanjhttp://www.cs.uiuc.edu/~hanj7/30/2019 data mning by jaiwei han chapter 2
90/90
Work to be done
Add MS OLAP snapshots! A tutorial on MS/OLAPReorganize cube computation materialsInto cube computation and cube exploration
Top Related