Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling...
Transcript of Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling...
![Page 1: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide](https://reader033.fdocuments.net/reader033/viewer/2022042200/5e9fe6a7a1a9950f1a3d25c9/html5/thumbnails/1.jpg)
Data Analytics Beyond OLAP
Prof. Yanlei Diao
![Page 2: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide](https://reader033.fdocuments.net/reader033/viewer/2022042200/5e9fe6a7a1a9950f1a3d25c9/html5/thumbnails/2.jpg)
OPERATIONAL DBs
EXTRACT TRANSFORM
LOAD (ETL)
DATA WAREHOUSE
METADATA STORE
DB 1 DB 2 DB 3
SUPPORTS
OLAP DATA
MINING
INTERACTIVE DATA EXPLORATION
![Page 3: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide](https://reader033.fdocuments.net/reader033/viewer/2022042200/5e9fe6a7a1a9950f1a3d25c9/html5/thumbnails/3.jpg)
Overview of Topics
q Data Mining in Databases • Association rule mining
• Interesting visualizations
q Approximate Query Processing • Online aggregation: group by aggregation, wander join
• Interactive SQL
q Interactive Data Exploration • Faceted search
• Semantic windows
• Explore by example
![Page 4: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide](https://reader033.fdocuments.net/reader033/viewer/2022042200/5e9fe6a7a1a9950f1a3d25c9/html5/thumbnails/4.jpg)
1. Association Rule Mining
Fast Algorithms for Mining Association Rules
Rakesh Agrawal and Ramakrishnan Srikant VLDB '94
![Page 5: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide](https://reader033.fdocuments.net/reader033/viewer/2022042200/5e9fe6a7a1a9950f1a3d25c9/html5/thumbnails/5.jpg)
Motivation
• Example Rules: – 98% of customers who purchase tires get automotive
services done – Customers who buy mustard and ketchup also buy burgers – Goal: find these rules from transactional data
• Rules help with decision making – E.g., store layout, buying patterns, add-on sales
![Page 6: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide](https://reader033.fdocuments.net/reader033/viewer/2022042200/5e9fe6a7a1a9950f1a3d25c9/html5/thumbnails/6.jpg)
DB of "Basket Data" TID items 100 1 3 4 200 2 3 5 300 1 2 3 5 400 2 5
Association Mining
Association Rules {1} => {3} {2,3} => {5} {2,5} => {3}
. . .
![Page 7: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide](https://reader033.fdocuments.net/reader033/viewer/2022042200/5e9fe6a7a1a9950f1a3d25c9/html5/thumbnails/7.jpg)
DB of "Basket Data" TID items 100 1 3 4 200 2 3 5 300 1 2 3 5 400 2 5
Associate Rules
Association Rules {1} => {3} {2,3} => {5} {2,5} => {3}
Association rule: X => Y q X and Y are disjoint itemsets, called antecedent (LHS) and consequent (RHS)
• Confidence: c% of transactions that contain X also contain Y (rule-specific) • Support: s% of all transactions contain both X and Y (relative to all data)
q Goal: find all rules that satisfy the confidence and support thresholds.
.
.
.
![Page 8: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide](https://reader033.fdocuments.net/reader033/viewer/2022042200/5e9fe6a7a1a9950f1a3d25c9/html5/thumbnails/8.jpg)
Support Example
TID Cereal Beer Bread Bananas Milk1 X X X2 X X X X3 X X4 X X5 X X6 X X7 X X8 X
Support(Cereal)4/8=.5
Support(Cereal=>Milk)3/8=.375
![Page 9: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide](https://reader033.fdocuments.net/reader033/viewer/2022042200/5e9fe6a7a1a9950f1a3d25c9/html5/thumbnails/9.jpg)
Confidence Example
TID Cereal Beer Bread Bananas Milk1 X X X2 X X X X3 X X4 X X5 X X6 X X7 X X8 X
Confidence(Cereal=>Milk)3/4=.75
Confidence(Bananas=>Bread)1/3=.33333…
![Page 10: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide](https://reader033.fdocuments.net/reader033/viewer/2022042200/5e9fe6a7a1a9950f1a3d25c9/html5/thumbnails/10.jpg)
Apriori Algorithm and Notation
• {i1, i2, …, im} be the set of literals, known as items
• { Tj } is the set of transactions (database), where each transaction Tj is a set of items s.t.
– Each transaction has a unique identifier TID
– The size of an itemset is the number of items
– Itemset of size k is a k-itemset
• Assume that items in an itemset are sorted in lexicographical order
![Page 11: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide](https://reader033.fdocuments.net/reader033/viewer/2022042200/5e9fe6a7a1a9950f1a3d25c9/html5/thumbnails/11.jpg)
• Step I: Find all itemsets with minimum support (min_sup s)
• Step II: Generate rules from min_sup'ed itemsets
General Strategy
support itemsets0.25 {4}, {1,2}, {1,4}, {1,5}, {3,4},
{1,3,4}, {1,2,3}, {1,2,5}, {1,3,5},{1,2,3,5}
0.5 {1}, {1,3}, {2,3}, {3,5}, {2,3,5}0.75 {2}, {3}, {5}, {2,5}
support confidence rules0.5 66% {3}=>{1}, {3}=>{2}, {2}=>{3}, {3}=>{5},
{5}=>{3}, {5}=>{2,3}, {3}=>{2,5},{2}=>{3,5}, {5,2}=>{3}, {5,3}=>{2}
0.5 100% {1}=>{3}, {5,3}=>{2}, {2,3}=>{5}
0.75 100% {5}=>{2}, {2}=>{5}
TID items100 1 3 4200 2 3 5300 1 2 3 5400 2 5
![Page 12: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide](https://reader033.fdocuments.net/reader033/viewer/2022042200/5e9fe6a7a1a9950f1a3d25c9/html5/thumbnails/12.jpg)
Step I: Finding Minsup Itemsets
• What is the complexity of finding all subsets of item that satisfy the mini_sup s?
• The power set of the n literals!
• A new algorithmic framework based on anti-monotonicity: • For a frequent itemset, all of its subsets are also frequent • For an infrequent itemset, all of its supersets must be infrequent • Can be used to design efficient pruning of the search space.
![Page 13: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide](https://reader033.fdocuments.net/reader033/viewer/2022042200/5e9fe6a7a1a9950f1a3d25c9/html5/thumbnails/13.jpg)
Anti-monotonicity
• Anti-monotonicity • Adding items to an itemset never increases its support • For a frequent itemset, all of its subsets are also frequent • For an infrequent itemset, all of its supersets must be infrequent
frequent itemset Lk frequent itemset Lk-1
![Page 14: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide](https://reader033.fdocuments.net/reader033/viewer/2022042200/5e9fe6a7a1a9950f1a3d25c9/html5/thumbnails/14.jpg)
Anti-monotonicity
• Anti-monotonicity • Adding items to an itemset never increases its support • For a frequent itemset, all of its subsets are also frequent • For an infrequent itemset, all of its supersets must be infrequent
frequent itemset Lk-1 frequent itemset Lk super-itemset Ck
![Page 15: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide](https://reader033.fdocuments.net/reader033/viewer/2022042200/5e9fe6a7a1a9950f1a3d25c9/html5/thumbnails/15.jpg)
Anti-monotonicity
• Anti-monotonicity • Adding items to an itemset never increases its support • For a frequent itemset, all of its subsets are also frequent • For an infrequent itemset, all of its supersets must be infrequent
frequent itemset Lk-1 frequent itemset Lk super-itemset Ck
![Page 16: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide](https://reader033.fdocuments.net/reader033/viewer/2022042200/5e9fe6a7a1a9950f1a3d25c9/html5/thumbnails/16.jpg)
Step I: Finding Minsup Itemsets
• Anti-monotonicity:
Adding items to an itemset never increases its support
• Apriori Algorithm: Proceed inductively on itemset size
1) Base case: Begin with all minsup itemsets of size 1 (L1)
2) Without peeking at the DB, generate candidate itemsets of
size k (Ck) from Lk-1
3) Remove candidate itemsets that contain unsupported subsets
4) Further refine Ck using the database to produce Lk
repe
at
![Page 17: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide](https://reader033.fdocuments.net/reader033/viewer/2022042200/5e9fe6a7a1a9950f1a3d25c9/html5/thumbnails/17.jpg)
Task 2) Guess Itemsets
• Naïve way: – Extend all itemsets with all possible items
• Apriori: 2) Join Lk-1 with itself, adding only a single, final item
• e.g.: {1 2 3}, {1 2 4}, {1 3 4}, {1 3 5}, {2, 3, 4} produces {1 2 3 4} and {1 3 4 5}
![Page 18: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide](https://reader033.fdocuments.net/reader033/viewer/2022042200/5e9fe6a7a1a9950f1a3d25c9/html5/thumbnails/18.jpg)
Task 3) Filter Itemsets
• Apriori: 2) Join Lk-1 with itself, adding only a single, final item
• e.g.: {1 2 3}, {1 2 4}, {1 3 4}, {1 3 5}, {2, 3, 4} produces {1 2 3 4} and {1 3 4 5}
3) Remove itemsets with an unsupported subset • e.g.: {1 3 4 5} has an unsupported subset: {1 4 5} if minsup
= 50%
![Page 19: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide](https://reader033.fdocuments.net/reader033/viewer/2022042200/5e9fe6a7a1a9950f1a3d25c9/html5/thumbnails/19.jpg)
Task 4) Finalize k-Itemsets
• Apriori: 2) Join Lk-1 with itself, adding only a single, final item
• E.g., {1 2 3}, {1 2 4}, {1 3 4}, {1 3 5}, {2, 3, 4} produces {1 2 3 4} and {1 3 4 5}
3) Remove itemsets with an unsupported subset • E.g., {1 3 4 5} has an unsupported subset: {1 4 5} if
minsup = 50% 4) Use the database to further refine Ck
• Count precisely the occurrence of each itemset in the dataset, to see if it is indeed larger than min_sup
![Page 20: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide](https://reader033.fdocuments.net/reader033/viewer/2022042200/5e9fe6a7a1a9950f1a3d25c9/html5/thumbnails/20.jpg)
Full Algorithm
22
![Page 21: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide](https://reader033.fdocuments.net/reader033/viewer/2022042200/5e9fe6a7a1a9950f1a3d25c9/html5/thumbnails/21.jpg)
23
![Page 22: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide](https://reader033.fdocuments.net/reader033/viewer/2022042200/5e9fe6a7a1a9950f1a3d25c9/html5/thumbnails/22.jpg)
2. Interesting Visualizations
SeeDB: efficient data-driven visualization recommendations to support visual analytics
Manasi Vartak, Sajjadur Rahman, Samuel Madden, Aditya Parameswaran, Neoklis Polyzotis VLDB ’14
![Page 23: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide](https://reader033.fdocuments.net/reader033/viewer/2022042200/5e9fe6a7a1a9950f1a3d25c9/html5/thumbnails/23.jpg)
Visualization Recommendation
Given a dataset and a task, automatically produce a set of visualizations that are
the most “interesting” given the task
![Page 24: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide](https://reader033.fdocuments.net/reader033/viewer/2022042200/5e9fe6a7a1a9950f1a3d25c9/html5/thumbnails/24.jpg)
For simplicity, assume a single table
(star schema)
Visualizations = agg. + grp. by queries
Vi = SELECT d, f(m)
FROM table WHERE ___
GROUP BY d
(d, m, f): dimension, measure, aggregate
1.5
2
2.5
3
3.5
4
4.5
2000
20
02
2004
20
06
2008
20
10
2012
50
10 10 30
MA CA IL NY
Space of Visualizations
![Page 25: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide](https://reader033.fdocuments.net/reader033/viewer/2022042200/5e9fe6a7a1a9950f1a3d25c9/html5/thumbnails/25.jpg)
Space of Visualizations
Vi = SELECT d, f(m) FROM table WHERE ___ GROUP BY d
(d, m, f): dimension, measure, aggregate
{d} : race, work-type, sex etc. {m} : capital-gain, capital-loss, hours-per-week {f} : COUNT, SUM, AVG
![Page 26: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide](https://reader033.fdocuments.net/reader033/viewer/2022042200/5e9fe6a7a1a9950f1a3d25c9/html5/thumbnails/26.jpg)
I. Interestingness: How do we determine if a visualization is interesting?
– Utility Metric
II. Scale: How to make recommendations efficiently and interactively?
– Optimizations
Key Questions
![Page 27: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide](https://reader033.fdocuments.net/reader033/viewer/2022042200/5e9fe6a7a1a9950f1a3d25c9/html5/thumbnails/27.jpg)
A visualization is interesting if it displays a large deviation from some reference
Task: compare unmarried adults with all adults
50
10 10
30
MA CA IL NY
30 20
10
40
MA CA IL NY
V1
V2
V1
V2
Compare induced probability distributions!
Target Reference
V1 = SELECT d, f(m) FROM table WHERE target GROUP BY d ︎V2 = SELECT d, f(m) FROM table WHERE reference GROUP BY d ︎
Deviation-based Utility Metric
![Page 28: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide](https://reader033.fdocuments.net/reader033/viewer/2022042200/5e9fe6a7a1a9950f1a3d25c9/html5/thumbnails/28.jpg)
A visualization is interesting if it displays a large deviation from some reference
Many metrics for computing distance between distributions
V1
V2
D [P( V1), P(V2)]
Earth mover’s distance
L1, L2 distance K-L divergence
Any distance metric b/n
distributions is OK!
Deviation-based Utility Metric
![Page 29: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide](https://reader033.fdocuments.net/reader033/viewer/2022042200/5e9fe6a7a1a9950f1a3d25c9/html5/thumbnails/29.jpg)
Overview of Topics
q Data Mining over Databases • Association rule mining
• Interesting visualizations
q Approximate Query Processing • Online aggregation: group by aggregation, wander join
• Interactive SQL
q Interactive Data Exploration • Faceted search
• Semantic windows
• Explore by example
![Page 30: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide](https://reader033.fdocuments.net/reader033/viewer/2022042200/5e9fe6a7a1a9950f1a3d25c9/html5/thumbnails/30.jpg)
3. Interactive Data Exploration
Explore-by-example: an automatic query steering framework for interactive data exploration
Kyriaki Dimitriadou, Olga Papaemmanouil, Yanlei Diao SIGMOD ’14
![Page 31: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide](https://reader033.fdocuments.net/reader033/viewer/2022042200/5e9fe6a7a1a9950f1a3d25c9/html5/thumbnails/31.jpg)
Web Applications: e.g., House Hunting
2+ bedrooms, ≤ 0.5 million
2+ bedrooms, price depending on the size
2+ bedrooms, price depending on the size, location
2+ bedrooms, price depending on the size, location, quietness (away from street, higher floor)
2+ bedrooms, price depending on the size, location, quietness, school district, building structure…
![Page 32: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide](https://reader033.fdocuments.net/reader033/viewer/2022042200/5e9fe6a7a1a9950f1a3d25c9/html5/thumbnails/32.jpg)
An Explore-by-Example Framework
Query Formulation
Relevant examples
Irrelevant examples User Model
Final Data Extraction
Query
User Model
Relevance Feedback
Sample
Classification Model
Exploration Queries
Space Exploration
Initial d attributes
Sample Acquisition
![Page 33: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide](https://reader033.fdocuments.net/reader033/viewer/2022042200/5e9fe6a7a1a9950f1a3d25c9/html5/thumbnails/33.jpg)
(1) Initial Sampling
Analyze sampling methods in this new problem setting: ² Random sampling ² Equi-width stratified sampling ² Equi-depth stratified sampling ² Progressive sampling [Dimitriadou 2014]
How to produce initial examples that fall in the user interest area?
![Page 34: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide](https://reader033.fdocuments.net/reader033/viewer/2022042200/5e9fe6a7a1a9950f1a3d25c9/html5/thumbnails/34.jpg)
Equi-Width Stratified Sampling
Figure 2: equi-width stratified sampling
bucket according to feature F2, and divide each bucket into c sub-buckets, so the number of points withineach sub-bucket is almost the same. We do this for d dimensions, and we know that after this process, eachgrid will have nearly the same number of data points.
Algorithm 3 Equi-depth Sampling1: procedure SELECT_EQUIDEPTH(k)2: c d
pk
3: Bucket_Set {S}4: for i = 1 to d do5: New_Bucket_Set {}6: for each bucket b in Bucket_Set do7: sort data points in b according to feature Fi
8: divide b into c sub-buckets b1, b2, ..., bc according to Fi, so that |b1| = |b2| = ... = |bc|9: New_Bucket_Set.append(b1, b2, ..., bc)
10: Bucket_Set New_Bucket_Set11: j 112: for each bucket b in Bucket_Set do13: samples[j] one random sample from b14: j j + 1
return samples[1...k]
We illustrate the algorithm of equi-depth stratified sampling through an example in Figure 3. In the example,we divide the data space into 16 grids, and each grid has nearly the same number of data points. Then we
4
Given a database of N data points in D-dim space, a true user interest in a d*-dim subspace • Start with a user-chosen d-dim subspace, d* ≤ d << D • Consider generating k samples • Want to create c buckets per dimension, cd ≥ k • Generate 1 sample per cell, [ l1 , h1 ] x [ l2 , h2 ] … x [ lc , hc ]
Equi-width stratified sampling § c buckets are of equal width
![Page 35: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide](https://reader033.fdocuments.net/reader033/viewer/2022042200/5e9fe6a7a1a9950f1a3d25c9/html5/thumbnails/35.jpg)
Equi-Depth Stratified Sampling
Figure 3: equi-depth stratified sampling
draw one random sample from each grid.
2.4 progressive sampling
Progressive sampling algorithm 4 is to perform equi-width or equi-depth stratified sampling level-by-level.In the first level, we divide each dimension into 2 buckets, so we have 2d grids, and we select a randomsample from each grid. When we go to the second level, we divide each dimension into 22 buckets, sowe have 22d grids. Then we select one random sample from each of these smaller grids. We continue thissampling process until we get one sample from the user interest area.
Algorithm 4 Progressive Sampling1: procedure PROGRESSIVE_SAMPLING2: sample_set = {}3: level 14: while no point in sample_set is within user interest area do5: k 2level⇤d
6: samples SELECT_EQUIWIDTH(k) or SELECT_EQUIDEPTH(k)7: sample_set.append(samples)8: level level + 1
return sample_set
5
Given a database of N data points in D-dim space, a true user interest in a d*-dim subspace • Start with a user-chosen d-dim subspace, d* <= d << D • Consider generating k samples • Want to create c buckets per dimension, cd >= k • Generate 1 sample per cell, [ l1 , h1 ] x [ l2 , h2 ] … x [ lc , hc ]
Equi-depth stratified sampling § c buckets have (roughly) equal
numbers of data points
![Page 36: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide](https://reader033.fdocuments.net/reader033/viewer/2022042200/5e9fe6a7a1a9950f1a3d25c9/html5/thumbnails/36.jpg)
Red
wav
elen
gth
Green Wavelength
Progressive Sampling: dynamic # examples
√
√
x
x
x
x
x x
x x
x
x
x
x
x
x
x x
√
√
![Page 37: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide](https://reader033.fdocuments.net/reader033/viewer/2022042200/5e9fe6a7a1a9950f1a3d25c9/html5/thumbnails/37.jpg)
Classification based on Decision Trees
red
red<=14.82 red>14.82
red Irrelevant
Irrelevant green
red<13.55 red>=13.55
green<=13.74
Relevant Irrelevant
green>13.74
SELECT * FROM galaxy WHERE red<= 14.82 AND red>= 13.5 AND
green<=13.74
Sample Red Green UserLabel
ObjectA 13.67 12.34 Yes
ObjectB 15.32 14.50 No
.. .. .. ...
ObjectX 14.21 13.57 Yes
Decision Tree Classifier
![Page 38: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide](https://reader033.fdocuments.net/reader033/viewer/2022042200/5e9fe6a7a1a9950f1a3d25c9/html5/thumbnails/38.jpg)
(2) Recover from Misclassified Samples R
ed w
avel
engt
h
Green Wavelength
√
√
x
x
x
x
x x
x x
x
x
x
x
x
x
x x
√
√
False negative
False positive Predicted Area
False negative
![Page 39: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide](https://reader033.fdocuments.net/reader033/viewer/2022042200/5e9fe6a7a1a9950f1a3d25c9/html5/thumbnails/39.jpg)
Handling Misclassified Samples R
ed w
avel
engt
h
Green Wavelength
√
√
x
x
x
x
x x
x x
x
x
x
x
x
x
x x
√
√ √
Sampling Areas
√ √ √
√ x √ √
![Page 40: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide](https://reader033.fdocuments.net/reader033/viewer/2022042200/5e9fe6a7a1a9950f1a3d25c9/html5/thumbnails/40.jpg)
Clustering-based Sampling
56
Red
wav
elen
gth
Green Wavelength
√
√
x √
√ √ √
√ √ √
√ √
Clusters-Sampling
Areas
√
√ √
√ x x
x √ √ x
x
x
x
![Page 41: Lec9 Beyond OLAPavid.cs.umass.edu/courses/645/s2018/lectures/... · Equi-Width Stratified Sampling Figure 2: equi-width stratified sampling bucket according to feature F2, and divide](https://reader033.fdocuments.net/reader033/viewer/2022042200/5e9fe6a7a1a9950f1a3d25c9/html5/thumbnails/41.jpg)
(3) Refine Boundaries of Relevant Areas
57 57 57
Red
wav
elen
gth
Green Wavelength
Sampling Areas