Harikrishnan Karunakaran Sulabha Balan CSE 6339. Introduction Icicles Icicle Maintenance ...
-
Upload
george-anthony -
Category
Documents
-
view
226 -
download
0
Transcript of Harikrishnan Karunakaran Sulabha Balan CSE 6339. Introduction Icicles Icicle Maintenance ...
Harikrishnan Karunakaran Sulabha Balan
ICICLES: Self-tuning Samples for Approximate Query Answering
By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan
CSE 6339
Introduction
Icicles
Icicle Maintenance
Icicle-Based Estimators
Quality & Performance
Conclusion
Outline
Analysis of data in data warehouses useful in decision support
Users of decision support systems want interactive systems
OLAP – Online Analytical Processing Aggregate Query Answering Systems
(AQUA) developed to reduce response time to desirable levels
Tolerant of approximate results
Introduction
Various Approaches
Sampling-based
Histogram-based
Clustering
Probabilistic
Wavelet-based
Approximate Querying
Uniform Random Sampling
Branch
State Sales
1 CA 80K
2 TX 42K
3 CA 40K
4 CA 42K
5 TX 75K
6 CA 48K
7 TX 55K
8 TX 38K
9 CA 40K
10 CA 41K
Branch
State Sales
2 TX 42K
4 CA 42K
6 CA 48K
8 TX 38K
10 CA 41K
50%
Sample
SELECT SUM(sales) x 2 AS cnt
FROM s_sales
WHERE state = ‘TX’
S_sales
scale factor
Sales
Biased Sampling
Sample relation for aggregation query workload regarding Texas branches
Branch
State Sales
1 CA 80K
2 TX 42K
3 CA 40K
4 CA 42K
5 TX 75K
6 CA 48K
7 TX 55K
8 TX 38K
9 CA 40K
10 CA 41K
Branch
State Sales
2 TX 42K
4 CA 42K
5 TX 75K
7 TX 55K
8 TX 38K
SalesS_sales
All tuples in a Uniform Random Sample are treated as equally important for answering queries
Sample needs to be tuned to contain tuples which are more relevant to answer queries in a workload
Need for a dynamic algorithm that changes the sample as and according to suit the queries being executed in the workload
Methodology
Join of a Uniform Random Sample of a Fact Table with a set of accompanying Dimension Tables
Join Synopsis
SELECT COUNT(*), AVG(LI Extendedprice), SUM(LI Extendedprice) FROM LI, C, O, S, N, R WHERE C Custkey=O Custkey AND O Orderkey=LI
Orderkey AND LI Suppkey=S Suppkey AND C Nationkey=N
Nationkey AND N Regionkey=R Regionkey AND R Name=North
America AND O Orderdate01-01-1998 AND O Orderdate12-31-
1998;
Any aggregate query on the fact table can be answered approximately using exactly one of a smaller number of synopses
Uniform Random Sample of Relation wastes memory
OLAP queries exhibit locality in their data access
Need for Icicles
Class of samples to capture data locality of aggregate queries of foreign key joins
Identify focus of a query workload and sample accordingly
Is a uniform random sample of a multiset of tuples L, which is the union of R and all sets of tuples that were required to answer queries in the workload (an extension of R)
Is a non-uniform sample of the original relation R
Icicles
Icicle L
Icicle Maintanence Algorithm
Algorithm is efficient due to
Uniform Random Sample of L ensures tuple’s selection in its icicle is proportional to it’s frequency
Incremental maintenance of icicle requires only the segment of R that satisfies the new query from the workload
Reservoir Sampling Algorithm
Icicle Maintanence Algorithm
Icicle Maintenance Example
SELECT average(*)
FROM widget-tuners
WHERE date.month = ‘April’
• In spite of unified sampling being used the result is a biased sample
• Frequency Relation maintained over all tuples in relation
• Different Estimation mechanisms for Average, Count and Sum
Icicle-Based Estimators
Average Average taken over set of distinct sample tuples that satisfy the query
predicate of the average query is a pretty good estimate of the average
Count Sum of Expected Contributions of all tuples in the sample that
satisfy the given query
Sum Estimate is given by the product of the average and the count
estimates
Estimators
Frequency Attribute added to the Relation
Starting Frequency set to 1 for all tuples
Incremented each time tuple is used to answer a query
Frequencies of relevant tuples updated only when icicle updated with new query
Maintaining Frequency Relation
When queries exhibit data locality then icicle is constituted of more tuples from frequently accessed subsets of the relation
Accuracy improves with increase in number of tuples used to compute it
Class consisting of queries ‘focused’ with respect to workload will obtain more accurate approximate answers from the icicle
Quality Guarantees
Quality Guarantees contd...
Performance EvaluationSELECT COUNT(*), AVG(LI_Extendedprice), SUM(LI_Extendedprice)
FROM LI, C, O, S, N, R
WHERE C_Custkey=O_Custkey AND O_Orderkey=LI_Orderkey AND LI_Suppkey=S_Suppkey AND
C_Nationkey = N_Nationkey AND N_Regionkey = R_Regionkey AND
R Name = [region] AND O Orderdate >= Date[startdate] AND O Orderdate <= 12-31-1998
SELECT COUNT(*), AVG(LI_Extendedprice), SUM(LI_Extendedprice)
FROM LICOS-icicle, N, R
WHERE C_Nationkey = N_Nationkey AND N_Regionkey = R_Regionkey AND
R Name = [region] AND O Orderdate >= Date[startdate] AND O Orderdate <= 12-31-1998
Qworkload : Template for generating workloads
Template for obtaining approximate answers
Performance Evaluation contd...
The Error Plots for Comparison
Static uniform random sample on Join Synopsis
Icicle as it evolves with the workload
Icicle-Complete which is formed after entire workload has been executed once
Performance Evaluation contd...
Focused Queries
Performance Evaluation contd...
Performance Evaluation contd...
Mixed Workload
Rapid decrease in relative error of query answers from icicles with queries focused on a set of core tuples
Icicle plot shows a convergence to the Icicle-Complete plot
Quick Convergence of Icicle plot towards Icicle-Complete means Icicle adapts fast
Observations (focused)
Improvement due to usage of icicles is not significant
Can be concluded that icicles are at worst as good as the static samples
Observations (mixed)
Icicles provide class of samples that adapt according to the characteristics of the workload
It can never be worse than the case of static sampling
It focuses on relatively small subsets in the relation
Conclusion
There is no significant gains in the case of Uniform Workload
There is a trade-off between accuracy and cost
Restricted to certain scenarios where the queries tend to be increasingly focused towards the workload.
Inferences
V. Ganti, M. Lee, and R. Ramakrishnan. ICICLES: Self-tuning Samples for Approximate Query Answering. VLDB Conference 2000.
S Acharya, PB Gibbons, V Poosala, S Ramaswamy Join synopses for approximate query answering. ACM SIGMOD Record 1999
References
Thank You
Questions?