Harikrishnan Karunakaran Sulabha Balan CSE 6339. Introduction Icicles Icicle Maintenance ...

Harikrishnan Karunakaran Sulabha Balan

ICICLES: Self-tuning Samples for Approximate Query Answering

By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan

CSE 6339

Introduction

Icicles

Icicle Maintenance

Icicle-Based Estimators

Quality & Performance

Conclusion

Outline

Analysis of data in data warehouses useful in decision support

Users of decision support systems want interactive systems

OLAP – Online Analytical Processing Aggregate Query Answering Systems

(AQUA) developed to reduce response time to desirable levels

Tolerant of approximate results

Introduction

Various Approaches

Sampling-based

Histogram-based

Clustering

Probabilistic

Wavelet-based

Approximate Querying

Uniform Random Sampling

Branch

State Sales

1 CA 80K

2 TX 42K

3 CA 40K

4 CA 42K

5 TX 75K

6 CA 48K

7 TX 55K

8 TX 38K

9 CA 40K

10 CA 41K

Branch

State Sales

2 TX 42K

4 CA 42K

6 CA 48K

8 TX 38K

10 CA 41K

50%

Sample

SELECT SUM(sales) x 2 AS cnt

FROM s_sales

WHERE state = ‘TX’

S_sales

scale factor

Sales

Biased Sampling

Sample relation for aggregation query workload regarding Texas branches

Branch

State Sales

1 CA 80K

2 TX 42K

3 CA 40K

4 CA 42K

5 TX 75K

6 CA 48K

7 TX 55K

8 TX 38K

9 CA 40K

10 CA 41K

Branch

State Sales

2 TX 42K

4 CA 42K

5 TX 75K

7 TX 55K

8 TX 38K

SalesS_sales

All tuples in a Uniform Random Sample are treated as equally important for answering queries

Sample needs to be tuned to contain tuples which are more relevant to answer queries in a workload

Need for a dynamic algorithm that changes the sample as and according to suit the queries being executed in the workload

Methodology

Join of a Uniform Random Sample of a Fact Table with a set of accompanying Dimension Tables

Join Synopsis

SELECT COUNT(*), AVG(LI Extendedprice), SUM(LI Extendedprice) FROM LI, C, O, S, N, R WHERE C Custkey=O Custkey AND O Orderkey=LI

Orderkey AND LI Suppkey=S Suppkey AND C Nationkey=N

Nationkey AND N Regionkey=R Regionkey AND R Name=North

America AND O Orderdate01-01-1998 AND O Orderdate12-31-

1998;

Any aggregate query on the fact table can be answered approximately using exactly one of a smaller number of synopses

Uniform Random Sample of Relation wastes memory

OLAP queries exhibit locality in their data access

Need for Icicles

Class of samples to capture data locality of aggregate queries of foreign key joins

Identify focus of a query workload and sample accordingly

Is a uniform random sample of a multiset of tuples L, which is the union of R and all sets of tuples that were required to answer queries in the workload (an extension of R)

Is a non-uniform sample of the original relation R

Icicles

Icicle L

Icicle Maintanence Algorithm

Algorithm is efficient due to

Uniform Random Sample of L ensures tuple’s selection in its icicle is proportional to it’s frequency

Incremental maintenance of icicle requires only the segment of R that satisfies the new query from the workload

Reservoir Sampling Algorithm

Icicle Maintanence Algorithm

Icicle Maintenance Example

SELECT average(*)

FROM widget-tuners

WHERE date.month = ‘April’

• In spite of unified sampling being used the result is a biased sample

• Frequency Relation maintained over all tuples in relation

• Different Estimation mechanisms for Average, Count and Sum

Icicle-Based Estimators

Average Average taken over set of distinct sample tuples that satisfy the query

predicate of the average query is a pretty good estimate of the average

Count Sum of Expected Contributions of all tuples in the sample that

satisfy the given query

Sum Estimate is given by the product of the average and the count

estimates

Estimators

Frequency Attribute added to the Relation

Starting Frequency set to 1 for all tuples

Incremented each time tuple is used to answer a query

Frequencies of relevant tuples updated only when icicle updated with new query

Maintaining Frequency Relation

When queries exhibit data locality then icicle is constituted of more tuples from frequently accessed subsets of the relation

Accuracy improves with increase in number of tuples used to compute it

Class consisting of queries ‘focused’ with respect to workload will obtain more accurate approximate answers from the icicle

Quality Guarantees

Quality Guarantees contd...

Performance EvaluationSELECT COUNT(*), AVG(LI_Extendedprice), SUM(LI_Extendedprice)

FROM LI, C, O, S, N, R

WHERE C_Custkey=O_Custkey AND O_Orderkey=LI_Orderkey AND LI_Suppkey=S_Suppkey AND

C_Nationkey = N_Nationkey AND N_Regionkey = R_Regionkey AND

R Name = [region] AND O Orderdate >= Date[startdate] AND O Orderdate <= 12-31-1998

SELECT COUNT(*), AVG(LI_Extendedprice), SUM(LI_Extendedprice)

FROM LICOS-icicle, N, R

WHERE C_Nationkey = N_Nationkey AND N_Regionkey = R_Regionkey AND

R Name = [region] AND O Orderdate >= Date[startdate] AND O Orderdate <= 12-31-1998

Qworkload : Template for generating workloads

Template for obtaining approximate answers

Performance Evaluation contd...

The Error Plots for Comparison

Static uniform random sample on Join Synopsis

Icicle as it evolves with the workload

Icicle-Complete which is formed after entire workload has been executed once


Focused Queries



Mixed Workload

Rapid decrease in relative error of query answers from icicles with queries focused on a set of core tuples

Icicle plot shows a convergence to the Icicle-Complete plot

Quick Convergence of Icicle plot towards Icicle-Complete means Icicle adapts fast

Observations (focused)

Improvement due to usage of icicles is not significant

Can be concluded that icicles are at worst as good as the static samples

Observations (mixed)

Icicles provide class of samples that adapt according to the characteristics of the workload

It can never be worse than the case of static sampling

It focuses on relatively small subsets in the relation

Conclusion

There is no significant gains in the case of Uniform Workload

There is a trade-off between accuracy and cost

Restricted to certain scenarios where the queries tend to be increasingly focused towards the workload.

Inferences

V. Ganti, M. Lee, and R. Ramakrishnan. ICICLES: Self-tuning Samples for Approximate Query Answering. VLDB Conference 2000.

S Acharya, PB Gibbons, V Poosala, S Ramaswamy Join synopses for approximate query answering. ACM SIGMOD Record 1999

References

Thank You

Questions?

Harikrishnan Karunakaran Sulabha Balan CSE 6339. Introduction Icicles Icicle Maintenance ...

Documents

Transcript of Harikrishnan Karunakaran Sulabha Balan CSE 6339. Introduction Icicles Icicle Maintenance ...