DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015.
-
Upload
felicia-morton -
Category
Documents
-
view
222 -
download
4
Transcript of DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015.
![Page 1: DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015.](https://reader031.fdocuments.net/reader031/viewer/2022032206/56649eef5503460f94bff483/html5/thumbnails/1.jpg)
DAQ: A New Paradigm forApproximate Query Processing
Navneet PottiJignesh Patel
VLDB 2015
![Page 2: DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015.](https://reader031.fdocuments.net/reader031/viewer/2022032206/56649eef5503460f94bff483/html5/thumbnails/2.jpg)
2
Outline
• Approximate Query Processing• SAQ• DAQ• Bitwise DAQ Scheme• Evaluation• Conclusion
![Page 3: DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015.](https://reader031.fdocuments.net/reader031/viewer/2022032206/56649eef5503460f94bff483/html5/thumbnails/3.jpg)
3
Approximate Query Processing
Data volume isgrowing exponentially
Queries are interactiveto support real-time decisions
Decisions are resilient to small errors
Quick Approximate Answer
is better than
Slow Exact Answer
Exploratory analysis demands responsiveness
eg: Average Revenue estimate $12M
is about as good as $12,345,678
![Page 4: DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015.](https://reader031.fdocuments.net/reader031/viewer/2022032206/56649eef5503460f94bff483/html5/thumbnails/4.jpg)
4
SAQSampling-based Approximate Querying
• Run query on a small random subset of data
• Error in estimate presented as confidence intervalAvg revenue = $12.3 ± 0.1 million with 95% confidence
• Can be “online”– error eventually shrinks to zero => exact estimate
![Page 5: DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015.](https://reader031.fdocuments.net/reader031/viewer/2022032206/56649eef5503460f94bff483/html5/thumbnails/5.jpg)
5
Confidence Intervals
• Avg revenue = $12.3 ± 0.1 million with 95% confidence
What does this mean?
12.3 12.412.2
95%
Probability Distribution of Average Revenue
5%
With 95% probability, true average revenue lies in 12.3 ± 0.1 million
With 5% probability, true average revenue lies outside 12.3 ± 0.1 million
![Page 6: DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015.](https://reader031.fdocuments.net/reader031/viewer/2022032206/56649eef5503460f94bff483/html5/thumbnails/6.jpg)
6
Confidence Intervals
• eg: Avg revenue = $12.3 ± 0.1 million with 95% confidence
How should we interpret the tails?
12.3 12.412.2
95%
Probability Distribution of Average Revenue
5%
Tails occur 5% of the time.In Avg Regional Revenue for 100 states,
5 estimates are in the tails
There is no bound on error in the tail.The Avg Revenue could be
as small as $1M or as large as $100M
![Page 7: DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015.](https://reader031.fdocuments.net/reader031/viewer/2022032206/56649eef5503460f94bff483/html5/thumbnails/7.jpg)
7
SAQ: Shortcomings
12.3 12.412.2
5%
Semantics of the tails of confidence intervals are hard to interpret.
Intervals are very broad for outlier aggregates like MAX or Top 100.
Intervals are hard to manipulate. No closed algebra.
95%
Confidence interval bounds are unintuitive
Need to see more of the data to find outliers. Slow convergence.
Is 100 ± 10 “greater than” 90 ± 20?How do we add these intervals?
![Page 8: DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015.](https://reader031.fdocuments.net/reader031/viewer/2022032206/56649eef5503460f94bff483/html5/thumbnails/8.jpg)
8
Pop quiz! Estimate the sum.
DAQDeterministic Approximate Querying
5,321,656+ 3,151,709+ 1,362,296_________________________
= ? ? ? ? ?a. Approximately 2.4 million?
b. Approximately 9.7 million?
c. Approximately 13.8 million?
d. Approximately 17.0 million?
a. Approximately 2.4 million?
b. Approximately 9.7 million?
c. Approximately 13.8 million?
d. Approximately 17.0 million?
= 9,???,???= 9,7??,???= 9,83?,???
![Page 9: DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015.](https://reader031.fdocuments.net/reader031/viewer/2022032206/56649eef5503460f94bff483/html5/thumbnails/9.jpg)
9
• Use deterministic intervals instead of probabilistic (confidence) intervals
• Guaranteed upper and lower boundsAvg revenue = $12.3 ± 0.2 million
• Can be “online”– Error interval eventually
becomes degenerate => exact estimate
DAQDeterministic Approximate Querying
DAQ UI Mockup
![Page 10: DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015.](https://reader031.fdocuments.net/reader031/viewer/2022032206/56649eef5503460f94bff483/html5/thumbnails/10.jpg)
10
SAQ vs DAQ(at a glance)
SAQ DAQ
Complex semantics using confidence intervals
due to the “tail”.
Simple semantics using deterministic intervals
as there is no “tail”.
Slow for outlier aggregates like MAX or Top 100
and heavy-tailed data.
Fast for outlier aggregates like MAX or Top 100
and heavy-tailed data.
No closed algebra.No clear semantics for predicates and arithmetic operations on estimates.
Closed relational algebra.Clear semantics for predicates and
arithmetic ops using interval algebra.
![Page 11: DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015.](https://reader031.fdocuments.net/reader031/viewer/2022032206/56649eef5503460f94bff483/html5/thumbnails/11.jpg)
11
Conceptual DAQ Scheme
• Hierarchically partition the attribute’s domain• Estimates are represented as intervals [a,b]
![Page 12: DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015.](https://reader031.fdocuments.net/reader031/viewer/2022032206/56649eef5503460f94bff483/html5/thumbnails/12.jpg)
12
Conceptual DAQ Scheme
• Hierarchically partition the attribute’s domain• Estimates are represented as intervals [a,b]• e.g., Count B-Tree
![Page 13: DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015.](https://reader031.fdocuments.net/reader031/viewer/2022032206/56649eef5503460f94bff483/html5/thumbnails/13.jpg)
13
Interval Algebra
• Predicate evaluation• Interval representation for relations
City Est. PopulationShire [110,120]
Rivendell [ 70, 90]
Gondor [ 80,120]
Which cities have population > 100?
City Est. PopulationShire [110,120]
Rivendell [ 70, 90]
Gondor [ 80,120]
City Est. PopulationShire [110,120]
Rivendell [ 70, 90]
Gondor [ 80,120]
Certainly > 100 Potentially > 100,
![Page 14: DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015.](https://reader031.fdocuments.net/reader031/viewer/2022032206/56649eef5503460f94bff483/html5/thumbnails/14.jpg)
14
Formal Definition
• Interval estimates for attributes and relations
• Operators consume and produce intervals
• Closed relational algebra
• Online DAQ scheme– monotonically shrinking intervals– converges to exact estimate
![Page 15: DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015.](https://reader031.fdocuments.net/reader031/viewer/2022032206/56649eef5503460f94bff483/html5/thumbnails/15.jpg)
15
Bitwise DAQ Scheme
• Similar to the decimal digit-wise sum example• Uses Bitsliced Index representation
![Page 16: DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015.](https://reader031.fdocuments.net/reader031/viewer/2022032206/56649eef5503460f94bff483/html5/thumbnails/16.jpg)
16
Bitwise DAQ Scheme
• Use most significant m bits for evaluation
• Remaining n-m bits set to all-0 and all-1 for bounds
• Error bound decreases exponentially: 2n-m
![Page 17: DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015.](https://reader031.fdocuments.net/reader031/viewer/2022032206/56649eef5503460f94bff483/html5/thumbnails/17.jpg)
17
Bitwise DAQ SchemeAlgorithms
Average aggregation upto m bits
![Page 18: DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015.](https://reader031.fdocuments.net/reader031/viewer/2022032206/56649eef5503460f94bff483/html5/thumbnails/18.jpg)
18
Bitwise DAQ vs. BaselineExponentially decreasing error bounds in estimating Avg
![Page 19: DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015.](https://reader031.fdocuments.net/reader031/viewer/2022032206/56649eef5503460f94bff483/html5/thumbnails/19.jpg)
19
Bitwise DAQ SchemeAlgorithms
Less Than predicate upto m bits
![Page 20: DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015.](https://reader031.fdocuments.net/reader031/viewer/2022032206/56649eef5503460f94bff483/html5/thumbnails/20.jpg)
20
Bitwise DAQ vs. BaselinePredicate evaluation: 6x speedup using 8 bits for < 1% error
![Page 21: DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015.](https://reader031.fdocuments.net/reader031/viewer/2022032206/56649eef5503460f94bff483/html5/thumbnails/21.jpg)
21
Bitwise DAQ vs. BaselineTop 100: 3.5x speedup for < 1% error on Uniform, Zipf data
![Page 22: DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015.](https://reader031.fdocuments.net/reader031/viewer/2022032206/56649eef5503460f94bff483/html5/thumbnails/22.jpg)
22
Bitwise DAQ vs. SAQTop 100: DAQ performs better for heavy-tailed data (Zipf)
![Page 23: DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015.](https://reader031.fdocuments.net/reader031/viewer/2022032206/56649eef5503460f94bff483/html5/thumbnails/23.jpg)
23
Bitwise DAQSummary
Excels at• Predicates• Outlier aggregates• Heavy-tailed data
Suffers for• Simpler aggregates
(sum, avg)• Uniform data
• Compression improves performance further• Can operate directly on compressed bitvectors• > 8x speedup for Top 100 on Zipf data
• Embarrassingly parallelizable• NUMA-aware parallelization
![Page 24: DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015.](https://reader031.fdocuments.net/reader031/viewer/2022032206/56649eef5503460f94bff483/html5/thumbnails/24.jpg)
24
Future Work
• B-tree like indices
• Extension to other operators (Group By, Join)
• Extension to other data types
• Query optimization
• Combining SAQ and DAQ