Post on 23-Dec-2015
A Dip in the Reservoir: Maintaining Sample Synopses
of Evolving Datasets
Rainer Gemulla (University of Technology Dresden)Wolfgang Lehner (University of Technology Dresden)
Peter J. Haas (IBM Almaden Research Center)
Faculty of Computer Science, Institute System Architecture, Database Technology Group
Rainer Gemulla, Wolfgang Lehner, Peter J. Haas
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
Slide 2(VLDB 2006)
Outline
1. Introduction
2. Deletions
3. Resizing
4. Experiments
5. Summary
Rainer Gemulla, Wolfgang Lehner, Peter J. Haas
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
Slide 3(VLDB 2006)
Random Sampling
• Database applications– huge data sets– complex algorithms
(space & time)
• Requirements– performance, performance, performance
• Random sampling– approximate query answering – data mining – data stream processing– query optimization – data integration
Turnover in Europe (TPC-H)
1% 8.46 Mil. 0.15 Mil. 4s
10% 8.51 Mil. 0.05 Mil. 52s
100% 8.54 Mil. 200s
Rainer Gemulla, Wolfgang Lehner, Peter J. Haas
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
Slide 4(VLDB 2006)
The Problem Space
• Setting– arbitrary data sets– samples of the data– evolving data
• Scope of this talk– maintenance of
random samples
Can we minimize or even avoid access to base data?
Apply
D
Apply
Compute
Data Sample
Rainer Gemulla, Wolfgang Lehner, Peter J. Haas
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
Slide 5(VLDB 2006)
Types of Data Sets
• Data sets– variation of data set size– influence on sampling
Stable
Goal: stable sample
Growing
Goal: controlled
growing sample
Shrinking
uninteresting
Rainer Gemulla, Wolfgang Lehner, Peter J. Haas
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
Slide 6(VLDB 2006)
Uniform Sampling
• Uniform sampling– all samples of the same size are equally likely– many statistical procedures assume uniformity– flexibility
• Example– a data set (also called population)
– possible samples of size 2
1 2 3 4
1 2 1 3 1 4 2 3 2 4 3 4
16% 16% 16% 16% 16% 16%
Rainer Gemulla, Wolfgang Lehner, Peter J. Haas
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
Slide 7(VLDB 2006)
Reservoir Sampling
• Reservoir sampling– computes a uniform sample of M elements – building block for many sophisticated sampling schemes
– single-scan algorithm• add the first M elements• afterwards, flip a coin
a) ignore the element (reject) b) replace a random element in the sample (accept)
– accept probability of the ith element
i
MtP i
size population
size sample)accepted is (
Rainer Gemulla, Wolfgang Lehner, Peter J. Haas
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
Slide 8(VLDB 2006)
Reservoir Sampling (Example)
1 2+t1 +t2100%
• Example– sample size M = 2
1 2
1 2
1/3
3 2 1 3
1/3 1/3
+t1 +t2
+t333% 33% 33%
1 2
1 2 4 2 1 4
1 2
1/3
2/4 1/4 1/4
3 2 4 2 3 4
3 2
2/4 1/4 1/4
1 3 4 3 1 4
1 3
2/4 1/4 1/4
1/3 1/3
+t1 +t2
+t3
+t416% 8% 8% 8% 8% 8% 8%16% 16%
Rainer Gemulla, Wolfgang Lehner, Peter J. Haas
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
Slide 9(VLDB 2006)
Problems with Reservoir Sampling
• Problems with reservoir sampling– lacks support for deletions (stable data sets)– cannot efficiently enlarge sample (growing data sets)
?
Rainer Gemulla, Wolfgang Lehner, Peter J. Haas
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
Slide 10(VLDB 2006)
Outline
1. Introduction
2. Deletions
3. Resizing
4. Experiments
5. Summary
Rainer Gemulla, Wolfgang Lehner, Peter J. Haas
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
Slide 11(VLDB 2006)
Naïve/Prior Approaches
unstableconduct deletions, continue with smaller sample
(RS with deletions)
CommentsTechniqueAlgorithm
expensive, low space efficiency in our setting
tailored for multiset populations Distinct-value sampling
special case of our RP algorithm
developed for data streams (sliding windows only)
Passive sampling
inexpensive but unstable
“coin flip” sampling with deletions, purge if too large
Bernoulli s. with purging
stable but expensiveimmediately sample from base data to refill the sample
CAR(WOR)
expensive, unstablelet sample size decrease, but occasionally recompute
Backing sample
not uniformuse insertions to immediately refill the sample
Naïve
Rainer Gemulla, Wolfgang Lehner, Peter J. Haas
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
Slide 12(VLDB 2006)
Random Pairing
• Random pairing– compensates deletions with arriving insertions – corrects inclusion probabilies
• General idea (insertion)– no uncompensated deletions reservoir sampling– otherwise,
• randomly select an uncompensated deletion (partner)• compensate it: Was it in the sample?
– yes add arriving element to sample– no ignore arriving element
Rainer Gemulla, Wolfgang Lehner, Peter J. Haas
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
Slide 13(VLDB 2006)
Random Pairing
• Example
1 2
1 2
1/3
3 2 1 3
1/3 1/3
+t1 +t2
+t3
-t2 1 3 1 3
-t3 1 1
1
1
1
1
1
1
1 2
1 2
1/3
3 2 1 3
1/3 1/3
+t1 +t2
+t3
-t2 1 3 1 3
-t3 1 1
+t4
1
1
1
1
1
1
1 1 4
1/2 1/2
4 4
1/2 1/2
1 4 1
1/2 1/2
1 2
1 2
1/3
3 2 1 3
1/3 1/3
+t1 +t2
+t3
-t2 1 3 1 3
-t3 1 1
+t4
1
1
1
1
1
1
+t5
1 1 4
1/2 1/2
1 4 1
1/2 1/2
4 4
1/2 1/2
1 5
1
1 4
1
1 5
1
1 4
1
4 5
1
4 5
1
16% 16% 16% 16%16% 16%
Rainer Gemulla, Wolfgang Lehner, Peter J. Haas
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
Slide 14(VLDB 2006)
Random Pairing
• Details of the algorithm– keeping history of deleted items is expensive, but:
– maintenance of two counters suffices– correctness proof is in the paper
d
c
PtP i
1
deletions teduncompensa#
samplein deletions teduncompensa#sample)in spartner wa random()added is (
Rainer Gemulla, Wolfgang Lehner, Peter J. Haas
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
Slide 15(VLDB 2006)
Outline
1. Introduction
2. Deletions
3. Resizing
4. Experiments
5. Summary
Rainer Gemulla, Wolfgang Lehner, Peter J. Haas
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
Slide 16(VLDB 2006)
Growing Data Sets
• The problem– growing data set
Data set
growing data set
Random pairing
stable samplesampling fraction
decreases
Rainer Gemulla, Wolfgang Lehner, Peter J. Haas
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
Slide 17(VLDB 2006)
A Negative Result
• Negative result– There is no resizing algorithm which can enlarge a bounded-
size sample without ever accessing base data.
• Example– data set
– samples of size 2
– new data set
– samples of size 3
1 2 3 4
1 2 1 3 1 4 2 3 2 4 3 4
16% 16% 16% 16% 16% 16%
1 2 3 1 2 5
0% >0%Not uniform!
1 2 3 4 5 6 ...
Rainer Gemulla, Wolfgang Lehner, Peter J. Haas
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
Slide 18(VLDB 2006)
Resizing
• Goal– efficiently increase sample size– stay within an upper bound at all times
• General idea1. convert sample to Bernoulli sample2. continue Bernoulli sampling until new sample size is
reached3. convert back to reservoir sample
• Optimally balance cost– cost of base data accesses (in step 1) – time to reach new sample size (in step 2)
Rainer Gemulla, Wolfgang Lehner, Peter J. Haas
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
Slide 19(VLDB 2006)
Resizing
• Bernoulli sampling– uniform sampling scheme– each tuple is added to the sample with probability q– sample size follows binomial distribution no effective
upper bound
• Phase 1: Conversion to a Bernoulli sample– given q, randomly determine sample size– reuse reservoir sample to create Bernoulli sample
• subsample• sample additional tuples (base data access)
– choice of q• small less base data accesses• large more base data accesses
Rainer Gemulla, Wolfgang Lehner, Peter J. Haas
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
Slide 20(VLDB 2006)
Resizing
• Phase 2: Run Bernoulli sampling– accept new tuples with probability q– conduct deletions– stop as soon as new sample size is reached
• Phase 3: Revert to Reservoir sampling– switchover is trivial
• Choosing q– determines cost of Phase 1 and Phase 2– goal: minimize total cost
• base data access expensive small q• base data access cheap large q
– details in paper
Rainer Gemulla, Wolfgang Lehner, Peter J. Haas
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
Slide 21(VLDB 2006)
Resizing
• Example– resize by 30% if sampling fraction drops below 9%– dependent on costs of accessing base data
Low costs
immediate resizing
Moderate costs
combined solution
High costs
degenerates to Bernoulli sampling
Rainer Gemulla, Wolfgang Lehner, Peter J. Haas
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
Slide 22(VLDB 2006)
Outline
1. Introduction
2. Deletions
3. Resizing
4. Experiments
5. Summary
Rainer Gemulla, Wolfgang Lehner, Peter J. Haas
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
Slide 23(VLDB 2006)
Total Cost
• Total cost– stable dataset, 10M operations– sample size 100k, data access 10 times more expensive
than sample access
Base data access
No base data access
Rainer Gemulla, Wolfgang Lehner, Peter J. Haas
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
Slide 24(VLDB 2006)
Sample size
• Sample size– stable dataset, size 1M– sample size 100k
Base data access
No base data access
Rainer Gemulla, Wolfgang Lehner, Peter J. Haas
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
Slide 25(VLDB 2006)
Outline
1. Introduction
2. Deletions
3. Resizing
4. Experiments
5. Summary
Rainer Gemulla, Wolfgang Lehner, Peter J. Haas
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
Slide 26(VLDB 2006)
Summary
• Reservoir Sampling– lacks support for deletions– complete recomputation to enlarge the sample
• Random Pairing– uses arriving insertions to compensate for deletions
• Resizing– base data access cannot be avoided– minimizes total cost
• Future work– better q for resizing– combine with existing techniques [4,8,17] to enhance
flexibility, scalability
Rainer Gemulla, Wolfgang Lehner, Peter J. Haas
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
Slide 27(VLDB 2006)
Thank you!
Questions?
Rainer Gemulla, Wolfgang Lehner, Peter J. Haas
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
Slide 28(VLDB 2006)
Backup: Bounded-Size Sampling
• Why sampling?– performance, performance, performance
• How much to sample?– influencing factors
1. storage consumption2. response time3. accuracy
– choosing the sample size / sampling fraction1. largest sample that meets storage requirements2. largest sample that meets response time requirements3. smallest sample that meets accuracy requirements
Rainer Gemulla, Wolfgang Lehner, Peter J. Haas
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
Slide 29(VLDB 2006)
Backup: Bounded-Size Sampling
• Example– random pairing vs. bernoulli sampling– average estimation
Data set Sample size
BS violates 1, 2
Standard error
BS violates 3
N
n
n
Var1
Rainer Gemulla, Wolfgang Lehner, Peter J. Haas
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
Slide 30(VLDB 2006)
Backup: Distinct-Value Sampling
• Distinct-value sampling (optimistic setting for DV)– DV-scheme knows avg. dataset size in advance– assume no storage for counters & hash functions
Sample size
RP has better memory utilization
Execution time
RP is significantly faster
10%
10%
0% 10%0%10ms
100ms
1s
10s
100s
1000s
Rainer Gemulla, Wolfgang Lehner, Peter J. Haas
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
Slide 31(VLDB 2006)
Backup: RS With Deletions
• Reservoir sampling with deletions– conduct deletions, continue with smaller sample size
1 2
1 2
1/3
3 2 1 3
1/3 1/3
+t1 +t2
+t3
-t2 1 3 1 3
-t3 1 1
+t4
1 5 4 5
1 4
1
1
1
1
1
1
1 1/2 1/2
2/3 1/3 2/3 1/3
1 5 4 5
1 4
1/2 1/2
2/3 1/3 2/3 1/3
+t5
1
11% 5,5% 11% 33%5,5% 11% 5,5% 11% 5,5%
Rainer Gemulla, Wolfgang Lehner, Peter J. Haas
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
Slide 32(VLDB 2006)
Backup: Backing Sample
• Evaluation– data set consists of 1 million elements (on average)– 100k sample, clustered insertions/deletions
Data set
stable
Reservoir sampling
sample is empty eventually
Backing sample
expensive, unstable
Rainer Gemulla, Wolfgang Lehner, Peter J. Haas
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
Slide 33(VLDB 2006)
1 2
1 2
1/3
3 2 1 3
1/3 1/3
+t1 +t2
+t3
-t2 1 3 1 3
-t3 1 1
+t4 4
4 5
1
1
1
1
1
1
1
+t5
1
1 4
1
1 4
1
1 4 5 4 1 5
1/3 1/3 1/3
1 4 5 4 1 5
1/3 1/3 1/3
11% 11% 11% 33% 11% 11% 11%
Backup: An Incorrect Approach
• Idea– use arriving insertions to refill the sample
Not uniform!
Rainer Gemulla, Wolfgang Lehner, Peter J. Haas
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
Slide 34(VLDB 2006)
Backup: Random Pairing
• Evaluation– data set consists of 1 million elements (on average)– 100k sample, clustered insertions/deletions
Data set
stable
Reservoir sampling
sample gets emtpy eventually
Random pairing
no base data access!
Rainer Gemulla, Wolfgang Lehner, Peter J. Haas
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
Slide 35(VLDB 2006)
Backup: Average Sample Size
• Average sample size– stable dataset, 10M operations– sample size 100k
Rainer Gemulla, Wolfgang Lehner, Peter J. Haas
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
Slide 36(VLDB 2006)
Backup: Average Sample Size With Clustered Insertions/Deletions
• Average sample size with clustered insertions/deletions– stable dataset, size 10M, ~8M operations– sample size 100k
Rainer Gemulla, Wolfgang Lehner, Peter J. Haas
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
Slide 37(VLDB 2006)
Backup: Cost
• Cost– stable dataset, 10M operations– sample size 100k
Rainer Gemulla, Wolfgang Lehner, Peter J. Haas
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
Slide 38(VLDB 2006)
Backup: Cost With Clustered Insertions/Deletions
• Cost with clustered insertions/deletions– stable dataset, size 10M, ~8M operations– sample size 100k
Rainer Gemulla, Wolfgang Lehner, Peter J. Haas
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets
Slide 39(VLDB 2006)
Backup: Resizing (Value of q)
• Resizing– enlarge sample from 100k to 200k– base data access 10ms, arrival rate 1ms