A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Rainer Gemulla (University...

A Dip in the Reservoir: Maintaining Sample Synopses

of Evolving Datasets

Rainer Gemulla (University of Technology Dresden)Wolfgang Lehner (University of Technology Dresden)

Peter J. Haas (IBM Almaden Research Center)

Faculty of Computer Science, Institute System Architecture, Database Technology Group

Rainer Gemulla, Wolfgang Lehner, Peter J. Haas

A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets

(VLDB 2006)

Outline

1. Introduction

2. Deletions

3. Resizing

4. Experiments

5. Summary

(VLDB 2006)

Random Sampling

• Database applications– huge data sets– complex algorithms

(space & time)

• Requirements– performance, performance, performance

• Random sampling– approximate query answering – data mining – data stream processing– query optimization – data integration

Turnover in Europe (TPC-H)

1% 8.46 Mil. 0.15 Mil. 4s

10% 8.51 Mil. 0.05 Mil. 52s

100% 8.54 Mil. 200s

(VLDB 2006)

The Problem Space

• Setting– arbitrary data sets– samples of the data– evolving data

• Scope of this talk– maintenance of

random samples

Can we minimize or even avoid access to base data?

Compute

Data Sample

(VLDB 2006)

Types of Data Sets

• Data sets– variation of data set size– influence on sampling

Stable

Goal: stable sample

Growing

Goal: controlled

growing sample

Shrinking

uninteresting

(VLDB 2006)

Uniform Sampling

• Uniform sampling– all samples of the same size are equally likely– many statistical procedures assume uniformity– flexibility

• Example– a data set (also called population)

– possible samples of size 2

1 2 3 4

1 2 1 3 1 4 2 3 2 4 3 4

16% 16% 16% 16% 16% 16%

(VLDB 2006)

Reservoir Sampling

• Reservoir sampling– computes a uniform sample of M elements – building block for many sophisticated sampling schemes

– single-scan algorithm• add the first M elements• afterwards, flip a coin

a) ignore the element (reject) b) replace a random element in the sample (accept)

– accept probability of the ith element

size population

size sample)accepted is (

(VLDB 2006)

Reservoir Sampling (Example)

1 2+t1 +t2100%

• Example– sample size M = 2

3 2 1 3

1/3 1/3

+t1 +t2

+t333% 33% 33%

1 2 4 2 1 4

2/4 1/4 1/4

3 2 4 2 3 4

2/4 1/4 1/4

1 3 4 3 1 4

2/4 1/4 1/4

1/3 1/3

+t1 +t2

+t416% 8% 8% 8% 8% 8% 8%16% 16%

(VLDB 2006)

Problems with Reservoir Sampling

• Problems with reservoir sampling– lacks support for deletions (stable data sets)– cannot efficiently enlarge sample (growing data sets)

(VLDB 2006)

Outline

1. Introduction

2. Deletions

3. Resizing

4. Experiments

5. Summary

(VLDB 2006)

Naïve/Prior Approaches

unstableconduct deletions, continue with smaller sample

(RS with deletions)

CommentsTechniqueAlgorithm

expensive, low space efficiency in our setting

tailored for multiset populations Distinct-value sampling

special case of our RP algorithm

developed for data streams (sliding windows only)

Passive sampling

inexpensive but unstable

“coin flip” sampling with deletions, purge if too large

Bernoulli s. with purging

stable but expensiveimmediately sample from base data to refill the sample

CAR(WOR)

expensive, unstablelet sample size decrease, but occasionally recompute

Backing sample

not uniformuse insertions to immediately refill the sample

Naïve

(VLDB 2006)

Random Pairing

• Random pairing– compensates deletions with arriving insertions – corrects inclusion probabilies

• General idea (insertion)– no uncompensated deletions reservoir sampling– otherwise,

• randomly select an uncompensated deletion (partner)• compensate it: Was it in the sample?

– yes add arriving element to sample– no ignore arriving element

(VLDB 2006)

Random Pairing

• Example

3 2 1 3

1/3 1/3

+t1 +t2

-t2 1 3 1 3

-t3 1 1

3 2 1 3

1/3 1/3

+t1 +t2

-t2 1 3 1 3

-t3 1 1

1/2 1/2

3 2 1 3

1/3 1/3

+t1 +t2

-t2 1 3 1 3

-t3 1 1

1/2 1/2

16% 16% 16% 16%16% 16%

(VLDB 2006)

Random Pairing

• Details of the algorithm– keeping history of deleted items is expensive, but:

– maintenance of two counters suffices– correctness proof is in the paper

deletions teduncompensa#

samplein deletions teduncompensa#sample)in spartner wa random()added is (

(VLDB 2006)

Outline

1. Introduction

2. Deletions

3. Resizing

4. Experiments

5. Summary

(VLDB 2006)

Growing Data Sets

• The problem– growing data set

Data set

growing data set

Random pairing

stable samplesampling fraction

decreases

(VLDB 2006)

A Negative Result

• Negative result– There is no resizing algorithm which can enlarge a bounded-

size sample without ever accessing base data.

• Example– data set

– samples of size 2

– new data set

– samples of size 3

1 2 3 4

1 2 1 3 1 4 2 3 2 4 3 4

16% 16% 16% 16% 16% 16%

1 2 3 1 2 5

0% >0%Not uniform!

1 2 3 4 5 6 ...

(VLDB 2006)

Resizing

• Goal– efficiently increase sample size– stay within an upper bound at all times

• General idea1. convert sample to Bernoulli sample2. continue Bernoulli sampling until new sample size is

reached3. convert back to reservoir sample

• Optimally balance cost– cost of base data accesses (in step 1) – time to reach new sample size (in step 2)

(VLDB 2006)

Resizing

• Bernoulli sampling– uniform sampling scheme– each tuple is added to the sample with probability q– sample size follows binomial distribution no effective

upper bound

• Phase 1: Conversion to a Bernoulli sample– given q, randomly determine sample size– reuse reservoir sample to create Bernoulli sample

• subsample• sample additional tuples (base data access)

– choice of q• small less base data accesses• large more base data accesses

(VLDB 2006)

Resizing

• Phase 2: Run Bernoulli sampling– accept new tuples with probability q– conduct deletions– stop as soon as new sample size is reached

• Phase 3: Revert to Reservoir sampling– switchover is trivial

• Choosing q– determines cost of Phase 1 and Phase 2– goal: minimize total cost

• base data access expensive small q• base data access cheap large q

– details in paper

(VLDB 2006)

Resizing

• Example– resize by 30% if sampling fraction drops below 9%– dependent on costs of accessing base data

Low costs

immediate resizing

Moderate costs

combined solution

High costs

degenerates to Bernoulli sampling

(VLDB 2006)

Outline

1. Introduction

2. Deletions

3. Resizing

4. Experiments

5. Summary

(VLDB 2006)

Total Cost

• Total cost– stable dataset, 10M operations– sample size 100k, data access 10 times more expensive

than sample access

Base data access

No base data access

(VLDB 2006)

Sample size

• Sample size– stable dataset, size 1M– sample size 100k

Base data access

No base data access

(VLDB 2006)

Outline

1. Introduction

2. Deletions

3. Resizing

4. Experiments

5. Summary

(VLDB 2006)

Summary

• Reservoir Sampling– lacks support for deletions– complete recomputation to enlarge the sample

• Random Pairing– uses arriving insertions to compensate for deletions

• Resizing– base data access cannot be avoided– minimizes total cost

• Future work– better q for resizing– combine with existing techniques [4,8,17] to enhance

flexibility, scalability

(VLDB 2006)

Thank you!

Questions?

(VLDB 2006)

Backup: Bounded-Size Sampling

• Why sampling?– performance, performance, performance

• How much to sample?– influencing factors

1. storage consumption2. response time3. accuracy

– choosing the sample size / sampling fraction1. largest sample that meets storage requirements2. largest sample that meets response time requirements3. smallest sample that meets accuracy requirements

(VLDB 2006)

Backup: Bounded-Size Sampling

• Example– random pairing vs. bernoulli sampling– average estimation

Data set Sample size

BS violates 1, 2

Standard error

BS violates 3

(VLDB 2006)

Backup: Distinct-Value Sampling

• Distinct-value sampling (optimistic setting for DV)– DV-scheme knows avg. dataset size in advance– assume no storage for counters & hash functions

Sample size

RP has better memory utilization

Execution time

RP is significantly faster

0% 10%0%10ms

(VLDB 2006)

Backup: RS With Deletions

• Reservoir sampling with deletions– conduct deletions, continue with smaller sample size

3 2 1 3

1/3 1/3

+t1 +t2

-t2 1 3 1 3

-t3 1 1

1 5 4 5

1 1/2 1/2

2/3 1/3 2/3 1/3

1 5 4 5

1/2 1/2

2/3 1/3 2/3 1/3

11% 5,5% 11% 33%5,5% 11% 5,5% 11% 5,5%

(VLDB 2006)

Backup: Backing Sample

• Evaluation– data set consists of 1 million elements (on average)– 100k sample, clustered insertions/deletions

Data set

stable

Reservoir sampling

sample is empty eventually

Backing sample

expensive, unstable

(VLDB 2006)

3 2 1 3

1/3 1/3

+t1 +t2

-t2 1 3 1 3

-t3 1 1

1 4 5 4 1 5

1/3 1/3 1/3

1 4 5 4 1 5

1/3 1/3 1/3

11% 11% 11% 33% 11% 11% 11%

Backup: An Incorrect Approach

• Idea– use arriving insertions to refill the sample

Not uniform!

(VLDB 2006)

Backup: Random Pairing

• Evaluation– data set consists of 1 million elements (on average)– 100k sample, clustered insertions/deletions

Data set

stable

Reservoir sampling

sample gets emtpy eventually

Random pairing

no base data access!

(VLDB 2006)

Backup: Average Sample Size

• Average sample size– stable dataset, 10M operations– sample size 100k

(VLDB 2006)

Backup: Average Sample Size With Clustered Insertions/Deletions

• Average sample size with clustered insertions/deletions– stable dataset, size 10M, ~8M operations– sample size 100k

(VLDB 2006)

Backup: Cost

• Cost– stable dataset, 10M operations– sample size 100k

(VLDB 2006)

Backup: Cost With Clustered Insertions/Deletions

• Cost with clustered insertions/deletions– stable dataset, size 10M, ~8M operations– sample size 100k

(VLDB 2006)

Backup: Resizing (Value of q)

• Resizing– enlarge sample from 100k to 200k– base data access 10ms, arrival rate 1ms

A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Rainer Gemulla (University...

Documents

Transcript of A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Rainer Gemulla (University...

Lehner Akustik Mietpreisliste 2014

Synopses Res En

Lehner Wolle 3 – New Fashion 2012

Technical Manual - lehner-lifttechnik.at

LEHNER - remund-berger.ch

® INDUSTRIAS LEHNER

Postgraduate Lecture Training Course Synopses

Präsentation DHGD 2014 Dissertationsprojekt Marion Lehner

01 Lehner+ Viel Stoff Wenig Zeit

2015 ON TOUR Film Synopses

Deferred Maintenance of Disk-Based Random Samples Rainer Gemulla (University of Technology Dresden) Wolfgang Lehner (University of Technology Dresden)

Sampling Algorithms for Evolving Datasets Rainer Gemulla Defense of Ph.D. Thesis 20.10.2008

CONVICTED: DOCTRINAL SYNOPSES

Lehner Wolle, Gartencenter 2016

AIA Doc Synopses By Series - American Institute of Architects...AIA Document Synopses by Series, p. 29 AIA Documents Synopses by Series. Copyright © 2018 by The American Institute

Lehner Wolle 3 – Four Seasons 2014

Cine Europa 13 Films Synopses

Wolfgang Lehner Technische Universitat Dresden

INNOVATION SYNOPSES - ICECINNO

Lehner Wolle, Emotion 2016