Deferred Maintenance of Disk-Based Random Samples Rainer Gemulla (University of Technology Dresden)...

26
Deferred Maintenance of Disk-Based Random Samples Rainer Gemulla (University of Technology Dresden) Wolfgang Lehner (University of Technology Dresden) Faculty of Computer Science, Institute System Architecture, Database Technology Group

Transcript of Deferred Maintenance of Disk-Based Random Samples Rainer Gemulla (University of Technology Dresden)...

Deferred Maintenance of Disk-Based Random Samples

Rainer Gemulla (University of Technology Dresden)Wolfgang Lehner (University of Technology Dresden)

Faculty of Computer Science, Institute System Architecture, Database Technology Group

Rainer Gemulla, Wolfgang Lehner Deferred Maintenance of Disk-Based Random Samples Slide 2

Outline

1. Introduction

2. Logging Schemes

3. Refresh Algorithms

4. Performance

5. Summary & Outlook

Rainer Gemulla, Wolfgang Lehner Deferred Maintenance of Disk-Based Random Samples Slide 3

Random Sampling

• Analytical databases– huge data sets– complex algorithms

• Requirements– Performance, performance, performance!

• Random sampling– approximate query answering – data mining – data stream processing– query optimization – data integration

Turnover in Europe (TPCH)

1% 8.46 Mil. 0.15 Mil. 4s

10% 8.51 Mil. 0.05 Mil. 52s

100% 8.54 Mil. 200s

Rainer Gemulla, Wolfgang Lehner Deferred Maintenance of Disk-Based Random Samples Slide 4

Offline Sampling

• Precomputed samples– pros

• avoid access to base data• used multiple times• arbitrary base data• versatile

– cons• maintenance!!!

• Disk-based samples– many, large samples stored on disk– crash safe– typically space-restricted– challenges

• sequential access is faster• blocking of data

Rainer Gemulla, Wolfgang Lehner Deferred Maintenance of Disk-Based Random Samples Slide 5

Basics: Reservoir Sampling

• Sampling with space-constraints– maintain a sample (reservoir) of M tuples

• add the first M tuples• afterwards, throw a dice

a) ignore the tuple (reject)b) replace a random tuple in the sample (accept)

– accept probability controls sampling scheme– building block for many sophisticated sampling schemes

• Example– dataset with 50 tuples (M=5)

Rainer Gemulla, Wolfgang Lehner Deferred Maintenance of Disk-Based Random Samples Slide 6

Evolution of the Sample

Random I/O!!!

Rainer Gemulla, Wolfgang Lehner Deferred Maintenance of Disk-Based Random Samples Slide 7

Outline

1. Introduction

2. Logging Schemes

3. Refresh Algorithms

4. Performance

5. Summary & Outlook

Rainer Gemulla, Wolfgang Lehner Deferred Maintenance of Disk-Based Random Samples Slide 8

Full Logging

• Full Log– track all changes– log is written sequentially– log contains more information than needed

Rainer Gemulla, Wolfgang Lehner Deferred Maintenance of Disk-Based Random Samples Slide 9

Candidate Logging

• Candidate log– track only changes which affect the sample– log is written sequentially– smaller logs

How to implement Candidate Refresh?

Rainer Gemulla, Wolfgang Lehner Deferred Maintenance of Disk-Based Random Samples Slide 10

Outline

1. Introduction

2. Logging Schemes

3. Refresh Algorithms

4. Performance

5. Summary & Outlook

Rainer Gemulla, Wolfgang Lehner Deferred Maintenance of Disk-Based Random Samples Slide 11

Naive Refresh

• Naive refresh– scan log file sequentially– write each element of the log to a random position in the

sample

– No improvement at all!• random access to sample• some elements are written more than once

Rainer Gemulla, Wolfgang Lehner Deferred Maintenance of Disk-Based Random Samples Slide 12

Avoiding Multiple Writes

• Observation– each candidate can be overwritten by subsequent

candidates only– last candidate is never overwritten

• Approach– scan log in reverse order– write only tuples which have not been written before

Rainer Gemulla, Wolfgang Lehner Deferred Maintenance of Disk-Based Random Samples Slide 13

Avoiding Multiple Writes

• Probability of overwrites

• In general– k tuples written to sample (k=0…5)– probability of overwrite: pk = (M-k)/M– number of skipped tuples: P(Xk=x)=(1-pk)x pk (k>0)– X5=– here: X1=0, X2=1, X3=1, X4=6

Rainer Gemulla, Wolfgang Lehner Deferred Maintenance of Disk-Based Random Samples Slide 14

Nomem Refresh

• Nomem Refresh (Phase 1)

– dry run: generate X4,…,X1 in advance

– reset pseudo-random number generator and generate same sequence again

– start at: |C|-X indexes of log file are generated

Rainer Gemulla, Wolfgang Lehner Deferred Maintenance of Disk-Based Random Samples Slide 15

Nomem Refresh

• Naive update of sample– read generated indexes of the log– write it to a random (free) position in the sample

– drawbacks• free positions have to be maintained• random access to the sample

Rainer Gemulla, Wolfgang Lehner Deferred Maintenance of Disk-Based Random Samples Slide 16

Nomem Refresh

• Nomem Refresh (Phase 2)– general idea: order of the tuples in sample is unimportant

– algorithm• (re-)generate next position in the log (6, 8,10,11)• generate next position in the sample (1, 2, 3, 5)• read from log, write to sample

Rainer Gemulla, Wolfgang Lehner Deferred Maintenance of Disk-Based Random Samples Slide 17

Nomem Refresh

• Properties– log file is read sequentially– sample is written sequentially– no overwrites– no memory consumption– works on full logs as well (DBMS!)

Rainer Gemulla, Wolfgang Lehner Deferred Maintenance of Disk-Based Random Samples Slide 18

Outline

1. Introduction

2. Logging Schemes

3. Refresh Algorithms

4. Performance

5. Summary & Outlook

Rainer Gemulla, Wolfgang Lehner Deferred Maintenance of Disk-Based Random Samples Slide 19

Experiments

• Number of operations & execution time– sample size: 1 million tuples– refresh period: 1 million operations

Rainer Gemulla, Wolfgang Lehner Deferred Maintenance of Disk-Based Random Samples Slide 20

Experiments

• Refresh period & execution time– sample size: 1 million tuples– number of operations: 100 million

Rainer Gemulla, Wolfgang Lehner Deferred Maintenance of Disk-Based Random Samples Slide 21

Outline

1. Introduction

2. Logging Schemes

3. Refresh Algorithms

4. Performance

5. Summary & Outlook

Rainer Gemulla, Wolfgang Lehner Deferred Maintenance of Disk-Based Random Samples Slide 22

Summary & Outlook

• Logging schemes– full logs: often found in database systems– candidate logs: reduce log file size

• Nomem Refresh– fast incremental refresh– sequential disk access only– no memory consumption– works with full and candidate logs

• Future work– more detailed discussion of updates & deletions

Rainer Gemulla, Wolfgang Lehner Deferred Maintenance of Disk-Based Random Samples Slide 23

Thank you!

Questions?

Rainer Gemulla, Wolfgang Lehner Deferred Maintenance of Disk-Based Random Samples Slide 24

Extensions

• Extensions– nomem refresh for full logs (DBMS!)

• dry run: compute candidates, count their number• reset random number generator• add skips of Nomem Refresh and Reservoir Sampling

– deletions and updates• store deletions and updates separately• process delete and update log first• run Nomem Refresh on the insert log• requires disjoint logs

Rainer Gemulla, Wolfgang Lehner Deferred Maintenance of Disk-Based Random Samples Slide 25

Experiments

• Comparison with the Geometric File– sample size: 1 million tuples– number of operations: 100 million

Rainer Gemulla, Wolfgang Lehner Deferred Maintenance of Disk-Based Random Samples Slide 26

Experiments

• Computational overhead– sample size: 1 million tuples