Deferred Maintenance of Disk-Based Random Samples Rainer Gemulla (University of Technology Dresden)...
-
Upload
milo-allen -
Category
Documents
-
view
217 -
download
0
Transcript of Deferred Maintenance of Disk-Based Random Samples Rainer Gemulla (University of Technology Dresden)...
Deferred Maintenance of Disk-Based Random Samples
Rainer Gemulla (University of Technology Dresden)Wolfgang Lehner (University of Technology Dresden)
Faculty of Computer Science, Institute System Architecture, Database Technology Group
Rainer Gemulla, Wolfgang Lehner Deferred Maintenance of Disk-Based Random Samples Slide 2
Outline
1. Introduction
2. Logging Schemes
3. Refresh Algorithms
4. Performance
5. Summary & Outlook
Rainer Gemulla, Wolfgang Lehner Deferred Maintenance of Disk-Based Random Samples Slide 3
Random Sampling
• Analytical databases– huge data sets– complex algorithms
• Requirements– Performance, performance, performance!
• Random sampling– approximate query answering – data mining – data stream processing– query optimization – data integration
Turnover in Europe (TPCH)
1% 8.46 Mil. 0.15 Mil. 4s
10% 8.51 Mil. 0.05 Mil. 52s
100% 8.54 Mil. 200s
Rainer Gemulla, Wolfgang Lehner Deferred Maintenance of Disk-Based Random Samples Slide 4
Offline Sampling
• Precomputed samples– pros
• avoid access to base data• used multiple times• arbitrary base data• versatile
– cons• maintenance!!!
• Disk-based samples– many, large samples stored on disk– crash safe– typically space-restricted– challenges
• sequential access is faster• blocking of data
Rainer Gemulla, Wolfgang Lehner Deferred Maintenance of Disk-Based Random Samples Slide 5
Basics: Reservoir Sampling
• Sampling with space-constraints– maintain a sample (reservoir) of M tuples
• add the first M tuples• afterwards, throw a dice
a) ignore the tuple (reject)b) replace a random tuple in the sample (accept)
– accept probability controls sampling scheme– building block for many sophisticated sampling schemes
• Example– dataset with 50 tuples (M=5)
Rainer Gemulla, Wolfgang Lehner Deferred Maintenance of Disk-Based Random Samples Slide 6
Evolution of the Sample
Random I/O!!!
Rainer Gemulla, Wolfgang Lehner Deferred Maintenance of Disk-Based Random Samples Slide 7
Outline
1. Introduction
2. Logging Schemes
3. Refresh Algorithms
4. Performance
5. Summary & Outlook
Rainer Gemulla, Wolfgang Lehner Deferred Maintenance of Disk-Based Random Samples Slide 8
Full Logging
• Full Log– track all changes– log is written sequentially– log contains more information than needed
Rainer Gemulla, Wolfgang Lehner Deferred Maintenance of Disk-Based Random Samples Slide 9
Candidate Logging
• Candidate log– track only changes which affect the sample– log is written sequentially– smaller logs
How to implement Candidate Refresh?
Rainer Gemulla, Wolfgang Lehner Deferred Maintenance of Disk-Based Random Samples Slide 10
Outline
1. Introduction
2. Logging Schemes
3. Refresh Algorithms
4. Performance
5. Summary & Outlook
Rainer Gemulla, Wolfgang Lehner Deferred Maintenance of Disk-Based Random Samples Slide 11
Naive Refresh
• Naive refresh– scan log file sequentially– write each element of the log to a random position in the
sample
– No improvement at all!• random access to sample• some elements are written more than once
Rainer Gemulla, Wolfgang Lehner Deferred Maintenance of Disk-Based Random Samples Slide 12
Avoiding Multiple Writes
• Observation– each candidate can be overwritten by subsequent
candidates only– last candidate is never overwritten
• Approach– scan log in reverse order– write only tuples which have not been written before
Rainer Gemulla, Wolfgang Lehner Deferred Maintenance of Disk-Based Random Samples Slide 13
Avoiding Multiple Writes
• Probability of overwrites
• In general– k tuples written to sample (k=0…5)– probability of overwrite: pk = (M-k)/M– number of skipped tuples: P(Xk=x)=(1-pk)x pk (k>0)– X5=– here: X1=0, X2=1, X3=1, X4=6
Rainer Gemulla, Wolfgang Lehner Deferred Maintenance of Disk-Based Random Samples Slide 14
Nomem Refresh
• Nomem Refresh (Phase 1)
– dry run: generate X4,…,X1 in advance
– reset pseudo-random number generator and generate same sequence again
– start at: |C|-X indexes of log file are generated
Rainer Gemulla, Wolfgang Lehner Deferred Maintenance of Disk-Based Random Samples Slide 15
Nomem Refresh
• Naive update of sample– read generated indexes of the log– write it to a random (free) position in the sample
– drawbacks• free positions have to be maintained• random access to the sample
Rainer Gemulla, Wolfgang Lehner Deferred Maintenance of Disk-Based Random Samples Slide 16
Nomem Refresh
• Nomem Refresh (Phase 2)– general idea: order of the tuples in sample is unimportant
– algorithm• (re-)generate next position in the log (6, 8,10,11)• generate next position in the sample (1, 2, 3, 5)• read from log, write to sample
Rainer Gemulla, Wolfgang Lehner Deferred Maintenance of Disk-Based Random Samples Slide 17
Nomem Refresh
• Properties– log file is read sequentially– sample is written sequentially– no overwrites– no memory consumption– works on full logs as well (DBMS!)
Rainer Gemulla, Wolfgang Lehner Deferred Maintenance of Disk-Based Random Samples Slide 18
Outline
1. Introduction
2. Logging Schemes
3. Refresh Algorithms
4. Performance
5. Summary & Outlook
Rainer Gemulla, Wolfgang Lehner Deferred Maintenance of Disk-Based Random Samples Slide 19
Experiments
• Number of operations & execution time– sample size: 1 million tuples– refresh period: 1 million operations
Rainer Gemulla, Wolfgang Lehner Deferred Maintenance of Disk-Based Random Samples Slide 20
Experiments
• Refresh period & execution time– sample size: 1 million tuples– number of operations: 100 million
Rainer Gemulla, Wolfgang Lehner Deferred Maintenance of Disk-Based Random Samples Slide 21
Outline
1. Introduction
2. Logging Schemes
3. Refresh Algorithms
4. Performance
5. Summary & Outlook
Rainer Gemulla, Wolfgang Lehner Deferred Maintenance of Disk-Based Random Samples Slide 22
Summary & Outlook
• Logging schemes– full logs: often found in database systems– candidate logs: reduce log file size
• Nomem Refresh– fast incremental refresh– sequential disk access only– no memory consumption– works with full and candidate logs
• Future work– more detailed discussion of updates & deletions
Rainer Gemulla, Wolfgang Lehner Deferred Maintenance of Disk-Based Random Samples Slide 23
Thank you!
Questions?
Rainer Gemulla, Wolfgang Lehner Deferred Maintenance of Disk-Based Random Samples Slide 24
Extensions
• Extensions– nomem refresh for full logs (DBMS!)
• dry run: compute candidates, count their number• reset random number generator• add skips of Nomem Refresh and Reservoir Sampling
– deletions and updates• store deletions and updates separately• process delete and update log first• run Nomem Refresh on the insert log• requires disjoint logs
Rainer Gemulla, Wolfgang Lehner Deferred Maintenance of Disk-Based Random Samples Slide 25
Experiments
• Comparison with the Geometric File– sample size: 1 million tuples– number of operations: 100 million