SAMPLING-BASED RANDOMIZATION TECHNIQUES FOR APPROXIMATE...
-
Upload
nguyentruc -
Category
Documents
-
view
247 -
download
0
Transcript of SAMPLING-BASED RANDOMIZATION TECHNIQUES FOR APPROXIMATE...
SAMPLING-BASED RANDOMIZATION TECHNIQUES FOR APPROXIMATEQUERY PROCESSING
By
SHANTANU JOSHI
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
2007
1
c© 2007 SHANTANU JOSHI
2
To my parents, Dr Sharad Joshi and Dr Hemangi Joshi
3
ACKNOWLEDGMENTS
Firstly, my sincerest gratitude goes to my advisor, Professor Chris Jermaine for his
invaluable guidance and support throughout my PhD research work. During the initial
several months of my graduate work, Chris was extremely patient and always led me
towards the right direction whenever I would waver. His acute insight into the research
problems we worked on set an excellent example and provided me immense motivation to
work on them. He has always emphasized the importance of high-quality technical writing
and has spent several painstaking hours reading and correcting my technical manuscripts.
He has been the best mentor I could have hoped for and I shall always remain indebted to
him for shaping my career and more importantly, my thinking.
I am also very thankful to Professor Alin Dobra for his guidance during my graduate
study. His enthusiasm and constant willingness to help has always amazed me.
Thanks are also due to Professor Joachim Hammer for his support during the very
early days of my graduate study. I take this opportunity to thank Professors Tamer
Kahveci and Gary Koehler for taking the time to serve on my committee and for their
helpful suggestions.
It was a pleasure working with Subi Arumugam and Abhijit Pol on various collaborative
research projects. Several interesting technical discussions with Mingxi Wu, Fei Xu, Florin
Rusu, Laukik Chitnis and Seema Degwekar provided a stimulating work environment in
the Database Center.
This work would not have been possible without the constant encouragement and
support of my family. My parents, Dr Sharad Joshi and Dr Hemangi Joshi always
encouraged me to focus on my goals and pursue them against all odds. My brother,
Dr Abhijit Joshi has always placed trust in my abilities and has been an ideal example to
follow since my childhood. My loving sister-in-law, Dr Hetal Joshi has been supportive
since the time I decided to pursue computer science.
4
TABLE OF CONTENTS
page
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
CHAPTER
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.1 Approximate Query Processing (AQP) - A Different Paradigm . . . . . . . 131.2 Building an AQP System Afresh . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.1 Sampling Vs Precomputed Synopses . . . . . . . . . . . . . . . . . . 151.2.2 Architectural Changes . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3 Contributions in This Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2 RELATED WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1 Sampling-based Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 192.2 Estimation Using Non-sampling Precomputed Synopses . . . . . . . . . . . 282.3 Analytic Query Processing Using Non-standard Data Models . . . . . . . . 30
3 MATERIALIZED SAMPLE VIEWS FOR DATABASE APPROXIMATION . . 33
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.2 Existing Sampling Techniques . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.1 Randomly Permuted Files . . . . . . . . . . . . . . . . . . . . . . . 353.2.2 Sampling from Indices . . . . . . . . . . . . . . . . . . . . . . . . . 363.2.3 Block-based Random Sampling . . . . . . . . . . . . . . . . . . . . . 37
3.3 Overview of Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.3.1 ACE Tree Leaf Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . 383.3.2 ACE Tree Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.3.3 Example Query Execution in ACE Tree . . . . . . . . . . . . . . . . 403.3.4 Choice of Binary Versus k-Ary Tree . . . . . . . . . . . . . . . . . . 42
3.4 Properties of the ACE Tree . . . . . . . . . . . . . . . . . . . . . . . . . . 433.4.1 Combinability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.4.2 Appendability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.4.3 Exponentiality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5 Construction of the ACE Tree . . . . . . . . . . . . . . . . . . . . . . . . . 453.5.1 Design Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.5.2 Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463.5.3 Construction Phase 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5
3.5.4 Construction Phase 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 483.5.5 Combinability/Appendability Revisited . . . . . . . . . . . . . . . . 513.5.6 Page Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.6 Query Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523.6.1 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.6.2 Algorithm Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.6.3 Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553.6.4 Actual Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553.6.5 Algorithm Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.7 Multi-Dimensional ACE Trees . . . . . . . . . . . . . . . . . . . . . . . . . 593.8 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.8.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613.8.2 Discussion of Experimental Results . . . . . . . . . . . . . . . . . . 66
3.9 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4 SAMPLING-BASED ESTIMATORS FOR SUBSET-BASED QUERIES . . . . 73
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 734.2 The Concurrent Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . 784.3 Unbiased Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.3.1 High-Level Description . . . . . . . . . . . . . . . . . . . . . . . . . 804.3.2 The Unbiased Estimator In Depth . . . . . . . . . . . . . . . . . . . 814.3.3 Why Is the Estimator Unbiased? . . . . . . . . . . . . . . . . . . . . 854.3.4 Computing the Variance of the Estimator . . . . . . . . . . . . . . . 874.3.5 Is This Good? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.4 Developing a Biased Estimator . . . . . . . . . . . . . . . . . . . . . . . . 914.5 Details of Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.5.1 Choice of Model and Model Parameters . . . . . . . . . . . . . . . . 924.5.2 Estimation of Model Parameters . . . . . . . . . . . . . . . . . . . . 954.5.3 Generating Populations From the Model . . . . . . . . . . . . . . . 1004.5.4 Constructing the Estimator . . . . . . . . . . . . . . . . . . . . . . . 102
4.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1034.6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.6.1.1 Synthetic data sets . . . . . . . . . . . . . . . . . . . . . . 1044.6.1.2 Real-life data sets . . . . . . . . . . . . . . . . . . . . . . . 106
4.6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1094.6.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
4.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1184.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5 SAMPLING-BASED ESTIMATION OF LOW SELECTIVITY QUERIES . . . 121
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1215.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.2.1 Stratification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1245.2.2 “Optimal” Allocation and Why It’s Not . . . . . . . . . . . . . . . . 126
6
5.3 Overview of Our Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . 1285.4 Defining XΣ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1295.4.2 Defining Xcnt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1305.4.3 Defining XΣ′ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1325.4.4 Combining The Two . . . . . . . . . . . . . . . . . . . . . . . . . . 1355.4.5 Limiting the Number of Domain Values . . . . . . . . . . . . . . . . 137
5.5 Updating Priors Using The Pilot . . . . . . . . . . . . . . . . . . . . . . . 1395.6 Putting It All Together . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
5.6.1 Minimizing the Variance . . . . . . . . . . . . . . . . . . . . . . . . 1415.6.2 Computing the Final Sampling Allocation . . . . . . . . . . . . . . 142
5.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1435.7.1 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1435.7.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 1445.7.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1475.7.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
5.8 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1515.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
6 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
APPENDIX
EM ALGORITHM DERIVATION . . . . . . . . . . . . . . . . . . . . . . . . . 155
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
7
LIST OF TABLES
Table page
4-1 Observed standard error as a percentage of SUM (e.SAL) over all e ∈ EMP for24 synthetically generated data sets. The table shows errors for three differentsampling fractions: 1%, 5% and 10% and for each of these fractions, it showsthe error for the three estimators: U - Unbiased estimator, C - Concurrent samplingestimator and B - Model-based biased estimator. . . . . . . . . . . . . . . . . . 112
4-2 Observed standard error as a percentage of SUM (e.SAL) over all e ∈ EMP for24 synthetically generated data sets. The table shows errors for three differentsampling fractions: 1%, 5% and 10% and for each of these fractions, it showsthe error for the three estimators: U - Unbiased estimator, C - Concurrent samplingestimator and B - Model-based biased estimator. . . . . . . . . . . . . . . . . . 113
4-3 Observed standard error as a percentage of SUM (e.SAL) over all e ∈ EMP for18 synthetically generated data sets. The table shows errors for three differentsampling fractions: 1%, 5% and 10% and for each of these fractions, it showsthe error for the three estimators: U - Unbiased estimator, C - Concurrent samplingestimator and B - Model-based biased estimator. . . . . . . . . . . . . . . . . . 114
4-4 Observed standard error as a percentage of the total aggregate value of all recordsin the database for 8 queries over 3 real-life data sets. The table shows errorsfor three different sampling fractions: 1%, 5% and 10% and for each of thesefractions, it shows the error for the three estimators: U - Unbiased estimator,C - Concurrent sampling estimator and B - Model-based biased estimator. . . . 115
5-1 Bandwidth (as a ratio of error bounds width to the true query answer) and Coverage(for 1000 query runs) for a Simple Random Sampling estimator for the KDDCup data set. Results are shown for varying sample sizes and for three differentquery selectivities - 0.01%, 0.1% and 1%. . . . . . . . . . . . . . . . . . . . . . . 146
5-2 Average running time of Neyman and Bayes-Neyman estimators over three real-worlddatasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
5-3 Bandwidth (as a ratio of error bounds width to the true query answer) and Coverage(for 1000 query runs) for the Neyman estimator and the Bayes-Neyman estimatorfor the three data sets. Results are shown for 20 strata and for varying numberof records in pilot sample per stratum (PS), and sample sizes(SS) for three differentquery selectivities - 0.01%, 0.1% and 1%. . . . . . . . . . . . . . . . . . . . . . . 148
5-4 Bandwidth (as a ratio of error bounds width to the true query answer) and Coverage(for 1000 query runs) for the Neyman estimator and the Bayes-Neyman estimatorfor the three data sets. Results are shown for 200 strata with varying numberof records in pilot sample per stratum (PS), and sample sizes(SS) for three differentquery selectivities - 0.01%, 0.1% and 1%. . . . . . . . . . . . . . . . . . . . . . . 149
8
LIST OF FIGURES
Figure page
1-1 Simplified architecture of a DBMS . . . . . . . . . . . . . . . . . . . . . . . . . 17
3-1 Structure of a leaf node of the ACE tree. . . . . . . . . . . . . . . . . . . . . . . 39
3-2 Structure of the ACE tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3-3 Random samples from section 1 of L3. . . . . . . . . . . . . . . . . . . . . . . . 41
3-4 Combining samples from L3 and L5. . . . . . . . . . . . . . . . . . . . . . . . . 42
3-5 Combining two sections of leaf nodes of the ACE tree. . . . . . . . . . . . . . . 43
3-6 Appending two sections of leaf nodes of the ACE tree. . . . . . . . . . . . . . . 45
3-7 Choosing keys for internal nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3-8 Exponentiality property of ACE tree. . . . . . . . . . . . . . . . . . . . . . . . . 48
3-9 Phase 2 of tree construction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3-10 Execution runs of query answering algorithm with (a) 1 contributing section,(b) 6 contributing sections, (c) 7 contributing sections and (d) 16 contributingsections. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3-11 Sampling rate of an ACE tree vs. rate for a B+ tree and scan of a randomlypermuted file, with a one dimensional selection predicate accepting 0.25% ofthe database records. The graph shows the percentage of database records retrievedby all three sampling techniques versus time plotted as a percentage of the timerequired to scan the relation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3-12 Sampling rate of an ACE tree vs. rate for a B+ tree and scan of a randomlypermuted file, with a one dimensional selection predicate accepting 2.5% of thedatabase records. The graph shows the percentage of database records retrievedby all three sampling techniques versus time plotted as a percentage of the timerequired to scan the relation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3-13 Sampling rate of an ACE tree vs. rate for a B+ tree and scan of a randomlypermuted file, with a one dimensional selection predicate accepting 25% of thedatabase records. The graph shows the percentage of database records retrievedby all three sampling techniques versus time plotted as a percentage of the timerequired to scan the relation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3-14 Sampling rate of an ACE tree vs. rate for a B+ tree and scan of a randomlypermuted file, with a one dimensional selection predicate accepting 2.5% of thedatabase records. The graph is an extension of Figure 3-12 and shows resultstill all three sampling techniques return all the records matching the query predicate. 63
9
3-15 Number of records needed to be buffered by the ACE Tree for queries with (a)0.25% and (b) 2.5% selectivity. The graphs show the number of records bufferedas a fraction of the total database records versus time plotted as a percentageof the time required to scan the relation. . . . . . . . . . . . . . . . . . . . . . . 64
3-16 Sampling rate of an ACE Tree vs. rate for an R-Tree and scan of a randomlypermuted file, with a spatial selection predicate accepting 0.25% of the databasetuples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3-17 Sampling rate of an ACE tree vs. rate for an R-tree, and scan of a randomlypermuted file with a spatial selection predicate accepting 2.5% of the databasetuples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3-18 Sampling rate of an ACE tree vs. rate for an R-tree, and scan of a randomlypermuted file with a spatial selection predicate accepting 25% of the databasetuples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4-1 Sampling from a superpopulation . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4-2 Six distributions used to generate for each e in EMP the number of records s inSALE for which f3(e, s) evaluates to true. . . . . . . . . . . . . . . . . . . . . . . 105
5-1 Beta distribution with parameters α = β = 0.5. . . . . . . . . . . . . . . . . . . 131
10
Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy
SAMPLING-BASED RANDOMIZATION TECHNIQUES FOR APPROXIMATEQUERY PROCESSING
By
SHANTANU JOSHI
August 2007
Chair: Christopher JermaineMajor: Computer Engineering
The past couple of decades have seen a significant amount of research directed
towards data warehousing and efficient processing of analytic queries. This is a daunting
task due to massive sizes of data warehouses and the nature of complex, analytical queries.
This is evident from standard, published benchmarking results such as TPC-H, which
show that many typical queries can require several minutes to execute despite using
sophisticated hardware equipment. This can seem expensive especially for ad-hoc, data
exploratory analysis. One direction to speed up execution of such exploratory queries is
to rely on approximate results. This approach can be especially promising if approximate
answers and their error bounds are computed in a small fraction of the time required to
execute the query to completion. Random samples can be used effectively to perform
such an estimation. Two important problems have to be addressed before using random
samples for estimation. The first problem is that retrieval of random samples from
a database is generally very expensive and hence index structures are required to be
designed which can permit efficient random sampling from arbitrary selection predicates.
Secondly, approximate computation of arbitrary queries generally requires complex
statistical machinery and reliable sampling-based estimators have to be developed for
different types of analytic queries. My research addresses the two problems described
above by making the following contributions: (a) A novel file organization and index
structure called the ACE Tree which permits efficient random sampling from an arbitrary
11
range query. (b) Sampling-based estimators for aggregate queries which have a correlated
subquery where the inner and outer queries are related by the SQL EXISTS, NOT
EXISTS, IN or NOT IN clause. (c) A stratified sampling technique for estimating the
result of aggregate queries having highly selective predicates.
12
CHAPTER 1INTRODUCTION
The last couple of decades have seen an explosive growth of electronic data. It is not
unusual for data management systems to support several terabytes or even petabytes of
data. Such massive volumes of data have led to the evolution of “data warehouses”, which
are systems capable of supporting storage and efficient retrieval of large amounts of data.
Data Warehouses are typically used for applications such as online analytical processing
among others. Such applications process queries and expect results in a manner that is
different from traditional transaction processing. For example, a typical query by a sales
manager on a sales data warehouse might be:
“Return average salary of all employees at locations whose sales have increased by
atleast 10% over the past 3 years”
The result of such a query could be used to make high-level decisions such as whether
or not to hire more employees at the locations of interest. Such queries are typical in
a data warehousing environment in that their evaluation requires complex analytical
processing over huge amounts of data. Traditional transactional processing methods may
be unacceptably slow to answer such complex queries.
1.1 Approximate Query Processing (AQP) - A Different Paradigm
The nature of analytical queries and their associated applications provides an
opportunity to provide results which may not be exact. Since computation of exact
results may require an unreasonable amount of time due to massive volumes of data,
approximation may be attractive if the approximate results can be computed in a fraction
of the time it would take to compute the exact results. Moreover, providing approximate
results can be useful to quickly explore the whole data at a high level. This technique of
providing fast but approximate results has been termed “Approximate Query Processing”
in the literature.
13
In addition to computing an approximate answer, it is also important to provide
metrics about the accuracy of the answer. One way to express the accuracy is in terms
of error bounds with certain probabilistic guarantees of the form, “The estimated answer
is 2.45 × 105, and with 95% confidence the true answer lies within ±1.18 × 103 of the
estimate”. Here, the error bounds are expressed as an interval and the accuracy guarantee
is provided at 95% confidence.
A promising approach for aggregation queries in Approximate Query Processing
(AQP) has been proposed by Haas and Hellerstein [63] called Online aggregation (OLA).
They propose an interactive interface for data exploration and analysis where records are
retrieved in a random order. Using these random samples, running estimates and error
bounds are computed and immediately displayed to the user. As time progresses, the
size of the random sample keeps growing and so the estimate is continuously refined. At
a predetermined time interval, the refined estimate along with its improved accuracy is
displayed to the user. If at any point of time during the execution the user is satisfied with
the accuracy of the answer, she can terminate further execution. The system also gives an
overall progress indicator based on the fraction of records that have been sampled thus
far. Thus, OLA provides an interface where the user is given a rough estimate of the result
very quickly.
1.2 Building an AQP System Afresh
The OLA system described above presents an intuitive interface for approximate
answering of aggregate queries. However, to support the functionality proposed by
the system, fundamental changes need to be incorporated in several components of a
traditional database management system. In this section, we first examine why sampling
is a good approach for AQP, and then present an overview of the changes needed in the
architecture of a database management system to support sampling-based AQP.
14
1.2.1 Sampling Vs Precomputed Synopses
We now discuss two techniques that can be used to support fast but approximate
answering of queries. One intuitive technique is using some compact information about
records for answering queries. Such information is typically called a database statistic and
it is actually summary information about the actual records of the database. Commonly
used database statistics are wavelets, histograms and sketches. Such statistics also known
as synopses, are orders of magnitude smaller in size than the actual data. Hence it is much
faster and efficient to access synopses as compared to reading the entire data. However,
such synopses are precomputed and static. If a query is issued which requires some
synopses that are not already available, then they would have to be computed by scanning
the dataset, possibly multiple times before answering the query.
A second approach to AQP is using samples of database records to answer queries.
Query execution is extremely fast since the number of records in the sample is a small
fraction of the total number of records in the database. The answer is then extrapolated
or “scaled-up” to the size of the entire database. Since the answer is computed by
processing very few records of the database, it is an approximation of the true answer.
For the work in this thesis, we propose to use sampling [25, 109] in order to support
AQP. We make this choice due to the following important advantages of sampling over
precomputed synopses. The accuracy of an estimate computed by using samples can be
easily improved by obtaining more samples to answer the query. On the other hand, if the
estimate computed by using synopses is not sufficiently accurate, a new synopsis providing
greater accuracy would have to be built. Since this would require scanning the dataset it is
impractical. Secondly, sampling is very amenable to scalability. Even for extremely large
datasets of the order of hundreds of gigabytes, it is generally possible to accomodate a
small sample in main memory and use efficient in-memory algorithms to process it. If this
is not possible, disk-based samples and algorithms have also been proposed [76] and are
equally effective as their in-memory counterparts. This is an important benefit of sampling
15
as compared to histograms, which become unwieldy as the number of attributes of the
records in the dataset increases.
Thirdly, since real records (although very few) are used in a sample, it is possible
to answer any statistical query including arbitrarily complex functions in relational
selection and join predicates. This is a very important advantage of sampling as opposed
to synopses such as sketches, which are not suitable for answering arbitrary queries.
Finally, unlike precomputed synopses there is no requirement of maintenance and
updates for on-the-fly sampling as data are updated.
1.2.2 Architectural Changes
In order to support sampling-based AQP in a database management system, major
changes need to be incorporated in the architecture of the system. The reason for this is
that traditional database management systems were not designed to work with random
samples or to support computation of approximate results. In this section, we briefly
describe some of the most critical changes that are required in the architecture of a DBMS
to support sampling-based AQP.
Figure 1-1 depicts the various components from a simplified architecture of a DBMS.
The four components that require major changes in order to support sampling-based AQP
are as follows:
• Index/file/record manager - The use of traditional index structures like B+-Treesis not appropriate to obtain random samples. This is because such index structuresorder records based on record search key values which is actually the opposite ofobtaining records in a random order. Hence, for AQP it is important to providephysical structures or file organizations which support efficient retrieval of randomsamples.
• Execution engine - The execution engine needs to be revamped completely so thatit can use the random samples returned by the lower level to execute the query onthem. Further, the result of the query needs to be scaled up appropriately for thesize of the entire database. This component would also need to be able to computeaccuracy guarantees for the approximate answer.
16
User Interface
Query Compiler
Execution Engine
Index/File/Record Manager
Buffer Manager
Storage Manager
Storage
Queries/Updates
Query plan
Index, file and record requests
Page commands
Read/write pages
Figure 1-1. Simplified architecture of a DBMS
• Query compiler - The query compiler has to be modified so that it can chalk outa different strategy of execution for various types of queries like relational joins,subset-based queries or queries with a GROUP-BY clause. Moreover, optimizationof queries needs to be done very differently from traditional query optimizers whichcreate the most efficient query plan to run a query to completion. For AQP, queriesshould be optimized so that the first few result tuples are output as quickly aspossible.
• User interface - There is tremendous scope of providing an intuitive user interfacefor an online AQP system. In addition to the UI being able to provide accuracyguarantees to the estimate, it would be very intuitive to provide a visualization ofthe intermediate results as and when they become available so that the user cancontinue to explore the query or decide to modify or terminate it. Current databasemanagement systems provide user interfaces with very limited functionality.
17
1.3 Contributions in This Thesis
These tasks involve significant research and implementation issues. Since many of the
problems have never been tackled in the literature, there are several challenging tasks to
be addressed.
For the scope of my research, I choose to address the following three problems. The
motivation and our solutions to each of these research problems is described separately in
the following chapters of this thesis.
• We present a primary index structure which can support efficient retrieval of randomsamples from an arbitrary range query. This requires a specialized file organizationand an efficient algorithm to actually retrieve the desired random samples from theindex. This work falls in the scope of the Index/file/record manager componentdescribed earlier.
• We present our solution to support execution of queries which have a nestedsub-query where the inner query is correlated to the outer query, in an approximatequery processing framework. This work falls in the purview of the execution engine ofthe system.
• Finally, we also present a technique to support efficient execution of queries whichhave predicates with low selectivities, such as GROUP BY queries with manydifferent groups. This work also falls in the scope of the query execution engine.
18
CHAPTER 2RELATED WORK
This chapter presents previous work in the data management and statistics literature
related to estimation using sampling as well as non-sampling based precomputed
synopses structures. Finally, it describes work related to OLAP query processing using
non-relational data models like data cubes.
2.1 Sampling-based Estimation
Sampling has a long history in the data management literature. Some of the
pioneering work in this field has been done by Olken and Rotem [96, 98–101] and
Antoshenkov [9], though the idea of using a survey sample for estimation in statistics
literature goes back much earlier than these works. Most of the work by Olken and Rotem
describes how to perform simple random sampling from databases. Estimation for several
types of database tasks has been attempted with random samples. The rest of this section
presents important works on sampling-based estimation of major database tasks.
Some of the initial work on estimating selectivity of join queries is due to Hou et al.
[67, 68]. They present unbiased and consistent estimators for estimating the join size and
also provide an algorithm for cluster sampling. In [64] they propose unbiased estimators
for COUNT aggregate queries over arbitrary relational algebra expressions. However,
computation of variance of their estimators is very complex [67]. They also do not provide
any bounds on the number of random samples required for estimation.
Adaptive sampling has been used for estimation of selectivity of predicates in
relational selection and join operations [83, 84, 86] and for approximating the size of a
relational projection operation [94]. Adaptive sampling has also been used in [85], to
estimate transitive closures of database relations. The authors point out the benefits and
generality of using sampling for selectivity estimation over parametric methods which
make assumptions about an underlying probability distribution for the data as well as
over non-parametric methods which require storing and maintaining synopses about the
19
underlying data. The algorithms consider the query result as a collection of results from
several disjoint subqueries. Subqueries are sampled randomly and their result sizes are
computed. The estimate of the actual query result size is then obtained from the results
of the various subqueries. The sampling of subqueries is continued until either the sum
of the subquery sizes is sufficiently large or the number of samples taken is sufficiently
large. The method requires that the maximum size of a subquery be known. Since this
is generally not available, the authors use an upper bound for the maximum subquery
size in their method. Haas and Swami [59] observe that using a loose upper bound for
the maximum subquery size can lead to sampling more subqueries than necessary, and
potentially increasing the cost of sampling significantly.
Double sampling or two-phase sampling has been used in [66] for estimating the
result of a COUNT query with a guaranteed error bound at a certain confidence level.
The error bound is guaranteed by performing sampling in two steps. In the first step a
small pilot sample is used to obtain preliminary information about the input relation. This
information is then used to compute the size of the sample for the second step such that
the estimator is guaranteed to produce an estimate with the desired error bound.
As Haas and Swami [59] point out, the drawback of using double sampling is that
there is no theoretical guidance for choosing the size of the pilot sample. This could
lead to an unpredictably imprecise estimate if the pilot sample size is too small or an
unnecessarily high sampling cost if the pilot sample size is too large. In their work [59],
Haas and Swami present sequential sampling techniques which provide an estimate of the
result size and also bounds the error in estimation with a prespecified probability. They
present two algorithms in the paper to estimate the size of a query result. Although both
algorithms have been proven to be asymptotically correct and efficient, the first algorithm
suffers from the problem of undercoverage. This means that in practice the probability
with which it estimates the query result within the computed error bound is less than
the specified confidence level of the algorithm. This problem is addressed by the second
20
algorithm which organizes groups of equal-sized results sets into a single stratum and then
performs stratified sampling over the different strata. However, their algorithms do not
perform very well when estimating the size of joins between a skewed and a non-skewed
relation.
Ling and Sun [82] point out that general sampling-based estimation methods have a
high cost of execution since they make an overly restrictive assumption of no knowledge
about the overall characteristics of the data. In particular, they note that estimation of
the overall mean and variance of the data not only incurs cost but also introduces error in
estimation. The authors rather suggest an alternative approach of actually keeping track
of these characteristics in the database at a minimal overhead.
A detailed study about the cost of sampling-based methods to estimate join query
sizes appears in [58]. The paper systematically analyses the factors which influence the
cost of a sampling-based method to estimate join selectivities. Based on their analysis,
their findings can be summarized as follows: (a) When the measure of precision of the
estimate is absolute, the cost of sampling increases with the number of relations involved
in the join as well as the sizes of the relations themselves. (b) When the measure of
precision of the estimate is relative, the cost of using sampling increases with the sizes
of the relations, but decreases as the number of input relations increase. (c) When the
distribution of the join attribute values is uniform or highly skewed for all input relations,
the cost of sampling tends to be low, while it is high when only some of the input relations
have a skewed join attribute value distribution. (d) The presence of tuples in a relation
which do not join with any other tuples from other relations always increases the cost of
sampling.
Haas et al. [56, 57] study and compare the performance of new as well as previous
sampling-based procedures for estimating the selectivity of queries with joins. In particular
they identify estimators which have a minimum variance after a fixed number of sampling
steps have been performed. They note that use of indexes on input relations can further
21
reduce variance of the selectivity estimate. The authors also show how their estimation
methods can be used to estimate the cost of implementing a given join query plan without
making any assumptions about the underlying data or requiring storage and maintenance
of summary statistics about the data.
Ganguly et al. [35] describe how to estimate the size of a join in the presence of skew
in the data by using a technique called bifocal sampling. This technique classifies tuples
of each input relation into two groups, sparse and dense, based on the number of tuples
with the same value for the join attribute. Every combination of these groups is then
subject to different estimation procedures. Each of these estimation procedures require a
sample size larger than a certain value (in terms of the total number of tuples in the input
relation) to provide an estimate within a small constant factor of the true join size. In
order to guarantee estimates with the specified accuracy, bifocal sampling also requires the
total join size and the join sizes from sparse-sparse subjoins to be greater than a certain
threshold.
Gibbons and Matias [40] introduce two sampling-based summary statistics called con-
cise samples and counting samples and present techniques for their fast and incremental
maintenance. Although the paper describes summary statistics rather than on-the-fly
sampling techniques, the summary statistics are created from random samples of the
underlying data and are actually defined to describe characteristics of a random sample
of the data. Since summary statistics of a random sample require much lesser amount of
memory than the sample itself, the paper describes how information from a much larger
sample can be stored in a given amount of memory by storing sample statistics instead
of using the memory to store actual random samples. Thus, the authors claim that since
information from a larger sample can be stored by their summary statistics the accuracy of
approximate answers can be boosted.
Chaudhuri, Motwani and Narasayya [22, 24] present a detailed study of the problem
of efficiently sampling the output of a join operation without actually computing the
22
entire join. They prove a negative result that it is not possible to generate a sample of
the join result of two relations by merely joining samples of the relations involved in
the join. Based on this result, they propose a biased sampling strategy which samples
tuples from one relation in the proportion with which their matching tuples appear in
the other relation. The intuition behind this approach is that the resulting biased sample
is more likely to reflect the structure of the actual join result between the two relations.
Information about the frequency of the various join attribute values is assumed to be
available in the form of some synopsis structures like histograms.
There has also been work to estimate the actual result of an aggregate query which
involves a relational join operation on its input relations. In fact, Haas, Hellerstein and
Wang [63] propose a system called Online Aggregation (OLA) that can support online
execution of analytic-style aggregation queries. They propose the system to have a visual
interface which displays the current estimate of the aggregate query along with error
bounds at a certain confidence level. Then, as time progresses, the system continually
refines the estimate and at the same time shrinks the width of the error bounds. The user
who is presented with such a visual interface, has at all times, an option to terminate
further execution of the query in case the error bound width is satisfactory for the given
confidence level. The authors propose the use random sampling from input relations to
provide estimates in OLA. Further, they describe some of the key changes that would
be required in a DBMS to support OLA. In [51], Haas describes statistical techniques
for computing error bounds in OLA. The work on OLA eventually grew into the UC
Berkeley CONTROL project. In their article [62], Hellerstein et al. describe various issues
in providing interactive data analysis and possible approaches to address those issues.
Haas and Hellerstein [53, 54] propose a family of join algorithms called ripple joins
to perform relational joins in an OLA framework. Ripple joins were designed to minimize
the time until an acceptably precise estimate of the query result is made available, as
opposed to minimizing the time to completion of the query as in a traditional DBMS. For
23
a two-table join, the algorithm retrieves a certain number of random tuples from both
relations at each sampling step; these new tuples are joined with previously seen tuples
and with each other. The running result of the aggregate query is updated with these
newly retrieved tuples. The paper also describes how a statistically meaningful confidence
interval of the estimated result can be computed based on the Central Limit Theorem
(CLT).
Luo et al. [87] present an online parallel hash ripple join algorithm to speed up
the execution of the ripple join especially when the join selectivity is low and also when
the user wishes to continue execution till completion. The algorithm is assumed to be
executed at a fixed set of processor nodes. At each node, a hash table is maintained for
every relation. Moreover every bucket in each hash table could have some tuples stored
in memory and some others stored on disk. The join algorithm proceeds in two phases; in
the first phase tuples from both relations are retrieved in a random order and distributed
to the processor nodes so that each node would perform roughly the same amount of work
for executing the join. By using multiple threads at each node, production of join tuples
from the in-memory hash table buckets begins even as tuples are being distributed to
the various processors. The second phase begins after redistribution from the first phase
is complete. In this phase, a new in-memory hash table is created which uses a hashing
function different from the function used in phase 1. The tuples in the disk-resident
buckets of the hash table of phase 1 are then hashed according to the hashing function
of phase 2 and joined. The algorithm provides a considerable speed-up factor over the
one-node ripple join, provided its memory requirements are met.
Jermaine et al. [73, 74] point out that the drawback of both the ripple join algorithms
described above is that the statistical guarantees provided by the estimator are valid
only as long as the output of the join can be accomodated in main memory. In order
to counteract this problem, they propose the Sort-Merge-Shrink join algorithm as a
generalization of the ripple join which can provide error guarantees throughout execution,
24
even if it operates from disk. The algorithm proceeds in three phases. In the sort phase,
the two input relations are read in parallel and sorted into runs. Each pair of runs is
subject to an in-memory hash ripple join and provides a corresponding estimate of the
join result. The merge and shrink phases execute concurrently where in the merge phase,
tuples are retrieved from the various sorted runs of both relations and joined with each
other. Since the sorted runs “lose” tuples which are pulled by the merge phase, the
shrinking phase takes these tuples into account and updates the estimator accordingly.
The authors provide a detailed statistical analysis of the estimator as well computation of
error bounds.
Estimation using sampling of the number of distinct values in a column has been
studied by Haas et al. [48]. They provide an overview of the estimators used in the
database and statistics literature and also develop several new sampling-based estimators
for the distinct value estimation problem. They propose a new hybrid sampling estimator
which explicitly adapts to different levels of data skew. Their hybrid estimator performs
a Chi-square test to detect skew in the distribution of the attribute value. If the data
appears to be skewed, then Shlosser’s estimator is used while if the test does not detect
skew, a smoothed-jackknife estimator (which is a modification of the conventional
jackknife estimator) is used. The authors attribute a dearth of work for sampling-based
estimation of the number of distinct values to the inherent difficulty of the problem while
noting that it is a much harder problem than estimating the selectivity of a join.
Haas and Stokes [50] present a detailed study of the problem of estimating the
number of classes in a finite population. This is equivalent to the database problem of
estimating the number of distinct values in a relation. The authors make recommendations
about which statistical estimator is appropriate subject to constraints and finally claim
from empirical results that a hybrid estimator which adapts according to data skew is the
most superior estimator.
25
There has also been work by Charikar et al. [16] which establishes a negative result
stating that no sampling-based estimator for estimating the number of distinct values
can guarantee small error across all input distributions unless it examines a large fraction
of the input data. They also present a Guaranteed Error Estimator (GEE) whose error
is provably no worse than their negative result. Since the GEE is a general estimator
providing optimal error over all distributions, the authors note that its accuracy may be
lower than some previous estimators on specific distributions. Hence, they propose an
estimator called the Adaptive Estimator (AE) which is similar in spirit to Haas et al.’s
hybrid estimator [50], but unlike the latter, is not composed of two distinct estimators.
Rather the AE considers the contribution of data items having high and low frequencies in
a single unified estimator.
In the AQUA system [41] for approximate answering of queries, Acharya et al.
[6] propose using synopses for estimating the result of relational join queries involving
foreign-key joins rather than using random samples from the base relations. These
synopses are actually precomputed samples from a small set of distinguished joins and
are called join synopses in the paper. The idea of join synopses is that by precomputing
samples from a small set of distinguished joins, these samples can be used for estimating
the result of many other joins. The concept is applicable in a k-way join where each join
involves a primary and foreign key of the participating relations. The paper describes
that if workload information is available, it can be used to design an optimal allocation
for the join synopses that minimizes the overall error in the approximate answers over the
workload.
Acharya et al. [5] propose using a mix of uniform and biased samples for approximately
answering queries with a GROUP-BY clause. Their sampling technique called congres-
sional sampling relies on using precomputed samples which are a hybrid union of uniform
and biased samples. They assume that the selectivity of the query predicate is not so low
that their precomputed sample completely misses one or more groups from the result of
26
the GROUP-BY query. Based on this assumption, they devise a sampling plan for the
different groups such that the expected minimum number of tuples satisfying the query
predicate in any group, is maximized. The authors also present one-pass algorithms [4] for
constructing the congressional samples.
Ganti et al. [37] describe a biased sampling approach which they call ICICLES to
obtain random samples which are tuned to a particular workload. Thus, if a tuple is
chosen by many queries in a workload, it has a higher probability of being selected in the
self-tuning sample as compared to tuples which are chosen by fewer queries. Since this is
a non-uniform sample, traditional sampling-based estimators must be adapted for these
samples. The paper describes modified estimators for the common aggregation operations.
It also describes how the self-tuning samples are tuned in the presence of a dynamically
changing workload.
Chaudhuri et al. [18] note that uniform random sampling to estimate aggregate
queries is ineffective when the distribution of the aggregate attribute is skewed or when
the query predicate has a low selectivity. They propose using a combination of two
methods to address this problem. Their first approach is to index separately those
attribute values which contribute significantly to the query result. This method is called
Outlier Indexing in the paper. The second approach proposed in the paper is to exploit
workload information to perform weighted sampling. According to this technique, records
which satisfied many queries in the workload are sampled more than records than satisfied
fewer queries.
Chaudhuri, Das and Narasayya [19, 20] describe how workload information can
be used to precompute a sample that minimizes the error for the given workload. The
problem of selection of the sample is framed as an optimization problem so that the error
in estimation of the workload queries using the resulting sample is minimized. When the
actual incoming queries are identical to queries in the workload, this approach gives a
solution with minimal error across all queries. The paper also describes how the choice of
27
the sample can be tuned to achieve effective estimates when the actual queries are similar
but not identical to the workload.
Babcock, Chaudhuri and Das [10] note that a uniformly random sample can lead
to inaccurate answers for many queries. They observe that for such queries, estimation
using an appropriately biased sample can lead to more accurate answers as compared
to estimation using uniformly random samples. Based on this idea, the paper describes
a technique called small group sampling which is designed to approximately answer
aggregation queries having a GROUP-BY clause. The distinctive feature of this technique
as compared to previous biased sampling techniques like congressional sampling is that
a new biased sample is chosen for every GROUP-BY query, such that it maximizes
the accuracy of estimating the query rather than trying to devise a biased sample that
maximizes the accuracy over an entire workload of queries. According to this technique,
larger groups from the output of the GROUP-BY queries are sampled uniformly while the
small groups are sampled at a higher rate to ensure that they are adequately represented.
The group samples are obtained on a per-query basis from an overall sample which is
computed in a pre-processing phase.
In fact, database sampling has been recognized as an important enough problem
that ISO has been working to develop a standard interface for sampling from relational
database systems [55], and significant research efforts are directed at providing sampling
from database systems by vendors such as IBM [52].
2.2 Estimation Using Non-sampling Precomputed Synopses
Estimation in databases using a non-sampling technique was first proposed by Rowe
[106, 107]. The technique proposed is called antisampling and involves creation of a special
auxiliary structure called database abstract. The abstract considers the distribution of
several attributes and groups of attributes. Correlations between different attributes can
also be characterized as statistics. This technique was found to be faster than random
sampling, but required domain knowledge about the various attributes.
28
Classic work on histogram based estimation of predicate selectivity is by Selinger et
al. [110] and Piatetsky-Shapiro and Connell [102]. Selectivity estimation of queries with
multidimensional predicates using histograms was presented by Muralikrishna and DeWitt
[92]. They show that the maximum error in estimation can be controlled more effectively
by choosing equi-depth histograms as opposed to equi-width histograms.
Ioannidis [70] describes how serial histograms are optimal for aggregate queries
involving arbitrary join trees with equality predicates. Ioannidis and Poosala [71] have also
studied how histograms can be used to approximately answer non-aggregate queries which
have a set based result.
Several histogram construction schemes [42, 45, 72] have been proposed in the
literature. Jagadish et al. [72] describe techniques for constructing histograms which can
minimize a given error metric where the error is introduced because of approximation
of values in a bucket by a single value associated with the bucket. They also describe
techniques for augmenting histograms with additional information so that they can be
used to provide accuracy guarantees of the estimated results.
Construction of approximate histograms by considering only a random sample of
the data set was investigated by Chaudhuri et al. [23]. Their technique uses an adaptive
sampling approach to determine the sample size that would be sufficient to generate
approximate histograms which can guarantee pre-specified error bounds in estimation.
They also extend their work to consider duplicate values in the domain of the attribute for
which a histogram is to be constructed.
The problem of estimation of the number of distinct value combinations of a set of
attributes has been studied by Yu et al. [121]. Due to the inherent difficulty of developing
a good, sampling-based estimation solution to the problem, they propose using additional
information about the data in the form of histograms, indexes or data cubes.
In a recent paper [28], Dobra presents a study of when histograms are best suited for
approximation. The paper considers the long-standing assumption that histograms are
29
most effective only when all elements in a bucket have the same frequency and actually
extends it to a less restrictive assumption that histograms are well-suited when elements
within a bucket are randomly arranged even though they might have different frequencies.
Wavelets have a long history as mathematical tools for hierarchical decomposition
of functions in signal and image processing. Vitter and his collaborators have also
studied how wavelets can be applied to selectivity estimation of queries [89] and also
for computing aggregates over data cubes [118, 119]. Chakrabarti et al. [15] present
techniques for approximate computation of results for aggregate as well as non-aggregate
queries using Haar wavelets.
One more summary structure that has been proposed for approximating the size of
joins is sketches. Sketches are small-space summaries of data suited for data streams. A
sketch generally consists of multiple counters corresponding to random variables which
enable them to provide approximate answers with error guarantees for a priori decided
queries. Some of the earliest work on sketches was presented by Alon, Gibbons, Matias
and Szegedy [7, 8]. Sketching techniques with improved error guarantees and faster update
times have been proposed as Fast-Count sketches [117]. A statistical analysis of various
sketching techniques along with recommendations on their use for estimating join sizes
appears in [108].
2.3 Analytic Query Processing Using Non-standard Data Models
A data model for OLAP applications called data cube was proposed by Gray et al.
[44] for processing of analytic style aggregation queries over data warehouses. The paper
describes a generalization of the SQL GROUP BY operator to multiple dimensions by
introducing the data cube operator. This operator treats each of the possible aggregation
attributes as a dimension of a high dimensional space. The aggregate of a particular
set of attribute values is considered as a point in this space. Since the cube holds
precomputed aggregate values over all dimensions, it can be used to quickly compute
results to GROUP-BY queries over multiple dimensions. The data cube is precomputed
30
and can require significant amount of space for storage of the precomputed aggregates
along the different dimensions. A more serious drawback of the data cube approach is
that it can be used to efficiently answer only such queries which have a grouping hierarchy
that conforms to the hierarchy on which the data cube is built. Moreover, complex queries
which have been addressed in this thesis such as queries having correlated subqueries are
not amenable to efficient processing with the data cube model.
Due to potentially large sizes of data cubes for high dimensions, researchers have
studied techniques to discover semantic relationships in a data cube. This approach
reduces the number of precomputed aggregates grouped by different attributes if
their aggregate values are identical. The quotient cube [79] and quotient cube tree [80]
structures are such compressed representations of the data cube which preserve semantic
relationships while also allowing processing of point and range queries.
Another approach that has been employed in shrinking the data cube while at the
same time preserving all the information in it is the Dwarf [113, 114] structure. Dwarf
identifies and eliminates redundancies in prefixes and suffixes of the values along different
dimensions of a data cube. The paper shows that by eliminating prefix as well as suffix
redundancies, both dense as well as sparse data cubes can be compressed effectively.
The paper also shows improved cube construction time, query response time as well as
update time as compared to cube trees [105]. Although, the dwarf structure improves the
performance of the data cube model, it still suffers from the inherent drawback of the data
cube model – it is not suitable to efficiently answer arbitrarily complex queries such as
queries with correlated subqueries.
Recently, a new column-oriented architecture for database systems called C-store was
proposed by Stonebraker et al [115]. The system has been designed for an environment
that has much higher number of database reads as opposed to writes, such as a data
warehousing environment. C-store logically splits attributes of a relational table into
projections which are collections of attributes, and stores them on disk such that all values
31
of any attribute are stored adjacent to each other. The paper presents experimental results
which show that C-store executes several select-project-join and group-by queries over the
TPC-H benchmark much faster than commercial row-oriented or column-oriented systems.
At the time of the paper [115], the system was still under development.
32
CHAPTER 3MATERIALIZED SAMPLE VIEWS FOR DATABASE APPROXIMATION
3.1 Introduction
With ever-increasing database sizes, randomization and randomized algorithms [91]
have become vital data management tools. In particular, random sampling is one of the
most important sources of randomness for such algorithms. Scores of algorithms that are
useful over large data repositories either require a randomized input ordering for data (i.e.,
an online random sample), or else they operate over samples of the data to increase the
speed of the algorithm.
Although applications requiring randomization abound in the data management
literature, we specifically consider online aggregation [54, 62, 63] in this thesis. In online
aggregation, database records are processed one-at-a-time, and used to keep the user
informed of the current “best guess” as to the eventual answer to the query. If the records
are input into the online aggregation algorithm in a randomized order, then it becomes
possible to give probabilistic guarantees on the relationship of the current guess to the
eventual answer to the query.
Despite the obvious importance of random sampling in a database environment and
dozens of recent papers on the subject (approximately 20 papers from recent SIGMOD
and VLDB conferences are concerned with database sampling), there has been relatively
little work towards actually supporting random sampling with physical database file
organizations. The classic work in this area (by Olken and his co-authors [98, 99, 101])
suffers from a key drawback: each record sampled from a database file requires a random
disk I/O. At a current rate of around 100 random disk I/Os per second per disk, this
means that it is possible to retrieve only 6,000 samples per minute. If the goal is fast
approximate query processing or speeding up a data mining algorithm, this is clearly
unacceptable.
33
The Materialized Sample view
In this chapter, we propose to use the materialized sample view 1 as a convenient
abstraction for allowing efficient random sampling from a database. For example, consider
the following database schema:
SALE (DAY, CUST, PART, SUPP)
Imagine that we want to support fast, random sampling from this table, and most of
our queries include a temporal range predicate on the DAY attribute. This is exactly the
interface provided by a materialized sample view. A materialized sample view can be
specified with the following SQL-like query:
CREATE MATERIALIZED SAMPLE VIEW MySam
AS SELECT * FROM SALE
INDEX ON DAY
In general, the range attribute or attributes referenced in the INDEX ON clause can be
spatial, temporal, or otherwise, depending on the requirements of the application.While the materialized sample view is a straightforward concept, efficient implementation
is difficult. The primary technical contribution of this thesis is a novel index structurecalled the ACE Tree (Appendability, Combinability Exponentiality ; see Section 3.4) whichcan be used to efficiently implement a materialized sample view. Such a view, stored as anACE-Tree, has the following characteristics:
• It is possible to efficiently sample (without replacement) from any arbitrary rangequery over the indexed attribute, at a rate that is far faster than is possible usingtechniques proposed by Olken [96] or by scanning a randomly permuted file. Ingeneral, the view can produce samples from a predicate involving any attributehaving a natural ordering, and a straightforward extension of the ACE Tree can beused for sampling from multi-dimensional predicates.
• The resulting sample is online, which means that new samples are returnedcontinuously as time progresses, and in a manner such that at all times, the setof samples returned is a true random sample of all of the records in the view that
1 This term was originally used in Olken’s PhD thesis [96] in a slightly different context,where the goal was to maintain a fixed-size sample of database; in contrast, as we describesubsequently our materialized sample view is a structure allowing online sampling
34
match the range query. This is vital for important applications like online aggregationand data mining.
• Finally, the sample view is created efficiently, requiring only two external sorts of therecords in the view, and with only a very small space overhead beyond the storagerequired for the data records.
We note that while the materialized sample view is a logical concept, the actual file
organization used to implement such a view can be referred to as a sample index since it is
a primary index structure to efficiently retrieve random samples.
3.2 Existing Sampling Techniques
In this section, we discuss three simple techniques that can be used to create
materialized sample views to support random sampling from a relational selection
predicate.
3.2.1 Randomly Permuted Files
One option for creating a materialized sample view is to randomly shuffle or permute
the records in the view. To sample from a relational selection predicate over the view,
we scan it sequentially from beginning to end and accept those records that satisfy the
predicate while rejecting the rest. This method has the advantage that it is very simple,
and using a fast external sorting algorithm, permuting the records can be very efficient.
Furthermore, since the process of scanning the file can make use of the fast, sequential I/O
provided by modern hard disks, a materialized view organized as a randomly permuted file
can be very useful for answering queries that are not very selective.
However, the major problem with such a materialized view is that the fraction of
useful samples retrieved by it is directly proportional to the selectivity of the selection
predicate. For example, if the selectivity of the query is 10%, then on average only 10% of
the random samples obtained by such a view can be used to answer the query. Hence for
moderate to low selectivity queries, most of the random samples retrieved by such a view
will not be useful for answering queries. Thus, the performance of such a view quickly
degrades as selectivity of the selection predicates decreases.
35
3.2.2 Sampling from Indices
The second approach to creating a materialized sample view is to use one of the
standard indexing structures like a hashing scheme or a tree-based index structure
to organize the records in the view. In order to produce random samples from such
a materialized view, we can employ iterative or batch sampling techniques [9, 96,
99–101] that sample directly from a relational selection predicate, thus avoiding the
aforementioned problem of obtaining too few relevant records in the sample. Olken [96]
presents a comprehensive analysis and comparison of many such techniques. In this
Section we discuss the technique of sampling from a materialized view organized as
a ranked B+-Tree, since it has been proven to be the most efficient existing iterative
sampling technique in terms of number of disk accesses. A ranked B+-Tree is a regular
B+-Tree whose internal nodes have been augmented with information which permits one
to find the ith record in the file.
Let us assume that the relation SALE presented in the Introduction is stored as a
ranked B+-Tree file indexed on the attribute DAY and we want to retrieve a random
sample of records whose DAY attribute value falls between 11-28-2004 and 03-02-2005.
This translates to the following SQL query:
SELECT * FROM SALE
WHERE SALE.DAY BETWEEN ’11-28-2004’ AND ’03-02-2005’
Algorithm 1 above can then be used to obtain a random sample of relevant records
from the ranked B+-Tree file.
The drawback of the above algorithm is that whenever a leaf page is accessed, the
algorithm retrieves only that record whose rank matches with the rank being searched for.
Hence for every record which resides on a page that is not currently buffered, the retrieval
time is the same as the time required for a random disk I/O. Thus, as long as there are
unbuffered leaf pages containing candidate records, the rate of record retrieval is very slow.
36
Algorithm 1: Sampling from a Ranked B+-Tree
Algorithm SampleRankedB+Tree (Value v1, Value v2)1. Find the rank r1 of the record which has the smallest
DAY value greater than v1.2. Find the rank r2 of the record which has the largest
DAY value smaller than v2.3. While sample size < desired sample size3.a Generate a uniformly distributed random number
i, between r1 and r2.3.b If i has been generated previously, discard it and
generate the next random number.3.c Using the rank information in the internal nodes,
retrieve the record whose rank is i.
3.2.3 Block-based Random Sampling
While the classic algorithms of Olken and Antoshenkov sample records one-at-a-time,
it is possible to sample from an indexing structure such as a B+-Tree, and make use of
entire blocks of records [21, 55]. The number of records per block is typically on the order
of 100 to 1000, leading to a speedup of two or three orders of magnitude in the number of
records retrieved over time if all of the records in each block are consumed, rather than a
single record.
However, there are two problems with this approach. First, if the structure is used to
estimate the answer to some aggregate query, then the confidence bounds associated with
any estimate provided after N samples have been retrieved from a range predicate using a
B+-Tree (or some other index structure) may be much wider than the confidence bounds
that would have been obtained had all N samples been independent. In the extreme case
where the values on each block of records are closely correlated with one another, all of
the N samples may be no better than a single sample. Second, any algorithm which makes
use of such a sample must be aware of the block-based method used to sample the index,
and adjust its estimates accordingly, thus adding complexity to the query result estimating
process. For algorithms such as Bradley’s K-means algorithm [11], it is not clear whether
or not such samples are even appropriate.
37
3.3 Overview of Our Approach
We propose an entirely different strategy for implementing a materialized sample
view. Our strategy uses a new data structure called the ACE Tree to index the records
in the sample view. At the highest level, the ACE Tree partitions a data set into a
large number of different random samples such that each is a random sample without
replacement from one particular range query. When an application asks to sample from
some arbitrary range query, the ACE Tree and its associated algorithms filter and combine
these samples so that very quickly, a large and random subset of the records satisfying the
range query is returned. The sampling algorithm of the ACE Tree is an online algorithm,
which means that as time progresses, a larger and larger sample is produced by the
structure. At all times, the set of records retrieved is a true random sample of all the
database records matching the range selection predicate.
3.3.1 ACE Tree Leaf Nodes
The ACE Tree stores records in a large set of leaf nodes on disk. Every leaf node has
two components:
1. A set of h ranges, where a range is a pair of key values in the domain of the keyattribute and h is the height of the ACE Tree. Unlike a B+-Tree, each leaf nodein the ACE Tree stores records falling in several different ranges. The ith rangeassociated with leaf node L is denoted by L.Ri. The h different ranges associatedwith a leaf node are textithierarchical; that is L.R1 ⊃ L.R2 ⊃ · · · ⊃ L.Rh. The firstrange in any leaf node, L.R1, always contains a uniform random sample of all recordsof the database thus corresponding to the range (−∞,∞). The hth range in any leafnode is the smallest among all other ranges in that leaf node.
2. A set of h associated sections. The ith section of leaf node L is denoted by L.Si. Thesection L.Si contains a random subset of all the database records with key values inthe range L.Ri.
Figure 3-1 depicts an example leaf node in the ACE Tree with attribute range values
written above each section and section numbers marked below. Records within each
section are shown as circles.
38
75
36 41
47
22
18 10 25
3 1
11 7
R1 :0-100
S3 S4S2S1
R4 : 0-12R3 :0-25R2 :0-50
Figure 3-1. Structure of a leaf node of the ACE tree.
3.3.2 ACE Tree Structure
Logically, the ACE Tree is a disk-based binary tree data structure with internal nodes
used to index leaf nodes, and leaf nodes used to store the actual data. Since the internal
nodes in a binary tree are much smaller than disk pages, they are packed and stored
together in disk-page-sized units [27]. Each internal node has the following components:
1. A range R of key values associated with the node.
2. A key value k that splits R and partitions the data on the left and right of the node.
3. Pointers ptrl and ptrr, that point to the left and right children of the node.
4. Counts cntl and cntr, that give the number of database records falling in theranges associated with the left and right child nodes. These values can be used,for example, during evaluation of online aggregation queries which require the size ofthe population from which we are sampling [54].
Figure 3-2 shows the logical structure of the ACE Tree. Ii,j refers to the jth internal
node at level i. The root node is labeled with a range I1,1.R = [0-100], signifying that
all records in the data set have key values within this range. The key of the root node
partitions I1,1.R into I2,1.R = [0-50] and I2,2.R = [51-100]. Similarly each internal node
divides the range of its descendents with its own key.
The ranges associated with each section of a leaf node are determined by the ranges
associated with each internal node on the path from the root node to the leaf. For
example, if we consider the path from the root node down to leaf node L4, the ranges that
we encounter along the path are 0-100, 0-50, 26-50 and 38-50. Thus for L4, L4.S1 has a
random sample of records in the range 0-100, L4.S2 has a random sample in the range
39
50
75
37 62 8812
0−50 51−100
26−50 76−10051−75
0−100
0−25
25
0−50 26−50 38−50
70
87
14 7
20 40
39
4427
40
4638
50
0−100
26
L1
L4.S2
I1,1
I2,1
I3,3 I3,4
L4.S1 L4.S3 L4.S4
L8
I2,2
I3,2I3,1
L7L6L5L4L3L2
Figure 3-2. Structure of the ACE tree.
0-50, L4.S3 has a random sample in the range 26-50, while L4.S4 has a random sample in
the range 38-50.
3.3.3 Example Query Execution in ACE Tree
In the following discussion, we demonstrate how the ACE Tree efficiently retrieves a
large random sample of records for any given range query. The query algorithm is formally
described in Section 3.6.
Let Q = [30-65] be our example query postulated over the ACE Tree depicted in
Figure 3-2. The query algorithm starts at I1,1, the root node. Since I2,1.R overlaps Q, the
algorithm decides to explore the left child node labeled I2,1 in Figure 3-2. At this point
the two range values associated with the left and right children of I2,1 are 0-25 and 26-50.
Since the left child range has no overlap with the query range, the algorithm chooses to
explore the right child next. At this child node (I3,2), the algorithm picks leaf node L3 to
be the first leaf node retrieved by the index. Records from section 1 of L3 (which totally
encompasses Q) are filtered for Q and returned immediately to the consumer of the sample
40
as a random sample from the range [30-65], while records from sections 2, 3 and 4 are
stored in memory. Figure 3-3 shows the one random sample from section 1 of L3 which
can be used directly for answering query Q.
72
55
89
31 18
48 23
45 2937
2934
55
0-100
L3.S1
σ30−65
0-50 26-50 26-37
L3.S2 L3.S3 L3.S4
Figure 3-3. Random samples from section 1 of L3.
Next, the algorithm again starts at the root node and now chooses to explore the
right child node I2,2. After performing range comparisons, it explores the left child of I2,2
which is I3,3 since I3,4.R has no overlap with Q. The algorithm chooses to visit the left
child node of I3,3 next, which is leaf node L5. This is the second leaf node to be retrieved.
As depicted in Figure 3-4, since L5.R1 encompasses Q, the records of L5.S1 are filtered and
returned immediately to the user as two additional samples from R. Furthermore, section
2 records are combined with section 2 records of L3 to obtain a random sample of records
in the range 0-100. These are again filtered and returned, giving four more samples from
Q. Section 3 records are also combined with section 3 records of L3 to obtain a sample
of records in the range 26-75. Since this range also encompasses R, the records are again
filtered and returned adding four more records to our sample. Finally section 4 records are
stored in memory for later use.
Note that after retrieving just two leaf nodes in our small example, the algorithm
obtains eleven randomly selected records from the query range. However, in a real index,
this number would be many times greater. Thus, the ACE Tree supports “fast first”
41
sampling from a range predicate: a large number of samples are returned very quickly. We
contrast this with a sample taken from a B+-Tree having a similar structure to the ACE
Tree depicted in Figure 3-2. The B+-Tree sampling algorithm would need to pre-select
which nodes to explore. Since four leaf nodes in the tree are needed to span the query
range, there is a reasonably high likelihood that the first four samples taken would need
to access all four leaf nodes. As the ACE Tree Query Algorithm progresses, it goes on to
retrieve the rest of the leaf nodes in the order L4, L6, L1, L7, L2, L8.
45
0−100 0−50 51−100 51−7526−50
Combinesamples
Combinesamples
48
18
23
29
37
3110 61
59 77
6159
99
34 46
31 48
46 34
58
74 69
53
45 37
53 58
L5.S2 L5.S3L3.S3
σ30−65 σ30−65
σ30−65
L5.S1 L3.S2
Figure 3-4. Combining samples from L3 and L5.
3.3.4 Choice of Binary Versus k-Ary Tree
The ACE Tree as described above can also be implemented as a k-ary tree instead of
a binary tree. For example, for a ternary tree, each internal node can have two (instead
of one) keys and three (instead of two) children. If the height of the tree was h, every
leaf node would still have h ranges and h sections associated with them. Like a standard
complete k-ary tree, the number of leaf nodes will be kh. However, the big difference
would be the manner in which a query is executed using a k-ary ACE Tree as opposed to
a binary ACE Tree. The query algorithm will always start at the root node and traverse
42
down to a leaf. However, at every internal node it will alternate between the k children in
a round-robin fashion. Moreover, since the data space would be divided into k equal parts
at each level, the query algorithm might have to make k traversals and hence access k leaf
nodes before it can combine sections that can be used to answer the query. This would
mean that the query algorithm will have to wait longer (than a binary ACE Tree) before
it can combine leaf node sections and thus return useful random samples. Since the goal
of the ACE Tree is to support “fast first” sampling, use of a binary tree instead of a k-ary
tree seems to be a better choice to implement the ACE Tree.
3.4 Properties of the ACE Tree
In this Section we describe the three important properties of the ACE Tree which
facilitate the efficient retrieval of random samples from any range query, and will be
instrumental in ensuring the performance of the algorithm described in Section 3.6.
3.4.1 Combinability
18
47 39
22
34
2926
4533
5039
7
samplesCombined
0−50
0−50 0−25
26−3726−500−100
0−120−100
36
75
41
47
22
18 10
3
25
1
11
27
6188
L3
L1.S4L1.S3L1.S2L1.S1
L3.S4L3.S3L3.S2L3.S1
σ3−47L1
σ3−47
Figure 3-5. Combining two sections of leaf nodes of the ACE tree.
43
The various samples produced from processing a set of leaf nodes are combinable.
For example, consider the two leaf nodes L1 and L3, and the query “Compute a random
sample of the records in the query range Ql = [3 to 47]”. As depicted in Figure 3-5, first
we read leaf node L1 and filter the second section in order to produce a random sample
of size n1 from Ql which is returned to the user. Next we read leaf node L3, and filter its
second section L3.S2 to produce a random sample of size n2 from Ql which is also returned
to the user. At this point, the two sets returned to the user constitute a single random
sample from Ql of size n1 + n2. This means that as more and more nodes are read from
disk, the records contained in them can be combined to obtain an ever-increasing random
sample from any range query.
3.4.2 Appendability
The ith sections from two leaf nodes are appendable. That is, given two leaf nodes Lj
and Lk, Lj.Si
⋃Lk.Si is always a true random sample of all records of the database with
key values within the range Lj.Ri
⋃Lk.Ri. For example, reconsider the query, “Compute
a random sample of the records in the query range Ql = [3 to 47]”. As depicted in Figure
3-6, we can append the third section from node L3 to the third section from node L1 and
filter the result to produce yet another random sample from Ql. This means that sections
are never wasted.
3.4.3 Exponentiality
The ranges in a leaf node are exponential. The number of database records that fall
in L.Ri is twice the number of records that fall in L.Ri+1. This allows the ACE Tree
to maintain the invariant that for any query Q′ over a relation R such that at least hµ
database records fall in Q′, and with |R|/2k+1 <= |σQ′(R)| <= |R|/2k;∀k <= h− 1, there
exists a pair of leaf nodes Li and Lj, where at least one-half of the database records falling
in Li.Rk+2
⋃Lj.Rk+2 are also in Q′. µ is the average number of records in each section,
and h is the height of the tree or equivalently, the total number of sections in any leaf
node.
44
26−3726−50
0−120−250−500−100
5039
61
27
88
2645
0−100
sectionsAppend
711
1
25
3
1018
22
47
4136
75
34
2926
4533
0−50
33
25 310
L1.S1 L1.S2 L1.S3 L1.S4
L3.S1 L3.S2 L3.S3 L3.S4
L3
σ3−47
Combinedsamples
L1
Figure 3-6. Appending two sections of leaf nodes of the ACE tree.
While the formal statement of the exponentiality property is a bit complicated, the
net result is is simple: there is always a pair of leaf nodes whose sections can be appended
to form a set which can be filtered to quickly obtain a sample from any range query Q′.
As an illustration, consider query Q over the ACE Tree of Figure 3-2. Note that the
number of database records falling in Q is greater than one-fourth, but less than half the
database size. The exponentiality property assures us that Q can be totally covered by
appending sections of two different leaf nodes. In our example, this means that Q can be
covered by appending section 3 of nodes L4 and L6. If RC = L4.R3
⋃L6.R3, then by the
invariant given above we can claim that |σQ(R)| >= (1/2)× |σRC(R)|.3.5 Construction of the ACE Tree
In this Section, we present an I/O efficient, bulk construction algorithm for the ACE
Tree.
3.5.1 Design Goals
The algorithm for building an ACE Tree index is designed with the following goals in
mind:
45
1. Since the ACE Tree may index enormous amounts of data, construction of the treeshould rely on efficient, external memory algorithms, requiring as few passes throughthe data set as possible.
2. In the resulting data structure, the data which are placed in each leaf node sectionmust constitute a true random sample (without replacement) of all database recordslying within the range associated with that section.
3. Finally, the tree must be constructed in such a way as to have the exponentiality,combinability, and appendability properties necessary for supporting the ACE Treequery algorithms.
3.5.2 Construction
The construction of the ACE Tree proceeds in two distinct phases. Each phasecomprises of two read/write passes through the data set (that is, constructing anACE-Tree from scratch requires two external sorts of a large database table). The twophases are as follows:
1. During Phase 1, the data set is sorted based on the record key values. This sortedorder of records is used to provide the split points associated with each internal nodein the tree.
2. During Phase 2, the data are organized into leaf nodes based on those key values.Disk blocks corresponding to groups of internal nodes can easily be constructed atthe same time as the final pass through the data writes the leaf nodes to disk.
3.5.3 Construction Phase 1
The primary task of Phase 1 is to assign split points to each internal node of the tree.
To achieve this, the construction algorithm first sorts the data set based upon keys of the
records, as depicted in Figure 3-7.
After the dataset is sorted, the median record for the entire data set is determined
(this value is 50 in our example). This record’s key will be used as the key associated with
the root of the ACE Tree, and will determine L.R2 for every leaf node in the tree. We
denote this key value by I1,1.k, since the value serves as the key of the first internal node
in level 1 of the tree.
After determining the key value associated with the root node, the medians of each of
the two halves of the data set partitioned by I1,1.k are chosen as keys for the two internal
nodes at the next level: I2,1.k and I2,2.k, respectively. In the example of Figure 3-7, these
46
84 88 89 92 98
3 7 10 12 15 18 33 36 37 41 47 50 53 58 60 62 69 72 74
813 7 10 12 15 18 22 25 29 33 36 37 41 47 50 53 58 60 62 69 72 74 75 77
77
89 92 98
Median Record
50
50
50
25
75 886225
2922 75
12 37 60 84
81 84 88 89 92 98
3 7 10 15 18 22 29 33 36 41 47 50 53 58 69 72 74 77 81
I3,4.kI3,2.k
I2,2.k
I1,1.k
I2,1.k
I3,1.k I3,3.k
Figure 3-7. Choosing keys for internal nodes.
values are 25 and 75. I2,1.k and I2,2.k, along with I1,1.k, will determine L.R3 for every
leaf node in the tree. The process is then repeated recursively until enough medians2
have been obtained to provide every internal node with a key value. Note that the same
time that these various key values are determined, the values cntl and cntr can also be
determined.
This simple strategy for choosing the various key values in the tree ensures that
the exponentiality property will hold. If the data space between Ii+1,2j−1.k and Ii,j.k
corresponds to some leaf node range L.Rn, then the data space between Ii+1,2j−1.k and
Ii+2,4j−2.k will correspond to some range L.Rn+1. Since Ii+2,4j−2.k is the midpoint of
2 We choose a value for the height of the tree in such a manner that the expected sizeof a leaf node (see Sec. V F.) does not exceed one logical disk block. Choosing a nodesize that corresponds to the block size is done for the same reason it is done in mosttraditional indexing structures: typically, the system disk block size has already beencarefully chosen by a DBA to balance speed of sequential access (which demands a largerblock size) with the cost of accessing more data than is needed (which demands a smallerblock size).
47
the data space between Ii+1,2j−1.k and Ii,j.k, we know that two times as many database
records should fall in L.Rn, compared with L.Rn+1.
The following example also shows how the invariant described in Section 4.3 is
guaranteed by adopting the aforementioned strategy of assigning key values to internal
nodes. Consider the ACE Tree of Figure 3-2. Figure 3-8 shows the keys of the internal
nodes as medians of the dataset R. We also consider two example queries, Q1 and Q2 such
that the number of database records falling in Q2 is greater than one-fourth but less than
one-half of the database size, while the number of database records falling in Q1 is more
than half the database size.
50 7512 37 62
|R|/4
|R|/2
Q1
Q2
|R|
8825
Figure 3-8. Exponentiality property of ACE tree.
Q1 can be answered by appending section 2 of (for example) L4 and L8 (refer to
Figure 3-2). Let RC1 = L4.R2
⋃L8.R2. Then all the database records fall in RC1.
Moreover, since |σQ1(R)| >= |R|/2, we have |σQ1(R)| >= (1/2) × |σRC1(R)|. Similarly,
Q2 can be answered by appending section 3 of (for example) L4 and L6. If RC2 =
L4.R3
⋃L6.R3, then half the database records fall in RC2. Also, since |σQ2(R)| >= |R|/4
we have |σQ2(R)| >= (1/2) × |σRC2(R)|. This can be generalized to obtain the invariant
stated in Section 3.4.3.
3.5.4 Construction Phase 2
The objective of Phase 2 is to construct leaf nodes with appropriate sections and
populate them with records. This can be achieved by the following three steps:
48
37562563482343412137
13
7
15291233375077419269
Leaf NumbersSection Numbers
16541 72858
1 4221
(b) Records assigned leaf numbers
50 98928988848177757472696260585350
223 113234123414234124343
88
44443333322211
21132 331
5
3 41
8
4
6
4
Leaf NumbersSection Numbers
55211 887777665
4
Leaf 1
6274502210 478137251860
(c) Records organized into leaf nodes
36 89847598745853
Leaf 2
21132433243214221
Leaf 8Leaf 7Leaf 6Leaf 5Leaf 4Leaf 3
47
412341423412434324221
50
3
413736332925221815121073
(a) Records assigned section numbers
13123 1132
9815121073
Section Numbers
18 9289888481777574726962605853504741373633292522
Figure 3-9. Phase 2 of tree construction.
1. Assign a uniformly generated random number between 1 and h to each record as itssection number.
2. Associate an additional random number with the record that will be used to identifythe leaf node to which the record will be assigned.
3. Finally, re-organize the file by performing an external sort to group records in a givenleaf node and a given section together.
Figure 3-9(a) depicts our example data set after we have assigned each record a
randomly generated section number, assuming four sections in each leaf node.In Step 2, the algorithm assigns one more randomly generated number to each
record, which will identify the leaf node to which the record will be assigned. We assumefor our example that the number of leaf nodes is 2h−1 = 23 = 8. The number to identifythe leaf node is assigned as follows.
1. First, the section number of the record is checked. We denote this value as s.
49
2. We then start at the root of the tree and traverse down by comparing the record keywith s − 1 key values. After the comparisons, if we arrive at an internal node, Ii,j,then we assign the record to one of the leaves in the subtree rooted at Ii,j.
From the example of Figure 3-9(a), the first record having key value 3 has been
assigned to section 1. Since this record can be randomly assigned to any leaf from 1
through 8, we assign it to leaf 7.
The next record of Figure 3-9(a) has been assigned to section number 2. Referring
back to Figure 3-7, we see that the key of the root node is 50. Since the key of the record
is 7 which is less than 50, the record will be assigned to a leaf node in the left subtree of
the root. Hence we assign a leaf node between 1 and 4 to this record. In our example, we
randomly choose the leaf node 3.
For the next record having key value 10, we see that the section number assigned is 3.
To assign a leaf node to this record, we initially compare its key with the key of the root
node. Referring to Figure 3-7, we see that 10 is smaller than 50; hence we then compare it
with 25 which is the key of the left child node of the root. Since the record key is smaller
than 25, we assign the record to some leaf node in the left subtree of the node with key 25
by assigning to it a random number between 1 and 2.
The section number and leaf node identifiers for each record are written in a small
amount of temporary disk space associated with each record. Once all records have been
assigned to leaf nodes and sections, the dataset is re-organized into leaf nodes using a
two-pass external sorting algorithm as follows:
• Records are sorted in ascending order of their leaf node number.
• Records with the same leaf node number are arranged in ascending order of theirsection number.
The re-organized data set is depicted in Figure 3-9(c).
50
3.5.5 Combinability/Appendability Revisited
In Phase 2 of the tree construction, we observe that all records belonging to some
section k are segregated based upon the result of the comparison of their key with the
appropriate medians, and are then randomly assigned a leaf node number from the feasible
ones. Thus, if records from section s of all leaf nodes are merged together, we will obtain
all of the section s records. This ensures the appendability property of the ACE Tree.
Also note that the probability of assignment of one record to a section is unaffected
by the probability of assignment of some other record to that section. Since this results
in each section having a random subset of the database records, it is possible to merge
a sample of the records from one section that match a range query with a sample of
records from a different section that match the same query. This will produce a larger
random sample of records falling in the range of the query, thus ensuring the combinability
property.
3.5.6 Page Alignment
In Phase 2 of the construction algorithm, section numbers and leaf node numbers
are randomly generated. Hence we can only predict on expectation the number of records
that will fall in each section of each leaf node. As a result, section sizes within each leaf
node can differ, and the size of a leaf node itself is variable and will generally not be equal
to the size of a disk page. Thus when the leaf nodes are written out to disk, a single leaf
node may span across multiple disk pages or may be contained within a single disk page.
This situation could be avoided if we fix the size of each section a priori. However,
this poses a serious problem. Consider two leaf node sections Li.Sj and Li+1.Sj. We
can force these two sections to contain the same number of records by ensuring that the
set of records assigned to section j in Phase 2 of the construction algorithm has equal
representation from Li.Rj and Li+1.Rj. However, this means that the set of records
assigned to section j is no longer random. If we fix the section size and force a set number
51
of records to fall in each section, we invalidate the appendability and combinability
properties of the structure. Thus, we are forced to accept a variable section size.In order to implement variable section size, we can adopt one of the following two
schemes:
1. Enforce fixed-sized leaf nodes and allow variable-sized sections within the leaf nodes.
2. Allow variable-sized leaf nodes along with variable-sized sections.
If we choose the fixed-sized leaf node, variable-sized section scheme, leaf node size is
fixed in advance. However, section size is allowed to vary. This allows full sections to grow
further by claiming any available space within the leaf node. The leaf node size chosen
should be large enough to prevent any leaf node from becoming completely filled up, which
prevents the partitioning of any leaf node across two disk pages. The major drawback of
this scheme is that the average leaf node space utilization will be very low. Assuming a
reasonable set of ACE Tree parameters, a quick calculation shows that if we want to be
99% sure that no leaf node gets filled up, the average leaf node space utilization will be
less than 15%.
The variable-sized leaf node, variable-sized section scheme does not impose a size
limit on either the leaf node or the section. It allows leaf nodes to grow beyond disk page
boundaries, if space is required. The important advantage of this scheme is that it is
space-efficient. The main drawback of this approach is that leaf nodes may span multiple
disk pages, and hence all such pages must be accessed in order to retrieve such a leaf node.
Given that most of the cost associated with reading an arbitrary leaf page is associated
with the disk head movement needed to move the disk arm to the appropriate cylinder,
this does not pose too much of a problem. Hence we use this scheme for the construction
of leaf nodes of the ACE Tree.
3.6 Query Algorithm
In this Section, we describe in detail the algorithm used to answer range queries using
the ACE Tree.
52
3.6.1 Goals
The algorithm has been designed to meet the primary goal of achieving “fast-first”
sampling from the index structure, which means it attempts to be greedy on the number
of records relevant for the query in the early stages of execution. In order to meet this
goal, the query answering algorithm identifies the leaf nodes which contain maximum
number of sections relevant for the query. A section Li1 .Sj is relevant for a range query Q
if Li1 .Rj
⋂Q 6= φ and Li1 .Rj
⋃Li2 .Rj
⋃. . .
⋃Lin .Rj ⊇ Q where Li1 , . . . ,Lin are some leaf
nodes in the tree.The query algorithm prioritizes retrieval of leaf nodes so as to:
• Facilitate the combination of sections so as to maximize n in the above formulation,and
• Maximize the number of relevant sections in each leaf node L retrieved such thatL.Sj
⋂Q 6= φ where j = (c + 1) . . . h where L.Rc is the smallest range in L that
encompasses Q.
3.6.2 Algorithm Overview
At a high level, the query answering algorithm retrieves the leaf nodes relevant to
answering a query via a series of stabs or traversals, accessing one leaf node per stab.
Each stab begins at the root node and traverses down to a leaf. The distinctive feature of
the algorithm is that at each internal node that is traversed during a stab, the algorithm
chooses to access the child node that was not chosen the last time the node was traversed.
For example, imagine that for a given internal node I, the algorithm chooses to traverse
to the left child of I during a stab. The next time that I is accessed during a stab, the
algorithm will choose to traverse to the right child node. This can be seen in Figure
3-10, when we compare the paths taken by Stab 1 and Stab 2. The algorithm chooses to
traverse to the left child of the root node during the first stab, while during the second
stab it chooses to traverse to the right child of the root node.
The advantage of retrieving leaf nodes in this back and forth sequence is that it allows
us to quickly retrieve a set of leaf nodes with the most disparate sections possible in a
53
������������
50
75
37 62 8812
0−50 51−100
26−50 76−10051−750−25
25
0−100
next = L
next = L
next = L
(a) Stab 1, 1 contributing section
L1
I3,4
L8
I1,1
I2,2
I2,1
I3,3I3,2I3,1
L7L6L5L4L3L2
����
50
75
37 62 8812
0−50 51−100
26−50 76−10051−750−25
25
0−100
next = R
next = L
next = L
(b)Stab 2, 6 contributing sections
L8L1
I3,3 I3,4
I1,1
I2,2I2,1
I3,2I3,1
L2 L3 L4 L5 L6 L7
(c) Stab 3, 7 contributing sections
����
50
75
37 62 8812
0−50 51−100
26−50 76−10051−750−25
25
0−100
next = L
next = L
next = R
L2L1
I3,2
I1,1
I2,1
I3,1 I3,3
L3 L4 L5 L6
I2,2
I3,4
L7 L8
(d) Stab 4, 16 contributing sections
� �� �� ��
50
75
37 62 8812
0−50 51−100
26−50 76−10051−750−25
25
0−100
next = R
next = R
next = L
L1
I2,1
I3,4
I2,2
I1,1
I3,3I3,2I3,1
L2 L3 L4 L5 L6 L7 L8
50
75
37 62 8812
0−50 51−100
26−50 76−10051−750−25
25
0−100
next = L
next = L
next = L
(a) Stab 1, 1 contributing section
L1
I3,4
L8
I1,1
I2,2
I2,1
I3,3I3,2I3,1
L7L6L5L4L3L2
50
75
37 62 8812
0−50 51−100
26−50 76−10051−750−25
25
0−100
next = R
next = L
next = L
(b)Stab 2, 6 contributing sections
L8L1
I3,3 I3,4
I1,1
I2,2I2,1
I3,2I3,1
L2 L3 L4 L5 L6 L7
(c) Stab 3, 7 contributing sections
������������������
���������������������
���������������
50
75
37 62 8812
0−50 51−100
26−50 76−10051−750−25
25
0−100
next = L
next = L
next = R
L2L1
I3,2
I1,1
I2,1
I3,1 I3,3
L3 L4 L5 L6
I2,2
I3,4
L7 L8
(d) Stab 4, 16 contributing sections
�� � !" #$
50
75
37 62 8812
0−50 51−100
26−50 76−10051−750−25
25
0−100
next = R
next = R
next = L
L1
I2,1
I3,4
I2,2
I1,1
I3,3I3,2I3,1
L2 L3 L4 L5 L6 L7 L8
50
75
37 62 8812
0−50 51−100
26−50 76−10051−750−25
25
0−100
next = L
next = L
next = L
(a) Stab 1, 1 contributing section
L1
I3,4
L8
I1,1
I2,2
I2,1
I3,3I3,2I3,1
L7L6L5L4L3L2
50
75
37 62 8812
0−50 51−100
26−50 76−10051−750−25
25
0−100
next = R
next = L
next = L
(b)Stab 2, 6 contributing sections
L8L1
I3,3 I3,4
I1,1
I2,2I2,1
I3,2I3,1
L2 L3 L4 L5 L6 L7
(c) Stab 3, 7 contributing sections
50
75
37 62 8812
0−50 51−100
26−50 76−10051−750−25
25
0−100
next = L
next = L
next = R
L2L1
I3,2
I1,1
I2,1
I3,1 I3,3
L3 L4 L5 L6
I2,2
I3,4
L7 L8
(d) Stab 4, 16 contributing sections
%�%%�%%�%&�&&�&&�& '�'
'�''�'(�((�((�(
)�)�))�)�)*�*�**�*�* +�++�+,�,
,�, -�--�-.�..�. /�//�/0�00�0
50
75
37 62 8812
0−50 51−100
26−50 76−10051−750−25
25
0−100
next = R
next = R
next = L
L1
I2,1
I3,4
I2,2
I1,1
I3,3I3,2I3,1
L2 L3 L4 L5 L6 L7 L8
50
75
37 62 8812
0−50 51−100
26−50 76−10051−750−25
25
0−100
next = L
next = L
next = L
(a) Stab 1, 1 contributing section
L1
I3,4
L8
I1,1
I2,2
I2,1
I3,3I3,2I3,1
L7L6L5L4L3L2
1�11�12�22�23�3�33�3�34�4�44�4�4
50
75
37 62 8812
0−50 51−100
26−50 76−10051−750−25
25
0−100
next = R
next = L
next = L
(b)Stab 2, 6 contributing sections
L8L1
I3,3 I3,4
I1,1
I2,2I2,1
I3,2I3,1
L2 L3 L4 L5 L6 L7
(c) Stab 3, 7 contributing sections
56789:
50
75
37 62 8812
0−50 51−100
26−50 76−10051−750−25
25
0−100
next = L
next = L
next = R
L2L1
I3,2
I1,1
I2,1
I3,1 I3,3
L3 L4 L5 L6
I2,2
I3,4
L7 L8
(d) Stab 4, 16 contributing sections
;< => ?@ AB
50
75
37 62 8812
0−50 51−100
26−50 76−10051−750−25
25
0−100
next = R
next = R
next = L
L1
I2,1
I3,4
I2,2
I1,1
I3,3I3,2I3,1
L2 L3 L4 L5 L6 L7 L8
Figure 3-10. Execution runs of query answering algorithm with (a) 1 contributing section,(b) 6 contributing sections, (c) 7 contributing sections and (d) 16contributing sections.
given number of stabs. The reason that we want a non-homogeneous set of nodes is that
nodes from very distant portions of a query range will tend to have sections covering large
ranges that do not overlap. This allows us to append sections of newly retrieved leaf nodes
with the corresponding sections of previously retrieved leaf nodes. The samples obtained
can then be filtered and immediately returned.
54
This order of retrieval is implemented by associating a bit with each internal node
that indicates whether the next child node to be retrieved should be the left node or the
right node. The value of this bit is toggled every time the node is accessed. Figure 3-10
illustrates the choices made by the algorithm at each internal node during four separate
stabs. Note that when the algorithm reaches an internal node where the range associated
with one of the child nodes has no overlap with the query range, the algorithm always
picks the child node that has overlap with the query, irrespective of the value of the
indicator bit. The only exception to this is when all leaf nodes of the subtree rooted at
an internal node which overlaps the query range have been accessed. In such a case, the
internal node which overlaps the query range is not chosen and is never accessed again.
3.6.3 Data Structures
In addition to the structure of the internal and leaf nodes of the ACE Tree, the queryalgorithm uses and updates the following two memory resident data structures:
1. A lookup table T , to store internal node information in the form of a pair of values(next = [LEFT] | [RIGHT], done = [TRUE] | [FALSE]). The first value indicates whetherthe next node to be retrieved should be the left child or right child. The second valueis TRUE if all leaf nodes in the subtree rooted at the current node have already beenaccessed, else it is FALSE.
2. An array buckets[h] to hold sections of all the leaf nodes which have been accessed sofar and whose records could not be used to answer the query. h is the height of theACE Tree.
3.6.4 Actual Algorithm
We now present the algorithms used for answering queries using the ACE Tree.
Algorithm 2 simply calls Algorithm 3 which is the main tree traversal algorithm, called
Shuttle(). Each traversal or stab begins at the root node and proceeds down to a leaf
node. In each invocation of Shuttle(), a recursive call is made to either its left or right
child with the recursion ending when it reaches a leaf node. At this point, the sections in
the leaf node are combined with previously retrieved sections so that they can be used to
answer the query. The algorithm for combining sections is described in Algorithm 4. This
55
Algorithm 2: Query Answering Algorithm
Algorithm Answer (Query Q)Let root be the root of the ACE TreeWhile (!T.lookup(root).done)
T.lookup(root).done = shuttle(Q, root);
Algorithm 3: ACE Tree traversal algorithm
Algorithm Shuttle (Query Q, Node curr node)If (curr node is an internal node)
left node = curr node→get left node();right node = curr node→get right node();If (left node is done AND right node is done)
Mark curr node as doneElse if (right node is not done)
Shuttle(Q, right node);Else if (left node is not done)
Shuttle(Q, left node);Else if (both children are not done)
If (Q overlaps only with left node.R)Shuttle(Q, left node);
Else if (Q overlaps only with right node.R)Shuttle(Q, right node);
Else //Q overlaps both sides or noneIf (next node is LEFT)
Shuttle(Q, left node);Set next node to RIGHT;
If (next node is RIGHT)Shuttle(Q, right node);Set next node to LEFT;
Else //curr node is a leaf nodeCombine Tuples(Q, curr node);Mark curr node as done
algorithm determines the sections that are required to be combined with every new section
s that is retrieved and then searches for them in the array buckets[]. If all sections are
found, it combines them with s and removes them from buckets[]. If it does not find all
the required sections in buckets[], it stores s in buckets[].
56
Algorithm 4: Algorithm for combining sections
Algorithm Combine Tuples(Query Q, LeafNode node)For each section s in node do
Store the section numbers required to becombined with s to span Q, in a list list
For each section number i in list doIf buckets[] does not have section i
f lag = falseIf (flag == true)
Combine all sections from list with sand use the records to answer Q
ElseStore s in the appropriate bucket
3.6.5 Algorithm Analysis
We now present a lower bound on the expected performance of the ACE Tree index
for sampling from a relational selection predicate. For simplicity, our analysis assumes that
the number of leaf nodes in the tree is a power of 2.Lemma 1. Efficiency of the ACE Tree for query evaluation.
• Let n be the total number of leaf nodes in a ACE Tree used to sample from somearbitrary range query, Q
• Let p be the largest power of 2 no greater than n
• Let µ be the mean section size in the tree
• Let α be fraction of database records falling in Q
• Let N be the size of the sample from Q that has been obtained after m ACE Tree leafnodes have been retrieved from disk
If m is not too large (that is, if m ≤ 2αn + 2), then:
E[N ] ≥ µ
2p log2 p
where E[N ] denotes the expected value of N (the mean value of N after an infinite number
of trials).
57
Proof. Let Ii,j and Ii,j+1 be the two internal nodes in the ACE Tree where R =
Ii,j.R⋃
Ii,j+1.R covers Q and i is maximized. As long as the shuttle algorithm has not
retrieved all the children of Ii,j and Ii,j+1 (this is the case as long as m ≤ 2αn + 2), when
the mth leaf node has been processed, the expected number of new samples obtained:
Nm =
blog2 mc∑
k=1
2k−1∑
l=1
wklµ
where the outer summation is over each of the h − i contributing sections of the leaf
nodes, starting with section number i up to section number h, while∑
l wkl represents the
fraction of records of the 2k−1 combined sections that satisfy Q. By the exponentiality
property,∑
l wkl ≥ 1/2 for every k,
Nm ≥ µ
2log2 m
Thus after m leaf nodes have been obtained, the total number of expected samples is given
by:
E[N ] ≥m∑
k=1
Nk
≥m∑
k=1
µ
2log2 k
≥ µ
2m log2 m
If m is a power of 2, the result is proven.
Lemma 2. The expected number of records µ in any leaf node section is given by:
E[µ] =|R|
h2h−1
where |R| is the total number of database records, h is the height of the ACE Tree and 2h−1
is the number of leaf nodes in the ACE Tree.
58
Proof. The probability of assigning a record to any section i; i ≤ h is 1/h. Given that
the record is assigned to section i, it can be assigned to only one of 2i−1 leaf node groups
after comparing with the appropriate medians. Since each group would have 2h−1/2i−1
candidate leaf nodes, the probability that the record is assigned to some leaf node Lj is:
E[µi,j] =∑t∈R
1
h×
2i−1∑
k=1
01
2i−1+
1
2i−1
2i−1
2h−1
=|R|
h2h−1
This completes the proof of the lemma.
3.7 Multi-Dimensional ACE Trees
The ACE Tree can be easily extended to support queries that include multi-dimensional
predicates. The change needed to incorporate this extension is to use a k-d binary tree
instead of the regular binary tree for the ACE Tree. Let a1, . . . , ak be the k key attributes
for the k-d ACE Tree. To construct such a tree, the root node would be the median of all
the a1 values in the database. Thus the root partitions the dataset based on a1. At the
next step, we need to assign values for level 2 internal nodes of the tree. For each of the
resulting partitions of the dataset, we calculate the median of all the a2 values. These two
medians are assigned to the two internal nodes at level 2 respectively, and we recursively
partition the two halves based on a2. This process is continued until we finish level k.
At level k + 1, we again consider a1 for choosing the medians. We would then assign a
randomly generated section number to every record. The strategy for assigning a leaf node
number to the records would also be similar to the one described in Section 3.5.4 except
that the appropriate key attribute is used while performing comparisons with the internal
nodes. Finally, the dataset is sorted into leaf nodes as in Figure 3-9(c).
Query answering with the k-d ACE Tree can use the Shuttle algorithm described
earlier with a few minor modifications. Whenever a section is retrieved by the algorithm,
only records which satisfy all predicates in the query should be returned. Also, the mth
59
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0 0.5 1 1.5 2 2.5 3 3.5 4
% o
f tot
al n
umbe
r of
rec
ords
in th
e re
latio
n
% of time required to scan relation
ACE Tree
B+ Tree
Randomly permuted file
Figure 3-11. Sampling rate of an ACE tree vs. rate for a B+ tree and scan of a randomlypermuted file, with a one dimensional selection predicate accepting 0.25% ofthe database records. The graph shows the percentage of database recordsretrieved by all three sampling techniques versus time plotted as a percentageof the time required to scan the relation
sections of two leaf nodes can be combined only if they match in all m dimensions. The
nth sections of two leaf nodes can be appended only if they match in the first n − 1
dimensions and form a contiguous interval over the nth dimension.
3.8 Benchmarking
In this Section, we describe a set of experiments designed to test the ability of the
ACE Tree to quickly provide an online random sample from a relational selection predicate
as well as to demonstrate that the memory requirement of the ACE Tree is reasonable.
We performed two sets of experiments. The first set is designed to test the utility of the
ACE Tree for use with one-dimensional data, where the ACE Tree is compared with
a simple sequential file scan as well as Antoshenkov’s algorithm for sampling from a
ranked B+-Tree. In the second set, we compare a multi-dimensional ACE Tree with the
60
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.5 1 1.5 2 2.5 3 3.5 4
% o
f tot
al n
umbe
r of
rec
ords
in th
e re
latio
n
% of time required to scan the relation
ACE Tree
B+ Tree
Randomly permuted file
Figure 3-12. Sampling rate of an ACE tree vs. rate for a B+ tree and scan of a randomlypermuted file, with a one dimensional selection predicate accepting 2.5% ofthe database records. The graph shows the percentage of database recordsretrieved by all three sampling techniques versus time plotted as a percentageof the time required to scan the relation
sequential file scan as well as with the obvious extension of Antoshenkov’s algorithm to a
two-dimensional R-Tree.
3.8.1 Overview
All experiments were performed on a Linux workstation having 1GB of RAM, 2.4GHz
clock speed and with two 80GB, 15,000 RPM Seagate SCSI disks. 64KB data pages were
used.
Experiment 1. For the first set of experiments, we consider the problem of sampling
from a range query of the form:
SELECT * FROM SALE
WHERE SALE.DAY >= i AND SALE.DAY <= j
We implemented and tested the following three random-order record retrievalalgorithms for sampling the range query:
61
0
0.5
1
1.5
2
2.5
0.5 1 1.5 2 2.5 3 3.5 4
% o
f tot
al n
umbe
r of
rec
ords
in th
e re
latio
n
% of time required to scan the relation
ACE Tree
B+ Tree
Randomly permuted file
Figure 3-13. Sampling rate of an ACE tree vs. rate for a B+ tree and scan of a randomlypermuted file, with a one dimensional selection predicate accepting 25% ofthe database records. The graph shows the percentage of database recordsretrieved by all three sampling techniques versus time plotted as a percentageof the time required to scan the relation
1. ACE Tree Query Algorithm: The ACE Tree was implemented exactly as described inthis thesis. In order to use the ACE Tree to aid in sampling from the SALE relation, amaterialized sample view for the relation was created, using SALE.DAY as the indexedattribute.
2. Random sampling from a B+-Tree: Antoshenkov’s algorithm for sampling from aranked B+-Tree was implemented as described in Algorithm 1. The B+-Tree usedin the experiment was a primary index on the SALE relation (that is, the underlyingdata were actually stored within the tree), and was constructed using the standardB+-Tree bulk construction algorithm.
3. Sampling from a randomly permuted file: We implemented this random samplingtechnique as described in Section 3.2.1 of this chapter. This is the standard samplingtechnique used in previous work on online aggregation. The SALE relation wasrandomly permuted by assigning a random key value k to each record. All of therecords from SALE were then sorted in ascending order of each k value using atwo-phase, multi-way merge sort (TPMMS) (see Garcia-Molina et al. [38]). As thesorted records are written back to disk in the final pass of the TPMMS, k is removedfrom the file. To sample from a range predicate using a randomly permuted file, the
62
0
0.5
1
1.5
2
2.5
3
0 100 200 300 400 500 600 700 800
% o
f tot
al n
umbe
r of
rec
ords
in th
e re
latio
n
% of time required to scan the relation
ACE Tree
B+ Tree
Randomly permuted file
Figure 3-14. Sampling rate of an ACE tree vs. rate for a B+ tree and scan of a randomlypermuted file, with a one dimensional selection predicate accepting 2.5% ofthe database records. The graph is an extension of Figure 3-12 and showsresults till all three sampling techniques return all the records matching thequery predicate.
file is scanned from front to back and all records matching the range predicate areimmediately returned.
For the first set of experiments, we synthetically generated the SALE relation to be
20GB in size with 100B records, resulting in around 200 million records in the relation.
We began the first set of experiments by sampling from 10 different range selection
predicates over SALE using the three sampling techniques described above. 0.25% of the
records from MySam satisfied each range selection predicate. For each of the three random
sampling algorithms, we recorded the total number of random samples retrieved by the
algorithm at each time instant. The average number of random samples obtained for each
of the ten queries was then calculated. This average is plotted as a percentage of the total
number of records in SALE along the Y-axis in Figure 3-11. On the X-axis, we have plotted
the elapsed time as a percentage of the time required to scan the entire relation. We chose
63
0
5e-05
0.0001
0.00015
0.0002
0.00025
0.0003
0.00035
0 1 2 3 4 5 6 7 8 9 10 11
Frac
tion
of to
tal n
umbe
r of
rec
ords
in th
e re
latio
n
% of time required to scan the relation
Average across 10 queriesMaximum of 10 queriesMinimum of 10 queries
(a) 0.25% selectivity
0
0.0005
0.001
0.0015
0.002
0.0025
0.003
0 1 2 3 4 5 6 7 8 9 10 11
Frac
tion
of to
tal n
umbe
r of
rec
ords
in th
e re
latio
n
% of time required to scan the relation
Average across 10 queriesMaximum of 10 queriesMinimum of 10 queries
(b) 2.5% selectivity
Figure 3-15. Number of records needed to be buffered by the ACE Tree for queries with(a) 0.25% and (b) 2.5% selectivity. The graphs show the number of recordsbuffered as a fraction of the total database records versus time plotted as apercentage of the time required to scan the relation.
64
this metric considering the linear scan as the baseline record retrieval method. The test
was then repeated with two more sets of selection predicates that are satisfied by 2.5% and
25% of MySam’s records, respectively. The results are plotted in Figure 3-12 and Figure
3-13. For all the three figures, results are shown for the first 15 seconds of execution,
corresponding to approximately 4% of the time required to scan the relation. We show an
additional graph in Figure 3-14 for the 2.5% selectivity case, where we plot results until all
the three record retrieval algorithms return all the records matching the query predicate.
Finally, we provide experimental results to indicate the number of records that
are needed to be buffered by the ACE Tree query algorithm for two different query
selectivities. Figure 3-15(a) shows the minimum, maximum and the average number of
records stored for ten different queries having a selectivity of 0.25% while Figure 3-15(b)
shows similar results for queries having selectivity 2.5%.
Experiment 2. For the second set of experiments, we add an additional attribute
AMOUNT to the SALE relation and test the following two-dimensional range query:
SELECT * FROM SALE
WHERE SALE.DAY >= d1 AND SALE.DAY <= d2
AND SALE.AMOUNT >= a1 AND SALE.AMOUNT <= b2
To generate the SALE relation, each (DAY, AMOUNT) pair in each record is generated by
sampling from a bivariate uniform distribution.
In this experiment, we again test the three random sampling options given above:
1. ACE tree query algorithm: The ACE Tree for multi-dimensional data (a k-d ACETree) was implemented exactly as described in Section 3.7. It was used to create amaterialized sample view over the DAY and AMOUNT attributes.
2. Random sampling from a R-tree: Antoshenkov’s algorithm for sampling from aranked B+-Tree was extended in the obvious fashion for sampling from a R-Tree [46].Just as in the case of the B+-Tree, the R-Tree is created as a primary index, andthe data from the SALE relation are actually stored in the leaf nodes of the tree. TheR-Tree was constructed in bulk using the well-known Sort-Tile-Recursive [81] bulkconstruction algorithm.
65
3. Sampling from a randomly permuted file: We implemented this random samplingtechnique in a similar manner as Experiment 1.
In this experiment, the SALE relation was generated so as to be about 16 GB in
size. Each record in the relation was 100B in size, resulting in approximately 160 million
records.
Just as in the first experiment, we began by sampling from 10 different range selection
predicates over SALE using the three sampling techniques described above. 0.25% of
the records from SALE satisfied each range selection predicate. For all the three random
sampling algorithms, we recorded the total number of random samples retrieved by the
algorithm at each time instant. The average number of random samples obtained for each
of the ten queries is then computed. This average is plotted as a percentage of the total
number of records in SALE along the Y-axis in Figure 3-16. On the X-axis, we have plotted
the elapsed time as a percentage of the time required to scan the entire relation. The
test was then repeated with two more selection predicates that are satisfied by 2.5% and
25% of the SALE relations records, respectively. The results are plotted in Figure 3-17 and
Figure 3-18 respectively.
3.8.2 Discussion of Experimental Results
There are several important observations that can be made from the experimental
results. Irrespective of the selectivity of the query, we observed that the ACE Tree clearly
provides a much faster sampling rate during the first few seconds of query execution
compared with the other approaches. This advantage tends to degrade over time, but since
sampling is often performed only as long as more samples are needed to achieve a desired
accuracy, the fact that the ACE Tree can immediately provide a large, online random
sample almost immediately indicates its practical utility.
Another observation indicating the utility of the ACE Tree is that while it was the
top performer over the three query selectivities tested, the best alternative to the ACE
Tree generally changed depending on the query selectivity. For highly selective queries,
66
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
% o
f tot
al n
umbe
r of
rec
ords
in r
elat
ion
% of time required to scan relation
ACE Tree R Tree
Randomly permuted file
Figure 3-16. Sampling rate of an ACE Tree vs. rate for an R-Tree and scan of a randomlypermuted file, with a spatial selection predicate accepting 0.25% of thedatabase tuples.
the randomly-permuted file is almost useless due to the fact that the chance that any
given record is accepted by the relational selection predicate is very low. On the other
hand, the B+-Tree (and the R-Tree over multi-dimensional data) performs relatively
well for highly selective queries. The reason for this is that during the sampling, if the
query range is small, then all the leaf pages of the B+-Tree (or R-Tree) containing records
that match the query predicate are retrieved very quickly. Once all of the relevant pages
are in the buffer, the sampling algorithm does not have to access the disk to satisfy
subsequent sample requests and the rate of record retrieval increases rapidly. However,
for less selective queries, the randomly-permuted file works well since it can make use of
an efficient, sequential disk scan to retrieve records. As long as a relatively large fraction
of the records retrieved match the selection predicate, the amount of waste incurred by
scanning unwanted records as well is small compared to the additional efficiency gained by
the sequential scan. On the other hand, when the range associated with a query having
67
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
% o
f tot
al n
umbe
r of
rec
ords
in r
elat
ion
% of time required to scan relation
ACE Tree
R Tree
Randomly permuted file
Figure 3-17. Sampling rate of an ACE tree vs. rate for an R-tree, and scan of a randomlypermuted file with a spatial selection predicate accepting 2.5% of thedatabase tuples.
high selectivity is very large, the time required to load all of the relevant B+-Tree (or
R-Tree) pages into memory using random disk I/Os is prohibitive. Even if the query is run
long enough that all of the relevant pages are touched, for a query with high selectivity,
the buffer manager cannot be expected to buffer all the B+-Tree (or R-Tree) pages that
contain records matching the query predicate. This is the reason that the curve for the
B+-Tree in Figure 3-13 or for the R-Tree in Figure 3-18, never leaves the y-axis for the
time range plotted.
The net result of this is that if an ACE Tree were not used, it would probably
be necessary to use both a B+-Tree and a randomly-permuted file in order to ensure
satisfactory performance in the general case. Again, this is a point which seems to strongly
favor use of the ACE Tree.
An observation we make from Figure 3-14 is that if all the three record retrieval
algorithms are allowed to run to completion, we find that the ACE Tree is not the first to
68
0
0.5
1
1.5
2
2.5
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
% o
f tot
al n
umbe
r of
rec
ords
in r
elat
ion
% of time required to scan relation
ACE Tree
R Tree
Randomly permuted file
Figure 3-18. Sampling rate of an ACE tree vs. rate for an R-tree, and scan of a randomlypermuted file with a spatial selection predicate accepting 25% of the databasetuples.
complete execution. Thus, there is generally a crossover point beyond which the sampling
rate of an alternative random sampling technique is higher than the sampling rate of the
ACE Tree. However, the important point is that such a transition always occurs very late
in the query execution by which time the ACE Tree has already retrieved almost 90% of
the possible random samples. We found this trend for all the different query selectivities
we tested with single dimensional as well as multi-dimensional ACE Trees. Thus, we
emphasize that the existence of such a crossover point in no way belittles the utility of the
ACE Tree since in practical applications where random samples are used, the number of
random samples required is very small. Since the ACE Tree provides the desired number
of random samples (and many more) much faster than the other two methods, it still
emerges as the top performer among the three methods for obtaining random samples.
Finally, Figure 3-15 shows the memory requirement of the ACE Tree to store records
that match the query predicate but cannot be used as yet to answer the query. The
69
fluctuations in the number of records buffered by the query algorithm at different times
during the query execution is as expected. This is because the amount of buffer space
required by the query algorithm can vary as newly retrieved leaf node sections are either
buffered (thus requiring more buffer space) or can be appended with already buffered
sections (thus releasing buffer space). We also note from Figure 3-15 that the ACE Tree
has a reasonable memory requirement since a very small fraction of the total number of
records is buffered by it.
3.9 Conclusion and Discussion
In this chapter we have presented the idea of a “sample view” which is an indexed,
materialized view of an underlying database relation. The sample view facilitates efficient
random sampling of records satisfying a relational range predicate. In the chapter we
describe the ACE Tree which is a new indexing structure that we use to index the sample
view. We have shown experimentally that with the ACE Tree index, the sample view can
be used to provide an online random sample with much greater efficiency than the obvious
alternatives. For applications like online aggregation or data mining that require a random
ordering of input records, this makes the ACE Tree a natural choice for random sampling.
This is not to say that the ACE Tree is without any drawbacks. One obvious concern
is that the ACE Tree is a primary file organization as well as an index, and hence it
requires that the data be stored within the ACE Tree structure. This means that if the
data are stored within an ACE Tree, then without replication of the data elsewhere
it is not possible to cluster the data in another way at the same time. This may be a
drawback for some applications. For example, it might be desirable to organize the data
as a B+-Tree if non-sampling-based range queries are asked frequently as well, and this is
precluded by the ACE Tree. This is certainly a valid concern. However, we still feel that
the ACE Tree will be one important weapon in a data analyst’s arsenal. Applications like
online aggregation (where the database is used primarily or exclusively for sampling-based
analysis) already require that the data be clustered in a randomized fashion; in such a
70
situation, it is not possible to apply traditional structures like a B+-Tree anyway, and
so there is no additional cost associated with the use of an ACE Tree as the primary
file organization. Even if the primary purpose of the database is a more traditional or
widespread application such as OLAP, we note that it is becoming increasingly common
for analysts to subsample the database and apply various analytic techniques (such as
data mining) to the subsample; if such a sample were to be materialized anyway, then
organizing the subsample itself as an ACE Tree in order to facilitate efficient online
analysis would be a natural choice.
Another potential drawback of the ACE Tree as it has been described in this chapter,
is that it is not an incrementally updateable structure. The ACE Tree is relatively
efficient to construct in bulk: it requires two external sorts of the underlying data to
build from scratch. The difficulty is that as new data are added, there is not an easy
way to update the structure without rebuilding it from scratch. Thus, one potential
area for future work is to add the ability to handle incremental inserts to the sample
view (assuming that the ACE Tree is most useful in a data warehousing environment,
then deletes are far less useful). However, we note that even without the ability to
incrementally update an ACE-Tree, it is still easily usable in a dynamic environment if a
standard method such as a differential file [111] is applied. Specifically, one could maintain
the differential file as a randomly permuted file or even a second ACE Tree, and when a
relational selection query is posed, in order to draw a random sample from the query one
selects the next sample from either the primary ACE Tree or the differential file with an
appropriate hypergeometric probability (for an idea of how this could be done, see the
recent paper of Brown and Haas [12] for a discussion of how to draw a single sample from
multiple data set partitions). Thus, we argue that the lack of an algorithm to update the
ACE tree incrementally may not be a tremendous drawback.
Finally, we close the chapter by asserting that the importance of having indexing
methods that can handle insertions incrementally is often overstated in the research
71
literature. In practice, most incrementally-updateable structures such as B+-Trees
cannot be updated incrementally in a data warehousing environment due to performance
considerations anyway [93]. Such structures still require on the order of one random
I/O per update, rendering it impossible to efficiently process bulk updates consisting of
millions of records without simply rebuilding the structure from scratch. Thus, we feel
that the drawbacks associated with the ACE Tree do not prevent its utility in many
real-world situations.
72
CHAPTER 4SAMPLING-BASED ESTIMATORS FOR SUBSET-BASED QUERIES
4.1 Introduction
Sampling is well-established as a method for dealing with very large volumes of
data, when it is simply not practical or desirable to perform the computation over the
entire data set. Sampling has several advantages compared to other widely-studied
approximation methodologies from the data management literature such as wavelets [88],
histograms [92] and sketches [29]. Not the least of those is generality: it is very easy to
efficiently draw a sample from a large data set in a single pass using reservoir techniques
[34]. Then, once the sample has been drawn it is possible to guess, with greater or lesser
accuracy, the answer to virtually any statistical query over those sets. Samples can easily
handle many different database queries, including complex functions in relational selection
and join predicates. The same cannot be said of the other approximation methods, which
generally require more knowledge of the query during synopsis construction, such as the
attribute that will appear in the SELECT clause of the SQL query corresponding to the
desired statistical calculation.
However, one class of aggregate queries that remain difficult or impossible to answer
with samples are the so-called “subset” queries, which can generally be written in SQL in
the form:
SELECT SUM (f1(r))
FROM R as r
WHERE f2(r) AND NOT EXISTS
(SELECT * FROM S AS s
WHERE f3(r, s))
Note that the function f2 can be incorporated into f1 if we have f1 evaluate to zero if
f2 is not true; thus, in the remainder of the chapter we will ignore f2. An example of such
73
a query is: “Find the total salary of all employees who have not made a sale in the past
year”:
SELECT SUM (e.SAL)
FROM EMP AS e
WHERE NOT EXISTS
(SELECT * FROM SALE AS s
WHERE s.EID = e.EID)
A general solution to this problem would greatly extend the class of database-style
queries that are amenable to being answered via random sampling. For example, there is
a very close relationship between such queries and those obtained by removing the NOT
in the subquery. Using the terminology introduced later in this chapter, all records from
EMP with i records in SALES are called “class i” records. The only difference between NOT
EXISTS and EXISTS is that the former query computes a sum over all class 0 records,
whereas the latter query computes a sum over all class i > 0 records. Since any reasonable
estimator for NOT EXISTS will likely have to compute an estimated sum over each class, a
solution for NOT EXISTS should immediately suggest a solution for EXISTS. Also, nested
queries having an IN (or NOT IN) clause can be easily rewritten as a nested query having
the EXISTS (or NOT EXISTS) clause. For example, the query “Find the total salary of all
employees who have not made a sale in the past year” given above can also be written as:
SELECT SUM (e.SAL)
FROM EMP as e
WHERE e.EID NOT IN
(SELECT s.EID FROM SALE AS s)
Furthermore, a solution to the problem of sampling for subset queries would allow
sampling-based aggregates over SQL DISTINCT queries, which can easily be re-written as
subset queries. For example:
SELECT SUM (DISTINCT e.SAL)
74
FROM EMP AS e
is equivalent to:
SELECT SUM (e.SAL)
FROM EMP AS e
WHERE NOT EXISTS
(SELECT * FROM EMP AS e2
WHERE id(e) < id(e2)
AND e.SAL = e2.SAL)
In this query, id is a function that returns the row identifier for the record in question.
Some work has considered the problem of sampling for counts of distinct attribute values
[17, 49], but aggregates over DISTINCT queries remains an open problem. Similarly, it
is possible to write an aggregate query where records with identical values may appear
more than once in the data, but should be considered no more than once by the aggregate
function as a subset-based SQL query. For example:
SELECT SUM (e.SAL)
FROM EMP AS e
WHERE NOT EXISTS
(SELECT * FROM EMP AS e2
WHERE id(e) < id(e2)
AND identical(e, e2))
In this query, the function identical returns true if the two records contain identical
values for all of their attributes. This would be very useful in computations where the
same data object may be seen at many sites in a distributed environment (packets in an
IP network, for example). Previous work has considered how to perform sampling in such
a distributed system [12, 77], but not how to deal with the duplicate data problem.
Unfortunately, it turns out that handling subset queries using sampling is exceedingly
difficult, due to the fact that the subquery in a subset query is not asking for a mean or a
75
sum – tasks for which sampling is particularly well-suited. Rather, the subquery is asking
whether we will ever see a match for each tuple from the outer relation. By looking at an
individual tuple, this is very hard to guess: either we have seen a match already on our
sample (in which case we are assured that the inner relation has a match), or we have not,
in which case we may have almost no way to guess whether we will ever see a match. For
example, imagine that employee Joe does not have a sale in a 10% sample of the SALE
relation. How can we guess whether or not he has a sale in the remaining 90%?
There is little relevant work in the statistical literature to suggest how to tackle
subset queries, because such queries ask a simultaneous question linking two populations
(database tables EMP and SALE in our example), which is an uncommon question in
traditional applications of finite population sampling. Outside of the work on sample from
the number of distinct values [17, 49, 50] and one method that requires an index on the
inner relation [75], there is also little relevant work in the data management literature;
we presume this is due to the difficulty of the problem; researchers have considered the
difficulty of the more limited problem of sampling for distinct values in some detail [17].
Our Contributions
In this chapter, we consider the problem of developing sampling-based statistical
estimators for such queries. In the remainder of this chapter, we assume without-replacement
sampling, though our methods could easily be extended to other sampling plans. Given
the difficulty of the problem, it is perhaps not surprising that significant statistical and
mathematical machinery is required for a satisfactory solution.
Our first contribution is to develop an unbiased estimator, which is the traditional
first step when searching for a good statistical estimator. An unbiased estimator is one
that is correct on expectation; that is, if an unbiased estimator is run an infinite number
of times, then the average over all of the trials would be exactly the same as the correct
answer to the query. The reason that an unbiased estimator is the natural first choice is
76
that if the estimator has low variance1 , then the fact that it is correct on average implies
that it will always be very close to the correct answer.
Unfortunately, it turns out that the unbiased estimator we develop often has high
variance, which we prove analytically and demonstrate experimentally. Since it is easy to
argue that our unbiased estimator is the only unbiased estimator for a certain subclass of
subset-based queries (see the Related Work section of this chapter), it is perhaps doubtful
that a better unbiased estimator exists.
Thus, we also propose a novel, biased estimator that makes use of a statistical
technique called “superpopulation modeling”. Superpopulation modeling is an example
of a so-called Bayesian statistical technique [39]. Bayesian methods generally make use of
mild and reasonable distributional assumptions about the data in order to greatly increase
estimation accuracy, and have become very popular in statistics in the last few decades.
Using this method in the context of answering subset-based queries presents a number of
significant technical challenges whose solutions are detailed in this chapter, including:
• The definition of an appropriate generative statistical model for the problem ofsampling for subset-based queries.
• The derivation of a unique Expectation Maximization algorithm [26] to learn themodel from the database samples.
• The development of algorithms for efficiently generating many new random data setsfrom the model, without actually having to materialize them.
Through an extensive set of experiments, we show that the resulting biased Bayesian
estimator has excellent accuracy on a wide variety of data. The biased estimator also has
the desirable property that it provides something closely related to classical confidence
bounds, that can be used to give the user an idea of the accuracy of the associated
estimate.
1 Variance is the statistical measure of the random variability of an estimator.
77
4.2 The Concurrent Estimator
With a little effort, it is not hard to imagine several possible sampling-based
estimators for subset queries. In this section, we discuss one very simple (and sometimes
unusable) sample-based estimator. This estimator has previously been studied in detail
[75], but we present it here because it forms the basis for the unbiased estimator described
in the next section.
We begin our description with an even simpler estimation problem. Given a
one-attribute relation R(A) consisting of nR records, imagine that our goal is to estimate
the sum over attribute A of all the records in R. A simple, sample-based estimator would
be as follows. We obtain a random sample R′ of size nR′ of all the records of R, compute
total =∑
r∈R′ r.A, and then scale up total to output total × nR/nR′ as the estimate for
the final sum. Not only is this estimator extremely simple to understand, but it is also
unbiased, consistent, and its variance reduces monotonically with increasing sample size.
We can extend this simple idea to define an estimator for the NOT EXISTS query
considered in the introduction. We start by obtaining random samples EMP′ and SALE′ of
sizes nEMP′ and nSALE′ , respectively from the relations EMP and SALE. We then evaluate the
NOT EXISTS query over the samples of the two relations. We compare every record in EMP′
with every record in SALE′, and if we do not find a matching record (that is, one for which
f3 evaluates to true), then we add its f1 value to the estimated total. Lastly, we scale up
the estimated total by a factor of nEMP/nEMP′ to obtain the final estimate, which we term
M :
M =
(nEMP
nEMP′
) ∑
e∈EMP′f1(e)× (1−min(1, cnt(e, SALE′)))
In this expression, cnt(e, SALE′) =∑
s∈SALE′ I(f3(e, s)) where I is the standard
indicator function, returning 1 if the boolean argument evaluates to true, and 0 otherwise.
The algorithm can be slightly modified to accommodate for growing samples of the
78
relations, and has been described in detail in [75], where it is called the “concurrent
estimator” since it samples both relations concurrently.
Unfortunately, on expectation, the estimator is often severely biased, meaning that
it is, on average, incorrect. The reason for this bias is fairly intuitive. The algorithm
compares a record from EMP with all records from SALE′, and if it does not find a matching
record in SALE′, it classifies the record as having no match in the entire SALE relation.
Clearly, this classification may be incorrect for certain records in EMP, since although they
might have no matching record in SALE′, it is possible that they may match with some
record from the part of SALE that was not included in the sample. As a result, M typically
overestimates the answer to the NOT EXISTS query. In fact, the bias of M is:
Bias(M) =∑e∈EMP
f1(e)(1−min(1, cnt(e, SALE))
−ϕ(nSALE, nSALE′ , cnt(e, SALE)))
In this expression, ϕ denotes the hypergeometric probability2 that a sample of size nSALE′
will contain none of the cnt(e, SALE) matching records of e.
The solution that was employed previously to counteract this bias requires an index
such as a B+-Tree on the entire SALE relation, in order to estimate and correct for
Bias(M). Unfortunately, the requirement for an index severely limits the applicability
of the method. If an index on the “join” attribute in the inner relation is not available,
the method cannot be used. In a streaming environment where it is not feasible to store
SALE in its entirety, an index is not practical. The requirement of an index also precludes
use of the concurrent estimator for a non-equality predicate in the inner subquery or
2 The hypergeometric probability distribution models the distribution of the numberof red balls that will be obtained in a sample without replacement of n′ balls from an urncontaining r red balls and n− r non-red balls.
79
for non-database environments where sampling might be useful, such as in a distributed
system.
In the remainder of this chapter, we consider the development of sampling-based
estimators for this problem that require nothing but samples from the relations themselves.
Our first estimator makes use of a provably unbiased estimator Bias(M) for Bias(M).
Taken together, M − Bias(M). is then an unbiased estimator for the final query answer.
The second estimator we consider is quite different in character, making use of Bayesian
statistical techniques.
4.3 Unbiased Estimator
4.3.1 High-Level Description
In order to develop an unbiased estimator for Bias(M), it is useful to first re-write
the formula for Bias(M) in a slightly different fashion. We subsequently refer to the set
of records in EMP that have i matches in SALE as “class i records”. Denote the sum of the
aggregate function over all records of class i by ti, so ti =∑
e∈EMP f1(e)×I(cnt(e, SALE) = i)
(note that the final answer to the NOT EXISTS query is the quantity t0). Given that the
probability that a record with i matches in SALE happens to have no matches in SALE′ is
ϕ(nSALE, nSALE′ , i), we can re-write the expression for the bias of M as:
Bias(M) =m∑
i=1
ϕ(nSALE, nSALE′ , i)× ti (4–1)
The above equation computes the bias of M since it computes the expected sum
over the aggregate attribute of all records of EMP which are incorrectly classified as class 0
records by M .
Let m be the maximum number of matching records in SALE for any record of EMP.
Equation 4–1 suggests an unbiased estimator for Bias(M) because it turns out that
it is easy to generate an unbiased estimate for tm: since no records other than those
with m matches in SALE can have m matches in SALE′, we can simply count the sum
80
of the aggregate function f1 over all such records in our sample, and scale up the total
accordingly. The scale-up would also be done to account for the fact that we use SALE′ and
not SALE to count matches. Once we have an estimate for tm, it is possible to estimate
tm−1. How? Note that records with m − 1 matches in SALE′ must be a member of either
class m or class m − 1. Using our unbiased estimate for tm, it is possible to guess the
total aggregate sum for those records with m − 1 matches in SALE′ that in reality have m
matches in SALE. By subtracting this from the sum for those records with m − 1 matches
in SALE′ and scaling up accordingly, we can obtain an unbiased estimate for tm−1. In a
similar fashion, each unbiased estimate for ti leads to an unbiased estimate for ti−1. By
using this recursive relationship, it is possible to guess in an unbiased fashion the value
for each ti in the expression for Bias(M). This leads to an unbiased estimator for the
Bias(M) quantity, which can be subtracted from M to provide an unbiased guess for the
query result.
4.3.2 The Unbiased Estimator In DepthWe now formalize the above ideas to develop an unbiased estimator for each tk
that can be used in conjunction with Equation 4–1 to develop an unbiased estimator forBias(M). We use the following additional notation for this section and the remainder ofthis chapter:
• ∆k,i is a 0/1 (non-random) variable which evaluates to 1 if the ith tuple of EMP has kmatches in SALE and evaluates to 0 otherwise.
• sk is the sum of f1 over all records of EMP′ having k matching records in SALE′:sk =
∑nEMP′i=1 I(cnt(ei, SALE
′) = k)× f1(ei).
• α0 is nEMP′/nEMP, the sampling fraction of EMP.
• Yi is a random variable which governs whether or not the ith record of EMP appears inEMP′.
• h(k; nSALE, nSALE′ , i) is the hypergeometric probability that out of the i interestingrecords in a population of size nSALE, exactly k will appear in a random sample of sizenSALE′ . For compactness of representation we will refer to this probability as h(k; i) inthe remainder of the thesis, since our sampling fraction never changes.
81
We begin by noting that if we consider only those records from EMP which appear in
the sample EMP′, an unbiased estimator for tk over EMP′ can be expressed as follows:
tk =1
α0
nEMP∑i=1
Yi ×∆k,i × f1(ei) (4–2)
Unfortunately, this estimator relies on being able to evaluate ∆k,i for an arbitrary record,
which is impossible without scanning the inner relation in its entirety. However, with
a little cleverness, it is possible to remove this requirement. We have seen earlier that
a record e can have k matches in the sample SALE′ provided it has i ≥ k matches in
SALE. This implies that records from all classes i where i ≥ k can contribute to sk.
The contribution of a class i record towards the expected value of sk is obtained by
simply multiplying the probability that it will have k matches in SALE′ with its aggregate
attribute value. Thus a generic expression to compute the contribution of any arbitrary
record from EMP′ towards the expected value of sk can be written as∑m
i=k ∆i,j × h(k; i) ×f1(ej). Then, the following random variable has an expected value that is equivalent to the
expected value of sk:
sk =nEMP∑j=1
m∑
i=k
Yj ×∆i,j × h(k; i)× f1(ej) (4–3)
The fact that E[sk] = E[sk] (proven in Section 4.3.3) is significant, because there is a
simple algebraic relationship between the various s variables and the various t variables.
Thus, we can express one set in terms of the other, and then replace each sk with sk in
order to derive an unbiased estimator for each t. The benefit of doing this is that since sk
is defined as the sum of f1 over all records of EMP′ having k matching records in SALE′, it
can be directly evaluated from the samples EMP′ and SALE′.
82
To derive the relationship between s and t, we start with an expression for sm−r using
Equation 4–3:
sm−r =nEMP∑j=1
m∑i=m−r
Yj ×∆i,j × h(m− r; i)× f1(ej)
=m∑
i=m−r
h(m− r; i)nEMP∑j=1
Yj ×∆i,j × f1(ej)
=r∑
i=0
h(m− r; m− r + i)nEMP∑j=1
Yj ×∆m−r+i,j × f1(ej)
=r∑
i=0
h(m− r; m− r + i)α0tm−r+i (4–4)
By re-arranging the terms we get the following important recursive relationship:
tm−r =sm−r − α0
∑ri=1 h(m− r; m− r + i)× tm−r+i
α0h(m− r; m− r)(4–5)
For the base case we obtain:
tm = sm
α0h(m;m)
= am × sm (4–6)
where am = 1/(α0h(m; m)).
By replacing sm−r in the above equations with sm−r which is readily observable from
the data and has the same expected value, we can obtain a simple recursive algorithm for
computing an unbiased estimator for any ti. Before presenting the recursive algorithm, we
note that we can re-write Equation 4–5 for ti by replacing s with s and by changing the
summation variable from i to k and actually substituting m− r by i,
ti =si − α0
∑m−ik=1 h(i; i + k)ti+k
α0h(i; i)
83
The following pseudo-code then gives the algorithm3 for computing an unbiased
estimator for any ti.
———————————————————————————-
Function GetEstTi(int i)
1 if(i == m)
2 return sm/(α0h(m; m))
3 else
4 returnval = si
5 for(int k = 1; k <= m− i; k++)
6 returnval -= α0h(i; i + k)×GetEstTi(i+k)
7 returnval /= α0h(i; i)
8 return returnval;
9 }———————————————————————————-
Recall from Equation 4–1 that the bias of M was expressed as a linear combination
of various ti terms. Using GetEstTi to estimate each of the ti terms, we can write an
estimator for the bias of M as:
Bias(M) =m∑
i=1
ϕ(nSALE, nSALE′ , i)× GetEstTi(i) (4–7)
In the following two subsections, we present a formal analysis of the statistical
properties of our estimator.
3 Note the h(m; m) probability in line 2 of the GetEstTi function. If the sample sizefrom SALE is not at least as large as m, then h(m; m) = 0 and the GetEstTi is undefined.This means that our estimator is undefined if the sample is not at least as large as thelargest number of matches for any record from EMP in SALE. The fact that the estimator isundefined in this case is not surprising, since it means that our estimator does not conflictwith known results regarding the existence of an unbiased estimator for the distinct valueproblem. See the Related Work section for more details.
84
4.3.3 Why Is the Estimator Unbiased?
According to Equation 4–7, the estimator for the bias of M is composed of a sum of
m different estimators. Hence by the linearity of expectation, the expected value of the
estimator can be written as:
E[ Bias(M)] =m∑
i=1
ϕ(nSALE, nSALE′ , i)× E[GetEstTi(i)] (4–8)
The above relation suggests that in order to prove that the sample-based estimator
of Equation 4–7 is unbiased, it would suffice to prove that each of the individual GetEstTi
estimators is unbiased. We use mathematical induction to prove the correctness of the
various estimators on expectation.
As a preliminary step for the proof of unbiasedness, we first derive the expected
values of the si estimator used by GetEstTi. To do this, we introduce a zero/one random
variable Hj,k that evaluates to 1 if ej has k matches in SALE′ and 0 otherwise. The
expected value of this variable is simply the probability that it evaluates to 1, giving us
E[Hj,k] = h(k; cnt(ej, SALE)). With this:
E[sk] = E
[nEMP∑j=1
m∑
i=k
Yj ×∆i,j ×Hj,k × f1(ej)
]
=nEMP∑j=1
m∑
i=k
E[Yj]×∆i,j × E[Hj,k]× f1(ej)
= α0
nEMP∑j=1
m∑
i=k
∆i,j × h(k; i)× f1(ej) (4–9)
We are now ready to present a formal proof of unbiasedness of the GetEstTi.
Theorem 1. The expected value of GetEstTi(i) is∑nEMP
j=1 ∆i,jf1(ej).
Proof. Using Equation 4–5, the recursive GetEstTi estimator can be re-written as:
GetEstTi(i) =si − α0
∑m−ik=1 h(i; i + k)GetEstTi(i + k)
α0h(i; i)(4–10)
85
We first prove the unbiasedness for the base case: GetEstTi(m). Setting i = m in the
above relation and taking the expectation:
E[GetEstTi(m)] =E[sm]
α0h(m; m)
Replacing E[sm] using Equation 4–9:
E[GetEstTi(m)] =α0
α0h(m; m)
nEMP∑j=1
∆m,jh(m; m)f1(ej)
=nEMP∑j=1
∆m,jf1(ej)
which is actually the value of tm.
By induction, we can now assume that all estimators GetEstTi(i+k) for 1 ≤ k ≤ m− i
are unbiased and we use this to prove that the estimator GetEstTi(i) is unbiased. Taking
the expectation on both sides of the above equation:
E[GetEstTi(i)] = E
[si − α0
∑m−ik=1 h(i; i + k)GetEstTi(i + k)
α0h(i; i)
]
By the linearity of expectation:
=E[si]− α0
∑m−ik=1 h(i; i + k)E[GetEstTi(i + k)]
α0h(i; i)
Replacing the values of E[GetEstTi(i + k)] and E[si]:
= 1α0h(i;i)
(α0
∑nEMP
j=1
∑mk=i ∆k,jh(i; k)f1(ej)
−α0
∑m−ik=1 h(i; i + k)
∑nEMP
j=1 ∆i+k,jf1(ej))
= 1h(i;i)
(∑nEMP
j=1
∑mk=i ∆k,jh(i; k)f1(ej)
−∑nEMP
j=1
∑m−ik=1 ∆i+k,jh(i; i + k)f1(ej))
86
For the second term in the parentheses, replacing i + k by p and changing the limits of
summation for the inner sum accordingly:
= 1h(i;i)
(∑nEMP
j=1
∑mk=i ∆k,jh(i; k)f1(ej)
−∑nEMP
j=1
∑mp=i+1 ∆p,jh(i; p)f1(ej)) (4–11)
We notice that the limits of summation of the inner sum of the first term are from i to m.
Splitting this term into two terms such that one term has limits of summation from i to i
while the other has limits from i + 1 to m:
= 1h(i;i)
(∑nEMP
j=1
∑ik=i ∆k,jh(i; k)f1(ej)
+∑nEMP
j=1
∑mk=i+1 ∆k,jh(i; k)f1(ej)
−∑nEMP
j=1
∑mp=i+1 ∆p,jh(i; p)f1(ej))
= 1h(i;i)
(∑nEMP
j=1 ∆i,jh(i; i)f1(ej))
=∑nEMP
j=1 ∆i,jf1(ej) (4–12)
4.3.4 Computing the Variance of the Estimator
The unbiased property of Bias(M) means that it may be useful. However, the
accuracy of any estimator depends on its variance as well as its bias. We now investigate
the variance of our unbiased estimator.
We have seen that Bias(M) is a linear combination of various GetEstTi results with
ϕ(nSALE, nSALE′ , i) as the coefficient of GetEstTi(i). In order to derive an expression for the
variance of the estimator and gain insight about the potential values it can take, we first
express the estimator as a linear combination of si terms:
Bias(M) =m∑
i=1
bi × si (4–13)
87
The next step in deriving the variance is being able to compute the various bi values.
Intuitively, the bi terms can be thought of as coming from the linear relationship between
the ti and si terms. The following algorithm shows how we can actually compute the bi
values.
——————————————————————————————————-
Function ComputeBis(m)
1 // Let table[m][m] be a 2-dimensional array with all elements initialized to zero
2 for(int row = 0; row < m; row++){3 for(int term = 1; term <= row; term++){4 factor = -h(m− row; m− row + term)/h(m− row;m− row)
5 prow = row - term
6 for(int pcol = 0; pcol <= prow; pcol++)
7 table[row][pcol] += factor ∗ table[prow][pcol]
8 }9 table[row][row] = 1/h(m− row; m− row)
10}11for(int row = 0; row < m; row++)
12 for(int col = 0; col <= row; col++)
13 bm−col += α0 ∗ h(0,m− row) ∗ table[row][col]
——————————————————————————————————-
With this, the variance of this estimator can then be written as:
V ar( Bias(M)) = V ar
(m∑
i=1
bisi
)(4–14)
88
Note that the si values are not independent random variables since if an EMP′ record
has i matches in SALE′, then it cannot have j matches in SALE′. Hence we have:
V ar( Bias(M)) =m∑
i=1
b2i V ar(si) + 2
m∑i=1
∑i<j≤m
bibjCov(si, sj) (4–15)
The V ar and Cov terms can be computed by using the standard formulas:
V ar(si) = E[s2i ]− E2[si]; Cov(sisj) = E[sisj]− E[si]E[sj] (4–16)
To evaluate V ar(si) and Cov(sisj), E[s2i ] and E[sisj] can be computed as follows:
E[sisj] = E
[(nEMP∑
k=1
Ykf1(ek)Hk,i
)(nEMP∑r=1
Yrf1(er)Hr,j
)]
= E
[nEMP∑
k=1
nEMP∑r=1
YkYrHk,iHr,jf1(ej)f1(er)
]
=nEMP∑
k=1
nEMP∑r=1
E[YkYr]E[Hk,iHr,j]f1(ej)f1(er) (4–17)
The above expression can be evaluated using the following rules:
• if k 6= r (that is, ek and er are two different tuples) then,E[Hk,iHr,j] ≈ h(i, cnt(ek, SALE)) h(j, cnt(er, SALE)) if we assume that no record s exists in
SALE where f3(ek, s) = f3(er, s) = true
• if i = j (that is, we are computing E[s2i ]) and k = r, then E[Hk,iHr,j] =
h(i, cnt(ek, SALE))
• if i 6= j (that is, we are computing E[sisj]) and k = r, then E[Hk,iHr,j] = 0 since arecord cannot have two different numbers of matches in a sample
• if k = r, then E[YkYr] = α0
• if k 6= r, then E[YkYr] ≈ α02
4.3.5 Is This Good?
At this point, we now have a simple, unbiased estimator for the answer to a
subset-based query, as well as a formal analysis of the statistical properties of the
89
Actualpopulation
. . .
stepHypothetical
Superpopulation. . .process
Generative
Superpopulation Model
Sampling
SurveySample
n321
. . .
Sampling underdesired sampling design
N321
Random
F (Θ)
Figure 4-1. Sampling from a superpopulation
estimator. However, there are two problems related to the variance that may limit the
utility of the estimator.
First, in order to evaluate the hypergeometric probabilities needed to compute or
estimate the variance, we need the value of cnt(e, SALE) for an arbitrary record e of
EMP. This information is generally unavailable during sampling, and it seems difficult
or impossible to obtain a good estimate for the appropriate probability without having
this information. This means that in practice, it will be difficult or impossible to tell
a user how accurate the resulting estimate is likely to be. We have experimented with
general-purpose methods such as the bootstrap [31] to estimate this variance, but have
found that these methods often do an extremely poor job in practice.
Second, the variance of the estimator itself may be huge. The bi coefficients are
composed of sums, products and ratios of hypergeometric probabilities which can
result in huge values. Particularly worrisome is the h(i, i) value in the denominator
used by GetEstTi. Such probabilities can be tiny; including such a small value in the
denominator of an expression results in a very large value that may “pump up” the
variance accordingly.
90
4.4 Developing a Biased Estimator
In light of these problems, in this section we describe a biased estimator that is
often far more accurate than the unbiased one, and also provides the user with an idea
of the estimation accuracy. Just like the unbiased estimator M − Bias(M) from the
previous section, our biased estimator will be nothing more than a weighted sum over the
observed sk values. However, the weights will be chosen so as to minimize the expected or
mean-squared error of the resulting estimator.
To develop our biased estimator, we make use of the “superpopulation modeling”
approach from statistics [78]. One simple way to think of a superpopulation is that it
is an infinitely large set of records from which the original data set has been obtained
by random sampling. Because the superpopulation is infinite, it is specified using a
parametric distribution, which is usually referred to as the prior distribution.
Using a superpopulation method, we imagine the following two-step process is used to
produce our sample:
1. Draw a large sample of size N from an imaginary infinite superpopulation where N isthe data set size.
2. Draw a sample of size n < N without replacement from the large sample of size Nobtained in Step 1 where n is the desired sample size.
By characterizing the superpopulation, it is possible to design an estimator that tends
to perform well on any data set and sample obtained using the process above.The following steps outline a road-map of our superpopulation-based approach for
obtaining a high-quality biased estimator for a subset-based query. We describe each stepin detail in the next section.
1. Postulate a superpopulation model F for our data set (F is the prior distribution;we use the notation pF to denote the probability density function (PDF) of F ). Ingeneral, F is parameterized on a parameter set Θ.
2. Infer the most likely values of the parameter set Θ from EMP′ and SALE′. Since wedo not have the complete data, but rather a random sample of the data, this is adifficult problem. We make use of an Expectation-Maximization (EM) algorithm tolearn the model parameters.
91
3. Use F (Θ) to generate d different populations P1, ..., Pd, where each Pi = (EMPi, SALEi).Note that if the data set in question is large, this may be very expensive. We showthat for our problem it is not necessary to generate the actual populations – it isenough to obtain certain sufficient statistics for each of them, which can be doneefficiently.
4. Sample from each Pi to obtain d sample pairs of the form Si = (EMP′i, SALE′i). Again,
this can be done without actually materializing the samples.
5. Let q(Pi) be the query answer over the ith data set. Construct a weighted estimatorW that minimizes
∑di=1(q(Pi)−W (Si))
2.
6. Use W on the original samples EMP′ and SALE′ to obtain the final estimate to the NOT
EXISTS query. The MSE of this estimate can generally be assumed to be the MSEover all of the populations generated:
∑di=1(1/d)× (q(Pi)−W (Si))
2.
4.5 Details of Our Approach
In this section, we discuss in detail each of the steps outlined above of our approach of
obtaining an optimal weighted estimator for the NOT EXISTS query.
4.5.1 Choice of Model and Model Parameters
The first task is to define a generative model and an associated probability density
function for the two relations EMP and SALE respectively. While this may seem like a
daunting task (and a potentially impossible one given all of the intricacies of modeling
real-life data) it is made easy by the fact that we only need to define a model that can
realistically reproduce those characteristics of EMP and SALE that may affect the bias or
variance of an estimator for a subset-based query. From the material in Section 4.3 of the
thesis, for a given record e from EMP, we know that these three characteristics are:
1. f1(e)
2. cnt(e, SALE), which is the number of SALE records s for which f3(e, s) is true
3. cnt(e, e′, SALE) where e′ 6= e, which is the number of SALE records s for whichf3(e, s) ∧ f3(e
′, s) is true
To simplify our task, we will actually ignore the third characteristic and define a
model such that this count is always zero for any given record pair. While this may
92
introduce some inaccuracy into our method, it still captures a large number of real-life
situations. For example, if f3 consists of an equality check on a foreign key from SALE
into EMP (which is arguably the most common example of such a subset-based query) then
two records from EMP can never match with the same record from SALE and this count is
always zero.Given that our model needs to be able to generate instances of EMP and SALE that
realistically model the first two aspects given above, we choose the parameter set Θ = {p,µ, σ2} where:
• p is a vector of probabilities, where pi represents the probability that any arbitraryrecord of EMP belongs to class i
• µ is a vector of means, where µi represents the mean aggregate value of all recordsbelonging to class i.
• σ2 is the variance of f1(e) over all records e ∈ EMP.
Then given these parameters, EMP and SALE are generated using our model as follows:
————————————————————————————–
Procedure GenData
1 For rec = 1 to nEMP do
2 Randomly generate k between 0 and m such that for any
i; 0 ≤ i ≤ m,Pr[i] = pi
3 Generate a value for f1(e) by sampling from N(µk, σ)
4 Add the resulting e to EMP
5 For j = 1 to k do
6 Generate a record s where f3(e, s) is true
7 Add s to SALE
————————————————————————————–
In step (3), N is a normally distributed random variable with the specified mean
and variance. We use a normal random variable because we are interested in sums over
classes in EMP; due to the central limit theorem (CLT), these sums will be normally
93
distributed for a large database. Thus, using a normal random variable does not result
in loss of generality. Also, note that in step (6), according to our earlier assumption
f3(e′, s) = false, ∀e′ 6= e.
In our actual model, the various µi values are not assumed to be independent but
we rather assume a linear relationship between them to limit the degrees of freedom
of the model and thus avoid the problem of overfitting the model (see Sec. 4.5.1). In
our model the various µi values are related as µi = s × i + µ0, where s and µ0 are the
only two parameters that need to be learned to determine all the µi. Also in order to
avoid overfitting, we assume that σ2 is the variance of f1(e) over all records, rather than
modeling and learning variance values of all the individual classes separately.
We now define the density function for the superpopulation model corresponding to
the GenData algorithm. For a given EMP record e, if f1(e) = v and cnt(e, SALE) = k the
probability density for e given a parameter set Θ is given by:
p(e|Θ) = p(v, k|Θ) = pkpN(µk, σ, v) (4–18)
Where it is convenient, we will use the notation p(v, k|Θ) for values v and k and
p(e|Θ) for record e interchangeably. In this expression, pN is the PDF for the normal
distribution evaluated at v and is given by:
pN(µk, σ, v) =e−(v−µk)2/2σ2
σ√
2π(4–19)
Then if we consider a given data set {EMP, SALE}, the probability density of the data
set is simply the product of the densities of all the individual records:
pF (EMP, SALE) =∏
e∈EMPp(e|Θ) (4–20)
A Note on The Generality of the Model. As described, our model is extremely
general, making almost no assumptions about the data other than the fact that f1(e)
values are normally distributed. This is actually an inconsequential assumption anyway,
94
since we are interested in sums over f1(e) values which will be normally distributed
whatever the distribution of f1(e) due to the CLT.
On one hand, this generality can be seen as a benefit of the approach: it makes use of
very few assumptions about the data. Most significant is the lack of any sort of restriction
on the probability vector p. The result is that the number of records from SALE matching
a certain record from EMP is multinomially distributed. On the other hand, a Bayesian
argument [39] can be made that such extreme freedom is actually a poor choice, and that
in ”real-life”, an analyst will have some sort of idea what the various pi values look like,
and a more restrictive distribution providing fewer degrees of freedom should be used.
For example, a negative binomial distribution has been assumed for the distinct value
estimation problem [90]. Such background knowledge could certainly improve the accuracy
of the method.
Though we eschew any such restrictions in the remainder of the thesis (except for an
assumption of a linear relationship among the µi values; see “Dealing with Over-fitting”
in the next section), we note that it would be very easy to incorporate such knowledge
into our method. The only change needed is that the EM algorithm described in the next
section would need to be modified to incorporate any constraints induced on the various
parameters by additional distributional assumptions.
4.5.2 Estimation of Model Parameters
Now that we have defined our superpopulation model, we need access to the
parameter set Θ that was used to create our particular instances of EMP and SALE in
order to develop an estimator that performs well for the resulting superpopulation.
However, we have several difficulties. First, we do not know Θ; since EMP and SALE are in
reality not sampled from any parametric distribution, Θ does not even exist. We could
95
compute a maximum-likelihood estimate (MLE)4 to choose a Θ that optimally fits EMP
and SALE, but then we have an even bigger problem: we do not even have access to EMP
and SALE; we only have access to samples from them. Thus, we need a way to infer Θ by
looking only at the samples EMP′ and SALE′.
It turns out that we can still make use of an MLE. Since EMP′ may be treated as a
set of independent, identically distributed samples from F , if we simply replace EMP with
EMP′ as an argument to pF , then by choosing Θ so as to maximize pF , we will still produce
exactly the same estimate for Θ on expectation that we would have if EMP were used
instead. Thus, we can essentially ignore the distinction between EMP and EMP′. However,
the same argument does not hold for SALE because without access to all of SALE, we
cannot compute k = cnt(e, SALE) for arbitrary e in order to apply an MLE.
To handle this, we will modify our PDF slightly to also take into account the
sampling from SALE. This can easily be done by modifying the function p(v, k|Θ). To
simplify the modification, we ignore the fact that the number of such records s from SALE′
where f3(e1, s) is true may be correlated with the number of records from SALE′ where
f3(e2, s) is true for arbitrary records e1 and e2 from EMP; that is, we assume that we are
looking for matches of a record e in its own “private” sample from SALE and that all
of these samplings are independent. With this, if f1(e) = v and cnt(e, SALE) = k and
cnt(e, SALE′) = k′ then:
p(v, k, k′|Θ) = p(v, k|Θ)h(k′; k) (4–21)
In this expression, h is the hypergeometric probability of seeing k′ matches for e in
SALE′, given that there were k matches in SALE.
4 An MLE is a standard statistical estimator for unknown model parameters when asample is available; the MLE simply chooses Θ so as to maximize the value of the PDF ofthe sample.
96
Since the portion of SALE that is not in SALE′ is hidden to us due to the sampling, we
do not know k and we have a classic example of an MLE problem with hidden or missing
data. There are several methods from the literature for solving such a problem; the one
that we employ is the Expectation Maximization (EM) algorithm.
The EM algorithm [26] is a general method of finding the maximum-likelihood
estimate of the parameters of an underlying distribution from a given data set when the
data is incomplete or has missing values. EM starts out with an initial assignment of
values for the unknown parameters and at each step, recomputes new values for each of
the parameters via a set of update rules. EM continues this process until the likelihood
stops increasing any further. Since cnt(e, SALE) is unknown, the likelihood function:
L(Θ|{EMP′, SALE′}) =∏
e∈EMP′
m∑
k=1
p(f1(e), k, cnt(e, SALE′)|Θ)
We present the derivation of our EM implementation in the Appendix, while here we
give only the algorithm. In this algorithm, p(i|Θ, e) denotes the posterior probability for
record e belonging to class i. This is the probability that given the current set of values for
Θ, record e belongs to class i.
———————————————————————————
Procedure EM(Θ)
1 Initialize all parameters of Θ; Lprev = −9999
2 while (true){3 Compute L(Θ) from the sample and assign it to Lcurr
4 if((Lcurr − Lprev)/Lprev < 0.01) break
5 Compute posterior probabilities for each e ∈ EMP′ and each k
6 Recompute all parameters of Θ by using the following
update rules:
7 µi =P
e∈EMP p(i|Θ′,e)×f1(e)Pe∈EMP p(i|Θ′,e)
8 σ2 =P
e∈EMPPm
j=0 p(j|Θ′,e)(f1(e)−µj)2
Pe∈EMP
Pmj=0 p(j|Θ′,e)
97
9 pi =P
e∈EMP p(i|Θ′,e)Pe∈EMP
Pmj=1 p(j|Θ′,e)
10 Lprev = Lcurr
11 }12 Return values in Θ as the final parameters of the model
————————————————————————————
Every iteration of the EM algorithm performs an expectation (E) step and a
maximization (M) step. In our algorithm, the E-step is contained in step (6) where
for each record e of EMP′, a set of probability values p(i|Θ, e), 0 ≤ i ≤ m, is computed
under the current model parameters, Θ. The posterior probability p(i|Θ, e) is computed
as described in the Appendix. Intuitively, the posterior probability for record e and class
i is a ratio of two quantities: (1) the probability that e belongs to class i according to the
density function of the model, and (2) the sum of probabilities that it belongs to each of
the classes 0 through m, also according to the model density function.
The M-step (which corresponds to steps (7) - (10) of our algorithm) updates the
parameters of our model in such a way that the expected value of the likelihood function
associated with the model is maximized with respect to the posterior probabilities. Details
of how we obtain the various update rules are explained in the Appendix.
The observant reader may note that the EM algorithm assumes that the parameter
m is known before the process has begun. This is potentially a problem, since m will
typically be unknown. Fortunately, knowing the exact value for m is not vital, particularly
if m is overestimated (in which case the class probabilities associated with the class i
records for large i will end up being zero, if the EM algorithm functions correctly). As a
rough estimate for m, we take the record from EMP′ with the largest number of matches in
SALE′ and scale up the number of matches by nSALE/nSALE′ . Particularly if several records
with m matches in SALE are expected to appear in EMP′, this estimate for m will be quite
conservative.
98
Dealing with Over-fitting. The superpopulation model has a total of 2(m + 1) + 1
parameters within Θ. Since the number of degrees of freedom of the model is so large, the
model has a tremendous leeway when choosing parameter values. This potentially leads to
a well-known drawback of learned models – over-fitting the training data, where the model
is tailored to be excessively well-suited to the training data at the cost of generality.Several techniques have been proposed to address the over-fitting problem [30]. We
use the following two methods in our approach:
• Limiting the number of degrees of freedom of the model.
• Using multiple models and combining them to develop our final estimator.
To use the first technique, we restrict our generative model so that the mean
aggregate value of all records of any class i is not independent of the mean value of
other classes. Rather, we use a simple linear regression model µi = s× i + µ0. s and µ0 are
the two parameters of the linear regression model and can be learned easily. This means
that once we have learned the two parameters s and µ0, the µi values for all other classes
can be determined directly by the above relation and will not be learned separately. As
mentioned previously, it would also be possible to place distributional constraints upon the
vector p in order to reduce the degrees of freedom even more, though we choose not to do
this in our implementation.
Our second strategy to tackle the over-fitting problem is to learn multiple models
rather than working with a single model. These models differ from each other only in that
they are learned using our EM algorithm with different initial random settings for their
parameters. When generating populations from the models learned via EM (as described
in the next subsection), we then rotate through the various models in round-robin fashion.
Are we not done yet? Once the model has been learned, a simple estimator is
immediately available to us: we could return p0 × µ0 × nEMP, since this will be the expected
query result over an arbitrary database sampled from the model. This is equivalent to
first determining a class of databases that the database in question has been randomly
99
selected from, and then returning the average query result over all of those databases. If
multiple models are learned in order to alleviate the over-fitting problem, then we can use
the average of this expression over all of those models.
While this estimator is certainly reasonable, the concern is twofold. First, if there is
high variability in the possible populations that could be produced by the model or models
(corresponding to uncertainty in the correctness of the model), then simply taking the
average of all of these populations will expectedly result in an answer with high variance.
A related concern is that this is not very robust to errors in the model-learning process –
an error in the model will lead directly to an error in the estimate.
Thus, in the next few subsections we detail a process that attempts to simultaneously
perform well on any and all of the databases that could be sampled from the model,
rather than simply returning the mean answer over all potential databases. The method
samples a large number of ((EMPi, SALESi), (EMP′i, SALES′i)) combinations from the model,
and then attempts to construct an estimator that can accurately infer the query answer
over precisely the (EMPi, SALESi) that has been sampled by looking at (EMP′i, SALES′i).
4.5.3 Generating Populations From the Model
Once we know the parameter set Θ, the next task is to generate many instances
of Pi = (EMPi, SALEi) and Si = (EMP′i, SALE′i) in order to optimize our biased estimator
over these population-sample pairs. The difficulty is that in practice, EMP and SALE can
have billions of records in them. Hence, it would not be feasible to actually materialize
each (Pi, Si) pair. The good news is that for our problem it is not necessary to actually
generate the populations if we can generate statistics associated with the pair that are
sufficient to optimize our biased estimator.
Computing sufficient statistics for EMP and SALE. For each Pi, we must generate thefollowing statistics:
• The number of records of EMP belonging to each class (we use ni to denote this).
• The mean over f1 for all records belonging to each class.
100
The first set of statistics are easy to generate if we notice that the number of records
belonging to each class is simply a multinomial distribution with nEMP trials and each
multinomial bucket probability is given by the vector p. A single, vector-valued sample
from an appropriately-distributed multinomial distribution can then give us each ni.
The next set of statistics can be computed by relying on the CLT. According to
the generative model, the aggregate attribute value of records of the superpopulation
belonging to class i is given by µi with variance of σ2. Since the population is an i.i.d.
random sample from the superpopulation, the mean aggregate value of records belonging
to class i follows a normal distribution with mean given by µi and variance of σ2/ni.
Thus, ti which is the sum over the aggregate attribute of all records of class i can then be
obtained by drawing a trial from the normal distribution N(µi, σ2/ni) and multiplying it
by ni.
Computing sufficient statistics for EMP′ and SALE′. For each Si, we must generate the
following statistics:
• The number of sampled records from each class of EMP; this is denoted by n′i.
• The number of sampled records from the ith class of EMP that have j matches inSALE′ for each i and j. We denote this by n′i,j.
• The mean over f1 corresponding to each n′i,j.
The first set of statistics can be produced by repeatedly sampling from a hypergeometric
distribution. To compute n′0, we sample from a hypergeometric distribution with
parameters nEMP, nEMP′ , and n0 (these parameters are the population size, the sample
size, and the size of the subpopulation of interest, respectively). To compute n′1, we sample
from a hypergeometric distribution with parameters nEMP − n0, n′EMP − n′0, and n1. n′2 is a
sample from a hypergeometric distribution with parameters nEMP−(n0+n1), n′EMP−(n′0+n′1),
and n2. This process is repeated for each n′i.
101
Once each n′i is generated, each n′i,j is generated. In order to speed the process of
generating each n′i,j, we can assume that the expected value of each n′i,j is small compared
to nSALE, so that there is little difference between sampling with and without replacement.
Thus, we can assume that each n′i,j; j ≤ i is binomially distributed which in turn means
that all n′i,j are multinomially distributed, where the probability that any class i record
will have j matches in the sample SALE′ is a hypergeometric probability denoted by h(j; i).
A single trial over a multinomial random variable having probabilities of h(j; i) for j from
0 to i will then give us each n′i,j for a given i.
Finally, again using a CLT-based argument, the mean over f1 for all of the records
corresponding to each n′i,j is generated by a single trial over a normal random variable
N(µi, σ2/n′i,j).
4.5.4 Constructing the Estimator
We have seen in the previous subsection that once a model has been learned, it can be
used to generate statistics for any number of population(s)/sample(s).
Recall from Section 4.4 that the jth population generated and the sample from that
population are Pj = (EMPj, SALEj) and Sj = (EMP′j, SALE′j), respectively. Let sij be the
value of si computed over Sj; that is, it is the sum for f1 over all tuples in EMP′j that have i
matches in SALE′j. Our goal in all of this is to construct a weighted estimator:
W (Sj) =m∑
i=0
wisij (4–22)
that minimizes:
SSE =∑
j
(W (Sj)− q(Pj))2 =
∑j
(m∑
i=0
wisij − q(Pj)
)2
(4–23)
where q(Pj) is the answer to the NOT EXISTS query over the jth population.
W should be optimized by choosing each wi so as to minimize the SSE (sum-squared-error)
given above. In order to compute these weights we evaluate the partial derivative of the
SSE w.r.t each of the unknown weights. For example, by taking the partial derivative of
102
the SSE w.r.t w0, we obtain:
∂SSE
∂w0
=∑
j
2
(m∑
i=0
wisij − q(Pj)
)(s0j)
If we differentiate with respect to each wi and set the resulting m + 1 expressions to
zero, we obtain m + 1 linear equations in the m + 1 unknown weights. These equations can
be represented in the following matrix form:
∑j s2
0j
∑j s0js1j · · ·
∑j s0js1j
∑j s2
1j · · ·.
.
∑j s0jsmj
∑j s1jsmj · · ·
w0
w1
.
.
wm
=
∑j s0jq(Pj)
∑j s1jq(Pj)
.
.
∑j smjq(Pj)
The optimal weights can then be easily obtained by using a linear equation solver to
solve the above system of equations.
Once W has been derived, it is then applied to the original samples EMP′ and SALES′
in order to estimate the answer to the query. By dividing the SSE obtained via the
minimization problem described above by the number of data sets generated, we can also
obtain a reasonable estimate of the mean-squared error of W .
4.6 Experiments
In this section we describe results of the experiments we performed to test our
estimators. Our experiments are designed to test the accuracy of our estimators and the
running time of the biased estimator, over a wide variety of data sets.
4.6.1 Experimental Setup
In this subsection, we describe the properties of the various data sets we use to test
our estimators. We generate 66 synthetic data sets and use three real-life data sets for
conducting our experiments. All our experiments were performed on a Linux workstation
having 1 GB of RAM and a 2.4 GHz clock speed and all software was implemented using
the C++ programming language.
103
4.6.1.1 Synthetic data sets
In each data set, we have two relations, EMP(EID, AGE, SAL) and SALE(SALEID,
EID, AMOUNT) of size 10 million and 50 million records, respectively. We evaluate the
following SQL query over each data set:
SELECT SUM (e.SAL)
FROM EMP as e
WHERE NOT EXISTS
(SELECT * FROM SALE AS s
WHERE s.EID = e.EID)
Two important data set properties that affect the query result are:
1. The distribution of the number of matching records in SALE for each record of EMP
2. The distribution of e.SAL values of all records of EMP
Based on these two important properties, we synthetically generated data sets so
that the distribution of the number of matching records for all EMP records follows a
discretized Gamma distribution. The Gamma distribution was chosen because it produces
positive numbers and is very flexible, allowing a long tail to the right. This means that it
is possible to create data sets for which most records in EMP have very few matches, but
some have a large number. We chose values of 1, 2 and 5 for the Gamma distribution’s
shift parameter and values of 0.5 and 1 for the scale parameter. Based on these different
values for the shift and scale parameters, we obtained six possible data sets: 1: (shift =
1, scale = 0.5); 2: (shift = 2, scale = 0.5); 3: (shift = 5, scale = 0.5); 4: (shift = 1, scale
= 1); 5: (shift = 2, scale = 1); and 6: (shift = 5, scale = 1). For these six data sets, the
fraction of EMP records having no matches in SALE (and thus contributing to the query
answer) were .86, .59, .052, .63, .27, and .0037, respectively. A plot of the probability that
an arbitrary tuple from EMP has m matches in SALE for each of the six data sets is given as
Figure 4-2. This shows the wide variety of data set characteristics we tested.
104
0
0.1
0.2
0.3
0.4
0.5
0.6
0 1 2 3 4 5 6
Pro
babi
lity
Number of matches per record from SAL
data set 6
data set 3
data set 4
data set 1
data set 2data set 5
Figure 4-2. Six distributions used to generate for each e in EMP the number of records s inSALE for which f3(e, s) evaluates to true.
We also varied the distribution of the e.SAL values such that the distribution can be
one of the following:
• a. Normally distributed with a mean of 100 and standard deviation of 10
• b. Normally distributed with a mean of 100 and standard deviation of 200, with only
the absolute values considered
• c. Zipfian distributed with a skew parameter of 0.5
• d. Zipfian distributed with a skew parameter of 1.0
We doubled the number of data sets by further providing a linear positive correlation
or no correlation between the e.SAL value of a record and the number of matching
records it has in SALE. We thus obtained 48 different data sets considering all possible
combinations of the distribution of matching records and the distribution of e.SAL values.
We also tested our estimator on 18 additional synthetic data sets that were
deliberately designed to have properties that violate the assumptions of the superpopulation
model of our biased estimator, so as to see how robust this estimator is to inaccuracies in
the parametric model. From Section 4.5.1, the three specific assumptions we made for our
superpopulation model were:
105
1. cnt(e, e′, SALE) = 0 when e′ 6= e. Thus, the number of SALE records s for which
f3(e, s) ∧ f3(e′, s) is true is zero. In other words, different records from EMP do not
“share” matching records in SALE.
2. There exists a linear relationship between the mean aggregate values of the different
classes of EMP records given by µi = s× i + µ0 where s is the slope of the straight line
connecting the various µi values.
3. The variance of the aggregate attribute values of records of any class is approximately
equal to the single model parameter σ2.
For each of these three cases, we generate six different data sets using the six different
sets of gamma parameters described earlier. Thus we obtain 18 more data sets where the
first six sets violate assumption 1, the next six sets violate assumption 2 and the last six
sets violate assumption 3. For each of these 18 data sets, the aggregate attribute value is
normally distributed with a mean of 100 and standard deviation of 200 except for the last
six sets where different values of standard deviation are chosen for records from different
classes.
In order to violate assumption 1, we no longer assume a primary key-foreign key
relationship between EMP and SALE. To generate a data set violating this assumption, a set
s1 of records of size 100 from EMP is selected. Let max be the largest number of matches
in SALE for any record from s1. Then an associated set s2 of max records is added to SALE
such that all records in s1 have their matching records in s2. Assumption 2 was violated
using µi = s × j + µ0, where j 6= i (in fact, the j value for a given i is randomly selected
from 1...m). Assumption 3 was violated by assuming different values for the variance
of records from different classes. We randomly chose these values from the range (100,
15000).
4.6.1.2 Real-life data sets
The three real-life data sets we use in our experiments are from the Internet Movie
Database (IMDB) [1], the Synoptic Cloud Reports [3] obtained from the Oak Ridge
106
National Laboratory, and the network connections data sets from the 1999 KDDCup
event.
The IMDB database contains several relations with information about movies, actors
and production studios. For our experiments, we use the two relations MovieBusiness
and MovieGoofs. MovieBusiness contains information about box-office revenues of movies
while MovieGoofs contains records that describe unintended mistakes or “goofs” in various
movies. The following schema shows the relevant attributes of the two relations for the
queries we tested in our experiments.
MovieBusiness (MovieName, NumAdmissions)
MovieGoofs (GoofId, MovieName)
MovieName is the primary key of MovieBusiness and a foreign key of MovieGoofs.
We tested the following three SQL queries on the two relations of the IMDB dataset.
Q1: SELECT SUM (b.NumAdmissions)
FROM MovieBusiness as b
WHERE NOT EXISTS
(SELECT * FROM MovieGoofs AS g
WHERE g.MovieName = b.MovieName)
Q2: SELECT SUM (b.NumAdmissions)
FROM MovieBusiness as b
WHERE NOT EXISTS
(SELECT * FROM MovieBusiness AS b2
WHERE id(b) < id(b2)
AND b.NumAdmissions = b.NumAdmissions)
Q3: SELECT COUNT (*)
FROM MovieBusiness as b
WHERE NOT EXISTS
(SELECT * FROM MovieGoofs AS g
107
WHERE g.MovieName = b.MovieName)
The second real-life data set we use is the Synoptic Cloud Report (SCR) data set. It
contains weather reports for a 10-year period obtained from measuring stations on land
as well as water. We use weather reports for the months of December 1981 and November
1991 from measuring stations on land. Specifically, the two relations and their relevant
schema used in our experiments are:
DEC81 (Id, Latitude, CloudAmount)
NOV91 (Id, Latitude, CloudAmount)
Here, Id is the key in both the relations. We tested the following two SQL queries on
the relations DEC81 and NOV91.
Q4: SELECT SUM (D81.CloudAmount)
FROM DEC81 as D81
WHERE NOT EXISTS
(SELECT * FROM NOV91 AS N91
WHERE N91.Latitude = D81.Latitude)
Q5: SELECT COUNT (*)
FROM DEC81 as D81
WHERE NOT EXISTS
(SELECT * FROM NOV91 AS N91
WHERE N91.Latitude = D81.Latitude)
The KDDCup data set contains information about various network connections that
can potentially be used for intrusion detection. This data set has 42 integer, real-valued,
and categorical attributes. We tested our estimator on this data set by estimating the
total number of source bytes of connections that were “significantly different” from
the rest of the network connections. That is, we summed the total number of source
bytes created by outlier connections. Our definition of “significantly different” records is
those records whose distance from all other records in the data set is greater than some
108
predefined threshold. For our experiments, we use a simple distance function that uses
Euclidean distance for numerical attributes and a 0/1 distance for categorical attributes.
We execute the following query on the KDDCup data set for our experiments.
SELECT SUM (kc1.SourceBytes)
FROM KDDCup as kc1
WHERE NOT EXISTS
(SELECT * FROM KDDCup AS kc2
WHERE d(kc1, kc2) < dthreshold)
By choosing different values for dthreshold, we can control the selectivity of the above
query. For our experiments, we define Q6, Q7 and Q8 as three variants of the above query
with different values of dthreshold so that Q6 has a selectivity of around 24%, Q7 has a
selectivity of 1.75% while Q8 has a selectivity of 0.4%.
4.6.2 Results
We ran our experiments on 1%, 5% and 10% random samples of the data sets (both
relations in each data set were sampled independently without replacement at the same
rate). Both the biased estimator and the unbiased estimator were run ten times on each
of the test cases. For comparison we also analytically compute the standard error for the
concurrent estimator described in Section 4.2. Results from the first 48 synthetic data sets
are given in in Tables 4-1 and 4-2 while results from the next 18 synthetic data sets (which
specifically violate the model assumptions) are presented in Table 4-3. Real-life data set
results are shown in Table 4-4. For each of the test cases, we give the square root of the
observed mean-squared error (that is, the standard error) for the biased, unbiased as well
as concurrent estimator. Because having an absolute value for the standard error lacks any
sort of scale and thus would not be informative, we give the standard error as a percentage
of the total aggregate value of all records in the database. For example, for the synthetic
data sets, we give the standard error as a percentage of the answer to the query:
SELECT SUM (e.SAL)
109
FROM EMP as e
Thus, if the estimation method simply returned zero every time, its error would vary
between 0% and 100%, depending on the selectivity of the subquery. If the method is
also able to estimate with high accuracy which of the constituent records should not be
counted in the aggregate total, then the error can be reduced to an arbitrarily small level.
Although our error metric is different from the relative error (which takes the ratio
of the absolute error with the true query answer), the value of the relative error could be
readily computed from the error value given by dividing by the ratio of the query answer
and the total aggregate value of all records in the outer relation. For all the eight cases of
data set 1, the query answer is approximately 86% of the total answer. Hence, the relative
error is about 1.1 times the error reported in Table 4-1. Similarly for the rest of the data
sets, the factors are: data set 2: 1.7; data set 3: 19; data set 4: 1.5; data set 5: 3.7 and
data set 6: 270. For the IMDB and SCR data sets, the factors are between 1 and 5.5 while
for the KDDCup the factors range from 2 (for the high selectivity query) to 40 (for the
very low selectivity query).
When we tested the queries, we also recorded the number of times (out of ten) that
the answer given by the biased estimator was within ±2 estimated standard errors of the
real answer to the query and found that for almost all the test cases this number was ten
while only for a couple of test cases this number was found to be nine out of ten.
Finally, we measured the computation time required by the biased estimator to
initially learn the generative model, then compute weights for the various components of
the estimator, and to finally provide an estimate of the query result. We observed that
for the synthetic data sets (which consists of 10 million and 50 million records in the two
relations) the maximum observed running time of biased estimator was between 3 and 4
seconds for a 10% sample from each. The vast majority of this time is spent in the EM
learning algorithm, which requires O(m × |EMP′| × i) time, where m is the maximum
possible number of matches for a record in EMP with records in SALES, and i is the number
110
of iterations required for EM convergence. We speed our implementation by sub-sampling
EMP′ and using the subsample in the EM algorithm rather than using EMP′ directly. The
justification for this is that the EM can be quite expensive with a large EMP′, and the
accuracy of the modeling step is much more closely related to the size of SALE′. We use a
subsample of size 500 in our experiments.
In comparison, computation for the unbiased estimator is almost instantaneous,
requiring a small fraction of a second. In our test data, the most costly operation for
the unbiased estimator is running the “join” between EMP′ and SALE′; that is, searching
for matches for each record from EMP′ in SALE′. Given summary statistics describing this
matching, the core GetEstTi routine itself can be implemented as a dynamic programming
algorithm that takes time O(m′2), where m′ is the maximum number of matches for any
record from EMP′ in SALE′.
4.6.3 Discussion
One of the most obvious results from Table 4-1 is that the unbiased estimator has
uniformly small error only on those eight tests performed using synthetic data set 1,
where the number of matches for each record e ∈ EMP is generated using a Gamma
distribution with parameters (shift = 1, scale = 0.5). In this particular data set, only a
very small number of the records are excluded by the NOT EXISTS clause since 86% of the
records in EMP do not have a match in SALE. Furthermore, only a very small number of the
records have a large number of matches. Both of these characteristics tend to stabilize the
variance of the unbiased estimator, making it a fine choice.
For all the other data sets, the unbiased estimator does very poorly for most of the
cases. For synthetic data, the estimator’s worst performance is for data set 6, in which
less than one percent of the records are accepted by the NOT EXISTS clause and several
records from EMP have more than 15 matching records in SALE. In this case, the unbiased
estimator is unusable, and the results were particularly poor with correlation between
the number of matches and the aggregate value that is summed. For example, in the
111
Data set 1% Sample 5% Sample 10% Sampletype error error error
Ga- Cor- Val U C B U C B U C Bmma red ? Dist. (%) (%) (%) (%) (%) (%) (%) (%) (%)
1 No a. 7.39 13.32 38.30 2.39 12.62 3.88 1.09 11.89 1.461 No b. 6.69 13.45 37.87 3.04 12.63 5.92 1.08 11.93 1.381 No c. 6.89 12.92 22.59 5.23 12.04 8.18 3.79 11.23 7.091 No d. 16.65 6.32 68.37 15.94 6.19 29.34 9.56 5.94 19.721 Yes a. 11.90 20.90 34.50 4.59 19.94 2.26 3.15 18.68 1.421 Yes b. 13.50 17.80 36.30 4.07 16.37 5.12 1.75 15.50 2.181 Yes c. 7.70 15.06 21.14 5.69 14.06 7.84 3.98 13.13 6.211 Yes d. 18.05 1.04 66.94 16.26 0.52 25.35 12.98 0.41 15.33
2 No a. 11.79 40.12 6.09 8.10 37.98 3.55 2.43 35.44 3.372 No b. 13.65 39.48 5.00 6.82 37.86 4.83 2.54 35.51 4.032 No c. 179.87 39.20 14.75 6.35 37.00 8.34 4.54 34.44 7.122 No d. 31.60 20.45 43.43 10.24 19.26 12.88 9.99 17.08 6.252 Yes a. 24.70 65.60 21.39 19.83 62.00 18.45 4.78 57.51 13.702 Yes b. 19.34 54.27 12.99 12.61 51.19 12.28 3.46 47.72 7.482 Yes c. 220.14 46.60 23.01 12.19 44.01 12.01 5.10 40.88 5.102 Yes d. 52.61 39.08 39.45 19.62 36.75 5.32 9.20 33.19 2.25
3 No a. 234.60 92.75 18.61 59.67 84.91 12.22 33.00 76.00 6.283 No b. 315.97 93.29 19.42 70.32 84.68 11.68 34.78 76.05 5.843 No c. 188.17 91.50 20.53 46.14 84.01 18.50 24.92 75.07 15.803 No d. 139.27 72.67 14.24 63.56 67.36 12.18 6.79 59.83 5.333 Yes a. 753.73 189.70 42.19 220.00 172.10 28.99 115.25 151.85 17.023 Yes b. 421.00 146.70 30.93 151.00 133.50 21.05 74.50 118.40 11.993 Yes c. 240.20 119.80 28.28 74.66 109.50 25.99 42.57 97.22 21.863 Yes d. 47.95 144.61 33.85 18.52 130.93 28.69 3.63 114.00 18.63
Table 4-1. Observed standard error as a percentage of SUM (e.SAL) over all e ∈ EMP for 24synthetically generated data sets. The table shows errors for three differentsampling fractions: 1%, 5% and 10% and for each of these fractions, it showsthe error for the three estimators: U - Unbiased estimator, C - Concurrentsampling estimator and B - Model-based biased estimator.
112
Data set 1% Sample 5% Sample 10% Sampletype error error error
Ga- Cor- Val U C B U C B U C Bmma red ? Dist. (%) (%) (%) (%) (%) (%) (%) (%) (%)
4 No a. 153.70 36.20 14.52 37.17 33.90 4.73 24.47 31.20 0.894 No b. 226.00 37.00 18.56 50.32 33.95 5.27 42.87 31.11 1.334 No c. 242.70 35.20 11.10 19.40 32.85 3.62 17.03 30.04 3.594 No d. 146.37 16.56 45.16 23.60 14.85 21.26 8.85 12.62 16.614 Yes a. 418.70 64.50 10.85 116.55 59.94 2.71 27.55 54.52 1.644 Yes b. 327.02 52.06 8.62 75.95 48.42 3.92 45.62 44.12 2.834 Yes c. 359.60 43.40 13.90 30.19 40.39 7.17 27.21 36.80 5.164 Yes d. 1.1e3 37.53 40.29 54.33 33.99 10.66 18.94 29.32 5.68
5 No a. 236.00 72.04 13.19 46.18 66.08 12.07 38.30 59.60 6.155 No b. 395.00 72.30 11.78 55.78 66.09 11.73 42.73 59.55 5.375 No c. 167.70 71.10 7.70 120.81 65.20 1.99 62.70 58.50 1.155 No d. 135.65 51.87 13.58 77.12 48.29 4.30 24.14 42.21 4.165 Yes a. 862.00 71.79 31.25 203.81 64.90 7.21 57.22 57.00 2.935 Yes b. 650.80 56.60 28.64 129.75 51.46 6.75 74.16 43.90 1.865 Yes c. 298.70 92.30 11.47 189.70 84.22 4.06 69.63 74.80 2.535 Yes d. 283.26 105.24 10.84 178.61 95.07 9.38 145.78 81.86 3.04
6 No a. 7.1e3 95.13 19.30 6.2e3 79.49 9.82 4.1e3 63.33 6.096 No b. 1.9e4 95.20 18.40 2.1e3 79.58 9.47 6.6e2 63.40 5.746 No c. 1.9e4 94.32 13.03 1.2e3 78.60 5.96 9.6e2 62.74 1.716 No d. 4.7e4 76.71 7.54 2.0e2 66.87 8.42 68.87 54.96 3.976 Yes a. 5.4e4 307.0 62.00 3.0e4 249.30 30.90 5.7e3 119.00 18.786 Yes b. 4.2e4 214.0 42.70 1.9e4 174.25 21.12 7.0e3 135.00 12.886 Yes c. 3.2e4 156.3 22.70 2.0e3 128.10 10.87 8.7e2 100.12 3.056 Yes d. 1.3e5 234.4 29.78 2.9e3 192.46 28.25 2.4e3 148.28 12.79
Table 4-2. Observed standard error as a percentage of SUM (e.SAL) over all e ∈ EMP for 24synthetically generated data sets. The table shows errors for three differentsampling fractions: 1%, 5% and 10% and for each of these fractions, it showsthe error for the three estimators: U - Unbiased estimator, C - Concurrentsampling estimator and B - Model-based biased estimator.
113
Data set 1% Sample 5% Sample 10% Sampletype error error error
Ga- Vio- U C B U C B U C Bmma lates (%) (%) (%) (%) (%) (%) (%) (%) (%)
1 (1) 8.83 13.37 62.60 3.12 12.47 15.24 1.19 11.75 4.622 (1) 24.66 39.33 34.39 8.14 37.89 2.74 3.41 35.60 2.483 (1) 94.11 92.31 21.14 72.94 84.82 16.76 20.27 75.78 13.054 (1) 22.30 36.67 37.99 12.72 34.07 7.96 6.34 31.12 2.955 (1) 231.50 72.60 6.76 123.30 66.14 6.37 85.68 59.48 4.356 (1) 1366.80 95.96 9.99 1.2e3 78.64 5.85 700.0 62.62 1.88
1 (2) 14.18 21.70 100.70 4.42 21.09 26.34 2.69 20.20 12.442 (2) 21.62 72.24 59.94 14.25 67.50 7.56 6.25 62.90 4.473 (2) 886.2 220.20 45.73 136.0 201.90 31.73 79.75 180.10 25.764 (2) 462.0 95.80 106.80 269.19 88.74 22.18 81.03 82.43 11.525 (2) 247.60 205.0 18.84 233.0 187.00 17.69 88.55 168.30 9.786 (2) 6891.00 369.0 42.30 5988.0 310.00 40.90 1924.00 246.57 19.77
1 (3) 14.70 21.14 61.86 6.24 20.20 10.15 1.13 19.13 2.672 (3) 26.15 66.73 29.10 22.49 62.25 20.25 5.38 57.69 17.353 (3) 920.10 185.30 41.86 147.60 167.20 30.12 65.63 146.88 27.204 (3) 2.3e5 64.42 35.96 714.00 60.54 16.87 150.80 54.77 9.245 (3) 1350.30 143.00 33.59 856.00 127.76 29.58 306.70 113.14 10.086 (3) 2.2e5 264.02 38.37 4519.10 212.80 34.92 2530.00 162.70 21.96
Table 4-3. Observed standard error as a percentage of SUM (e.SAL) over all e ∈ EMP for 18synthetically generated data sets. The table shows errors for three differentsampling fractions: 1%, 5% and 10% and for each of these fractions, it showsthe error for the three estimators: U - Unbiased estimator, C - Concurrentsampling estimator and B - Model-based biased estimator.
correlated case with a 1% sample, most of the relative standard errors were more than
40000%. Such very poor results are found sporadically throughout most of the data
sets, though the results were somewhat erratic. The reason that the observed errors
associated with the unbiased estimator are highly variable is the very long tail of the
error distribution. Under many circumstances, most of the answers computed using
the unbiased estimator are very good, but there is still a small (though non-negligible)
probability of getting a ridiculous estimate whose error is hundreds of times the sum over
the aggregate value over the entire EMP relation. Unfortunately, it is interesting to note
114
1% Sample 5% Sample 10% SampleError Error Error
Data- Query U C B U C B U C BSet (%) (%) (%) (%) (%) (%) (%) (%) (%)
IMDB Q1 9.6e3 27.67 70.88 3.3e3 17.51 33.44 4.1e2 13.71 14.14IMDB Q2 1.2e2 75.12 65.10 91.26 62.86 31.97 49.82 52.69 9.31IMDB Q3 1.e4 25.21 18.47 3.5e3 16.58 14.38 4.7e2 12.71 1.92
SCR Q4 1.4e4 65.22 10.31 5.0e3 44.97 6.84 8.2e2 23.27 4.41SCR Q5 1.2e4 59.06 9.42 4.6e3 41.62 7.51 7.8e2 24.07 3.95
KDDCup Q6 1.1e10 60.47 12.39 7.4e4 54.92 10.96 7.6e3 42.08 2.10KDDCup Q7 6.5e147 41.30 11.24 5.8e83 26.54 4.32 9.3e36 17.04 3.28KDDCup Q8 7.3e210 15.24 8.46 3.6e172 10.80 1.56 2.3e120 6.35 0.98
Table 4-4. Observed standard error as a percentage of the total aggregate value of allrecords in the database for 8 queries over 3 real-life data sets. The table showserrors for three different sampling fractions: 1%, 5% and 10% and for each ofthese fractions, it shows the error for the three estimators: U - Unbiasedestimator, C - Concurrent sampling estimator and B - Model-based biasedestimator.
that the unbiased estimator’s worst performance overall was observed on Q8 over the
KDDCup data, where the error was astronomically high: larger than 10100.
In comparison, the biased estimator generally did a very good job predicting the final
query result, and in most cases with a 5% or 10% sampling fraction the observed standard
error was less than 10% of the total aggregate value found in EMP. In other words, if the
total value of SUM (e.SAL) with no NOT EXISTS clause is x, then for just about any query
tested, the standard error was less than x/10, and it was frequently much smaller. This is
actually quite impressive when one considers the difficulty of the problem. The primary
drawback associated with the biased estimator is its complexity (requiring non-trivial and
substantial statistically-oriented computations) and the fact that a significant amount
of computation is required, most of it associated with running the EM algorithm to
completion. By comparison, the unbiased estimate can be calculated via an almost trivial
recursive routine that relies on the calculation of simple hypergeometric probabilities.
One case where the biased estimator had questionable qualitative performance was
with the 16 tests associated with data sets 3 and 6. The problem in this case was that
115
the EM algorithm tended to overestimate p0 in Θ, which is actually very small in these
two data sets (.052 and .0037, respectively). This results in an error that hovers at 10%+
of the total aggregate value of e.SAL (even for a 5%+ sample) when the real answer is
only 5% of this total for data set 3 or less than 1% of this total for data set 6. We stress
that guessing that only a few percent of the tuples in EMP have no matches in SALE from
a small sample with limited information is an extremely difficult estimation problem,
and we conjecture that without additional information (such as prior knowledge that the
distribution represented by p is a discretized gamma distribution) it will be very difficult
to achieve better results.
Results from the synthetic data sets which specifically violate the assumptions of the
superpopulation model are shown in Table 4-3. The first six rows in the table show results
for data sets in which more than one EMP record can match with a given record from SALE.
The results show that violating this assumption of the model in the actual data set did
not affect the accuracy of the biased estimator significantly. The next set of six rows in the
table show results for data sets in which there is no linear relationship between the mean
aggregate values of the different classes of EMP records. The results show that the biased
estimator is about twice as inaccurate over these data sets as compared to corresponding
data sets which do not have a strict violation of the assumption. The last six rows in
the table show results over data sets in which the variances of the aggregate values of
records from different classes are significantly different. Results show that these data sets
affect the accuracy of the biased estimator as much as the data sets which violate the
“linear relationship of mean values” assumption. However, the results are certainly not
poor when these assumptions are violated, and the method still seems to have qualitative
performance that may be acceptable for many applications, particularly with a larger
sample size.
The results from the eight queries over the three real-life data sets are depicted in
Table 4-4. The key difference in the characteristics of the real-life data sets compared
116
to the synthetically-generated data sets is the number of matching records in the inner
relation for a given record from the outer relation of the NOT EXISTS query. For the
KDDCup data set, the maximum number of matching records in the inner relation is as
high as 2500, while for the IMDB and SCR data sets this number is about 200 and 90
respectively. Due to this, none of the cases which are favorable for the use of the unbiased
estimator (as described above) are observed in the real-life data sets. On the other hand,
it can be seen from Table 4-4 that the accuracy of the biased estimator is generally quite
good over the real data.
We also note that the standard error of the biased estimator over the learned
superpopulation seems to be a reasonable surrogate for the standard error of the biased
estimator in practice. For most biased estimators, it is reasonable to use the standard
error of the biased estimator in the same way that one would use the standard deviation
of an unbiased estimator when constructing confidence bounds (see Sarndal et al. [109],
Section 5.2). According to the Vysochanskii-Petunin inequality [120], any unbiased
uni-modal estimator will be within three standard deviations of the correct answer 95% of
the time, and according to the more aggressive central limit theorem, an estimator will be
within two standard deviations of the correct answer 95% of the time. We observed that
almost all of the tests, ten out of ten of the errors for the biased estimator were actually
within two predicted standard errors of zero. This seems to be strong evidence for the
utility of the bounds computed using the predicted standard error of the biased estimator.
We finally remark on the time required for the execution of the biased estimator. The
biased estimator performs several computations including learning the model parameters,
generating sufficient statistics for several population-sample pairs and then solving a
system of equations to compute weights for the various components of the estimator. As
discussed previously, this took no longer than four seconds for the largest samples tested.
If this is not fast enough, we point out that it may be able to speed this time even more,
though this is beyond the scope of the thesis. While we used the traditional EM algorithm
117
in our implementation, we note that EM can be made faster by using incremental variants
[69, 95, 116] of the EM algorithm. These variants of the EM algorithm typically achieve
faster convergence time by implementing the Expectation and/or the Minimization step of
the EM algorithm partially.
4.7 Related Work
Estimation via sampling has a long history in databases. One of the oldest and best
known works is Frank Olken’s PhD thesis [97]. Other classic efforts at sampling-based
estimation over database data are the adaptive sampling of Lipton and Naughton [83, 84]
for join query selectivity estimation, and the sampling techniques of Hou et al. [64, 65] for
aggregate queries. More recent well-known work on sampling is that on online aggregation
by Haas, Hellerstein, and their colleagues [47, 60, 61].
The sampling-based database estimation problem that is closest to the one studied
in this chapter is that of sampling for the number of distinct values in a database. As
discussed in the introduction to this chapter, a solution to the problem of estimation over
subset-based queries is a solution to the problem of estimating the number of distinct
values in a database since the latter problem can be written as a NOT EXISTS query. The
classic paper in distinct value estimation is due to Haas et al. [49]. For a survey of the
state-of-the-art work on this problem in databases through the year 2000, we refer the
reader to the Introduction of the paper by Charikar et al. on the topic [17]. The paper
of Bunge and Fitzpatrick [13] provides a survey of work in the statistics area, current
through the early 1990’s. Work in statistics continues on this problem to this day. In
fact, a recent paper from statistics by Mingoti [90] on the distinct value problem provided
inspiration for our use of superpopulation techniques.
Though the problems of distinct value estimation and subset-based aggregate
estimation are related, we note that the problem of estimating the number of distinct
values is a very restricted version of the problem we study in this thesis, and it is not
immediately clear how arbitrary solutions to the distinct value problem can be generalized
118
to handle subset-based queries. The most obvious difficulty in extending such methods
to subset-based queries is the fact that a NOT EXISTS or related clause results in a
complicated statistic summarizing two populations (the two tables that are queried over).
Nonetheless, links between the problems do exist. For example, though our own unbiased
estimator was not directly inspired by Goodman’s estimator [43]5 and it takes a very
different form, it is easy to argue that our unbiased estimator must be a generalization of
Goodman’s estimator. The reasoning is straightforward: Goodman’s estimator is proven to
be the only unbiased estimator for distinct value queries, and our own unbiased estimator
is unbiased for distinct value queries. Therefore, they must be equivalent when used on
this particular problem.
4.8 Conclusion
This chapter has presented two sampling-based estimators for the answer to a
subset-based query, where the answer to a SUM aggregate query (and by trivial extension,
AVERAGE and COUNT) is restricted to consider only those tuples that satisfy a NOT EXISTS
or related clause. The first estimator is provably unbiased, while the second makes use of
superpopulation methods and was found to be much more accurate.
As discussed in Section 4.5.1 of the thesis, one of the most controversial decisions
made in the development of the latter estimator was our choice of a very general prior
distribution. To a statistician from the so-called “Bayesian” school [39], this may be
seen as a poor choice and Bayesian statistician may argue that a more descriptive prior
distribution, if appropriate, would increase the accuracy of the method. This is certainly
true, if the selected distribution were a good match for the actual data distribution. In
our work, however, we have consciously chosen generality and its associated drawbacks in
place of specificity. Our experimental results seem to argue that for a variety of different
5 Goodman’s estimator is one of the earliest statistical estimators for distinct valuequeries.
119
data distributions, the resulting estimator still has high accuracy. Still, this represents
an intriguing question for future work: can a different prior distribution be chosen that
is appropriate for use in real-world data sets, and which results in a more accurate
estimator?
Finally, we note that the model-based method outlined in the latter half of this
chapter was designed specifically to address the problem of estimating the answer to a
nested SQL query with a single table in the inner query and a single table in the outer
query linked by a NOT EXISTS predicate. As is, our model is not directly applicable to
arbitrarily complex nested queries. For example, nested queries may include multiple
relations in the outer as well as the inner query. One could imagine sampling all of the
input relations, and then using any result tuples that are discovered as part of the inner
or outer subqueries as input into an estimator such as the one studied in this chapter.
However, this may be dangerous, and our superpopulation model is not directly applicable.
The problem is that if there is a join in the inner (or outer) query, then the tuples
produced via joining samples from the input relations are not i.i.d. samples from the join
[47]. This means that the join itself must be modeled, which is a problem for future work.
Another problem for future work is arbitrary levels of nesting. An inner query may itself
be linked with another inner query via a NOT EXISTS or similar clause.
120
CHAPTER 5SAMPLING-BASED ESTIMATION OF LOW SELECTIVITY QUERIES
5.1 Introduction
The specific problem that we consider in this chapter is sampling-based approximation
of the answer to highly selective aggregate queries – those having a relational selection
predicate that accepts only a very small percentage of the data set. Again, we consider
sampling because it is the most versatile of the approximation methods: a single sample
can be used to handle virtually any relational selection predicate or any join condition.
Samples generally do not require prior knowledge of what queries will be asked, unlike
other methods such as sketches [8]. We consider very selective queries because they are the
one class of queries that are hardest to handle approximately without workload knowledge:
if a query references only a few tuples from the data set, then it is very hard to make sure
that a synopsis structure (such as a sample) will contain the information needed to answer
the query.
The most natural method for handling highly selective queries using sampling is to
make use of stratification [25]. In order to answer an aggregate query over a relation,
one could first (offline) partition the relation’s tuples into various subsets so that similar
tuples are grouped together – the assumption being that the relational selection predicate
associated with a given query will tend to favor certain strata. Even if a given query is
very selective, at least one or two of the strata will have a relatively heavy concentration
of tuples that will contribute to the query answer. When the query is processed, those
“important” strata can be sampled first and more heavily than the others. This is
ilustrated with the following example:
Example 1: The relation MOVIE(MovieYear, Sales) is partitioned into two strata
as follows: The query Q is then issued:
SELECT SUM (Sales)
121
R1 : MovieYear ≤ 1975 R2 : MovieYear > 1975
r1 : 〈1961, 30〉 r3 : 〈1983, 60〉r2 : 〈1972, 50〉 r4 : 〈1977, 40〉
r5 : 〈1997, 25〉r6 : 〈1992, 100〉r7 : 〈2004, 100〉
FROM MOVIE
WHERE MovieYear < 1980
Since all movies in R1 were released before 1975, all the records in the stratum R1
match Q. Hence, we decide to obtain a biased sample that includes as many records from
R1 as the sample size permits and we sample from R2 only if the desired sample size is not
met. For a sample size of 4, this results in an estimate whose variance (or error) is 2400.
Drawing a sample from the population as a whole results in an estimate whose variance is
2575.
While stratification may be very useful, it is not a new idea. It has been studied in
statistics for decades, and it has been suggested previously as a way to make approximate
aggregate query processing more accurate [18–20]. However, in the context of databases,
researchers have previously considered only half of the problem: how to divide the
database into strata. This may actually be the easy and less important half of the
problem, since even the relatively naive partitioning strategy we use in our experiments
can give excellent results. The equally fundamental problem we consider in this paper is:
how to allocate samples to strata when actually answering the query. More specifically,
given a budget of n samples, how does one choose how to “spend” those samples on the
various strata in order to achieve the greatest accuracy?
The classic allocation method from statistics is the Neyman allocation, and it is the
one advocated previously in the database literature [19]. The key difficulty with applying
the Neyman Allocation in practice is that it requires extensive knowledge of certain
statistical characteristics of each strata, with respect to the incoming query. In practice
122
this knowledge can only be guessed at by taking a pilot sample. As we show in this paper,
if the guess is poor, then the resulting sampling plan can be disastrous. This results in a
classic chicken-and-egg problem: we want to sample in order to avoid scanning all of the
data, but in order to sample properly, we have to collect statistics that require scanning
all of the data! The result is that the classic Neyman allocation is unusable in many
situations, as we will demonstrate experimentally in the paper.
Our Contributions
In this thesis, we develop an alternative to the classic Neyman allocation that we
call the Bayes-Neyman allocation. While this is a very general method and its utility
is not limited to the context of database management, the Bayes-Neyman allocation is
particularly relevant to database sampling because it is designed to be robust when only a
few of the data records in the data set are relevant to estimating a quantity over the data
– as is the case when a query has a restrictive relational selection predicate. The specific
contributions of our work are as follows:
• The Bayes-Neyman allocation explicitly takes into account the error that might beincurred when developing the sampling plan to maximize the expected accuracy ofthe resulting estimate.
• The Bayes-Neyman allocation makes use of novel, Bayesian techniques from statistics[14] that allow us to take into account any prior expectation (such as the expectedefficacy of the stratification) in a principled fashion.
• We carefully evaluate our methods experimentally, and show that if one is verycareful in developing a sampling plan, even a naive partitioning of samples to stratathat uses no workload information can show dramatic accuracy for very selectivequeries.
• Our methods are very general. They can be used with any partitioning (such as thoseproposed by Chaudhuri et. al [18–20]), or even in cases where the partitioning is notuser-defined and is imposed by the problem domain (for example, when the various“strata” are different data sources in a distributed environment). Our methods canalso be extended to more complicated relational operations such as joins, though thisproblem is beyond the scope of the paper.
123
5.2 Background
This section presents some preliminaries and background about stratified sampling,
and discusses the problems associated with using stratified sampling in a database setting
to estimate results of arbitrary queries.
5.2.1 Stratification
A general example of a SUM aggregate query over a single relation can be written as
follows:
SELECT SUM (f1(r))
FROM R As r
WHERE f2(r)
Note that if we define a function f() where,
f(r) =
f1(r) if f2(r) is true
0 if f2(r) is false
the above query can be simply re-written as,
SELECT SUM (f(r))
FROM R As r
If the relational selection predicate f2(r) selects a very small fraction of records from
the relation R, then the query is said to be a low selectivity query.
Assume that relation R is partitioned into L disjoint strata such that Ri represents
the ith stratum. Then, we have R = R1 ∪ R2 ∪ · · · ∪ RL. We denote the size of the ith
stratum by Ni and thus we have |Ri| = Ni. Let R′i where |R′i| = ni be the survey sample
(without replacement) from the ith stratum. The sizes of all the strata are known from
strata construction time, while the sizes of the survey samples from each of the strata
(the ni values) can be determined by using some sampling allocation scheme subject to
the constraint∑
i ni = n, where n is the pre-determined total sample size from R. The
problem of determining an optimal sample allocation is the central focus of this paper.
124
If we execute the above query on each of the R′i, the result of the query over the
sample of stratum i can be written as,
yi =∑
r∈R′i
f(r)
The unbiased stratified sampling estimator for the query result expressed in terms of
the yi values is,
Y =L∑
i=1
Ni
ni
yi (5–1)
The true variance of the records in stratum i can be computed as,
σ2i =
1
Ni
∑r∈Ri
f 2(r)−(
1
Ni
∑r∈Ri
f(r)
)2
Thus the true variance (or error) of the estimator Y is given by,
σ2 =L∑
i=1
Ni(Ni − ni)
ni
σ2i (5–2)
In practice, it is not feasible to know the true stratum variances for an arbitrary
query. Hence, a sample-based estimate for the variance of stratum i can be computed as,
σ2i =
1
ni − 1
∑
r∈R′i
(f(r)− yi
ni
)2
(5–3)
Then, an unbiased estimator for the variance of Y can be obtained from Equation 5–2
by simply replacing all the σ2i terms with their corresponding unbiased estimators σ2
i .
Central-Limit-Theorem-based confidence bounds [112] for Y can then be computed
as, Y ± zpσ where zp is the z-score for the desired confidence level. If desired, more
conservative confidence bounds from the literature (such as Chebyshev-based [112]) can
also be used.
Finally, we note that aggregate queries like COUNT and AVG can also be handled by
stratified sampling estimators like the one described above by using ratios of two different
estimates. Aggregate queries with a GROUP BY clause can also be answered by using
125
stratification. A GROUP BY query can be considered as executing several simple queries in
parallel – one for each group. Joins can also be handled using methods similar to those
proposed by Haas and Hellerstein [54], though that is beyond the scope of the paper.
5.2.2 “Optimal” Allocation and Why It’s Not
The problem of determining the ni values for all the strata for a predetermined
sample size n is the sample allocation problem. The key constraint on the values of the
sample sizes is that their sum should equal the total sample size. Besides this constraint,
there is freedom in the choice of the ni values, and hence a natural choice is to minimize
the error of Y of Equation 5–1. Since Y is unbiased, minimizing its error is equivalent
to minimizing its variance. An optimization problem can be formulated for the choice
of ni values so that the variance σ2 is minimized – solving the problem leads to the
well-known Neyman allocation [25] from statistics. Specifically, the Neyman allocation
states that the variance of a stratified sampling estimator is minimized when sample size
ni is proportional to the size of the stratum, Ni, and to the variance of the f() values in
the stratum, σ2i . That is,
ni =n∑
j Njσ2j
Niσ2i (5–4)
The problem we face in a database setting is that strata variance values σ2i are not
known for an arbitrary query. The stratum variance σ2i depends on: (a) the function to be
aggregated f1(), and (b) the relational selection predicate f2(). Since these functions can
vary from one query to another, it is not feasible to compute beforehand exact values of
the various σ2i terms for an arbitrary query. This means that the optimal ni values cannot
be computed in the absence of exact σ2i values.
It is possible to obtain rough estimates for the strata variances by doing a pilot run of
the query on very small pilot samples from each stratum, which is the standard method.
However, as the following example shows, a major drawback of this approach is that
the variance estimates calculated from such pilot sampling can be arbitrarily erroneous,
leading to an extremely poor allocation scheme and even more severe problems.
126
Example 2: Imagine that we have a relation R partitioned into two strata R1 and R2
such that |R1| = 10000 and |R2| = 10000. Let Q be a query identical to the query presented
in Section 5.2.1. The number of records from R1 accepted by f2() is 10 while the number of
records from R2 accepted by f2() is 1000. Further, let f1(r) ∼ N(1000, 100) ∀r ∈ R1 and
f1(r) ∼ N(10, 100) ∀r ∈ R2, where N(µ, σ) denotes a normal distribution with mean µ
and variance σ2.
We use a pilot sample of 100 records to estimate the variance of the f() values in
each stratum. These estimates are σ21 and σ2
2. If the desired sample size is n = 1000, the
estimated variances can be used with Equation 5–4 to obtain an estimate for the optimal
sampling allocation as follows:
n1 =1000
σ21 + σ2
2
σ21 n2 =
1000
σ21 + σ2
2
σ22
We then ask the question: how accurate will the resulting sampling plan be? To
answer this question, we perform a simple experiment in which we repeat the above
process 1000 times. For each iteration, we record the squared error of the estimate
produced by the computed sampling plan. The average of all these squared errors gives us
an approximation of the mean-squared error (MSE) of the estimator. For each iteration,
we also compute the estimated variance of the result (using Equation 5–2) since this
variance would be used to report confidence bounds to the user. We then compute the
average estimated variance across the 1000 iterations. Finally, we use the true variances of
both strata to obtain an optimal sample allocation, and repeat the above experiment using
the optimal allocation. We summarize the results in the following table.
True query result 20150
Avg. observed bias 10200
Avg. estimated MSE 0.76 million
Avg. observed MSE 100 million
MSE of true optimal 58.6 million
127
Overall, the results using the pilot sampling are disastrous. Specifically:
• The pilot-sampling-based allocation provides an average estimated error to the userthat is more than 2 orders of magnitude smaller than the true error – 0.76 million versus100 million. Since the estimated error is typically used to compute confidence bounds, theresulting confidence bounds will be much narrower than what they should be in reality.Hence, the user would be provided with a dangerously optimistic picture of the error ofthe estimator.
• Second, the non-optimal allocation leads to an estimate that has a heavy bias. This isdue to the fact that the allocation often directs the stratified sampling to ignore the firststratum. For approximately 90% of the 1000 iterations the pilot sample fails to discoverany matching records in R1. Hence, the pilot sample-based variance is naively guessed tobe zero. When this value is used with the Neyman allocation, no samples are allocatedto R1, while all 1000 samples are allocated to R2. The outcome is that the query result isusually underestimated, because R1 actually contains records accepted by f2().
• Finally, by using a truly optimal sampling allocation to estimate the query result,it is possible to achieve an error that is around half the error obtained by a non-optimalallocation. The additional error incurred due to the poor allocation represents a wastedopportunity to provide a much more accurate estimate.
5.3 Overview of Our Solution
The fundamental problem we face is that the natural estimator for σ2i serves us
extremely poorly when we are trying to figure out how to allocate samples to strata.
Human intuition tells us that it is foolish to simply assume that σ2i is zero in this case,
though our estimate σ2i will be zero. This is because as human beings we know that there
will often be a number of records matching the given f2() in a strata, and we will simply
be unlucky enough to miss them in our pilot sample.
To remedy these problems, we propose a novel Bayesian approach [14] called the
Bayes-Neyman allocation that can incorporate such intuition into the process in a
principled fashion. In general, Bayesian methods formally model such prior intuition
or belief as a probability distribution. Such methods then refine the distribution by
incorporating additional information – in our case information from the pilot sample – to
obtain an overall improved probability distribution.
At the highest level, the proposed Bayes-Neyman allocation works as follows:
128
1. First, in Bayesian fashion, we represent our belief in the possible variances overthe f() values in each strata as a prior probability distribution. Let the vectorΣ = 〈σ2
1, σ22, · · · , σ2
L〉 be one possible set of strata variances. We define a probabiltydistribution over all of the possible Σ values to represent this prior belief. Let XΣ bea random variable with exactly this probability distribution. Thus, sampling from XΣ
(that is, performing a random trial over XΣ) gives us one possible value for the vectorΣ, where those variance vectors that we feel are more “correct” are more likely to besampled.
2. Second, we take a pilot sample from the database and use the result of the pilotsample to update the distribution of XΣ in order to make it more accurate.
3. Third, we sample a large number of possible Σ values from the resulting XΣ inMonte-Carlo fashion. This gives us a large number of possible alternative values forΣ.
4. Finally, we construct a sampling plan for estimating the answer to our query whoseaverage error (variance) is minimized over all of the Σ values that were sampledfrom XΣ. This gives us a sampling plan whose expected error over the possible set ofdatabases described by the distribution of XΣ is minimized. This plan is then used toperform the actual stratified sampling.
The three key technical questions that must be addressed when adopting this
approach are:
1. First, how is the random variable XΣ defined?
2. Second, how can the distribution of XΣ be updated to take into account anyinformation that is gathered via the pilot sample?
3. Third, how can a set of samples from the updated XΣ be used to produce an optimalsampling plan?
The next three sections outline our answer to these three questions.
5.4 Defining XΣ
In this section, we consider the nature of XΣ itself, and how to sample from it.
5.4.1 Overview
At the highest level, the process of producing a single sample Σ from XΣ will be
further subdivided into three steps:
129
1. First, we sample from a random variable Xcnt to obtain a vector 〈cnt1, cnt2, · · · , cntL〉,where this vector tells us how many tuples from each strata are accepted by therelational selection predicate f2().
2. Second, we sample from a random variable XΣ′ that gives us the vector Σ′ =〈(µ1, µ2)1, (µ1, µ2)2, ..., (µ1, µ2)L〉. The ith pair (µ1, µ2)i is the mean (that is, µ1) andsecond moment (that is, µ2)
1 over all of the the f1() values in strata i for those cntituples that are accepted by f2().
3. Third, once these two samples have been obtained, it is then a simple mathematicaltask to use the outputs of Xcnt and XΣ′ to compute the output of XΣ.
We now consider each of these three steps in detail.
5.4.2 Defining Xcnt
Using terminology common in Bayesian statistics, each entry in Xcnt is generated by
sampling from a binomial distribution with a Beta prior distribution [33]. This means
that we view the probability pi that an arbitrary tuple from stratum i will be accepted
by the relational selection predicate f2() as being the result of a random sample from
the Beta distribution, which produces a result from 0 to 1. Since we view each tuple as a
separate and independent application of f2(), the number of tuples from stratum i that are
accepted by f2() is then binomially distributed2 , with the binomial distribution taking the
value pi as input, along with the stratum size Ni.
The Beta distribution is chosen as the prior distribution because it is a canonical
“conjugate prior” distribution for the binomial distribution. The fact that it is a conjugate
prior means that its domain is precisely equal to the parameter space for the Binomial
distribution; in this case, the range 0 to 1, which is the valid range for pi.
1 Recall that the second moment of a random variable X is the expected value of X2:µ2 = E[X2].
2 The binomial distribution models the case where n balls are thrown at a bucket andeach ball has a probability p of falling in the bucket. A binomially distributed samplereturns the number of balls that happened to land in the bucket.
130
0.05
0.1
0.15
0.2
0.25
0.3
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Fra
ctio
n of
que
ries
Query selectivity
Figure 5-1. Beta distribution with parameters α = β = 0.5.
Given this setup, the first task is to choose the set of Beta parameters that control
the distribution of each pi so as to match the reality of what a typical value of pi will be
for each stratum. The Beta disribution is a parametric distribution and requires two input
parameters, α and β. Depending on the parameters that are selected, the Beta can take
a large variety of shapes and skews. Choosing α and β for the ith stratum is equivalent
to supplying our “intuition” to the method, stating what our initial belief is regarding the
probabilitiy that an arbitrary record will be accepted by f2().
There are two possibilities for setting those initial parameters. The first possibility
is to use workload information. We could monitor all previously-observed queries over
each and every strata, where we observe that for query i and stratum j the probability
that a given record was accepted by f2() was pij. Then, assuming that {pij ∀i, j} are all
samples from our generative Beta prior, we simply estimate α and β from this set using
any standard method. An estimate for the Beta parameters based upon the principle of
Maximum Likelihood Estimation can easily be derived [112].
A second method is to simply assume that the stratification we choose usually works
well. In this case, most strata will either have a very low or a very high percentage of its
records accepted by f2(). Choosing α = β = .5 results in a U-shaped distribution that
matches this intuition exactly, and is a common choice for a Beta prior. The resulting
Beta is illustrated in Figure 5-1. In practice we find that this produces excellent results.
131
We stress that though the initial choice of α and β for each stratum is important,
it is only important to the extent that it informs us what is going on in the case that we
have very little information available in the pilot sample (such as when the pilot is very
small). If the pilot sample contains a great deal of information, the update step described
in Section 5.5 will update α and β as needed to take into account the information present
in the pilot sample.
Producing the Vector of Counts
Given the above setup, the GetCounts algorithm can be used to produce the vector of
counts 〈cnt1, cnt2, · · · , cntL〉:
————————————————————————————
Algorithm GetCounts(α, β, N)
1 // Let α = 〈α1, α2, · · · , αL〉 be the parameters of the beta
// distributions of all strata
2 // Let β = 〈β1, β2, · · · , βL〉 be the parameters of the beta
// distributions of all strata
3 // Let N = 〈N1, N2, · · · , NL〉 be a vector of all strata sizes
4 // Let cnt = 〈cnt1, · · · , cntL〉 be a vector of counts for all strata
5 for(int i = 1; i <= L; i++){6 pi ← Beta(αi, βi)
7 cnti ← Binomial(Ni, pi)
8 }9 return cnt
————————————————————————————
5.4.3 Defining XΣ′
In the previous subsection, we described how to obtain counts for the number of
records that satisfy the selection predicate f2(). However, in order to obtain a sample
from XΣ, it is not enough to merely know these counts. We actually need to know the f1()
132
values of all the records that satisfy f2() in stratum i, since these values are needed to be
able to compute (µ1, µ2)i as is required to sample from XΣ′ .
To do this, we use the following method. For the ith stratum, let D be the vector
of all possible distinct values from the range of the function f1(). We then associate a
probability pj with the jth distinct value. pj indicates the likelihood of the jth distinct
value from the stratum (that is, D[j]) being assigned to an arbitrary tuple that has been
accepted by f2(). Then, let V denote a vector of |D| counts, where if V [j] = k then it
means that the jth distinct value from the stratum has been assigned to k tuples that
were accepted by f2(). Thus,∑
j V [j] = cnti. Since we assume that each application
of f1() is independent on a per-tuple basis, V can be obtained by sampling from a
multinomial distribution3 with two arguments: the probability vector consisting of all the
pj values, and the number of trials given by cnti (that is, the number of tuples accepted by
f2()). Then, the resulting vector V along with the distinct value vector D can be used to
compute the pair (µ1, µ2)i.This technique poses two important questions that need to be answered:
• Is it always feasible to consider all the values in the range of f1()? That is, can wealways materialize D?
• How do we assign probabilities to all of the values in D? That is, how do we decidethe value of each pj?
The answer to the first question is simple: It is certainly not feasible to always
consider all possible values in the range of f1(), for obvious computational and storage
reasons. However, for the moment, we assume that it is feasible and consider the more
general case in Section 5.4.5.
3 The multinomial distribution models the case case where cnt balls are thrown at dbuckets so that the probability of an arbitrary ball falling in bucket j is pj; a sample fromthe multinomial assigns bj balls to bucket k such that
∑bj = 1.
133
In answering the second question, we develop a methodology analogous to the way we
choose the pi parameter when dealing with the ith strata for Xcnt. As described above, the
number of times that each distinct f1() value is selected follows a multinomial distribution.
We know from Bayesian statistics that the standard conjugate prior for a multinomial
distribution is the Dirichlet distribution [33] – just as the Beta distribution is the standard
conjugate prior for a binomial distribution. The Dirichlet is the multi-dimensional
generalization of the Beta. A k-dimensional Dirichlet distribution makes use of the
parameter vector Θ = {Θ1, Θ2, · · · , Θk}. Just as in the case of the Beta prior used by
Xcnt, the Dirichlet prior requires an initial set of parameters that represent our initial
belief. Since we typically have no knowledge about how likely it is that a given f1() value
will be selected by f2(), the simplest initial assumption to make is that all values are
equally likely. In the case of the Dirichlet distribution, using Θi = 1 for all i is the typical
zero-knowledge prior [33]. Given Θ, it is then a simple matter to sample from XΣ′ , as we
describe formally in the next subsection. We note that although this initial parameter
choice may be inaccurate, in Bayesian fashion the parameters will be made more accurate
based upon the information present in the pilot sample. Section 5.5 provides details of
how the update is accomplished.
Producing the Vector Σ′
We now present an algorithm GetMoments to obtain the vector Σ′. We assume that
we have all the Θi values corresponding to the parameters of the Dirichlet along with
counts of the number of records that are accepted by f2() in each stratum. These count
values are the values in the vector Xcnt obtained according to Algorithm GetCounts in
Section 5.4.2.
————————————————————————————
Algorithm GetMoments(Θ1, · · · ,ΘL, D){1 // Let Θi denote the Dirichlet parameters for stratum i
2 // Let D be an array of all distinct values from the range of f2()
134
3 // Let Σ′ = 〈Σ′1, · · · , Σ′L〉 be a vector of moments of all strata
4 for(int i = 1; i <= L; i++){5 p← Dirichlet(Θi)
6 µ1 = µ2 = 0
7 // Let V be an array of counts for each domain value
8 V ←Multinomial(cnti,p)
9 for(int j = 1; j <= |D|; j++){10 µ1 += V [j] ∗D[j]
11 µ2 += V [j] ∗ (D[j])2
12 }13 µ1 /= cnti
14 µ2 /= cnti
15 (µ1, µ2)i = (µ1, µ2)
16 Σ′i = (µ1, µ2)i
17 }18 return Σ′
————————————————————————————
5.4.4 Combining The Two
Once a sample from Xcnt and from XΣ′ have been obtained, it is then a simple matter
to put them together to obtain a sample from XΣ. Recall that the variance of a variable
X is defined as follows:
σ2[X] = E[X2]− E2[X]
where E[] denoes the expected value of the random variable. For the ith stratum, after
sampling from Xcnt and XΣ′ we know three things:
1. The size of the stratum Ni.
2. The number of records accepted by f2(), which is cnti.
3. The first and second moment (µ1, µ2)i of f1() applied to those tuples that wereaccepted by f2().
135
Thus, the variance σ2i for f() applied to all tuples in the ith strata can be computed as:
σ2i =
cntiNi
× µ2 +Ni − cnti
Ni
× 0−(
cntiNi
× µ1
)2
− Ni − cntiNi
× 0
=cntiNi
× µ2 −(
cntiNi
× µ1
)2
(5–5)
The two zeros in the above derivation come from the fact that both the first moment (or
mean) as well as the second moment for f() over every tuple not accepted by f2() are zero.
This computation is repeated for each possible i in order to obtain the desired sample
from XΣ.
The algorithm GetSigma describes how the variances can be computed using the
above technique.
————————————————————————————
Algorithm GetSigma(cnt,Σ′, N){1 // Let cnt = 〈cnt1, · · · , cntL〉 be a vector of counts of records
// accepted by f2() for all strata
2 // Let Σ′ = 〈Σ′1, · · · , Σ′L〉 be a vector of moments of all strata
3 // Let N = 〈N1, N2, · · · , NL〉 be a vector of all strata sizes
4 // Let Σ be a vector of variances for all strata
5 for(int i = 1; i <= L; i++){6 σ2
i = (cnti ∗ Σ′i.µ2)/N [i]− (cnti ∗ Σ′i.µ1/N [i])2
7 Σ[i] = σ2i
8 }9 return Σ
10 }————————————————————————————
136
5.4.5 Limiting the Number of Domain Values
As mentioned in Section 5.4.3, the one remaining problem regarding how to sample
from XΣ′ is the problem of having a very large (or even unknown) range for the function
f1(). In this case, dealing with the vectors D and V may be impossible, for both storage
and computational reasons.
The simple solution to this problem is to break the range of f1() into a number of
buckets and make use of a histogram over the range, rather than using the range itself. In
this case, D is generalized to be an array of histogram buckets, where each entry in D has
summary information for a group of distinct f1() values. Each entry in D has the following
four specific pieces of information:
1. low and high, which are the upper and lower bounds for the f1() values that arefound in this particular bucket.
2. µ1, which is the mean of the f1() values that are found in this particular bucket.That is, if A is the set of distinct values from low to high, then µ1 =
∑a∈A
a|A| .
3. µ2, which is the second moment of the f1() values that are found in this particularbucket. That is, µ2 =
∑a∈A
a2
|A| .
Given |D|, there are two possible ways to construct the histogram. In the case where
the queries that will be asked request a simple sum over one of the attributes from the
underlying relation R (that is, f1() does not encode any function other than a simple
relational projection), then it is possible to construct D offline by using any histogram
construction scheme [42, 45, 72] over the attribute that is to be queried. In the case that
multiple attributes might be queried, one histogram can be constructed for each attribute.
This is the method that we test experimentally.
Another appropriate method is to construct D on-the-fly by making use of the pilot
sample that is used to compute the sampling plan. This has the advantage that any
arbitrary f1() can be handled at run time. Again, any appropriate histogram construction
scheme can be used, but rather than constructing D offline using the entire relation R, f1()
137
is applied to each r ∈ ⋃i R
piloti (whether or not r is accepted by f2()) and the histogram is
constructed over the resulting set of distinct values.
Whatever method is used to construct D, the function GetMoments from Section
5.4.3 must be modified so as to handle the modified D. The following is an appropriately
modified GetMoments - we call it GetMomentsFromHist.
————————————————————————————
Algorithm GetMomentsFromHist(Θ1, · · · ,ΘL, D){1 // Let Θi denote the vector of Dirichlet parameters for stratum i
2 // Let D be an array of histogram buckets
3 // Let Σ′ = 〈Σ′1, · · · , Σ′L〉 be a vector of moments of all strata
4 for(int i = 1; i <= L; i++){5 p← Dirichlet( Θi)
6 µ1 = µ2 = 0
7 // Let V be an array of counts for each bucket
8 V ←Multinomial(cnti,p)
9 for(int j = 1; j <= |D|; j++){10 µ1 += V [j] ∗D[j].µ1
11 µ2 += V [j] ∗D[j].µ2
12 }13 µ1 /= cnti
14 µ2 /= cnti
15 (µ1, µ2)i = (µ1, µ2)
16 Σ′i = (µ1, µ2)i
17 }18 return Σ′
————————————————————————————
138
5.5 Updating Priors Using The Pilot
In Section 5.4, we described how we assign initial values to the parameters of the two
prior distributions – the Beta and the Dirichlet distributions. In this section, we explain
how these initial values can be refined by using information from a pilot sample to obtain
corresponding posterior distributions. Updating these priors using the pilot sample in the
proposed Bayes-Neyman approach is analagous to using the pilot sample to estimate the
stratum variances using the classic Neyman allocation. The update rules described in this
section are fairly straightforward applications of the standard Bayesian update rules [14].
The Beta distribution has two parameters α and β. Let Rpilot denote the pilot sample
and let s denote the number of records that are accepted by the predicate f2(). Thus,
|Rpilot| − s will be the number of records that fail to be accepted by the query.
Then, the following update rules can be used to directly update the α and β
parameters of the Beta distribution:
α = α + s
β = β + (|Rpilot| − s)
The Dirichlet distribution is updated similarly. Recall that this distribution uses a
vector of parameters, Θ = {Θ1, Θ2, · · · , Θk}, where k is the number of dimensions.
To update the parameter vector Θ, we can use the same pilot sample that was used
to update the beta as follows. We initialize to zero all elements of an array count of size k.
These elements denote counts of the number of times that different values from the range
f1() appear in the pilot sample and are accepted by f2().
The following update rule can be used to update all the different parameters of the
Dirichlet distribution:
Θi = Θi + counti
Algorithm UpdatePriors describes exactly how pilot sampling is used to update the
parameters of the prior Beta and Dirichlet distributions for the ith stratum.
139
———————————————————————————————-
Algorithm UpdatePriors(α, β, Θ, D, Rpilot){1 // Let α, β be the parameters of the beta distribution for the
// stratum to be updated
2 //Let Θ = 〈Θ1, · · · ,ΘL〉 be the parameters of the Dirichlet
// distribution for the stratum
3 // Let D be an array of histogram buckets for the stratum
4 // Let Rpilot be a pilot sample from the stratum
5 // Let count be an array of counts for each histogram bucket for
// stratum i
6 for(int j = 1; j <= |D|; j++)
7 count[j] = 0
8 s=0
9 for(int r = 1; r <= |Rpilot|; r++){10 rec = Rpilot[r]
11 if(f2(rec)){12 s++
13 val = f1(rec)
14 pos = FindPositionInArray(D, val)
15 count[pos]++
16 }17 }18 α = α + s
19 β = β + (|Rpilot| − s)
20 for(int j = 1; j <= |D|; j++)
21 Θj = Θj + count[j]
22 }———————————————————————————————-
140
——————————————————————————–
Algorithm FindPositionInArray(D, val){1 // Let D be an array of histogram buckets
2 // Let val be a scalar value
3 for(int j = 1; j <= |D|; j++)
4 if(D[j].low ≤ val && val ≤ D[j].high)
5 return j
6 }——————————————————————————–
5.6 Putting It All Together
In this section, we consider how the random variable XΣ can be used to produce
an alternative allocation to the classical Neyman, and give the complete algorithm for
computing our allocation.
5.6.1 Minimizing the Variance
In general, the goal of any sampling plan should be to minimize the variance σ2 of
the resulting stratified sampling estimator. The formula for σ2 in the classic allocation
problem is given as Equation 5–2 of the thesis. Our situation differs from the classic setup
only in that (in Bayesian fashion) we now use XΣ to implicitly define a distribution over
the per-strata variance values 〈σ1, σ2, · · · , σL〉. Thus, we cannot minimize σ2 directly
because under the Bayesian regime, σ2 is now a random variable.
Instead, it makes sense to minimize the expected value or average of σ2, which (using
Equation 5–2) can be computed as:
E[σ2] = E
[L∑
i=1
Ni(Ni − ni)
ni
σ2i
]
Using the linearity of expectation, we have:
E[σ2] =L∑
i=1
Ni(Ni − ni)
ni
E[σ2i ]
141
All of the machinery from the last two sections allows us to be able to sample possible
variance vectors from XΣ. Assume that we sample v of these vectors, where v is a suitable
large number, and the samples are denoted by Σ1, Σ2, ..., Σv. Then∑v
j=11vΣj.σ
2i is an
unbiased estimate of E[σ2i ]. Plugging this estimate into the previous equation, we have:
E[σ2] ≈L∑
i=1
Ni(Ni − ni)
ni
v∑j=1
1
vΣj.σ
2i
We now wish to minimize this value subject to the constraint that∑
ni = n. Notice
that the resulting optimization problem has exactly the same structure as the optimization
problem solved by the Neyman allocation, with the exception that σ2i has been replaced by
∑vj=1
1vΣj.σ
2i . Thus the resulting optimal solution is then nearly identical, with σ2
i being
replaced as appropriate:
ni =n∑L
k=1
∑vj=1
Nk
vΣj.σ2
k
v∑j=1
Ni
vΣj.σ
2i
5.6.2 Computing the Final Sampling Allocation
Algorithm GetBayesNeymanAllocation describes exactly how an optimal sampling
allocation can be obtained using our technique.
————————————————————————————
Algorithm GetBayesNeymanAllocation(α, β,Θ, D, Rpilot, N, n, v){1 // Let α = 〈α1, α2, · · · , αL〉, be the parameter of the beta
// distributions of all strata
2 // Let β = 〈β1, β2, · · · , βL〉 be the parameter of the beta
// distributions of all strata
3 // Let Θ = 〈Θ1, · · · ,ΘL〉 be the set of parameters of the
// Dirichlet distributions of all strata
4 // Let D = 〈D1, D2, · · · , DL〉 be an array of histogram
// buckets for all strata
5 // Let Rpilot = 〈Rpilot1 , Rpilot
2 , · · · , RpilotL 〉 be the pilot samples
142
// from all strata
6 // Let v be the total number of iterations of re-sampling
7 for(int j = 1; j <= L; j++)
8 UpdatePriors(αi, βi,Θi, Di, Rpiloti )
9 // Let cnt = 〈cnt1, cnt2, · · · , cntL〉 be a vector of counts for
// all strata
10 // Let Σ′ = 〈Σ′1,Σ′2, · · · , Σ′L〉 be a vector of moments for all strata
11 // Let Σ and Σtemp be vectors of variances of size L
12 for(int i = 1; i <= v; i++){13 cnt = GetCounts(α, β,N)
14 Σ′ = GetMomentsFromHist(Θ1, · · · ,ΘL, D)
15 Σtemp = GetSigma(cnt,Σ′, N)
16 for(int j = 1; j <= L; j++)
17 Σ[j] += Σtemp[j]
18 }19 denom = 0
20 for(int j = 1; j <= L; j++){21 Σ[j] /= v
22 denom += N [j] ∗ Σ[j]
23 }24 for(int j = 1; j <= L; j++)
25 nj = (n ∗Nj ∗ Σ[j])/denom
26 }
————————————————————————————
5.7 Experiments
5.7.1 Goals
The specific goals of our experimental evaluation are as follows:
143
• To compare the width of the confidence bounds produced using both the classicNeyman allocation and the proposed Bayes-Neyman allocation in realistic scenarios inorder to see which can produce tighter bounds.
• To test the reliability of the confidence bounds produced by the two methods. Thatis, we wish to ask: if bounds are reported to the user as p% bounds, is the chancethat they contain the answer actually p%?
• Third, we wish to compare both methods against simple random sampling as a sanitycheck to see if there is a significant improvement in bound width.
• Finally, we wish to compare the compuation time required for the two estimators.
5.7.2 Experimental Setup
Data Sets Used. We use three different data sets in our experimental evaluation:
• The first is a synthetic data set called the GMM data set, and is produced using aGaussian (normal) mixture model. The GMM data set has three numerical and threecategorical attributes. Since the underlying normal variables only produce numericaldata, the three categorical attributes (having seven possible values each) and areproduced by mapping the ranges of three of the dimensions to discrete values. Thisdata set has 5 million records.
• The second is the Person data set. This is a 13-attribute real-life data set obtainedfrom the 1990 Census and contains family and income information. This dataset ispublicly available [2] and has a single relation with over 9.5 million records. The datahas twelve numerical attributes and one categorical attribute with 29 categories.
• The third is the KDD data set, which is the data set from the 1999 KDD Cup event.This data set has 42 attributes with status information regarding various networkconnections for intrusion detection. This dataset consists of around 5 million recordswith integer, real-valued, as well as categorical attributes.
Queries Tested. For each data set, we test queries of the form:
SELECT SUM (f1(r))
FROM R As r
WHERE f2(r)
f1() and f2() vary depending upon the data set. For the GMM data set, f1() projects
one of the three different numerical attributes (each query projects a random attribute).
For the Person data set, either the TotalIncome attribute or the WageIncome attribute are
144
projected by each query. For the KDD data set, either the src bytes or the dst bytes
attributes are projected.
For each of the data sets, three different classes of selection predicates encoded by f2()
are used. Each class has a different selectivity. The three selectivity classes for f2() have
selectivities of (0.01%± 0.001%), (0.1%± 0.01%), and (1.0%± 0.1%), respectively.
For the GMM data set, f2() is constructed by rolling a three-faced die to decide how
many attributes will be included in the conjunction computed by f2(). The appropriate
number of attributes are then randomly selected from among the six GMM attributes. If
a categorical attribute is chosen as one of the attributes in f2(), then the attribute will be
checked with either an equality or inequality condition over a randomly-selected domain
value. If a numerical attribute is chosen, then a range predicate is constructed. For a given
numerical attribute, assume that low and high are the known minimum and maximum
attribute values. The range is constructed using qlow = low + v1 × (high − low) and
qhigh = qlow + v2 × (high − qlow) where v1 and v2 are randomly chosen real values from
the range [0-1]. For each selectivity class, 50 different queries are generated by repeating
the query-generation process until enough queries falling the appropriate selectivity range
have been generated.
The f2() functions for the other two data sets are constructed similarly.
Stratification Tested. For each of the various data sets, a simple nearest-neighbor
classification algorithm is used to perform the statification. In order to partition a data
set into L strata, L records are first chosen randomly from the data to serve as “seeds” for
each of the strata, and all of the other records are added to the strata whose seed is closest
to the data point. For numerical attributes, the L2 norm is used as the distance function.
For categorical attributes, we compute the distance using the support from the database
for the attribute values [36]. Since each data set has both numerical and categorical data,
the actual distance function used is the sum of the two “sub” distance functions. Note
that it would be possible to use a much more sophisticated stratification, but actually
145
Sample Sel Bandwidth CoverageSize (%) GMM /Person /KDD GMM /Person /KDD
50K0.01 3.277 /2.289 /2.140 918 /892 /9210.1 1.776 /0.514 /1.520 926 /912 /9881 0.587 /0.184 /0.210 947 /944 /942
100K0.01 2.626 /2.108 /1.48 922 /941 /9370.1 1.273 /0.351 /0.910 939 /948 /9401 0.415 /0.128 /0.120 948 /952 /946
500K0.01 2.192 /1.740 /0.820 923 /943 /9400.1 0.551 /0.132 /0.630 946 /947 /9421 0.178 /0.087 /0.070 946 /947 /948
Table 5-1. Bandwidth (as a ratio of error bounds width to the true query answer) andCoverage (for 1000 query runs) for a Simple Random Sampling estimator forthe KDD Cup data set. Results are shown for varying sample sizes and forthree different query selectivities - 0.01%, 0.1% and 1%.
performing the stratification is not the point of this thesis – our goal is to study how to
best use the stratification.
In our experiments, we test L = 1, L = 20, and L = 200. Note that if L = 1
then there is actually no stratification performed, and so this case is equivalent to simple
random sampling without replacement and will serve as a sanity check in our experiments.
Tests Run. For the Neyman allocation and our Bayes-Neyman allocation, our test suiteconsists of 54 different test cases for each data set, plus nine more tests using L = 1.These test cases are obtained by assigning three different values to the following fourparameters:
• Number of strata – We use L = 1, L = 20, L = 200; as described above, L = 1 is alsoequivalent to simple random sampling without replacement.
• Pilot sample size – This is the number of records we obtain from each stratum inorder to perform the allocation. We choose values of 5, 20 and 100 records.
• Sample Size – This is the total sample size that has to be allocated. We use 50,000,100,000 and 500,000 samples in our tests.
• Query Selectivity – As described above, we test query selectivities of 0.01%, 0.1% and1%.
146
Average Running Time (sec.)Neyman Bayes-Neyman
Gaussian Mixture 1.5 2.4Person 2.3 3.1KDD Cup 2.1 2.8
Table 5-2. Average running time of Neyman and Bayes-Neyman estimators over threereal-world datasets.
Each of the 50 queries for each (data set, selectivity) combination is re-run 20
times using 20 different (pilot sample, sample) combinations. Thus, for each (data set,
selectivity) combination we obtain results for 1000 query runs in all.
5.7.3 Results
Table 5-1 shows the results for the nine cases where L = 1; that is, where no
stratification is performed. We report two numbers: the bandwidth and the coverage. The
bandwidth is the ratio of the width of the 95% confidence bounds computed as the result
of using the allocation to the true query answer. The coverage is the number of times out
of the 1000 trials that the true answer is actually contained in the 95% confidence bounds
reported by the estimator. Naturally, one would expect this number to be close to 950 if
the bounds are in fact reliable. Tables 5-3 and 5-4 show the results for the 54 different test
cases where a stratification is actually performed. For each of the 54 test cases and both
of the sampling plans used (the Neyman allocation and the Bayes-Nayman allocation) we
again report the bandwidth and the coverage.
Finally, the following table shows the average running times for the two stratified
sampling estimators on all the three data sets. There is generally around a 50% hit in
terms of running time when using the Bayes-Neyman allocation compared to the Neyman
allocation.
5.7.4 Discussion
There are quite a large number of results presented, and discussing all of the
intricacies present in all of our findings is beyond the scope of the thesis. However,
taken as a whole, our experiments clearly show two things. First, for the type of selective
147
Bandwidth CoverageGMM /Person /KDD GMM /Person /KDD
NS PS SS Sel Neyman Bayes - Neyman Neyman Bayes - Neyman
20 5 50K0.01 0.00 /0.00 /0.00 2.90 /0.19 /1.12 0 /0 /0 935 /882 /9270.1 0.03 /0.01 /0.02 1.27 /0.02 /0.80 3 /49 /23 929 /939 /9381 0.05 /0.02 /0.14 0.39 /0.01 /0.09 11 /247 /155 940 /950 /945
100K0.01 0.00 /0.00 /0.00 2.77 /0.16 /1.08 0 /0 /0 936 /961 /9300.1 0.02 /0.01 /0.01 0.90 /0.02 /0.73 3 /53 /28 941 /941 /9381 0.05 /0.01 /0.03 0.28 /0.01 /0.08 24 /306 /170 941 /947 /947
500K0.01 0.01 /0.00 /0.00 2.05 /0.06 /0.87 3 /0 /4 938 /948 /9320.1 0.01 /0.00 /0.01 0.37 /0.01 /0.55 10 /62 /51 954 /954 /9411 0.03 /0.01 /0.02 0.12 /0.00 /0.04 38 /316 /184 957 /955 /945
20 50K0.01 0.06 /0.00 /0.04 2.72 /0.22 /1.06 14 /0 /5 942 /941 /9380.1 0.17 /0.03 /0.09 1.21 /0.03 /0.81 106 /61 /88 908 /938 /9441 0.21 /0.05 /0.27 0.34 /0.01 /0.09 404 /692 /561 948 /948 /947
100K0.01 0.01 /0.00 /0.01 2.58 /0.16 /0.91 23 /0 /6 941 /937 /9410.1 0.11 /0.02 /0.06 0.85 /0.02 /0.74 165 /66 /107 934 /954 /9391 0.14 /0.03 /0.09 0.25 /0.01 /0.06 431 /728 /612 954 /962 /953
500K0.01 0.01 /0.00 /0.01 1.93 /0.07 /0.62 30 /0 /21 946 /943 /9440.1 0.01 /0.01 /0.01 0.34 /0.01 /0.51 230 /145 /245 942 /952 /9451 0.04 /0.01 /0.03 0.09 /0.00 /0.02 447 /751 /746 943 /961 /950
100 50K0.01 0.15 /0.04 /0.08 2.33 /0.19 /0.82 24 /58 /20 938 /922 /9380.1 0.26 /0.10 /0.16 1.09 /0.02 /0.58 436 /204 /172 929 /949 /9421 0.47 /0.18 /0.34 0.32 /0.01 /0.05 870 /891 /866 932 /962 /951
100K0.01 0.12 /0.03 /0.06 2.26 /0.16 /0.57 29 /59 /41 935 /945 /9400.1 0.18 /0.05 /0.11 0.81 /0.02 /0.40 435 /249 /355 927 /957 /9421 0.31 /0.08 /0.02 0.22 /0.01 /0.04 895 /928 /914 948 /968 /943
500K0.01 0.01 /0.01 /0.01 1.72 /0.07 /0.33 45 /66 /50 939 /952 /9470.1 0.06 /0.02 /0.04 0.31 /0.01 /0.28 474 /297 /412 954 /954 /9521 0.06 /0.02 /0.06 0.08 /0.00 /0.02 926 /935 /942 950 /970 /949
Table 5-3. Bandwidth (as a ratio of error bounds width to the true query answer) andCoverage (for 1000 query runs) for the Neyman estimator and theBayes-Neyman estimator for the three data sets. Results are shown for 20strata and for varying number of records in pilot sample per stratum (PS), andsample sizes(SS) for three different query selectivities - 0.01%, 0.1% and 1%.
148
Bandwidth CoverageGMM /Person /KDD GMM /Person /KDD
NS PS SS Sel Neyman Bayes - Neyman Neyman Bayes - Neyman
200 5 50K0.01 0.00 /0.00 /0.00 1.73 /0.18 /0.91 0 /0 /0 933 /931 /9240.1 0.00 /0.02 /0.01 0.97 /0.02 /0.76 0 /56 /27 933 /953 /9361 0.05 /0.02 /0.03 0.26 /0.01 /0.09 19 /162 /149 940 /960 /940
100K0.01 0.00 /0.01 /0.01 1.57 /0.13 /0.75 0 /43 /28 936 /916 /9300.1 0.01 /0.01 /0.01 0.72 /0.02 /0.64 7 /60 /41 938 /958 /9361 0.03 /0.01 /0.01 0.19 /0.00 /0.08 34 /365 /212 945 /955 /947
500K0.01 0.01 /0.00 /0.00 1.20 /0.08 /0.52 5 /45 /34 940 /939 /9380.1 0.02 /0.01 /0.00 0.28 /0.01 /0.44 22 /89 /76 946 /946 /9441 0.02 /0.01 /0.01 0.07 /0.00 /0.06 45 /372 /336 954 /954 /951
20 50K0.01 0.05 /0.03 /0.04 1.59 /0.18 /0.85 19 /51 /21 943 /931 /9340.1 0.11 /0.03 /0.07 0.75 /0.02 /0.72 91 /70 /94 943 /953 /9391 0.09 /0.04 /0.09 0.18 /0.01 /0.07 345 /627 /580 958 /962 /945
100K0.01 0.01 /0.01 /0.03 1.35 /0.14 /0.67 22 /66 /45 948 /948 /9410.1 0.02 /0.02 /0.04 0.54 /0.01 /0.54 131 /135 /128 935 /955 /9491 0.05 /0.02 /0.05 0.12 /0.00 /0.06 488 /702 /643 945 /955 /952
500K0.01 0.01 /0.00 /0.01 1.04 /0.06 /0.42 49 /83 /72 941 /954 /9470.1 0.01 /0.00 /0.02 0.20 /0.00 /0.35 210 /209 /282 955 /945 /9501 0.04 /0.01 /0.01 0.03 /0.00 /0.03 617 /830 /869 948 /958 /953
100 50K0.01 0.08 /0.03 /0.06 1.35 /0.14 /0.54 28 /56 /39 939 /938 /9390.1 0.20 /0.05 /0.09 0.56 /0.02 /0.40 313 /357 /243 949 /949 /9421 0.10 /0.01 /0.15 0.14 /0.01 /0.03 543 /823 /874 948 /948 /951
100K0.01 0.07 /0.02 /0.04 1.11 /0.12 /0.39 47 /77 /53 938 /935 /9470.1 0.08 /0.03 /0.06 0.40 /0.01 /0.28 533 /456 /427 948 /948 /9511 0.06 /0.06 /0.08 0.09 /0.01 /0.02 918 /912 /930 959 /956 /952
500K0.01 0.01 /0.00 /0.02 0.89 /0.05 /0.21 63 /91 /104 946 /936 /9370.1 0.02 /0.01 /0.02 0.10 /0.00 /0.13 580 /540 /607 945 /945 /9481 0.04 /0.03 /0.05 0.01 /0.00 /0.01 936 /920 /941 960 /953 /950
Table 5-4. Bandwidth (as a ratio of error bounds width to the true query answer) andCoverage (for 1000 query runs) for the Neyman estimator and theBayes-Neyman estimator for the three data sets. Results are shown for 200strata with varying number of records in pilot sample per stratum (PS), andsample sizes(SS) for three different query selectivities - 0.01%, 0.1% and 1%.
149
queries we concentrate on in our work, the classic Neyman allocation is generally useless.
As expected, the allocation tends to ignore strata with relevant records, resulting in “95%
confidence bounds” that are generally accurate nowhere close to 95% of the time. Out of
162 different tests over the three data sets, the Neyman allocation produced confidence
bounds that had greater than 90% coverage only eleven times, even though 95% bounds
were specified. In 15 out of the 162 tests, the “95% confidence bounds” actually contained
the answer 0 out of 1000 times!
Second, the allocation produced by the proposed Bayes-Neyman tends to be
remarkably useful – that is, the bounds produced are both accurate and tight. In
only 7 of the 162 tests, the coverage of the bounds produced by the Bayes-Neyman
allocation was found to be less than 93%, and coverage was often remarkably close to 95%.
Furthermore, in the few cases where the classic Neyman bounds were actually worthwhile,
the Bayes-Neyman bounds were far superior in terms of having a tighter bandwidth.
Even if one looks only at the cases where the Neyman bounds were not ridiculous (where
“ridiculous” bounds are arbitrarily defined to be those that had a coverage of less than
20%), the Bayes-Neyman bounds were actually tighter than the Neyman bounds 35
out of 70 times. In other words, there were many cases where the Neyman allocation
produced bounds that had coverage rates of only around 20%, whereas the Bayes-Neyman
allocations produced bounds that were actually tighter, and still had coverage rates very
close to the user-specified 95%.
There are a few other interesting findings. Not surprisingly, increasing the number
of strata generally gives tighter error bounds for fixed pilot and sample sizes because it
tends to increase the homogeneity of the records in each stratum. However, in practice
there is a cost associated with increasing the number of strata and so this cannot
be done arbitrarily. Specifically, more strata may translate to more I/Os required to
actually perform the sampling. One might typically store the records within a stratum in
randomized order on disk. Thus, to sample from a given stratum requires only a sequential
150
scan, but each additional stratum requires a random disk I/O. In addition, it is more
difficult and more costly to maintain a large number of strata.
We also find that by using a larger pilot sample, estimation accuracy generally
increases. This is intuitive since a larger pilot sample contains more information about
the stratum, thus helping to make a better sampling allocation plan and providing a more
accurate estimate. However, a large pilot sample incurs a greater cost to actually perform
the pilot sampling. Explicitly studying this trade-off is an interesting avenue for future
work.
Finally, we point out that even the rudimentary stratification that we tested in these
experiments is remarkably successful – if the correct sampling allocation is used. Consider
the case of a 500K record sample. For a query selectivity of 0.01%, only around 50 records
in the sample will be accepted by selection predicate encoded in f2(). This is why the
bandwidth for the simple random sample estimator with no stratification (L = 1) is so
great: for the Person data set it is 1.74 and for the KDD data set it is 0.82. The bounds
are so wide that they are essentially useless. However, if the Bayes-Neyman allocation is
used over 200 strata and a pilot sample of size 100, the bandwidths shrink to 0.053 and
0.21, respectively. These are far tighter. In the case of the Person data set the bandwidth
shrinks by nearly two orders of magnitude. For the KDD data set the reduction is more
modest (a factor of four) due to the high dimensionality of the data, which tends to render
the stratification less effective. Still, this suggests that perhaps the real issue to consider
when stratifying in a database environment is not how to perform the stratification, but
how to use the stratification in an effective manner.
5.8 Related Work
Broadly speaking, it is possible to divide the related prior research into two categories
– those works from the statistics literature, and those from the data management
literature.
151
The idea of applying Bayesian and/or superpopulation (model-based) methods to
the allocation problem has a long history in statistics, and seems to have been studied
with particular intensity in the 1970’s. Given the number of papers on this topic, it is
not feasible to reference all of them, though a small number are listed in the References
section of the thesis [32, 103, 104]. At a high level, the biggest difference between this
work and that prior work is the specificity of our work with respect to database queries.
Sampling from a database is very unique in that the distribution of values that are
aggregated is typically ill-suited to traditional parametric models. Due to the inclusion
of the selection predicate encoded by f2(), the distribution of the f() values that are
aggregated tends to have a large “stovepipe” located at zero corresponding to those
records that are not accepted by f2(), with a more well-behaved distribution of values
located elsewhere corresponding to those f1() values for records that were accepted by
f2(). The Bayes-Neyman allocation scheme proposed in this thesis explicity allows for
such a situation via its use of a two-stage model where first a certain number of records
are accepted by f2() (modeled via the random variable Xcnt) and then the f1() values for
those accepted records are produced (modeled by XΣ′). This is quite different from the
general-purpose methods described in the statistics literature, which typically attach a
well-behaved, standard distribution to the mean and/or variance of each stratum [32, 104].
Sampling for the answer to database queries has also been studied extensively
[63, 67, 96]. In particular, Chaudhuri and his co-authors have explicitly studied the
idea of stratification for approximating database queries [18–20]. However, there is a
key difference between that work and our own: these existing papers focus on how to
break the data into strata, and not on how to sample the strata in a robust fashion. In
that sense, our work is completely orthogonal to Chaudhuri et al.’s prior work and our
sampling plans could easily be used in conjunction with the workload-based stratifications
that their methods can construct.
152
5.9 Conclusion
In this chapter, we have considered the problem of stratification for develpoing robust
estimates for the answer to very selective aggregate queries. While the obvious problem
to consider when stratifying is how to break the data into subsets, the more significant
challenge may lie in developing a sampling plan at run time that actually uses the strata
in a robust fashion. We have shown that the traditional Neyman sampling allocation can
give disastrous results when it is used in conjunction with mildly to very selective queries.
We have developed a unique Bayesian method for developing robust sampling plans. Our
plans explicitly minimize the expected variance of the final estimator over the space of
possible strata variances. We have shown that even when the resulting allocation is used
with a very naive nearest-neighbor stratification, the increase in accuracy compared to
simple random sampling is considerable. Even more significant is the fact that for highly
selective queries, our sampling plans give results that are “safe” in the sense that the
associated confidence bounds have near perfect coverage.
153
CHAPTER 6CONCLUSION
In this research work, we have studied and described the problem of efficient
answering of complex queries on large data warehouses. Our approach for addressing
the problem relies on approximation. We present sampling-based techniques which can
be used to compute very quickly, approximate answers along with error guarantees for
long-running queries. The first part of this study addresses the problem of efficiently
obtaining random samples of records satisfying arbitrary range selection predicates. The
second part of the study develops statistical, sampling-based estimators for the specific
class of queries that have a nested, correlated subquery. The problem addressed in this
work is actually a generalisation of the important problem of estimating the number of
distinct values in a database. The third and final part of this study addresses the problem
of estimating the result to queries having highly selective predicates. Since a uniform
random sample is not likely to contain any records satisfying the selection predicate,
our approach uses stratified sampling and develops stratified sampling plans to correctly
identify high-density strata for arbitrary queries.
154
APPENDIXEM ALGORITHM DERIVATION
Let Ye be the information about record e ∈ EMP that can be observed i.e. v = f1(e)
and k′ = cnt(e, SALE′). Let Xe be the information about record e that includes Ye as well
as the relevant data that cannot be observed, i.e. k = cnt(e, SALE).
Then let:
f(Xe = 〈Ye, k〉|Θ) = pk × e−(v−µk)2/2σ2
σ√
2π× h(k′; k)
Also, let:
g(Ye|Θ) =m∑
i=0
pi × e−(v−µi)2/2σ2
σ√
2π× h(k′; i)
We then compute the posterior probability that e belongs to class i as:
p(i|Θ, e) =f(Xe = 〈Ye, i〉|Θ)
g(Ye|Θ)
Then the logarithm of the expected probability that we would observe EMP′ and SALE′
is:
E =∑e∈EMP
log(f(Xe = 〈Ye, i〉|Θ))|p(i|Θ′, e)
=∑e∈EMP
m∑i=0
p(i|Θ′, e)
×log
(pi × e−(f1(e)−µi)
2/2σ2
σ√
2π× h(k′; i)
)
=∑e∈EMP
m∑i=0
p(i|Θ′, e)× (log(pi)− log(σ)− log(√
2π)
−(f1(e)− µi)2
2σ2+ log(h(k′; i)))
To find the unknown parameters, µi, σ and pi, we maximize E for the given set of
posterior probabilities at that step. We do this by taking partial derivatives of E w.r.t
each of these parameters and setting the result to zero:
155
∂E
∂µ1
=∑e∈EMP
p(1|Θ′, e)× f1(e)− µ1
σ2
Setting this expression to zero gives:
µ1 =
∑e∈EMP p(1|Θ′, e)× f1(e)∑
e∈EMP p(1|Θ′, e)
We can obtain µ2, · · ·µm in a similar manner.
By taking the partial derivative of E w.r.t σ2 and setting to zero we get:
σ2 =
∑e∈EMP
∑mj=0 p(j|Θ′, e)(f1(e)− µj)
2
∑e∈EMP
∑mj=0 p(j|Θ′, e)
Finally, to evaluate the pi’s, we also consider the additional constraint that∑m
i=0 pi =
1. We can find the values of pi’s that maximize E subject to this constraint by using the
method of Lagrangian multipliers to obtain:
pj =
∑e∈EMP p(j|Θ′, e)∑
e∈EMP∑m
i=1 p(i|Θ′, e)
This completes the derivation of the update rules given in Section 4.5.2.
156
REFERENCES
1. IMDB dataset. http://www.imdb.com
2. Person data set. http://usa.ipums.org/usa
3. Synoptic cloud report dataset.http://cdiac.ornl.gov/epubs/ndp/ndp026b/ndp026b.htm
4. Acharya, S., Gibbons, P.B., Poosala, V.: Congressional samples for approximateanswering of group-by queries. In: Tech. Report, Bell Laboratories, Murray Hill, NewJersey (1999)
5. Acharya, S., Gibbons, P.B., Poosala, V.: Congressional samples for approximateanswering of group-by queries. In: SIGMOD, pp. 487–498 (2000)
6. Acharya, S., Gibbons, P.B., Poosala, V., Ramaswamy, S.: Join synopses forapproximate query answering. In: SIGMOD, pp. 275–286 (1999)
7. Alon, N., Gibbons, P.B., Matias, Y., Szegedy, M.: Tracking join and self-join sizes inlimited storage. In: PODS, pp. 10–20 (1999)
8. Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating thefrequency moments. In: STOC, pp. 20–29 (1996)
9. Antoshenkov, G.: Random sampling from pseudo-ranked b+ trees. In: VLDB, pp.375–382 (1992)
10. Babcock, B., Chaudhuri, S., Das, G.: Dynamic sample selection for approximatequery processing. In: SIGMOD, pp. 539–550 (2003)
11. Bradley, P.S., Fayyad, U.M., Reina, C.: Scaling clustering algorithms to largedatabases. In: KDD, pp. 9 – 15 (1998)
12. Brown, P.G., Haas, P.J.: Techniques for warehousing of sample data. In: ICDE, p. 6(2006)
13. Bunge, J., Fitzpatrick, M.: Estimating the number of species: A review. Journal ofthe American Statistical Association 88, 364–373 (1993)
14. Carlin, B., Louis, T.: Bayes and Empirical Bayes Methods for Data Analysis.Chapman and Hall (1996)
15. Chakrabarti, K., Garofalakis, M., Rastogi, R., Shim, K.: Approximate queryprocessing using wavelets. The VLDB Journal 10(2-3), 199–223 (2001)
16. Charikar, M., Chaudhuri, S., Motwani, R., Narasayya, V.: Towards estimation errorguarantees for distinct values. In: PODS, pp. 268–279 (2000)
157
158
17. Charikar, M., Chaudhuri, S., Motwani, R., Narasayya, V.: Towards estimation errorguarantees for distinct values. In: PODS, pp. 268–279 (2000)
18. Chaudhuri, S., Das, G., Datar, M., Motwani, R., Narasayya, V.R.: Overcominglimitations of sampling for aggregation queries. In: ICDE, pp. 534–542 (2001)
19. Chaudhuri, S., Das, G., Narasayya, V.: A robust, optimization-based approach forapproximate answering of aggregate queries. In: SIGMOD, pp. 295–306 (2001)
20. Chaudhuri, S., Das, G., Narasayya, V.: Optimized stratified sampling forapproximate query processing. ACM TODS, To Appear (2007)
21. Chaudhuri, S., Das, G., Srivastava, U.: Effective use of block-level sampling instatistics estimation. In: SIGMOD, pp. 287 – 298 (2004)
22. Chaudhuri, S., Motwani, R.: On sampling and relational operators. IEEE Data Eng.Bull. 22(4), 41–46 (1999)
23. Chaudhuri, S., Motwani, R., Narasayya, V.: Random sampling for histogramconstruction: how much is enough? SIGMOD Rec. 27(2), 436–447 (1998)
24. Chaudhuri, S., Motwani, R., Narasayya, V.: On random sampling over joins. In:SIGMOD, pp. 263–274 (1999)
25. Cochran, W.: Sampling Techniques. Wiley and Sons (1977)
26. Dempster, A., Laird, N., Rubin, D.: Maximum-likelihood from incomplete data viathe EM algorithm. J. Royal Statist. Soc. Ser. B. 39 (1977)
27. Diwan, A.A., Rane, S., Seshadri, S., Sudarshan, S.: Clustering techniques forminimizing external path length. In: VLDB, pp. 342–353 (1996)
28. Dobra, A.: Histograms revisited: when are histograms the best approximationmethod for aggregates over joins? In: PODS, pp. 228–237 (2005)
29. Dobra, A., Garofalakis, M., Gehrke, J., Rastogi, R.: Processing complex aggregatequeries over data streams. In: SIGMOD Conference, pp. 61–72 (2002)
30. Domingos, P.: Bayesian averaging of classifiers and the overfitting problem. In: 17thInternational Conf. on Machine Learning (2000)
31. Efron, B., Tibshirani, R.: An Introduction to the Bootstrap. Chapman & Hall/CRC(1998)
32. Ericson, W.A.: Optimum stratified sampling using prior information. JASA 60(311),750–771 (1965)
33. Evans, M., Hastings, N., Peacock, B.: Statistical Distributions. Wiley and Sons(2000)
159
34. Fan, C., Muller, M., , Rezucha, I.: Development of sampling plans by using sequential(item by item) selection techniques and digital computers. Journal of the AmericanStatistical Association 57, 387–402 (1962)
35. Ganguly, S., Gibbons, P., Matias, Y., Silberschatz, A.: Bifocal sampling forskew-resistant join size estimation. In: SIGMOD, pp. 271–281 (1996)
36. Ganti, V., Gehrke, J., Ramakrishnan, R.: Cactus: clustering categorical data usingsummaries. In: KDD, pp. 73–83 (1999)
37. Ganti, V., Lee, M.L., Ramakrishnan, R.: ICICLES:self-tuning samples forapproximate query answering. In: VLDB, pp. 176–187 (2000)
38. Garcia-Molina, H., Widom, J., Ullman, J.D.: Database System Implementation.Prentice-Hall, Inc. (1999)
39. Gelman, A., Carlin, J., Stern, H., Rubin, D.: Bayesian Data Analysis, SecondEdition. Chapman & Hall/CRC (2003)
40. Gibbons, P.B., Matias, Y.: New sampling-based summary statistics for improvingapproximate query answers. In: SIGMOD, pp. 331–342 (1998)
41. Gibbons, P.B., Matias, Y., Poosala, V.: Aqua project white paper. In: TechnicalReport, Bell Laboratories, Murray Hill, New Jersey, pp. 275–286 (1999)
42. Gilbert, A.C., Kotidis, Y., Muthukrishnan, S., Strauss, M.: Optimal and approximatecomputation of summary statistics for range aggregates. In: PODS (2001)
43. Goodman, L.: On the estimation of the number of classes in a population. Annals ofMathematical Statistics 20, 272–579 (1949)
44. Gray, J., Bosworth, A., Layman, A., Pirahesh, H.: Data cube: A relationalaggregation operator generalizing group-by, cross-tab, and sub-total. In: ICDE,pp. 152–159 (1996)
45. Guha, S., Koudas, N., Srivastava, D.: Fast algorithms for hierarchical rangehistogram construction. In: PODS, pp. 180–187 (2002)
46. Guttman, A.: R-trees: A dynamic index structure for spatial searching. In: SIGMODConference, pp. 47–57 (1984)
47. Haas, P., Hellerstein, J.: Ripple joins for online aggregation. In: SIGMODConference, pp. 287–298 (1999)
48. Haas, P., Naughton, J., Seshadri, S., Stokes, L.: Sampling-based estimation of thenumber of distinct values of an attribute. In: 21st International Conference on VeryLarge Databases, pp. 311–322 (1995)
49. Haas, P., Naughton, J., Seshadri, S., Stokes, L.: Sampling-based estimation of thenumber of distinct values of an attribute. In: VLDB, pp. 311–322 (1995)
160
50. Haas, P., Stokes, L.: Estimating the number of classes in a finite population. Journalof the American Statistical Association 93, 1475–1487 (1998)
51. Haas, P.J.: Large-sample and deterministic confidence intervals for onlineaggregation. In: Statistical and Scientific Database Management, pp. 51–63 (1997)
52. Haas, P.J.: The need for speed: Speeding up DB2 using sampling. IDUG SolutionsJournal 10, 32–34 (2003)
53. Haas, P.J., Hellerstein, J.: Join algorithms for online aggregation. IBM ResearchReport RJ 10126 (1998)
54. Haas, P.J., Hellerstein, J.M.: Ripple joins for online aggregation. In: SIGMOD, pp.287 – 298 (1999)
55. Haas, P.J., Koenig, C.: A bi-level bernoulli scheme for database sampling. In:SIGMOD, pp. 275 – 286 (2004)
56. Haas, P.J., Naughton, J.F., Seshadri, S., Swami, A.N.: Fixed-precision estimation ofjoin selectivity. In: PODS, pp. 190–201 (1993)
57. Haas, P.J., Naughton, J.F., Seshadri, S., Swami, A.N.: Selectivity and cost estimationfor joins based on random sampling. J. Comput. Syst. Sci. 52(3), 550–569 (1996)
58. Haas, P.J., Naughton, J.F., Swami, A.N.: On the relative cost of sampling for joinselectivity estimation. In: PODS, pp. 14–24 (1994)
59. Haas, P.J., Swami, A.N.: Sequential sampling procedures for query size estimation.In: SIGMOD, pp. 341–350 (1992)
60. Hellerstein, J., Avnur, R., Chou, A., Hidber, C., Olston, C., Raman, V., Roth, T.,Haas, P.: Interactive data analysis: The cONTROL project. IEEE Computer 32(8),51–59 (1999)
61. Hellerstein, J., Haas, P., Wang, H.: Online aggregation. In: SIGMOD Conference, pp.171–182 (1997)
62. Hellerstein, J.M., Avnur, R., Chou, A., Hidber, C., Olston, C., Raman, V., Roth, T.,Haas, P.J.: Interactive data analysis: The control project. In: IEEE Computer 32(8),pp. 51 – 59 (1999)
63. Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online aggregation. In: SIGMOD, pp.171–182 (1997)
64. Hou, W.C., Ozsoyoglu, G.: Statistical estimators for aggregate relational algebraqueries. ACM Trans. Database Syst. 16(4), 600–654 (1991)
65. Hou, W.C., Ozsoyoglu, G.: Processing time-constrained aggregate queries in case-db.ACM Trans. Database Syst. 18(2), 224–261 (1993)
161
66. Hou, W.C., Ozsoyoglu, G., Dogdu, E.: Error-constrained COUNT query evaluation inrelational databases. SIGMOD Rec. 20(2), 278–287 (1991)
67. Hou, W.C., Ozsoyoglu, G., Taneja, B.K.: Statistical estimators for relational algebraexpressions. In: PODS, pp. 276–287 (1988)
68. Hou, W.C., Ozsoyoglu, G., Taneja, B.K.: Processing aggregate relational queries withhard time constraints. In: SIGMOD, pp. 68–77 (1989)
69. Huang, H., Bi, L., Song, H., Lu, Y.: A variational em algorithm for large databases.In: International Conference on Machine Learning and Cybernetics, pp. 3048–3052(2005)
70. Ioannidis, Y.E.: Universality of serial histograms. In: VLDB, pp. 256–267 (1993)
71. Ioannidis, Y.E., Poosala, V.: Histogram-based approximation of set-valuedquery-answers. In: VLDB (1999)
72. Jagadish, H.V., Koudas, N., Muthukrishnan, S., Poosala, V., Sevcik, K.C., Suel, T.:Optimal histograms with quality guarantees. In: VLDB, pp. 275–286 (1998)
73. Jermaine, C., Dobra, A., Arumugam, S., Joshi, S., Pol, A.: A disk-based join withprobabilistic guarantees. In: SIGMOD, pp. 563–574 (2005)
74. Jermaine, C., Dobra, A., Arumugam, S., Joshi, S., Pol, A.: The sort-merge-shrinkjoin. ACM Trans. Database Syst. 31(4), 1382–1416 (2006)
75. Jermaine, C., Dobra, A., Pol, A., Joshi, S.: Online estimation for subset-based SQLqueries. In: 31st International conference on Very large data bases, pp. 745–756(2005)
76. Jermaine, C., Pol, A., Arumugam, S.: Online maintenance of very large randomsamples. In: SIGMOD, pp. 299–310. ACM Press, New York, NY, USA (2004)
77. Kempe, D., Dobra, A., Gehrke, J.: Gossip-based computation of aggregateinformation. In: FOCS, pp. 482–491 (2003)
78. Krewski, D., Platek, R., Rao, J.: Current Topics in Survey Sampling. Academic Press(1981)
79. Lakshmanan, L.V.S., Pei, J., Han, J.: Quotient cube: How to summarize thesemantics of a data cube. In: VLDB, pp. 778–789 (2002)
80. Lakshmanan, L.V.S., Pei, J., Zhao, Y.: Qc-trees: An efficient summary structure forsemantic olap. In: SIGMOD, pp. 64–75 (2003)
81. Leutenegger, S.T., Edgington, J.M., Lopez, M.A.: STR: A simple and efficientalgorithm for r-tree packing. In: ICDE, pp. 497–506 (1997)
162
82. Ling, Y., Sun, W.: A supplement to sampling-based methods for query sizeestimation in a database system. SIGMOD Rec. 21(4), 12–15 (1992)
83. Lipton, R., Naughton, J.: Query size estimation by adaptive sampling. In: PODS,pp. 40–46 (1990)
84. Lipton, R., Naughton, J., Schneider, D.: Practical selectivity estimation throughadaptive sampling. In: SIGMOD Conference, pp. 1–11 (1990)
85. Lipton, R.J., Naughton, J.F.: Estimating the size of generalized transitive closures.In: VLDB, pp. 165–171 (1989)
86. Lipton, R.J., Naughton, J.F.: Query size estimation by adaptive sampling. J.Comput. Syst. Sci. 51(1), 18–25 (1995)
87. Luo, G., Ellmann, C.J., Haas, P.J., Naughton, J.F.: A scalable hash ripple joinalgorithm. In: SIGMOD, pp. 252–262 (2002)
88. Matias, Y., Vitter, J., Wang, M.: Wavelet-based histograms for selectivity estimation.In: SIGMOD Conference, pp. 448–459 (1998)
89. Matias, Y., Vitter, J.S., Wang, M.: Wavelet-based histograms for selectivityestimation. SIGMOD Record 27(2), 448–459 (1998)
90. Mingoti, S.: Bayesian estimator for the total number of distinct species when quadratsampling is used. Journal of Applied Statistics 26(4), 469–483 (1999)
91. Motwani, R., Raghavan, P.: Randomized Algorithms. Cambridge University Press,New York (1995)
92. Muralikrishna, M., DeWitt, D.: Equi-depth histograms for estimating selectivityfactors for multi-dimensional queries. In: SIGMOD Conference, pp. 28–36 (1988)
93. Muth, P., O’Neil, P.E., Pick, A., Weikum, G.: Design, implementation, andperformance of the LHAM log-structured history data access method. In: VLDB, pp.452–463 (1998)
94. Naughton, J.F., Seshadri, S.: On estimating the size of projections. In: ICDT:Proceedings of the third international conference on Database theory, pp. 499–513(1990)
95. Neal, R., Hinton, G.: A view of the em algorithm that justifies incremental, sparse,and other variants. In: Learning in Graphical Models (1998)
96. Olken, F.: Random sampling from databases. In: Ph.D. Dissertation (1993)
97. Olken, F.: Random sampling from databases. Tech. Rep. LBL-32883, LawrenceBerkeley National Laboratory (1993)
163
98. Olken, F., Rotem, D.: Simple random sampling from relational databases. In: VLDB,pp. 160 – 169 (1986)
99. Olken, F., Rotem, D.: Random sampling from b+ trees. In: VLDB, pp. 269–277(1989)
100. Olken, F., Rotem, D.: Sampling from spatial databases. In: ICDE, pp. 199 – 208(1993)
101. Olken, F., Rotem, D., Xu, P.: Random sampling from hash files. In: SIGMOD, pp.375 – 386 (1990)
102. Piatetsky-Shapiro, G., Connell, C.: Accurate estimation of the number of tuplessatisfying a condition. In: SIGMOD, pp. 256–276 (1984)
103. Rao, T.J.: On the allocation of sample size in stratified sampling. Annals of theInstitute of Statistical Mathematics 20, 159–166 (1968)
104. Rao, T.J.: Optimum allocation of sample size and prior distributions: A review.International Statistical Review 45(2), 173–179 (1977)
105. Roussopoulos, N., Kotidis, Y., Roussopoulos, M.: Cubetree: organization of and bulkincremental updates on the data cube. In: SIGMOD, pp. 89–99 (1997)
106. Rowe, N.C.: Top-down statistical estimation on a database. SIGMOD Record 13(4),135–145 (1983)
107. Rowe, N.C.: Antisampling for estimation: an overview. IEEE Trans. Softw. Eng.11(10), 1081–1091 (1985)
108. Rusu, F., Dobra, A.: Statistical analysis of sketch estimators. In: To Appear,SIGMOD (2007)
109. Sarndal, C., Swensson, B., Wretman, J.: Model Assisted Survey Sampling. Springer,New York (1992)
110. Selinger, P.G., Astrahan, M.M., Chamberlin, D.D., Lorie, R.A., Price, T.G.: Accesspath selection in a relational database management system. In: SIGMOD, pp. 23–34(1979)
111. Severance, D.G., Lohman, G.M.: Differential files: Their application to themaintenance of large databases. ACM Trans. Database Syst. 1(3), 256–267 (1976)
112. Shao, J.: Mathematical Statistics. Springer-Verlag (1999)
113. Sismanis, Y., Deligiannakis, A., Roussopoulos, N., Kotidis, Y.: Dwarf: Shrinking thepetacube. In: SIGMOD, pp. 464–475 (2002)
114. Sismanis, Y., Roussopoulos, N.: The polynomial complexity of fully materializedcoalesced cubes. In: VLDB, pp. 540–551 (2004)
164
115. Stonebraker, M., Abadi, D.J., Batkin, A., Chen, X., Cherniack, M., Ferreira, M.,Lau, E., Lin, A., Madden, S., O’Neil, E., O’Neil, P., Rasin, A., Tran, N., Zdonik, S.:C-store: a column-oriented DBMS. In: VLDB, pp. 553–564 (2005)
116. Thiesson, B., Meek, C., Heckerman, D.: Accelerating em for large databases. Mach.Learn. 45(3), 279–299 (2001)
117. Thorup, M., Zhang, Y.: Tabulation based 4-universal hashing with applications tosecond moment estimation. In: SODA, pp. 615–624 (2004)
118. Vitter, J.S., Wang, M.: Approximate computation of multidimensional aggregates ofsparse data using wavelets. SIGMOD Rec. 28(2), 193–204 (1999)
119. Vitter, J.S., Wang, M., Iyer, B.: Data cube approximation and histograms viawavelets. In: CIKM, pp. 96–104 (1998)
120. Vysochanskii, D., Petunin, Y.: Justification of the 3-sigma rule for unimodaldistributions. Theory of Probability and Mathematical Statistics 21, 25–36 (1980)
121. Yu, X., Zuzarte, C., Sevcik, K.C.: Towards estimating the number of distinct valuecombinations for a set of attributes. In: CIKM, pp. 656–663 (2005)
BIOGRAPHICAL SKETCH
Shantanu Joshi received his Bachelor of Engineering in Computer Science from the
University of Mumbai, India in 2000. After a brief stint of one year at Patni Computer
Systems in Mumbai, he joined the graduate school at the University of Florida in fall
2001, where he received his Master of Science (MS) in 2003 from the Department of
Computer and Information Science and Engineering.
In the summer of 2006, he was a research intern at the Data Management, Exploration
and Mining Group at Microsoft Research, where he worked with Nicolas Bruno and
Surajit Chaudhuri.
Shantanu will receive a Ph.D. in Computer Science in August 2007 from the
University of Florida and will then join the Database Server Manageability group at
Oracle Corporation as a member of technical staff.
165