SAMPLING-BASED RANDOMIZATION TECHNIQUES FOR APPROXIMATE...

SAMPLING-BASED RANDOMIZATION TECHNIQUES FOR APPROXIMATEQUERY PROCESSING

By

SHANTANU JOSHI

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2007

1

c© 2007 SHANTANU JOSHI

2

To my parents, Dr Sharad Joshi and Dr Hemangi Joshi

3

ACKNOWLEDGMENTS

Firstly, my sincerest gratitude goes to my advisor, Professor Chris Jermaine for his

invaluable guidance and support throughout my PhD research work. During the initial

several months of my graduate work, Chris was extremely patient and always led me

towards the right direction whenever I would waver. His acute insight into the research

problems we worked on set an excellent example and provided me immense motivation to

work on them. He has always emphasized the importance of high-quality technical writing

and has spent several painstaking hours reading and correcting my technical manuscripts.

He has been the best mentor I could have hoped for and I shall always remain indebted to

him for shaping my career and more importantly, my thinking.

I am also very thankful to Professor Alin Dobra for his guidance during my graduate

study. His enthusiasm and constant willingness to help has always amazed me.

Thanks are also due to Professor Joachim Hammer for his support during the very

early days of my graduate study. I take this opportunity to thank Professors Tamer

Kahveci and Gary Koehler for taking the time to serve on my committee and for their

helpful suggestions.

It was a pleasure working with Subi Arumugam and Abhijit Pol on various collaborative

research projects. Several interesting technical discussions with Mingxi Wu, Fei Xu, Florin

Rusu, Laukik Chitnis and Seema Degwekar provided a stimulating work environment in

the Database Center.

This work would not have been possible without the constant encouragement and

support of my family. My parents, Dr Sharad Joshi and Dr Hemangi Joshi always

encouraged me to focus on my goals and pursue them against all odds. My brother,

Dr Abhijit Joshi has always placed trust in my abilities and has been an ideal example to

follow since my childhood. My loving sister-in-law, Dr Hetal Joshi has been supportive

since the time I decided to pursue computer science.

4

TABLE OF CONTENTS

page

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

CHAPTER

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.1 Approximate Query Processing (AQP) - A Different Paradigm . . . . . . . 131.2 Building an AQP System Afresh . . . . . . . . . . . . . . . . . . . . . . . . 14

1.2.1 Sampling Vs Precomputed Synopses . . . . . . . . . . . . . . . . . . 151.2.2 Architectural Changes . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.3 Contributions in This Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2 RELATED WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.1 Sampling-based Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 192.2 Estimation Using Non-sampling Precomputed Synopses . . . . . . . . . . . 282.3 Analytic Query Processing Using Non-standard Data Models . . . . . . . . 30

3 MATERIALIZED SAMPLE VIEWS FOR DATABASE APPROXIMATION . . 33

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.2 Existing Sampling Techniques . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.2.1 Randomly Permuted Files . . . . . . . . . . . . . . . . . . . . . . . 353.2.2 Sampling from Indices . . . . . . . . . . . . . . . . . . . . . . . . . 363.2.3 Block-based Random Sampling . . . . . . . . . . . . . . . . . . . . . 37

3.3 Overview of Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.3.1 ACE Tree Leaf Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . 383.3.2 ACE Tree Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.3.3 Example Query Execution in ACE Tree . . . . . . . . . . . . . . . . 403.3.4 Choice of Binary Versus k-Ary Tree . . . . . . . . . . . . . . . . . . 42

3.4 Properties of the ACE Tree . . . . . . . . . . . . . . . . . . . . . . . . . . 433.4.1 Combinability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.4.2 Appendability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.4.3 Exponentiality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.5 Construction of the ACE Tree . . . . . . . . . . . . . . . . . . . . . . . . . 453.5.1 Design Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.5.2 Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463.5.3 Construction Phase 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5

3.5.4 Construction Phase 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 483.5.5 Combinability/Appendability Revisited . . . . . . . . . . . . . . . . 513.5.6 Page Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.6 Query Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523.6.1 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.6.2 Algorithm Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.6.3 Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553.6.4 Actual Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553.6.5 Algorithm Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.7 Multi-Dimensional ACE Trees . . . . . . . . . . . . . . . . . . . . . . . . . 593.8 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3.8.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613.8.2 Discussion of Experimental Results . . . . . . . . . . . . . . . . . . 66

3.9 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4 SAMPLING-BASED ESTIMATORS FOR SUBSET-BASED QUERIES . . . . 73

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 734.2 The Concurrent Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . 784.3 Unbiased Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

4.3.1 High-Level Description . . . . . . . . . . . . . . . . . . . . . . . . . 804.3.2 The Unbiased Estimator In Depth . . . . . . . . . . . . . . . . . . . 814.3.3 Why Is the Estimator Unbiased? . . . . . . . . . . . . . . . . . . . . 854.3.4 Computing the Variance of the Estimator . . . . . . . . . . . . . . . 874.3.5 Is This Good? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

4.4 Developing a Biased Estimator . . . . . . . . . . . . . . . . . . . . . . . . 914.5 Details of Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

4.5.1 Choice of Model and Model Parameters . . . . . . . . . . . . . . . . 924.5.2 Estimation of Model Parameters . . . . . . . . . . . . . . . . . . . . 954.5.3 Generating Populations From the Model . . . . . . . . . . . . . . . 1004.5.4 Constructing the Estimator . . . . . . . . . . . . . . . . . . . . . . . 102

4.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1034.6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

4.6.1.1 Synthetic data sets . . . . . . . . . . . . . . . . . . . . . . 1044.6.1.2 Real-life data sets . . . . . . . . . . . . . . . . . . . . . . . 106

4.6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1094.6.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

4.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1184.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

5 SAMPLING-BASED ESTIMATION OF LOW SELECTIVITY QUERIES . . . 121

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1215.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

5.2.1 Stratification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1245.2.2 “Optimal” Allocation and Why It’s Not . . . . . . . . . . . . . . . . 126

6

5.3 Overview of Our Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . 1285.4 Defining XΣ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

5.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1295.4.2 Defining Xcnt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1305.4.3 Defining XΣ′ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1325.4.4 Combining The Two . . . . . . . . . . . . . . . . . . . . . . . . . . 1355.4.5 Limiting the Number of Domain Values . . . . . . . . . . . . . . . . 137

5.5 Updating Priors Using The Pilot . . . . . . . . . . . . . . . . . . . . . . . 1395.6 Putting It All Together . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

5.6.1 Minimizing the Variance . . . . . . . . . . . . . . . . . . . . . . . . 1415.6.2 Computing the Final Sampling Allocation . . . . . . . . . . . . . . 142

5.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1435.7.1 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1435.7.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 1445.7.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1475.7.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

5.8 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1515.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

6 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

APPENDIX

EM ALGORITHM DERIVATION . . . . . . . . . . . . . . . . . . . . . . . . . 155

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

7

LIST OF TABLES

Table page

4-1 Observed standard error as a percentage of SUM (e.SAL) over all e ∈ EMP for24 synthetically generated data sets. The table shows errors for three differentsampling fractions: 1%, 5% and 10% and for each of these fractions, it showsthe error for the three estimators: U - Unbiased estimator, C - Concurrent samplingestimator and B - Model-based biased estimator. . . . . . . . . . . . . . . . . . 112



4-4 Observed standard error as a percentage of the total aggregate value of all recordsin the database for 8 queries over 3 real-life data sets. The table shows errorsfor three different sampling fractions: 1%, 5% and 10% and for each of thesefractions, it shows the error for the three estimators: U - Unbiased estimator,C - Concurrent sampling estimator and B - Model-based biased estimator. . . . 115

5-1 Bandwidth (as a ratio of error bounds width to the true query answer) and Coverage(for 1000 query runs) for a Simple Random Sampling estimator for the KDDCup data set. Results are shown for varying sample sizes and for three differentquery selectivities - 0.01%, 0.1% and 1%. . . . . . . . . . . . . . . . . . . . . . . 146

5-2 Average running time of Neyman and Bayes-Neyman estimators over three real-worlddatasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

5-3 Bandwidth (as a ratio of error bounds width to the true query answer) and Coverage(for 1000 query runs) for the Neyman estimator and the Bayes-Neyman estimatorfor the three data sets. Results are shown for 20 strata and for varying numberof records in pilot sample per stratum (PS), and sample sizes(SS) for three differentquery selectivities - 0.01%, 0.1% and 1%. . . . . . . . . . . . . . . . . . . . . . . 148

5-4 Bandwidth (as a ratio of error bounds width to the true query answer) and Coverage(for 1000 query runs) for the Neyman estimator and the Bayes-Neyman estimatorfor the three data sets. Results are shown for 200 strata with varying numberof records in pilot sample per stratum (PS), and sample sizes(SS) for three differentquery selectivities - 0.01%, 0.1% and 1%. . . . . . . . . . . . . . . . . . . . . . . 149

8

LIST OF FIGURES

Figure page

1-1 Simplified architecture of a DBMS . . . . . . . . . . . . . . . . . . . . . . . . . 17

3-1 Structure of a leaf node of the ACE tree. . . . . . . . . . . . . . . . . . . . . . . 39

3-2 Structure of the ACE tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3-3 Random samples from section 1 of L3. . . . . . . . . . . . . . . . . . . . . . . . 41

3-4 Combining samples from L3 and L5. . . . . . . . . . . . . . . . . . . . . . . . . 42

3-5 Combining two sections of leaf nodes of the ACE tree. . . . . . . . . . . . . . . 43

3-6 Appending two sections of leaf nodes of the ACE tree. . . . . . . . . . . . . . . 45

3-7 Choosing keys for internal nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3-8 Exponentiality property of ACE tree. . . . . . . . . . . . . . . . . . . . . . . . . 48

3-9 Phase 2 of tree construction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3-10 Execution runs of query answering algorithm with (a) 1 contributing section,(b) 6 contributing sections, (c) 7 contributing sections and (d) 16 contributingsections. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3-11 Sampling rate of an ACE tree vs. rate for a B+ tree and scan of a randomlypermuted file, with a one dimensional selection predicate accepting 0.25% ofthe database records. The graph shows the percentage of database records retrievedby all three sampling techniques versus time plotted as a percentage of the timerequired to scan the relation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3-12 Sampling rate of an ACE tree vs. rate for a B+ tree and scan of a randomlypermuted file, with a one dimensional selection predicate accepting 2.5% of thedatabase records. The graph shows the percentage of database records retrievedby all three sampling techniques versus time plotted as a percentage of the timerequired to scan the relation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

3-13 Sampling rate of an ACE tree vs. rate for a B+ tree and scan of a randomlypermuted file, with a one dimensional selection predicate accepting 25% of thedatabase records. The graph shows the percentage of database records retrievedby all three sampling techniques versus time plotted as a percentage of the timerequired to scan the relation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3-14 Sampling rate of an ACE tree vs. rate for a B+ tree and scan of a randomlypermuted file, with a one dimensional selection predicate accepting 2.5% of thedatabase records. The graph is an extension of Figure 3-12 and shows resultstill all three sampling techniques return all the records matching the query predicate. 63

9

3-15 Number of records needed to be buffered by the ACE Tree for queries with (a)0.25% and (b) 2.5% selectivity. The graphs show the number of records bufferedas a fraction of the total database records versus time plotted as a percentageof the time required to scan the relation. . . . . . . . . . . . . . . . . . . . . . . 64

3-16 Sampling rate of an ACE Tree vs. rate for an R-Tree and scan of a randomlypermuted file, with a spatial selection predicate accepting 0.25% of the databasetuples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

3-17 Sampling rate of an ACE tree vs. rate for an R-tree, and scan of a randomlypermuted file with a spatial selection predicate accepting 2.5% of the databasetuples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

3-18 Sampling rate of an ACE tree vs. rate for an R-tree, and scan of a randomlypermuted file with a spatial selection predicate accepting 25% of the databasetuples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4-1 Sampling from a superpopulation . . . . . . . . . . . . . . . . . . . . . . . . . . 90

4-2 Six distributions used to generate for each e in EMP the number of records s inSALE for which f3(e, s) evaluates to true. . . . . . . . . . . . . . . . . . . . . . . 105

5-1 Beta distribution with parameters α = β = 0.5. . . . . . . . . . . . . . . . . . . 131

10

Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy

SAMPLING-BASED RANDOMIZATION TECHNIQUES FOR APPROXIMATEQUERY PROCESSING

By

SHANTANU JOSHI

August 2007

Chair: Christopher JermaineMajor: Computer Engineering

The past couple of decades have seen a significant amount of research directed

towards data warehousing and efficient processing of analytic queries. This is a daunting

task due to massive sizes of data warehouses and the nature of complex, analytical queries.

This is evident from standard, published benchmarking results such as TPC-H, which

show that many typical queries can require several minutes to execute despite using

sophisticated hardware equipment. This can seem expensive especially for ad-hoc, data

exploratory analysis. One direction to speed up execution of such exploratory queries is

to rely on approximate results. This approach can be especially promising if approximate

answers and their error bounds are computed in a small fraction of the time required to

execute the query to completion. Random samples can be used effectively to perform

such an estimation. Two important problems have to be addressed before using random

samples for estimation. The first problem is that retrieval of random samples from

a database is generally very expensive and hence index structures are required to be

designed which can permit efficient random sampling from arbitrary selection predicates.

Secondly, approximate computation of arbitrary queries generally requires complex

statistical machinery and reliable sampling-based estimators have to be developed for

different types of analytic queries. My research addresses the two problems described

above by making the following contributions: (a) A novel file organization and index

structure called the ACE Tree which permits efficient random sampling from an arbitrary

11

range query. (b) Sampling-based estimators for aggregate queries which have a correlated

subquery where the inner and outer queries are related by the SQL EXISTS, NOT

EXISTS, IN or NOT IN clause. (c) A stratified sampling technique for estimating the

result of aggregate queries having highly selective predicates.

12

CHAPTER 1INTRODUCTION

The last couple of decades have seen an explosive growth of electronic data. It is not

unusual for data management systems to support several terabytes or even petabytes of

data. Such massive volumes of data have led to the evolution of “data warehouses”, which

are systems capable of supporting storage and efficient retrieval of large amounts of data.

Data Warehouses are typically used for applications such as online analytical processing

among others. Such applications process queries and expect results in a manner that is

different from traditional transaction processing. For example, a typical query by a sales

manager on a sales data warehouse might be:

“Return average salary of all employees at locations whose sales have increased by

atleast 10% over the past 3 years”

The result of such a query could be used to make high-level decisions such as whether

or not to hire more employees at the locations of interest. Such queries are typical in

a data warehousing environment in that their evaluation requires complex analytical

processing over huge amounts of data. Traditional transactional processing methods may

be unacceptably slow to answer such complex queries.

1.1 Approximate Query Processing (AQP) - A Different Paradigm

The nature of analytical queries and their associated applications provides an

opportunity to provide results which may not be exact. Since computation of exact

results may require an unreasonable amount of time due to massive volumes of data,

approximation may be attractive if the approximate results can be computed in a fraction

of the time it would take to compute the exact results. Moreover, providing approximate

results can be useful to quickly explore the whole data at a high level. This technique of

providing fast but approximate results has been termed “Approximate Query Processing”

in the literature.

13

In addition to computing an approximate answer, it is also important to provide

metrics about the accuracy of the answer. One way to express the accuracy is in terms

of error bounds with certain probabilistic guarantees of the form, “The estimated answer

is 2.45 × 105, and with 95% confidence the true answer lies within ±1.18 × 103 of the

estimate”. Here, the error bounds are expressed as an interval and the accuracy guarantee

is provided at 95% confidence.

A promising approach for aggregation queries in Approximate Query Processing

(AQP) has been proposed by Haas and Hellerstein [63] called Online aggregation (OLA).

They propose an interactive interface for data exploration and analysis where records are

retrieved in a random order. Using these random samples, running estimates and error

bounds are computed and immediately displayed to the user. As time progresses, the

size of the random sample keeps growing and so the estimate is continuously refined. At

a predetermined time interval, the refined estimate along with its improved accuracy is

displayed to the user. If at any point of time during the execution the user is satisfied with

the accuracy of the answer, she can terminate further execution. The system also gives an

overall progress indicator based on the fraction of records that have been sampled thus

far. Thus, OLA provides an interface where the user is given a rough estimate of the result

very quickly.

1.2 Building an AQP System Afresh

The OLA system described above presents an intuitive interface for approximate

answering of aggregate queries. However, to support the functionality proposed by

the system, fundamental changes need to be incorporated in several components of a

traditional database management system. In this section, we first examine why sampling

is a good approach for AQP, and then present an overview of the changes needed in the

architecture of a database management system to support sampling-based AQP.

14

1.2.1 Sampling Vs Precomputed Synopses

We now discuss two techniques that can be used to support fast but approximate

answering of queries. One intuitive technique is using some compact information about

records for answering queries. Such information is typically called a database statistic and

it is actually summary information about the actual records of the database. Commonly

used database statistics are wavelets, histograms and sketches. Such statistics also known

as synopses, are orders of magnitude smaller in size than the actual data. Hence it is much

faster and efficient to access synopses as compared to reading the entire data. However,

such synopses are precomputed and static. If a query is issued which requires some

synopses that are not already available, then they would have to be computed by scanning

the dataset, possibly multiple times before answering the query.

A second approach to AQP is using samples of database records to answer queries.

Query execution is extremely fast since the number of records in the sample is a small

fraction of the total number of records in the database. The answer is then extrapolated

or “scaled-up” to the size of the entire database. Since the answer is computed by

processing very few records of the database, it is an approximation of the true answer.

For the work in this thesis, we propose to use sampling [25, 109] in order to support

AQP. We make this choice due to the following important advantages of sampling over

precomputed synopses. The accuracy of an estimate computed by using samples can be

easily improved by obtaining more samples to answer the query. On the other hand, if the

estimate computed by using synopses is not sufficiently accurate, a new synopsis providing

greater accuracy would have to be built. Since this would require scanning the dataset it is

impractical. Secondly, sampling is very amenable to scalability. Even for extremely large

datasets of the order of hundreds of gigabytes, it is generally possible to accomodate a

small sample in main memory and use efficient in-memory algorithms to process it. If this

is not possible, disk-based samples and algorithms have also been proposed [76] and are

equally effective as their in-memory counterparts. This is an important benefit of sampling

15

as compared to histograms, which become unwieldy as the number of attributes of the

records in the dataset increases.

Thirdly, since real records (although very few) are used in a sample, it is possible

to answer any statistical query including arbitrarily complex functions in relational

selection and join predicates. This is a very important advantage of sampling as opposed

to synopses such as sketches, which are not suitable for answering arbitrary queries.

Finally, unlike precomputed synopses there is no requirement of maintenance and

updates for on-the-fly sampling as data are updated.

1.2.2 Architectural Changes

In order to support sampling-based AQP in a database management system, major

changes need to be incorporated in the architecture of the system. The reason for this is

that traditional database management systems were not designed to work with random

samples or to support computation of approximate results. In this section, we briefly

describe some of the most critical changes that are required in the architecture of a DBMS

to support sampling-based AQP.

Figure 1-1 depicts the various components from a simplified architecture of a DBMS.

The four components that require major changes in order to support sampling-based AQP

are as follows:

• Index/file/record manager - The use of traditional index structures like B+-Treesis not appropriate to obtain random samples. This is because such index structuresorder records based on record search key values which is actually the opposite ofobtaining records in a random order. Hence, for AQP it is important to providephysical structures or file organizations which support efficient retrieval of randomsamples.

• Execution engine - The execution engine needs to be revamped completely so thatit can use the random samples returned by the lower level to execute the query onthem. Further, the result of the query needs to be scaled up appropriately for thesize of the entire database. This component would also need to be able to computeaccuracy guarantees for the approximate answer.

16

User Interface

Query Compiler

Execution Engine

Index/File/Record Manager

Buffer Manager

Storage Manager

Storage

Queries/Updates

Query plan

Index, file and record requests

Page commands

Read/write pages

Figure 1-1. Simplified architecture of a DBMS

• Query compiler - The query compiler has to be modified so that it can chalk outa different strategy of execution for various types of queries like relational joins,subset-based queries or queries with a GROUP-BY clause. Moreover, optimizationof queries needs to be done very differently from traditional query optimizers whichcreate the most efficient query plan to run a query to completion. For AQP, queriesshould be optimized so that the first few result tuples are output as quickly aspossible.

• User interface - There is tremendous scope of providing an intuitive user interfacefor an online AQP system. In addition to the UI being able to provide accuracyguarantees to the estimate, it would be very intuitive to provide a visualization ofthe intermediate results as and when they become available so that the user cancontinue to explore the query or decide to modify or terminate it. Current databasemanagement systems provide user interfaces with very limited functionality.

17

1.3 Contributions in This Thesis

These tasks involve significant research and implementation issues. Since many of the

problems have never been tackled in the literature, there are several challenging tasks to

be addressed.

For the scope of my research, I choose to address the following three problems. The

motivation and our solutions to each of these research problems is described separately in

the following chapters of this thesis.

• We present a primary index structure which can support efficient retrieval of randomsamples from an arbitrary range query. This requires a specialized file organizationand an efficient algorithm to actually retrieve the desired random samples from theindex. This work falls in the scope of the Index/file/record manager componentdescribed earlier.

• We present our solution to support execution of queries which have a nestedsub-query where the inner query is correlated to the outer query, in an approximatequery processing framework. This work falls in the purview of the execution engine ofthe system.

• Finally, we also present a technique to support efficient execution of queries whichhave predicates with low selectivities, such as GROUP BY queries with manydifferent groups. This work also falls in the scope of the query execution engine.

18

CHAPTER 2RELATED WORK

This chapter presents previous work in the data management and statistics literature

related to estimation using sampling as well as non-sampling based precomputed

synopses structures. Finally, it describes work related to OLAP query processing using

non-relational data models like data cubes.

2.1 Sampling-based Estimation

Sampling has a long history in the data management literature. Some of the

pioneering work in this field has been done by Olken and Rotem [96, 98–101] and

Antoshenkov [9], though the idea of using a survey sample for estimation in statistics

literature goes back much earlier than these works. Most of the work by Olken and Rotem

describes how to perform simple random sampling from databases. Estimation for several

types of database tasks has been attempted with random samples. The rest of this section

presents important works on sampling-based estimation of major database tasks.

Some of the initial work on estimating selectivity of join queries is due to Hou et al.

[67, 68]. They present unbiased and consistent estimators for estimating the join size and

also provide an algorithm for cluster sampling. In [64] they propose unbiased estimators

for COUNT aggregate queries over arbitrary relational algebra expressions. However,

computation of variance of their estimators is very complex [67]. They also do not provide

any bounds on the number of random samples required for estimation.

Adaptive sampling has been used for estimation of selectivity of predicates in

relational selection and join operations [83, 84, 86] and for approximating the size of a

relational projection operation [94]. Adaptive sampling has also been used in [85], to

estimate transitive closures of database relations. The authors point out the benefits and

generality of using sampling for selectivity estimation over parametric methods which

make assumptions about an underlying probability distribution for the data as well as

over non-parametric methods which require storing and maintaining synopses about the

19

underlying data. The algorithms consider the query result as a collection of results from

several disjoint subqueries. Subqueries are sampled randomly and their result sizes are

computed. The estimate of the actual query result size is then obtained from the results

of the various subqueries. The sampling of subqueries is continued until either the sum

of the subquery sizes is sufficiently large or the number of samples taken is sufficiently

large. The method requires that the maximum size of a subquery be known. Since this

is generally not available, the authors use an upper bound for the maximum subquery

size in their method. Haas and Swami [59] observe that using a loose upper bound for

the maximum subquery size can lead to sampling more subqueries than necessary, and

potentially increasing the cost of sampling significantly.

Double sampling or two-phase sampling has been used in [66] for estimating the

result of a COUNT query with a guaranteed error bound at a certain confidence level.

The error bound is guaranteed by performing sampling in two steps. In the first step a

small pilot sample is used to obtain preliminary information about the input relation. This

information is then used to compute the size of the sample for the second step such that

the estimator is guaranteed to produce an estimate with the desired error bound.

As Haas and Swami [59] point out, the drawback of using double sampling is that

there is no theoretical guidance for choosing the size of the pilot sample. This could

lead to an unpredictably imprecise estimate if the pilot sample size is too small or an

unnecessarily high sampling cost if the pilot sample size is too large. In their work [59],

Haas and Swami present sequential sampling techniques which provide an estimate of the

result size and also bounds the error in estimation with a prespecified probability. They

present two algorithms in the paper to estimate the size of a query result. Although both

algorithms have been proven to be asymptotically correct and efficient, the first algorithm

suffers from the problem of undercoverage. This means that in practice the probability

with which it estimates the query result within the computed error bound is less than

the specified confidence level of the algorithm. This problem is addressed by the second

20

algorithm which organizes groups of equal-sized results sets into a single stratum and then

performs stratified sampling over the different strata. However, their algorithms do not

perform very well when estimating the size of joins between a skewed and a non-skewed

relation.

Ling and Sun [82] point out that general sampling-based estimation methods have a

high cost of execution since they make an overly restrictive assumption of no knowledge

about the overall characteristics of the data. In particular, they note that estimation of

the overall mean and variance of the data not only incurs cost but also introduces error in

estimation. The authors rather suggest an alternative approach of actually keeping track

of these characteristics in the database at a minimal overhead.

A detailed study about the cost of sampling-based methods to estimate join query

sizes appears in [58]. The paper systematically analyses the factors which influence the

cost of a sampling-based method to estimate join selectivities. Based on their analysis,

their findings can be summarized as follows: (a) When the measure of precision of the

estimate is absolute, the cost of sampling increases with the number of relations involved

in the join as well as the sizes of the relations themselves. (b) When the measure of

precision of the estimate is relative, the cost of using sampling increases with the sizes

of the relations, but decreases as the number of input relations increase. (c) When the

distribution of the join attribute values is uniform or highly skewed for all input relations,

the cost of sampling tends to be low, while it is high when only some of the input relations

have a skewed join attribute value distribution. (d) The presence of tuples in a relation

which do not join with any other tuples from other relations always increases the cost of

sampling.

Haas et al. [56, 57] study and compare the performance of new as well as previous

sampling-based procedures for estimating the selectivity of queries with joins. In particular

they identify estimators which have a minimum variance after a fixed number of sampling

steps have been performed. They note that use of indexes on input relations can further

21

reduce variance of the selectivity estimate. The authors also show how their estimation

methods can be used to estimate the cost of implementing a given join query plan without

making any assumptions about the underlying data or requiring storage and maintenance

of summary statistics about the data.

Ganguly et al. [35] describe how to estimate the size of a join in the presence of skew

in the data by using a technique called bifocal sampling. This technique classifies tuples

of each input relation into two groups, sparse and dense, based on the number of tuples

with the same value for the join attribute. Every combination of these groups is then

subject to different estimation procedures. Each of these estimation procedures require a

sample size larger than a certain value (in terms of the total number of tuples in the input

relation) to provide an estimate within a small constant factor of the true join size. In

order to guarantee estimates with the specified accuracy, bifocal sampling also requires the

total join size and the join sizes from sparse-sparse subjoins to be greater than a certain

threshold.

Gibbons and Matias [40] introduce two sampling-based summary statistics called con-

cise samples and counting samples and present techniques for their fast and incremental

maintenance. Although the paper describes summary statistics rather than on-the-fly

sampling techniques, the summary statistics are created from random samples of the

underlying data and are actually defined to describe characteristics of a random sample

of the data. Since summary statistics of a random sample require much lesser amount of

memory than the sample itself, the paper describes how information from a much larger

sample can be stored in a given amount of memory by storing sample statistics instead

of using the memory to store actual random samples. Thus, the authors claim that since

information from a larger sample can be stored by their summary statistics the accuracy of

approximate answers can be boosted.

Chaudhuri, Motwani and Narasayya [22, 24] present a detailed study of the problem

of efficiently sampling the output of a join operation without actually computing the

22

entire join. They prove a negative result that it is not possible to generate a sample of

the join result of two relations by merely joining samples of the relations involved in

the join. Based on this result, they propose a biased sampling strategy which samples

tuples from one relation in the proportion with which their matching tuples appear in

the other relation. The intuition behind this approach is that the resulting biased sample

is more likely to reflect the structure of the actual join result between the two relations.

Information about the frequency of the various join attribute values is assumed to be

available in the form of some synopsis structures like histograms.

There has also been work to estimate the actual result of an aggregate query which

involves a relational join operation on its input relations. In fact, Haas, Hellerstein and

Wang [63] propose a system called Online Aggregation (OLA) that can support online

execution of analytic-style aggregation queries. They propose the system to have a visual

interface which displays the current estimate of the aggregate query along with error

bounds at a certain confidence level. Then, as time progresses, the system continually

refines the estimate and at the same time shrinks the width of the error bounds. The user

who is presented with such a visual interface, has at all times, an option to terminate

further execution of the query in case the error bound width is satisfactory for the given

confidence level. The authors propose the use random sampling from input relations to

provide estimates in OLA. Further, they describe some of the key changes that would

be required in a DBMS to support OLA. In [51], Haas describes statistical techniques

for computing error bounds in OLA. The work on OLA eventually grew into the UC

Berkeley CONTROL project. In their article [62], Hellerstein et al. describe various issues

in providing interactive data analysis and possible approaches to address those issues.

Haas and Hellerstein [53, 54] propose a family of join algorithms called ripple joins

to perform relational joins in an OLA framework. Ripple joins were designed to minimize

the time until an acceptably precise estimate of the query result is made available, as

opposed to minimizing the time to completion of the query as in a traditional DBMS. For

23

a two-table join, the algorithm retrieves a certain number of random tuples from both

relations at each sampling step; these new tuples are joined with previously seen tuples

and with each other. The running result of the aggregate query is updated with these

newly retrieved tuples. The paper also describes how a statistically meaningful confidence

interval of the estimated result can be computed based on the Central Limit Theorem

(CLT).

Luo et al. [87] present an online parallel hash ripple join algorithm to speed up

the execution of the ripple join especially when the join selectivity is low and also when

the user wishes to continue execution till completion. The algorithm is assumed to be

executed at a fixed set of processor nodes. At each node, a hash table is maintained for

every relation. Moreover every bucket in each hash table could have some tuples stored

in memory and some others stored on disk. The join algorithm proceeds in two phases; in

the first phase tuples from both relations are retrieved in a random order and distributed

to the processor nodes so that each node would perform roughly the same amount of work

for executing the join. By using multiple threads at each node, production of join tuples

from the in-memory hash table buckets begins even as tuples are being distributed to

the various processors. The second phase begins after redistribution from the first phase

is complete. In this phase, a new in-memory hash table is created which uses a hashing

function different from the function used in phase 1. The tuples in the disk-resident

buckets of the hash table of phase 1 are then hashed according to the hashing function

of phase 2 and joined. The algorithm provides a considerable speed-up factor over the

one-node ripple join, provided its memory requirements are met.

Jermaine et al. [73, 74] point out that the drawback of both the ripple join algorithms

described above is that the statistical guarantees provided by the estimator are valid

only as long as the output of the join can be accomodated in main memory. In order

to counteract this problem, they propose the Sort-Merge-Shrink join algorithm as a

generalization of the ripple join which can provide error guarantees throughout execution,

24

even if it operates from disk. The algorithm proceeds in three phases. In the sort phase,

the two input relations are read in parallel and sorted into runs. Each pair of runs is

subject to an in-memory hash ripple join and provides a corresponding estimate of the

join result. The merge and shrink phases execute concurrently where in the merge phase,

tuples are retrieved from the various sorted runs of both relations and joined with each

other. Since the sorted runs “lose” tuples which are pulled by the merge phase, the

shrinking phase takes these tuples into account and updates the estimator accordingly.

The authors provide a detailed statistical analysis of the estimator as well computation of

error bounds.

Estimation using sampling of the number of distinct values in a column has been

studied by Haas et al. [48]. They provide an overview of the estimators used in the

database and statistics literature and also develop several new sampling-based estimators

for the distinct value estimation problem. They propose a new hybrid sampling estimator

which explicitly adapts to different levels of data skew. Their hybrid estimator performs

a Chi-square test to detect skew in the distribution of the attribute value. If the data

appears to be skewed, then Shlosser’s estimator is used while if the test does not detect

skew, a smoothed-jackknife estimator (which is a modification of the conventional

jackknife estimator) is used. The authors attribute a dearth of work for sampling-based

estimation of the number of distinct values to the inherent difficulty of the problem while

noting that it is a much harder problem than estimating the selectivity of a join.

Haas and Stokes [50] present a detailed study of the problem of estimating the

number of classes in a finite population. This is equivalent to the database problem of

estimating the number of distinct values in a relation. The authors make recommendations

about which statistical estimator is appropriate subject to constraints and finally claim

from empirical results that a hybrid estimator which adapts according to data skew is the

most superior estimator.

25

There has also been work by Charikar et al. [16] which establishes a negative result

stating that no sampling-based estimator for estimating the number of distinct values

can guarantee small error across all input distributions unless it examines a large fraction

of the input data. They also present a Guaranteed Error Estimator (GEE) whose error

is provably no worse than their negative result. Since the GEE is a general estimator

providing optimal error over all distributions, the authors note that its accuracy may be

lower than some previous estimators on specific distributions. Hence, they propose an

estimator called the Adaptive Estimator (AE) which is similar in spirit to Haas et al.’s

hybrid estimator [50], but unlike the latter, is not composed of two distinct estimators.

Rather the AE considers the contribution of data items having high and low frequencies in

a single unified estimator.

In the AQUA system [41] for approximate answering of queries, Acharya et al.

[6] propose using synopses for estimating the result of relational join queries involving

foreign-key joins rather than using random samples from the base relations. These

synopses are actually precomputed samples from a small set of distinguished joins and

are called join synopses in the paper. The idea of join synopses is that by precomputing

samples from a small set of distinguished joins, these samples can be used for estimating

the result of many other joins. The concept is applicable in a k-way join where each join

involves a primary and foreign key of the participating relations. The paper describes

that if workload information is available, it can be used to design an optimal allocation

for the join synopses that minimizes the overall error in the approximate answers over the

workload.

Acharya et al. [5] propose using a mix of uniform and biased samples for approximately

answering queries with a GROUP-BY clause. Their sampling technique called congres-

sional sampling relies on using precomputed samples which are a hybrid union of uniform

and biased samples. They assume that the selectivity of the query predicate is not so low

that their precomputed sample completely misses one or more groups from the result of

26

the GROUP-BY query. Based on this assumption, they devise a sampling plan for the

different groups such that the expected minimum number of tuples satisfying the query

predicate in any group, is maximized. The authors also present one-pass algorithms [4] for

constructing the congressional samples.

Ganti et al. [37] describe a biased sampling approach which they call ICICLES to

obtain random samples which are tuned to a particular workload. Thus, if a tuple is

chosen by many queries in a workload, it has a higher probability of being selected in the

self-tuning sample as compared to tuples which are chosen by fewer queries. Since this is

a non-uniform sample, traditional sampling-based estimators must be adapted for these

samples. The paper describes modified estimators for the common aggregation operations.

It also describes how the self-tuning samples are tuned in the presence of a dynamically

changing workload.

Chaudhuri et al. [18] note that uniform random sampling to estimate aggregate

queries is ineffective when the distribution of the aggregate attribute is skewed or when

the query predicate has a low selectivity. They propose using a combination of two

methods to address this problem. Their first approach is to index separately those

attribute values which contribute significantly to the query result. This method is called

Outlier Indexing in the paper. The second approach proposed in the paper is to exploit

workload information to perform weighted sampling. According to this technique, records

which satisfied many queries in the workload are sampled more than records than satisfied

fewer queries.

Chaudhuri, Das and Narasayya [19, 20] describe how workload information can

be used to precompute a sample that minimizes the error for the given workload. The

problem of selection of the sample is framed as an optimization problem so that the error

in estimation of the workload queries using the resulting sample is minimized. When the

actual incoming queries are identical to queries in the workload, this approach gives a

solution with minimal error across all queries. The paper also describes how the choice of

27

the sample can be tuned to achieve effective estimates when the actual queries are similar

but not identical to the workload.

Babcock, Chaudhuri and Das [10] note that a uniformly random sample can lead

to inaccurate answers for many queries. They observe that for such queries, estimation

using an appropriately biased sample can lead to more accurate answers as compared

to estimation using uniformly random samples. Based on this idea, the paper describes

a technique called small group sampling which is designed to approximately answer

aggregation queries having a GROUP-BY clause. The distinctive feature of this technique

as compared to previous biased sampling techniques like congressional sampling is that

a new biased sample is chosen for every GROUP-BY query, such that it maximizes

the accuracy of estimating the query rather than trying to devise a biased sample that

maximizes the accuracy over an entire workload of queries. According to this technique,

larger groups from the output of the GROUP-BY queries are sampled uniformly while the

small groups are sampled at a higher rate to ensure that they are adequately represented.

The group samples are obtained on a per-query basis from an overall sample which is

computed in a pre-processing phase.

In fact, database sampling has been recognized as an important enough problem

that ISO has been working to develop a standard interface for sampling from relational

database systems [55], and significant research efforts are directed at providing sampling

from database systems by vendors such as IBM [52].

2.2 Estimation Using Non-sampling Precomputed Synopses

Estimation in databases using a non-sampling technique was first proposed by Rowe

[106, 107]. The technique proposed is called antisampling and involves creation of a special

auxiliary structure called database abstract. The abstract considers the distribution of

several attributes and groups of attributes. Correlations between different attributes can

also be characterized as statistics. This technique was found to be faster than random

sampling, but required domain knowledge about the various attributes.

28

Classic work on histogram based estimation of predicate selectivity is by Selinger et

al. [110] and Piatetsky-Shapiro and Connell [102]. Selectivity estimation of queries with

multidimensional predicates using histograms was presented by Muralikrishna and DeWitt

[92]. They show that the maximum error in estimation can be controlled more effectively

by choosing equi-depth histograms as opposed to equi-width histograms.

Ioannidis [70] describes how serial histograms are optimal for aggregate queries

involving arbitrary join trees with equality predicates. Ioannidis and Poosala [71] have also

studied how histograms can be used to approximately answer non-aggregate queries which

have a set based result.

Several histogram construction schemes [42, 45, 72] have been proposed in the

literature. Jagadish et al. [72] describe techniques for constructing histograms which can

minimize a given error metric where the error is introduced because of approximation

of values in a bucket by a single value associated with the bucket. They also describe

techniques for augmenting histograms with additional information so that they can be

used to provide accuracy guarantees of the estimated results.

Construction of approximate histograms by considering only a random sample of

the data set was investigated by Chaudhuri et al. [23]. Their technique uses an adaptive

sampling approach to determine the sample size that would be sufficient to generate

approximate histograms which can guarantee pre-specified error bounds in estimation.

They also extend their work to consider duplicate values in the domain of the attribute for

which a histogram is to be constructed.

The problem of estimation of the number of distinct value combinations of a set of

attributes has been studied by Yu et al. [121]. Due to the inherent difficulty of developing

a good, sampling-based estimation solution to the problem, they propose using additional

information about the data in the form of histograms, indexes or data cubes.

In a recent paper [28], Dobra presents a study of when histograms are best suited for

approximation. The paper considers the long-standing assumption that histograms are

29

most effective only when all elements in a bucket have the same frequency and actually

extends it to a less restrictive assumption that histograms are well-suited when elements

within a bucket are randomly arranged even though they might have different frequencies.

Wavelets have a long history as mathematical tools for hierarchical decomposition

of functions in signal and image processing. Vitter and his collaborators have also

studied how wavelets can be applied to selectivity estimation of queries [89] and also

for computing aggregates over data cubes [118, 119]. Chakrabarti et al. [15] present

techniques for approximate computation of results for aggregate as well as non-aggregate

queries using Haar wavelets.

One more summary structure that has been proposed for approximating the size of

joins is sketches. Sketches are small-space summaries of data suited for data streams. A

sketch generally consists of multiple counters corresponding to random variables which

enable them to provide approximate answers with error guarantees for a priori decided

queries. Some of the earliest work on sketches was presented by Alon, Gibbons, Matias

and Szegedy [7, 8]. Sketching techniques with improved error guarantees and faster update

times have been proposed as Fast-Count sketches [117]. A statistical analysis of various

sketching techniques along with recommendations on their use for estimating join sizes

appears in [108].

2.3 Analytic Query Processing Using Non-standard Data Models

A data model for OLAP applications called data cube was proposed by Gray et al.

[44] for processing of analytic style aggregation queries over data warehouses. The paper

describes a generalization of the SQL GROUP BY operator to multiple dimensions by

introducing the data cube operator. This operator treats each of the possible aggregation

attributes as a dimension of a high dimensional space. The aggregate of a particular

set of attribute values is considered as a point in this space. Since the cube holds

precomputed aggregate values over all dimensions, it can be used to quickly compute

results to GROUP-BY queries over multiple dimensions. The data cube is precomputed

30

and can require significant amount of space for storage of the precomputed aggregates

along the different dimensions. A more serious drawback of the data cube approach is

that it can be used to efficiently answer only such queries which have a grouping hierarchy

that conforms to the hierarchy on which the data cube is built. Moreover, complex queries

which have been addressed in this thesis such as queries having correlated subqueries are

not amenable to efficient processing with the data cube model.

Due to potentially large sizes of data cubes for high dimensions, researchers have

studied techniques to discover semantic relationships in a data cube. This approach

reduces the number of precomputed aggregates grouped by different attributes if

their aggregate values are identical. The quotient cube [79] and quotient cube tree [80]

structures are such compressed representations of the data cube which preserve semantic

relationships while also allowing processing of point and range queries.

Another approach that has been employed in shrinking the data cube while at the

same time preserving all the information in it is the Dwarf [113, 114] structure. Dwarf

identifies and eliminates redundancies in prefixes and suffixes of the values along different

dimensions of a data cube. The paper shows that by eliminating prefix as well as suffix

redundancies, both dense as well as sparse data cubes can be compressed effectively.

The paper also shows improved cube construction time, query response time as well as

update time as compared to cube trees [105]. Although, the dwarf structure improves the

performance of the data cube model, it still suffers from the inherent drawback of the data

cube model – it is not suitable to efficiently answer arbitrarily complex queries such as

queries with correlated subqueries.

Recently, a new column-oriented architecture for database systems called C-store was

proposed by Stonebraker et al [115]. The system has been designed for an environment

that has much higher number of database reads as opposed to writes, such as a data

warehousing environment. C-store logically splits attributes of a relational table into

projections which are collections of attributes, and stores them on disk such that all values

31

of any attribute are stored adjacent to each other. The paper presents experimental results

which show that C-store executes several select-project-join and group-by queries over the

TPC-H benchmark much faster than commercial row-oriented or column-oriented systems.

At the time of the paper [115], the system was still under development.

32

CHAPTER 3MATERIALIZED SAMPLE VIEWS FOR DATABASE APPROXIMATION

3.1 Introduction

With ever-increasing database sizes, randomization and randomized algorithms [91]

have become vital data management tools. In particular, random sampling is one of the

most important sources of randomness for such algorithms. Scores of algorithms that are

useful over large data repositories either require a randomized input ordering for data (i.e.,

an online random sample), or else they operate over samples of the data to increase the

speed of the algorithm.

Although applications requiring randomization abound in the data management

literature, we specifically consider online aggregation [54, 62, 63] in this thesis. In online

aggregation, database records are processed one-at-a-time, and used to keep the user

informed of the current “best guess” as to the eventual answer to the query. If the records

are input into the online aggregation algorithm in a randomized order, then it becomes

possible to give probabilistic guarantees on the relationship of the current guess to the

eventual answer to the query.

Despite the obvious importance of random sampling in a database environment and

dozens of recent papers on the subject (approximately 20 papers from recent SIGMOD

and VLDB conferences are concerned with database sampling), there has been relatively

little work towards actually supporting random sampling with physical database file

organizations. The classic work in this area (by Olken and his co-authors [98, 99, 101])

suffers from a key drawback: each record sampled from a database file requires a random

disk I/O. At a current rate of around 100 random disk I/Os per second per disk, this

means that it is possible to retrieve only 6,000 samples per minute. If the goal is fast

approximate query processing or speeding up a data mining algorithm, this is clearly

unacceptable.

33

The Materialized Sample view

In this chapter, we propose to use the materialized sample view 1 as a convenient

abstraction for allowing efficient random sampling from a database. For example, consider

the following database schema:

SALE (DAY, CUST, PART, SUPP)

Imagine that we want to support fast, random sampling from this table, and most of

our queries include a temporal range predicate on the DAY attribute. This is exactly the

interface provided by a materialized sample view. A materialized sample view can be

specified with the following SQL-like query:

CREATE MATERIALIZED SAMPLE VIEW MySam

AS SELECT * FROM SALE

INDEX ON DAY

In general, the range attribute or attributes referenced in the INDEX ON clause can be

spatial, temporal, or otherwise, depending on the requirements of the application.While the materialized sample view is a straightforward concept, efficient implementation

is difficult. The primary technical contribution of this thesis is a novel index structurecalled the ACE Tree (Appendability, Combinability Exponentiality ; see Section 3.4) whichcan be used to efficiently implement a materialized sample view. Such a view, stored as anACE-Tree, has the following characteristics:

• It is possible to efficiently sample (without replacement) from any arbitrary rangequery over the indexed attribute, at a rate that is far faster than is possible usingtechniques proposed by Olken [96] or by scanning a randomly permuted file. Ingeneral, the view can produce samples from a predicate involving any attributehaving a natural ordering, and a straightforward extension of the ACE Tree can beused for sampling from multi-dimensional predicates.

• The resulting sample is online, which means that new samples are returnedcontinuously as time progresses, and in a manner such that at all times, the setof samples returned is a true random sample of all of the records in the view that

1 This term was originally used in Olken’s PhD thesis [96] in a slightly different context,where the goal was to maintain a fixed-size sample of database; in contrast, as we describesubsequently our materialized sample view is a structure allowing online sampling

34

match the range query. This is vital for important applications like online aggregationand data mining.

• Finally, the sample view is created efficiently, requiring only two external sorts of therecords in the view, and with only a very small space overhead beyond the storagerequired for the data records.

We note that while the materialized sample view is a logical concept, the actual file

organization used to implement such a view can be referred to as a sample index since it is

a primary index structure to efficiently retrieve random samples.

3.2 Existing Sampling Techniques

In this section, we discuss three simple techniques that can be used to create

materialized sample views to support random sampling from a relational selection

predicate.

3.2.1 Randomly Permuted Files

One option for creating a materialized sample view is to randomly shuffle or permute

the records in the view. To sample from a relational selection predicate over the view,

we scan it sequentially from beginning to end and accept those records that satisfy the

predicate while rejecting the rest. This method has the advantage that it is very simple,

and using a fast external sorting algorithm, permuting the records can be very efficient.

Furthermore, since the process of scanning the file can make use of the fast, sequential I/O

provided by modern hard disks, a materialized view organized as a randomly permuted file

can be very useful for answering queries that are not very selective.

However, the major problem with such a materialized view is that the fraction of

useful samples retrieved by it is directly proportional to the selectivity of the selection

predicate. For example, if the selectivity of the query is 10%, then on average only 10% of

the random samples obtained by such a view can be used to answer the query. Hence for

moderate to low selectivity queries, most of the random samples retrieved by such a view

will not be useful for answering queries. Thus, the performance of such a view quickly

degrades as selectivity of the selection predicates decreases.

35

3.2.2 Sampling from Indices

The second approach to creating a materialized sample view is to use one of the

standard indexing structures like a hashing scheme or a tree-based index structure

to organize the records in the view. In order to produce random samples from such

a materialized view, we can employ iterative or batch sampling techniques [9, 96,

99–101] that sample directly from a relational selection predicate, thus avoiding the

aforementioned problem of obtaining too few relevant records in the sample. Olken [96]

presents a comprehensive analysis and comparison of many such techniques. In this

Section we discuss the technique of sampling from a materialized view organized as

a ranked B+-Tree, since it has been proven to be the most efficient existing iterative

sampling technique in terms of number of disk accesses. A ranked B+-Tree is a regular

B+-Tree whose internal nodes have been augmented with information which permits one

to find the ith record in the file.

Let us assume that the relation SALE presented in the Introduction is stored as a

ranked B+-Tree file indexed on the attribute DAY and we want to retrieve a random

sample of records whose DAY attribute value falls between 11-28-2004 and 03-02-2005.

This translates to the following SQL query:

SELECT * FROM SALE

WHERE SALE.DAY BETWEEN ’11-28-2004’ AND ’03-02-2005’

Algorithm 1 above can then be used to obtain a random sample of relevant records

from the ranked B+-Tree file.

The drawback of the above algorithm is that whenever a leaf page is accessed, the

algorithm retrieves only that record whose rank matches with the rank being searched for.

Hence for every record which resides on a page that is not currently buffered, the retrieval

time is the same as the time required for a random disk I/O. Thus, as long as there are

unbuffered leaf pages containing candidate records, the rate of record retrieval is very slow.

36

Algorithm 1: Sampling from a Ranked B+-Tree

Algorithm SampleRankedB+Tree (Value v1, Value v2)1. Find the rank r1 of the record which has the smallest

DAY value greater than v1.2. Find the rank r2 of the record which has the largest

DAY value smaller than v2.3. While sample size < desired sample size3.a Generate a uniformly distributed random number

i, between r1 and r2.3.b If i has been generated previously, discard it and

generate the next random number.3.c Using the rank information in the internal nodes,

retrieve the record whose rank is i.

3.2.3 Block-based Random Sampling

While the classic algorithms of Olken and Antoshenkov sample records one-at-a-time,

it is possible to sample from an indexing structure such as a B+-Tree, and make use of

entire blocks of records [21, 55]. The number of records per block is typically on the order

of 100 to 1000, leading to a speedup of two or three orders of magnitude in the number of

records retrieved over time if all of the records in each block are consumed, rather than a

single record.

However, there are two problems with this approach. First, if the structure is used to

estimate the answer to some aggregate query, then the confidence bounds associated with

any estimate provided after N samples have been retrieved from a range predicate using a

B+-Tree (or some other index structure) may be much wider than the confidence bounds

that would have been obtained had all N samples been independent. In the extreme case

where the values on each block of records are closely correlated with one another, all of

the N samples may be no better than a single sample. Second, any algorithm which makes

use of such a sample must be aware of the block-based method used to sample the index,

and adjust its estimates accordingly, thus adding complexity to the query result estimating

process. For algorithms such as Bradley’s K-means algorithm [11], it is not clear whether

or not such samples are even appropriate.

37

3.3 Overview of Our Approach

We propose an entirely different strategy for implementing a materialized sample

view. Our strategy uses a new data structure called the ACE Tree to index the records

in the sample view. At the highest level, the ACE Tree partitions a data set into a

large number of different random samples such that each is a random sample without

replacement from one particular range query. When an application asks to sample from

some arbitrary range query, the ACE Tree and its associated algorithms filter and combine

these samples so that very quickly, a large and random subset of the records satisfying the

range query is returned. The sampling algorithm of the ACE Tree is an online algorithm,

which means that as time progresses, a larger and larger sample is produced by the

structure. At all times, the set of records retrieved is a true random sample of all the

database records matching the range selection predicate.

3.3.1 ACE Tree Leaf Nodes

The ACE Tree stores records in a large set of leaf nodes on disk. Every leaf node has

two components:

1. A set of h ranges, where a range is a pair of key values in the domain of the keyattribute and h is the height of the ACE Tree. Unlike a B+-Tree, each leaf nodein the ACE Tree stores records falling in several different ranges. The ith rangeassociated with leaf node L is denoted by L.Ri. The h different ranges associatedwith a leaf node are textithierarchical; that is L.R1 ⊃ L.R2 ⊃ · · · ⊃ L.Rh. The firstrange in any leaf node, L.R1, always contains a uniform random sample of all recordsof the database thus corresponding to the range (−∞,∞). The hth range in any leafnode is the smallest among all other ranges in that leaf node.

2. A set of h associated sections. The ith section of leaf node L is denoted by L.Si. Thesection L.Si contains a random subset of all the database records with key values inthe range L.Ri.

Figure 3-1 depicts an example leaf node in the ACE Tree with attribute range values

written above each section and section numbers marked below. Records within each

section are shown as circles.

38

75

36 41

47

22

18 10 25

3 1

11 7

R1 :0-100

S3 S4S2S1

R4 : 0-12R3 :0-25R2 :0-50

Figure 3-1. Structure of a leaf node of the ACE tree.

3.3.2 ACE Tree Structure

Logically, the ACE Tree is a disk-based binary tree data structure with internal nodes

used to index leaf nodes, and leaf nodes used to store the actual data. Since the internal

nodes in a binary tree are much smaller than disk pages, they are packed and stored

together in disk-page-sized units [27]. Each internal node has the following components:

1. A range R of key values associated with the node.

2. A key value k that splits R and partitions the data on the left and right of the node.

3. Pointers ptrl and ptrr, that point to the left and right children of the node.

4. Counts cntl and cntr, that give the number of database records falling in theranges associated with the left and right child nodes. These values can be used,for example, during evaluation of online aggregation queries which require the size ofthe population from which we are sampling [54].

Figure 3-2 shows the logical structure of the ACE Tree. Ii,j refers to the jth internal

node at level i. The root node is labeled with a range I1,1.R = [0-100], signifying that

all records in the data set have key values within this range. The key of the root node

partitions I1,1.R into I2,1.R = [0-50] and I2,2.R = [51-100]. Similarly each internal node

divides the range of its descendents with its own key.

The ranges associated with each section of a leaf node are determined by the ranges

associated with each internal node on the path from the root node to the leaf. For

example, if we consider the path from the root node down to leaf node L4, the ranges that

we encounter along the path are 0-100, 0-50, 26-50 and 38-50. Thus for L4, L4.S1 has a

random sample of records in the range 0-100, L4.S2 has a random sample in the range

39

50

75

37 62 8812

0−50 51−100

26−50 76−10051−75

0−100

0−25

25

0−50 26−50 38−50

70

87

14 7

20 40

39

4427

40

4638

50

0−100

26

L1

L4.S2

I1,1

I2,1

I3,3 I3,4

L4.S1 L4.S3 L4.S4

L8

I2,2

I3,2I3,1

L7L6L5L4L3L2

Figure 3-2. Structure of the ACE tree.

0-50, L4.S3 has a random sample in the range 26-50, while L4.S4 has a random sample in

the range 38-50.

3.3.3 Example Query Execution in ACE Tree

In the following discussion, we demonstrate how the ACE Tree efficiently retrieves a

large random sample of records for any given range query. The query algorithm is formally

described in Section 3.6.

Let Q = [30-65] be our example query postulated over the ACE Tree depicted in

Figure 3-2. The query algorithm starts at I1,1, the root node. Since I2,1.R overlaps Q, the

algorithm decides to explore the left child node labeled I2,1 in Figure 3-2. At this point

the two range values associated with the left and right children of I2,1 are 0-25 and 26-50.

Since the left child range has no overlap with the query range, the algorithm chooses to

explore the right child next. At this child node (I3,2), the algorithm picks leaf node L3 to

be the first leaf node retrieved by the index. Records from section 1 of L3 (which totally

encompasses Q) are filtered for Q and returned immediately to the consumer of the sample

40

as a random sample from the range [30-65], while records from sections 2, 3 and 4 are

stored in memory. Figure 3-3 shows the one random sample from section 1 of L3 which

can be used directly for answering query Q.

72

55

89

31 18

48 23

45 2937

2934

55

0-100

L3.S1

σ30−65

0-50 26-50 26-37

L3.S2 L3.S3 L3.S4

Figure 3-3. Random samples from section 1 of L3.

Next, the algorithm again starts at the root node and now chooses to explore the

right child node I2,2. After performing range comparisons, it explores the left child of I2,2

which is I3,3 since I3,4.R has no overlap with Q. The algorithm chooses to visit the left

child node of I3,3 next, which is leaf node L5. This is the second leaf node to be retrieved.

As depicted in Figure 3-4, since L5.R1 encompasses Q, the records of L5.S1 are filtered and

returned immediately to the user as two additional samples from R. Furthermore, section

2 records are combined with section 2 records of L3 to obtain a random sample of records

in the range 0-100. These are again filtered and returned, giving four more samples from

Q. Section 3 records are also combined with section 3 records of L3 to obtain a sample

of records in the range 26-75. Since this range also encompasses R, the records are again

filtered and returned adding four more records to our sample. Finally section 4 records are

stored in memory for later use.

Note that after retrieving just two leaf nodes in our small example, the algorithm

obtains eleven randomly selected records from the query range. However, in a real index,

this number would be many times greater. Thus, the ACE Tree supports “fast first”

41

sampling from a range predicate: a large number of samples are returned very quickly. We

contrast this with a sample taken from a B+-Tree having a similar structure to the ACE

Tree depicted in Figure 3-2. The B+-Tree sampling algorithm would need to pre-select

which nodes to explore. Since four leaf nodes in the tree are needed to span the query

range, there is a reasonably high likelihood that the first four samples taken would need

to access all four leaf nodes. As the ACE Tree Query Algorithm progresses, it goes on to

retrieve the rest of the leaf nodes in the order L4, L6, L1, L7, L2, L8.

45

0−100 0−50 51−100 51−7526−50

Combinesamples

Combinesamples

48

18

23

29

37

3110 61

59 77

6159

99

34 46

31 48

46 34

58

74 69

53

45 37

53 58

L5.S2 L5.S3L3.S3

σ30−65 σ30−65

σ30−65

L5.S1 L3.S2

Figure 3-4. Combining samples from L3 and L5.

3.3.4 Choice of Binary Versus k-Ary Tree

The ACE Tree as described above can also be implemented as a k-ary tree instead of

a binary tree. For example, for a ternary tree, each internal node can have two (instead

of one) keys and three (instead of two) children. If the height of the tree was h, every

leaf node would still have h ranges and h sections associated with them. Like a standard

complete k-ary tree, the number of leaf nodes will be kh. However, the big difference

would be the manner in which a query is executed using a k-ary ACE Tree as opposed to

a binary ACE Tree. The query algorithm will always start at the root node and traverse

42

down to a leaf. However, at every internal node it will alternate between the k children in

a round-robin fashion. Moreover, since the data space would be divided into k equal parts

at each level, the query algorithm might have to make k traversals and hence access k leaf

nodes before it can combine sections that can be used to answer the query. This would

mean that the query algorithm will have to wait longer (than a binary ACE Tree) before

it can combine leaf node sections and thus return useful random samples. Since the goal

of the ACE Tree is to support “fast first” sampling, use of a binary tree instead of a k-ary

tree seems to be a better choice to implement the ACE Tree.

3.4 Properties of the ACE Tree

In this Section we describe the three important properties of the ACE Tree which

facilitate the efficient retrieval of random samples from any range query, and will be

instrumental in ensuring the performance of the algorithm described in Section 3.6.

3.4.1 Combinability

18

47 39

22

34

2926

4533

5039

7

samplesCombined

0−50

0−50 0−25

26−3726−500−100

0−120−100

36

75

41

47

22

18 10

3

25

1

11

27

6188

L3

L1.S4L1.S3L1.S2L1.S1

L3.S4L3.S3L3.S2L3.S1

σ3−47L1

σ3−47

Figure 3-5. Combining two sections of leaf nodes of the ACE tree.

43

The various samples produced from processing a set of leaf nodes are combinable.

For example, consider the two leaf nodes L1 and L3, and the query “Compute a random

sample of the records in the query range Ql = [3 to 47]”. As depicted in Figure 3-5, first

we read leaf node L1 and filter the second section in order to produce a random sample

of size n1 from Ql which is returned to the user. Next we read leaf node L3, and filter its

second section L3.S2 to produce a random sample of size n2 from Ql which is also returned

to the user. At this point, the two sets returned to the user constitute a single random

sample from Ql of size n1 + n2. This means that as more and more nodes are read from

disk, the records contained in them can be combined to obtain an ever-increasing random

sample from any range query.

3.4.2 Appendability

The ith sections from two leaf nodes are appendable. That is, given two leaf nodes Lj

and Lk, Lj.Si

⋃Lk.Si is always a true random sample of all records of the database with

key values within the range Lj.Ri

⋃Lk.Ri. For example, reconsider the query, “Compute

a random sample of the records in the query range Ql = [3 to 47]”. As depicted in Figure

3-6, we can append the third section from node L3 to the third section from node L1 and

filter the result to produce yet another random sample from Ql. This means that sections

are never wasted.

3.4.3 Exponentiality

The ranges in a leaf node are exponential. The number of database records that fall

in L.Ri is twice the number of records that fall in L.Ri+1. This allows the ACE Tree

to maintain the invariant that for any query Q′ over a relation R such that at least hµ

database records fall in Q′, and with |R|/2k+1 <= |σQ′(R)| <= |R|/2k;∀k <= h− 1, there

exists a pair of leaf nodes Li and Lj, where at least one-half of the database records falling

in Li.Rk+2

⋃Lj.Rk+2 are also in Q′. µ is the average number of records in each section,

and h is the height of the tree or equivalently, the total number of sections in any leaf

node.

44

26−3726−50

0−120−250−500−100

5039

61

27

88

2645

0−100

sectionsAppend

711

1

25

3

1018

22

47

4136

75

34

2926

4533

0−50

33

25 310

L1.S1 L1.S2 L1.S3 L1.S4

L3.S1 L3.S2 L3.S3 L3.S4

L3

σ3−47

Combinedsamples

L1

Figure 3-6. Appending two sections of leaf nodes of the ACE tree.

While the formal statement of the exponentiality property is a bit complicated, the

net result is is simple: there is always a pair of leaf nodes whose sections can be appended

to form a set which can be filtered to quickly obtain a sample from any range query Q′.

As an illustration, consider query Q over the ACE Tree of Figure 3-2. Note that the

number of database records falling in Q is greater than one-fourth, but less than half the

database size. The exponentiality property assures us that Q can be totally covered by

appending sections of two different leaf nodes. In our example, this means that Q can be

covered by appending section 3 of nodes L4 and L6. If RC = L4.R3

⋃L6.R3, then by the

invariant given above we can claim that |σQ(R)| >= (1/2)× |σRC(R)|.3.5 Construction of the ACE Tree

In this Section, we present an I/O efficient, bulk construction algorithm for the ACE

Tree.

3.5.1 Design Goals

The algorithm for building an ACE Tree index is designed with the following goals in

mind:

45

1. Since the ACE Tree may index enormous amounts of data, construction of the treeshould rely on efficient, external memory algorithms, requiring as few passes throughthe data set as possible.

2. In the resulting data structure, the data which are placed in each leaf node sectionmust constitute a true random sample (without replacement) of all database recordslying within the range associated with that section.

3. Finally, the tree must be constructed in such a way as to have the exponentiality,combinability, and appendability properties necessary for supporting the ACE Treequery algorithms.

3.5.2 Construction

The construction of the ACE Tree proceeds in two distinct phases. Each phasecomprises of two read/write passes through the data set (that is, constructing anACE-Tree from scratch requires two external sorts of a large database table). The twophases are as follows:

1. During Phase 1, the data set is sorted based on the record key values. This sortedorder of records is used to provide the split points associated with each internal nodein the tree.

2. During Phase 2, the data are organized into leaf nodes based on those key values.Disk blocks corresponding to groups of internal nodes can easily be constructed atthe same time as the final pass through the data writes the leaf nodes to disk.

3.5.3 Construction Phase 1

The primary task of Phase 1 is to assign split points to each internal node of the tree.

To achieve this, the construction algorithm first sorts the data set based upon keys of the

records, as depicted in Figure 3-7.

After the dataset is sorted, the median record for the entire data set is determined

(this value is 50 in our example). This record’s key will be used as the key associated with

the root of the ACE Tree, and will determine L.R2 for every leaf node in the tree. We

denote this key value by I1,1.k, since the value serves as the key of the first internal node

in level 1 of the tree.

After determining the key value associated with the root node, the medians of each of

the two halves of the data set partitioned by I1,1.k are chosen as keys for the two internal

nodes at the next level: I2,1.k and I2,2.k, respectively. In the example of Figure 3-7, these

46

84 88 89 92 98

3 7 10 12 15 18 33 36 37 41 47 50 53 58 60 62 69 72 74

813 7 10 12 15 18 22 25 29 33 36 37 41 47 50 53 58 60 62 69 72 74 75 77

77

89 92 98

Median Record

50

50

50

25

75 886225

2922 75

12 37 60 84

81 84 88 89 92 98

3 7 10 15 18 22 29 33 36 41 47 50 53 58 69 72 74 77 81

I3,4.kI3,2.k

I2,2.k

I1,1.k

I2,1.k

I3,1.k I3,3.k

Figure 3-7. Choosing keys for internal nodes.

values are 25 and 75. I2,1.k and I2,2.k, along with I1,1.k, will determine L.R3 for every

leaf node in the tree. The process is then repeated recursively until enough medians2

have been obtained to provide every internal node with a key value. Note that the same

time that these various key values are determined, the values cntl and cntr can also be

determined.

This simple strategy for choosing the various key values in the tree ensures that

the exponentiality property will hold. If the data space between Ii+1,2j−1.k and Ii,j.k

corresponds to some leaf node range L.Rn, then the data space between Ii+1,2j−1.k and

Ii+2,4j−2.k will correspond to some range L.Rn+1. Since Ii+2,4j−2.k is the midpoint of

2 We choose a value for the height of the tree in such a manner that the expected sizeof a leaf node (see Sec. V F.) does not exceed one logical disk block. Choosing a nodesize that corresponds to the block size is done for the same reason it is done in mosttraditional indexing structures: typically, the system disk block size has already beencarefully chosen by a DBA to balance speed of sequential access (which demands a largerblock size) with the cost of accessing more data than is needed (which demands a smallerblock size).

47

the data space between Ii+1,2j−1.k and Ii,j.k, we know that two times as many database

records should fall in L.Rn, compared with L.Rn+1.

The following example also shows how the invariant described in Section 4.3 is

guaranteed by adopting the aforementioned strategy of assigning key values to internal

nodes. Consider the ACE Tree of Figure 3-2. Figure 3-8 shows the keys of the internal

nodes as medians of the dataset R. We also consider two example queries, Q1 and Q2 such

that the number of database records falling in Q2 is greater than one-fourth but less than

one-half of the database size, while the number of database records falling in Q1 is more

than half the database size.

50 7512 37 62

|R|/4

|R|/2

Q1

Q2

|R|

8825

Figure 3-8. Exponentiality property of ACE tree.

Q1 can be answered by appending section 2 of (for example) L4 and L8 (refer to

Figure 3-2). Let RC1 = L4.R2

⋃L8.R2. Then all the database records fall in RC1.

Moreover, since |σQ1(R)| >= |R|/2, we have |σQ1(R)| >= (1/2) × |σRC1(R)|. Similarly,

Q2 can be answered by appending section 3 of (for example) L4 and L6. If RC2 =

L4.R3

⋃L6.R3, then half the database records fall in RC2. Also, since |σQ2(R)| >= |R|/4

we have |σQ2(R)| >= (1/2) × |σRC2(R)|. This can be generalized to obtain the invariant

stated in Section 3.4.3.

3.5.4 Construction Phase 2

The objective of Phase 2 is to construct leaf nodes with appropriate sections and

populate them with records. This can be achieved by the following three steps:

48

37562563482343412137

13

7

15291233375077419269

Leaf NumbersSection Numbers

16541 72858

1 4221

(b) Records assigned leaf numbers

50 98928988848177757472696260585350

223 113234123414234124343

88

44443333322211

21132 331

5

3 41

8

4

6

4

Leaf NumbersSection Numbers

55211 887777665

4

Leaf 1

6274502210 478137251860

(c) Records organized into leaf nodes

36 89847598745853

Leaf 2

21132433243214221

Leaf 8Leaf 7Leaf 6Leaf 5Leaf 4Leaf 3

47

412341423412434324221

50

3

413736332925221815121073

(a) Records assigned section numbers

13123 1132

9815121073

Section Numbers

18 9289888481777574726962605853504741373633292522

Figure 3-9. Phase 2 of tree construction.

1. Assign a uniformly generated random number between 1 and h to each record as itssection number.

2. Associate an additional random number with the record that will be used to identifythe leaf node to which the record will be assigned.

3. Finally, re-organize the file by performing an external sort to group records in a givenleaf node and a given section together.

Figure 3-9(a) depicts our example data set after we have assigned each record a

randomly generated section number, assuming four sections in each leaf node.In Step 2, the algorithm assigns one more randomly generated number to each

record, which will identify the leaf node to which the record will be assigned. We assumefor our example that the number of leaf nodes is 2h−1 = 23 = 8. The number to identifythe leaf node is assigned as follows.

1. First, the section number of the record is checked. We denote this value as s.

49

2. We then start at the root of the tree and traverse down by comparing the record keywith s − 1 key values. After the comparisons, if we arrive at an internal node, Ii,j,then we assign the record to one of the leaves in the subtree rooted at Ii,j.

From the example of Figure 3-9(a), the first record having key value 3 has been

assigned to section 1. Since this record can be randomly assigned to any leaf from 1

through 8, we assign it to leaf 7.

The next record of Figure 3-9(a) has been assigned to section number 2. Referring

back to Figure 3-7, we see that the key of the root node is 50. Since the key of the record

is 7 which is less than 50, the record will be assigned to a leaf node in the left subtree of

the root. Hence we assign a leaf node between 1 and 4 to this record. In our example, we

randomly choose the leaf node 3.

For the next record having key value 10, we see that the section number assigned is 3.

To assign a leaf node to this record, we initially compare its key with the key of the root

node. Referring to Figure 3-7, we see that 10 is smaller than 50; hence we then compare it

with 25 which is the key of the left child node of the root. Since the record key is smaller

than 25, we assign the record to some leaf node in the left subtree of the node with key 25

by assigning to it a random number between 1 and 2.

The section number and leaf node identifiers for each record are written in a small

amount of temporary disk space associated with each record. Once all records have been

assigned to leaf nodes and sections, the dataset is re-organized into leaf nodes using a

two-pass external sorting algorithm as follows:

• Records are sorted in ascending order of their leaf node number.

• Records with the same leaf node number are arranged in ascending order of theirsection number.

The re-organized data set is depicted in Figure 3-9(c).

50

3.5.5 Combinability/Appendability Revisited

In Phase 2 of the tree construction, we observe that all records belonging to some

section k are segregated based upon the result of the comparison of their key with the

appropriate medians, and are then randomly assigned a leaf node number from the feasible

ones. Thus, if records from section s of all leaf nodes are merged together, we will obtain

all of the section s records. This ensures the appendability property of the ACE Tree.

Also note that the probability of assignment of one record to a section is unaffected

by the probability of assignment of some other record to that section. Since this results

in each section having a random subset of the database records, it is possible to merge

a sample of the records from one section that match a range query with a sample of

records from a different section that match the same query. This will produce a larger

random sample of records falling in the range of the query, thus ensuring the combinability

property.

3.5.6 Page Alignment

In Phase 2 of the construction algorithm, section numbers and leaf node numbers

are randomly generated. Hence we can only predict on expectation the number of records

that will fall in each section of each leaf node. As a result, section sizes within each leaf

node can differ, and the size of a leaf node itself is variable and will generally not be equal

to the size of a disk page. Thus when the leaf nodes are written out to disk, a single leaf

node may span across multiple disk pages or may be contained within a single disk page.

This situation could be avoided if we fix the size of each section a priori. However,

this poses a serious problem. Consider two leaf node sections Li.Sj and Li+1.Sj. We

can force these two sections to contain the same number of records by ensuring that the

set of records assigned to section j in Phase 2 of the construction algorithm has equal

representation from Li.Rj and Li+1.Rj. However, this means that the set of records

assigned to section j is no longer random. If we fix the section size and force a set number

51

of records to fall in each section, we invalidate the appendability and combinability

properties of the structure. Thus, we are forced to accept a variable section size.In order to implement variable section size, we can adopt one of the following two

schemes:

1. Enforce fixed-sized leaf nodes and allow variable-sized sections within the leaf nodes.

2. Allow variable-sized leaf nodes along with variable-sized sections.

If we choose the fixed-sized leaf node, variable-sized section scheme, leaf node size is

fixed in advance. However, section size is allowed to vary. This allows full sections to grow

further by claiming any available space within the leaf node. The leaf node size chosen

should be large enough to prevent any leaf node from becoming completely filled up, which

prevents the partitioning of any leaf node across two disk pages. The major drawback of

this scheme is that the average leaf node space utilization will be very low. Assuming a

reasonable set of ACE Tree parameters, a quick calculation shows that if we want to be

99% sure that no leaf node gets filled up, the average leaf node space utilization will be

less than 15%.

The variable-sized leaf node, variable-sized section scheme does not impose a size

limit on either the leaf node or the section. It allows leaf nodes to grow beyond disk page

boundaries, if space is required. The important advantage of this scheme is that it is

space-efficient. The main drawback of this approach is that leaf nodes may span multiple

disk pages, and hence all such pages must be accessed in order to retrieve such a leaf node.

Given that most of the cost associated with reading an arbitrary leaf page is associated

with the disk head movement needed to move the disk arm to the appropriate cylinder,

this does not pose too much of a problem. Hence we use this scheme for the construction

of leaf nodes of the ACE Tree.

3.6 Query Algorithm

In this Section, we describe in detail the algorithm used to answer range queries using

the ACE Tree.

52

3.6.1 Goals

The algorithm has been designed to meet the primary goal of achieving “fast-first”

sampling from the index structure, which means it attempts to be greedy on the number

of records relevant for the query in the early stages of execution. In order to meet this

goal, the query answering algorithm identifies the leaf nodes which contain maximum

number of sections relevant for the query. A section Li1 .Sj is relevant for a range query Q

if Li1 .Rj

⋂Q 6= φ and Li1 .Rj

⋃Li2 .Rj

⋃. . .

⋃Lin .Rj ⊇ Q where Li1 , . . . ,Lin are some leaf

nodes in the tree.The query algorithm prioritizes retrieval of leaf nodes so as to:

• Facilitate the combination of sections so as to maximize n in the above formulation,and

• Maximize the number of relevant sections in each leaf node L retrieved such thatL.Sj

⋂Q 6= φ where j = (c + 1) . . . h where L.Rc is the smallest range in L that

encompasses Q.

3.6.2 Algorithm Overview

At a high level, the query answering algorithm retrieves the leaf nodes relevant to

answering a query via a series of stabs or traversals, accessing one leaf node per stab.

Each stab begins at the root node and traverses down to a leaf. The distinctive feature of

the algorithm is that at each internal node that is traversed during a stab, the algorithm

chooses to access the child node that was not chosen the last time the node was traversed.

For example, imagine that for a given internal node I, the algorithm chooses to traverse

to the left child of I during a stab. The next time that I is accessed during a stab, the

algorithm will choose to traverse to the right child node. This can be seen in Figure

3-10, when we compare the paths taken by Stab 1 and Stab 2. The algorithm chooses to

traverse to the left child of the root node during the first stab, while during the second

stab it chooses to traverse to the right child of the root node.

The advantage of retrieving leaf nodes in this back and forth sequence is that it allows

us to quickly retrieve a set of leaf nodes with the most disparate sections possible in a

53

��

50

75

37 62 8812

0−50 51−100

26−50 76−10051−750−25

25

0−100

next = L

next = L

next = L

(a) Stab 1, 1 contributing section

L1

I3,4

L8

I1,1

I2,2

I2,1

I3,3I3,2I3,1

L7L6L5L4L3L2

��

50

75

37 62 8812

0−50 51−100

26−50 76−10051−750−25

25

0−100

next = R

next = L

next = L

(b)Stab 2, 6 contributing sections

L8L1

I3,3 I3,4

I1,1

I2,2I2,1

I3,2I3,1

L2 L3 L4 L5 L6 L7

(c) Stab 3, 7 contributing sections

��

50

75

37 62 8812

0−50 51−100

26−50 76−10051−750−25

25

0−100

next = L

next = L

next = R

L2L1

I3,2

I1,1

I2,1

I3,1 I3,3

L3 L4 L5 L6

I2,2

I3,4

L7 L8

(d) Stab 4, 16 contributing sections

� ��

50

75

37 62 8812

0−50 51−100

26−50 76−10051−750−25

25

0−100

next = R

next = R

next = L

L1

I2,1

I3,4

I2,2

I1,1

I3,3I3,2I3,1

L2 L3 L4 L5 L6 L7 L8

50

75

37 62 8812

0−50 51−100

26−50 76−10051−750−25

25

0−100

next = L

next = L

next = L


L1

I3,4

L8

I1,1

I2,2

I2,1

I3,3I3,2I3,1

L7L6L5L4L3L2

50

75

37 62 8812

0−50 51−100

26−50 76−10051−750−25

25

0−100

next = R

next = L

next = L


L8L1

I3,3 I3,4

I1,1

I2,2I2,1

I3,2I3,1

L2 L3 L4 L5 L6 L7


��

��

��

50

75

37 62 8812

0−50 51−100

26−50 76−10051−750−25

25

0−100

next = L

next = L

next = R

L2L1

I3,2

I1,1

I2,1

I3,1 I3,3

L3 L4 L5 L6

I2,2

I3,4

L7 L8


�� !" #$

50

75

37 62 8812

0−50 51−100

26−50 76−10051−750−25

25

0−100

next = R

next = R

next = L

L1

I2,1

I3,4

I2,2

I1,1

I3,3I3,2I3,1

L2 L3 L4 L5 L6 L7 L8

50

75

37 62 8812

0−50 51−100

26−50 76−10051−750−25

25

0−100

next = L

next = L

next = L


L1

I3,4

L8

I1,1

I2,2

I2,1

I3,3I3,2I3,1

L7L6L5L4L3L2

50

75

37 62 8812

0−50 51−100

26−50 76−10051−750−25

25

0−100

next = R

next = L

next = L


L8L1

I3,3 I3,4

I1,1

I2,2I2,1

I3,2I3,1

L2 L3 L4 L5 L6 L7


50

75

37 62 8812

0−50 51−100

26−50 76−10051−750−25

25

0−100

next = L

next = L

next = R

L2L1

I3,2

I1,1

I2,1

I3,1 I3,3

L3 L4 L5 L6

I2,2

I3,4

L7 L8


%�%%�%%�%&�&&�&&�& '�'

'�''�'(�((�((�(

)�)�))�)�)*�*�**�*�* +�++�+,�,

,�, -�--�-.�..�. /�//�/0�00�0

50

75

37 62 8812

0−50 51−100

26−50 76−10051−750−25

25

0−100

next = R

next = R

next = L

L1

I2,1

I3,4

I2,2

I1,1

I3,3I3,2I3,1

L2 L3 L4 L5 L6 L7 L8

50

75

37 62 8812

0−50 51−100

26−50 76−10051−750−25

25

0−100

next = L

next = L

next = L


L1

I3,4

L8

I1,1

I2,2

I2,1

I3,3I3,2I3,1

L7L6L5L4L3L2

1�11�12�22�23�3�33�3�34�4�44�4�4

50

75

37 62 8812

0−50 51−100

26−50 76−10051−750−25

25

0−100

next = R

next = L

next = L


L8L1

I3,3 I3,4

I1,1

I2,2I2,1

I3,2I3,1

L2 L3 L4 L5 L6 L7


56789:

50

75

37 62 8812

0−50 51−100

26−50 76−10051−750−25

25

0−100

next = L

next = L

next = R

L2L1

I3,2

I1,1

I2,1

I3,1 I3,3

L3 L4 L5 L6

I2,2

I3,4

L7 L8


;< => ?@ AB

50

75

37 62 8812

0−50 51−100

26−50 76−10051−750−25

25

0−100

next = R

next = R

next = L

L1

I2,1

I3,4

I2,2

I1,1

I3,3I3,2I3,1

L2 L3 L4 L5 L6 L7 L8

Figure 3-10. Execution runs of query answering algorithm with (a) 1 contributing section,(b) 6 contributing sections, (c) 7 contributing sections and (d) 16contributing sections.

given number of stabs. The reason that we want a non-homogeneous set of nodes is that

nodes from very distant portions of a query range will tend to have sections covering large

ranges that do not overlap. This allows us to append sections of newly retrieved leaf nodes

with the corresponding sections of previously retrieved leaf nodes. The samples obtained

can then be filtered and immediately returned.

54

This order of retrieval is implemented by associating a bit with each internal node

that indicates whether the next child node to be retrieved should be the left node or the

right node. The value of this bit is toggled every time the node is accessed. Figure 3-10

illustrates the choices made by the algorithm at each internal node during four separate

stabs. Note that when the algorithm reaches an internal node where the range associated

with one of the child nodes has no overlap with the query range, the algorithm always

picks the child node that has overlap with the query, irrespective of the value of the

indicator bit. The only exception to this is when all leaf nodes of the subtree rooted at

an internal node which overlaps the query range have been accessed. In such a case, the

internal node which overlaps the query range is not chosen and is never accessed again.

3.6.3 Data Structures

In addition to the structure of the internal and leaf nodes of the ACE Tree, the queryalgorithm uses and updates the following two memory resident data structures:

1. A lookup table T , to store internal node information in the form of a pair of values(next = [LEFT] | [RIGHT], done = [TRUE] | [FALSE]). The first value indicates whetherthe next node to be retrieved should be the left child or right child. The second valueis TRUE if all leaf nodes in the subtree rooted at the current node have already beenaccessed, else it is FALSE.

2. An array buckets[h] to hold sections of all the leaf nodes which have been accessed sofar and whose records could not be used to answer the query. h is the height of theACE Tree.

3.6.4 Actual Algorithm

We now present the algorithms used for answering queries using the ACE Tree.

Algorithm 2 simply calls Algorithm 3 which is the main tree traversal algorithm, called

Shuttle(). Each traversal or stab begins at the root node and proceeds down to a leaf

node. In each invocation of Shuttle(), a recursive call is made to either its left or right

child with the recursion ending when it reaches a leaf node. At this point, the sections in

the leaf node are combined with previously retrieved sections so that they can be used to

answer the query. The algorithm for combining sections is described in Algorithm 4. This

55

Algorithm 2: Query Answering Algorithm

Algorithm Answer (Query Q)Let root be the root of the ACE TreeWhile (!T.lookup(root).done)

T.lookup(root).done = shuttle(Q, root);

Algorithm 3: ACE Tree traversal algorithm

Algorithm Shuttle (Query Q, Node curr node)If (curr node is an internal node)

left node = curr node→get left node();right node = curr node→get right node();If (left node is done AND right node is done)

Mark curr node as doneElse if (right node is not done)

Shuttle(Q, right node);Else if (left node is not done)

Shuttle(Q, left node);Else if (both children are not done)

If (Q overlaps only with left node.R)Shuttle(Q, left node);

Else if (Q overlaps only with right node.R)Shuttle(Q, right node);

Else //Q overlaps both sides or noneIf (next node is LEFT)

Shuttle(Q, left node);Set next node to RIGHT;

If (next node is RIGHT)Shuttle(Q, right node);Set next node to LEFT;

Else //curr node is a leaf nodeCombine Tuples(Q, curr node);Mark curr node as done

algorithm determines the sections that are required to be combined with every new section

s that is retrieved and then searches for them in the array buckets[]. If all sections are

found, it combines them with s and removes them from buckets[]. If it does not find all

the required sections in buckets[], it stores s in buckets[].

56

Algorithm 4: Algorithm for combining sections

Algorithm Combine Tuples(Query Q, LeafNode node)For each section s in node do

Store the section numbers required to becombined with s to span Q, in a list list

For each section number i in list doIf buckets[] does not have section i

f lag = falseIf (flag == true)

Combine all sections from list with sand use the records to answer Q

ElseStore s in the appropriate bucket

3.6.5 Algorithm Analysis

We now present a lower bound on the expected performance of the ACE Tree index

for sampling from a relational selection predicate. For simplicity, our analysis assumes that

the number of leaf nodes in the tree is a power of 2.Lemma 1. Efficiency of the ACE Tree for query evaluation.

• Let n be the total number of leaf nodes in a ACE Tree used to sample from somearbitrary range query, Q

• Let p be the largest power of 2 no greater than n

• Let µ be the mean section size in the tree

• Let α be fraction of database records falling in Q

• Let N be the size of the sample from Q that has been obtained after m ACE Tree leafnodes have been retrieved from disk

If m is not too large (that is, if m ≤ 2αn + 2), then:

E[N ] ≥ µ

2p log2 p

where E[N ] denotes the expected value of N (the mean value of N after an infinite number

of trials).

57

Proof. Let Ii,j and Ii,j+1 be the two internal nodes in the ACE Tree where R =

Ii,j.R⋃

Ii,j+1.R covers Q and i is maximized. As long as the shuttle algorithm has not

retrieved all the children of Ii,j and Ii,j+1 (this is the case as long as m ≤ 2αn + 2), when

the mth leaf node has been processed, the expected number of new samples obtained:

Nm =

blog2 mc∑

k=1

2k−1∑

l=1

wklµ

where the outer summation is over each of the h − i contributing sections of the leaf

nodes, starting with section number i up to section number h, while∑

l wkl represents the

fraction of records of the 2k−1 combined sections that satisfy Q. By the exponentiality

property,∑

l wkl ≥ 1/2 for every k,

Nm ≥ µ

2log2 m

Thus after m leaf nodes have been obtained, the total number of expected samples is given

by:

E[N ] ≥m∑

k=1

Nk

≥m∑

k=1

µ

2log2 k

≥ µ

2m log2 m

If m is a power of 2, the result is proven.

Lemma 2. The expected number of records µ in any leaf node section is given by:

E[µ] =|R|

h2h−1

where |R| is the total number of database records, h is the height of the ACE Tree and 2h−1

is the number of leaf nodes in the ACE Tree.

58

Proof. The probability of assigning a record to any section i; i ≤ h is 1/h. Given that

the record is assigned to section i, it can be assigned to only one of 2i−1 leaf node groups

after comparing with the appropriate medians. Since each group would have 2h−1/2i−1

candidate leaf nodes, the probability that the record is assigned to some leaf node Lj is:

E[µi,j] =∑t∈R

1

h×

2i−1∑

k=1

01

2i−1+

1

2i−1

2i−1

2h−1

=|R|

h2h−1

This completes the proof of the lemma.

3.7 Multi-Dimensional ACE Trees

The ACE Tree can be easily extended to support queries that include multi-dimensional

predicates. The change needed to incorporate this extension is to use a k-d binary tree

instead of the regular binary tree for the ACE Tree. Let a1, . . . , ak be the k key attributes

for the k-d ACE Tree. To construct such a tree, the root node would be the median of all

the a1 values in the database. Thus the root partitions the dataset based on a1. At the

next step, we need to assign values for level 2 internal nodes of the tree. For each of the

resulting partitions of the dataset, we calculate the median of all the a2 values. These two

medians are assigned to the two internal nodes at level 2 respectively, and we recursively

partition the two halves based on a2. This process is continued until we finish level k.

At level k + 1, we again consider a1 for choosing the medians. We would then assign a

randomly generated section number to every record. The strategy for assigning a leaf node

number to the records would also be similar to the one described in Section 3.5.4 except

that the appropriate key attribute is used while performing comparisons with the internal

nodes. Finally, the dataset is sorted into leaf nodes as in Figure 3-9(c).

Query answering with the k-d ACE Tree can use the Shuttle algorithm described

earlier with a few minor modifications. Whenever a section is retrieved by the algorithm,

only records which satisfy all predicates in the query should be returned. Also, the mth

59

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0 0.5 1 1.5 2 2.5 3 3.5 4

% o

f tot

al n

umbe

r of

rec

ords

in th

e re

latio

n

% of time required to scan relation

ACE Tree

B+ Tree

Randomly permuted file

Figure 3-11. Sampling rate of an ACE tree vs. rate for a B+ tree and scan of a randomlypermuted file, with a one dimensional selection predicate accepting 0.25% ofthe database records. The graph shows the percentage of database recordsretrieved by all three sampling techniques versus time plotted as a percentageof the time required to scan the relation

sections of two leaf nodes can be combined only if they match in all m dimensions. The

nth sections of two leaf nodes can be appended only if they match in the first n − 1

dimensions and form a contiguous interval over the nth dimension.

3.8 Benchmarking

In this Section, we describe a set of experiments designed to test the ability of the

ACE Tree to quickly provide an online random sample from a relational selection predicate

as well as to demonstrate that the memory requirement of the ACE Tree is reasonable.

We performed two sets of experiments. The first set is designed to test the utility of the

ACE Tree for use with one-dimensional data, where the ACE Tree is compared with

a simple sequential file scan as well as Antoshenkov’s algorithm for sampling from a

ranked B+-Tree. In the second set, we compare a multi-dimensional ACE Tree with the

60

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.5 1 1.5 2 2.5 3 3.5 4

% o

f tot

al n

umbe

r of

rec

ords

in th

e re

latio

n

% of time required to scan the relation

ACE Tree

B+ Tree


Figure 3-12. Sampling rate of an ACE tree vs. rate for a B+ tree and scan of a randomlypermuted file, with a one dimensional selection predicate accepting 2.5% ofthe database records. The graph shows the percentage of database recordsretrieved by all three sampling techniques versus time plotted as a percentageof the time required to scan the relation

sequential file scan as well as with the obvious extension of Antoshenkov’s algorithm to a

two-dimensional R-Tree.

3.8.1 Overview

All experiments were performed on a Linux workstation having 1GB of RAM, 2.4GHz

clock speed and with two 80GB, 15,000 RPM Seagate SCSI disks. 64KB data pages were

used.

Experiment 1. For the first set of experiments, we consider the problem of sampling

from a range query of the form:

SELECT * FROM SALE

WHERE SALE.DAY >= i AND SALE.DAY <= j

We implemented and tested the following three random-order record retrievalalgorithms for sampling the range query:

61

0

0.5

1

1.5

2

2.5

0.5 1 1.5 2 2.5 3 3.5 4

% o

f tot

al n

umbe

r of

rec

ords

in th

e re

latio

n


ACE Tree

B+ Tree


Figure 3-13. Sampling rate of an ACE tree vs. rate for a B+ tree and scan of a randomlypermuted file, with a one dimensional selection predicate accepting 25% ofthe database records. The graph shows the percentage of database recordsretrieved by all three sampling techniques versus time plotted as a percentageof the time required to scan the relation

1. ACE Tree Query Algorithm: The ACE Tree was implemented exactly as described inthis thesis. In order to use the ACE Tree to aid in sampling from the SALE relation, amaterialized sample view for the relation was created, using SALE.DAY as the indexedattribute.

2. Random sampling from a B+-Tree: Antoshenkov’s algorithm for sampling from aranked B+-Tree was implemented as described in Algorithm 1. The B+-Tree usedin the experiment was a primary index on the SALE relation (that is, the underlyingdata were actually stored within the tree), and was constructed using the standardB+-Tree bulk construction algorithm.

3. Sampling from a randomly permuted file: We implemented this random samplingtechnique as described in Section 3.2.1 of this chapter. This is the standard samplingtechnique used in previous work on online aggregation. The SALE relation wasrandomly permuted by assigning a random key value k to each record. All of therecords from SALE were then sorted in ascending order of each k value using atwo-phase, multi-way merge sort (TPMMS) (see Garcia-Molina et al. [38]). As thesorted records are written back to disk in the final pass of the TPMMS, k is removedfrom the file. To sample from a range predicate using a randomly permuted file, the

62

0

0.5

1

1.5

2

2.5

3

0 100 200 300 400 500 600 700 800

% o

f tot

al n

umbe

r of

rec

ords

in th

e re

latio

n


ACE Tree

B+ Tree


Figure 3-14. Sampling rate of an ACE tree vs. rate for a B+ tree and scan of a randomlypermuted file, with a one dimensional selection predicate accepting 2.5% ofthe database records. The graph is an extension of Figure 3-12 and showsresults till all three sampling techniques return all the records matching thequery predicate.

file is scanned from front to back and all records matching the range predicate areimmediately returned.

For the first set of experiments, we synthetically generated the SALE relation to be

20GB in size with 100B records, resulting in around 200 million records in the relation.

We began the first set of experiments by sampling from 10 different range selection

predicates over SALE using the three sampling techniques described above. 0.25% of the

records from MySam satisfied each range selection predicate. For each of the three random

sampling algorithms, we recorded the total number of random samples retrieved by the

algorithm at each time instant. The average number of random samples obtained for each

of the ten queries was then calculated. This average is plotted as a percentage of the total

number of records in SALE along the Y-axis in Figure 3-11. On the X-axis, we have plotted

the elapsed time as a percentage of the time required to scan the entire relation. We chose

63

0

5e-05

0.0001

0.00015

0.0002

0.00025

0.0003

0.00035

0 1 2 3 4 5 6 7 8 9 10 11

Frac

tion

of to

tal n

umbe

r of

rec

ords

in th

e re

latio

n


Average across 10 queriesMaximum of 10 queriesMinimum of 10 queries

(a) 0.25% selectivity

0

0.0005

0.001

0.0015

0.002

0.0025

0.003

0 1 2 3 4 5 6 7 8 9 10 11

Frac

tion

of to

tal n

umbe

r of

rec

ords

in th

e re

latio

n


Average across 10 queriesMaximum of 10 queriesMinimum of 10 queries

(b) 2.5% selectivity

Figure 3-15. Number of records needed to be buffered by the ACE Tree for queries with(a) 0.25% and (b) 2.5% selectivity. The graphs show the number of recordsbuffered as a fraction of the total database records versus time plotted as apercentage of the time required to scan the relation.

64

this metric considering the linear scan as the baseline record retrieval method. The test

was then repeated with two more sets of selection predicates that are satisfied by 2.5% and

25% of MySam’s records, respectively. The results are plotted in Figure 3-12 and Figure

3-13. For all the three figures, results are shown for the first 15 seconds of execution,

corresponding to approximately 4% of the time required to scan the relation. We show an

additional graph in Figure 3-14 for the 2.5% selectivity case, where we plot results until all

the three record retrieval algorithms return all the records matching the query predicate.

Finally, we provide experimental results to indicate the number of records that

are needed to be buffered by the ACE Tree query algorithm for two different query

selectivities. Figure 3-15(a) shows the minimum, maximum and the average number of

records stored for ten different queries having a selectivity of 0.25% while Figure 3-15(b)

shows similar results for queries having selectivity 2.5%.

Experiment 2. For the second set of experiments, we add an additional attribute

AMOUNT to the SALE relation and test the following two-dimensional range query:

SELECT * FROM SALE

WHERE SALE.DAY >= d1 AND SALE.DAY <= d2

AND SALE.AMOUNT >= a1 AND SALE.AMOUNT <= b2

To generate the SALE relation, each (DAY, AMOUNT) pair in each record is generated by

sampling from a bivariate uniform distribution.

In this experiment, we again test the three random sampling options given above:

1. ACE tree query algorithm: The ACE Tree for multi-dimensional data (a k-d ACETree) was implemented exactly as described in Section 3.7. It was used to create amaterialized sample view over the DAY and AMOUNT attributes.

2. Random sampling from a R-tree: Antoshenkov’s algorithm for sampling from aranked B+-Tree was extended in the obvious fashion for sampling from a R-Tree [46].Just as in the case of the B+-Tree, the R-Tree is created as a primary index, andthe data from the SALE relation are actually stored in the leaf nodes of the tree. TheR-Tree was constructed in bulk using the well-known Sort-Tile-Recursive [81] bulkconstruction algorithm.

65

3. Sampling from a randomly permuted file: We implemented this random samplingtechnique in a similar manner as Experiment 1.

In this experiment, the SALE relation was generated so as to be about 16 GB in

size. Each record in the relation was 100B in size, resulting in approximately 160 million

records.

Just as in the first experiment, we began by sampling from 10 different range selection

predicates over SALE using the three sampling techniques described above. 0.25% of

the records from SALE satisfied each range selection predicate. For all the three random

sampling algorithms, we recorded the total number of random samples retrieved by the

algorithm at each time instant. The average number of random samples obtained for each

of the ten queries is then computed. This average is plotted as a percentage of the total

number of records in SALE along the Y-axis in Figure 3-16. On the X-axis, we have plotted

the elapsed time as a percentage of the time required to scan the entire relation. The

test was then repeated with two more selection predicates that are satisfied by 2.5% and

25% of the SALE relations records, respectively. The results are plotted in Figure 3-17 and

Figure 3-18 respectively.

3.8.2 Discussion of Experimental Results

There are several important observations that can be made from the experimental

results. Irrespective of the selectivity of the query, we observed that the ACE Tree clearly

provides a much faster sampling rate during the first few seconds of query execution

compared with the other approaches. This advantage tends to degrade over time, but since

sampling is often performed only as long as more samples are needed to achieve a desired

accuracy, the fact that the ACE Tree can immediately provide a large, online random

sample almost immediately indicates its practical utility.

Another observation indicating the utility of the ACE Tree is that while it was the

top performer over the three query selectivities tested, the best alternative to the ACE

Tree generally changed depending on the query selectivity. For highly selective queries,

66

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

% o

f tot

al n

umbe

r of

rec

ords

in r

elat

ion


ACE Tree R Tree


Figure 3-16. Sampling rate of an ACE Tree vs. rate for an R-Tree and scan of a randomlypermuted file, with a spatial selection predicate accepting 0.25% of thedatabase tuples.

the randomly-permuted file is almost useless due to the fact that the chance that any

given record is accepted by the relational selection predicate is very low. On the other

hand, the B+-Tree (and the R-Tree over multi-dimensional data) performs relatively

well for highly selective queries. The reason for this is that during the sampling, if the

query range is small, then all the leaf pages of the B+-Tree (or R-Tree) containing records

that match the query predicate are retrieved very quickly. Once all of the relevant pages

are in the buffer, the sampling algorithm does not have to access the disk to satisfy

subsequent sample requests and the rate of record retrieval increases rapidly. However,

for less selective queries, the randomly-permuted file works well since it can make use of

an efficient, sequential disk scan to retrieve records. As long as a relatively large fraction

of the records retrieved match the selection predicate, the amount of waste incurred by

scanning unwanted records as well is small compared to the additional efficiency gained by

the sequential scan. On the other hand, when the range associated with a query having

67

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

% o

f tot

al n

umbe

r of

rec

ords

in r

elat

ion


ACE Tree

R Tree


Figure 3-17. Sampling rate of an ACE tree vs. rate for an R-tree, and scan of a randomlypermuted file with a spatial selection predicate accepting 2.5% of thedatabase tuples.

high selectivity is very large, the time required to load all of the relevant B+-Tree (or

R-Tree) pages into memory using random disk I/Os is prohibitive. Even if the query is run

long enough that all of the relevant pages are touched, for a query with high selectivity,

the buffer manager cannot be expected to buffer all the B+-Tree (or R-Tree) pages that

contain records matching the query predicate. This is the reason that the curve for the

B+-Tree in Figure 3-13 or for the R-Tree in Figure 3-18, never leaves the y-axis for the

time range plotted.

The net result of this is that if an ACE Tree were not used, it would probably

be necessary to use both a B+-Tree and a randomly-permuted file in order to ensure

satisfactory performance in the general case. Again, this is a point which seems to strongly

favor use of the ACE Tree.

An observation we make from Figure 3-14 is that if all the three record retrieval

algorithms are allowed to run to completion, we find that the ACE Tree is not the first to

68

0

0.5

1

1.5

2

2.5

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

% o

f tot

al n

umbe

r of

rec

ords

in r

elat

ion


ACE Tree

R Tree


Figure 3-18. Sampling rate of an ACE tree vs. rate for an R-tree, and scan of a randomlypermuted file with a spatial selection predicate accepting 25% of the databasetuples.

complete execution. Thus, there is generally a crossover point beyond which the sampling

rate of an alternative random sampling technique is higher than the sampling rate of the

ACE Tree. However, the important point is that such a transition always occurs very late

in the query execution by which time the ACE Tree has already retrieved almost 90% of

the possible random samples. We found this trend for all the different query selectivities

we tested with single dimensional as well as multi-dimensional ACE Trees. Thus, we

emphasize that the existence of such a crossover point in no way belittles the utility of the

ACE Tree since in practical applications where random samples are used, the number of

random samples required is very small. Since the ACE Tree provides the desired number

of random samples (and many more) much faster than the other two methods, it still

emerges as the top performer among the three methods for obtaining random samples.

Finally, Figure 3-15 shows the memory requirement of the ACE Tree to store records

that match the query predicate but cannot be used as yet to answer the query. The

69

fluctuations in the number of records buffered by the query algorithm at different times

during the query execution is as expected. This is because the amount of buffer space

required by the query algorithm can vary as newly retrieved leaf node sections are either

buffered (thus requiring more buffer space) or can be appended with already buffered

sections (thus releasing buffer space). We also note from Figure 3-15 that the ACE Tree

has a reasonable memory requirement since a very small fraction of the total number of

records is buffered by it.

3.9 Conclusion and Discussion

In this chapter we have presented the idea of a “sample view” which is an indexed,

materialized view of an underlying database relation. The sample view facilitates efficient

random sampling of records satisfying a relational range predicate. In the chapter we

describe the ACE Tree which is a new indexing structure that we use to index the sample

view. We have shown experimentally that with the ACE Tree index, the sample view can

be used to provide an online random sample with much greater efficiency than the obvious

alternatives. For applications like online aggregation or data mining that require a random

ordering of input records, this makes the ACE Tree a natural choice for random sampling.

This is not to say that the ACE Tree is without any drawbacks. One obvious concern

is that the ACE Tree is a primary file organization as well as an index, and hence it

requires that the data be stored within the ACE Tree structure. This means that if the

data are stored within an ACE Tree, then without replication of the data elsewhere

it is not possible to cluster the data in another way at the same time. This may be a

drawback for some applications. For example, it might be desirable to organize the data

as a B+-Tree if non-sampling-based range queries are asked frequently as well, and this is

precluded by the ACE Tree. This is certainly a valid concern. However, we still feel that

the ACE Tree will be one important weapon in a data analyst’s arsenal. Applications like

online aggregation (where the database is used primarily or exclusively for sampling-based

analysis) already require that the data be clustered in a randomized fashion; in such a

70

situation, it is not possible to apply traditional structures like a B+-Tree anyway, and

so there is no additional cost associated with the use of an ACE Tree as the primary

file organization. Even if the primary purpose of the database is a more traditional or

widespread application such as OLAP, we note that it is becoming increasingly common

for analysts to subsample the database and apply various analytic techniques (such as

data mining) to the subsample; if such a sample were to be materialized anyway, then

organizing the subsample itself as an ACE Tree in order to facilitate efficient online

analysis would be a natural choice.

Another potential drawback of the ACE Tree as it has been described in this chapter,

is that it is not an incrementally updateable structure. The ACE Tree is relatively

efficient to construct in bulk: it requires two external sorts of the underlying data to

build from scratch. The difficulty is that as new data are added, there is not an easy

way to update the structure without rebuilding it from scratch. Thus, one potential

area for future work is to add the ability to handle incremental inserts to the sample

view (assuming that the ACE Tree is most useful in a data warehousing environment,

then deletes are far less useful). However, we note that even without the ability to

incrementally update an ACE-Tree, it is still easily usable in a dynamic environment if a

standard method such as a differential file [111] is applied. Specifically, one could maintain

the differential file as a randomly permuted file or even a second ACE Tree, and when a

relational selection query is posed, in order to draw a random sample from the query one

selects the next sample from either the primary ACE Tree or the differential file with an

appropriate hypergeometric probability (for an idea of how this could be done, see the

recent paper of Brown and Haas [12] for a discussion of how to draw a single sample from

multiple data set partitions). Thus, we argue that the lack of an algorithm to update the

ACE tree incrementally may not be a tremendous drawback.

Finally, we close the chapter by asserting that the importance of having indexing

methods that can handle insertions incrementally is often overstated in the research

71

literature. In practice, most incrementally-updateable structures such as B+-Trees

cannot be updated incrementally in a data warehousing environment due to performance

considerations anyway [93]. Such structures still require on the order of one random

I/O per update, rendering it impossible to efficiently process bulk updates consisting of

millions of records without simply rebuilding the structure from scratch. Thus, we feel

that the drawbacks associated with the ACE Tree do not prevent its utility in many

real-world situations.

72

CHAPTER 4SAMPLING-BASED ESTIMATORS FOR SUBSET-BASED QUERIES

4.1 Introduction

Sampling is well-established as a method for dealing with very large volumes of

data, when it is simply not practical or desirable to perform the computation over the

entire data set. Sampling has several advantages compared to other widely-studied

approximation methodologies from the data management literature such as wavelets [88],

histograms [92] and sketches [29]. Not the least of those is generality: it is very easy to

efficiently draw a sample from a large data set in a single pass using reservoir techniques

[34]. Then, once the sample has been drawn it is possible to guess, with greater or lesser

accuracy, the answer to virtually any statistical query over those sets. Samples can easily

handle many different database queries, including complex functions in relational selection

and join predicates. The same cannot be said of the other approximation methods, which

generally require more knowledge of the query during synopsis construction, such as the

attribute that will appear in the SELECT clause of the SQL query corresponding to the

desired statistical calculation.

However, one class of aggregate queries that remain difficult or impossible to answer

with samples are the so-called “subset” queries, which can generally be written in SQL in

the form:

SELECT SUM (f1(r))

FROM R as r

WHERE f2(r) AND NOT EXISTS

(SELECT * FROM S AS s

WHERE f3(r, s))

Note that the function f2 can be incorporated into f1 if we have f1 evaluate to zero if

f2 is not true; thus, in the remainder of the chapter we will ignore f2. An example of such

73

a query is: “Find the total salary of all employees who have not made a sale in the past

year”:

SELECT SUM (e.SAL)

FROM EMP AS e

WHERE NOT EXISTS

(SELECT * FROM SALE AS s

WHERE s.EID = e.EID)

A general solution to this problem would greatly extend the class of database-style

queries that are amenable to being answered via random sampling. For example, there is

a very close relationship between such queries and those obtained by removing the NOT

in the subquery. Using the terminology introduced later in this chapter, all records from

EMP with i records in SALES are called “class i” records. The only difference between NOT

EXISTS and EXISTS is that the former query computes a sum over all class 0 records,

whereas the latter query computes a sum over all class i > 0 records. Since any reasonable

estimator for NOT EXISTS will likely have to compute an estimated sum over each class, a

solution for NOT EXISTS should immediately suggest a solution for EXISTS. Also, nested

queries having an IN (or NOT IN) clause can be easily rewritten as a nested query having

the EXISTS (or NOT EXISTS) clause. For example, the query “Find the total salary of all

employees who have not made a sale in the past year” given above can also be written as:

SELECT SUM (e.SAL)

FROM EMP as e

WHERE e.EID NOT IN

(SELECT s.EID FROM SALE AS s)

Furthermore, a solution to the problem of sampling for subset queries would allow

sampling-based aggregates over SQL DISTINCT queries, which can easily be re-written as

subset queries. For example:

SELECT SUM (DISTINCT e.SAL)

74

FROM EMP AS e

is equivalent to:

SELECT SUM (e.SAL)

FROM EMP AS e

WHERE NOT EXISTS

(SELECT * FROM EMP AS e2

WHERE id(e) < id(e2)

AND e.SAL = e2.SAL)

In this query, id is a function that returns the row identifier for the record in question.

Some work has considered the problem of sampling for counts of distinct attribute values

[17, 49], but aggregates over DISTINCT queries remains an open problem. Similarly, it

is possible to write an aggregate query where records with identical values may appear

more than once in the data, but should be considered no more than once by the aggregate

function as a subset-based SQL query. For example:

SELECT SUM (e.SAL)

FROM EMP AS e

WHERE NOT EXISTS

(SELECT * FROM EMP AS e2

WHERE id(e) < id(e2)

AND identical(e, e2))

In this query, the function identical returns true if the two records contain identical

values for all of their attributes. This would be very useful in computations where the

same data object may be seen at many sites in a distributed environment (packets in an

IP network, for example). Previous work has considered how to perform sampling in such

a distributed system [12, 77], but not how to deal with the duplicate data problem.

Unfortunately, it turns out that handling subset queries using sampling is exceedingly

difficult, due to the fact that the subquery in a subset query is not asking for a mean or a

75

sum – tasks for which sampling is particularly well-suited. Rather, the subquery is asking

whether we will ever see a match for each tuple from the outer relation. By looking at an

individual tuple, this is very hard to guess: either we have seen a match already on our

sample (in which case we are assured that the inner relation has a match), or we have not,

in which case we may have almost no way to guess whether we will ever see a match. For

example, imagine that employee Joe does not have a sale in a 10% sample of the SALE

relation. How can we guess whether or not he has a sale in the remaining 90%?

There is little relevant work in the statistical literature to suggest how to tackle

subset queries, because such queries ask a simultaneous question linking two populations

(database tables EMP and SALE in our example), which is an uncommon question in

traditional applications of finite population sampling. Outside of the work on sample from

the number of distinct values [17, 49, 50] and one method that requires an index on the

inner relation [75], there is also little relevant work in the data management literature;

we presume this is due to the difficulty of the problem; researchers have considered the

difficulty of the more limited problem of sampling for distinct values in some detail [17].

Our Contributions

In this chapter, we consider the problem of developing sampling-based statistical

estimators for such queries. In the remainder of this chapter, we assume without-replacement

sampling, though our methods could easily be extended to other sampling plans. Given

the difficulty of the problem, it is perhaps not surprising that significant statistical and

mathematical machinery is required for a satisfactory solution.

Our first contribution is to develop an unbiased estimator, which is the traditional

first step when searching for a good statistical estimator. An unbiased estimator is one

that is correct on expectation; that is, if an unbiased estimator is run an infinite number

of times, then the average over all of the trials would be exactly the same as the correct

answer to the query. The reason that an unbiased estimator is the natural first choice is

76

that if the estimator has low variance1 , then the fact that it is correct on average implies

that it will always be very close to the correct answer.

Unfortunately, it turns out that the unbiased estimator we develop often has high

variance, which we prove analytically and demonstrate experimentally. Since it is easy to

argue that our unbiased estimator is the only unbiased estimator for a certain subclass of

subset-based queries (see the Related Work section of this chapter), it is perhaps doubtful

that a better unbiased estimator exists.

Thus, we also propose a novel, biased estimator that makes use of a statistical

technique called “superpopulation modeling”. Superpopulation modeling is an example

of a so-called Bayesian statistical technique [39]. Bayesian methods generally make use of

mild and reasonable distributional assumptions about the data in order to greatly increase

estimation accuracy, and have become very popular in statistics in the last few decades.

Using this method in the context of answering subset-based queries presents a number of

significant technical challenges whose solutions are detailed in this chapter, including:

• The definition of an appropriate generative statistical model for the problem ofsampling for subset-based queries.

• The derivation of a unique Expectation Maximization algorithm [26] to learn themodel from the database samples.

• The development of algorithms for efficiently generating many new random data setsfrom the model, without actually having to materialize them.

Through an extensive set of experiments, we show that the resulting biased Bayesian

estimator has excellent accuracy on a wide variety of data. The biased estimator also has

the desirable property that it provides something closely related to classical confidence

bounds, that can be used to give the user an idea of the accuracy of the associated

estimate.

1 Variance is the statistical measure of the random variability of an estimator.

77

4.2 The Concurrent Estimator

With a little effort, it is not hard to imagine several possible sampling-based

estimators for subset queries. In this section, we discuss one very simple (and sometimes

unusable) sample-based estimator. This estimator has previously been studied in detail

[75], but we present it here because it forms the basis for the unbiased estimator described

in the next section.

We begin our description with an even simpler estimation problem. Given a

one-attribute relation R(A) consisting of nR records, imagine that our goal is to estimate

the sum over attribute A of all the records in R. A simple, sample-based estimator would

be as follows. We obtain a random sample R′ of size nR′ of all the records of R, compute

total =∑

r∈R′ r.A, and then scale up total to output total × nR/nR′ as the estimate for

the final sum. Not only is this estimator extremely simple to understand, but it is also

unbiased, consistent, and its variance reduces monotonically with increasing sample size.

We can extend this simple idea to define an estimator for the NOT EXISTS query

considered in the introduction. We start by obtaining random samples EMP′ and SALE′ of

sizes nEMP′ and nSALE′ , respectively from the relations EMP and SALE. We then evaluate the

NOT EXISTS query over the samples of the two relations. We compare every record in EMP′

with every record in SALE′, and if we do not find a matching record (that is, one for which

f3 evaluates to true), then we add its f1 value to the estimated total. Lastly, we scale up

the estimated total by a factor of nEMP/nEMP′ to obtain the final estimate, which we term

M :

M =

(nEMP

nEMP′

) ∑

e∈EMP′f1(e)× (1−min(1, cnt(e, SALE′)))

In this expression, cnt(e, SALE′) =∑

s∈SALE′ I(f3(e, s)) where I is the standard

indicator function, returning 1 if the boolean argument evaluates to true, and 0 otherwise.

The algorithm can be slightly modified to accommodate for growing samples of the

78

relations, and has been described in detail in [75], where it is called the “concurrent

estimator” since it samples both relations concurrently.

Unfortunately, on expectation, the estimator is often severely biased, meaning that

it is, on average, incorrect. The reason for this bias is fairly intuitive. The algorithm

compares a record from EMP with all records from SALE′, and if it does not find a matching

record in SALE′, it classifies the record as having no match in the entire SALE relation.

Clearly, this classification may be incorrect for certain records in EMP, since although they

might have no matching record in SALE′, it is possible that they may match with some

record from the part of SALE that was not included in the sample. As a result, M typically

overestimates the answer to the NOT EXISTS query. In fact, the bias of M is:

Bias(M) =∑e∈EMP

f1(e)(1−min(1, cnt(e, SALE))

−ϕ(nSALE, nSALE′ , cnt(e, SALE)))

In this expression, ϕ denotes the hypergeometric probability2 that a sample of size nSALE′

will contain none of the cnt(e, SALE) matching records of e.

The solution that was employed previously to counteract this bias requires an index

such as a B+-Tree on the entire SALE relation, in order to estimate and correct for

Bias(M). Unfortunately, the requirement for an index severely limits the applicability

of the method. If an index on the “join” attribute in the inner relation is not available,

the method cannot be used. In a streaming environment where it is not feasible to store

SALE in its entirety, an index is not practical. The requirement of an index also precludes

use of the concurrent estimator for a non-equality predicate in the inner subquery or

2 The hypergeometric probability distribution models the distribution of the numberof red balls that will be obtained in a sample without replacement of n′ balls from an urncontaining r red balls and n− r non-red balls.

79

for non-database environments where sampling might be useful, such as in a distributed

system.

In the remainder of this chapter, we consider the development of sampling-based

estimators for this problem that require nothing but samples from the relations themselves.

Our first estimator makes use of a provably unbiased estimator Bias(M) for Bias(M).

Taken together, M − Bias(M). is then an unbiased estimator for the final query answer.

The second estimator we consider is quite different in character, making use of Bayesian

statistical techniques.

4.3 Unbiased Estimator

4.3.1 High-Level Description

In order to develop an unbiased estimator for Bias(M), it is useful to first re-write

the formula for Bias(M) in a slightly different fashion. We subsequently refer to the set

of records in EMP that have i matches in SALE as “class i records”. Denote the sum of the

aggregate function over all records of class i by ti, so ti =∑

e∈EMP f1(e)×I(cnt(e, SALE) = i)

(note that the final answer to the NOT EXISTS query is the quantity t0). Given that the

probability that a record with i matches in SALE happens to have no matches in SALE′ is

ϕ(nSALE, nSALE′ , i), we can re-write the expression for the bias of M as:

Bias(M) =m∑

i=1

ϕ(nSALE, nSALE′ , i)× ti (4–1)

The above equation computes the bias of M since it computes the expected sum

over the aggregate attribute of all records of EMP which are incorrectly classified as class 0

records by M .

Let m be the maximum number of matching records in SALE for any record of EMP.

Equation 4–1 suggests an unbiased estimator for Bias(M) because it turns out that

it is easy to generate an unbiased estimate for tm: since no records other than those

with m matches in SALE can have m matches in SALE′, we can simply count the sum

80

of the aggregate function f1 over all such records in our sample, and scale up the total

accordingly. The scale-up would also be done to account for the fact that we use SALE′ and

not SALE to count matches. Once we have an estimate for tm, it is possible to estimate

tm−1. How? Note that records with m − 1 matches in SALE′ must be a member of either

class m or class m − 1. Using our unbiased estimate for tm, it is possible to guess the

total aggregate sum for those records with m − 1 matches in SALE′ that in reality have m

matches in SALE. By subtracting this from the sum for those records with m − 1 matches

in SALE′ and scaling up accordingly, we can obtain an unbiased estimate for tm−1. In a

similar fashion, each unbiased estimate for ti leads to an unbiased estimate for ti−1. By

using this recursive relationship, it is possible to guess in an unbiased fashion the value

for each ti in the expression for Bias(M). This leads to an unbiased estimator for the

Bias(M) quantity, which can be subtracted from M to provide an unbiased guess for the

query result.

4.3.2 The Unbiased Estimator In DepthWe now formalize the above ideas to develop an unbiased estimator for each tk

that can be used in conjunction with Equation 4–1 to develop an unbiased estimator forBias(M). We use the following additional notation for this section and the remainder ofthis chapter:

• ∆k,i is a 0/1 (non-random) variable which evaluates to 1 if the ith tuple of EMP has kmatches in SALE and evaluates to 0 otherwise.

• sk is the sum of f1 over all records of EMP′ having k matching records in SALE′:sk =

∑nEMP′i=1 I(cnt(ei, SALE

′) = k)× f1(ei).

• α0 is nEMP′/nEMP, the sampling fraction of EMP.

• Yi is a random variable which governs whether or not the ith record of EMP appears inEMP′.

• h(k; nSALE, nSALE′ , i) is the hypergeometric probability that out of the i interestingrecords in a population of size nSALE, exactly k will appear in a random sample of sizenSALE′ . For compactness of representation we will refer to this probability as h(k; i) inthe remainder of the thesis, since our sampling fraction never changes.

81

We begin by noting that if we consider only those records from EMP which appear in

the sample EMP′, an unbiased estimator for tk over EMP′ can be expressed as follows:

tk =1

α0

nEMP∑i=1

Yi ×∆k,i × f1(ei) (4–2)

Unfortunately, this estimator relies on being able to evaluate ∆k,i for an arbitrary record,

which is impossible without scanning the inner relation in its entirety. However, with

a little cleverness, it is possible to remove this requirement. We have seen earlier that

a record e can have k matches in the sample SALE′ provided it has i ≥ k matches in

SALE. This implies that records from all classes i where i ≥ k can contribute to sk.

The contribution of a class i record towards the expected value of sk is obtained by

simply multiplying the probability that it will have k matches in SALE′ with its aggregate

attribute value. Thus a generic expression to compute the contribution of any arbitrary

record from EMP′ towards the expected value of sk can be written as∑m

i=k ∆i,j × h(k; i) ×f1(ej). Then, the following random variable has an expected value that is equivalent to the

expected value of sk:

sk =nEMP∑j=1

m∑

i=k

Yj ×∆i,j × h(k; i)× f1(ej) (4–3)

The fact that E[sk] = E[sk] (proven in Section 4.3.3) is significant, because there is a

simple algebraic relationship between the various s variables and the various t variables.

Thus, we can express one set in terms of the other, and then replace each sk with sk in

order to derive an unbiased estimator for each t. The benefit of doing this is that since sk

is defined as the sum of f1 over all records of EMP′ having k matching records in SALE′, it

can be directly evaluated from the samples EMP′ and SALE′.

82

To derive the relationship between s and t, we start with an expression for sm−r using

Equation 4–3:

sm−r =nEMP∑j=1

m∑i=m−r

Yj ×∆i,j × h(m− r; i)× f1(ej)

=m∑

i=m−r

h(m− r; i)nEMP∑j=1

Yj ×∆i,j × f1(ej)

=r∑

i=0

h(m− r; m− r + i)nEMP∑j=1

Yj ×∆m−r+i,j × f1(ej)

=r∑

i=0

h(m− r; m− r + i)α0tm−r+i (4–4)

By re-arranging the terms we get the following important recursive relationship:

tm−r =sm−r − α0

∑ri=1 h(m− r; m− r + i)× tm−r+i

α0h(m− r; m− r)(4–5)

For the base case we obtain:

tm = sm

α0h(m;m)

= am × sm (4–6)

where am = 1/(α0h(m; m)).

By replacing sm−r in the above equations with sm−r which is readily observable from

the data and has the same expected value, we can obtain a simple recursive algorithm for

computing an unbiased estimator for any ti. Before presenting the recursive algorithm, we

note that we can re-write Equation 4–5 for ti by replacing s with s and by changing the

summation variable from i to k and actually substituting m− r by i,

ti =si − α0

∑m−ik=1 h(i; i + k)ti+k

α0h(i; i)

83

The following pseudo-code then gives the algorithm3 for computing an unbiased

estimator for any ti.

———————————————————————————-

Function GetEstTi(int i)

1 if(i == m)

2 return sm/(α0h(m; m))

3 else

4 returnval = si

5 for(int k = 1; k <= m− i; k++)

6 returnval -= α0h(i; i + k)×GetEstTi(i+k)

7 returnval /= α0h(i; i)

8 return returnval;

9 }———————————————————————————-

Recall from Equation 4–1 that the bias of M was expressed as a linear combination

of various ti terms. Using GetEstTi to estimate each of the ti terms, we can write an

estimator for the bias of M as:

Bias(M) =m∑

i=1

ϕ(nSALE, nSALE′ , i)× GetEstTi(i) (4–7)

In the following two subsections, we present a formal analysis of the statistical

properties of our estimator.

3 Note the h(m; m) probability in line 2 of the GetEstTi function. If the sample sizefrom SALE is not at least as large as m, then h(m; m) = 0 and the GetEstTi is undefined.This means that our estimator is undefined if the sample is not at least as large as thelargest number of matches for any record from EMP in SALE. The fact that the estimator isundefined in this case is not surprising, since it means that our estimator does not conflictwith known results regarding the existence of an unbiased estimator for the distinct valueproblem. See the Related Work section for more details.

84

4.3.3 Why Is the Estimator Unbiased?

According to Equation 4–7, the estimator for the bias of M is composed of a sum of

m different estimators. Hence by the linearity of expectation, the expected value of the

estimator can be written as:

E[ Bias(M)] =m∑

i=1

ϕ(nSALE, nSALE′ , i)× E[GetEstTi(i)] (4–8)

The above relation suggests that in order to prove that the sample-based estimator

of Equation 4–7 is unbiased, it would suffice to prove that each of the individual GetEstTi

estimators is unbiased. We use mathematical induction to prove the correctness of the

various estimators on expectation.

As a preliminary step for the proof of unbiasedness, we first derive the expected

values of the si estimator used by GetEstTi. To do this, we introduce a zero/one random

variable Hj,k that evaluates to 1 if ej has k matches in SALE′ and 0 otherwise. The

expected value of this variable is simply the probability that it evaluates to 1, giving us

E[Hj,k] = h(k; cnt(ej, SALE)). With this:

E[sk] = E

[nEMP∑j=1

m∑

i=k

Yj ×∆i,j ×Hj,k × f1(ej)

]

=nEMP∑j=1

m∑

i=k

E[Yj]×∆i,j × E[Hj,k]× f1(ej)

= α0

nEMP∑j=1

m∑

i=k

∆i,j × h(k; i)× f1(ej) (4–9)

We are now ready to present a formal proof of unbiasedness of the GetEstTi.

Theorem 1. The expected value of GetEstTi(i) is∑nEMP

j=1 ∆i,jf1(ej).

Proof. Using Equation 4–5, the recursive GetEstTi estimator can be re-written as:

GetEstTi(i) =si − α0

∑m−ik=1 h(i; i + k)GetEstTi(i + k)

α0h(i; i)(4–10)

85

We first prove the unbiasedness for the base case: GetEstTi(m). Setting i = m in the

above relation and taking the expectation:

E[GetEstTi(m)] =E[sm]

α0h(m; m)

Replacing E[sm] using Equation 4–9:

E[GetEstTi(m)] =α0

α0h(m; m)

nEMP∑j=1

∆m,jh(m; m)f1(ej)

=nEMP∑j=1

∆m,jf1(ej)

which is actually the value of tm.

By induction, we can now assume that all estimators GetEstTi(i+k) for 1 ≤ k ≤ m− i

are unbiased and we use this to prove that the estimator GetEstTi(i) is unbiased. Taking

the expectation on both sides of the above equation:

E[GetEstTi(i)] = E

[si − α0

∑m−ik=1 h(i; i + k)GetEstTi(i + k)

α0h(i; i)

]

By the linearity of expectation:

=E[si]− α0

∑m−ik=1 h(i; i + k)E[GetEstTi(i + k)]

α0h(i; i)

Replacing the values of E[GetEstTi(i + k)] and E[si]:

= 1α0h(i;i)

(α0

∑nEMP

j=1

∑mk=i ∆k,jh(i; k)f1(ej)

−α0

∑m−ik=1 h(i; i + k)

∑nEMP

j=1 ∆i+k,jf1(ej))

= 1h(i;i)

(∑nEMP

j=1


−∑nEMP

j=1

∑m−ik=1 ∆i+k,jh(i; i + k)f1(ej))

86

For the second term in the parentheses, replacing i + k by p and changing the limits of

summation for the inner sum accordingly:

= 1h(i;i)

(∑nEMP

j=1


−∑nEMP

j=1

∑mp=i+1 ∆p,jh(i; p)f1(ej)) (4–11)

We notice that the limits of summation of the inner sum of the first term are from i to m.

Splitting this term into two terms such that one term has limits of summation from i to i

while the other has limits from i + 1 to m:

= 1h(i;i)

(∑nEMP

j=1

∑ik=i ∆k,jh(i; k)f1(ej)

+∑nEMP

j=1

∑mk=i+1 ∆k,jh(i; k)f1(ej)

−∑nEMP

j=1

∑mp=i+1 ∆p,jh(i; p)f1(ej))

= 1h(i;i)

(∑nEMP

j=1 ∆i,jh(i; i)f1(ej))

=∑nEMP

j=1 ∆i,jf1(ej) (4–12)

4.3.4 Computing the Variance of the Estimator

The unbiased property of Bias(M) means that it may be useful. However, the

accuracy of any estimator depends on its variance as well as its bias. We now investigate

the variance of our unbiased estimator.

We have seen that Bias(M) is a linear combination of various GetEstTi results with

ϕ(nSALE, nSALE′ , i) as the coefficient of GetEstTi(i). In order to derive an expression for the

variance of the estimator and gain insight about the potential values it can take, we first

express the estimator as a linear combination of si terms:

Bias(M) =m∑

i=1

bi × si (4–13)

87

The next step in deriving the variance is being able to compute the various bi values.

Intuitively, the bi terms can be thought of as coming from the linear relationship between

the ti and si terms. The following algorithm shows how we can actually compute the bi

values.

——————————————————————————————————-

Function ComputeBis(m)

1 // Let table[m][m] be a 2-dimensional array with all elements initialized to zero

2 for(int row = 0; row < m; row++){3 for(int term = 1; term <= row; term++){4 factor = -h(m− row; m− row + term)/h(m− row;m− row)

5 prow = row - term

6 for(int pcol = 0; pcol <= prow; pcol++)

7 table[row][pcol] += factor ∗ table[prow][pcol]

8 }9 table[row][row] = 1/h(m− row; m− row)

10}11for(int row = 0; row < m; row++)

12 for(int col = 0; col <= row; col++)

13 bm−col += α0 ∗ h(0,m− row) ∗ table[row][col]

——————————————————————————————————-

With this, the variance of this estimator can then be written as:

V ar( Bias(M)) = V ar

(m∑

i=1

bisi

)(4–14)

88

Note that the si values are not independent random variables since if an EMP′ record

has i matches in SALE′, then it cannot have j matches in SALE′. Hence we have:

V ar( Bias(M)) =m∑

i=1

b2i V ar(si) + 2

m∑i=1

∑i<j≤m

bibjCov(si, sj) (4–15)

The V ar and Cov terms can be computed by using the standard formulas:

V ar(si) = E[s2i ]− E2[si]; Cov(sisj) = E[sisj]− E[si]E[sj] (4–16)

To evaluate V ar(si) and Cov(sisj), E[s2i ] and E[sisj] can be computed as follows:

E[sisj] = E

[(nEMP∑

k=1

Ykf1(ek)Hk,i

)(nEMP∑r=1

Yrf1(er)Hr,j

)]

= E

[nEMP∑

k=1

nEMP∑r=1

YkYrHk,iHr,jf1(ej)f1(er)

]

=nEMP∑

k=1

nEMP∑r=1

E[YkYr]E[Hk,iHr,j]f1(ej)f1(er) (4–17)

The above expression can be evaluated using the following rules:

• if k 6= r (that is, ek and er are two different tuples) then,E[Hk,iHr,j] ≈ h(i, cnt(ek, SALE)) h(j, cnt(er, SALE)) if we assume that no record s exists in

SALE where f3(ek, s) = f3(er, s) = true

• if i = j (that is, we are computing E[s2i ]) and k = r, then E[Hk,iHr,j] =

h(i, cnt(ek, SALE))

• if i 6= j (that is, we are computing E[sisj]) and k = r, then E[Hk,iHr,j] = 0 since arecord cannot have two different numbers of matches in a sample

• if k = r, then E[YkYr] = α0

• if k 6= r, then E[YkYr] ≈ α02

4.3.5 Is This Good?

At this point, we now have a simple, unbiased estimator for the answer to a

subset-based query, as well as a formal analysis of the statistical properties of the

89

Actualpopulation

. . .

stepHypothetical

Superpopulation. . .process

Generative

Superpopulation Model

Sampling

SurveySample

n321

. . .

Sampling underdesired sampling design

N321

Random

F (Θ)

Figure 4-1. Sampling from a superpopulation

estimator. However, there are two problems related to the variance that may limit the

utility of the estimator.

First, in order to evaluate the hypergeometric probabilities needed to compute or

estimate the variance, we need the value of cnt(e, SALE) for an arbitrary record e of

EMP. This information is generally unavailable during sampling, and it seems difficult

or impossible to obtain a good estimate for the appropriate probability without having

this information. This means that in practice, it will be difficult or impossible to tell

a user how accurate the resulting estimate is likely to be. We have experimented with

general-purpose methods such as the bootstrap [31] to estimate this variance, but have

found that these methods often do an extremely poor job in practice.

Second, the variance of the estimator itself may be huge. The bi coefficients are

composed of sums, products and ratios of hypergeometric probabilities which can

result in huge values. Particularly worrisome is the h(i, i) value in the denominator

used by GetEstTi. Such probabilities can be tiny; including such a small value in the

denominator of an expression results in a very large value that may “pump up” the

variance accordingly.

90

4.4 Developing a Biased Estimator

In light of these problems, in this section we describe a biased estimator that is

often far more accurate than the unbiased one, and also provides the user with an idea

of the estimation accuracy. Just like the unbiased estimator M − Bias(M) from the

previous section, our biased estimator will be nothing more than a weighted sum over the

observed sk values. However, the weights will be chosen so as to minimize the expected or

mean-squared error of the resulting estimator.

To develop our biased estimator, we make use of the “superpopulation modeling”

approach from statistics [78]. One simple way to think of a superpopulation is that it

is an infinitely large set of records from which the original data set has been obtained

by random sampling. Because the superpopulation is infinite, it is specified using a

parametric distribution, which is usually referred to as the prior distribution.

Using a superpopulation method, we imagine the following two-step process is used to

produce our sample:

1. Draw a large sample of size N from an imaginary infinite superpopulation where N isthe data set size.

2. Draw a sample of size n < N without replacement from the large sample of size Nobtained in Step 1 where n is the desired sample size.

By characterizing the superpopulation, it is possible to design an estimator that tends

to perform well on any data set and sample obtained using the process above.The following steps outline a road-map of our superpopulation-based approach for

obtaining a high-quality biased estimator for a subset-based query. We describe each stepin detail in the next section.

1. Postulate a superpopulation model F for our data set (F is the prior distribution;we use the notation pF to denote the probability density function (PDF) of F ). Ingeneral, F is parameterized on a parameter set Θ.

2. Infer the most likely values of the parameter set Θ from EMP′ and SALE′. Since wedo not have the complete data, but rather a random sample of the data, this is adifficult problem. We make use of an Expectation-Maximization (EM) algorithm tolearn the model parameters.

91

3. Use F (Θ) to generate d different populations P1, ..., Pd, where each Pi = (EMPi, SALEi).Note that if the data set in question is large, this may be very expensive. We showthat for our problem it is not necessary to generate the actual populations – it isenough to obtain certain sufficient statistics for each of them, which can be doneefficiently.

4. Sample from each Pi to obtain d sample pairs of the form Si = (EMP′i, SALE′i). Again,

this can be done without actually materializing the samples.

5. Let q(Pi) be the query answer over the ith data set. Construct a weighted estimatorW that minimizes

∑di=1(q(Pi)−W (Si))

2.

6. Use W on the original samples EMP′ and SALE′ to obtain the final estimate to the NOT

EXISTS query. The MSE of this estimate can generally be assumed to be the MSEover all of the populations generated:

∑di=1(1/d)× (q(Pi)−W (Si))

2.

4.5 Details of Our Approach

In this section, we discuss in detail each of the steps outlined above of our approach of

obtaining an optimal weighted estimator for the NOT EXISTS query.

4.5.1 Choice of Model and Model Parameters

The first task is to define a generative model and an associated probability density

function for the two relations EMP and SALE respectively. While this may seem like a

daunting task (and a potentially impossible one given all of the intricacies of modeling

real-life data) it is made easy by the fact that we only need to define a model that can

realistically reproduce those characteristics of EMP and SALE that may affect the bias or

variance of an estimator for a subset-based query. From the material in Section 4.3 of the

thesis, for a given record e from EMP, we know that these three characteristics are:

1. f1(e)

2. cnt(e, SALE), which is the number of SALE records s for which f3(e, s) is true

3. cnt(e, e′, SALE) where e′ 6= e, which is the number of SALE records s for whichf3(e, s) ∧ f3(e

′, s) is true

To simplify our task, we will actually ignore the third characteristic and define a

model such that this count is always zero for any given record pair. While this may

92

introduce some inaccuracy into our method, it still captures a large number of real-life

situations. For example, if f3 consists of an equality check on a foreign key from SALE

into EMP (which is arguably the most common example of such a subset-based query) then

two records from EMP can never match with the same record from SALE and this count is

always zero.Given that our model needs to be able to generate instances of EMP and SALE that

realistically model the first two aspects given above, we choose the parameter set Θ = {p,µ, σ2} where:

• p is a vector of probabilities, where pi represents the probability that any arbitraryrecord of EMP belongs to class i

• µ is a vector of means, where µi represents the mean aggregate value of all recordsbelonging to class i.

• σ2 is the variance of f1(e) over all records e ∈ EMP.

Then given these parameters, EMP and SALE are generated using our model as follows:

————————————————————————————–

Procedure GenData

1 For rec = 1 to nEMP do

2 Randomly generate k between 0 and m such that for any

i; 0 ≤ i ≤ m,Pr[i] = pi

3 Generate a value for f1(e) by sampling from N(µk, σ)

4 Add the resulting e to EMP

5 For j = 1 to k do

6 Generate a record s where f3(e, s) is true

7 Add s to SALE

————————————————————————————–

In step (3), N is a normally distributed random variable with the specified mean

and variance. We use a normal random variable because we are interested in sums over

classes in EMP; due to the central limit theorem (CLT), these sums will be normally

93

distributed for a large database. Thus, using a normal random variable does not result

in loss of generality. Also, note that in step (6), according to our earlier assumption

f3(e′, s) = false, ∀e′ 6= e.

In our actual model, the various µi values are not assumed to be independent but

we rather assume a linear relationship between them to limit the degrees of freedom

of the model and thus avoid the problem of overfitting the model (see Sec. 4.5.1). In

our model the various µi values are related as µi = s × i + µ0, where s and µ0 are the

only two parameters that need to be learned to determine all the µi. Also in order to

avoid overfitting, we assume that σ2 is the variance of f1(e) over all records, rather than

modeling and learning variance values of all the individual classes separately.

We now define the density function for the superpopulation model corresponding to

the GenData algorithm. For a given EMP record e, if f1(e) = v and cnt(e, SALE) = k the

probability density for e given a parameter set Θ is given by:

p(e|Θ) = p(v, k|Θ) = pkpN(µk, σ, v) (4–18)

Where it is convenient, we will use the notation p(v, k|Θ) for values v and k and

p(e|Θ) for record e interchangeably. In this expression, pN is the PDF for the normal

distribution evaluated at v and is given by:

pN(µk, σ, v) =e−(v−µk)2/2σ2

σ√

2π(4–19)

Then if we consider a given data set {EMP, SALE}, the probability density of the data

set is simply the product of the densities of all the individual records:

pF (EMP, SALE) =∏

e∈EMPp(e|Θ) (4–20)

A Note on The Generality of the Model. As described, our model is extremely

general, making almost no assumptions about the data other than the fact that f1(e)

values are normally distributed. This is actually an inconsequential assumption anyway,

94

since we are interested in sums over f1(e) values which will be normally distributed

whatever the distribution of f1(e) due to the CLT.

On one hand, this generality can be seen as a benefit of the approach: it makes use of

very few assumptions about the data. Most significant is the lack of any sort of restriction

on the probability vector p. The result is that the number of records from SALE matching

a certain record from EMP is multinomially distributed. On the other hand, a Bayesian

argument [39] can be made that such extreme freedom is actually a poor choice, and that

in ”real-life”, an analyst will have some sort of idea what the various pi values look like,

and a more restrictive distribution providing fewer degrees of freedom should be used.

For example, a negative binomial distribution has been assumed for the distinct value

estimation problem [90]. Such background knowledge could certainly improve the accuracy

of the method.

Though we eschew any such restrictions in the remainder of the thesis (except for an

assumption of a linear relationship among the µi values; see “Dealing with Over-fitting”

in the next section), we note that it would be very easy to incorporate such knowledge

into our method. The only change needed is that the EM algorithm described in the next

section would need to be modified to incorporate any constraints induced on the various

parameters by additional distributional assumptions.

4.5.2 Estimation of Model Parameters

Now that we have defined our superpopulation model, we need access to the

parameter set Θ that was used to create our particular instances of EMP and SALE in

order to develop an estimator that performs well for the resulting superpopulation.

However, we have several difficulties. First, we do not know Θ; since EMP and SALE are in

reality not sampled from any parametric distribution, Θ does not even exist. We could

95

compute a maximum-likelihood estimate (MLE)4 to choose a Θ that optimally fits EMP

and SALE, but then we have an even bigger problem: we do not even have access to EMP

and SALE; we only have access to samples from them. Thus, we need a way to infer Θ by

looking only at the samples EMP′ and SALE′.

It turns out that we can still make use of an MLE. Since EMP′ may be treated as a

set of independent, identically distributed samples from F , if we simply replace EMP with

EMP′ as an argument to pF , then by choosing Θ so as to maximize pF , we will still produce

exactly the same estimate for Θ on expectation that we would have if EMP were used

instead. Thus, we can essentially ignore the distinction between EMP and EMP′. However,

the same argument does not hold for SALE because without access to all of SALE, we

cannot compute k = cnt(e, SALE) for arbitrary e in order to apply an MLE.

To handle this, we will modify our PDF slightly to also take into account the

sampling from SALE. This can easily be done by modifying the function p(v, k|Θ). To

simplify the modification, we ignore the fact that the number of such records s from SALE′

where f3(e1, s) is true may be correlated with the number of records from SALE′ where

f3(e2, s) is true for arbitrary records e1 and e2 from EMP; that is, we assume that we are

looking for matches of a record e in its own “private” sample from SALE and that all

of these samplings are independent. With this, if f1(e) = v and cnt(e, SALE) = k and

cnt(e, SALE′) = k′ then:

p(v, k, k′|Θ) = p(v, k|Θ)h(k′; k) (4–21)

In this expression, h is the hypergeometric probability of seeing k′ matches for e in

SALE′, given that there were k matches in SALE.

4 An MLE is a standard statistical estimator for unknown model parameters when asample is available; the MLE simply chooses Θ so as to maximize the value of the PDF ofthe sample.

96

Since the portion of SALE that is not in SALE′ is hidden to us due to the sampling, we

do not know k and we have a classic example of an MLE problem with hidden or missing

data. There are several methods from the literature for solving such a problem; the one

that we employ is the Expectation Maximization (EM) algorithm.

The EM algorithm [26] is a general method of finding the maximum-likelihood

estimate of the parameters of an underlying distribution from a given data set when the

data is incomplete or has missing values. EM starts out with an initial assignment of

values for the unknown parameters and at each step, recomputes new values for each of

the parameters via a set of update rules. EM continues this process until the likelihood

stops increasing any further. Since cnt(e, SALE) is unknown, the likelihood function:

L(Θ|{EMP′, SALE′}) =∏

e∈EMP′

m∑

k=1

p(f1(e), k, cnt(e, SALE′)|Θ)

We present the derivation of our EM implementation in the Appendix, while here we

give only the algorithm. In this algorithm, p(i|Θ, e) denotes the posterior probability for

record e belonging to class i. This is the probability that given the current set of values for

Θ, record e belongs to class i.

———————————————————————————

Procedure EM(Θ)

1 Initialize all parameters of Θ; Lprev = −9999

2 while (true){3 Compute L(Θ) from the sample and assign it to Lcurr

4 if((Lcurr − Lprev)/Lprev < 0.01) break

5 Compute posterior probabilities for each e ∈ EMP′ and each k

6 Recompute all parameters of Θ by using the following

update rules:

7 µi =P

e∈EMP p(i|Θ′,e)×f1(e)Pe∈EMP p(i|Θ′,e)

8 σ2 =P

e∈EMPPm

j=0 p(j|Θ′,e)(f1(e)−µj)2

Pe∈EMP

Pmj=0 p(j|Θ′,e)

97

9 pi =P

e∈EMP p(i|Θ′,e)Pe∈EMP

Pmj=1 p(j|Θ′,e)

10 Lprev = Lcurr

11 }12 Return values in Θ as the final parameters of the model

————————————————————————————

Every iteration of the EM algorithm performs an expectation (E) step and a

maximization (M) step. In our algorithm, the E-step is contained in step (6) where

for each record e of EMP′, a set of probability values p(i|Θ, e), 0 ≤ i ≤ m, is computed

under the current model parameters, Θ. The posterior probability p(i|Θ, e) is computed

as described in the Appendix. Intuitively, the posterior probability for record e and class

i is a ratio of two quantities: (1) the probability that e belongs to class i according to the

density function of the model, and (2) the sum of probabilities that it belongs to each of

the classes 0 through m, also according to the model density function.

The M-step (which corresponds to steps (7) - (10) of our algorithm) updates the

parameters of our model in such a way that the expected value of the likelihood function

associated with the model is maximized with respect to the posterior probabilities. Details

of how we obtain the various update rules are explained in the Appendix.

The observant reader may note that the EM algorithm assumes that the parameter

m is known before the process has begun. This is potentially a problem, since m will

typically be unknown. Fortunately, knowing the exact value for m is not vital, particularly

if m is overestimated (in which case the class probabilities associated with the class i

records for large i will end up being zero, if the EM algorithm functions correctly). As a

rough estimate for m, we take the record from EMP′ with the largest number of matches in

SALE′ and scale up the number of matches by nSALE/nSALE′ . Particularly if several records

with m matches in SALE are expected to appear in EMP′, this estimate for m will be quite

conservative.

98

Dealing with Over-fitting. The superpopulation model has a total of 2(m + 1) + 1

parameters within Θ. Since the number of degrees of freedom of the model is so large, the

model has a tremendous leeway when choosing parameter values. This potentially leads to

a well-known drawback of learned models – over-fitting the training data, where the model

is tailored to be excessively well-suited to the training data at the cost of generality.Several techniques have been proposed to address the over-fitting problem [30]. We

use the following two methods in our approach:

• Limiting the number of degrees of freedom of the model.

• Using multiple models and combining them to develop our final estimator.

To use the first technique, we restrict our generative model so that the mean

aggregate value of all records of any class i is not independent of the mean value of

other classes. Rather, we use a simple linear regression model µi = s× i + µ0. s and µ0 are

the two parameters of the linear regression model and can be learned easily. This means

that once we have learned the two parameters s and µ0, the µi values for all other classes

can be determined directly by the above relation and will not be learned separately. As

mentioned previously, it would also be possible to place distributional constraints upon the

vector p in order to reduce the degrees of freedom even more, though we choose not to do

this in our implementation.

Our second strategy to tackle the over-fitting problem is to learn multiple models

rather than working with a single model. These models differ from each other only in that

they are learned using our EM algorithm with different initial random settings for their

parameters. When generating populations from the models learned via EM (as described

in the next subsection), we then rotate through the various models in round-robin fashion.

Are we not done yet? Once the model has been learned, a simple estimator is

immediately available to us: we could return p0 × µ0 × nEMP, since this will be the expected

query result over an arbitrary database sampled from the model. This is equivalent to

first determining a class of databases that the database in question has been randomly

99

selected from, and then returning the average query result over all of those databases. If

multiple models are learned in order to alleviate the over-fitting problem, then we can use

the average of this expression over all of those models.

While this estimator is certainly reasonable, the concern is twofold. First, if there is

high variability in the possible populations that could be produced by the model or models

(corresponding to uncertainty in the correctness of the model), then simply taking the

average of all of these populations will expectedly result in an answer with high variance.

A related concern is that this is not very robust to errors in the model-learning process –

an error in the model will lead directly to an error in the estimate.

Thus, in the next few subsections we detail a process that attempts to simultaneously

perform well on any and all of the databases that could be sampled from the model,

rather than simply returning the mean answer over all potential databases. The method

samples a large number of ((EMPi, SALESi), (EMP′i, SALES′i)) combinations from the model,

and then attempts to construct an estimator that can accurately infer the query answer

over precisely the (EMPi, SALESi) that has been sampled by looking at (EMP′i, SALES′i).

4.5.3 Generating Populations From the Model

Once we know the parameter set Θ, the next task is to generate many instances

of Pi = (EMPi, SALEi) and Si = (EMP′i, SALE′i) in order to optimize our biased estimator

over these population-sample pairs. The difficulty is that in practice, EMP and SALE can

have billions of records in them. Hence, it would not be feasible to actually materialize

each (Pi, Si) pair. The good news is that for our problem it is not necessary to actually

generate the populations if we can generate statistics associated with the pair that are

sufficient to optimize our biased estimator.

Computing sufficient statistics for EMP and SALE. For each Pi, we must generate thefollowing statistics:

• The number of records of EMP belonging to each class (we use ni to denote this).

• The mean over f1 for all records belonging to each class.

100

The first set of statistics are easy to generate if we notice that the number of records

belonging to each class is simply a multinomial distribution with nEMP trials and each

multinomial bucket probability is given by the vector p. A single, vector-valued sample

from an appropriately-distributed multinomial distribution can then give us each ni.

The next set of statistics can be computed by relying on the CLT. According to

the generative model, the aggregate attribute value of records of the superpopulation

belonging to class i is given by µi with variance of σ2. Since the population is an i.i.d.

random sample from the superpopulation, the mean aggregate value of records belonging

to class i follows a normal distribution with mean given by µi and variance of σ2/ni.

Thus, ti which is the sum over the aggregate attribute of all records of class i can then be

obtained by drawing a trial from the normal distribution N(µi, σ2/ni) and multiplying it

by ni.

Computing sufficient statistics for EMP′ and SALE′. For each Si, we must generate the

following statistics:

• The number of sampled records from each class of EMP; this is denoted by n′i.

• The number of sampled records from the ith class of EMP that have j matches inSALE′ for each i and j. We denote this by n′i,j.

• The mean over f1 corresponding to each n′i,j.

The first set of statistics can be produced by repeatedly sampling from a hypergeometric

distribution. To compute n′0, we sample from a hypergeometric distribution with

parameters nEMP, nEMP′ , and n0 (these parameters are the population size, the sample

size, and the size of the subpopulation of interest, respectively). To compute n′1, we sample

from a hypergeometric distribution with parameters nEMP − n0, n′EMP − n′0, and n1. n′2 is a

sample from a hypergeometric distribution with parameters nEMP−(n0+n1), n′EMP−(n′0+n′1),

and n2. This process is repeated for each n′i.

101

Once each n′i is generated, each n′i,j is generated. In order to speed the process of

generating each n′i,j, we can assume that the expected value of each n′i,j is small compared

to nSALE, so that there is little difference between sampling with and without replacement.

Thus, we can assume that each n′i,j; j ≤ i is binomially distributed which in turn means

that all n′i,j are multinomially distributed, where the probability that any class i record

will have j matches in the sample SALE′ is a hypergeometric probability denoted by h(j; i).

A single trial over a multinomial random variable having probabilities of h(j; i) for j from

0 to i will then give us each n′i,j for a given i.

Finally, again using a CLT-based argument, the mean over f1 for all of the records

corresponding to each n′i,j is generated by a single trial over a normal random variable

N(µi, σ2/n′i,j).

4.5.4 Constructing the Estimator

We have seen in the previous subsection that once a model has been learned, it can be

used to generate statistics for any number of population(s)/sample(s).

Recall from Section 4.4 that the jth population generated and the sample from that

population are Pj = (EMPj, SALEj) and Sj = (EMP′j, SALE′j), respectively. Let sij be the

value of si computed over Sj; that is, it is the sum for f1 over all tuples in EMP′j that have i

matches in SALE′j. Our goal in all of this is to construct a weighted estimator:

W (Sj) =m∑

i=0

wisij (4–22)

that minimizes:

SSE =∑

j

(W (Sj)− q(Pj))2 =

∑j

(m∑

i=0

wisij − q(Pj)

)2

(4–23)

where q(Pj) is the answer to the NOT EXISTS query over the jth population.

W should be optimized by choosing each wi so as to minimize the SSE (sum-squared-error)

given above. In order to compute these weights we evaluate the partial derivative of the

SSE w.r.t each of the unknown weights. For example, by taking the partial derivative of

102

the SSE w.r.t w0, we obtain:

∂SSE

∂w0

=∑

j

2

(m∑

i=0

wisij − q(Pj)

)(s0j)

If we differentiate with respect to each wi and set the resulting m + 1 expressions to

zero, we obtain m + 1 linear equations in the m + 1 unknown weights. These equations can

be represented in the following matrix form:

∑j s2

0j

∑j s0js1j · · ·

∑j s0js1j

∑j s2

1j · · ·.

.

∑j s0jsmj

∑j s1jsmj · · ·

w0

w1

.

.

wm

=

∑j s0jq(Pj)

∑j s1jq(Pj)

.

.

∑j smjq(Pj)

The optimal weights can then be easily obtained by using a linear equation solver to

solve the above system of equations.

Once W has been derived, it is then applied to the original samples EMP′ and SALES′

in order to estimate the answer to the query. By dividing the SSE obtained via the

minimization problem described above by the number of data sets generated, we can also

obtain a reasonable estimate of the mean-squared error of W .

4.6 Experiments

In this section we describe results of the experiments we performed to test our

estimators. Our experiments are designed to test the accuracy of our estimators and the

running time of the biased estimator, over a wide variety of data sets.

4.6.1 Experimental Setup

In this subsection, we describe the properties of the various data sets we use to test

our estimators. We generate 66 synthetic data sets and use three real-life data sets for

conducting our experiments. All our experiments were performed on a Linux workstation

having 1 GB of RAM and a 2.4 GHz clock speed and all software was implemented using

the C++ programming language.

103

4.6.1.1 Synthetic data sets

In each data set, we have two relations, EMP(EID, AGE, SAL) and SALE(SALEID,

EID, AMOUNT) of size 10 million and 50 million records, respectively. We evaluate the

following SQL query over each data set:

SELECT SUM (e.SAL)

FROM EMP as e

WHERE NOT EXISTS

(SELECT * FROM SALE AS s

WHERE s.EID = e.EID)

Two important data set properties that affect the query result are:

1. The distribution of the number of matching records in SALE for each record of EMP

2. The distribution of e.SAL values of all records of EMP

Based on these two important properties, we synthetically generated data sets so

that the distribution of the number of matching records for all EMP records follows a

discretized Gamma distribution. The Gamma distribution was chosen because it produces

positive numbers and is very flexible, allowing a long tail to the right. This means that it

is possible to create data sets for which most records in EMP have very few matches, but

some have a large number. We chose values of 1, 2 and 5 for the Gamma distribution’s

shift parameter and values of 0.5 and 1 for the scale parameter. Based on these different

values for the shift and scale parameters, we obtained six possible data sets: 1: (shift =

1, scale = 0.5); 2: (shift = 2, scale = 0.5); 3: (shift = 5, scale = 0.5); 4: (shift = 1, scale

= 1); 5: (shift = 2, scale = 1); and 6: (shift = 5, scale = 1). For these six data sets, the

fraction of EMP records having no matches in SALE (and thus contributing to the query

answer) were .86, .59, .052, .63, .27, and .0037, respectively. A plot of the probability that

an arbitrary tuple from EMP has m matches in SALE for each of the six data sets is given as

Figure 4-2. This shows the wide variety of data set characteristics we tested.

104

0

0.1

0.2

0.3

0.4

0.5

0.6

0 1 2 3 4 5 6

Pro

babi

lity

Number of matches per record from SAL

data set 6

data set 3

data set 4

data set 1

data set 2data set 5

Figure 4-2. Six distributions used to generate for each e in EMP the number of records s inSALE for which f3(e, s) evaluates to true.

We also varied the distribution of the e.SAL values such that the distribution can be

one of the following:

• a. Normally distributed with a mean of 100 and standard deviation of 10

• b. Normally distributed with a mean of 100 and standard deviation of 200, with only

the absolute values considered

• c. Zipfian distributed with a skew parameter of 0.5

• d. Zipfian distributed with a skew parameter of 1.0

We doubled the number of data sets by further providing a linear positive correlation

or no correlation between the e.SAL value of a record and the number of matching

records it has in SALE. We thus obtained 48 different data sets considering all possible

combinations of the distribution of matching records and the distribution of e.SAL values.

We also tested our estimator on 18 additional synthetic data sets that were

deliberately designed to have properties that violate the assumptions of the superpopulation

model of our biased estimator, so as to see how robust this estimator is to inaccuracies in

the parametric model. From Section 4.5.1, the three specific assumptions we made for our

superpopulation model were:

105

1. cnt(e, e′, SALE) = 0 when e′ 6= e. Thus, the number of SALE records s for which

f3(e, s) ∧ f3(e′, s) is true is zero. In other words, different records from EMP do not

“share” matching records in SALE.

2. There exists a linear relationship between the mean aggregate values of the different

classes of EMP records given by µi = s× i + µ0 where s is the slope of the straight line

connecting the various µi values.

3. The variance of the aggregate attribute values of records of any class is approximately

equal to the single model parameter σ2.

For each of these three cases, we generate six different data sets using the six different

sets of gamma parameters described earlier. Thus we obtain 18 more data sets where the

first six sets violate assumption 1, the next six sets violate assumption 2 and the last six

sets violate assumption 3. For each of these 18 data sets, the aggregate attribute value is

normally distributed with a mean of 100 and standard deviation of 200 except for the last

six sets where different values of standard deviation are chosen for records from different

classes.

In order to violate assumption 1, we no longer assume a primary key-foreign key

relationship between EMP and SALE. To generate a data set violating this assumption, a set

s1 of records of size 100 from EMP is selected. Let max be the largest number of matches

in SALE for any record from s1. Then an associated set s2 of max records is added to SALE

such that all records in s1 have their matching records in s2. Assumption 2 was violated

using µi = s × j + µ0, where j 6= i (in fact, the j value for a given i is randomly selected

from 1...m). Assumption 3 was violated by assuming different values for the variance

of records from different classes. We randomly chose these values from the range (100,

15000).

4.6.1.2 Real-life data sets

The three real-life data sets we use in our experiments are from the Internet Movie

Database (IMDB) [1], the Synoptic Cloud Reports [3] obtained from the Oak Ridge

106

National Laboratory, and the network connections data sets from the 1999 KDDCup

event.

The IMDB database contains several relations with information about movies, actors

and production studios. For our experiments, we use the two relations MovieBusiness

and MovieGoofs. MovieBusiness contains information about box-office revenues of movies

while MovieGoofs contains records that describe unintended mistakes or “goofs” in various

movies. The following schema shows the relevant attributes of the two relations for the

queries we tested in our experiments.

MovieBusiness (MovieName, NumAdmissions)

MovieGoofs (GoofId, MovieName)

MovieName is the primary key of MovieBusiness and a foreign key of MovieGoofs.

We tested the following three SQL queries on the two relations of the IMDB dataset.

Q1: SELECT SUM (b.NumAdmissions)

FROM MovieBusiness as b

WHERE NOT EXISTS

(SELECT * FROM MovieGoofs AS g

WHERE g.MovieName = b.MovieName)

Q2: SELECT SUM (b.NumAdmissions)


WHERE NOT EXISTS

(SELECT * FROM MovieBusiness AS b2

WHERE id(b) < id(b2)

AND b.NumAdmissions = b.NumAdmissions)

Q3: SELECT COUNT (*)


WHERE NOT EXISTS

(SELECT * FROM MovieGoofs AS g

107

WHERE g.MovieName = b.MovieName)

The second real-life data set we use is the Synoptic Cloud Report (SCR) data set. It

contains weather reports for a 10-year period obtained from measuring stations on land

as well as water. We use weather reports for the months of December 1981 and November

1991 from measuring stations on land. Specifically, the two relations and their relevant

schema used in our experiments are:

DEC81 (Id, Latitude, CloudAmount)

NOV91 (Id, Latitude, CloudAmount)

Here, Id is the key in both the relations. We tested the following two SQL queries on

the relations DEC81 and NOV91.

Q4: SELECT SUM (D81.CloudAmount)

FROM DEC81 as D81

WHERE NOT EXISTS

(SELECT * FROM NOV91 AS N91

WHERE N91.Latitude = D81.Latitude)

Q5: SELECT COUNT (*)

FROM DEC81 as D81

WHERE NOT EXISTS

(SELECT * FROM NOV91 AS N91

WHERE N91.Latitude = D81.Latitude)

The KDDCup data set contains information about various network connections that

can potentially be used for intrusion detection. This data set has 42 integer, real-valued,

and categorical attributes. We tested our estimator on this data set by estimating the

total number of source bytes of connections that were “significantly different” from

the rest of the network connections. That is, we summed the total number of source

bytes created by outlier connections. Our definition of “significantly different” records is

those records whose distance from all other records in the data set is greater than some

108

predefined threshold. For our experiments, we use a simple distance function that uses

Euclidean distance for numerical attributes and a 0/1 distance for categorical attributes.

We execute the following query on the KDDCup data set for our experiments.

SELECT SUM (kc1.SourceBytes)

FROM KDDCup as kc1

WHERE NOT EXISTS

(SELECT * FROM KDDCup AS kc2

WHERE d(kc1, kc2) < dthreshold)

By choosing different values for dthreshold, we can control the selectivity of the above

query. For our experiments, we define Q6, Q7 and Q8 as three variants of the above query

with different values of dthreshold so that Q6 has a selectivity of around 24%, Q7 has a

selectivity of 1.75% while Q8 has a selectivity of 0.4%.

4.6.2 Results

We ran our experiments on 1%, 5% and 10% random samples of the data sets (both

relations in each data set were sampled independently without replacement at the same

rate). Both the biased estimator and the unbiased estimator were run ten times on each

of the test cases. For comparison we also analytically compute the standard error for the

concurrent estimator described in Section 4.2. Results from the first 48 synthetic data sets

are given in in Tables 4-1 and 4-2 while results from the next 18 synthetic data sets (which

specifically violate the model assumptions) are presented in Table 4-3. Real-life data set

results are shown in Table 4-4. For each of the test cases, we give the square root of the

observed mean-squared error (that is, the standard error) for the biased, unbiased as well

as concurrent estimator. Because having an absolute value for the standard error lacks any

sort of scale and thus would not be informative, we give the standard error as a percentage

of the total aggregate value of all records in the database. For example, for the synthetic

data sets, we give the standard error as a percentage of the answer to the query:

SELECT SUM (e.SAL)

109

FROM EMP as e

Thus, if the estimation method simply returned zero every time, its error would vary

between 0% and 100%, depending on the selectivity of the subquery. If the method is

also able to estimate with high accuracy which of the constituent records should not be

counted in the aggregate total, then the error can be reduced to an arbitrarily small level.

Although our error metric is different from the relative error (which takes the ratio

of the absolute error with the true query answer), the value of the relative error could be

readily computed from the error value given by dividing by the ratio of the query answer

and the total aggregate value of all records in the outer relation. For all the eight cases of

data set 1, the query answer is approximately 86% of the total answer. Hence, the relative

error is about 1.1 times the error reported in Table 4-1. Similarly for the rest of the data

sets, the factors are: data set 2: 1.7; data set 3: 19; data set 4: 1.5; data set 5: 3.7 and

data set 6: 270. For the IMDB and SCR data sets, the factors are between 1 and 5.5 while

for the KDDCup the factors range from 2 (for the high selectivity query) to 40 (for the

very low selectivity query).

When we tested the queries, we also recorded the number of times (out of ten) that

the answer given by the biased estimator was within ±2 estimated standard errors of the

real answer to the query and found that for almost all the test cases this number was ten

while only for a couple of test cases this number was found to be nine out of ten.

Finally, we measured the computation time required by the biased estimator to

initially learn the generative model, then compute weights for the various components of

the estimator, and to finally provide an estimate of the query result. We observed that

for the synthetic data sets (which consists of 10 million and 50 million records in the two

relations) the maximum observed running time of biased estimator was between 3 and 4

seconds for a 10% sample from each. The vast majority of this time is spent in the EM

learning algorithm, which requires O(m × |EMP′| × i) time, where m is the maximum

possible number of matches for a record in EMP with records in SALES, and i is the number

110

of iterations required for EM convergence. We speed our implementation by sub-sampling

EMP′ and using the subsample in the EM algorithm rather than using EMP′ directly. The

justification for this is that the EM can be quite expensive with a large EMP′, and the

accuracy of the modeling step is much more closely related to the size of SALE′. We use a

subsample of size 500 in our experiments.

In comparison, computation for the unbiased estimator is almost instantaneous,

requiring a small fraction of a second. In our test data, the most costly operation for

the unbiased estimator is running the “join” between EMP′ and SALE′; that is, searching

for matches for each record from EMP′ in SALE′. Given summary statistics describing this

matching, the core GetEstTi routine itself can be implemented as a dynamic programming

algorithm that takes time O(m′2), where m′ is the maximum number of matches for any

record from EMP′ in SALE′.

4.6.3 Discussion

One of the most obvious results from Table 4-1 is that the unbiased estimator has

uniformly small error only on those eight tests performed using synthetic data set 1,

where the number of matches for each record e ∈ EMP is generated using a Gamma

distribution with parameters (shift = 1, scale = 0.5). In this particular data set, only a

very small number of the records are excluded by the NOT EXISTS clause since 86% of the

records in EMP do not have a match in SALE. Furthermore, only a very small number of the

records have a large number of matches. Both of these characteristics tend to stabilize the

variance of the unbiased estimator, making it a fine choice.

For all the other data sets, the unbiased estimator does very poorly for most of the

cases. For synthetic data, the estimator’s worst performance is for data set 6, in which

less than one percent of the records are accepted by the NOT EXISTS clause and several

records from EMP have more than 15 matching records in SALE. In this case, the unbiased

estimator is unusable, and the results were particularly poor with correlation between

the number of matches and the aggregate value that is summed. For example, in the

111

Data set 1% Sample 5% Sample 10% Sampletype error error error

Ga- Cor- Val U C B U C B U C Bmma red ? Dist. (%) (%) (%) (%) (%) (%) (%) (%) (%)

1 No a. 7.39 13.32 38.30 2.39 12.62 3.88 1.09 11.89 1.461 No b. 6.69 13.45 37.87 3.04 12.63 5.92 1.08 11.93 1.381 No c. 6.89 12.92 22.59 5.23 12.04 8.18 3.79 11.23 7.091 No d. 16.65 6.32 68.37 15.94 6.19 29.34 9.56 5.94 19.721 Yes a. 11.90 20.90 34.50 4.59 19.94 2.26 3.15 18.68 1.421 Yes b. 13.50 17.80 36.30 4.07 16.37 5.12 1.75 15.50 2.181 Yes c. 7.70 15.06 21.14 5.69 14.06 7.84 3.98 13.13 6.211 Yes d. 18.05 1.04 66.94 16.26 0.52 25.35 12.98 0.41 15.33



Table 4-1. Observed standard error as a percentage of SUM (e.SAL) over all e ∈ EMP for 24synthetically generated data sets. The table shows errors for three differentsampling fractions: 1%, 5% and 10% and for each of these fractions, it showsthe error for the three estimators: U - Unbiased estimator, C - Concurrentsampling estimator and B - Model-based biased estimator.

112


Ga- Cor- Val U C B U C B U C Bmma red ? Dist. (%) (%) (%) (%) (%) (%) (%) (%) (%)

4 No a. 153.70 36.20 14.52 37.17 33.90 4.73 24.47 31.20 0.894 No b. 226.00 37.00 18.56 50.32 33.95 5.27 42.87 31.11 1.334 No c. 242.70 35.20 11.10 19.40 32.85 3.62 17.03 30.04 3.594 No d. 146.37 16.56 45.16 23.60 14.85 21.26 8.85 12.62 16.614 Yes a. 418.70 64.50 10.85 116.55 59.94 2.71 27.55 54.52 1.644 Yes b. 327.02 52.06 8.62 75.95 48.42 3.92 45.62 44.12 2.834 Yes c. 359.60 43.40 13.90 30.19 40.39 7.17 27.21 36.80 5.164 Yes d. 1.1e3 37.53 40.29 54.33 33.99 10.66 18.94 29.32 5.68


6 No a. 7.1e3 95.13 19.30 6.2e3 79.49 9.82 4.1e3 63.33 6.096 No b. 1.9e4 95.20 18.40 2.1e3 79.58 9.47 6.6e2 63.40 5.746 No c. 1.9e4 94.32 13.03 1.2e3 78.60 5.96 9.6e2 62.74 1.716 No d. 4.7e4 76.71 7.54 2.0e2 66.87 8.42 68.87 54.96 3.976 Yes a. 5.4e4 307.0 62.00 3.0e4 249.30 30.90 5.7e3 119.00 18.786 Yes b. 4.2e4 214.0 42.70 1.9e4 174.25 21.12 7.0e3 135.00 12.886 Yes c. 3.2e4 156.3 22.70 2.0e3 128.10 10.87 8.7e2 100.12 3.056 Yes d. 1.3e5 234.4 29.78 2.9e3 192.46 28.25 2.4e3 148.28 12.79


113


Ga- Vio- U C B U C B U C Bmma lates (%) (%) (%) (%) (%) (%) (%) (%) (%)

1 (1) 8.83 13.37 62.60 3.12 12.47 15.24 1.19 11.75 4.622 (1) 24.66 39.33 34.39 8.14 37.89 2.74 3.41 35.60 2.483 (1) 94.11 92.31 21.14 72.94 84.82 16.76 20.27 75.78 13.054 (1) 22.30 36.67 37.99 12.72 34.07 7.96 6.34 31.12 2.955 (1) 231.50 72.60 6.76 123.30 66.14 6.37 85.68 59.48 4.356 (1) 1366.80 95.96 9.99 1.2e3 78.64 5.85 700.0 62.62 1.88

1 (2) 14.18 21.70 100.70 4.42 21.09 26.34 2.69 20.20 12.442 (2) 21.62 72.24 59.94 14.25 67.50 7.56 6.25 62.90 4.473 (2) 886.2 220.20 45.73 136.0 201.90 31.73 79.75 180.10 25.764 (2) 462.0 95.80 106.80 269.19 88.74 22.18 81.03 82.43 11.525 (2) 247.60 205.0 18.84 233.0 187.00 17.69 88.55 168.30 9.786 (2) 6891.00 369.0 42.30 5988.0 310.00 40.90 1924.00 246.57 19.77

1 (3) 14.70 21.14 61.86 6.24 20.20 10.15 1.13 19.13 2.672 (3) 26.15 66.73 29.10 22.49 62.25 20.25 5.38 57.69 17.353 (3) 920.10 185.30 41.86 147.60 167.20 30.12 65.63 146.88 27.204 (3) 2.3e5 64.42 35.96 714.00 60.54 16.87 150.80 54.77 9.245 (3) 1350.30 143.00 33.59 856.00 127.76 29.58 306.70 113.14 10.086 (3) 2.2e5 264.02 38.37 4519.10 212.80 34.92 2530.00 162.70 21.96


correlated case with a 1% sample, most of the relative standard errors were more than

40000%. Such very poor results are found sporadically throughout most of the data

sets, though the results were somewhat erratic. The reason that the observed errors

associated with the unbiased estimator are highly variable is the very long tail of the

error distribution. Under many circumstances, most of the answers computed using

the unbiased estimator are very good, but there is still a small (though non-negligible)

probability of getting a ridiculous estimate whose error is hundreds of times the sum over

the aggregate value over the entire EMP relation. Unfortunately, it is interesting to note

114

1% Sample 5% Sample 10% SampleError Error Error

Data- Query U C B U C B U C BSet (%) (%) (%) (%) (%) (%) (%) (%) (%)

IMDB Q1 9.6e3 27.67 70.88 3.3e3 17.51 33.44 4.1e2 13.71 14.14IMDB Q2 1.2e2 75.12 65.10 91.26 62.86 31.97 49.82 52.69 9.31IMDB Q3 1.e4 25.21 18.47 3.5e3 16.58 14.38 4.7e2 12.71 1.92

SCR Q4 1.4e4 65.22 10.31 5.0e3 44.97 6.84 8.2e2 23.27 4.41SCR Q5 1.2e4 59.06 9.42 4.6e3 41.62 7.51 7.8e2 24.07 3.95

KDDCup Q6 1.1e10 60.47 12.39 7.4e4 54.92 10.96 7.6e3 42.08 2.10KDDCup Q7 6.5e147 41.30 11.24 5.8e83 26.54 4.32 9.3e36 17.04 3.28KDDCup Q8 7.3e210 15.24 8.46 3.6e172 10.80 1.56 2.3e120 6.35 0.98

Table 4-4. Observed standard error as a percentage of the total aggregate value of allrecords in the database for 8 queries over 3 real-life data sets. The table showserrors for three different sampling fractions: 1%, 5% and 10% and for each ofthese fractions, it shows the error for the three estimators: U - Unbiasedestimator, C - Concurrent sampling estimator and B - Model-based biasedestimator.

that the unbiased estimator’s worst performance overall was observed on Q8 over the

KDDCup data, where the error was astronomically high: larger than 10100.

In comparison, the biased estimator generally did a very good job predicting the final

query result, and in most cases with a 5% or 10% sampling fraction the observed standard

error was less than 10% of the total aggregate value found in EMP. In other words, if the

total value of SUM (e.SAL) with no NOT EXISTS clause is x, then for just about any query

tested, the standard error was less than x/10, and it was frequently much smaller. This is

actually quite impressive when one considers the difficulty of the problem. The primary

drawback associated with the biased estimator is its complexity (requiring non-trivial and

substantial statistically-oriented computations) and the fact that a significant amount

of computation is required, most of it associated with running the EM algorithm to

completion. By comparison, the unbiased estimate can be calculated via an almost trivial

recursive routine that relies on the calculation of simple hypergeometric probabilities.

One case where the biased estimator had questionable qualitative performance was

with the 16 tests associated with data sets 3 and 6. The problem in this case was that

115

the EM algorithm tended to overestimate p0 in Θ, which is actually very small in these

two data sets (.052 and .0037, respectively). This results in an error that hovers at 10%+

of the total aggregate value of e.SAL (even for a 5%+ sample) when the real answer is

only 5% of this total for data set 3 or less than 1% of this total for data set 6. We stress

that guessing that only a few percent of the tuples in EMP have no matches in SALE from

a small sample with limited information is an extremely difficult estimation problem,

and we conjecture that without additional information (such as prior knowledge that the

distribution represented by p is a discretized gamma distribution) it will be very difficult

to achieve better results.

Results from the synthetic data sets which specifically violate the assumptions of the

superpopulation model are shown in Table 4-3. The first six rows in the table show results

for data sets in which more than one EMP record can match with a given record from SALE.

The results show that violating this assumption of the model in the actual data set did

not affect the accuracy of the biased estimator significantly. The next set of six rows in the

table show results for data sets in which there is no linear relationship between the mean

aggregate values of the different classes of EMP records. The results show that the biased

estimator is about twice as inaccurate over these data sets as compared to corresponding

data sets which do not have a strict violation of the assumption. The last six rows in

the table show results over data sets in which the variances of the aggregate values of

records from different classes are significantly different. Results show that these data sets

affect the accuracy of the biased estimator as much as the data sets which violate the

“linear relationship of mean values” assumption. However, the results are certainly not

poor when these assumptions are violated, and the method still seems to have qualitative

performance that may be acceptable for many applications, particularly with a larger

sample size.

The results from the eight queries over the three real-life data sets are depicted in

Table 4-4. The key difference in the characteristics of the real-life data sets compared

116

to the synthetically-generated data sets is the number of matching records in the inner

relation for a given record from the outer relation of the NOT EXISTS query. For the

KDDCup data set, the maximum number of matching records in the inner relation is as

high as 2500, while for the IMDB and SCR data sets this number is about 200 and 90

respectively. Due to this, none of the cases which are favorable for the use of the unbiased

estimator (as described above) are observed in the real-life data sets. On the other hand,

it can be seen from Table 4-4 that the accuracy of the biased estimator is generally quite

good over the real data.

We also note that the standard error of the biased estimator over the learned

superpopulation seems to be a reasonable surrogate for the standard error of the biased

estimator in practice. For most biased estimators, it is reasonable to use the standard

error of the biased estimator in the same way that one would use the standard deviation

of an unbiased estimator when constructing confidence bounds (see Sarndal et al. [109],

Section 5.2). According to the Vysochanskii-Petunin inequality [120], any unbiased

uni-modal estimator will be within three standard deviations of the correct answer 95% of

the time, and according to the more aggressive central limit theorem, an estimator will be

within two standard deviations of the correct answer 95% of the time. We observed that

almost all of the tests, ten out of ten of the errors for the biased estimator were actually

within two predicted standard errors of zero. This seems to be strong evidence for the

utility of the bounds computed using the predicted standard error of the biased estimator.

We finally remark on the time required for the execution of the biased estimator. The

biased estimator performs several computations including learning the model parameters,

generating sufficient statistics for several population-sample pairs and then solving a

system of equations to compute weights for the various components of the estimator. As

discussed previously, this took no longer than four seconds for the largest samples tested.

If this is not fast enough, we point out that it may be able to speed this time even more,

though this is beyond the scope of the thesis. While we used the traditional EM algorithm

117

in our implementation, we note that EM can be made faster by using incremental variants

[69, 95, 116] of the EM algorithm. These variants of the EM algorithm typically achieve

faster convergence time by implementing the Expectation and/or the Minimization step of

the EM algorithm partially.

4.7 Related Work

Estimation via sampling has a long history in databases. One of the oldest and best

known works is Frank Olken’s PhD thesis [97]. Other classic efforts at sampling-based

estimation over database data are the adaptive sampling of Lipton and Naughton [83, 84]

for join query selectivity estimation, and the sampling techniques of Hou et al. [64, 65] for

aggregate queries. More recent well-known work on sampling is that on online aggregation

by Haas, Hellerstein, and their colleagues [47, 60, 61].

The sampling-based database estimation problem that is closest to the one studied

in this chapter is that of sampling for the number of distinct values in a database. As

discussed in the introduction to this chapter, a solution to the problem of estimation over

subset-based queries is a solution to the problem of estimating the number of distinct

values in a database since the latter problem can be written as a NOT EXISTS query. The

classic paper in distinct value estimation is due to Haas et al. [49]. For a survey of the

state-of-the-art work on this problem in databases through the year 2000, we refer the

reader to the Introduction of the paper by Charikar et al. on the topic [17]. The paper

of Bunge and Fitzpatrick [13] provides a survey of work in the statistics area, current

through the early 1990’s. Work in statistics continues on this problem to this day. In

fact, a recent paper from statistics by Mingoti [90] on the distinct value problem provided

inspiration for our use of superpopulation techniques.

Though the problems of distinct value estimation and subset-based aggregate

estimation are related, we note that the problem of estimating the number of distinct

values is a very restricted version of the problem we study in this thesis, and it is not

immediately clear how arbitrary solutions to the distinct value problem can be generalized

118

to handle subset-based queries. The most obvious difficulty in extending such methods

to subset-based queries is the fact that a NOT EXISTS or related clause results in a

complicated statistic summarizing two populations (the two tables that are queried over).

Nonetheless, links between the problems do exist. For example, though our own unbiased

estimator was not directly inspired by Goodman’s estimator [43]5 and it takes a very

different form, it is easy to argue that our unbiased estimator must be a generalization of

Goodman’s estimator. The reasoning is straightforward: Goodman’s estimator is proven to

be the only unbiased estimator for distinct value queries, and our own unbiased estimator

is unbiased for distinct value queries. Therefore, they must be equivalent when used on

this particular problem.

4.8 Conclusion

This chapter has presented two sampling-based estimators for the answer to a

subset-based query, where the answer to a SUM aggregate query (and by trivial extension,

AVERAGE and COUNT) is restricted to consider only those tuples that satisfy a NOT EXISTS

or related clause. The first estimator is provably unbiased, while the second makes use of

superpopulation methods and was found to be much more accurate.

As discussed in Section 4.5.1 of the thesis, one of the most controversial decisions

made in the development of the latter estimator was our choice of a very general prior

distribution. To a statistician from the so-called “Bayesian” school [39], this may be

seen as a poor choice and Bayesian statistician may argue that a more descriptive prior

distribution, if appropriate, would increase the accuracy of the method. This is certainly

true, if the selected distribution were a good match for the actual data distribution. In

our work, however, we have consciously chosen generality and its associated drawbacks in

place of specificity. Our experimental results seem to argue that for a variety of different

5 Goodman’s estimator is one of the earliest statistical estimators for distinct valuequeries.

119

data distributions, the resulting estimator still has high accuracy. Still, this represents

an intriguing question for future work: can a different prior distribution be chosen that

is appropriate for use in real-world data sets, and which results in a more accurate

estimator?

Finally, we note that the model-based method outlined in the latter half of this

chapter was designed specifically to address the problem of estimating the answer to a

nested SQL query with a single table in the inner query and a single table in the outer

query linked by a NOT EXISTS predicate. As is, our model is not directly applicable to

arbitrarily complex nested queries. For example, nested queries may include multiple

relations in the outer as well as the inner query. One could imagine sampling all of the

input relations, and then using any result tuples that are discovered as part of the inner

or outer subqueries as input into an estimator such as the one studied in this chapter.

However, this may be dangerous, and our superpopulation model is not directly applicable.

The problem is that if there is a join in the inner (or outer) query, then the tuples

produced via joining samples from the input relations are not i.i.d. samples from the join

[47]. This means that the join itself must be modeled, which is a problem for future work.

Another problem for future work is arbitrary levels of nesting. An inner query may itself

be linked with another inner query via a NOT EXISTS or similar clause.

120

CHAPTER 5SAMPLING-BASED ESTIMATION OF LOW SELECTIVITY QUERIES

5.1 Introduction

The specific problem that we consider in this chapter is sampling-based approximation

of the answer to highly selective aggregate queries – those having a relational selection

predicate that accepts only a very small percentage of the data set. Again, we consider

sampling because it is the most versatile of the approximation methods: a single sample

can be used to handle virtually any relational selection predicate or any join condition.

Samples generally do not require prior knowledge of what queries will be asked, unlike

other methods such as sketches [8]. We consider very selective queries because they are the

one class of queries that are hardest to handle approximately without workload knowledge:

if a query references only a few tuples from the data set, then it is very hard to make sure

that a synopsis structure (such as a sample) will contain the information needed to answer

the query.

The most natural method for handling highly selective queries using sampling is to

make use of stratification [25]. In order to answer an aggregate query over a relation,

one could first (offline) partition the relation’s tuples into various subsets so that similar

tuples are grouped together – the assumption being that the relational selection predicate

associated with a given query will tend to favor certain strata. Even if a given query is

very selective, at least one or two of the strata will have a relatively heavy concentration

of tuples that will contribute to the query answer. When the query is processed, those

“important” strata can be sampled first and more heavily than the others. This is

ilustrated with the following example:

Example 1: The relation MOVIE(MovieYear, Sales) is partitioned into two strata

as follows: The query Q is then issued:

SELECT SUM (Sales)

121

R1 : MovieYear ≤ 1975 R2 : MovieYear > 1975

r1 : 〈1961, 30〉 r3 : 〈1983, 60〉r2 : 〈1972, 50〉 r4 : 〈1977, 40〉

r5 : 〈1997, 25〉r6 : 〈1992, 100〉r7 : 〈2004, 100〉

FROM MOVIE

WHERE MovieYear < 1980

Since all movies in R1 were released before 1975, all the records in the stratum R1

match Q. Hence, we decide to obtain a biased sample that includes as many records from

R1 as the sample size permits and we sample from R2 only if the desired sample size is not

met. For a sample size of 4, this results in an estimate whose variance (or error) is 2400.

Drawing a sample from the population as a whole results in an estimate whose variance is

2575.

While stratification may be very useful, it is not a new idea. It has been studied in

statistics for decades, and it has been suggested previously as a way to make approximate

aggregate query processing more accurate [18–20]. However, in the context of databases,

researchers have previously considered only half of the problem: how to divide the

database into strata. This may actually be the easy and less important half of the

problem, since even the relatively naive partitioning strategy we use in our experiments

can give excellent results. The equally fundamental problem we consider in this paper is:

how to allocate samples to strata when actually answering the query. More specifically,

given a budget of n samples, how does one choose how to “spend” those samples on the

various strata in order to achieve the greatest accuracy?

The classic allocation method from statistics is the Neyman allocation, and it is the

one advocated previously in the database literature [19]. The key difficulty with applying

the Neyman Allocation in practice is that it requires extensive knowledge of certain

statistical characteristics of each strata, with respect to the incoming query. In practice

122

this knowledge can only be guessed at by taking a pilot sample. As we show in this paper,

if the guess is poor, then the resulting sampling plan can be disastrous. This results in a

classic chicken-and-egg problem: we want to sample in order to avoid scanning all of the

data, but in order to sample properly, we have to collect statistics that require scanning

all of the data! The result is that the classic Neyman allocation is unusable in many

situations, as we will demonstrate experimentally in the paper.

Our Contributions

In this thesis, we develop an alternative to the classic Neyman allocation that we

call the Bayes-Neyman allocation. While this is a very general method and its utility

is not limited to the context of database management, the Bayes-Neyman allocation is

particularly relevant to database sampling because it is designed to be robust when only a

few of the data records in the data set are relevant to estimating a quantity over the data

– as is the case when a query has a restrictive relational selection predicate. The specific

contributions of our work are as follows:

• The Bayes-Neyman allocation explicitly takes into account the error that might beincurred when developing the sampling plan to maximize the expected accuracy ofthe resulting estimate.

• The Bayes-Neyman allocation makes use of novel, Bayesian techniques from statistics[14] that allow us to take into account any prior expectation (such as the expectedefficacy of the stratification) in a principled fashion.

• We carefully evaluate our methods experimentally, and show that if one is verycareful in developing a sampling plan, even a naive partitioning of samples to stratathat uses no workload information can show dramatic accuracy for very selectivequeries.

• Our methods are very general. They can be used with any partitioning (such as thoseproposed by Chaudhuri et. al [18–20]), or even in cases where the partitioning is notuser-defined and is imposed by the problem domain (for example, when the various“strata” are different data sources in a distributed environment). Our methods canalso be extended to more complicated relational operations such as joins, though thisproblem is beyond the scope of the paper.

123

5.2 Background

This section presents some preliminaries and background about stratified sampling,

and discusses the problems associated with using stratified sampling in a database setting

to estimate results of arbitrary queries.

5.2.1 Stratification

A general example of a SUM aggregate query over a single relation can be written as

follows:

SELECT SUM (f1(r))

FROM R As r

WHERE f2(r)

Note that if we define a function f() where,

f(r) =

f1(r) if f2(r) is true

0 if f2(r) is false

the above query can be simply re-written as,

SELECT SUM (f(r))

FROM R As r

If the relational selection predicate f2(r) selects a very small fraction of records from

the relation R, then the query is said to be a low selectivity query.

Assume that relation R is partitioned into L disjoint strata such that Ri represents

the ith stratum. Then, we have R = R1 ∪ R2 ∪ · · · ∪ RL. We denote the size of the ith

stratum by Ni and thus we have |Ri| = Ni. Let R′i where |R′i| = ni be the survey sample

(without replacement) from the ith stratum. The sizes of all the strata are known from

strata construction time, while the sizes of the survey samples from each of the strata

(the ni values) can be determined by using some sampling allocation scheme subject to

the constraint∑

i ni = n, where n is the pre-determined total sample size from R. The

problem of determining an optimal sample allocation is the central focus of this paper.

124

If we execute the above query on each of the R′i, the result of the query over the

sample of stratum i can be written as,

yi =∑

r∈R′i

f(r)

The unbiased stratified sampling estimator for the query result expressed in terms of

the yi values is,

Y =L∑

i=1

Ni

ni

yi (5–1)

The true variance of the records in stratum i can be computed as,

σ2i =

1

Ni

∑r∈Ri

f 2(r)−(

1

Ni

∑r∈Ri

f(r)

)2

Thus the true variance (or error) of the estimator Y is given by,

σ2 =L∑

i=1

Ni(Ni − ni)

ni

σ2i (5–2)

In practice, it is not feasible to know the true stratum variances for an arbitrary

query. Hence, a sample-based estimate for the variance of stratum i can be computed as,

σ2i =

1

ni − 1

∑

r∈R′i

(f(r)− yi

ni

)2

(5–3)

Then, an unbiased estimator for the variance of Y can be obtained from Equation 5–2

by simply replacing all the σ2i terms with their corresponding unbiased estimators σ2

i .

Central-Limit-Theorem-based confidence bounds [112] for Y can then be computed

as, Y ± zpσ where zp is the z-score for the desired confidence level. If desired, more

conservative confidence bounds from the literature (such as Chebyshev-based [112]) can

also be used.

Finally, we note that aggregate queries like COUNT and AVG can also be handled by

stratified sampling estimators like the one described above by using ratios of two different

estimates. Aggregate queries with a GROUP BY clause can also be answered by using

125

stratification. A GROUP BY query can be considered as executing several simple queries in

parallel – one for each group. Joins can also be handled using methods similar to those

proposed by Haas and Hellerstein [54], though that is beyond the scope of the paper.

5.2.2 “Optimal” Allocation and Why It’s Not

The problem of determining the ni values for all the strata for a predetermined

sample size n is the sample allocation problem. The key constraint on the values of the

sample sizes is that their sum should equal the total sample size. Besides this constraint,

there is freedom in the choice of the ni values, and hence a natural choice is to minimize

the error of Y of Equation 5–1. Since Y is unbiased, minimizing its error is equivalent

to minimizing its variance. An optimization problem can be formulated for the choice

of ni values so that the variance σ2 is minimized – solving the problem leads to the

well-known Neyman allocation [25] from statistics. Specifically, the Neyman allocation

states that the variance of a stratified sampling estimator is minimized when sample size

ni is proportional to the size of the stratum, Ni, and to the variance of the f() values in

the stratum, σ2i . That is,

ni =n∑

j Njσ2j

Niσ2i (5–4)

The problem we face in a database setting is that strata variance values σ2i are not

known for an arbitrary query. The stratum variance σ2i depends on: (a) the function to be

aggregated f1(), and (b) the relational selection predicate f2(). Since these functions can

vary from one query to another, it is not feasible to compute beforehand exact values of

the various σ2i terms for an arbitrary query. This means that the optimal ni values cannot

be computed in the absence of exact σ2i values.

It is possible to obtain rough estimates for the strata variances by doing a pilot run of

the query on very small pilot samples from each stratum, which is the standard method.

However, as the following example shows, a major drawback of this approach is that

the variance estimates calculated from such pilot sampling can be arbitrarily erroneous,

leading to an extremely poor allocation scheme and even more severe problems.

126

Example 2: Imagine that we have a relation R partitioned into two strata R1 and R2

such that |R1| = 10000 and |R2| = 10000. Let Q be a query identical to the query presented

in Section 5.2.1. The number of records from R1 accepted by f2() is 10 while the number of

records from R2 accepted by f2() is 1000. Further, let f1(r) ∼ N(1000, 100) ∀r ∈ R1 and

f1(r) ∼ N(10, 100) ∀r ∈ R2, where N(µ, σ) denotes a normal distribution with mean µ

and variance σ2.

We use a pilot sample of 100 records to estimate the variance of the f() values in

each stratum. These estimates are σ21 and σ2

2. If the desired sample size is n = 1000, the

estimated variances can be used with Equation 5–4 to obtain an estimate for the optimal

sampling allocation as follows:

n1 =1000

σ21 + σ2

2

σ21 n2 =

1000

σ21 + σ2

2

σ22

We then ask the question: how accurate will the resulting sampling plan be? To

answer this question, we perform a simple experiment in which we repeat the above

process 1000 times. For each iteration, we record the squared error of the estimate

produced by the computed sampling plan. The average of all these squared errors gives us

an approximation of the mean-squared error (MSE) of the estimator. For each iteration,

we also compute the estimated variance of the result (using Equation 5–2) since this

variance would be used to report confidence bounds to the user. We then compute the

average estimated variance across the 1000 iterations. Finally, we use the true variances of

both strata to obtain an optimal sample allocation, and repeat the above experiment using

the optimal allocation. We summarize the results in the following table.

True query result 20150

Avg. observed bias 10200

Avg. estimated MSE 0.76 million

Avg. observed MSE 100 million

MSE of true optimal 58.6 million

127

Overall, the results using the pilot sampling are disastrous. Specifically:

• The pilot-sampling-based allocation provides an average estimated error to the userthat is more than 2 orders of magnitude smaller than the true error – 0.76 million versus100 million. Since the estimated error is typically used to compute confidence bounds, theresulting confidence bounds will be much narrower than what they should be in reality.Hence, the user would be provided with a dangerously optimistic picture of the error ofthe estimator.

• Second, the non-optimal allocation leads to an estimate that has a heavy bias. This isdue to the fact that the allocation often directs the stratified sampling to ignore the firststratum. For approximately 90% of the 1000 iterations the pilot sample fails to discoverany matching records in R1. Hence, the pilot sample-based variance is naively guessed tobe zero. When this value is used with the Neyman allocation, no samples are allocatedto R1, while all 1000 samples are allocated to R2. The outcome is that the query result isusually underestimated, because R1 actually contains records accepted by f2().

• Finally, by using a truly optimal sampling allocation to estimate the query result,it is possible to achieve an error that is around half the error obtained by a non-optimalallocation. The additional error incurred due to the poor allocation represents a wastedopportunity to provide a much more accurate estimate.

5.3 Overview of Our Solution

The fundamental problem we face is that the natural estimator for σ2i serves us

extremely poorly when we are trying to figure out how to allocate samples to strata.

Human intuition tells us that it is foolish to simply assume that σ2i is zero in this case,

though our estimate σ2i will be zero. This is because as human beings we know that there

will often be a number of records matching the given f2() in a strata, and we will simply

be unlucky enough to miss them in our pilot sample.

To remedy these problems, we propose a novel Bayesian approach [14] called the

Bayes-Neyman allocation that can incorporate such intuition into the process in a

principled fashion. In general, Bayesian methods formally model such prior intuition

or belief as a probability distribution. Such methods then refine the distribution by

incorporating additional information – in our case information from the pilot sample – to

obtain an overall improved probability distribution.

At the highest level, the proposed Bayes-Neyman allocation works as follows:

128

1. First, in Bayesian fashion, we represent our belief in the possible variances overthe f() values in each strata as a prior probability distribution. Let the vectorΣ = 〈σ2

1, σ22, · · · , σ2

L〉 be one possible set of strata variances. We define a probabiltydistribution over all of the possible Σ values to represent this prior belief. Let XΣ bea random variable with exactly this probability distribution. Thus, sampling from XΣ

(that is, performing a random trial over XΣ) gives us one possible value for the vectorΣ, where those variance vectors that we feel are more “correct” are more likely to besampled.

2. Second, we take a pilot sample from the database and use the result of the pilotsample to update the distribution of XΣ in order to make it more accurate.

3. Third, we sample a large number of possible Σ values from the resulting XΣ inMonte-Carlo fashion. This gives us a large number of possible alternative values forΣ.

4. Finally, we construct a sampling plan for estimating the answer to our query whoseaverage error (variance) is minimized over all of the Σ values that were sampledfrom XΣ. This gives us a sampling plan whose expected error over the possible set ofdatabases described by the distribution of XΣ is minimized. This plan is then used toperform the actual stratified sampling.

The three key technical questions that must be addressed when adopting this

approach are:

1. First, how is the random variable XΣ defined?

2. Second, how can the distribution of XΣ be updated to take into account anyinformation that is gathered via the pilot sample?

3. Third, how can a set of samples from the updated XΣ be used to produce an optimalsampling plan?

The next three sections outline our answer to these three questions.

5.4 Defining XΣ

In this section, we consider the nature of XΣ itself, and how to sample from it.

5.4.1 Overview

At the highest level, the process of producing a single sample Σ from XΣ will be

further subdivided into three steps:

129

1. First, we sample from a random variable Xcnt to obtain a vector 〈cnt1, cnt2, · · · , cntL〉,where this vector tells us how many tuples from each strata are accepted by therelational selection predicate f2().

2. Second, we sample from a random variable XΣ′ that gives us the vector Σ′ =〈(µ1, µ2)1, (µ1, µ2)2, ..., (µ1, µ2)L〉. The ith pair (µ1, µ2)i is the mean (that is, µ1) andsecond moment (that is, µ2)

1 over all of the the f1() values in strata i for those cntituples that are accepted by f2().

3. Third, once these two samples have been obtained, it is then a simple mathematicaltask to use the outputs of Xcnt and XΣ′ to compute the output of XΣ.

We now consider each of these three steps in detail.

5.4.2 Defining Xcnt

Using terminology common in Bayesian statistics, each entry in Xcnt is generated by

sampling from a binomial distribution with a Beta prior distribution [33]. This means

that we view the probability pi that an arbitrary tuple from stratum i will be accepted

by the relational selection predicate f2() as being the result of a random sample from

the Beta distribution, which produces a result from 0 to 1. Since we view each tuple as a

separate and independent application of f2(), the number of tuples from stratum i that are

accepted by f2() is then binomially distributed2 , with the binomial distribution taking the

value pi as input, along with the stratum size Ni.

The Beta distribution is chosen as the prior distribution because it is a canonical

“conjugate prior” distribution for the binomial distribution. The fact that it is a conjugate

prior means that its domain is precisely equal to the parameter space for the Binomial

distribution; in this case, the range 0 to 1, which is the valid range for pi.

1 Recall that the second moment of a random variable X is the expected value of X2:µ2 = E[X2].

2 The binomial distribution models the case where n balls are thrown at a bucket andeach ball has a probability p of falling in the bucket. A binomially distributed samplereturns the number of balls that happened to land in the bucket.

130

0.05

0.1

0.15

0.2

0.25

0.3

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Fra

ctio

n of

que

ries

Query selectivity

Figure 5-1. Beta distribution with parameters α = β = 0.5.

Given this setup, the first task is to choose the set of Beta parameters that control

the distribution of each pi so as to match the reality of what a typical value of pi will be

for each stratum. The Beta disribution is a parametric distribution and requires two input

parameters, α and β. Depending on the parameters that are selected, the Beta can take

a large variety of shapes and skews. Choosing α and β for the ith stratum is equivalent

to supplying our “intuition” to the method, stating what our initial belief is regarding the

probabilitiy that an arbitrary record will be accepted by f2().

There are two possibilities for setting those initial parameters. The first possibility

is to use workload information. We could monitor all previously-observed queries over

each and every strata, where we observe that for query i and stratum j the probability

that a given record was accepted by f2() was pij. Then, assuming that {pij ∀i, j} are all

samples from our generative Beta prior, we simply estimate α and β from this set using

any standard method. An estimate for the Beta parameters based upon the principle of

Maximum Likelihood Estimation can easily be derived [112].

A second method is to simply assume that the stratification we choose usually works

well. In this case, most strata will either have a very low or a very high percentage of its

records accepted by f2(). Choosing α = β = .5 results in a U-shaped distribution that

matches this intuition exactly, and is a common choice for a Beta prior. The resulting

Beta is illustrated in Figure 5-1. In practice we find that this produces excellent results.

131

We stress that though the initial choice of α and β for each stratum is important,

it is only important to the extent that it informs us what is going on in the case that we

have very little information available in the pilot sample (such as when the pilot is very

small). If the pilot sample contains a great deal of information, the update step described

in Section 5.5 will update α and β as needed to take into account the information present

in the pilot sample.

Producing the Vector of Counts

Given the above setup, the GetCounts algorithm can be used to produce the vector of

counts 〈cnt1, cnt2, · · · , cntL〉:

————————————————————————————

Algorithm GetCounts(α, β, N)

1 // Let α = 〈α1, α2, · · · , αL〉 be the parameters of the beta

// distributions of all strata

2 // Let β = 〈β1, β2, · · · , βL〉 be the parameters of the beta


3 // Let N = 〈N1, N2, · · · , NL〉 be a vector of all strata sizes

4 // Let cnt = 〈cnt1, · · · , cntL〉 be a vector of counts for all strata

5 for(int i = 1; i <= L; i++){6 pi ← Beta(αi, βi)

7 cnti ← Binomial(Ni, pi)

8 }9 return cnt

————————————————————————————

5.4.3 Defining XΣ′

In the previous subsection, we described how to obtain counts for the number of

records that satisfy the selection predicate f2(). However, in order to obtain a sample

from XΣ, it is not enough to merely know these counts. We actually need to know the f1()

132

values of all the records that satisfy f2() in stratum i, since these values are needed to be

able to compute (µ1, µ2)i as is required to sample from XΣ′ .

To do this, we use the following method. For the ith stratum, let D be the vector

of all possible distinct values from the range of the function f1(). We then associate a

probability pj with the jth distinct value. pj indicates the likelihood of the jth distinct

value from the stratum (that is, D[j]) being assigned to an arbitrary tuple that has been

accepted by f2(). Then, let V denote a vector of |D| counts, where if V [j] = k then it

means that the jth distinct value from the stratum has been assigned to k tuples that

were accepted by f2(). Thus,∑

j V [j] = cnti. Since we assume that each application

of f1() is independent on a per-tuple basis, V can be obtained by sampling from a

multinomial distribution3 with two arguments: the probability vector consisting of all the

pj values, and the number of trials given by cnti (that is, the number of tuples accepted by

f2()). Then, the resulting vector V along with the distinct value vector D can be used to

compute the pair (µ1, µ2)i.This technique poses two important questions that need to be answered:

• Is it always feasible to consider all the values in the range of f1()? That is, can wealways materialize D?

• How do we assign probabilities to all of the values in D? That is, how do we decidethe value of each pj?

The answer to the first question is simple: It is certainly not feasible to always

consider all possible values in the range of f1(), for obvious computational and storage

reasons. However, for the moment, we assume that it is feasible and consider the more

general case in Section 5.4.5.

3 The multinomial distribution models the case case where cnt balls are thrown at dbuckets so that the probability of an arbitrary ball falling in bucket j is pj; a sample fromthe multinomial assigns bj balls to bucket k such that

∑bj = 1.

133

In answering the second question, we develop a methodology analogous to the way we

choose the pi parameter when dealing with the ith strata for Xcnt. As described above, the

number of times that each distinct f1() value is selected follows a multinomial distribution.

We know from Bayesian statistics that the standard conjugate prior for a multinomial

distribution is the Dirichlet distribution [33] – just as the Beta distribution is the standard

conjugate prior for a binomial distribution. The Dirichlet is the multi-dimensional

generalization of the Beta. A k-dimensional Dirichlet distribution makes use of the

parameter vector Θ = {Θ1, Θ2, · · · , Θk}. Just as in the case of the Beta prior used by

Xcnt, the Dirichlet prior requires an initial set of parameters that represent our initial

belief. Since we typically have no knowledge about how likely it is that a given f1() value

will be selected by f2(), the simplest initial assumption to make is that all values are

equally likely. In the case of the Dirichlet distribution, using Θi = 1 for all i is the typical

zero-knowledge prior [33]. Given Θ, it is then a simple matter to sample from XΣ′ , as we

describe formally in the next subsection. We note that although this initial parameter

choice may be inaccurate, in Bayesian fashion the parameters will be made more accurate

based upon the information present in the pilot sample. Section 5.5 provides details of

how the update is accomplished.

Producing the Vector Σ′

We now present an algorithm GetMoments to obtain the vector Σ′. We assume that

we have all the Θi values corresponding to the parameters of the Dirichlet along with

counts of the number of records that are accepted by f2() in each stratum. These count

values are the values in the vector Xcnt obtained according to Algorithm GetCounts in

Section 5.4.2.

————————————————————————————

Algorithm GetMoments(Θ1, · · · ,ΘL, D){1 // Let Θi denote the Dirichlet parameters for stratum i

2 // Let D be an array of all distinct values from the range of f2()

134

3 // Let Σ′ = 〈Σ′1, · · · , Σ′L〉 be a vector of moments of all strata

4 for(int i = 1; i <= L; i++){5 p← Dirichlet(Θi)

6 µ1 = µ2 = 0

7 // Let V be an array of counts for each domain value

8 V ←Multinomial(cnti,p)

9 for(int j = 1; j <= |D|; j++){10 µ1 += V [j] ∗D[j]

11 µ2 += V [j] ∗ (D[j])2

12 }13 µ1 /= cnti

14 µ2 /= cnti

15 (µ1, µ2)i = (µ1, µ2)

16 Σ′i = (µ1, µ2)i

17 }18 return Σ′

————————————————————————————

5.4.4 Combining The Two

Once a sample from Xcnt and from XΣ′ have been obtained, it is then a simple matter

to put them together to obtain a sample from XΣ. Recall that the variance of a variable

X is defined as follows:

σ2[X] = E[X2]− E2[X]

where E[] denoes the expected value of the random variable. For the ith stratum, after

sampling from Xcnt and XΣ′ we know three things:

1. The size of the stratum Ni.

2. The number of records accepted by f2(), which is cnti.

3. The first and second moment (µ1, µ2)i of f1() applied to those tuples that wereaccepted by f2().

135

Thus, the variance σ2i for f() applied to all tuples in the ith strata can be computed as:

σ2i =

cntiNi

× µ2 +Ni − cnti

Ni

× 0−(

cntiNi

× µ1

)2

− Ni − cntiNi

× 0

=cntiNi

× µ2 −(

cntiNi

× µ1

)2

(5–5)

The two zeros in the above derivation come from the fact that both the first moment (or

mean) as well as the second moment for f() over every tuple not accepted by f2() are zero.

This computation is repeated for each possible i in order to obtain the desired sample

from XΣ.

The algorithm GetSigma describes how the variances can be computed using the

above technique.

————————————————————————————

Algorithm GetSigma(cnt,Σ′, N){1 // Let cnt = 〈cnt1, · · · , cntL〉 be a vector of counts of records

// accepted by f2() for all strata


3 // Let N = 〈N1, N2, · · · , NL〉 be a vector of all strata sizes

4 // Let Σ be a vector of variances for all strata

5 for(int i = 1; i <= L; i++){6 σ2

i = (cnti ∗ Σ′i.µ2)/N [i]− (cnti ∗ Σ′i.µ1/N [i])2

7 Σ[i] = σ2i

8 }9 return Σ

10 }————————————————————————————

136

5.4.5 Limiting the Number of Domain Values

As mentioned in Section 5.4.3, the one remaining problem regarding how to sample

from XΣ′ is the problem of having a very large (or even unknown) range for the function

f1(). In this case, dealing with the vectors D and V may be impossible, for both storage

and computational reasons.

The simple solution to this problem is to break the range of f1() into a number of

buckets and make use of a histogram over the range, rather than using the range itself. In

this case, D is generalized to be an array of histogram buckets, where each entry in D has

summary information for a group of distinct f1() values. Each entry in D has the following

four specific pieces of information:

1. low and high, which are the upper and lower bounds for the f1() values that arefound in this particular bucket.

2. µ1, which is the mean of the f1() values that are found in this particular bucket.That is, if A is the set of distinct values from low to high, then µ1 =

∑a∈A

a|A| .

3. µ2, which is the second moment of the f1() values that are found in this particularbucket. That is, µ2 =

∑a∈A

a2

|A| .

Given |D|, there are two possible ways to construct the histogram. In the case where

the queries that will be asked request a simple sum over one of the attributes from the

underlying relation R (that is, f1() does not encode any function other than a simple

relational projection), then it is possible to construct D offline by using any histogram

construction scheme [42, 45, 72] over the attribute that is to be queried. In the case that

multiple attributes might be queried, one histogram can be constructed for each attribute.

This is the method that we test experimentally.

Another appropriate method is to construct D on-the-fly by making use of the pilot

sample that is used to compute the sampling plan. This has the advantage that any

arbitrary f1() can be handled at run time. Again, any appropriate histogram construction

scheme can be used, but rather than constructing D offline using the entire relation R, f1()

137

is applied to each r ∈ ⋃i R

piloti (whether or not r is accepted by f2()) and the histogram is

constructed over the resulting set of distinct values.

Whatever method is used to construct D, the function GetMoments from Section

5.4.3 must be modified so as to handle the modified D. The following is an appropriately

modified GetMoments - we call it GetMomentsFromHist.

————————————————————————————

Algorithm GetMomentsFromHist(Θ1, · · · ,ΘL, D){1 // Let Θi denote the vector of Dirichlet parameters for stratum i

2 // Let D be an array of histogram buckets


4 for(int i = 1; i <= L; i++){5 p← Dirichlet( Θi)

6 µ1 = µ2 = 0

7 // Let V be an array of counts for each bucket

8 V ←Multinomial(cnti,p)

9 for(int j = 1; j <= |D|; j++){10 µ1 += V [j] ∗D[j].µ1

11 µ2 += V [j] ∗D[j].µ2

12 }13 µ1 /= cnti

14 µ2 /= cnti

15 (µ1, µ2)i = (µ1, µ2)

16 Σ′i = (µ1, µ2)i

17 }18 return Σ′

————————————————————————————

138

5.5 Updating Priors Using The Pilot

In Section 5.4, we described how we assign initial values to the parameters of the two

prior distributions – the Beta and the Dirichlet distributions. In this section, we explain

how these initial values can be refined by using information from a pilot sample to obtain

corresponding posterior distributions. Updating these priors using the pilot sample in the

proposed Bayes-Neyman approach is analagous to using the pilot sample to estimate the

stratum variances using the classic Neyman allocation. The update rules described in this

section are fairly straightforward applications of the standard Bayesian update rules [14].

The Beta distribution has two parameters α and β. Let Rpilot denote the pilot sample

and let s denote the number of records that are accepted by the predicate f2(). Thus,

|Rpilot| − s will be the number of records that fail to be accepted by the query.

Then, the following update rules can be used to directly update the α and β

parameters of the Beta distribution:

α = α + s

β = β + (|Rpilot| − s)

The Dirichlet distribution is updated similarly. Recall that this distribution uses a

vector of parameters, Θ = {Θ1, Θ2, · · · , Θk}, where k is the number of dimensions.

To update the parameter vector Θ, we can use the same pilot sample that was used

to update the beta as follows. We initialize to zero all elements of an array count of size k.

These elements denote counts of the number of times that different values from the range

f1() appear in the pilot sample and are accepted by f2().

The following update rule can be used to update all the different parameters of the

Dirichlet distribution:

Θi = Θi + counti

Algorithm UpdatePriors describes exactly how pilot sampling is used to update the

parameters of the prior Beta and Dirichlet distributions for the ith stratum.

139

———————————————————————————————-

Algorithm UpdatePriors(α, β, Θ, D, Rpilot){1 // Let α, β be the parameters of the beta distribution for the

// stratum to be updated

2 //Let Θ = 〈Θ1, · · · ,ΘL〉 be the parameters of the Dirichlet

// distribution for the stratum

3 // Let D be an array of histogram buckets for the stratum

4 // Let Rpilot be a pilot sample from the stratum

5 // Let count be an array of counts for each histogram bucket for

// stratum i

6 for(int j = 1; j <= |D|; j++)

7 count[j] = 0

8 s=0

9 for(int r = 1; r <= |Rpilot|; r++){10 rec = Rpilot[r]

11 if(f2(rec)){12 s++

13 val = f1(rec)

14 pos = FindPositionInArray(D, val)

15 count[pos]++

16 }17 }18 α = α + s

19 β = β + (|Rpilot| − s)

20 for(int j = 1; j <= |D|; j++)

21 Θj = Θj + count[j]

22 }———————————————————————————————-

140

——————————————————————————–

Algorithm FindPositionInArray(D, val){1 // Let D be an array of histogram buckets

2 // Let val be a scalar value

3 for(int j = 1; j <= |D|; j++)

4 if(D[j].low ≤ val && val ≤ D[j].high)

5 return j

6 }——————————————————————————–

5.6 Putting It All Together

In this section, we consider how the random variable XΣ can be used to produce

an alternative allocation to the classical Neyman, and give the complete algorithm for

computing our allocation.

5.6.1 Minimizing the Variance

In general, the goal of any sampling plan should be to minimize the variance σ2 of

the resulting stratified sampling estimator. The formula for σ2 in the classic allocation

problem is given as Equation 5–2 of the thesis. Our situation differs from the classic setup

only in that (in Bayesian fashion) we now use XΣ to implicitly define a distribution over

the per-strata variance values 〈σ1, σ2, · · · , σL〉. Thus, we cannot minimize σ2 directly

because under the Bayesian regime, σ2 is now a random variable.

Instead, it makes sense to minimize the expected value or average of σ2, which (using

Equation 5–2) can be computed as:

E[σ2] = E

[L∑

i=1

Ni(Ni − ni)

ni

σ2i

]

Using the linearity of expectation, we have:

E[σ2] =L∑

i=1

Ni(Ni − ni)

ni

E[σ2i ]

141

All of the machinery from the last two sections allows us to be able to sample possible

variance vectors from XΣ. Assume that we sample v of these vectors, where v is a suitable

large number, and the samples are denoted by Σ1, Σ2, ..., Σv. Then∑v

j=11vΣj.σ

2i is an

unbiased estimate of E[σ2i ]. Plugging this estimate into the previous equation, we have:

E[σ2] ≈L∑

i=1

Ni(Ni − ni)

ni

v∑j=1

1

vΣj.σ

2i

We now wish to minimize this value subject to the constraint that∑

ni = n. Notice

that the resulting optimization problem has exactly the same structure as the optimization

problem solved by the Neyman allocation, with the exception that σ2i has been replaced by

∑vj=1

1vΣj.σ

2i . Thus the resulting optimal solution is then nearly identical, with σ2

i being

replaced as appropriate:

ni =n∑L

k=1

∑vj=1

Nk

vΣj.σ2

k

v∑j=1

Ni

vΣj.σ

2i

5.6.2 Computing the Final Sampling Allocation

Algorithm GetBayesNeymanAllocation describes exactly how an optimal sampling

allocation can be obtained using our technique.

————————————————————————————

Algorithm GetBayesNeymanAllocation(α, β,Θ, D, Rpilot, N, n, v){1 // Let α = 〈α1, α2, · · · , αL〉, be the parameter of the beta


2 // Let β = 〈β1, β2, · · · , βL〉 be the parameter of the beta


3 // Let Θ = 〈Θ1, · · · ,ΘL〉 be the set of parameters of the

// Dirichlet distributions of all strata

4 // Let D = 〈D1, D2, · · · , DL〉 be an array of histogram

// buckets for all strata

5 // Let Rpilot = 〈Rpilot1 , Rpilot

2 , · · · , RpilotL 〉 be the pilot samples

142

// from all strata

6 // Let v be the total number of iterations of re-sampling

7 for(int j = 1; j <= L; j++)

8 UpdatePriors(αi, βi,Θi, Di, Rpiloti )

9 // Let cnt = 〈cnt1, cnt2, · · · , cntL〉 be a vector of counts for

// all strata

10 // Let Σ′ = 〈Σ′1,Σ′2, · · · , Σ′L〉 be a vector of moments for all strata

11 // Let Σ and Σtemp be vectors of variances of size L

12 for(int i = 1; i <= v; i++){13 cnt = GetCounts(α, β,N)

14 Σ′ = GetMomentsFromHist(Θ1, · · · ,ΘL, D)

15 Σtemp = GetSigma(cnt,Σ′, N)

16 for(int j = 1; j <= L; j++)

17 Σ[j] += Σtemp[j]

18 }19 denom = 0

20 for(int j = 1; j <= L; j++){21 Σ[j] /= v

22 denom += N [j] ∗ Σ[j]

23 }24 for(int j = 1; j <= L; j++)

25 nj = (n ∗Nj ∗ Σ[j])/denom

26 }

————————————————————————————

5.7 Experiments

5.7.1 Goals

The specific goals of our experimental evaluation are as follows:

143

• To compare the width of the confidence bounds produced using both the classicNeyman allocation and the proposed Bayes-Neyman allocation in realistic scenarios inorder to see which can produce tighter bounds.

• To test the reliability of the confidence bounds produced by the two methods. Thatis, we wish to ask: if bounds are reported to the user as p% bounds, is the chancethat they contain the answer actually p%?

• Third, we wish to compare both methods against simple random sampling as a sanitycheck to see if there is a significant improvement in bound width.

• Finally, we wish to compare the compuation time required for the two estimators.

5.7.2 Experimental Setup

Data Sets Used. We use three different data sets in our experimental evaluation:

• The first is a synthetic data set called the GMM data set, and is produced using aGaussian (normal) mixture model. The GMM data set has three numerical and threecategorical attributes. Since the underlying normal variables only produce numericaldata, the three categorical attributes (having seven possible values each) and areproduced by mapping the ranges of three of the dimensions to discrete values. Thisdata set has 5 million records.

• The second is the Person data set. This is a 13-attribute real-life data set obtainedfrom the 1990 Census and contains family and income information. This dataset ispublicly available [2] and has a single relation with over 9.5 million records. The datahas twelve numerical attributes and one categorical attribute with 29 categories.

• The third is the KDD data set, which is the data set from the 1999 KDD Cup event.This data set has 42 attributes with status information regarding various networkconnections for intrusion detection. This dataset consists of around 5 million recordswith integer, real-valued, as well as categorical attributes.

Queries Tested. For each data set, we test queries of the form:

SELECT SUM (f1(r))

FROM R As r

WHERE f2(r)

f1() and f2() vary depending upon the data set. For the GMM data set, f1() projects

one of the three different numerical attributes (each query projects a random attribute).

For the Person data set, either the TotalIncome attribute or the WageIncome attribute are

144

projected by each query. For the KDD data set, either the src bytes or the dst bytes

attributes are projected.

For each of the data sets, three different classes of selection predicates encoded by f2()

are used. Each class has a different selectivity. The three selectivity classes for f2() have

selectivities of (0.01%± 0.001%), (0.1%± 0.01%), and (1.0%± 0.1%), respectively.

For the GMM data set, f2() is constructed by rolling a three-faced die to decide how

many attributes will be included in the conjunction computed by f2(). The appropriate

number of attributes are then randomly selected from among the six GMM attributes. If

a categorical attribute is chosen as one of the attributes in f2(), then the attribute will be

checked with either an equality or inequality condition over a randomly-selected domain

value. If a numerical attribute is chosen, then a range predicate is constructed. For a given

numerical attribute, assume that low and high are the known minimum and maximum

attribute values. The range is constructed using qlow = low + v1 × (high − low) and

qhigh = qlow + v2 × (high − qlow) where v1 and v2 are randomly chosen real values from

the range [0-1]. For each selectivity class, 50 different queries are generated by repeating

the query-generation process until enough queries falling the appropriate selectivity range

have been generated.

The f2() functions for the other two data sets are constructed similarly.

Stratification Tested. For each of the various data sets, a simple nearest-neighbor

classification algorithm is used to perform the statification. In order to partition a data

set into L strata, L records are first chosen randomly from the data to serve as “seeds” for

each of the strata, and all of the other records are added to the strata whose seed is closest

to the data point. For numerical attributes, the L2 norm is used as the distance function.

For categorical attributes, we compute the distance using the support from the database

for the attribute values [36]. Since each data set has both numerical and categorical data,

the actual distance function used is the sum of the two “sub” distance functions. Note

that it would be possible to use a much more sophisticated stratification, but actually

145

Sample Sel Bandwidth CoverageSize (%) GMM /Person /KDD GMM /Person /KDD

50K0.01 3.277 /2.289 /2.140 918 /892 /9210.1 1.776 /0.514 /1.520 926 /912 /9881 0.587 /0.184 /0.210 947 /944 /942

100K0.01 2.626 /2.108 /1.48 922 /941 /9370.1 1.273 /0.351 /0.910 939 /948 /9401 0.415 /0.128 /0.120 948 /952 /946

500K0.01 2.192 /1.740 /0.820 923 /943 /9400.1 0.551 /0.132 /0.630 946 /947 /9421 0.178 /0.087 /0.070 946 /947 /948

Table 5-1. Bandwidth (as a ratio of error bounds width to the true query answer) andCoverage (for 1000 query runs) for a Simple Random Sampling estimator forthe KDD Cup data set. Results are shown for varying sample sizes and forthree different query selectivities - 0.01%, 0.1% and 1%.

performing the stratification is not the point of this thesis – our goal is to study how to

best use the stratification.

In our experiments, we test L = 1, L = 20, and L = 200. Note that if L = 1

then there is actually no stratification performed, and so this case is equivalent to simple

random sampling without replacement and will serve as a sanity check in our experiments.

Tests Run. For the Neyman allocation and our Bayes-Neyman allocation, our test suiteconsists of 54 different test cases for each data set, plus nine more tests using L = 1.These test cases are obtained by assigning three different values to the following fourparameters:

• Number of strata – We use L = 1, L = 20, L = 200; as described above, L = 1 is alsoequivalent to simple random sampling without replacement.

• Pilot sample size – This is the number of records we obtain from each stratum inorder to perform the allocation. We choose values of 5, 20 and 100 records.

• Sample Size – This is the total sample size that has to be allocated. We use 50,000,100,000 and 500,000 samples in our tests.

• Query Selectivity – As described above, we test query selectivities of 0.01%, 0.1% and1%.

146

Average Running Time (sec.)Neyman Bayes-Neyman

Gaussian Mixture 1.5 2.4Person 2.3 3.1KDD Cup 2.1 2.8

Table 5-2. Average running time of Neyman and Bayes-Neyman estimators over threereal-world datasets.

Each of the 50 queries for each (data set, selectivity) combination is re-run 20

times using 20 different (pilot sample, sample) combinations. Thus, for each (data set,

selectivity) combination we obtain results for 1000 query runs in all.

5.7.3 Results

Table 5-1 shows the results for the nine cases where L = 1; that is, where no

stratification is performed. We report two numbers: the bandwidth and the coverage. The

bandwidth is the ratio of the width of the 95% confidence bounds computed as the result

of using the allocation to the true query answer. The coverage is the number of times out

of the 1000 trials that the true answer is actually contained in the 95% confidence bounds

reported by the estimator. Naturally, one would expect this number to be close to 950 if

the bounds are in fact reliable. Tables 5-3 and 5-4 show the results for the 54 different test

cases where a stratification is actually performed. For each of the 54 test cases and both

of the sampling plans used (the Neyman allocation and the Bayes-Nayman allocation) we

again report the bandwidth and the coverage.

Finally, the following table shows the average running times for the two stratified

sampling estimators on all the three data sets. There is generally around a 50% hit in

terms of running time when using the Bayes-Neyman allocation compared to the Neyman

allocation.

5.7.4 Discussion

There are quite a large number of results presented, and discussing all of the

intricacies present in all of our findings is beyond the scope of the thesis. However,

taken as a whole, our experiments clearly show two things. First, for the type of selective

147

Bandwidth CoverageGMM /Person /KDD GMM /Person /KDD

NS PS SS Sel Neyman Bayes - Neyman Neyman Bayes - Neyman

20 5 50K0.01 0.00 /0.00 /0.00 2.90 /0.19 /1.12 0 /0 /0 935 /882 /9270.1 0.03 /0.01 /0.02 1.27 /0.02 /0.80 3 /49 /23 929 /939 /9381 0.05 /0.02 /0.14 0.39 /0.01 /0.09 11 /247 /155 940 /950 /945

100K0.01 0.00 /0.00 /0.00 2.77 /0.16 /1.08 0 /0 /0 936 /961 /9300.1 0.02 /0.01 /0.01 0.90 /0.02 /0.73 3 /53 /28 941 /941 /9381 0.05 /0.01 /0.03 0.28 /0.01 /0.08 24 /306 /170 941 /947 /947

500K0.01 0.01 /0.00 /0.00 2.05 /0.06 /0.87 3 /0 /4 938 /948 /9320.1 0.01 /0.00 /0.01 0.37 /0.01 /0.55 10 /62 /51 954 /954 /9411 0.03 /0.01 /0.02 0.12 /0.00 /0.04 38 /316 /184 957 /955 /945

20 50K0.01 0.06 /0.00 /0.04 2.72 /0.22 /1.06 14 /0 /5 942 /941 /9380.1 0.17 /0.03 /0.09 1.21 /0.03 /0.81 106 /61 /88 908 /938 /9441 0.21 /0.05 /0.27 0.34 /0.01 /0.09 404 /692 /561 948 /948 /947

100K0.01 0.01 /0.00 /0.01 2.58 /0.16 /0.91 23 /0 /6 941 /937 /9410.1 0.11 /0.02 /0.06 0.85 /0.02 /0.74 165 /66 /107 934 /954 /9391 0.14 /0.03 /0.09 0.25 /0.01 /0.06 431 /728 /612 954 /962 /953

500K0.01 0.01 /0.00 /0.01 1.93 /0.07 /0.62 30 /0 /21 946 /943 /9440.1 0.01 /0.01 /0.01 0.34 /0.01 /0.51 230 /145 /245 942 /952 /9451 0.04 /0.01 /0.03 0.09 /0.00 /0.02 447 /751 /746 943 /961 /950

100 50K0.01 0.15 /0.04 /0.08 2.33 /0.19 /0.82 24 /58 /20 938 /922 /9380.1 0.26 /0.10 /0.16 1.09 /0.02 /0.58 436 /204 /172 929 /949 /9421 0.47 /0.18 /0.34 0.32 /0.01 /0.05 870 /891 /866 932 /962 /951

100K0.01 0.12 /0.03 /0.06 2.26 /0.16 /0.57 29 /59 /41 935 /945 /9400.1 0.18 /0.05 /0.11 0.81 /0.02 /0.40 435 /249 /355 927 /957 /9421 0.31 /0.08 /0.02 0.22 /0.01 /0.04 895 /928 /914 948 /968 /943

500K0.01 0.01 /0.01 /0.01 1.72 /0.07 /0.33 45 /66 /50 939 /952 /9470.1 0.06 /0.02 /0.04 0.31 /0.01 /0.28 474 /297 /412 954 /954 /9521 0.06 /0.02 /0.06 0.08 /0.00 /0.02 926 /935 /942 950 /970 /949

Table 5-3. Bandwidth (as a ratio of error bounds width to the true query answer) andCoverage (for 1000 query runs) for the Neyman estimator and theBayes-Neyman estimator for the three data sets. Results are shown for 20strata and for varying number of records in pilot sample per stratum (PS), andsample sizes(SS) for three different query selectivities - 0.01%, 0.1% and 1%.

148

Bandwidth CoverageGMM /Person /KDD GMM /Person /KDD

NS PS SS Sel Neyman Bayes - Neyman Neyman Bayes - Neyman

200 5 50K0.01 0.00 /0.00 /0.00 1.73 /0.18 /0.91 0 /0 /0 933 /931 /9240.1 0.00 /0.02 /0.01 0.97 /0.02 /0.76 0 /56 /27 933 /953 /9361 0.05 /0.02 /0.03 0.26 /0.01 /0.09 19 /162 /149 940 /960 /940

100K0.01 0.00 /0.01 /0.01 1.57 /0.13 /0.75 0 /43 /28 936 /916 /9300.1 0.01 /0.01 /0.01 0.72 /0.02 /0.64 7 /60 /41 938 /958 /9361 0.03 /0.01 /0.01 0.19 /0.00 /0.08 34 /365 /212 945 /955 /947

500K0.01 0.01 /0.00 /0.00 1.20 /0.08 /0.52 5 /45 /34 940 /939 /9380.1 0.02 /0.01 /0.00 0.28 /0.01 /0.44 22 /89 /76 946 /946 /9441 0.02 /0.01 /0.01 0.07 /0.00 /0.06 45 /372 /336 954 /954 /951

20 50K0.01 0.05 /0.03 /0.04 1.59 /0.18 /0.85 19 /51 /21 943 /931 /9340.1 0.11 /0.03 /0.07 0.75 /0.02 /0.72 91 /70 /94 943 /953 /9391 0.09 /0.04 /0.09 0.18 /0.01 /0.07 345 /627 /580 958 /962 /945

100K0.01 0.01 /0.01 /0.03 1.35 /0.14 /0.67 22 /66 /45 948 /948 /9410.1 0.02 /0.02 /0.04 0.54 /0.01 /0.54 131 /135 /128 935 /955 /9491 0.05 /0.02 /0.05 0.12 /0.00 /0.06 488 /702 /643 945 /955 /952

500K0.01 0.01 /0.00 /0.01 1.04 /0.06 /0.42 49 /83 /72 941 /954 /9470.1 0.01 /0.00 /0.02 0.20 /0.00 /0.35 210 /209 /282 955 /945 /9501 0.04 /0.01 /0.01 0.03 /0.00 /0.03 617 /830 /869 948 /958 /953

100 50K0.01 0.08 /0.03 /0.06 1.35 /0.14 /0.54 28 /56 /39 939 /938 /9390.1 0.20 /0.05 /0.09 0.56 /0.02 /0.40 313 /357 /243 949 /949 /9421 0.10 /0.01 /0.15 0.14 /0.01 /0.03 543 /823 /874 948 /948 /951

100K0.01 0.07 /0.02 /0.04 1.11 /0.12 /0.39 47 /77 /53 938 /935 /9470.1 0.08 /0.03 /0.06 0.40 /0.01 /0.28 533 /456 /427 948 /948 /9511 0.06 /0.06 /0.08 0.09 /0.01 /0.02 918 /912 /930 959 /956 /952

500K0.01 0.01 /0.00 /0.02 0.89 /0.05 /0.21 63 /91 /104 946 /936 /9370.1 0.02 /0.01 /0.02 0.10 /0.00 /0.13 580 /540 /607 945 /945 /9481 0.04 /0.03 /0.05 0.01 /0.00 /0.01 936 /920 /941 960 /953 /950

Table 5-4. Bandwidth (as a ratio of error bounds width to the true query answer) andCoverage (for 1000 query runs) for the Neyman estimator and theBayes-Neyman estimator for the three data sets. Results are shown for 200strata with varying number of records in pilot sample per stratum (PS), andsample sizes(SS) for three different query selectivities - 0.01%, 0.1% and 1%.

149

queries we concentrate on in our work, the classic Neyman allocation is generally useless.

As expected, the allocation tends to ignore strata with relevant records, resulting in “95%

confidence bounds” that are generally accurate nowhere close to 95% of the time. Out of

162 different tests over the three data sets, the Neyman allocation produced confidence

bounds that had greater than 90% coverage only eleven times, even though 95% bounds

were specified. In 15 out of the 162 tests, the “95% confidence bounds” actually contained

the answer 0 out of 1000 times!

Second, the allocation produced by the proposed Bayes-Neyman tends to be

remarkably useful – that is, the bounds produced are both accurate and tight. In

only 7 of the 162 tests, the coverage of the bounds produced by the Bayes-Neyman

allocation was found to be less than 93%, and coverage was often remarkably close to 95%.

Furthermore, in the few cases where the classic Neyman bounds were actually worthwhile,

the Bayes-Neyman bounds were far superior in terms of having a tighter bandwidth.

Even if one looks only at the cases where the Neyman bounds were not ridiculous (where

“ridiculous” bounds are arbitrarily defined to be those that had a coverage of less than

20%), the Bayes-Neyman bounds were actually tighter than the Neyman bounds 35

out of 70 times. In other words, there were many cases where the Neyman allocation

produced bounds that had coverage rates of only around 20%, whereas the Bayes-Neyman

allocations produced bounds that were actually tighter, and still had coverage rates very

close to the user-specified 95%.

There are a few other interesting findings. Not surprisingly, increasing the number

of strata generally gives tighter error bounds for fixed pilot and sample sizes because it

tends to increase the homogeneity of the records in each stratum. However, in practice

there is a cost associated with increasing the number of strata and so this cannot

be done arbitrarily. Specifically, more strata may translate to more I/Os required to

actually perform the sampling. One might typically store the records within a stratum in

randomized order on disk. Thus, to sample from a given stratum requires only a sequential

150

scan, but each additional stratum requires a random disk I/O. In addition, it is more

difficult and more costly to maintain a large number of strata.

We also find that by using a larger pilot sample, estimation accuracy generally

increases. This is intuitive since a larger pilot sample contains more information about

the stratum, thus helping to make a better sampling allocation plan and providing a more

accurate estimate. However, a large pilot sample incurs a greater cost to actually perform

the pilot sampling. Explicitly studying this trade-off is an interesting avenue for future

work.

Finally, we point out that even the rudimentary stratification that we tested in these

experiments is remarkably successful – if the correct sampling allocation is used. Consider

the case of a 500K record sample. For a query selectivity of 0.01%, only around 50 records

in the sample will be accepted by selection predicate encoded in f2(). This is why the

bandwidth for the simple random sample estimator with no stratification (L = 1) is so

great: for the Person data set it is 1.74 and for the KDD data set it is 0.82. The bounds

are so wide that they are essentially useless. However, if the Bayes-Neyman allocation is

used over 200 strata and a pilot sample of size 100, the bandwidths shrink to 0.053 and

0.21, respectively. These are far tighter. In the case of the Person data set the bandwidth

shrinks by nearly two orders of magnitude. For the KDD data set the reduction is more

modest (a factor of four) due to the high dimensionality of the data, which tends to render

the stratification less effective. Still, this suggests that perhaps the real issue to consider

when stratifying in a database environment is not how to perform the stratification, but

how to use the stratification in an effective manner.

5.8 Related Work

Broadly speaking, it is possible to divide the related prior research into two categories

– those works from the statistics literature, and those from the data management

literature.

151

The idea of applying Bayesian and/or superpopulation (model-based) methods to

the allocation problem has a long history in statistics, and seems to have been studied

with particular intensity in the 1970’s. Given the number of papers on this topic, it is

not feasible to reference all of them, though a small number are listed in the References

section of the thesis [32, 103, 104]. At a high level, the biggest difference between this

work and that prior work is the specificity of our work with respect to database queries.

Sampling from a database is very unique in that the distribution of values that are

aggregated is typically ill-suited to traditional parametric models. Due to the inclusion

of the selection predicate encoded by f2(), the distribution of the f() values that are

aggregated tends to have a large “stovepipe” located at zero corresponding to those

records that are not accepted by f2(), with a more well-behaved distribution of values

located elsewhere corresponding to those f1() values for records that were accepted by

f2(). The Bayes-Neyman allocation scheme proposed in this thesis explicity allows for

such a situation via its use of a two-stage model where first a certain number of records

are accepted by f2() (modeled via the random variable Xcnt) and then the f1() values for

those accepted records are produced (modeled by XΣ′). This is quite different from the

general-purpose methods described in the statistics literature, which typically attach a

well-behaved, standard distribution to the mean and/or variance of each stratum [32, 104].

Sampling for the answer to database queries has also been studied extensively

[63, 67, 96]. In particular, Chaudhuri and his co-authors have explicitly studied the

idea of stratification for approximating database queries [18–20]. However, there is a

key difference between that work and our own: these existing papers focus on how to

break the data into strata, and not on how to sample the strata in a robust fashion. In

that sense, our work is completely orthogonal to Chaudhuri et al.’s prior work and our

sampling plans could easily be used in conjunction with the workload-based stratifications

that their methods can construct.

152

5.9 Conclusion

In this chapter, we have considered the problem of stratification for develpoing robust

estimates for the answer to very selective aggregate queries. While the obvious problem

to consider when stratifying is how to break the data into subsets, the more significant

challenge may lie in developing a sampling plan at run time that actually uses the strata

in a robust fashion. We have shown that the traditional Neyman sampling allocation can

give disastrous results when it is used in conjunction with mildly to very selective queries.

We have developed a unique Bayesian method for developing robust sampling plans. Our

plans explicitly minimize the expected variance of the final estimator over the space of

possible strata variances. We have shown that even when the resulting allocation is used

with a very naive nearest-neighbor stratification, the increase in accuracy compared to

simple random sampling is considerable. Even more significant is the fact that for highly

selective queries, our sampling plans give results that are “safe” in the sense that the

associated confidence bounds have near perfect coverage.

153

CHAPTER 6CONCLUSION

In this research work, we have studied and described the problem of efficient

answering of complex queries on large data warehouses. Our approach for addressing

the problem relies on approximation. We present sampling-based techniques which can

be used to compute very quickly, approximate answers along with error guarantees for

long-running queries. The first part of this study addresses the problem of efficiently

obtaining random samples of records satisfying arbitrary range selection predicates. The

second part of the study develops statistical, sampling-based estimators for the specific

class of queries that have a nested, correlated subquery. The problem addressed in this

work is actually a generalisation of the important problem of estimating the number of

distinct values in a database. The third and final part of this study addresses the problem

of estimating the result to queries having highly selective predicates. Since a uniform

random sample is not likely to contain any records satisfying the selection predicate,

our approach uses stratified sampling and develops stratified sampling plans to correctly

identify high-density strata for arbitrary queries.

154

APPENDIXEM ALGORITHM DERIVATION

Let Ye be the information about record e ∈ EMP that can be observed i.e. v = f1(e)

and k′ = cnt(e, SALE′). Let Xe be the information about record e that includes Ye as well

as the relevant data that cannot be observed, i.e. k = cnt(e, SALE).

Then let:

f(Xe = 〈Ye, k〉|Θ) = pk × e−(v−µk)2/2σ2

σ√

2π× h(k′; k)

Also, let:

g(Ye|Θ) =m∑

i=0

pi × e−(v−µi)2/2σ2

σ√

2π× h(k′; i)

We then compute the posterior probability that e belongs to class i as:

p(i|Θ, e) =f(Xe = 〈Ye, i〉|Θ)

g(Ye|Θ)

Then the logarithm of the expected probability that we would observe EMP′ and SALE′

is:

E =∑e∈EMP

log(f(Xe = 〈Ye, i〉|Θ))|p(i|Θ′, e)

=∑e∈EMP

m∑i=0

p(i|Θ′, e)

×log

(pi × e−(f1(e)−µi)

2/2σ2

σ√

2π× h(k′; i)

)

=∑e∈EMP

m∑i=0

p(i|Θ′, e)× (log(pi)− log(σ)− log(√

2π)

−(f1(e)− µi)2

2σ2+ log(h(k′; i)))

To find the unknown parameters, µi, σ and pi, we maximize E for the given set of

posterior probabilities at that step. We do this by taking partial derivatives of E w.r.t

each of these parameters and setting the result to zero:

155

∂E

∂µ1

=∑e∈EMP

p(1|Θ′, e)× f1(e)− µ1

σ2

Setting this expression to zero gives:

µ1 =

∑e∈EMP p(1|Θ′, e)× f1(e)∑

e∈EMP p(1|Θ′, e)

We can obtain µ2, · · ·µm in a similar manner.

By taking the partial derivative of E w.r.t σ2 and setting to zero we get:

σ2 =

∑e∈EMP

∑mj=0 p(j|Θ′, e)(f1(e)− µj)

2

∑e∈EMP

∑mj=0 p(j|Θ′, e)

Finally, to evaluate the pi’s, we also consider the additional constraint that∑m

i=0 pi =

1. We can find the values of pi’s that maximize E subject to this constraint by using the

method of Lagrangian multipliers to obtain:

pj =

∑e∈EMP p(j|Θ′, e)∑

e∈EMP∑m

i=1 p(i|Θ′, e)

This completes the derivation of the update rules given in Section 4.5.2.

156

REFERENCES

1. IMDB dataset. http://www.imdb.com

2. Person data set. http://usa.ipums.org/usa

3. Synoptic cloud report dataset.http://cdiac.ornl.gov/epubs/ndp/ndp026b/ndp026b.htm

4. Acharya, S., Gibbons, P.B., Poosala, V.: Congressional samples for approximateanswering of group-by queries. In: Tech. Report, Bell Laboratories, Murray Hill, NewJersey (1999)

5. Acharya, S., Gibbons, P.B., Poosala, V.: Congressional samples for approximateanswering of group-by queries. In: SIGMOD, pp. 487–498 (2000)

6. Acharya, S., Gibbons, P.B., Poosala, V., Ramaswamy, S.: Join synopses forapproximate query answering. In: SIGMOD, pp. 275–286 (1999)

7. Alon, N., Gibbons, P.B., Matias, Y., Szegedy, M.: Tracking join and self-join sizes inlimited storage. In: PODS, pp. 10–20 (1999)

8. Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating thefrequency moments. In: STOC, pp. 20–29 (1996)

9. Antoshenkov, G.: Random sampling from pseudo-ranked b+ trees. In: VLDB, pp.375–382 (1992)

10. Babcock, B., Chaudhuri, S., Das, G.: Dynamic sample selection for approximatequery processing. In: SIGMOD, pp. 539–550 (2003)

11. Bradley, P.S., Fayyad, U.M., Reina, C.: Scaling clustering algorithms to largedatabases. In: KDD, pp. 9 – 15 (1998)

12. Brown, P.G., Haas, P.J.: Techniques for warehousing of sample data. In: ICDE, p. 6(2006)

13. Bunge, J., Fitzpatrick, M.: Estimating the number of species: A review. Journal ofthe American Statistical Association 88, 364–373 (1993)

14. Carlin, B., Louis, T.: Bayes and Empirical Bayes Methods for Data Analysis.Chapman and Hall (1996)

15. Chakrabarti, K., Garofalakis, M., Rastogi, R., Shim, K.: Approximate queryprocessing using wavelets. The VLDB Journal 10(2-3), 199–223 (2001)

16. Charikar, M., Chaudhuri, S., Motwani, R., Narasayya, V.: Towards estimation errorguarantees for distinct values. In: PODS, pp. 268–279 (2000)

157

158

17. Charikar, M., Chaudhuri, S., Motwani, R., Narasayya, V.: Towards estimation errorguarantees for distinct values. In: PODS, pp. 268–279 (2000)

18. Chaudhuri, S., Das, G., Datar, M., Motwani, R., Narasayya, V.R.: Overcominglimitations of sampling for aggregation queries. In: ICDE, pp. 534–542 (2001)

19. Chaudhuri, S., Das, G., Narasayya, V.: A robust, optimization-based approach forapproximate answering of aggregate queries. In: SIGMOD, pp. 295–306 (2001)

20. Chaudhuri, S., Das, G., Narasayya, V.: Optimized stratified sampling forapproximate query processing. ACM TODS, To Appear (2007)

21. Chaudhuri, S., Das, G., Srivastava, U.: Effective use of block-level sampling instatistics estimation. In: SIGMOD, pp. 287 – 298 (2004)

22. Chaudhuri, S., Motwani, R.: On sampling and relational operators. IEEE Data Eng.Bull. 22(4), 41–46 (1999)

23. Chaudhuri, S., Motwani, R., Narasayya, V.: Random sampling for histogramconstruction: how much is enough? SIGMOD Rec. 27(2), 436–447 (1998)

24. Chaudhuri, S., Motwani, R., Narasayya, V.: On random sampling over joins. In:SIGMOD, pp. 263–274 (1999)

25. Cochran, W.: Sampling Techniques. Wiley and Sons (1977)

26. Dempster, A., Laird, N., Rubin, D.: Maximum-likelihood from incomplete data viathe EM algorithm. J. Royal Statist. Soc. Ser. B. 39 (1977)

27. Diwan, A.A., Rane, S., Seshadri, S., Sudarshan, S.: Clustering techniques forminimizing external path length. In: VLDB, pp. 342–353 (1996)

28. Dobra, A.: Histograms revisited: when are histograms the best approximationmethod for aggregates over joins? In: PODS, pp. 228–237 (2005)

29. Dobra, A., Garofalakis, M., Gehrke, J., Rastogi, R.: Processing complex aggregatequeries over data streams. In: SIGMOD Conference, pp. 61–72 (2002)

30. Domingos, P.: Bayesian averaging of classifiers and the overfitting problem. In: 17thInternational Conf. on Machine Learning (2000)

31. Efron, B., Tibshirani, R.: An Introduction to the Bootstrap. Chapman & Hall/CRC(1998)

32. Ericson, W.A.: Optimum stratified sampling using prior information. JASA 60(311),750–771 (1965)

33. Evans, M., Hastings, N., Peacock, B.: Statistical Distributions. Wiley and Sons(2000)

159

34. Fan, C., Muller, M., , Rezucha, I.: Development of sampling plans by using sequential(item by item) selection techniques and digital computers. Journal of the AmericanStatistical Association 57, 387–402 (1962)

35. Ganguly, S., Gibbons, P., Matias, Y., Silberschatz, A.: Bifocal sampling forskew-resistant join size estimation. In: SIGMOD, pp. 271–281 (1996)

36. Ganti, V., Gehrke, J., Ramakrishnan, R.: Cactus: clustering categorical data usingsummaries. In: KDD, pp. 73–83 (1999)

37. Ganti, V., Lee, M.L., Ramakrishnan, R.: ICICLES:self-tuning samples forapproximate query answering. In: VLDB, pp. 176–187 (2000)

38. Garcia-Molina, H., Widom, J., Ullman, J.D.: Database System Implementation.Prentice-Hall, Inc. (1999)

39. Gelman, A., Carlin, J., Stern, H., Rubin, D.: Bayesian Data Analysis, SecondEdition. Chapman & Hall/CRC (2003)

40. Gibbons, P.B., Matias, Y.: New sampling-based summary statistics for improvingapproximate query answers. In: SIGMOD, pp. 331–342 (1998)

41. Gibbons, P.B., Matias, Y., Poosala, V.: Aqua project white paper. In: TechnicalReport, Bell Laboratories, Murray Hill, New Jersey, pp. 275–286 (1999)

42. Gilbert, A.C., Kotidis, Y., Muthukrishnan, S., Strauss, M.: Optimal and approximatecomputation of summary statistics for range aggregates. In: PODS (2001)

43. Goodman, L.: On the estimation of the number of classes in a population. Annals ofMathematical Statistics 20, 272–579 (1949)

44. Gray, J., Bosworth, A., Layman, A., Pirahesh, H.: Data cube: A relationalaggregation operator generalizing group-by, cross-tab, and sub-total. In: ICDE,pp. 152–159 (1996)

45. Guha, S., Koudas, N., Srivastava, D.: Fast algorithms for hierarchical rangehistogram construction. In: PODS, pp. 180–187 (2002)

46. Guttman, A.: R-trees: A dynamic index structure for spatial searching. In: SIGMODConference, pp. 47–57 (1984)

47. Haas, P., Hellerstein, J.: Ripple joins for online aggregation. In: SIGMODConference, pp. 287–298 (1999)

48. Haas, P., Naughton, J., Seshadri, S., Stokes, L.: Sampling-based estimation of thenumber of distinct values of an attribute. In: 21st International Conference on VeryLarge Databases, pp. 311–322 (1995)

49. Haas, P., Naughton, J., Seshadri, S., Stokes, L.: Sampling-based estimation of thenumber of distinct values of an attribute. In: VLDB, pp. 311–322 (1995)

160

50. Haas, P., Stokes, L.: Estimating the number of classes in a finite population. Journalof the American Statistical Association 93, 1475–1487 (1998)

51. Haas, P.J.: Large-sample and deterministic confidence intervals for onlineaggregation. In: Statistical and Scientific Database Management, pp. 51–63 (1997)

52. Haas, P.J.: The need for speed: Speeding up DB2 using sampling. IDUG SolutionsJournal 10, 32–34 (2003)

53. Haas, P.J., Hellerstein, J.: Join algorithms for online aggregation. IBM ResearchReport RJ 10126 (1998)

54. Haas, P.J., Hellerstein, J.M.: Ripple joins for online aggregation. In: SIGMOD, pp.287 – 298 (1999)

55. Haas, P.J., Koenig, C.: A bi-level bernoulli scheme for database sampling. In:SIGMOD, pp. 275 – 286 (2004)

56. Haas, P.J., Naughton, J.F., Seshadri, S., Swami, A.N.: Fixed-precision estimation ofjoin selectivity. In: PODS, pp. 190–201 (1993)

57. Haas, P.J., Naughton, J.F., Seshadri, S., Swami, A.N.: Selectivity and cost estimationfor joins based on random sampling. J. Comput. Syst. Sci. 52(3), 550–569 (1996)

58. Haas, P.J., Naughton, J.F., Swami, A.N.: On the relative cost of sampling for joinselectivity estimation. In: PODS, pp. 14–24 (1994)

59. Haas, P.J., Swami, A.N.: Sequential sampling procedures for query size estimation.In: SIGMOD, pp. 341–350 (1992)

60. Hellerstein, J., Avnur, R., Chou, A., Hidber, C., Olston, C., Raman, V., Roth, T.,Haas, P.: Interactive data analysis: The cONTROL project. IEEE Computer 32(8),51–59 (1999)

61. Hellerstein, J., Haas, P., Wang, H.: Online aggregation. In: SIGMOD Conference, pp.171–182 (1997)

62. Hellerstein, J.M., Avnur, R., Chou, A., Hidber, C., Olston, C., Raman, V., Roth, T.,Haas, P.J.: Interactive data analysis: The control project. In: IEEE Computer 32(8),pp. 51 – 59 (1999)

63. Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online aggregation. In: SIGMOD, pp.171–182 (1997)

64. Hou, W.C., Ozsoyoglu, G.: Statistical estimators for aggregate relational algebraqueries. ACM Trans. Database Syst. 16(4), 600–654 (1991)

65. Hou, W.C., Ozsoyoglu, G.: Processing time-constrained aggregate queries in case-db.ACM Trans. Database Syst. 18(2), 224–261 (1993)

161

66. Hou, W.C., Ozsoyoglu, G., Dogdu, E.: Error-constrained COUNT query evaluation inrelational databases. SIGMOD Rec. 20(2), 278–287 (1991)

67. Hou, W.C., Ozsoyoglu, G., Taneja, B.K.: Statistical estimators for relational algebraexpressions. In: PODS, pp. 276–287 (1988)

68. Hou, W.C., Ozsoyoglu, G., Taneja, B.K.: Processing aggregate relational queries withhard time constraints. In: SIGMOD, pp. 68–77 (1989)

69. Huang, H., Bi, L., Song, H., Lu, Y.: A variational em algorithm for large databases.In: International Conference on Machine Learning and Cybernetics, pp. 3048–3052(2005)

70. Ioannidis, Y.E.: Universality of serial histograms. In: VLDB, pp. 256–267 (1993)

71. Ioannidis, Y.E., Poosala, V.: Histogram-based approximation of set-valuedquery-answers. In: VLDB (1999)

72. Jagadish, H.V., Koudas, N., Muthukrishnan, S., Poosala, V., Sevcik, K.C., Suel, T.:Optimal histograms with quality guarantees. In: VLDB, pp. 275–286 (1998)

73. Jermaine, C., Dobra, A., Arumugam, S., Joshi, S., Pol, A.: A disk-based join withprobabilistic guarantees. In: SIGMOD, pp. 563–574 (2005)

74. Jermaine, C., Dobra, A., Arumugam, S., Joshi, S., Pol, A.: The sort-merge-shrinkjoin. ACM Trans. Database Syst. 31(4), 1382–1416 (2006)

75. Jermaine, C., Dobra, A., Pol, A., Joshi, S.: Online estimation for subset-based SQLqueries. In: 31st International conference on Very large data bases, pp. 745–756(2005)

76. Jermaine, C., Pol, A., Arumugam, S.: Online maintenance of very large randomsamples. In: SIGMOD, pp. 299–310. ACM Press, New York, NY, USA (2004)

77. Kempe, D., Dobra, A., Gehrke, J.: Gossip-based computation of aggregateinformation. In: FOCS, pp. 482–491 (2003)

78. Krewski, D., Platek, R., Rao, J.: Current Topics in Survey Sampling. Academic Press(1981)

79. Lakshmanan, L.V.S., Pei, J., Han, J.: Quotient cube: How to summarize thesemantics of a data cube. In: VLDB, pp. 778–789 (2002)

80. Lakshmanan, L.V.S., Pei, J., Zhao, Y.: Qc-trees: An efficient summary structure forsemantic olap. In: SIGMOD, pp. 64–75 (2003)

81. Leutenegger, S.T., Edgington, J.M., Lopez, M.A.: STR: A simple and efficientalgorithm for r-tree packing. In: ICDE, pp. 497–506 (1997)

162

82. Ling, Y., Sun, W.: A supplement to sampling-based methods for query sizeestimation in a database system. SIGMOD Rec. 21(4), 12–15 (1992)

83. Lipton, R., Naughton, J.: Query size estimation by adaptive sampling. In: PODS,pp. 40–46 (1990)

84. Lipton, R., Naughton, J., Schneider, D.: Practical selectivity estimation throughadaptive sampling. In: SIGMOD Conference, pp. 1–11 (1990)

85. Lipton, R.J., Naughton, J.F.: Estimating the size of generalized transitive closures.In: VLDB, pp. 165–171 (1989)

86. Lipton, R.J., Naughton, J.F.: Query size estimation by adaptive sampling. J.Comput. Syst. Sci. 51(1), 18–25 (1995)

87. Luo, G., Ellmann, C.J., Haas, P.J., Naughton, J.F.: A scalable hash ripple joinalgorithm. In: SIGMOD, pp. 252–262 (2002)

88. Matias, Y., Vitter, J., Wang, M.: Wavelet-based histograms for selectivity estimation.In: SIGMOD Conference, pp. 448–459 (1998)

89. Matias, Y., Vitter, J.S., Wang, M.: Wavelet-based histograms for selectivityestimation. SIGMOD Record 27(2), 448–459 (1998)

90. Mingoti, S.: Bayesian estimator for the total number of distinct species when quadratsampling is used. Journal of Applied Statistics 26(4), 469–483 (1999)

91. Motwani, R., Raghavan, P.: Randomized Algorithms. Cambridge University Press,New York (1995)

92. Muralikrishna, M., DeWitt, D.: Equi-depth histograms for estimating selectivityfactors for multi-dimensional queries. In: SIGMOD Conference, pp. 28–36 (1988)

93. Muth, P., O’Neil, P.E., Pick, A., Weikum, G.: Design, implementation, andperformance of the LHAM log-structured history data access method. In: VLDB, pp.452–463 (1998)

94. Naughton, J.F., Seshadri, S.: On estimating the size of projections. In: ICDT:Proceedings of the third international conference on Database theory, pp. 499–513(1990)

95. Neal, R., Hinton, G.: A view of the em algorithm that justifies incremental, sparse,and other variants. In: Learning in Graphical Models (1998)

96. Olken, F.: Random sampling from databases. In: Ph.D. Dissertation (1993)

97. Olken, F.: Random sampling from databases. Tech. Rep. LBL-32883, LawrenceBerkeley National Laboratory (1993)

163

98. Olken, F., Rotem, D.: Simple random sampling from relational databases. In: VLDB,pp. 160 – 169 (1986)

99. Olken, F., Rotem, D.: Random sampling from b+ trees. In: VLDB, pp. 269–277(1989)

100. Olken, F., Rotem, D.: Sampling from spatial databases. In: ICDE, pp. 199 – 208(1993)

101. Olken, F., Rotem, D., Xu, P.: Random sampling from hash files. In: SIGMOD, pp.375 – 386 (1990)

102. Piatetsky-Shapiro, G., Connell, C.: Accurate estimation of the number of tuplessatisfying a condition. In: SIGMOD, pp. 256–276 (1984)

103. Rao, T.J.: On the allocation of sample size in stratified sampling. Annals of theInstitute of Statistical Mathematics 20, 159–166 (1968)

104. Rao, T.J.: Optimum allocation of sample size and prior distributions: A review.International Statistical Review 45(2), 173–179 (1977)

105. Roussopoulos, N., Kotidis, Y., Roussopoulos, M.: Cubetree: organization of and bulkincremental updates on the data cube. In: SIGMOD, pp. 89–99 (1997)

106. Rowe, N.C.: Top-down statistical estimation on a database. SIGMOD Record 13(4),135–145 (1983)

107. Rowe, N.C.: Antisampling for estimation: an overview. IEEE Trans. Softw. Eng.11(10), 1081–1091 (1985)

108. Rusu, F., Dobra, A.: Statistical analysis of sketch estimators. In: To Appear,SIGMOD (2007)

109. Sarndal, C., Swensson, B., Wretman, J.: Model Assisted Survey Sampling. Springer,New York (1992)

110. Selinger, P.G., Astrahan, M.M., Chamberlin, D.D., Lorie, R.A., Price, T.G.: Accesspath selection in a relational database management system. In: SIGMOD, pp. 23–34(1979)

111. Severance, D.G., Lohman, G.M.: Differential files: Their application to themaintenance of large databases. ACM Trans. Database Syst. 1(3), 256–267 (1976)

112. Shao, J.: Mathematical Statistics. Springer-Verlag (1999)

113. Sismanis, Y., Deligiannakis, A., Roussopoulos, N., Kotidis, Y.: Dwarf: Shrinking thepetacube. In: SIGMOD, pp. 464–475 (2002)

114. Sismanis, Y., Roussopoulos, N.: The polynomial complexity of fully materializedcoalesced cubes. In: VLDB, pp. 540–551 (2004)

164

115. Stonebraker, M., Abadi, D.J., Batkin, A., Chen, X., Cherniack, M., Ferreira, M.,Lau, E., Lin, A., Madden, S., O’Neil, E., O’Neil, P., Rasin, A., Tran, N., Zdonik, S.:C-store: a column-oriented DBMS. In: VLDB, pp. 553–564 (2005)

116. Thiesson, B., Meek, C., Heckerman, D.: Accelerating em for large databases. Mach.Learn. 45(3), 279–299 (2001)

117. Thorup, M., Zhang, Y.: Tabulation based 4-universal hashing with applications tosecond moment estimation. In: SODA, pp. 615–624 (2004)

118. Vitter, J.S., Wang, M.: Approximate computation of multidimensional aggregates ofsparse data using wavelets. SIGMOD Rec. 28(2), 193–204 (1999)

119. Vitter, J.S., Wang, M., Iyer, B.: Data cube approximation and histograms viawavelets. In: CIKM, pp. 96–104 (1998)

120. Vysochanskii, D., Petunin, Y.: Justification of the 3-sigma rule for unimodaldistributions. Theory of Probability and Mathematical Statistics 21, 25–36 (1980)

121. Yu, X., Zuzarte, C., Sevcik, K.C.: Towards estimating the number of distinct valuecombinations for a set of attributes. In: CIKM, pp. 656–663 (2005)

BIOGRAPHICAL SKETCH

Shantanu Joshi received his Bachelor of Engineering in Computer Science from the

University of Mumbai, India in 2000. After a brief stint of one year at Patni Computer

Systems in Mumbai, he joined the graduate school at the University of Florida in fall

2001, where he received his Master of Science (MS) in 2003 from the Department of

Computer and Information Science and Engineering.

In the summer of 2006, he was a research intern at the Data Management, Exploration

and Mining Group at Microsoft Research, where he worked with Nicolas Bruno and

Surajit Chaudhuri.

Shantanu will receive a Ph.D. in Computer Science in August 2007 from the

University of Florida and will then join the Database Server Manageability group at

Oracle Corporation as a member of technical staff.

165

SAMPLING-BASED RANDOMIZATION TECHNIQUES FOR APPROXIMATE...

Documents

Transcript of SAMPLING-BASED RANDOMIZATION TECHNIQUES FOR APPROXIMATE...