ufdcimages.uflib.ufl.eduufdcimages.uflib.ufl.edu/UF/E0/05/21/85/00001/NIRKHIWALE_S.pdf · TABLE OF...
Transcript of ufdcimages.uflib.ufl.eduufdcimages.uflib.ufl.edu/UF/E0/05/21/85/00001/NIRKHIWALE_S.pdf · TABLE OF...
A SAMPLING ALGEBRA FOR SCALABLE APPROXIMATE QUERY PROCESSING
By
SUPRIYA NIRKHIWALE
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
2018
c⃝ 2018 Supriya Nirkhiwale
To Vedaant and Kshitij
ACKNOWLEDGMENTS
Thanks to my advisor, Alin Dobra, for introducing me to an interesting set of problems. I
have learnt from him how the right abstraction makes problems simple and tractable. Thanks
to my husband Kshitij, my son Vedaant, my family, and my friends, without whom none of this
would have been possible.
4
TABLE OF CONTENTS
page
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
CHAPTER
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 RELATED WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.1 Analytical Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.2 Bootstratpping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.3 Other Areas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3 TECHNICAL PRELIMINARIES . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1 Generalized Uniform Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 273.2 Multivariate Delta Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4 ANALYSIS OF SAMPLING QUERY PLANS . . . . . . . . . . . . . . . . . . . . . 31
4.1 SOA-Equivalence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.2 GUS Quasi-Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.3 Interaction between GUS and Relational Operators . . . . . . . . . . . . . . . 384.4 Interactions between GUS Operators . . . . . . . . . . . . . . . . . . . . . . 41
5 EFFICIENT VARIANCE ESTIMATION: IMPLEMENTATION & EXPERIMENTS . . 44
5.1 Estimating yS Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.2 Sub-Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485.3.2 Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.3.3 Running Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505.3.4 Sub-Sample Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6 COVARIANCE BETWEEN MULTIPLE ESTIMATES: CHALLENGES . . . . . . . . 55
6.1 Base Lemma for Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . 556.2 Covariance Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566.3 Notion of Equivalence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586.4 Support for Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596.5 Support for Sub-Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5
7 COVARIANCE BETWEEN MULTIPLE ESTIMATES: SOLUTION . . . . . . . . . . 61
7.1 SOA-COV Equivalence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617.2 Base Lemma and Overall Sampling . . . . . . . . . . . . . . . . . . . . . . . 637.3 SOA-COV Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
8 EFFICIENT COVARIANCE ANALYSIS: IMPLEMENTATION & EXPERIMENTS . . 80
8.1 Leveraging Foreign Keys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 828.2 Sub-Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 848.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
8.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 858.3.2 Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 858.3.3 Running Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 878.3.4 Foreign Key Optimization . . . . . . . . . . . . . . . . . . . . . . . . 89
9 A SAMPLING ALGEBRA FOR GENERAL MOMENT MATCHING . . . . . . . . . 90
9.1 k-Generalized Uniform Sampling . . . . . . . . . . . . . . . . . . . . . . . . . 909.2 kMA equivalence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 969.3 Interaction between kMA and Relational Operators . . . . . . . . . . . . . . . 97
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6
LIST OF TABLES
Table page
4-1 GUS parameters for known sampling methods on a single relation . . . . . . . . . . 36
4-2 Notation used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
7-1 SOA-COV rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
7-2 SOA-COV composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
7
LIST OF FIGURES
Figure page
1-1 Queries on samples of TPC-H relations . . . . . . . . . . . . . . . . . . . . . . . . 16
4-1 Transformation of the query plan to allow analysis . . . . . . . . . . . . . . . . . . 42
5-1 Transformation of the query plan to allow analysis . . . . . . . . . . . . . . . . . . 47
5-2 Plot of percentage of times the true value lies in the estimated confidence intervalsvs desired confidence level. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5-3 Plots of running time vs selection parameter with and without sub-sampling. . . . . 52
5-4 Plot of fluctuation of confidence interval widths obtained with sub-sampling, wrtthe true confidence interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6-1 Covariance in a single query plan . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6-2 Applying SOA-Equivalence for covariance computation . . . . . . . . . . . . . . . . 59
7-1 Setting for the base lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
7-2 SOA-COV for Joins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
7-3 SOA-COV for GUS/Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
7-4 SOA-COV for Join/GUS interaction . . . . . . . . . . . . . . . . . . . . . . . . . . 68
7-5 SOA composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7-6 Deriving Covariance-analyzable plans . . . . . . . . . . . . . . . . . . . . . . . . . 73
7-7 Applying SOA-COV Equivalence for covariance computation in Example 12 . . . . . 74
7-8 Applying SOA-COV Equivalence for covariance computation in Example 13 . . . . . 75
7-9 Applying SOA-COV Equivalence for covariance computation in Example 14 . . . . . 77
7-10 Applying SOA-COV Equivalence for covariance computation in Example 15 . . . . . 78
8-1 Plot of percentage of times the true value lies in the estimated confidence intervalsvs desired confidence level. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
8-2 Plot of runtime. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
8
Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy
A SAMPLING ALGEBRA FOR SCALABLE APPROXIMATE QUERY PROCESSING
By
Supriya Nirkhiwale
May 2018
Chair: Alin V. DobraMajor: Computer Engineering
As of 2005, sampling has been incorporated in all major databases. While efficient
sampling techniques are easily realizable, determining the accuracy of an estimate obtained
from the sample is still an unresolved problem. In the first part of this dissertation, we present
a theoretical framework that allows an elegant treatment of the problem. We base our work
on generalized uniform sampling (GUS), a class of sampling methods that subsumes a wide
variety of sampling techniques. We introduce a key notion of equivalence that allows GUS
sampling operators to commute with selection and join, and derivation of confidence intervals
for SUM-like aggregates obtained by a very general class of queries.
The use of sampling for approximate query processing in large data warehousing
environments has another significant limitation: sampling large tables is expensive. For
applications like approximate exploration, it can be wasteful to compute a single sample to
estimate only one value. Resources are better utilized if a single sample can be used to obtain
multiple estimates from the data. So far, it has not been possible to acheive this because
multiple estimates from the same sample are correlated and the theory to compute these
correlations was missing. In the second part of this dissertation, we provide a theoretical
framework for a lightweight, add-on tool to any database for computing covariance betwen
two estimates that arise from a common set of base relation samples. This theory also makes
it possible to a compute covariance matrix between groups of data in a GROUPBY query or
variances of estimates like AVG that are functions of SUM-like aggregates.
9
We illustrate the theory through extensive examples and give indications on how to use it
to provide meaningful estimation in database systems.
10
CHAPTER 1INTRODUCTION
Sampling has long been used by database practitioners to speed up query evaluation,
especially over very large data sets. For many years it was common to see SQL code of the
form “WHERE RAND() > 0.99”. Widespread use of this sort of code lead to the inclusion of
the TABLESAMPLE clause in the SQL-2003 standard [1]. Since then, all major databases have
incorporated native support for sampling over relations. One such query, using the TPC-H
schema, is:
SELECT SUM(l_discount*(1.0-l_tax))
FROM lineitem TABLESAMPLE (10 PERCENT),
orders TABLESAMPLE (1000 ROWS)
WHERE l_orderkey = o_orderkey AND
l_extendedprice > 100.0;
The result of this query is obtained by taking a Bernoulli sample with p = .1 over
lineitem and joining it with a sample of size 1000 obtained without replacement (WOR),
from orders and evaluating the SUM aggregate.
In practice, there are two main reasons practitioners write such code. One is that sampling
is useful for debugging expensive queries. The query can be quickly evaluated over a sample as
a sanity check, before it is unleashed upon the full database.
The second reason is that the practitioner is interested in obtaining an idea as to what the
actual answer to the query would be, in less time than would be required to run the query over
the entire database. This might be useful as a prelude to running the query “for real”—the
user might want to see if the result is potentially interesting—or else the estimate might be
used in place of the actual answer. Often, this situation arises when the query in question
11
performs an aggregation, since it is fairly intuitive to most users that sampling can be used to
obtain a number that is a reasonable approximation of the actual answer.
The problem we consider here comes from the desire to use sampling as an approximation
methodology. In this case, the user is not actually interested in computing an aggregate such
as “SUM(l discount*(1.0-l tax))” over a sample of the database. Rather, s/he is interested
in estimating the answer to such a query over the entire database using the sample. This
presents two obvious problems:
• First, what SQL code should the practitioner write in order to compute an estimate for aparticular aggregate?
• Second, how does the practitioner have any idea how accurate that estimate is?
Ideally, a database system would have built-in mechanisms that automatically provide
estimators for user supplied aggregate queries, and that automatically provide users with
accuracy guarantees. Along those lines, in this proposal, we study how to automatically support
SQL of the form:
SELECT C.I. (lo, hi, 95, SUM(l_discount*(1.0-l_tax)))
FROM lineitem TABLESAMPLE (10 PERCENT),
orders TABLESAMPLE(1000 ROWS)
WHERE l_orderkey = o_orderkey AND
l_extendedprice > 100.0;
Presented with such a query, the database engine will use the user-specified sampling to
automatically compute two values lo and hi that can be used as a [0.05, 0.95] confidence
bound on the true answer to the query. That is, the user has asked the system to compute
values lo and hi such that there is a 5% chance that the true answer is less than lo, and
there is a 95% chance that the true answer is less than hi. In the general case, the user
should be able to specify any aggregate over any number of sampled base tables using any
12
sampling scheme, and the system would automatically figure out how to compute an estimate
of the desired confidence level. A database practitioner need have no idea how to compute
an estimate for the answer, nor does s/he need to have any idea how to compute confidence
bounds; the user only specifies the desired level of confidence, and the system does the rest.
Existing Work on Database Sampling. While there has been a lot of research on
implementing efficient sampling algorithms [72, 77], providing confidence intervals for the
sample estimate is understood only for a few restricted cases. The simplest is when only a
single relation is sampled. A slightly more complicated case was handled by the AQUA system
developed at Bell labs [5–7, 41]. AQUA considered correlated sampling where a fact table
in a star schema is sampled. These cases are relatively simple because when a single table is
sampled, classical sampling theory applies with a few easy modifications. Simultaneous work
on ripple joins and online aggregation [47–50, 52] extended the class of queries amenable to
analysis to include those queries where multiple tables are sampled with replacement and then
joined. See Chapter 3 for an extensive review of related work.
Unfortunately, the extension to other types of sampling is not straightforward, and to
date new formulas have been derived every time a new sampling is considered (for example,
two-table without-replacement sampling [60]). Our goal is to provide a simple theory that
makes it possible to handle very general types of queries over virtually any uniform sampling
scheme: with replacement sampling, fixed-size without replacement sampling, Bernoulli
sampling, or whatever other sampling scheme is used. The ability to easily handle arbitrary
types of sampling is especially important given that the current SQL standard allows for a
somewhat mysterious SYSTEM sampling specification, whose exact implementation (and hence
its statistical properties) are left up to the database designers. Ideally, it should be easy for a
database designer to apply our theory to an arbitrary SYSTEM sampling implementation.
Generalized Uniform Sampling. One major reason that new theory and derivations
were previously required for each new type of sampling is that the usual analysis is tuple-based,
where the inclusion probability of each tuple in the output set is used as the basic building
13
block; computing expected values and variances requires intricate algebraic manipulations
of complicated summations. We use the notion of Generalized Uniform Sampling (GUS)
(see Definition 1) that subsumes many different sampling schemes (including all of the
aforementioned ones, as well as block-based variants thereof).
Our Contributions for SUM-like Aggregates. In Part II of this dissertation, we
develop an algebra over many common relational operators, as well as the GUS operator. This
makes it possible to take any query plan that contains one or more GUS operators and the
supported relational operators, and perform a statistical analysis of the accuracy of the result in
an algebraic fashion, working from the leaves up to the top of the plan.
No complicated algebraic manipulations over nested summations are required. This
algebra can form the basis for a lightweight tool for providing estimates and error bounds,
that should be easily integrable into any database system. The database need only feed the
tool the user-specified confidence levels, the set of tuples returned by the query, some simple
lineage information over those result tuples, and the query plan, and the tool can automatically
compute the desired error bounds.
The specific contributions we make, are:
• We define the notion of Second Order Analytical equivalence (SOA equivalence), a keyequivalence relationship between query plans that is strong enough to allow varianceanalysis but weak enough to ensure commutativity of sampling and relational operators.
• We define the GUS operator that emulates a wide class of sampling methods. Thisoperator commutes with most relational operators under SOA-equivalence.
• We develop an algebra over GUS and relational operators that allows derivation ofSOA-equivalent plans. These plans easily allow moment calculations that can be used toestimate error bounds.
• We describe how our theory can be used to add estimation capabilities to existingdatabases so that the required changes to the query optimizer and execution engine areminimal. Alternatively, the estimator can be implemented as an external tool.
Our work provides a straightforward analysis for the SUM aggregate. It can be easily
extended for COUNT by substituting the aggregated attribute to 1 and applying the analysis
14
for SUM on this attribute. Though the analysis for AVERAGE presents a slightly non-linear
case, the analyses for SUM and COUNT lay a foundation for it. The confidence intervals can
be derived using a method for approximating probability distribution/variance such as the delta
method. The analysis for MIN, MAX and DISTINCT are extremely hard problems to solve due
to their non-linearity. For example DISTINCT requires an estimate of all the distinct values in
the data and the number of such values. It is thus beyond the scope of this proposal.
While selections and joins are the highlight of our work, we show that SOA-equivalence
allows analysis for other database operators like cross-product (compaction), intersection
(concatenation) and union.
Multiple Correlated Aggregates Sharing the Same Sample. Practical AQP
applications often require support for a wide variety of estimates. In the first part of our
work, we focused on computing the variance of SUM-like estimates from query plans with
multiple joins. A natural question to ask would be: how can we generalize this framework to
accommodate:
• covariance between estimates
• non-linear estimates that can be constructed from multiple SUM-like aggregates, for eg,AVG, VAR.
• the GROUPBY clause. It is often more interesting to derive estimates for various groupsin the data, and compute their covariances for further analysis.
In a practical approximate exploration setting for large data, we almost never ask a single
question or compute only a single estimate per sample. The warehoused data can be Petabyte
size and a reasonable sample size should at least be Terabyte sized. The cost of obtaining
a sample of such a size can itself be significant. In some cases the underlying data itself is
a sample. It is prudent, or rather necessary, to reuse the obtained sample for computing the
required multiple estimates. These estimates are correlated since they are derived from the
same set of base relation samples. The user typically plugs in multiple estimates to generate
approximate answers to queries of interest. These approximate answers are incomplete without
15
lineitem
B(0.1)
orders
SWOR(1000)
customer
B(0.1)
nation
σ1 σ2
1 1
1 1
1
SUMloc COUNTloc COUNTocnCOUNTlocn GROUPBYlocn
Figure 1-1. Queries on samples of TPC-H relations
an error bound which quantifies the quality of the approximation. Computing covariances
between the multiple estimates is crucial for obtaining the required error bound. Ignoring these
correlations often leads to misleading results.
Example 1. Fig 1-1 shows an example using the TPC-H schema [4], where multiple estimates
are derived from a shared set of samples: a Bernoulli(0.1) sample from lineitem,
a sample of 1000 tuples w/o replacement from 150,000 tuples of orders & a
Bernoulli(0.1) sample from customer. These 3 base relation samples are used to derive the
following five Horvitz-Thompson ([24, 53]) estimators:
• SUMloc as 15000 ∗ SUM(l discount*(1.0-l tax))
• COUNTloc as 15000∗COUNT(*) from the join of the samples from lineitem, orders andcustomer
• COUNTocn as 1500∗COUNT(orderkey) from the join of the samples from orders,customer and nation
• COUNTlocn as 15000∗COUNT(*) & GROUPBYlocn as15000∗SUM(o totalprice) GROUPBY n name from the join of samples from lineitem,orders and customer and the nation table.
Note that the sample aggregates used in the above estimators have been scaled by appropriate
factors to obtain unbiased estimates of the corresponding population aggregates.
16
It is common practice for users to submit work units consisting of requests for multiple
estimates, where each estimate uses a subset of base relation samples. These estimates are
typically used in further analysis. For example, the user may want to approximate the average
price per lineitem by a function
f1(SUMloc, COUNTloc) =SUMloc
COUNTloc.
Approximations for other desired quantities can often involve much more complex functions.
For example, the average price of an order in a given nation can be approximated by
f2(SUMloc, COUNTloc, COUNTlocn, COUNTocn) =SUMloc
COUNTloc
COUNTlocn
COUNTocn.
The variance of the approximation f1 is given by
V (f1(SUMloc, COUNTloc))
≈ SUM2locCOUNT2loc
(V (SUMloc)
SUM2loc+V (COUNTloc)
COUNT2loc− 2Cov(SUMloc, COUNTloc)
SUMlocCOUNTloc
), (1–1)
and the variance of the approximation f2 (refer to Eq (3–3)) is a function of 6 covariance terms
- one for every pair of estimates from SUMloc, COUNTloc, COUNTlocn, COUNTocn, and 4 variance
terms.
The only way to avoid the covariance terms (by making them equal to zero) would
be to compute independent samples of lineitem, orders and customer per estimate
(12 independent base relation samples). This is clearly inefficient and wasteful. Another
example with multiple estimates is GROUPBYlocn, a set of 25 estimates (one for each nation).
By design, these estimates come from the same set of base relation samples and have
non-trivial covariances. The quality of an approximation can only be meaningfully and correctly
understood by knowing the pairwise covariances of various estimates used in the approximation.
2
As demonstrated by the above example, in modern AQP settings with multiple correlated
estimates, it is therefore crucial to compute pairwise covariance estimates, whether the
17
user needs to construct sophisticated functions of multiple estimates or to get accurate
simultaneous error bounds for the individual estimates (once covariances are computed, a
recipe for comupting simultaneous confidence intervals is available in [95]). Previous work on
covariance computation focuses on specific individual settings, where this computation can be
a tediuos, arduous process. There is a need for a clean and general algebraic approach (similar
to the approach that we develop for variance) to make covariance computation tenable in
practical settings.
Our Contributions for Covariance Computation. We address the covariance
computation problem in Part III of the dissertation. We provide a list of our specific
contributions below.
• We define the notion of Second Order Analytical Covariance equivalence (SOA-COVequivalence) between two pairs of query plans, a broad and non-trivial generalization ofthe notion of SOA-equivalence in developed in the first part of the dissertation. Thisequivalence is strong enough to allow covariance analysis, and subsumes the framework in[76] as variance can be thought of as a self covariance.
• We develop an algebra over GUS and relational operators that allow derivation ofSOA-COV equivalent plans. These plans easily allow moment calculations to estimatecovariance.
These analytical results form the basis for the development of a practical lightweight add-on
tool that
• computes the pairwise covariances between multiple estimates from a common sample.
• empowers the pratitioner with the ability to compute error bounds for functions ofsum-like aggregates, covariance matrix for GROUPBY queries and to get simultaneousconfidence intervals for multiple estimates.
• is based on theory which is independent of the number of joins involved, platform orschema and uses only a single sample.
Structure of the dissertation. The rest of the document is organized as follows. In
Chapter 2, we provide a detailed overview of related work in approximate query processing.
In Chapter 3, we review concepts related to Generalized Unifrom Sampling (GUS) methods
in detail. In Chapter 4, we introduce the notion of SOA-equivalence between query plans and
18
prove that GUS operators commute with a variety of relational operators in the SOA sense.
We also investigate interactions between GUS operators when applied to the same data. In
Chapter 5, we provide insights on how our theory can be used to implement a separate add-on
tool and how the performance of the variance estimation can be enhanced. Furthermore, we
test our implementation thoroughly, and provide accuracy and runtime analysis. In Chapter 6 ,
we propose to extend this theory to accommodate multiple estimates, for e.g. as in GROUPBY
queries. We explore the general difficulties associated with this problem and outline the major
technical challenges. In Chapter 7, we provide a solution to this problem by introducing the
notion of SOA-COV equivalence between pairs of query plans. We develop an algebra which
allows us to transform a given pair of query plans to an analyzable pair of query plans, thereby
giving us the ability to compute the covariance between any pair of aggregates resulting from
these plans. In Chapter 8, we proivde a thorough experimental testing of the SOA-COV based
theory and discuss issues relevant to implementation. In Chapter 9, we consider the problem
of estimating higher order moments of aggregate estimators, and develop the notions of
k-Generalized Uniform Sampling methods and kMA equivalence to provide a solution.
19
CHAPTER 2RELATED WORK
The idea of using sampling in databases for deriving estimates for a single relation was
first studied by Shapiro et al. [79]. Since then, much research has focused on implementing
efficient sampling algorithms in databases [72, 77]. Providing confidence intervals on estimates
for SQL aggregate queries is a difficult problem with limited progress so far. The previous
literature can be roughly classified into the following areas.
2.1 Analytical Bounds
The problem of providing closed form analytical bounds for approximate database queries
has a roughly three decade long history. There has been a large body of research on using
sampling to provide quick answers to database queries, on database systems [8, 19, 52, 59, 61,
78], and data stream systems [13, 74]. Olken [77] studied the problem for specific sampling
methods for a single relation. This line of work ended abruptly when Chaudhuri et al. [20, 21]
proved that extracting IID samples from a join of two relations is infeasible.
Another line of research was the extension to the correlated sampling pioneered by the
AQUA system [6, 7, 41]. AQUA is applicable to a star schema, where the goal is sampling
from the fact table, and including all tuples in dimension tables that match selected fact table
tuples. The AQUA type of sampling has been incorporated in DB2 [43].
The reason confidence intervals can be provided for AQUA type sampling is the fact
that independent identically distributed (IID) samples are obtained from the set over which
the aggregate is computed. A straightforward use of the central limit theorem readily allows
computation of good estimates and confidence intervals. Indeed, it is widely believed [6, 7, 20,
21, 41, 77] that IID samples at the top of the query plan are required to provide any confidence
interval. This idea leads to the search for a sampling operator that commutes with database
operators. This endeavor proved to be very difficult from the beginning [20] when joins are
involved. To see why this is the case, consider a tuple t ∈ orders and two tuples u1, u2 in
lineitem that join with t (i.e. they have the same value for orderkey). Random selection
20
of tuples t, u1, u2 in the sample does not guarantee random selection of result tuples (t, u1)
and (t, u2). If t is not selected, neither tuple can exist, and thus sampling is correlated. A lot
of effort [20, 21] has been spent in finding practical ways to de-correlate the result tuples with
only limited success.
Substantial research has been devoted to deriving samples from input relations in advance
and using them to approximate answers to ad-hoc queries [8] . These methods may provide
significant benefit when queries, predicates or query columns are predictable/ known in
advance. However, they offer limited support for joins. Any join has to be with a small
dimention tables on foreign keys. Multiple joins over large tables are not supported.
Progress has been made using a different line of thought by Hellerstein and Hass [52] and
the generalization in [51] for the special case of sampling with replacement. The problem of
producing IID result samples is avoided by developing central limit theorem-like results for the
combination of relation level sampling with replacement. The theory was generalized first to
sampling without replacement for single join queries [60], then further generalized to arbitrary
uniform sampling over base directions and arbitrary SELECT-FROM-WHERE queries without
duplicate elimination in DBO [59], and finally to allow sampling across multiple relations
in Turbo-DBO [30]. Even though some simplification occurred through these theoretical
developments, they are mathematically heavy and hard to understand/interpret. Moreover,
the theory, especially DBO and Turbo-DBO, is tightly coupled with the systems developed to
exploit it.
Technically, one major problem in all the mathematics used to analyze sampling schemes
is the fact the analyses use functions and summations over tuple domains, and not the
operators and algebras that the database community is used to. This makes the theory hard to
comprehend and apply. The fact that no database system picked up these ideas to provide a
confidence interval facility is a direct testament of these difficulties.
While recent progress has been made on generalized variance computation, the much
more daunting issue of com- puting pairwise covariances between multiple estimates has been
21
barely explored. The need for covariance computa- tion arises in a wide variety of situations.
In the context of sampling from databases, covariance computation is crucial for obtaining
efficient simultaneous confidence intervals for multiple GROUPBY estimates. Kandula et al.
[62] extend the notion of SOA-equivalence between plans introduced in [76] to the notion of
Sampling Dominance between plans. They argue that the variance of a transformed plan need
not be exactly equal to the variance of the original plan, because obtaining an upper bound
for the variance (and hence the error) of the sampling based estimator may be good enough
for obtaining an error bound. Relaxing the notion of SOA-equivalence allows them to consider
query plans with non-GUS samplers. They construct non-GUS extensions of the Bernoulli
sampler called the Distinct sampler and the Universe sampler, and develop a framework for
transforming any plan with these samplers to a plan with sampling only on top (just before
aggregation) such that the error for the transformed plan is greater than or equal to that of
the original plan. As we demonstrate in Example 1 to compute the variance of a function of
multiple aggregate estimators (like AVG) or to compute a corresponding joint confidence region
for these estimators, all the pairwise covariances need to be computed (see (3–3)). This issue
is acknowledged in [62, Appendix C] in the context of AVG, but a framework to estimate the
covariance is not developed. In [95], the authors derive closed form estimators for the pairwise
covariances between aggregates from a GROUPBY query, for the specific case of sampling
without replacement. These covariances are then used to construct simultaneous confidence
intervals. Pansare et al. [78] develop a very sophisticated Bayesian framework to infer the
confidence bounds of approximate aggregate answers. However, this approach is limited to
simple group-by aggregate queries and does not provide a systematic way of quantifying
approximation quality.
2.2 Bootstratpping
Bootstrap [32, 34] is a popular resampling based statistical technique for obtaining
confidence/error bounds for a wide variety of estimates. This method is particularly useful in
settings where closed form variance estimates are not available (MIN, MAX, nested queries,
22
etc). The basic idea behind bootstrap is simple: obtain a large number of resamples from the
existing samples, compute the required estimate for each of the resamples, and use them to
approximate the sampling distribution of the original estimate, providing error bounds as a
byproduct. Though conceptually simple and powerful, this method requires repeated estimator
computation on resamples having size comparable to original dataset. This poses the obvious
challenges of computational efficiency and provides basis for multiple lines of work.
Research in statistical methodology has focused on reducing the required number of
Monte Carlo resamples required [33, 34] or reducing the size of resamples [14–16, 81].
The techniques for reducing the number of resamples introduce additional complexity of
implementation and still need repeated estimator computations on resamples having size
comparable to that of the original dataset. There is some computational interest in lowering
the size the resamples with bootstrap variants, such as m out n sampling, where the size
of resamples are smaller than the original sample. These techniques are sensitive to the
choice of parameters (size of resamples) [69] and the analytical correction requires the prior
knowledge of the convergence rate of the estimator, making them infeasible to automate.
The requirement of prior theoretical knowledge can be avoided by averaging the distributons
of the smaller resamples [69], but the automatic selection of resampling parameters in a
computationally efficient manner remains to be a challenge.
In the last decade, various approaches for using Bootstrap in AQP have been developed
in the literature. These approaches focus mainly on reducing the computational overhead
associated with repeated resampling. Pol et al. [80] developed a resampling tree data structure
for every base relation, that holds indicator random variables for inclusion of a tuple in the
sample. Laptev et al. [70] target MapReduce platforms and study how to overlap computation
across different bootstrap trials or bootstrap samples. The Analytic Bootstrap [96] provides a
probabilistic relational model for symbollically executing bootstrap and develop new relational
operators that combine random variables. Since bootstrap is a simulation based technique,
23
recent work [9, 68] demonstrate the need of diagnostic methods to identify when bootstrap
based techniques are unreliable.
While a vast amount of bootstrap literature focuses on the computational issues, there are
few issues that arise with its applications.
Joins. The asymptotic theory for bootstrap is valid only under the assumption that the
elements of the sample are independent and identically distributed (IID). Pol and Jermaine [80]
use resample tree per base relation to simulate multiple instances of sampling w replacement
and estimate the accuracy of aggregate over joins. Even if the base relation samples are
IID, it is well known that joining two IID samples does not lead to an IID sample from the
relevant cross product space [21]. It is not clear if the asymptotic results for i.i.d. bootstrap
are applicable in the presence of joins and there are no known theoretical guarantees for
consistency. While empirical results suggest that this these bootstrap based methods work
efficiently, it’s important to point out that the theoretical guarantees of bootstrap are lost.
Sampling generality. The IID assumption also restricts traditional bootstrap applications
to use sampling with replacement. This assumption does not hold for GUS methods that
subsume a wider class of generic sampling methods (eg Bernoulli). Some theoretical results
about validity of the bootstrap when elements of the sample are only independent (but not
necessarily identically distributed) are available, but hold under specific regularity assumptions
[73]. The assumption of independence itself does not hold for GUS methods such as sampling
without replacement. Another direct consequence is that bootstrap samples have to be apriori,
whereas GUS methods work for both apriori and samples computed inline.
To summarize, the strength of the Bootstrap lies in its wide applicability, but in the
presence of sampling induced correlations, it is theoretically and computatonally preferable to
use (if available) a closed form non-simulation based estimator with rigorous guarantees for
accuracy. This is the approach that we pursue in the current paper.
24
2.3 Other Areas
Probabilistic databases. Much existing work in this area [11, 27, 83, 87, 91] uses
possible world semantics to model uncertain data and its query evaluation. Tuples in a
probabilistic database have binary uncertainty, i.e., they either exist or not with a certain
probability. Specifically, [27, 83] use semirings for modeling and querying probabilistic
databases, focusing on conjunctive queries with HAVING clauses. Many probablistic databases
assume iid tuples [11, 27, 83, 87] or propose new query evaluation methods to handle
particular correlations. [88, 89]
Sketches. Sketching is another common technique that is used to provide approximate
answers for aggregate queries over data streams. Sketching methods use randomized
algorithms that combine random seeds with data to produce random variables whose
distribution depends on the true aggregate value. One class of sketching techniques focuses on
accurately estimating the individual frequencies in a data stream [10, 25, 75]. These frequency
estimates can be used to compute join sizes , quantiles, heavy hitters, etc. Another class of
sketching techniques focuses on COUNT DISTINCT queries [35, 36, 94]. Dobra et al [29]
develop sketching methods based on partitioning for join size estimation with multiple join
conditions. Dobra And Rusu [84, 86] provide a rigorous staistical analysis of various sketch
algorithms for join size estimation and perform extensive empirical evaluations. In [85], the
authors study a method that combines sampling and sketching and investigate the dependance
of variance of the resulting estimators on these two components.
Wavelets and Histograms. A tool which has a rich history in signal processing and
statistics, but recently has generated a lot of interest in AQP applications is wavelets.
Wavelets provide an effective way of representing a relational data in terms of appropriate
wavelet coefficients using linear transformations. In big data settings, one can obtain a
compressed/approximate representation of the data by only keeping a certain number of
wavelet coeffcients, and setting the rest of them to zero. Developing meaningful methods
to choose the best wavelet coefficients to keep in the approximate representation has been
25
the main focus of current research in this area. See, for example, [18, 26, 37–40, 42, 44,
45, 64, 65, 74, 90]. Histograms also provide a well-established and well-studied approach
for summarizing data, in both databases and statistics. In the last three decades, several
methods for using histograms for AQP have been developed in the database community. See
[7, 17, 31, 46, 54, 56–58, 63, 67, 79, 82, 92, 93] to name just a few. For a rigorous analysis of
the theoretical properties of histograms, see [28, 55, 66].
See [24] for a detailed review and extensive references for sketches, histograms, and
wavelets, and sampling based methods. As discussed in [24, Chapter 6], each of these methods
is useful and has comparative advantages and disadvantages. In particular, sampling provides
a flexible approach that works for general-purpose queries and adapts much more easily to
high-dimensional data as compared to the other methods, thereby occupying a unique and
invaluable place in the toolbox for modern AQP.
26
CHAPTER 3TECHNICAL PRELIMINARIES
The aim of this chapter is to review Generalized Uniform Sampling (GUS) methods, which
are a key ingredient in our thoery. We start with a quick overview of other sampling methods
for databases. Then, we define GUS methods, and provide details on how to get estimates and
confidence intervals using these methods. We also review the multivariate delta method, which
allows us to estimate the variance of a function of several SUM-like aggregates in terms of the
individual varainces and pairwise covarainces.
3.1 Generalized Uniform Sampling
Definition 1 (GUS Sampling [30]). A randomized selection process G(a,b) which gives a sample
R from R = R1 × R2 × · · · × Rn is called Generalized Uniform Sampling (GUS) method,
if, for any given tuples t = (t1, ... , tn), t′ = (t ′1, ... , t
′n) ∈ R, P(t ∈ R) is independent of
t, and P(t, t ′ ∈ R) depends only on {i : ti = t ′i }. In such a case, the GUS parameters a,
b = {bT |T ⊂ {1 : n}} are defined as:
a = P[t ∈ R]
bT = P[t ∈ R ∧ t ′ ∈ R|∀i ∈ T , ti = t ′i , ∀j ∈ TC , tj = t ′j ].
This definition requires GUS sampling to behave like a randomized filter. In particular,
any GUS operator can be viewed as a selection process from the underlying data, a process
that can introduce correlations. The uniformity of GUS requires that the randomized filtering
is performed on lineage of tuples and not on the content. As simple as the idea is, expressing
any sampling process in the form of GUS is a non-trivial task. Example 2 shows the calculation
of GUS parameters for a simple case.
Example 2. In this example, we show how the GUS definition above can be used to
characterize the estimation necessary for the query from Chapter 1. We denote by l s the
Bernoulli sample with p = 0.1 from lineitem and by o s the WOR sample of size 1000 from
orders. We assume that cardinality of orders is 150000. Henceforth, for ease of exposition,
27
we will denote all base relations involved by their first letters. For example, lineitem will be
denoted by l.
Applying the definition above and the independence between sampling processes, we
can derive the parameters for this GUS as follows: For any tuple t ∈ lineitem and tuple
u ∈ orders:
a = P[(t ∈ l s) ∧ (u ∈ u s)] = 0.1× 1000
150000= 6.667× 10−4
since the base relations are sampled independently from each other. For any tuples t, t ′ ∈
lineitem and u, u′ ∈ orders:
b∅ = P[(t, t ′ ∈ l s) ∧ (u, u′ ∈ o s)]
= 0.1× 0.1× 1000
150000× 999
149999
= 4.44× 10−7,
and
bo = P[t ∈ l s]× P[t ′ ∈ l s|t ∈ l s]× P[u ∈ o s]
= 0.1× 0.1× 1000
150000= 6.667× 10−5.
Similarly,
bl = P[(t ∈ l s) ∧ (u, u′ ∈ o s)]
= P[t ∈ l s]× P[u ∈ o s]× P[u′ ∈ o s|u ∈ o s]
= 0.1× 1000
150000× 999
149999
= 4.44× 10−6.
The last term is
bl ,o = P[(t ∈ l s) ∧ (u ∈ o s)] = 0.1× 1000
150000= 6.667× 10−4.
28
Notice that the GUS captures the entire estimation process, not only the two individual
sampling methods. The above analysis dealt with a simple join consisting of two base relations.
For more complex query plans, the derivation of GUS parameters would involve consideration
of all possible interactions between participating tuples. This will make the analysis highly
complex.
The analysis of any GUS sampling method for a SUM-like aggregate is given as follows.
Theorem 1. [30] Let f (t) be a function/property of t ∈ R, and R be the sample obtained
by a GUS method G(a,b). Then, the aggregate A =∑t∈R f (t) and the sampling estimate
X = 1a
∑t∈R f (t) have the property:
E [X ] = A
σ2(X ) =∑S⊂{1:n}
cSa2yS − yϕ (3–1)
with
yS =∑
ti∈Ri |i∈S
∑tj∈Rj |j∈SC
f (ti , tj)
2
cS =∑T∈P(n)
(−1)|T |+|S| bT .
The above theorem indicates that the GUS estimates of SUM-like aggregates are unbiased
and that the variance is simply a linear combination of properties of the data, terms yS and
properties of the GUS sampling method cS . Moreover, yS can be estimated from samples of
any GUS (see [30]). This result is not asymptotic; it gives the exact analysis even for very
small samples. Once the estimate and the variance are computed, confidence intervals can be
readily provided using either the normality assumption or the more conservative Chebychev
bound (see [30]).
3.2 Multivariate Delta Method
Let θ = (θ1, θ2, · · · , θk) be a vector of unknown parameters with corresponding unbiased
estimators A = (A1,A2, · · · ,Ak). Then, under appropriate assumptions (such as the
29
existence of a multivariate central limit theorem for A), the delta method shows that for any
continuously differentiable function g, E [g(A)] ≈ g(θ), and
Var(g(A)) ≈k∑i=1
(▽ig(θ))2Var(Ai) +∑i =j
▽ig(θ)▽j g(θ)Cov(Ai ,Aj). (3–2)
Using the multivariate delta method, the variance of the approximation f2 in Example 1 is
given by
V (f2(SUMloc, COUNTloc, COUNTlocn, COUNTocn))
≈ COUNT2locnCOUNT2locCOUNT
2ocn
Var(SUMloc) +
SUM2locCOUNT2locCOUNT
2ocn
Var(COUNTlocn) +
SUM2locCOUNT2locn
COUNT4locCOUNT2ocn
Var(COUNTloc) +
SUM2locCOUNT2locn
COUNT2locCOUNT4ocn
Var(COUNTocn) +
2SUMlocCOUNTlocn
COUNT2locCOUNT2ocn
Cov(SUMloc, COUNTlocn)−
2SUMlocCOUNT
2locn
COUNT3locCOUNT2ocn
Cov(SUMloc, COUNTloc)−
2SUMlocCOUNT
2locn
COUNT2locCOUNT3ocn
Cov(SUMloc, COUNTocn)−
2SUM2locCOUNTlocn
COUNT3locCOUNT2ocn
Cov(COUNTlocn, COUNTloc)−
2SUM2locCOUNTlocn
COUNT2locCOUNT3ocn
Cov(COUNTlocn, COUNTocn) +
2SUM2locCOUNT
2locn
COUNT3locCOUNT3ocn
Cov(COUNTloc, COUNTocn). (3–3)
30
CHAPTER 4ANALYSIS OF SAMPLING QUERY PLANS
The high-level goal of our research is to introduce a tool that computes the confidence
bounds of estimates based on sampling. Given a query plan with sampling operators
interspersed at various points, our tool transforms it to an analytically equivalent query
plan that has a particular structure: all relational operators except the final aggregate form
a subtree that is the input to a single GUS sampling operator. The GUS operator feeds the
aggregate operator that produces the final result. Note that this transformation is done solely
for the purpose of computing the confidence bounds of the result; it does not provide a better
alternative to the execution plan used as input. Once this transformation is accomplished,
Theorem 1 readily gives the desired analysis – the equivalence ensures that the analysis for the
special plan coincides with the analysis for the original plan.
A natural and convenient strategy to obtain the desired structure is to perform multiple
local transformations on the original query plan. These local transformations are based on a
notion of analytical equivalence, that we call Second Order Analytical (SOA) equivalence. They
allow both commutativity of relational and GUS operators, and consolidation of GUS operators.
Effectively, these local transformations allow a plan to be put in the special form in which there
is a single GUS operator just before the aggregate.
In this chapter, we first define the SOA-equivalence and then use it to provide equiva-
lence relationships that allow the plan transformations mentioned above. A more elaborate
example showcases the theory in the latter part of the chapter.
4.1 SOA-Equivalence
The main reason the previous attempts to design a sampling operator were not fully
successful is the requirement to ensure IID samples at the top of the plan. Having IID samples
makes the analysis easy since Central Limit Theorem readily provides confidence intervals.
However it is too restrictive to allow plans with multiple joins to be dealt with. It is important
31
to notice that the difficulty is not in executing query plans containing sampling but in analyz-
ing such query plans.
The fundamental question we ask in this section is: What is the least restrictive
requirement we can have and still produce useful estimates? Our main interest is in how
the requirement can be transformed into a notion of equivalence. This will enable us to talk
about equivalent plans, initially, but more usefully about equivalent expressions. The
key insight comes from the observation that it is enough to compute the expected value and
variance for any query plan. Then either the conservative Chebychev bounds or the optimistic1
normal-distribution based bounds can be used to produce confidence intervals. Note that
confidence intervals are the end goal, and, preserving expected value and variance is enough to
guarantee the same confidence interval using both CLT and Chebychev methods.
Thus, for our purposes, two query plans are equivalent if their result has the same
expected value and variance. This equivalence relation between plans already allows significant
progress. It is an extension of the classic plan equivalence based on obtaining the same answer
to randomized plans. From an operational sense, though, the plan equivalence is not sufficient
to provide interesting characterizations. The main problem is the fact that the equivalence
exists only between complete plans that compute aggregates. It is not clear what can be said
about intermediate results–the equivalent of non-aggregate relational algebra expressions.
The key to extend the equivalence of plans to equivalence of expressions is to first
design such an extension for the classic relational algebra. To this end, assume that we can
only use equality on numbers that are results of SUM-like aggregates but we cannot directly
compare sets. To ensure that two expressions are equivalent, we could require that they
produce the same answer using any SUM-aggregate. Indeed, if the expressions produce the
same relation/set, they must agree on any aggregate computation using these sets since
1 While the CLT does not apply due to the lack of IID samples, the distribution of mostcomplex random variables made out of many loosely interacting parts tends to be normal.
32
aggregates are deterministic and, more importantly, do not depend on the order in which the
computation is performed. The SUM-aggregates are crucial for this definition since they form
a vector space. Aggregates At that sum function ft(u) = δtu are the basis of this vector space;
agreement on these aggregates ensures set agreement. Extending these ideas to randomized
estimation, we obtain the following.
Definition 2 (SOA-equivalence). Given (possibly
randomized) expressions E(R) and F(R), we say
E(R) SOA⇐⇒ F(R)
if for any arbitrary SUM-aggregate Af (S) =∑t∈S f (t),
E [Af (E(R))] = E [Af (F(R))]
Var [Af (E(R))] = Var [Af (F(R))].
From the above discussion, it immediately follows that SOA-equivalence is a generalization
and implies set equivalence for non-randomized expressions, as stated in the following
proposition.
Proposition 4.1. Given two relational algebra expressions E(R) and F (R) we have:
E(R) = F (R)⇔ E(R) SOA⇐⇒ F (R).
The next proposition establishes that SOA-equivalence is indeed an equivalence relation
and can be manipulated like relational equivalence.
33
Proposition 4.2. SOA-equivalence is an equivalence relation, i.e., for any expressions E ,F ,H
and relation R:
E(R) SOA⇐⇒ E(R)
E(R) SOA⇐⇒ F(R)⇒ F(R) SOA⇐⇒ E(R)
E(R) SOA⇐⇒ F(R) ∧ F(R) SOA⇐⇒ H(R)⇒ E(R) SOA⇐⇒ H(R).
SOA-equivalence subsumes relational algebra equivalence. The strength of SOA-equivalence
is the fact that it does not depend on a notion of randomized set equivalence, an equivalence
that would be hard to define especially if it has to preserve aggregates.
Proposition 4.3. Given two relational algebra expressions E(R) and F(R) we have:
E(R) SOA⇐⇒ F(R)
⇔
∀t ∈ R, P[t ∈ E(R)] = P[t ∈ F(R)] and
∀t, u ∈ R, P[t, u ∈ E(R)] = P[t, u ∈ F(R)].
Proof. Suppose E(R) SOA⇐⇒ F(R). For every t ∈ R, define the function ft as ft(s) = 1{s=t}.
Hence, Aft(S) = 1{t∈S}. It follows that
P(t ∈ E(R)) = E [Aft(E(R))]
= E [Aft(F(R))]
= P(t ∈ F(R)).
34
Now, for every t, t ′inR, define the function ft,t′ as ft,t′(s) = 1{s=t} + 1{s=t′}. It follows that
E[Aft(E(R))2
]= E
[(1{t∈E(R)} + 1{t′∈E(R)}
)2]= E
[1{t∈E(R)} + 1{t′∈E(R)} + 21{t,t′∈E(R)}
]= P(t ∈ E(R)) + P(t ′ ∈ E(R)) + 2P(t, t ′ ∈ E(R)).
Similarly,
E[Aft(F(R))2
]= P(t ∈ F(R)) + P(t ′ ∈ F(R)) + 2P(t, t ′ ∈ F(R)).
Note that
E[Aft(E(R))2
]= E
[Aft(F(R))2
].
It follows by (4–1) that
P(t, t ′ ∈ E(R)) = P(t, t ′ ∈ F(R)).
Hence, one direction of the equivalence is proved. Let us now assume that
P(t ∈ E(R)) = P(t ∈ F(R)) ∀t ∈ R,
and
P(t, t ′ ∈ E(R)) = P(t, t ′ ∈ F(R)) ∀t, t ′ ∈ R.
The SOA-equivalence of E(R) and F(R) immediately follows by noting that for an arbitray
function f on R , and an arbitrary (possibly randomized) expression S(R)
E [Af (S(R))] =∑t∈R
P(t ∈ S(R)f (t),
and
E[Af (S(R))2
]=∑t,t′∈R
P(t, t ′ ∈ S(R)f (t)f (t ′).
35
Table 4-1. GUS parameters for known sampling methods on a single relation
Sampling method GUS parametersBernoulli(p) a = p, b∅ = p
2, bR = p
WOR (n, N) a = nN, b∅ =
n(n−1)N(N−1) , bR =
nN
Proposition 4.3 provides a powerful alternative characterization of SOA-equivalence. This
equivalence is in terms of first and second order probabilities, and we refer to it as SOA-set
equivalence. Another way to interpret the result above is that SOA-set equivalence is the
same as agreement on all SUM-like aggregates. More importantly for this work, SOA-set
equivalence provides an alternative proof technique to show SOA-equivalence. Often, proofs
based on SOA-set equivalence are simpler and more compact.
Chapter 5 contains a recipe for expected value and variance computation for a specific
situation, when there is a single overall GUS sampling on top. Starting with the given query
plan that contains both sampling and relational operators, if we find a SOA-equivalent plan
that is equivalent and has no sampling except a GUS at the top, we readily have a way to
compute the expected value and variance of the original plan. In the rest of this chapter we
pursue this idea further and show how SOA-equivalent plans with the desired structure can be
obtained from a general query plan.
4.2 GUS Quasi-Operators
Except under restrictive circumstances, the sampling operators will not commute with
relational operators. This, as we mentioned is the main reason previous work made limited
progress on the issue. As we will see later in this chapter, GUS sampling does commute in a
SOA-equivalence sense with most relational operators. The reason we can commute GUS (but
not specific sampling methods) is that, due to its generality, it can capture the correlations
induced by the relational operators. The first step in our analysis has to be a translation from
specific sampling to GUS-sampling.
36
Before we talk about the translation from sampling to GUS operators, we need to clarify
and refine the Definition 1 of GUS sampling. As part of the definition, terms of the form
ti = t′i or tj = t ′j are used. Intuitively, they capture the idea that tuples (or parts) are the same
or different. Since in this work we will have multiple GUS operators involved, it is important to
make the meaning of such terms very clear. We do this through a notion that proved useful in
probabilistic databases (among other uses): lineage (see [22]). Lineage allows dissociation of
the ID of a tuple from the content of the tuple, for base relation, and tracking the composition
of derived tuples. With this, ti = t′i means that the two tuples are the same – have the same
ID/lineage – not that they have the same content.
Representing and manipulating lineage is a complex subject. In this work, since we only
accommodate selection and joins the issue is significantly simpler. The selection leaves lineage
unchanged, the lineage of the result of the join is the union of the lineage of the matching
tuples. Thus, lineage can be represented in relational form with one attribute for each base
relation participating in the expression. We can thus talk about lineage schema L(R), a
synonym of the set of base relations participating in the expression of R . The lineage of a
specific tuple t ∈ R will have values for the lineage of all base relations constituting R. A
particularly useful notation related to lineage is: T (t, t ′) = {Rk |tk = tk ′, k ∈ L (R)}, the
common part of the lineage of tuples t and t ′, i.e. the base relations on which the lineage of t
and t ′ agree.
Example 3. The query at the beginning of Chapter 1 uses two sampling methods: Bernoulli
sampling with p = 0.1 on lineitem and sampling 1K tuples without replacement from
orders (150K tuples). These methods can be expressed in terms of GUS as G(aB ,bB) and
G(aW ,bW ) as follows: For G(aB ,bB): aB = 0.1 and bB = {bB,∅, bB,l; bB,∅ = 0.01, bB,l = 0.1}.
For G(aW ,bW ): aW = 6.667 × 10−3 and bW = {bW ,∅, bW ,o; bW ,∅ = 4.44 × 10−5, bW ,o =
6.667× 10−3}
It is important to note that the GUS is not an operator but a quasi-operator. While it
corresponds to a real operator when the translation from specific sampling to GUS happens, it
37
Table 4-2. Notation used
Notation MeaningR Random subset of Ra P[t ∈ R]
L (R) Lineage schema of RL (t) Lineage of tuple tT Subset of L (R)
T (t, t ′) {Rk |tk = tk ′, k ∈ L (R)}bT P[t, t ′ ∈ R|T = T (t, t ′)]b {bT |T ∈ P(n)}
G(a,b) GUS method with parameters a and bG(a,b)(R) G(a,b) applied to relation R
l lineitem
o orders
c customer
p parts
will not correspond to an operator after transformations. There is no need to provide or even
to consider an implementation of a general GUS operator since GUS will only be used for the
purpose of analysis.
In the following analysis, we will assume that all specific sampling operators were
replaced by GUS quasi-operators, thus will not be encountered by the re-writing algorithm.
We designate by G(a,b)(R), a GUS method applied to a relation R , and the resulting sample by
R. When multiple GUS methods are used, the i ’th GUS method and its resulting sample will
be denoted by G(ai ,bi ) and Ri respectively.
4.3 Interaction between GUS and Relational Operators
As we stated in Section 4.1, SOA-equivalence is the key for deriving an analyzable
plan that is equivalent to the one provided by the user. The results in this section provide
equivalences that allow such transformations that lead to a single, top, GUS operator. The
results in this section make use of the notation in Table 4.2.
Proposition 4.4 (Identity GUS). The quasi-operator G(1,1), i.e. a GUS operator with a = 1,
bT = 1, can be inserted at any point in a query plan without changing the result.
38
Proof. Since a = 1, all input tuples are allowed with probability 1, i.e., no filtering happens.
Proposition 4.5 (Selection-GUS Commutativity). For any R , selection σC and GUS G(a,b),
σC(G(a,b))(R)SOA⇐⇒ G(a,b)(σC(R)).
Proof. Let R ′ = σC(R). On computing R ∩ R ′ we see that
∀(t ∈ R ′),P[t ∈ R ∩ R ′] = P[t ∈ R]I{t∈R′} = a.
∀(t, t ′ ∈ R ′),P[t, t ′ ∈ R ∩ R ′|T =T (t, t ′)]
= P[t, t ′ ∈ R|T =T (t, t ′)] = bT
The above results are somewhat expected and have been covered for particular cases in
previous literature. The following result, though, overcomes the difficulties in [21].
Proposition 4.6 (Join-GUS Commutativity). For
any R, S , join 1θ and GUS methods G(a1,b1),G(a2,b2),
if L (R1) ∩ L (R2) = ∅
G(a1,b1)(R1) 1θ G(a2,b2)(R2)SOA⇐⇒ G(a,b)(R1 1θ R2),
where, a = a1a2, bT = b1,T1b2,T2
with T1 = T ∩ L (R1) and T2 = T ∩ L (R2).
Proof. We proved in Proposition 4.5 that a GUS method commutes with selection. Thus, it is
enough to prove commutativity of a GUS method with cross product. Let R = R1 × R2 and
39
t = (t1, t2), t′ = (t ′1, t
′2) ∈ R. Thus, L (R) = L (R1) ∪ L (R2). We have:
a = P[t ∈ R] = P[t1 ∈ R1 ∧ t2 ∈ R2]
= P[t1 ∈ R1 ∧ t2 ∈ R2] = a1a2.
Since L (R1)∩L (R2) = ∅, for an arbitrary T ∈ L (R), T1 = T ∩L (R1) and T2 = T ∩L (R2)
we have, T1 ∩ T2 = ∅ (disjunct lineage). With this, we first get:
T (t, t ′)=T ⇔ E1 ∩ E2,
where E1 = {T (t1, t ′1)=T1} and E2 = {T (t2, t ′2)=T2}. Using the above and independence
of GUS methods,
bT = P[t ∈ R ∧ t ′ ∈ R|T (t, t ′)=T ]
= P[t1, t′1 ∈ R1 ∧ t2, t ′2 ∈ R2|E1,E2]
= P[t1, t′1 ∈ R1|E1]× P[t2, t ′2 ∈ R2|E2]
= b1,T1b2,T2.
Example 4. Applying the above results to the GUS co-efficients obtained in Example 3,
we can derive the following co-efficients for the overall G(a,b) on the join of lineitem and
orders:
a = a1a2 = 0.1× 6.667× 10−3 = 6.667× 10−4.
b∅ = b1,∅b2,∅ = 0.01× 4.44× 10−5 = 4.44× 10−7.
bo = b1,∅b2,o = 0.01× 6.667× 10−3 = 6.667× 10−5.
bl = b1,lb2,∅ = 0.1× 4.44× 10−5 = 4.44× 10−6.
blo = b1,lb2,o = 0.1× 6.667× 10−3 = 6.667× 10−4.
40
Example 5. In this example we provide a complete walk-through for a larger query plan. The
input is the query plan in Figure 4-1.a that contains 3 sampling operators, 3 joins and refers to
relations lineitem, orders, customers and part. To analyze such a query, the first step is
to re-write the sampling operators as GUS quasi-operators G(a1,b1),
G(a2,b2), G(a3,b3) as in Figure 4-1.b. The second step, shown in Figure 4-1.c is to apply
Proposition 4.6 to commute G(a1,b1) and G(a2,b2) with the join resulting in G(a12,b12) . This step
also shows the application of Proposition 4.4 above customers. The next step in Figure 4-1.d
again uses Proposition 4.6 to commute G(a12,b12) and G(1,1)resulting in G(a121,b121). Figure 4-1.e
shows the final transformation that uses the same proposition to get an overall GUS method
G(a123,b123) just below the aggregate and on the top of the rest of the plan. Theorem 1 can
now be used to obtain expected value and variance of the estimate. Using this and either
the normal approximation or the Chebychev bounds, we obtain confidence intervals for the
estimate. The computed coefficients for the GUS methods involved are depicted in Figure 4-1.
4.4 Interactions between GUS Operators
In the previous section we explored the interaction between GUS operators and relational
algebra operators. In this section, we investigate interactions between GUS operators when
applied to the same data. Intuitively, this will open up avenues for design of sampling
operators, since it will indicate how to compute GUS quasi-operators that correspond to
complex sampling schemes.
Proposition 4.7 (GUS Union). For any expression R and GUS methods G(a1,b1), G(a2,b2),
G(a1,b1)(R) ∪ G(a2,b2)(R)SOA⇐⇒ G(a,b)(R)
where, a = a1 + a2 − a1a2
bT = 2a − 1 + (1− 2a1 + b1T )(1− 2a2 + b2T )
Union of GUS methods can be very useful when samples are expensive to acquire,
thus there is value in reusing them. If two separate samples from relation R are available,
Proposition 4.7 provides a way to combine them.
41
SUM
1
B0.5
p
1
c1
W1000
o
B0.1
l
SUM
1
G(a3,b3)
p
1
c1
G(a2,b2)
o
G(a1,b1)
l
SUM
1
G(a3,b3)
p
1
G(1,1)
c
G(a12,b12)
1
ol
(a) (b) (c)
SUM
1
G(a3,b3)
p
G(a121,b121)
1
c1
ol
SUM
G(a123,b123)
1
p1
c1
ol
(d) (e)
GUS method ParametersG(a1,b1) a1 = 0.1, b1,∅ = 0.01, b1,l = 0.1G(a2,b2) a2 = 6.667× 10−3, b2,∅ = 4.44× 10−5, b2,o = 6.667× 10−3G(a3,b3) a3 = 0.5, b3,∅ = 0.25, b3,p = 0.5G(a12,b12) a12 = 6.667 × 10−4, b12,∅ = 4.44 × 10−7, b12,o = 6.667 × 10−5, b12,l =
4.44× 10−6, b12,lo = 6.667× 10−4G(a121,b121) a121 = 6.667 × 10−4, b121,∅ = 4.44 × 10−7, b121,c = 4.44 × 10−7, b121,o =
6.667×10−5, b121,oc = 6.667×10−5, b121,l = 4.44×10−6, b121,lc = 4.44×10−6, b121,lo = 6.667× 10−4, b121,loc = 6.667× 10−4
G(a123,b123) a123 = 3.334 × 10−4, b123,∅ = 1.11 × 10−7, b123,p = 2.22 × 10−7, b123,c =1.11 × 10−7, b123,cp = 2.22 × 10−7, b123,o = 1.667 × 10−5, b123,op =3.335 × 10−5, b123,oc = 1.667 × 10−5, b123,ocp = 3.335 × 10−5, b123,l =1.11 × 10−6, b123,lp = 2.22 × 10−6, b123,lc = 1.11 × 10−6, b123,lcp =2.22 × 10−6, b123,lo = 1.667 × 10−4, b123,lop = 3.334 × 10−4, b123,loc =1.667× 10−4, b123,locp = 3.334× 10−4
Figure 4-1. Transformation of the query plan to allow analysis
42
Proposition 4.8 (GUS Compaction). For any expression R, and GUS methods G(a1,b1),G(a2,b2),
G(a1,b1)(G(a2,b2)(R)
) SOA⇐⇒ G(a,b)(R),
where, a = a1a2, bT = b1,T1b2,T2
Compaction can be also viewed as intersection. It allows sampling methods to be stacked
on top of each other to obtain smaller samples. We will make use of this in the next chapter.
Interestingly, union behaves like + with the null element G(0,0) (the sampling method that
blocks everything), the compaction/intersection behaves like ∗ with the null element G(1,1)(the
sampling method that allows everything). Overall, the algebraic structure formed is that of a
semi-ring, as stated in the following.
Theorem 2. The GUS operators over any expression R, form a semi-ring structure with
respect to the union and compaction operations with G(0,0) and G(1,1) as the null elements,
respectively.
The semi-ring structure of GUS methods can be exploited to design sampling operators
from ingredients.
Proposition 4.9 (GUS Composition). For any expressions R1, R2 and G(a1,b1), G(a2,b2),
G(a1,b1)(R1) ◦ G(a2,b2)(R2)SOA⇐⇒ G(a,b)(R)
a=a1a2, bT = b1,Tb2,T .
GUS composition is very useful for design of multi-
dimensional sampling operators. We use it here to design a bi-dimensional Bernoulli.
Example 6. Suppose that we designed a bi dimensional sampling operator B0.2,0.3(l,o) that
combines Bernoulli sampling operators B0.2(l) and B0.3(o). Using the above result, the GUS
operator G(a,b) corresponding to the bi dimensional Bernoulli is G(a1,b1)(l) ◦ G(a2,b2), where
G(a1,b1) is the GUS of B0.2(l) and G(a2,b2) is the GUS of B0.3(o). Working out the coefficients
using Proposition 4.9 - the process is similar to the process in Example 4 we get: a3 = 0.06,
b3,∅ = 0.0036, b3,o = 0.012, b3,l = 0.018, b3,lo = 0.06.
43
CHAPTER 5EFFICIENT VARIANCE ESTIMATION: IMPLEMENTATION & EXPERIMENTS
In this chapter, we first present the major challenges in implementing an efficient variance
estimator and show how the theoretical ideas in the previous chapter can be used to tackle
those challenges. Furthermore, we demonstrate the effectiveness of our approach by performing
a detailed empirical study using the TPC-H [4] data.
5.1 Estimating yS Terms
The computation of the variance of the sampling estimator in Theorem 1 uses the
coefficients yS defined as:
yS =∑
ti∈Ri |i∈S
∑tj∈Rj |j∈SC
f (ti , tj)
2 .The terms yS essentially require a group by lineage followed by a specific computation. This is
better understood through an example – Query 1 – and equivalent expressions in SQL:
CREATE TABLE unagg AS
SELECT l_orderkey*10+l_linenumber as f_l,
o_orderkey as f_o, l_discount*(1.0-l_tax) as f
FROM lineitem TABLESAMPLE (10 PERCENT),
orders TABLESAMPLE(1000 ROWS)
WHERE l_orderkey = o_orderkey AND
l_extendedprice > 100.0;
SELECT sum(f)^2 as y_empty FROM unagg;
SELECT sum(F*F) as y_l FROM ( SELECT sum(f) as F
FROM unagg GROUP BY f_l);
SELECT sum(F*F) as y_o FROM ( SELECT sum(f) as F
44
FROM unagg GROUP BY f_o);
SELECT sum(f*f) as y_lo FROM unagg;
The computation of the yS terms using the above code is harder than the evaluation of
the exact query, thus resulting in an impractical solution. We can use the sample to estimate
these terms essentially replacing the entire base relations in the above queries by their samples.
These estimates, YS can be used to obtain unbiased estimates YS of terms yS using the
formula [59]
YS =1
cS,∅
yS − ∑T⊂SC ,T =∅
cS,T YS∪T
where
cS,T =∑U⊂T
(−1)|U|+|S| bS∪U .
Yet again, the major effort is in evaluating YS terms over the sample. The number of
terms to be evaluated is 2n, where n is the number of base relations and each term consists
of a GROUP BY query that is possibly expensive. To this end, we observe that the computation
of the variance of the sampling estimator depends, in orthogonal ways, on properties of the
data through terms yS and on properties of the sampling through cS . The base theory does
not require any particular way to compute/estimate terms yS . Using the available sample
for estimating yS terms is only one of the possibilities. While many ways to estimate terms
yS can be explored, a particularly interesting one in this context is to use another sampling
method for the purpose. More specifically, we could use a sub-sample of the available sample
for estimation of the terms yS and the full sample for the estimation of the true value.
To understand what benefits we can get from this idea, we observe that we do not need
very precise estimates of the terms yS . Should we make a mistake, it will only affect the
confidence interval by a small constant factor but will still allow the shrinking of the confidence
interval with the increase of the sample. Based on the experience in DBO and TurboDBO,
using 100K result tuples for the estimation of yS terms suffices. Moreover, only for these 100K
45
samples the system needs to provide lineage information1 ; samples used for evaluation of the
expected value need no lineage.
5.2 Sub-Sampling
There are two alternatives when it comes to reducing the number of samples used for
estimation of terms yS : select a more restrictive sampling method, or further sample from
the provided sample. The latter approach can be applied when needed, in case the size of
the sample is overwhelming for the computation of terms yS . Specifically, we can use a
multi-dimensional Bernoulli GUS on top of the existing query plan for result tuples. This can
be obtained by applying Proposition 4.9 until the desired size is reached. The extra results
in Section 4.4 together with the core results in Section 4.3 provide the means to analyze this
modified sampling process. Example 7 and the accompanying Figure 5-1 provide such analysis
for Query 1 and exemplifies how the extra Bernoulli sampling can be dealt with.
Example 7. This example shows how the query plan from the introduction can be sampled
further to efficiently obtain yS terms. Figure 5-1.a shows the original query plan. Figure 5-1.b
shows the sampling in terms of a GUS quasi-operator. Figure 5-1.c shows the placement
of a bi-dimensional Bernoulli sampling method. Figures 5-1.d, 5-1.e, 5-1.f make use of
propositions in Section 4.3 to obtain a SOA-equivalent plan, suitable for analysis.
5.3 Experiments
We have two main goals for this empirical study. Our first goal is to provide some
experimental confirmation of our theory. Our second goal is to evaluate the efficiency of the
estimation process and study how sub-sampling affects the running times and the variance
estimates. More specifically:
• How does the running time depend on the selectivity of the query?
1 By lineage, we mean the list of base relations that have participated in forming a giventuple.
46
SUM
1
WOR1000
o
B0.1
l
SUM
G(aBW ,bBW )
1
ol
SUM
B0.2,0.3
1
WOR1000
o
B0.1
l
SUM
G(a3,b3)
1
G(a2,b2)
o
G(a1,b1)
l
(a) (b) (c) (d)
SUM
G(a3,b3)
G(a12,b12)
1
ol
SUM
G(a123,b123)
1
ol
(e) (f)
GUS method ParametersG(a1,b1) a1 = 0.1, b1,∅ = 0.01, b1,l = 0.1G(a2,b2) a2 = 6.667× 10−3, b2,∅ = 4.44× 10−5, b2,o = 6.667× 10−3G(a3,b3) a3 = 0.06, b3,∅ = 0.0036, b3,o = 0.012, b3,l = 0.018, b3,lo = 0.06G(a12,b12) a12 = 6.667 × 10−4, b12,∅ = 4.44 × 10−7, b12,o = 6.667 × 10−5, b12,l =
4.44× 10−6, b12,lo = 6.667× 10−4G(a123,b123) a123 = 4× 10−5, b123,∅ = 1.598× 10−9, b123,o = 8× 10−7, b123,l = 7.992×
10−8, b123,lo = 4× 10−5
Figure 5-1. Transformation of the query plan to allow analysis
• How useful is the sub-sampling process? How does the running time depend on thesub-sampling?
• Does sub-sampling significantly reduce the precision of the confidence interval estimates?
As we will see in this section, the experiments validate the theory and the sub-sampling
is invaluable for efficient estimation. In particular, sub-samples of size 100K to 400K tuples
47
provide correct and meaningful confidence intervals and require less than 2% to the overall
running time.
5.3.1 Experimental Setup
Data Sets. For our experiments, we use two versions of TPC-H [4] data—a small sized
data set (scale factor 0.1, 100MB) for verification of confidence intervals, and a large one
(scale factor 1000, 1TB) to benchmark the efficiency of the for the estimation process. We use
following parameterized query:
SELECT SUM(l_discount*(1.0-l_tax))
FROM lineitem TABLESAMPLE (x PERCENT),
orders TABLESAMPLE(y ROWS),
part TABLESAMPLE(z PERCENT)
WHERE l_orderkey = o_orderkey AND
l_partkey = p_partkey AND o_totalprice < q AND
p_retailprice < r;
where the parameters x, y and z govern the inclusiveness of the sampling methods and
parameters q and r govern the selectivity.
Code. The analysis of the sampling query plan, as described in Section 4.3, is coded using
SWI-Prolog [3]. When presented with the specific query plan that contains both the relational
and sampling operators, the code derives the a, bT coefficients. The sub-sampling operator and
the confidence interval prediction based on the coefficients a, bT from the symbolic analysis
is implemented in C++. For the sub-sampling operator, we can specify a lower (L) and an
upper (U) bound on the number of tuples to maintain. The implementation uses an adaptive
algorithm that changes the probability of the Bernoulli sampling to meet these goals. In all the
experiments we specify the range used to configure the sub-sampling – the actual number of
tuples in the sub-sample is specified in some of the experiments.
48
DBMSs. Two DBMSs were used for experiments in this section. For the correctness
experiments, we used Oracle 11g and made use of the SAMPLE operator. For the large scale
experiments we used the DataPath[12] parallel database system and coded the sub-sampling
and analysis as user-defined aggregates.
Hardware. We run our experiments on a 48 core AMD Magny-cours 1.9GHz with 256GB
of RAM and 76 disks. The machine is capable of sustained I/O rates of 3.0GB/s through 3
Averatec RAID controllers. Oracle 11g used only one disk and did not make use of all the cores
(Oracle was used only for the correctness experiments). DataPath usually makes use of all the
resources and for most queries sustained data processing speeds of 2-3GB/s.
We now describe the experiments undertaken to answer the questions listed above. The
detailed setup of each experiment is specified in the respective subsections.
5.3.2 Correctness
In this experiment, we provide a sanity check for our theory and measure how accurate the
derived confidence intervals are.
Setup. For this study, we use the 100MB TPC-H data (lineitem has 600,000 tuples
and orders has 150,000 tuples). We selected the small size to allow us to perform many
experiments to empirically evaluate the variance through Monte-Carlo simulation. We used the
Oracle 11g database since it supports the SAMPLE operator from SQL and thus would support
our claim that the theory in this work can readily be used to predict sampling queries from
existing commercial DBMSs.
Monte Carlo Validation. We validate the theoretically computed expected value and
variance of the sampling estimate using a Monte Carlo simulation. We use the query as shown
above with parameters q and r fixed at 106 and 4 × 105 respectively. This query is run
for different values of (x,y,z) as shown in Figure 5-2. For each value of (x,y,z), we run the
query 5000 times and obtain 5000 confidence intervals each for ten different confidence levels,
ranging from 5% to 95%. We then compute the achieved confidence levels, i.e. for each of
49
0102030405060708090
0 10 20 30 40 50 60 70 80 90
Acheivedconfi
dencelevel
Desired confidence level
Correctness Study
l(10) o(10) p(50)l(20) o(15) p(55)l(30) o(20) p(60)l(40) o(25) p(65)
Figure 5-2. Plot of percentage of times the true value lies in the estimated confidenceintervals vs desired confidence level.
the ten desired confidence levels, we compute the percentage of times the true value falls
within the corresponding confidence intervals. In Figure 5-2, we show a comparison between
the desired and achieved confidence levels for 4 different sampling strategies. The achieved
confidence levels are very close to the desired values, across the different sampling strategies
and confidence levels. This provides strong empirical evidence that the confidence intervals
obtained by using the theory in Section 4.3 are accurate and tight.
5.3.3 Running Time
The next goal is to evaluate the efficiency of the estimation process. We are especially
interested in evaluating the variance of the estimators. This study performed with our research
prototype should give the practitioner some idea on the overhead expected.
Setup. Intuitively, the analysis overhead will depend on the sample size. To ensure that
we stress the analysis with large samples, we use the 1TB TPC-H instance and treat the
database as a sample of a 1PB database. More specifically we assume that the 6 billion tuples
in lineitem are a Bernoulli sample from the 6 trillion tuples in the same relation at 1PB scale
(0.001 sampling fraction). Similarly, the 1.5 billion tuples in orders are a sample without
replacement from the 1.5 trillion tuples of the 1PB database and the 200 million tuples in part
50
are a Bernoulli sample (0.001 sampling fraction) from 200 billion tuples at 1PB scale. This
ensures that the sample sizes the analysis has to deal with can be in the billions – a very harsh
scenario for analysis indeed.
Since the database is the sample, there is no sampling needed in the execution of the
query – the tuples that the analysis has to make use of are the tuples that are aggregated by
the non-sampling version of the query. As described in Section 5.1, maintaining the estimator
is as easy as performing the aggregation in the non-sampling query but computing the variance
is much more involved. The technique we proposed is to sub-sample from the sample to limit
the number of tuples used to estimate the variance. In our experiments we studied the impact
of the query characteristics (various selection predicates) and sub-sampling size on the running
time. For each experiment, we measured three running times. First, the running time of the
non-sampling query (no statistical estimation, just the aggregate). Second, the running time
of the query processing and sub-sampling process. Sub-sampling is interleaved with the rest
of the processing and the two running times cannot be separated. Third, the time to perform
the analysis. The analysis is single-threaded and starts only once the sub-sample is completely
formed.
Impact of Selectivity with Fixed Sub-Sampling. In our first experiment, we will vary
the selection predicate (thus indirectly the selectivity of the query) and set the range for the
sub-sampling at 100K-400K tuples (i.e. sub-sampling obtains 100K-400K tuples that are a
Bernoulli sample from the data provided for analysis). Results are depicted in Figure 5-3A.
We make three key observations; First, for this sub-sampling target, the analysis adds an
insignificant amount of extra effort (about 2% of the overall running time). Second, selectivity
of the query has no significant effect on the running time for either the non-sampling or for the
sampling query. Last, the running time of the sampling version of the query seems to be more
51
0100200300400500
9001200
15001800
2100
0
10
20
30
40
Run
ning
time(sec)
Num
ber
oftuples
(million)
Selection parameter
Sub-sampling range 100K - 400K
A
0100200300400500
9001200
15001800
2100
0
10
20
30
40
Run
ning
time(sec)
Num
ber
oftuples
(million)
Selection parameter
No Sub-sampling
Query processing + Sub-samplingAnalysis
Sub-sample sizeQuery without analysis
B
Figure 5-3. Plots of running time vs selection parameter with and without sub-sampling.
stable than the running time of the non-sampling version.2 It seems that, when the size of
sub-sample is below 500,000, the extra effort to perform sampling analysis is insignificant. We
show later that such sub-samples are good enough to produce stable variance estimates.
Impact of Selectivity with No Sub-Sampling. An unresolved question from the
previous experiment is what happens when no sub-sampling is performed, i.e. all the data
is used for analysis. The selectivity of the query will now control the number of tuples used
2 A similar behavior was noticed in [12]: the execution is more stable and somewhat faster athigher CPU loads.
52
for analysis and give an indication of the effort as a function of the size. Figure 5-3B depicts
result of such an experiment in which the selection predicate was varied. Results reveal that,
once the size of the sample exceeds 1M the analysis cost becomes unmanageable and starts
to dominate. At the end of the spectrum (31M tuples) the analysis was 5 times slower than
the the rest of the execution – this is clearly not acceptable in practice. As we hinted above,
targets of 100K-400K produce good enough estimates of the variance; there is no need to base
the variance analysis on millions of tuples, thus running time of analysis can be kept under
control. Sub-sampling is thus a crucial technique for applicability of sampling estimation to
large data.
5.3.4 Sub-Sample Size
This experiment sheds light on influence of sub-sampling sizes on the estimate for the variance
and thus the quality of the confidence intervals.
Setup. Since we would like to get samples from all over the data source, we use the 1TB
TPC-H instance as the data source and repeatedly derive samples from it. Remember that,
according to Section 5.1, any estimates of the terms yS can be used to analyze any of the
sampling methods. Sub-sampling is used to estimate the terms yS , but the entire sample is
used to compute the estimate. Subsampling leads to a substantial reduction in computation,
but also gives rise to wider confidence intervals. In many situations, the estimated aggregate
is various orders of magnitude larger than its estimated standard deviation (based on the
whole sample). In such cases, it clear that any increase in the width of the confidence interval
due to subsampling will be extremely minor as compared to the estimated aggregate. Thus
a much smaller sub-sample has only a secondary influence on the confidence interval, while
leading to a substantial reduction in computation. The plot in Figure 5-4 shows this fact. In
this experiment we run around 250 instances of Query 1 for sub-sampling ranges of 10K-40K,
100K-400K and 1M-4M tuples each. In all cases, we calculate the fluctuation of the resultant
confidence interval widths with respect to the confidence interval width obtained from an
analysis without sub-sampling. In particular, we define error by ratio of the difference between
53
0
0.02
0.04
0.06
0.08
0.1
10K-400K
100K-400K
1M-4M
Error
Sub-sampling target
Sub-sampling Study
Figure 5-4. Plot of fluctuation of confidence interval widths obtained with sub-sampling, wrtthe true confidence interval
the 5th and the 95th percentile values to the width of the confidence interval obtained without
sub-sampling. The plot in Figure 5-4 shows that this error is only 1% when 100K-400K tuples
are used.
Note on number of relations. As we have seen in this section, for sampling over
3 relations, good confidence intervals can be obtained with a mere 2% extra effort since
sub-samples of 100K tuples suffice. Since the analysis requires the computation of 2n terms
if n relations are sampled, the influence of the number of relations on the running time of the
analysis is of concern. In practice, these concerns can be easily addressed as follows: (a) the
computation of the yS terms from the sub-samples can be parallelized – on our system this
would result in a speedup of at least 32 (on 48 cores), (b) we noticed that foreign key joins
result in repeated values for certain terms – about half the values are repeated, (c) we see
no need to sample from more than 8 relations since there is no need to sample from small or
medium size relations. Notice that the parallelization alone would allow us to scale from 3 to
3 + 5 = 8 relations since 25 = 32, the expected speedup.
54
CHAPTER 6COVARIANCE BETWEEN MULTIPLE ESTIMATES: CHALLENGES
Having addressed the challenge of estimating variance for SUM-like aggregates, we
now focus on the more difficult problem of computing pairwise covariances between multiple
estimates. As mentioned in the introduction, previous work on covariance computation
(such as [59, 95]) has focused on deriving closed form expressions for covariances in special
settings, with specific assumptions on the type of schema, number of relations, and type of
estimates. The computation has to be done by hand on a case-by-case basis. Our goal is to
develop theory for a generalized solution, which is independent of platform, schema, number
of relations, is applicable for a wide class of sampling methods, and is also capable of being
automated. In the first part of this dissertation, we developed precisely such a theory for the
variance computation problem. Given the connection between covariance and variance - it is
natural to assume that covariance computation problem can be solved through a mild and
straightforward extension of the previous theory. However, a closer look at the problem reveals
that it is not the case, and brings out key underlying challenges. The goal of this section is to
carefully and methodically lay out these challenges. We will first recall the major steps in the
variance computation strategy, and then examine the adequacy of this strategy for covariance
computation, starting with the simplest case and then moving on to more genaral and complex
settings.
6.1 Base Lemma for Covariance
In previous chapters, we dealt with a general single query plan which has sampling at
the bottom, and the estimate at the top. The strategy for computing the variance of such an
estimate was as follows.
• The notion of SOA-Equivalence allowed us to transform this plan into an analyzable planfor variance analysis. This analyzable plan had an equivalent sampling operator at thetop of the plan, i.e., at the level of the estimate of interest.
• The tranformation of a general query plan into an analyzable query plan were acheived /carried by applying a series of algebraic rules based on SOA- Equivalence.
55
• The overall sampling method at the top of this analyzable plan was expressed as a GUSmethod, whose 2n parameters represented the sampling based correlations (one term forevery subset of the base relations) involved in the estimate.
• We used the base lemma for variance (Theorem 1 in Chapter 3) for the overall GUSmethod into the base lemma to get the required variance.
The conceptually simplest covariance computation problem corresponds to two estimates
deribed from a single query plan. Using the above strategy, we can get a SOA-Equivalent plan
with an overall GUS method on top. We plugged in these parameters into the base lemma for
variance. However, even in the simplest multiple estimate case where we have 2 estimates from
the same from the same set of base relation samples. Challenge #1: Generalize the base
lemma for covariance
Example 8. To compute the error bound for the estimate f1 using (1–1), we need to
compute V (SUMloc),V (COUNTloc) and Cov(SUMloc, COUNTloc). Let slineitem, sorders and
scustomer be Bernoulli(0.1), SWOR(1000) and Bernoulli(1000) samples drawn from lineitem,
orders & customer, denoted by l, o & c respectively in Fig 6-1. The corresponding GUS
sampling methods are denoted by Gl, Go & Gc respectively.The variance terms V (SUMloc) &
V (COUNTloc) can be computed using the theory in [76] as follows.
Starting from the initial representation in Fig 6-1(a), where sampling is at the base, we
can use Proposition 4.6 on the joins to get a SOA-Equivalent plan with a single GUS on top.
This overall GUS, Gloc in Figure 6-1(c), can be used to compute the two variance terms by
applying Theorem 1. However, the existing results do not provide any mechanism to compute
the term Cov(SUMloc, COUNTloc). 2
6.2 Covariance Parameters
The case of two estimates from a single query plan is still not general enough for the
covariance computation problem. Sampling based correlations will be present in any pair of
estimates that share samples of base relations. These estimates may come from different
queries and the queries may be based on different set of base relations. A more genralized
56
l
Gl
o
Go
c
Gc
1
1
SUMloc COUNTloc
SOA⇐⇒
l o
1
Glo
c
Gc
1
SUMloc COUNTloc
SOA⇐⇒
l o
1
c
1
Gloc
SUMloc COUNTloc
(a) (b) (c)
GUS ParametersGx:x=l,c ax = 0.1, bx,∅ = 0.01, bx,x = 0.1
Go ao = 6.67× 10−3, bo,∅ = 4.44× 10−5, bo,o = 6.67× 10−3Glo alo = 6.67×10−4, blo,∅ = 4.44× 10−7, blo,l = 4.44×10−6, blo,o = 6.67×
10−5, blo,lo = 6.67× 10−4Gloc aloc = 6.67 × 10−5, bloc,l = bloc,c = 4.44 × 10−8, bloc,o = 6.67 ×
10−7, bloc,lo = bloc,oc = 6.67 × 10−6, bloc,lc = 4.44 × 10−7, bloc,∅ =4.44× 10−9, bloc,loc = 6.67× 10−5
Figure 6-1. Covariance in a single query plan
setting for the covaraince problem is a pair of estimates coming from two simultaneous query
plans that share a subset of base relation samples.
Example 9. To compute the error bound for the estimate f2 in (3–3), we need to compute
Cov(SUMloc, COUNTocn), among other terms. Note that the estimates SUMloc and COUNTocn are
derived from different queries which only share the samples o and c (drawn from orders and
customer respectively). The sample l (from lineitem) is exclusive to SUMloc, and the sample
n (from nation) is exclusive to COUNTocn. This case, when the pair of estimates arises from a
pair of distinct queries which only partially share some of the base relation samples, is much
more challenging than the single query case illustrated in Example 8. 2
Previous theory for variance [76] depended crucially on the ability to find an analyz-
able representation for a query plan, where the sampling operators at the bottom of the
plan permuted to the top into an overall sampling characterization for the estimate. The
base lemma for variance could only be applied using the overall sampling in this analyzable
representation. It is hard to characterize the covariance of estimates at the top of an arbitrary
57
pair of query plans, when the sampling operators are at the bottom. If we permute the
sampling methods to the top of the respective query plans using the previous theory, we
lose the the correlation information between the estimates. Challenge #2: Identify an
analyzable representation and parameters that characterize covariance.
Example 10. The parameters of the overall GUS method obtained from previous theory
characterize the sampling based dependencies involved in a corresponding estimate. As
illustrated in Example 9, two estimates can be based on different cross product spaces.
For eg., SUMloc is based on lineitem × orders × customer while COUNTocn is based on
orders×customer×nation). In Figure 6.3, the overall GUS method corresponding to SUMloc
is represented by Gloc and the overall GUS method corresponding to COUNTocn is represented
by Gocn. However, Gloc and Gocn only capture the intra-query dependencies (enough for
variance computation), but do not capture the inter query dependencies needed for computing
Cov(SUMloc, COUNTocn).
6.3 Notion of Equivalence
If we find an analzyble representation for a pair of query plans that allows us to encode
the correlations between the estimates, we need to a notion equivalence to tranform an
arbitrary pair of query plans to this representation. This notion of equivalence is certainly
not relational equivalence, since the sampling operators would remain at the bottom. While
SOA-Equivalence allows sampling operators to permute upwards, it only applies to a single
query plan. It is not clear what the notion of equivalence means for a simultaneous pair
of query plans. The notion of SOA-Equivalence was extended to the notion of Sampling
Dominance in [62]. This generality comes at the cost of decreased accuracy in error estimation,
as we are only able to get upper bounds for the true variance. Note that the sign of any
covariance term in the variance expression of a function of the multiple estimators (see (3–3))
is not necessarily positive. Even if one were to develop a notion of sampling dominance for
covariances, it would not be useful in achieving the desired goal. The best approach or path
to take would be to try and generalize/extend the notion of SOA-equivalence (with exact
58
lineitem
Gl
orders
Go
customer
Gc
nation
σ1 σ2
1 1
1 1
SUMloc COUNTocn
SOA⇐⇒
lineitem orders customer nation
1 1
σ1 σ2
1 1
Gloc
SUMloc
Gocn
COUNTocn
Figure 6-2. Applying SOA-Equivalence for covariance computation
equality) for covariances. Challenge #3: Develop a notion of Equivalence between two
pairs of query plans.
From the above discussion it is clear that we need an overall characterization for
inter-estimate correlations and a notion of equivalence. The SOA-Algebra developed in
[76] was based on SOA-Equivalence for a single query. This algebra does not apply to the
genaralized covariance problem. A new algebra corresponding to the new notion of equivalence
will have to be developed from scratch. Challenge #4: Develop an algebra for computing
the covariance parameters.
6.4 Support for Optimization
Another relevant challenge is efficient support of GROPUBY queries. If Ng is the number
of groups in the result, then the number of covariances that need to be computed is(Ng2
). For
example, the GROUPBY query in Fig 1-1 produces a group of 25 estimates (one corresponding
to each nation) which result in(252
)= 300 covariance terms. This large number of covariance
terms can create a significant computation overhead. For Ng groups in the sample, the number
of covariance computations is O(Ng2). Each such computation requires us to compute 2n
terms, where n is the number of base relations. There are often properties of the underlying
data or other factors which can simplify some computations. The theory we aim to deveop
needs to have the ability to intgerate these simplification techniques. Challenge #5: De-
velop support for optimizations.
59
6.5 Support for Sub-Sampling
Quite often, the base relation samples can be too large to fit in the main memory,
making the analysis unnecessarily expensive. Depending upon the memory constraints, these
samples can be further sub-sampled so that the analysis can be done in memory, without
much loss in the accuracy of estimates. This approach was used in [76] in the context of
variance computation. Such a support is highly desirable even for covariance analysis. It is
not immidietely clear whether/how sub-sampling can be used in covariance computation.
Challenge #6: Develop support for computational efficiency.
60
CHAPTER 7COVARIANCE BETWEEN MULTIPLE ESTIMATES: SOLUTION
In this chapter, we present the theory for a comprehensive solution for covariance
computation that addresses the key challenges outlined in Chapter 6. Before proceeding, it
is imperative that we consider all possible ways of interactions between two query plans. This is
where we need to differentiate between a sampling method and a sample. For our purpose,
a GUS sampling method is characterized by its a and b co-efficients. There can be many ways
to perform the sampling operation such that the a and b parameters are the same. Thus, if
we apply two sampling processes on a set of relations and the GUS co-efficients are the same
in both cases, then both these processes correspond to the same sampling method. However,
since these two processes will be conducted independently (in time or space) of one another,
they provide different samples. If the same process is performed twice, independently of one
another, there will still be two different samples. (The theory in [76] did not need to make
such a distinction because it was designed to deal with a single query plan.) We now start by
defining a novel notion of equivalence between two pairs of query plans.
7.1 SOA-COV Equivalence
The notion of equivalence for covariance has to be able to capture the interaction between
pairs of plans. This equivalence must be defined between two different pairs of plans as
opposed to individual query plans. We call this notion SOA-COV equivalence, and define it as
follows.
Definition 3 (SOA-COV Equivalence). Given (possibly randomized) expressions C(R1) and
D(R2), such that they share a set of samples from S = R1 ∩ R2(= ∅),
{C(R1),D(R2)}soa⇐⇒cov
{E(R1),F(R2)}
if for any arbitrary SUM-like aggregates
Af (C(R1)) =∑
t1∈C(R1)
f (t1) & Ag(D(R2)) =∑
t2∈D(R2)
g(t2)
61
we have
E [Af (C(R1))Ag(D(R2))] = E [Af (E(R1))Ag(F(R2))].
Superficially, this definition looks significantly more restrictive then the SOA equivalence in
Definition 2, since it requires the covariance of all pairs of aggregates to agree on the pairwise
equivalent plans. The following proposition shows that it is enough to agree on interaction of
individual tuples to obtain SOA-COV equivalence.
Proposition 7.1. Given the relational algebra expressions C(R1), D(R2), E(R1) & F(R2) we
have:
{C(R1),D(R2)}soa⇐⇒cov
{E(R1),F(R2)}
⇔
P[t1 ∈ C(R1)&t2 ∈ D(R2)] = P[t1 ∈ E(R1)&t2 ∈ F(R2)]
∀t1 ∈ R1, t2 ∈ R2
Proof. The proof follows immediately by noting two facts. First, for any t1 ∈ R1, t2 ∈ R2,
P[t1 ∈ C(R1)&t2 ∈ D(R2)] = E [Af (C(R1))Ag(D(R2))],
where f (t1) = 1{t1=t1} and g(t2) = 1{t2=t2}. Second, we have that
E [Af (C(R1))Ag(D(R2))]
=∑
t1∈R1,t2∈R2
P[t1 ∈ C(R1)&t2 ∈ D(R2)]f (t1)g(t2)
for any f and g.
This result allows us to prove that SOA-COV equivalence subsumes SOA equivalence.
Proposition 7.2. {C(R), C(R)} soa⇐⇒cov
{D(R), D(R)} if and only if C(R) SOA⇐⇒ D(R).
62
Gcov(V )U1 × U2 × · · · × Up
V1 × V2 × · · · × Vn
W1 ×W2 × · · · ×Wq
1 1
Af Ag
Figure 7-1. Setting for the base lemma
Proof. By Proposition 7.1, the left side is equivalent to:
P[t1 ∈ C(R)&t2 ∈ C(R)] = P[t1 ∈ D(R)&t2 ∈ D(R)]
∀t1 ∈ R, t2 ∈ R.
By Proposition 4.3, C(R) SOA⇐⇒ D(R) if and only if
P[t1 ∈ C(R)&t2 ∈ C(R)] = P[t1 ∈ D(R)&t2 ∈ D(R)]
∀t1 ∈ R, t2 ∈ R.
The result follows immediately from both observations.
This indicates that SOA-COV is sufficient to analyze both - single queries and pairs of
queries.
7.2 Base Lemma and Overall Sampling
In order to extend the base lemma for covariance analysis, our key insight is that the pair
of query plans should be represented in the configuration shown in Figure 7-1. We will call this
configuration Covariance analyzable. The two query plans in this configuration use the same
sample from a set of common relations, V1 × V2 × · · · × Vn. Each plan computes a SUM-like
aggregate on its own cross product space, where each cross product space is comprised of
(a) the set of common relations and (b) the set of relations exclusive to the query plans,
U1×U2×· · ·×Up & W1×W2×· · ·×Wq. The lemma below provides a closed form expression
for the covariance between aggregates in such a case. We use the following notation.
63
Notation:
1. A GUS method (see Definition 1) on a set of relations V can be expressed in short as
GV .
2. A bar over a GUS expression GV (V) denotes that the sample derived from that GUS
is shared and V is the common set of relation. This notation will play a crucial role in
identifying inter query correlations.
3. A query plan that uses a sample from GV (V) and possibly a set of other non-shared
relations R is denoted by C(GV (V ),R).
Theorem 3 (GUS Covariance). Let V be a sample obtained from a GUS method GV (V) on a
set of relations V = V1×V2×· · ·×Vn. Suppose there are two query plans: one on U×V space
(with U = U1×U2×· · ·×Up) and the other on V ×W space (with W =W1×W2×· · ·×Wq)
such that no sampling is done on U and W , and U,W are mutually exclusive. Furthermore,
SUM-like aggregates Af & Ag are computed using these query plans as follows:
Af =1
aV
∑tu∈U,tv∈V
f (tu, tv), Ag =1
aV
∑t′v∈V,tw∈W
g(t ′v , tw).
Then Af ,Ag are unbiased estimates of the population level aggregates∑tu∈U,tv∈V f (tu, tv) and∑
tv∈V ,tw∈W g(tv , tw), and
Cov(Af ,Ag) =∑S∈P(V )
cSa2VyS − yϕ, (7–1)
with cS =∑T∈P(S)(−1)|T |+|S|bT ,V ,
yS =∑
ti∈Vi |i∈S
∑tj∈Vj |j∈Sc
F(ti , tj)
∑tj∈Vj |j∈Sc
G(ti , tj)
F(tv) =
∑tu∈U
f (tu, tv), G(tv) =∑tw∈W
g(tv , tw).
Proof. Since U × GV (V ) is SOA-equivalent to a GUS on U × V with a = 1 × aV = aV
(Proposition 4.6), it follows that Af is unbiased for∑tu∈U,tv∈V f (tu, tv). The fact that Ag is
64
unbiased for∑tv∈V ,tw∈W g(tv , tw) follows similarly. Let t1 be (tu, tv) & t2 be (t
′v , tw). Then
a2VE [AfAg] =∑tu
∑tv
∑t′v
∑tw
Pr(tv , t′v ∈ V)f (t1)g(t2)
=∑T∈P(V )
bT ,V∑
tv ,t′v :tv∩t′v=T
(∑tu
f (tu, tv)
)(∑tw
g(t ′v , tw)
)=
∑T∈P(V )
bT ,V∑
tv ,t′v :tv∩t′v=T
F(tv)G(t ′v). (7–2)
A straightforward application of the inclusion-exclusion principle gives
∑S∈P(V )
cSyS =∑S∈P(V )
yS∑T∈P(S)
(−1)|T |+|S|bT ,V
=∑S∈P(V )
∑tv ,t′vS⊆tv∩t′v
F(tv)G(t ′v)
∑T∈P(S)
(−1)|T |+|S|bT ,V
=∑
T∈P(V )
bT ,V∑S⊇T
(−1)|T |+|S|
∑tv ,t′vS⊆tv∩t′v
F(tv)G(t ′v)
=
∑T∈P(V )
bT ,V∑tv ,t′v
tv∩t′v=T
F(tv)G(t ′v).
The result follows by (7–2) and the definition of yϕ.
Note that the form of the covariance in (7–1) is structurally the same as the variance
formula in Theorem 1 and inherits all the nice computational advantages that are associated
with it.
The case of a single query plan with multiple estimates is a boundary condition of the
above setting, where the set of relations U & W are absent and samples are shared from all
the participating base relations, V . By Proposition 7.2, it follows that GV (V ) for this case is
the same as the overall GUS for variance computation.
65
Example 11. Since the estimates SUMloc and COUNTloc in Example 8 share the same query
plan, it follows that the overall GUS for covariance Gloc is the same as overall GUS for variance
Gloc (from Figure 6-1(c)). Cov(SUMlocCOUNTloc) can be estimating using Theorem 3. 2
7.3 SOA-COV Algebra
For a general pair of query plans, if we can show that these plans are SOA-COV
Equivalent to a pair plans as in Theorem 3, then the same result will immediately provide
the desired covariance between aggregates. The key is to understand how to establish such
a SOA-COV equivalence. In this section, we develop a SOA-COV algebra that enables the
tranformation of any general pair of query plans to a SOA-COV Equivalent query plan of
the form in Theorem 3. It is assumed throughout this section that sampling methods on two
mutually exclusive sets of relations are performed independently.
GU GV GW
U V W
1 1
Af Ag
soa⇐⇒cov
Gcov(V )
U V W
1 1
Af Ag
Figure 7-2. SOA-COV for Joins
Proposition 7.3. (SOA-COV For Joins) Given expressions GU(U), GW (W) and GV (V)
{GU(U) 1 GV (V ),GV (V ) 1 GW (W )}soa⇐⇒cov
{U 1 Gcov(V ),Gcov(V ) 1W }
bT ,cov = aUaWbT ,V
acov = aUaW aV
66
Proof. Let U ,V,W denote random samples obtained from GU(U), GV (V), GW (W) respectively
(see Figure 7-2). For arbitrary (tu, tv) ∈ U 1 V and (t ′v , tw) ∈ V 1W , it follows that
P((tu, tv) ∈ U × V, (t ′v , tw) ∈ V ×W) = P(tu ∈ U ; tv , t ′v ∈ V; tw ∈ W)
= P(tu ∈ U)× P(tv , t ′v ∈ V)× P(tw ∈ W)
= aUaWbtv∩t′v .
The result now follows immediately from Proposition 7.1.
V
GV
G1/σ1 G2/σ2
Af Ag
soa⇐⇒cov
V
Gcov(V )
Af Ag
Figure 7-3. SOA-COV for GUS/Selection
Proposition 7.4. (SOA-COV For selections) Given expressions σi (GV (V)), Gi(GV (V)), i =
1, 2,
{σ1/G1(GV (V )),σ2/G2(GV (V ))}soa⇐⇒cov
{Gcov(V ),Gcov(V )}
where bT ,cov = A1A2bT ,V , acov = A1A2aV with Ai = 1 if σi and Ai = ai if Gi .
Proof. We will provide a proof for the case with G1 and G2. Let V denote the sample obtained
by using GV (V). Let V1,V2 be the samples obtained after applying G1, G2 to V respectively.
We assume here that these samples are obtained independently conditional on V. For arbitrary
67
tv , t′v ∈ V :
P(tv ∈ V1, t ′v ∈ V2) = P(tv ∈ V1, t ′v ∈ V2, tv ∈ V, t ′v ∈ V)
= P(tv ∈ V1, t ′v ∈ V2 | tv , t ′v ∈ V)× P(tv , t ′v ∈ V)
= P(tv ∈ V1 | tv , t ′v ∈ V)× P(t ′v ∈ V2 | tv , t ′v ∈ V)btv∩t′v ,V
= a1a2btv∩t′v ,V .
The proof for the other cases follows similarly.
V
GV
G1/σ1 1
GU
U
Af Ag
soa⇐⇒cov
V
Gcov(V )
1
U
Af Ag
Figure 7-4. SOA-COV for Join/GUS interaction
Proposition 7.5. (SOA-COV for Join/GUS Interaction) Given expressions GU(U), GV (V), σ1
(GV (V)), G1(GV (V)),
{σ1/G1(GV (V )),GU(U) 1 GV (V )}soa⇐⇒cov
{Gcov(V ),U 1 Gcov(V )}
where bT ,cov = A1aUbT ,V , acov = A1aUaV with A1 = 1 if σ1, and A1 = a1 if G1.
Proof. We will provide a proof for the case with G1. Let U ,V denote the sample obtained by
using GU(U), GV (V) respectively. Let V1 be the sample obtained after applying G1 to V . For
68
arbitrary (tu, tv) ∈ U 1 V and t ′v ∈ V , it follows that
P((tu, tv) ∈ U 1 V, t ′v ∈ V1)
= P(tu ∈ U , tv ∈ V, t ′v ∈ V1, t ′v ∈ V)
= P(tu ∈ U)× P(t ′v ∈ V1 | tv ∈ V, t ′v ∈ V)× P(tv ∈ V, t ′v ∈ V)
= a1aUbtv∩t′v ,V .
The proof for the σ1 case follows similarly.
Propositions 7.4 and 7.5 together imply the following result.
Corollary 1. Assuming all base relation samples are obtained using GUS methods, any
selection operator in a pair of query plans does not change the parameters of the overall GUS
method Gcov .
C(U) D(V ) E(W )
1 1
Af Ag
soa⇐⇒cov
C∗(U) D∗(V ) E∗(W )
1 1
Af Ag
Figure 7-5. SOA composition
Proposition 7.6. (SOA Composition) Given randomized expressions C(U), C∗(U), D(V ),
D∗(V ), E(W ) and E∗(W ), such that C(U) SOA⇐⇒ C∗(U), D(V ) SOA⇐⇒ D∗(V ) and E(W ) SOA⇐⇒
E∗(W ), then
{C(U) 1 D(V ),D(V ) 1 E(W )}soa⇐⇒cov
{C∗(U) 1 D∗(V ),D∗(V ) 1 E∗(W )}
69
Proof. For any (tu, tv) ∈ U 1 V and (t ′v , tw) ∈ V 1 W , it follows by the SOA-Equivalence of
C, C∗, of D,D∗, and of E , E∗ (see Figure 7-5) and Proposition 4.3 that
P((tu, tv) ∈ C(U) 1 D(V ), (t ′v , tw) ∈ D(V )× E(W ))
= P(tu ∈ C(U))× P(tw ∈ E(W ))× P(tv , t ′v ∈ D(V ))
= P(tu ∈ C∗(U))× P(tw ∈ E∗(W ))× P(tv , t ′v ∈ D∗(V ))
= P((tu, tv) ∈ C∗(U) 1 D∗(V ), (t ′v , tw) ∈ D∗(V )× E∗(W )).
The result now follows from Proposition 7.1.
Proposition 7.7. (SOA-COV Composition) Let P and Q be two sets of relations (with
possibly some common relations). Let R and S be two sets of relations (with possibly some
common relations) such that P ∪ Q and R ∪ S are mutually exclusive. Given expressions
C(P),D(Q), C∗(P),D∗(Q), E(R),F(S), E∗(R),F∗(S) such that
{C(P),D(Q)} soa⇐⇒cov
{C∗(P),D∗(Q)}
{E(R),F(S)} soa⇐⇒cov
{E∗(R),F∗(S)}
we have
{C(P) 1 E(R),D(Q) 1 F(S)}soa⇐⇒cov
{C∗(P) 1 E∗(R),D∗(Q) 1 F∗(S)}
70
Proof. For arbitrary (tp, tr) ∈ P 1 R and (tq, ts) ∈ Q 1 S , it follows that
P((tp, tr) ∈ C(P) 1 E(R); (tq, ts) ∈ D(Q) 1 F(S))
= P(tp ∈ C(P), tq ∈ D(Q))× P(tr ∈ E(R), ts ∈ F(S))(a)= P(tp ∈ C∗(P), tq ∈ D∗(Q))× P(tr ∈ E∗(R), ts ∈ F∗(S))
= P((tp, tr) ∈ C∗(P) 1 E∗(R); (tq, ts) ∈ D∗(Q) 1 F∗(S))
Here (a) follows by the two SOA-COV equivalences assumed in the statement of this
proposition. The result follows immediately by Proposition 7.1.
Proposition 7.8. (SOA-COV for relational algebra equivalence) Let P,Q and R be mutually
exclusive sets of relations, and the radomized expressions C(P × Q), C∗(P × Q),D(Q ×
R),D∗(Q × R) be obtained as follows. First, independent samples are drawn from all the
base relations in P,Q & R . Then, randomized expressions C(P × Q) and C∗(P × Q) are
obtained by joining the base relation samples for P & Q in two ways which are relational
algebra equivalent. Similarlly, randomized expressions D(Q × R) and D∗(Q × R) are obtained
by joining the base relation samples for Q & R in two ways which are relational algebra
equivalent. Then
{C(P ×Q),D(Q × R)} soa⇐⇒cov
{C∗(P ×Q),D∗(Q × R)}
Proof. By construction, note that
t1 ∈ C(P ×Q), t2 ∈ C(Q × R)⇔ t1 ∈ C∗(P ×Q), t2 ∈ D∗(Q × R).
The result follows immediately using Proposition 7.1.
Applying SOA-COV Algebra to analyze covariance. The results in this chapter
(summarized in Tables 7-1 and 7-2) constitue the SOA-COV Algebra based on SOA-COV
Equivalence. The goal of these algebraic rules, is to transform a given pair of query plans to a
Covariance-analyzable pair of plans as in Fig. 7-1.
71
Table 7-1. SOA-COV rules
Name Transformation acov bT ,covSOA-COV for {GU(U) 1 GV (V ),GV (V ) 1 GW (W )} aUaW aV aUaWbT ,V
Joinssoa⇐⇒cov
(Prop. 7.3, Fig. 7-2) {U 1 Gcov(V ),Gcov(V ) 1W }SOA-COV for {σ1/G1(GV (V )),σ2/G2(GV (V ))} A1A2 aV A1A2 bT ,VSelections
soa⇐⇒cov
Ai = 1 if σi
(Prop. 7.4, Fig. 7-3) {Gcov(V ),Gcov(V )} Ai = ai if GiSOA-COV for {σ1/G1(GV (V )),GU(U) 1 GV (V )} A1aU aV A1aU bT ,V
Join-GUS interactionsoa⇐⇒cov
A1 = 1 if σ1
(Prop. 7.5, Fig. 7-4) {Gcov(V ),U 1 Gcov(V )} A1 = a1 if G1
Table 7-2. SOA-COV composition
Name If Then
SOA C(U) SOA⇐⇒ C∗(U) {C(U) 1 D(V ),D(V ) 1 E(W )}composition D(V ) SOA⇐⇒ D∗(V )
soa⇐⇒cov
(Prop. 7.6) E(W ) SOA⇐⇒ E∗(W ) {C∗(U) 1 D∗(V ),D∗(V ) 1 E∗(W )}SOA-COV {C(P),D(Q)} soa
⇐⇒cov
{C∗(P),D∗(Q)} {C(P) 1 E(R),D(Q) 1 F(S)}composition
soa⇐⇒cov
(Prop. 7.7) {E(R),F(S)} soa⇐⇒cov
{E∗(R),F∗(S)} {C∗(P) 1 E∗(R),D∗(Q) 1 F∗(S)}
For a single query plan, the Covariance-analyzable configuration will have an overall GUS
method permuted to the top of the query plan (Prop. 7.2). In the general setting, where the
estimates come from different query plans that share a subset of base relation samples, the
Covariance-analyzable configuration can be derived by using the SOA-Algebra rules in [76] to
permute the sampling operators in the exclusive cross product spaces U & W and the shared
cross-product space V, to the top of their respective spaces (Prop. 7.6), and then applying the
relevant SOA-COV rules in Table 7-1 to charecterize the overall GUS for covariance Gcov . (see
Fig. 7-6). We can then apply the base lemma for covariance (Lemma 3) to Gcov to compute
the covariance between the relevant estimates. Note that these transformations are only for the
purpose of analysis and do not interfere with the actual query execution.
We will now illustrate how the SOA-COV Algebra can be used to derive
Covariance-analyzable plans and the overall GUS Gcov . In Example 12, we analyze
72
GVGU
U1 × U2 × · · · × Up V1 × V2 × · · · × Vn
GW
W1 ×W2 × · · · ×Wq
1 1
Af Ag
soa⇐⇒cov
Gcov(V )U1 × U2 × · · · × Up
V1 × V2 × · · · × Vn
W1 ×W2 × · · · ×Wq
1 1
Af Ag
Figure 7-6. Deriving Covariance-analyzable plans
Cov(SUMloc, COUNTocn) for f2 in Example 1, where the queries for SUMloc and COUNTocn apply
different selection criteria on the joins of shared base relation samples. Example 13 analyzes a
case with multiple joins and an intermediate sampling in one of the queries. Example 15 shows
an application of SOA-COV composition (Prop. 7.7).
Example 12. The pair of query plans in Example 1 (see Figure 7-7(a)) are:
Planloc = Gl(l) 1 Go(o) 1θ1 Gc(c) = Gl(l) 1 σ1(Go(o) 1 Gc(c))
Planocn = Go(o) 1θ2 (Gc(c) 1 n) = σ2(Go(o) 1 (Gc(c)) 1 n)
It follows that
{Gl(l) 1 σ1(Go(o) 1 Gc(c)), σ2(Go(o) 1 (Gc(c)) 1 n)}soa⇐⇒cov
{Gl(l) 1 σ1(Goc(oc)),σ2(Goc(oc)) 1 n} [Prop. 4.6]
soa⇐⇒cov
{Gl(l) 1 Goc(oc)′,Goc(oc)
′1 n} [Prop. 7.4]
soa⇐⇒cov
{l 1 Gcov(lo),Gcov(lo) 1 n} [Prop. 7.3].
73
l
Gl
o
Go
c
Gc
n
σ1 σ2
1 1
1 1
SUMloc COUNTocn
soa⇐⇒cov
l
Gl
o c n
Goc
1
σ1 σ2
1 1
SUMloc COUNTocn
soa⇐⇒cov
l
Gl
o c n
Goc′
1 1
SUMloc COUNTocn
(a) (b) (c)
soa⇐⇒cov
l o c n
Gcov
1 1
SUMloc COUNTocn
(d)
GUS method ParametersGl al = 0.1, bl,∅ = 0.01, bl,l = 0.1
Go ao = 6.67× 10−3, bo,∅ = 4.44× 10−5, bo,o = 6.67× 10−3Gc ac = 0.1, bc,∅ = 0.01, bc,c = 0.1
Goc aoc = 6.67×10−4, boc,∅ = 4.44×10−7, boc,c = 4.44×10−6, boc,o =6.67× 10−5, boc,oc = 6.67× 10−4
Goc′
Same parameters as Goc
Gcov acov = 6.67 × 10−5, bcov ,∅ = 4.44 × 10−8, bcov ,c = 4.44 ×10−7, bcov ,o = 6.67× 10−6, bcov ,oc = 6.67× 10−5
Figure 7-7. Applying SOA-COV Equivalence for covariance computation in Example 12
Example 13. Starting with the pair of query plans in Figure 7-8 (a) we can rewrite the
query plans such the the exclusive and shared subtrees emerge.The pair of query plans can be
expressed as:
74
p ps l o c n
Gp Gps Gl Go Gc Gn
1 1 1
1 1
Af Ag
G1 soa⇐⇒cov
p ps l o c n
Gpps Glo Gcn
1 1 1
1 1
Af Ag
G1 soa⇐⇒cov
p ps l o c n
Gppslocn
1 1 1
1 1
Af Ag
G1
(a) (b) (c)
soa⇐⇒cov
p ps l o c n
Gcov
1 11
1 1
Af Ag
(d)
GUS method ParametersGx:x=p, ps, c, n ax = 0.1, bx,∅ = 0.01, bx,x = 0.1
Gx:x=l,o ax = 0.1, bx,∅ = 0.01, bx,x = 0.1G1 a1 = 0.5, b1,∅ = 0.25, b1,1 = 0.5Gpps apps = 0.01, bpps,∅ = 10
−4, bpps,p = bpps,ps = 10−3, bpps,pps =
10−2
Gcn acn = 0.01, bcn,∅ = 10−4, bcn,c = bcn,n = 10
−3, bcn,cn = 10−2
Glo alo = 0.01, blo,∅ = 10−4, blo,l = blo,o = 10
−3, blo,lo = 10−2
Gppslocn appslocn = 10−6, bppslocn,∅ = 10−8, bppslocn,l = bppslocn,o =10−7, bppslocn,lo = 10
−6
Gcov acov = 0.5 × 10−6, bcov ,∅ = 0.5 × 10−8, bcov ,l = bcov ,o =10−7, bcov ,lo = 10
−6
Figure 7-8. Applying SOA-COV Equivalence for covariance computation in Example 13
75
It follows that
{Gp(p) 1 Gps(ps) 1 Gl(l) 1 Go(o),G1(Gl(l) 1 Go(o) 1 Gc(c) 1 n)}soa⇐⇒cov
{Gpps(p 1 ps) 1 Glo(l 1 o),G1(Glo(l 1 o) 1 Gcn(c 1 n))} [Prop. 4.6]
soa⇐⇒cov
{p 1 ps 1 Gppslocn(l 1 o),G1(Gppslocn(l 1 o) 1 c 1 n)} [Prop. 7.4]
soa⇐⇒cov
{p 1 ps 1 Gcov(l 1 o),Gcov(l 1 o) 1 c 1 n} [Prop. 7.3].
Example 14. Figure 7-9(a) shows a pair query plans on shared Bernoulli(0.1) samples
of l & o. Each plan joins the samples with selection conditions σ1 & σ2, and derives
SUM-like estimates Af & Ag respectively. Note that Ag is derived from another sample
of the intermediate results of the join in the second plan. This pair of query plans can be
expressed as follows (see Figure 7-9(b)).
Plan1 = Gl(l) 1σ1 Go(o) = σ1(Glo(lo)
)Plan2 = G1
(Gl(l) 1σ2 Go(o)
)= G∗
1
(Glo(lo)
).
Here Glo(lo) = Gl(l)× Go(o) and G∗1 = G1 ◦ σ2 is a GUS method. By Prop. 7.4:
where acov = a∗1alao, bcov ,∅ = a
∗1bl,∅bo,∅, bcov ,l = a
∗1albo,∅, bcov ,∅ = a
∗1bl,∅ao and
bcov ,lo = a∗1alao (see Figure 7-9(c)). All the GUS parameters are given in the table of
Figure 7-9. We can now plug-in the parameters for Glo in Theorem 3 to obtain the covariance
of Af & Ag.
Example 15. Figure 7-10(a) shows a pair of query plans, Plan1 & Plan2, that share samples
of lineitem (Bernoulli 0.1) & orders (SWOR 1000 out of 150,000 tuples). Plan1 joins
these samples to Bernoulli 0.1 samples of partsup & part to compute Af . Plan2 joins these
samples to Bernoulli 0.1 samples of customer & nation to compute Ag. Hence
Plan1 =(Gp(p) 1 Gl(l)
)1
(Gps(ps) 1 Go(o)
)Plan2 =
(Gl(l) 1 Gc(c)
)1
(Go(o) 1 Gn(n)
).
76
l o
Gl Go
1σ1 1σ2
G1
Af Ag
soa⇐⇒cov
l o
Gl Go
1
σ1 σ2
G1
Af Ag
soa⇐⇒cov
l o
1
Glo
Af Ag
(a) (b) (c)
GUS method Parameters
Gx:x=l,o ax = 0.1, bx∅ = 0.01, bxx =0.1
G(a1,b1)(lo) a1 = 0.04, b1,∅ = 16 ×10−4, b1,l = 8× 10−3b1,l = 8× 10−3, b1,lo = 0.04
Gcov (lo) acov = 4× 10−4, bcov ,∅ = 4×10−6,bcov ,l = bcov ,o = 4 ×10−5, bcov ,lo = 4× 10−4
Figure 7-9. Applying SOA-COV Equivalence for covariance computation in Example 14
Note by Prop. 7.3 that
{Gp(p) 1 Gl(l),Gl(l) 1 Gc(c)}soa⇐⇒cov
{p 1 Gcov1(l),Gcov1(l) 1 c}
and
{Gps(ps) 1 Go(o),Go(o) 1 Gn(n)}soa⇐⇒cov
{ps 1 Gcov2(o),Gcov2(o) 1 n},
where the parameters of Gcov1(l) and Gcov2(o) are as specified in Prop. 7.3. It follows by
Prop. 7.7 that
{Plan1,Plan2}soa⇐⇒cov
{Plan3,Plan4} (7–3)
77
p ps l o c n
Gp Gps Gl Go Gc Gn
1 1 1 1
1 1
Af Ag
soa⇐⇒cov
p ps l o c n
Gcov1Gcov2
1 1 1 1
1 1
Af Ag
soa⇐⇒cov
p ps l o c n
Gcov1Gcov21 1
1
1 1
Af Ag
(a) (b) (c)
soa⇐⇒cov
p ps l o c n
Gcov
1 11
1 1
Af Ag
(d)
GUS method Parameters
Gx:x=p,ps,l,c,n ax = 0.1, bx∅ = 0.01, bxx = 0.1Gcov1 acov1 = 10
−3, bcov1,∅ = 10−4, bcov1,l = 10
−3
Gcov2 acov2 = 6.67× 10−5, bcov2,∅ = 4.44× 10−7, bcov2,o = 6.67× 10−5Gcov acov = 6.667 × 10−8, bcov ,∅ = 4.44 × 10−11, bcov ,l = 4.44 ×
10−10, bcov ,o = 6.67× 10−9, bcov ,lo = 6.67× 10−8
Figure 7-10. Applying SOA-COV Equivalence for covariance computation in Example 15
where
Plan3 =(p 1 Gcov1(l)
)1
(ps 1 Gcov2(o)
)Plan4 =
(Gcov1(l) 1 c
)1
(Gcov2(o) 1 n
).
By associativity of joins and (7–3) it follows that
{Plan1,Plan2}soa⇐⇒cov
{Plan5,Plan6}
78
where
Plan5 = (p 1 ps) 1(Gcov1(l) 1 Gcov2(o)
)Plan6 =
(Gcov1(l) 1 Gcov2(o)
)1 (c 1 n)
are in the form required by Theorem 3. Hence Theorem 3 can be used to obtain the covariance
for Af and Ag.
79
CHAPTER 8EFFICIENT COVARIANCE ANALYSIS: IMPLEMENTATION & EXPERIMENTS
The base lemma for covariance computation (Lemma 3) comprises of two kinds of terms:
terms that characterize the effects of sampling (cS) and terms that characterize the properties
of data (yS). The cS terms are derived from the Gcov parameters obtained by applying
SOA-COV transformations as shown in Section 7.3. Recall that the yS terms are defined as:
yS =∑
ti∈Vi |i∈S
∑tj∈Vj |j∈Sc
F(ti , tj)
∑tj∈Vj |j∈Sc
G(ti , tj)
F(tv) =
∑tu∈U
f (tu, tv), G(tv) =∑tw∈W
g(tv , tw).
Computing these ys terms essentially requires a group-by lineage over the shared base
relations. For the pair of estimates in Example 1 the SQL equivalent expressions for the yS
terms are given by:
CREATE TABLE unagg_loc AS
SELECT o_orderkey as f_o, c_custkey as f_c, l_discount*(1.0-l_tax) as f
FROM lineitem TABLESAMPLE (10 PERCENT),
orders TABLESAMPLE(10 PERCENT)
customer TABLESAMPLE(20 PERCENT)
WHERE l_orderkey = o_orderkey AND
l_extendedprice > 100.0;
CREATE TABLE unagg_ocn AS
SELECT o_orderkey as g_o, c_custkey as g_c, SUM(1) as g
FROM orders TABLESAMPLE(10 PERCENT),
customer TABLESAMPLE(20 PERCENT),
nation;
CREATE unagg AS
80
SELECT unagg_loc.f_o, unagg_loc.f_c, unagg_loc.f, unagg_ocn.g
FROM unagg_loc JOIN unagg_ocn
ON unagg_loc.f_o = unagg_ocn.g_o AND unagg_loc.f_c = unagg_ocn.g_c;
SELECT sum(f)*sum(g) as y_empty FROM unagg;
SELECT sum(F*G) as y_o FROM ( SELECT sum(f) as F, sum(g) as G
FROM unagg GROUP BY f_o);
SELECT sum(F*G) as y_c FROM ( SELECT sum(f) as F, sum(g) as G
FROM unagg GROUP BY f_c);
SELECT sum(f*g) as y_oc FROM unagg;
Computing the yS terms over the base relations is often more expensive than computing
the exact desired aggregate over the data. In [76] a computationally feasible solution is
discussed, where unbiased estimates YS of yS are constructed as follows.
YS =1
cS,∅
yS − ∑T⊂SC ,T =∅
cS,T YS∪T
where
cS,T =∑U⊂T
(−1)|U|+|Sc | bS∪U .
Here YS is obtained by replacing the enitre base relation by base relation samples in the
formula for computing yS . Sometimes, however, the size of the sample can be overwhelming
for computation of all the yS terms, and one can use a further sub-sample of the provided
sample for computing these terms (see [76, Section 6.2]). The discussion in [76] is in the
context of computing the variance of a single estimator. When two estimates are derived from
query plans sharing a common subset of base relation samples, implementing subsampling
81
for covariance computation presents addtional challenges. We address these challenges in
Section 8.2. Also, in the GROUP-BY setting where we have multiple estimates, the number of
yS terms can be exponential. In Section 8.1, we show how foreign keys can be used to simplify
this computation significantly
8.1 Leveraging Foreign Keys
The number of yS terms to be computed in Theorem 1 & 3 is exponential. If the number of
common base relations in the data is nR & the number of groups are ng, then there will be(ng2
)covariance computation problems, each requiring the computation of 2nR yS terms. Each
yS computation requires maintaining a seperate hash table. This is indeed a harsh scenario for
large samples. The proposition below shows that the presence of foreign keys implies that many
of the yS terms are the same, thereby significantly reducing the computational burden.
Proposition 8.1. Let V1,V2, · · · ,Vn be the set of relations as in the statement of Proposition
3. If there exist 1 ≤ i1 < i2 ≤ n such that Vi1 has a foreign key from Vi2 , then for any
S ⊇ {i1}, we have yS = yS∪{i2}.
Proof. Let S ⊇ {i1}, T = S ∪ {i2} and S ′ = S \ {i1}. If {i2} ⊆ S , or equivalently S = T ,
then the result is trivial. Suppose instead that S ⊂ T . Note that
yT =∑
ti∈Vi |i∈T
∑tj∈Vj |j∈T c
F(ti , tj)
∑tj∈Vj |j∈T c
G(ti , tj)
=
∑ti∈Vi |i∈S ′
∑(ti1 ,ti2)∈Vi1×Vi2
∑tj∈Vj |j∈T c
F(ti , ti1, ti2, tj)
×
∑tj∈Vj |j∈T c
G(ti , ti1, ti2, tj)
.
82
Since Vi1 has a foreign key from Vi2 , it follows that for every ti1 ∈ Vi1 , there exists a unique
h(ti1) ∈ Vi2 such that (ti1, h(ti1)) ∈ Vi1 × Vi2 . It follows that
yT =∑
ti∈Vi |i∈S ′
∑ti1∈Vi1
∑tj∈Vj |j∈T c
F(ti , ti1, h(ti1), tj)
×
∑tj∈Vj |j∈T c
G(ti , ti1, h(ti1), tj)
=
∑ti∈Vi |i∈S ′
∑ti1∈Vi1
∑tj∈Vj |j∈T c
∑ti2∈Vi2
F(ti , ti1, ti2, tj)
×
∑tj∈Vj |j∈T c
∑ti2∈Vi2
G(ti , ti1, ti2, tj)
=
∑ti∈Vi |i∈S
∑tj∈Vj |j∈Sc
F(ti , tj)
∑tj∈Vj |j∈Sc
G(ti , tj)
= yS .
Example 16. Consider the GROUPBY query
SELECT SUM(l_extendedprice * (1 - l_discount))
FROM lineitem, orders, customer, nation
WHERE o_custkey = c_custkey AND
l_orderkey = o_orderkey AND
c_nationkey = n_nationkey
GROUPBY n_nationkey;
Here the number of groups ng = 25 and the number of common relations nR = 3, which
means there are(252
)= 300 covariance terms, and for each, 23 = 8 yS terms need to be
computed. Hence the total number of computations is 300×8 = 2400. In the TPC-H schema,
lineitem has a foreign key from orders, and orders has a foreign key from customers. By
Proposition 8.1, it follows that yl = ylo = yloc = ylc and yo = yoc . Hence, only 4 distinct
83
yS terms (yϕ, yl , yo , yc), need to be computed for each covariance. This brings down the total
computations to 300× 4 = 1200.
8.2 Sub-Sampling
In Chapter 5, sub-sampling is used in [76] to allow faster, in-memory analysis when
samples are too large to fit in the main memory. In case of a single query plan (as in
Example 8 for AVG) this amounts to drawing a sub-sample at the top of the query plan,
just before the aggregate(s). Sub-sampling in the case of two query plans sharing samples from
a subset of base relations, presents an additional challenge. Drawing independent sub-samples
at the top of each of the query plans will not produce accurate covariance estimates (for
Figure 7-8 (d), this would amount to drawing differrent sub-samples just before computing Af
and Ag).
To get meaningful estimates, the same sub-sample from the common base relations should
be used in the covariance computation. For example, in Figure 7-8 this would amount to
drawing a single sub-sample just after the Gcov step. Note that Figure 7-8 (d) is an equivalent
conceptualization of the original query plan for the purposes of covariance computation and
may be different from the actual query execution plan (Figure 7-8 (a)). It is imperatve that
the sub-sampling does not interfere with the actual query execution plan. To acheive both the
goals, we draw dependent sub-samples from the top of the query plans using psuedo-RNGs
with the same starting seed. In our experiments, the sub-sampling was implemented through a
GIST [71], which uses multi-dimensional Bernoulli GUS on top of the query plans to obtain the
sub-samples.
8.3 Experiments
We have two major goals for our experimental study. The first goal is to provide experimental
confirmation of our theory. The second goal is to evaluate the efficiency of covariance
computation. More specifically:
• How does the runtime depend on the size of base relation samples?
• How does sub-sampling affect the runtime and accuracy of covariance estimates?
• How does the foreign key optimization affect runtime?
84
As we demonstrate in this section, both the sub-sampling and foreign-key optimizations
significantly reduce the runtime while providing correct and meaningful estimates.
8.3.1 Experimental Setup
Data sets. We use two versions of TPC-H data. For verification of confidence intervals,
we use a small dataset (scale factor 0.01, 10MB). For runtime experiments, we use a large
dataset (scale factor 1000, 1TB).
Code & Platform. The SOA-COV algebra described in Section 7.3 is implemented in
R. The query processing and subsampling is implemented in Grokit [2](implemented in C++),
which combines Datapath [12], GLADE [23] & GIST [71]. It uses a streaming strategy, reading
in chunks and column stores to acheive high performance (up to 3 GB/s).
Hardware. For the GrokIt machine: 64 AMD Opteron processors @2.3 Ghz, 512 GB of
DDR3 RAM, 20 4 TB hard drives.
We now describe the experiments undertaken to answer the questions listed above. The
detailed setup of each experiment is specified in the respective subsections.
8.3.2 Correctness
In this section, we provide a sanity check for our theory and evaluate the accuracy of the
proposed estimates.
Setup. We use the 10MB TPC-H data (lineitem has 60000, orders has 15000 and
customers has 1500 tuples). The small size allows us to perform multiple experiments to
empirically evaluate the estimates through Monte Carlo simulation. We use the following query.
SELECT SUM(l_extendedprice*(1-l_discount)) as e1,
COUNT(1) as e2
FROM lineitem TABLESAMPLE(x PERCENT),
orders TABLESAMPLE(y PERCENT),
customer TABLESAMPLE(z PERCENT)
WHERE l_orderkey = o_orderkey AND
o_orderkey = c_custkey AND
85
0102030405060708090
0 10 20 30 40 50 60 70 80 90
Acheivedconfi
dencelevel
Desired confidence level
Correctness Study
l(10) o(10) p(50)l(20) o(15) p(55)l(30) o(20) p(60)l(40) o(25) p(65)
y = x
Figure 8-1. Plot of percentage of times the true value lies in the estimated confidence intervalsvs desired confidence level.
c_mktsegment = "BUILDING";
The objective is to estimate the average revenue using e1/e2 from the sample obtained from
this query. As described at the end of Chapter 3, the covariance between e1 and e2 is one of
the key terms needed to compute the confidence interval for the average expected revenue.
Monte Carlo validation. We validate the theoretically computed expected value and
variance of the AVG-like estimate (e1/e2) using a Monte Carlo simulation. We use the query
as shown above for different values of (x,y,z), see Figure 8-1. For each value of (x,y,z), we run
the query 2500 times and obtain 2500 confidence intervals each for ten different confidence
levels, ranging from 5% to 95%. We then compute the achieved confidence levels, i.e. for each
of the ten desired confidence levels, we compute the percentage of times the true value falls
within the corresponding confidence intervals. In Figure 8-1, we show a comparison between
the desired and achieved confidence levels for 4 different sampling strategies. The achieved
confidence levels are very close to the desired values across the different sampling strategies
and confidence levels. This provides strong empirical evidence that the confidence intervals
obtained by using our theory are accurate and tight.
86
8.3.3 Running Time
In this section, we evaluate the efficiency of the estimation process.
Setup. For efficiency evaluation, it is important to evaluate the theory in large sample
settings. We use the 1TB TPC-H instance as a sample of a 1PB database. That means that
the six billion tuples in lineitem are a Bernoulli sample from six trillion lineitem tuples
in a 1PB database (0.001 sampling fraction), the 1.5 billion tuples in orders are a sample
without replacement from 1.5 trillion tuples on the 1PB database and the 150 million tuples
in customers are a Bernoulli sample from 150 billion customers tuples in the 1PB database.
The analysis overhead will depend on the sample size. Note that for this section, the database
is the sample. There is no sampling needed in the execution of the query. We use the pair of
queries below and vary the selectivity using the parameter FINALDATE. The selectivity of the
query controls the number of tuples used for analysis and indicates the overhead as a function
of size.
SELECT SUM(l_quantity)
FROM lineitem, orders, customer
WHERE o_custkey = c_custkey AND
l_orderkey = o_orderkey AND
"1992-06-01" <= o_orderdate AND
o_orderdate < FINALDATE
SELECT Sum(1)
FROM nation, orders, customer
WHERE o_custkey = c_custkey AND
c_nationkey = n_nationkey AND
"1992-06-01" <= o_orderdate AND
o_orderdate < FINALDATE
87
Figure 8-2. Plot of runtime.
Figure 8-2 demonstrates the running time of the simultaneous queries, with increasing overhead
specified by four values of FINALDATE. For each value, we recorded runtimes unders four
conditions: with/without sub-sampling & with/without foreign-key optimization. In each
case, the runtime is the sum of time taken for query processing with sub-sampling & the time
taken for analysis. When there is no optimization, the total runtime increases with larger loads
due to the increase in time taken for analysis. Since there are n = 2 common relations, the
anlalysis needs 2n = 4 ys parameters. Applying foreign-key optimization precludes the seperate
computation of two of these terms. In this single covariance example, using foreign keys for
optimization (without sub-sampling) reduces the analysis time, but the rate of increase with
selectivity is roughly the same. Sub-sampling (target sub-sample size between 100K - 400K
tuples) vastly reduces the analysis time and makes it uniform across all the four values of the
selectivity parameter, resulting in a stable total runtime.
88
8.3.4 Foreign Key Optimization
The effect of foreign key optimization is more useful in cases where a covariance matrix
needs to be computed, for eg. in GROUP-BY queries, where the number of yS compuation is
large (Section 8.1). We use the query in Example 16 and note the running times with/without
foreign key optimization. The query processing and subsampling time is the same in both
cases, but the analysis times reduced from 7 sec to 0.33 sec when foreign key optimization is
used.
89
CHAPTER 9A SAMPLING ALGEBRA FOR GENERAL MOMENT MATCHING
In previous chapters, we showed that a GUS sampling operator commutes with relational
operators such as selection/cross product in a SOA-equivalent sense. In particular, aggregates
derived from two SOA-equivalent sampling plans have the same first two moments. Such
an equivalence is sufficient if the aim is to compute the variance of these aggregates, and
to construct confidence intervals based on the variance. However, such an equivalence is
not sufficient if one wishes to know deeper distributional properties of the aggregates such
as skewness, kurtosis etc. For such endeavors, we need a class of sampling methods which
commute with relational operators in the sense that aggregates for all “equivalent” plans have
all the same moments up to a given order. In this chapter, we will develop a class of sampling
methods to achieve this purpose. In particular, it will be shown that
• common sampling methods such as SWR, SWOR and Bernoulli sampling (and theircombinations) are included in this class,
• these sampling methods commute with relational operators while preserving all momentsup to a given order k ,
• the moments for these methods can be expressed in a form which makes computationeasier.
9.1 k-Generalized Uniform Sampling
Consider a database R = R1 × R2 × · · · × Rn. Let k ≥ 2 be an arbitrarily chosen positive
integer. To define our class of sampling methods, we first introduce required notation.
• We say that S = (S1,S2, · · · ,Sk−1) is an ordered k-partition of V = {1, 2, · · · , n}if S1,S2, · · · ,Sk−1 are pairwise disjoint subsets of V . In this setting, we denote Sk =V \
(∪k−1i=1 Si
).
• The collection of all ordered k-partitions of V is denoted by Pk(V ).
• Let t1, t2, · · · , tk ∈ R be arbitrarily chosen. Then S({ti}ki=1
)is an ordered k-partition of
V such that
Sm({ti}ki=1
)={j ∈ V : There are exactly m distinct values in t1j , t
2j , · · · , tkj
}for 1 ≤ m ≤ k − 1.
90
With the above notation in hand, we define our desired class of sampling methods.
Definition 4. A randomized selection process which gives a sample R from R is a called
a k-Generalized Uniform Sampling (k-GUS) method if for any t1, t2, · · · , tk ∈ R,
P(t1, t2, · · · , tk ∈ R) is a function of S({ti}ki=1
). In such a case, the k-GUS parameters
b = {bT | T ∈ Pk(V )} are defined as:
bT = P(t1, t2, · · · , tk ∈ R | S
({ti}ki=1
)= T), (9–1)
and the sampling method is denoted by Gk,b.
Now, we establish some properties of the class of k-GUS methods. We first show that as k
increases, the class of k-GUS sampling methods becomes smaller.
Lemma 1. A k + 1-GUS method is also a k-GUS method.
Proof. Consider a sampling method which gives a sample R from R. Suppose this
sampling method is a k + 1-GUS method. Let t1, t2, · · · , tk ∈ R. Define tk+1 = t1. By the
definition of k + 1-GUS and tk+1, it follows that
P(t1, t2, · · · , tk ∈ R) = P(t1, t2, · · · , tk , tk+1 ∈ R)
is a function of the sets
{j ∈ V : There are exactly m distinct values in t1j , tj2, ·, tk+1j },
for 1 ≤ m ≤ k . Since the number of distinct values in {t ij }k+1i=1 is the same as the number of
distinct values in {t ij }ki=1, and this number is always bounded above by k , the result follows. 2
The following result shows that the sampling methods introduced in [76] are equivalent to
2-GUS methods as per the above definition.
Lemma 2. The class of GUS methods defined in [76] is identical to the class of 2-GUS
methods.
The proof of the above lemma follows immediately from the definition of GUS methods in [76].
91
Our next result shows that in the context of sampling from a single relation, three
standard sampling methods: simple random sampling without replacement (SWOR), simple
random sampling with replacement (SWR) and Bernoulli sampling are k-GUS sampling
methods for any positive integer k ≥ 2.
Lemma 3. Consider a sampling method which draws n tuples at random with replacement or
without replacement from a single relation with N tuples, or chooses each tuple independently
with probability p. Such a sampling method is k-GUS for any k ≥ 2.
Proof. For any t1, t2, · · · , tk ∈ R , it follows by standard calculations using the
inclusion-exclusion principle that
P(t1, t2, · · · , tk ∈ R) =
1−
∑mi=1(−1)i−1
(mi
) (N−iN
)nfor WR sampling,(
N−mn−m
)/(Nn
)for WOR sampling,
pm for Bernoulli sampling,
where m is the number of distinct elements in the set {t1, t2, · · · , tk}. In particular,
P(t1, t2, · · · , tk ∈ R) is a function of the number of distinct elements in {t1, t2, · · · , tk}.
The result follows immediately from the definition of a k-GUS method. 2
Using Lemma 3, we now show that any combination of SWR, SWOR and Bernoulli sampling
over R = R1 × R2 × · · · × Rn is k-GUS method for any positive integer k ≥ 2.
Lemma 4. Consider a sampling method where SWOR, SWR or Bernoulli sampling is done
individually for each relation independently, and a cross product of these samples is taken to
obtain a sample from R. Such a sampling method is a k-GUS method.
Proof. Using independence and Lemma 3 it follows that for any t1, t2, · · · , tk ∈ R,
P(t1, t2, · · · , tk ∈ R) =n∏j=1
P(t1j , t2j , · · · , tkj ∈ R)
=
k∏m=1
∏j∈Sm({ti}ki=1)
pj ,m
92
where
pj ,m =
1−
∑mi=1(−1)i−1
(mi
) (Nj−iNj
)njfor WR sampling,(
Nj−mnj−m
)/(Njnj
)for WOR sampling,
pmj for Bernoulli sampling.
The result now follows from the definition of k-GUS sampling. 2
We now derive an expression for the k th moment of SUM-like aggregates arising from a k-GUS
method. To achieve this, we will need to define a partial order on the space Pk(V ).
Definition 5. For S,T ∈ Pk(V ), we say S ⊇k T if Sj ⊇ Tj for every 1 ≤ j ≤ k − 1.
Using the above definition, we now state and prove the following result.
Theorem 4. Consider a k-GUS method Gk,b which gives a sample R from R. Then, for any
aggregate A =∑t∈R f (t) we have
E[Ak]=
∑S∈Pk(V )
cSyS.
Here
cS =∑T⊆kS
(−1)∑k−1i=1 |Si\Ti |bT,
F (S) =∑
t1,t2,··· ,tk∈R:S({ti}ki=1)=S
k∏i=1
f (ti),
and
yS =∑T⊇kS
F (T).
93
Proof. It follows by the definition of a k-GUS method that
E[Ak]= E
(∑t∈R
1{t∈R}f (t)
)k=
∑t1,t2,··· ,tk∈R
P(t1, t2, · · · , tk ∈ R)k∏i=1
f (ti)
=∑
S∈Pk(V )
bS∑
t1,t2,··· ,tk∈R:S({ti}ki=1)=S
k∏i=1
f (ti) (9–2)
=∑
S∈Pk(V )
bSF (S). (9–3)
Now, it follows by the definition of cS and yS that
∑S∈Pk(V )
cSyS =∑
S∈Pk(V )
(∑T⊆kS
(−1)∑k−1i=1 |Si\Ti |bT
)yS
=∑
S∈Pk(V )
∑T⊆kS
(−1)∑k−1i=1 |Si\Ti |bTyS
=∑
T∈Pk(V )
bT
(∑S⊇kT
(−1)∑k−1i=1 |Si\Ti |yS
). (9–4)
It is clear from the definition of yS that∑
S⊇kT(−1)∑k−1i=1 |Si\Ti |yS is a linear function of
{F (S)}S⊇kT. Clearly, the coefficient of F (T) in this linear function is 1, as F (T) only appears
in yT, and does not appear in any yS for S ⊃k T. Now, for any S, note that there are
2∑k−1i=1 |Si\Ti | sets which dominate T (⊇k T) and are dominated by S (⊆k S). The term F (S)
appears in all these yS expressions. However, for half of these sets (−1)∑k−1i=1 |Si\Ti | = 1 and for
the other half (−1)∑k−1i=1 |Si\Ti | = −1. Hence, the coefficient of F (S) in
∑S⊇kT(−1)
∑k−1i=1 |Si\Ti |yS
is zero. It follows that∑
S⊇kT(−1)∑k−1i=1 |Si\Ti |yS = F (T). It follows by (9–3) and (9–4) that
E[Ak]=
∑S∈Pk(V )
cSyS.
2
We now express yS when k = 3 in a form which can be easily obtained by using GROUPBY
queries.
94
Proposition 9.1. For every S ∈ P3(V ), we have
yS = 3∑ti :i∈S1
∑tj :j∈S2
( ∑tk :k∈S3
f (ti , tj , tk)
)2 ∑tj :j∈S2
∑tk :k∈S3
f (ti , tj , tk)
−
3∑ti :i∈S1
∑tj :j∈S2
( ∑tk :k∈S3
f (ti , tj , tk)
)3.
Proof. Recall that for t, t′, t′′ ∈ R, S1(t, t′, t′′) is the set of all indices on which t, t′, t′′
agree, and S2(t, t′, t′′) is the set of all indices on which t, t′, t′′ take exactly two distinct values.
If j ∈ S2(t, t′, t′′), then there are three possibilities: tj = t′j = t ′′j , t ′j = t ′′j = tj , tj = t ′j = t ′′j .
Hence, we get
yS =∑
t,t′,t′′:S(t,t′,t′′)=S
f (t)f (t′)f (t′′)
=∑ti :i∈S1
∑tj =t′j :j∈S2
∑tk ,t
′k ,t
′′k :k∈S3
f (ti , tj , tk)f (ti , t′j , t
′k)f (ti , tj , t
′′k ) +∑
ti :i∈S1
∑tj =t′′j :j∈S2
∑tk ,t
′k ,t
′′k :k∈S3
f (ti , tj , tk)f (ti , tj , t′k)f (ti , t
′′j , t
′′k ) +∑
ti :i∈S1
∑t′j =t”j :j∈S2
∑tk ,t
′k ,t
′′k :k∈S3
f (ti , tj , tk)f (ti , t′j , t
′k)f (ti , tj , t
′′k )
= 3∑ti :i∈S1
∑tj =t′j :j∈S2
∑tk ,t
′k ,t
′′k :k∈S3
f (ti , tj , tk)f (ti , t′j , t
′k)f (ti , tj , t
′′k ),
where the last step follows by symmetry. The result follows by noting that
∑ti :i∈S1
∑tj =t′j :j∈S2
∑tk ,t
′k ,t
′′k :k∈S3
f (ti , tj , tk)f (ti , t′j , t
′k)f (ti , tj , t
′′k )
=∑ti :i∈S1
∑tj ,t
′j :j∈S2
∑tk ,t
′k ,t
′′k :k∈S3
f (ti , tj , tk)f (ti , t′j , t
′k)f (ti , tj , t
′′k )−∑
ti :i∈S1
∑tj :j∈S2
∑tk ,t
′k ,t
′′k :k∈S3
f (ti , tj , tk)f (ti , t′j , t
′k)f (ti , tj , t
′′k )
=∑ti :i∈S1
∑tj :j∈S2
( ∑tk :k∈S3
f (ti , tj , tk)
)2 ∑tj :j∈S2
∑tk :k∈S3
f (ti , tj , tk)
−
∑ti :i∈S1
∑tj :j∈S2
( ∑tk :k∈S3
f (ti , tj , tk)
)3.
95
2
For k ≥ 4, the expressions for yS can be derived along similar lines, although they will get more
complex as k increases.
9.2 kMA equivalence
In this section, we will define an equivalence between randomize expressions which preserves all
moments up to a given order k , thereby generalizing the notion of SOA-equivalence in [76].
Definition 6. Given (possibly randomized) expressions E(R) and F(R) we say
E(R) kMA⇐⇒ F(R)
if for any arbitrary SUM-aggregate Af (S) =∑t∈S f (t),
E[Af (E(R))i
]= E
[Af (F(R))i
]for every 1 ≤ i ≤ k .
For non-randomized expressions, it can be easily shown by appropriate choices of the function
f that kMA equivalence is the same as set equivalence. The next lemma shows that kMA
equivalence is an equivalence relation and can be manipulated like a relational equivalence.
Proposition 9.2. kMA equivalence is an equivalence relation, i.e., for any expressions E ,F ,H
and relation R, we have
E(R) kMA⇐⇒ E(R)
E(R) kMA⇐⇒ F(R)⇒ F(R) kMA⇐⇒ E(R)
E(R) kMA⇐⇒ F(R)&F(R) kMA⇐⇒ H(R)⇒ E(R) kMA⇐⇒ H(R).
The proof of the lemma immediately follows from the definition of kMA equivalence. The
next lemma provides an alternative characterization of kMA equivalence in terms of joint
probabilities.
96
Proposition 9.3. Given two relational algebra expressions E(R) and F(R) we have
E(R) kMA⇐⇒ E(R)
⇔
P(t1, t2, · · · , tk ∈ E(R)) = P(t1, t2, · · · , tk ∈ F(R)) ∀t1, t2, · · · , tk ∈ R.
9.3 Interaction between kMA and Relational Operators
In this section we will show that k-GUS sampling operators commute with selection and
cross-product in terms of kMA equivalence. This will help converting sampling plans provided
by the user to kMA equivalent analyzable sampling plans.
Proposition 9.4. For an R, selection σC and k-GUS Gk,b,
σC(Gk,b(R)kMA⇐⇒ Gk,b(σC(R)).
Proof. Let R′ = σC(R). Let R denote the random sample corresponding to Gk,b. For
t1, t2, · · · , tk ∈ R′, it follows that
P(t1, t2, · · · , tk ∈ σC(R) | S({ti}ki=1) = S) = P(t1, t2, · · · , tk ∈ R | S({ti}ki=1) = S) = bS.
2
Proposition 9.5. For k-GUS methods Gk,b1(R) and Gk,b2(S) such that L(R) ∩ L(S) = ∅,
Gk,b1(R)× Gk,b2(S)kMA⇐⇒ Gk,b(R× S).
Here bT = b1,T1b2,T2 with T1 = T ∩ L(R) and T2 = T ∩ L(S), and L(R), L(S) refer to the
lineage schema for R and S respectively.
Proof. Let R and S denote the independent random samples corresponding to Gk,b1(R)
and Gk,b2(S) respectively. Let t1, t2, · · · , tk ∈ R×S, where ti = (tiR , tiS). Since L(R)∩L(S) =
97
∅, it follows that
S({ti}ki=1) = T ⇔ S({tiR}ki=1) = T1, S({tiS}ki=1) = T2.
Hence, we get
bT = P(t1, t2, · · · , tk ∈ R× S | S({ti}ki=1) = T)
= P(t1R , t2R , · · · , tkR ∈ R | S({tiR}ki=1) = T1)P(t1S , t2S , · · · , tkS ∈ S | S({tiS}ki=1) = T2)
= b1,T1b2,T2.
2
We now investigate interactions between k-GUS operators when applied to the same data.
Proposition 9.6. For any expression R and k-GUS methods Gk,b1 and Gk,b2 which are applied
independently
Gk,b1(Gk,b2(R)
) kMA⇐⇒ Gk,b(R)
where bT = b1,Tb2,T.
The proof follows immediately from the independence of the two k-GUS methods.
Using Propositions 9.4, 9.5 and 9.6, we can transform a wide variety of query plans to a
kMA-equivalent query plan with sampling on the top. This allows us to construct estimates of
quantities such as skewness (for k = 3) and kurtosis (for k = 4) for SUM-like aggregates.
In future work, we will explore extensions of the the notion of kMA equivalence to pairs of
query plans (analogous to extending SOA equivalence to SOA-COV equivalence). Such an
extension will allow us to estimate the skewness, kurtosis or other quantities invloving higher
order moments for functions of multiple SUM-like aggregates. Another future line of research
will be to find compact and computation-friendly expressions for yS for a general k , i.e., to
extend Lemma 9.1 beyond the case k = 3.
98
REFERENCES
[1] “SQL-2003 Standard.” 2003.
[2] “Grokit.” 2018.
URL https://github.com/tera-insights/grokit
[3] “SWI-Prolog.” 2018.
URL http://www.swi-prolog.org
[4] “TPC-H Benchmark.” 2018.
URL http://www.tpc.org/tpch
[5] Acharya, S., Gibbons, P. B., Poosala, V., and Ramaswamy, S. “Join synopses forapproximate query answering.” Proceedings of the ACM SIGMOD International Confer-ence on Management of Data. ACM, 1999, 275–286.
[6] Acharya, Swarup, Gibbons, Phillip B., and Poosala, Viswanath. “Aqua: A fast decisionsupport system using approximate query answers.” In Proc. of 25th Intl. Conf. on VeryLarge Data Bases. 1999, 754–755.
[7] Acharya, Swarup, Gibbons, Phillip B., Poosala, Viswanath, and Ramaswamy, Sridhar.“The Aqua approximate query answering system.” Proceedings of the ACM SIGMODInternational Conference on Management of Data. New York, USA: ACM, 1999, 574–576.
URL http://doi.acm.org/10.1145/304182.304581
[8] Agarwal, S., Mozafari, B., Panda, A., Milner, H., Madden, S., and Stoica, I. “BlinkDB:Queries with Bounded Errors and Bounded Response Times on Very Large Data.”EuroSys (2013).
[9] Agarwal, Sameer, Milner, Henry, Kleiner, Ariel, Talwalkar, Ameet, Jordan, Michael,Madden, Samuel, Mozafari, Barzan, and Stoica, Ion. “Knowing When You’re Wrong:Building Fast and Reliable Approximate Query Processing Systems Knowing When You’reWrong: Building Fast and Reliable Approximate Query Processing Systems.” Proceedingsof the ACM SIGMOD International Conference on Management of Data. ACM, 2014.
[10] Alon, N., Matias, Y., and Szegedy, M. “The space complexity of approximating thefrequency moments.” ACM Symposium on Theory of Computing. 1996.
[11] Antova, Lyublena, Jansen, Thomas, Koch, Christoph, and Olteanu, Dan. “Fast and simplerelational processing of uncertain data.” 2008 IEEE 24th International Conference on DataEngineering. IEEE, 2008, 983–992.
[12] Arumugam, Subi, Dobra, Alin, Jermaine, Christopher M., Pansare, Niketan, and Perez,Luis. “The DataPath System: A Data-centric Analytic Processing Engine for Large
99
Data Warehouses.” Proceedings of the ACM SIGMOD International Conference onManagement of Data. ACM, 2010, 519–530.
[13] Babcock, Brian, Datar, Mayur, and Motwani, Rajeev. “Load Shedding in Data StreamSystems.” Data Streams - Models and Algorithms. 2007. 127–147.
[14] Bickel, Peter J, Gotze, Friedrich, and van Zwet, Willem R. “Resampling fewer than nobservations: gains, losses, and remedies for losses.” Statistical Sinica (1997): 1–31.
[15] Bickel, Peter J and Sakov, Anat. “Extrapolation and the bootstrap.” Sankhya: The IndianJournal of Statistics, Series A (2002): 640–652.
[16] Bickel, Peter J and Yahav, Joseph A. “Richardson extrapolation and the bootstrap.”Journal of the American Statistical Association 83 (1988).402: 387–393.
[17] Buccafurri, F., Lax, G., Sacc‘a, D., Pontieri, L., and Rosaci, D. “Enhancing histograms bytree-like bucket indices.” The VLDB Journal 17 (2008): 1041–1061.
[18] Chakrabarti, K., Garofalakis, M. N., Rastogi, R., and Shim, K. “Approxi- mate queryprocessing using wavelets.” The VLDB Journal 10 (2001).2-3: 199–223.
[19] Chaudhuri, S., Das, G., and Narasayya, V. “Optimized stratified sampling for approximatequery processing.” TODS (2007).
[20] Chaudhuri, Surajit and Motwani, Rajeev. “On Sampling and Relational Operators.” IEEEData Eng. Bull. 22 (1999).4: 41–46.
[21] Chaudhuri, Surajit, Motwani, Rajeev, and Narasayya, Vivek R. “On Random Samplingover Joins.” SIGMOD Conference. 1999, 263–274.
[22] Cheney, James, Chiticariu, Laura, and Tan, Wang-Chiew. “Provenance in Databases:Why, How, and Where.” Foundations and Trends in Databases 1 (2007).4: 379–474.
[23] Cheng, Yu, Qin, Chengjie, and Rusu, Florin. “GLADE: Big Data Analytics Made Easy.”Proceedings of the 2012 ACM SIGMOD International Conference on Management ofData. ACM, 2012, 697–700.
[24] Cormode, G., Garofalakis, M., Haas, P.J., and Jermaine, C. “Synopses for Massive Data:Samples, Histograms, Wavelets, Sketches.” Foundations and Trends in Databases 4(2012).1-3: 1–294.
[25] Cormode, G. and Muthukrishnan, S. “An improved data stream summary: The count-minsketch and its applications.” Journal of Algorithms 55 (2005).1: 58–75.
[26] C.Pang, Q.Zhang, D.Hansen, and A.Maeder. “Unrestricted wavelet synopses undermaximum error bound.” Proceedings of the International Conference on ExtendingDatabase Technology. 2009.
[27] Dalvi, Nilesh and Suciu, Dan. “Efficient query evaluation on probabilistic databases.” TheVLDB Journal 16 (2007): 523–544.
100
[28] Dobra, A. “Histograms revisited: When are histograms the best approximation method foraggregates over joins?” Proceedings of ACM Principles of Database Systems. 2005.
[29] Dobra, A., Garofalakis, M., Gehrke, J. E., and Rastogi, R. “Processing complex aggregatequeries over data streams.” Proceedings of the ACM SIGMOD International Conferenceon Management of Data. ACM, 2002.
[30] Dobra, Alin, Jermaine, Chris, Rusu, Florin, and Xu, Fei. “Turbo-Charging EstimateConvergence in DBO.” PVLDB 2 (2009).1: 419–430.
[31] Donjerkovic, D. and Ramakrishnan, R. “Probabilistic optimization of top N queries.”Proceedings of the International Conference on Very Large Data Bases. 1999.
[32] Efron, B. “Bootstrap methods: another look at the jackknife.” Annals of Statistics 7(1979): 1–26.
[33] Efron, Bradley. “More Efficient Bootstrap Cmputations.” Journal of American StatisticalAssociation 85 (1990): 79–89.
[34] Efron, Bradley and Tibshirani, Robert J. An introduction to the bootstrap. CRC press,1994.
[35] Flajolet, P. and Martin, G. N. “Probabilistic counting algorithms for databaseapplications.” Journal of Computer and System Sciences 31 (1985): 182–209.
[36] Ganguly, S. “Counting distinct items over update streams.” Theoretical Computer Science378 (2007): 211–222.
[37] Garofalakis, M. and Gibbons, P. B. “Wavelet synopses with error guarantees.” Pro-ceedings of the ACM SIGMOD International Conference on Management of Data. ACM,2002.
[38] ———. “Probabilistic wavelet synopses.” ACM Transactions on Database Systems 29(2004).
[39] Garofalakis, M. and Kumar, A. “Deterministic wavelet thresholding for maximum-errormetrics.” Proceedings of the ACM SIGACT-SIGMOD- SIGART Symposium on Principlesof Database Systems. ACM, 2004, 166–176.
[40] ———. “Wavelet synopses for general error metrics.” ACM Transactions on DatabaseSystems 30 (2005).
[41] Gibbons, Phillip B., Poosala, Viswanath, Acharya, Swarup, Bartal, Yair, Matias, Yossi,Muthukrishnan, S., Ramaswamy, Sridhar, and Suel, Torsten. “AQUA: System andtechniques for approximate query answering.” Tech. rep., 1998.
[42] Gilbert, A. C., Kotidis, Y., Muthukrishnan, S., and Strauss, M. J. “One-pass waveletdecomposition of data streams.” IEEE Transactions on Knowledge and Data Engineering15 (2003).
101
[43] Gryz, Jarek, Guo, Junjie, Liu, Linqi, and Zuzarte, Calisto. “Query sampling in DB2Universal Database.” Proceedings of the ACM SIGMOD International Conference onManagement of Data. ACM, 2004, 839–843.
[44] Guha, S. and Harb, B. “Wavelet synopsis for data streams: Minimizing non-euclideanerror.” Proceedings of the ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining. 2005.
[45] ———. “Approximation algorithms for wavelet transform coding of data streams.”Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms. 2006.
[46] Guha, S., Koudas, N., and Shim, K. “Approximation and streaming algorithms forhistogram construction problems.” ACM Transactions on Database Systems 31 (2006):396–438.
[47] Haas, Peter J. “Large-Sample and Deterministic Confidence Intervals for OnlineAggregation.” SSDBM. IEEE Computer Society Press, 1996, 51–63.
[48] Haas, Peter J. and Hellerstein, Joseph M. “Ripple joins for online aggregation.” Proceed-ings of the ACM SIGMOD International Conference on Management of Data. ACM, 1999,287–298.
[49] ———. “Online Query Processing.” Proceedings of the ACM SIGMOD InternationalConference on Management of Data. ACM, 2001, 623.
[50] Haas, Peter J., Naughton, Jeffrey F., Seshadri, S., and Swami, Arun N. “Selectivity andCost Estimation for Joins Based on Random Sampling.” Journal of Computer and SystemSciences 52 (1996): 550 – 569.
URL http://www.sciencedirect.com/science/article/pii/S0022000096900410
[51] ———. “Selectivity and cost estimation for joins based on random sampling.” J. Comput.Syst. Sci. 52 (1996): 550–569.
[52] Hellerstein, Joseph M., Haas, Peter J., and Wang, Helen J. “Online aggregation.” ACM,1997, 171–182.
URL http://doi.acm.org/10.1145/253262.253291
[53] Horvitz, D. G. and Thompson, D. J. “A generalization of sampling without replacementfrom a finite universe.” Journal of the American Statistical Association 47 (1952):663–685.
[54] Ioannidis, Y. E. “Approximations in database systems.” Proceedings of the InternationalConference on Database Theory. 2003.
[55] Ioannidis, Y. E. and Christodoulakis, S. “On the propagation of errors in the size of joinresults.” Proceedings of the ACM SIGMOD International Conference on Management ofData. ACM, 1991.
102
[56] ———. “Optimal histograms for limiting worst-case error propagation in the size of joinresults.” ACM Transactions on Database Systems 18 (1993).
[57] Ioannidis, Y. E. and Poosala, V. “Balancing histogram optimality and practicality forquery result size estimation.” Proceedings of the ACM SIGMOD International Conferenceon Management of Data. ACM, 1995.
[58] Ioannidis, Y.E. and Poosala, V. “Histogram-based approximation of set-valuedquery-answers.” Proceedings of the International Conference on Very Large DataBases. 1999.
[59] Jermaine, Chris, Arumugam, Subramanian, Pol, Abhijit, and Dobra, Alin. “Scalableapproximate query processing with the DBO engine.” ACM Trans. Database Syst. 33(2008).
[60] Jermaine, Chris, Dobra, Alin, Arumugam, Subramanian, Joshi, Shantanu, and Pol, Abhijit.“The Sort-Merge-Shrink join.” ACM Trans. Database Syst. 31 (2006): 1382–1416.
[61] Joshi, S. and Jermaine, C. “Sampling-Based Estimators for Subset-Based Queries.”PVLDB 1 (2009): 181–202.
[62] Kandula, Srikanth, Shanbhag, Anil, Vitorovic, Aleksandar, Olma, Matthaios, Grandl,Robert, Chaudhuri, Surajit, and Ding, Bolin. “Quickr: Lazily Approximating ComplexAd-Hoc Queries in Big Data Clusters.” Proceedings of the ACM SIGMOD InternationalConference on Management of Data. ACM, 2016.
[63] Kanne, C.C. and Moerkotte, G. “Histograms reloaded: The merits of bucket diversity.”Proceedings of the ACM SIGMOD International Conference on Management of Data.ACM, 2010.
[64] Karras, P. and Manoulis, N. “One-pass wavelet synopses for maximum-error metrics.”PVLDB. ACM, 2005.
[65] Karras, P., Sacharidis, D., and Manoulis, N. “Exploiting duality in summa- rization withdeterministic guarantees.” Proceedings of the ACM SIGKDD International Conference onKnowledge Discovery and Data Mining. 2007.
[66] Kaushik, R., Naughton, J. F., Ramakrishnan, R., and Chakaravarthy, V. T. “Synopses forquery optimization: A space-complexity perspective.” ACM Transactions on DatabaseSystems 30 (2005): 1102–1127.
[67] Kempe, D., Dobra, A., and Gehrke, J. “Gossip-based computation of aggregateinformation.” Proceedings of the IEEE Conference on Foundations of Computer Sci-ence. 2003.
[68] Kleiner, A., Talwalkar, A., Agarwal, S., Stoica, I., and Jordan, M. I. “A general bootstrapperformance diagnostic.” KDD (2013).
103
[69] Kleiner, Ariel, Talwalkar, Ameet, Sarkar, Purnamrita, and Jordan, Michael. “The big databootstrap.” arXiv preprint arXiv:1206.6415 (2012).
[70] Laptev, N., Zeng, K., and Zaniolo, C. “Early Accurate Results for Advanced Analytics onMapReduce.” PVLDB 5 (2012).
[71] Li, Kun, Wang, Daisy Zhe, Dobra, Alin, and Dudley, Christopher. “UDA-GIST: AnIn-database Framework to Unify Data-parallel and State-parallel Analytics.” PVLDB(2015): 557–568.
[72] Lipton, Richard J., Naughton, Jeffrey F., Schneider, Donovan A., and Seshadri, S.“Efficient sampling strategies for relational database operations.” Theoretical ComputerScience 116 (1993).1: 195 – 226.
URL http://www.sciencedirect.com/science/article/pii/030439759390224H
[73] Liu, R.Y. and Singh, K. “Using i.i.d. bootstrap inference for general non-i.i.d. models.”Journal of Statistical Planning and Inference 43 (1999): 67–75.
[74] Matias, Y. and Urieli, D. “Optimal workload-based weighted wavelet synopses.” Theoreti-cal Computer Science 371 (2007): 227–246.
[75] M.Charikar, K.Chen, and M.Farach-Colton. “Finding frequent items in data streams.”International Colloquium on Automata, Languages and Programming. 2002.
[76] Nirkhiwale, Supriya, Dobra, Alin, and Jermaine, Chris. “A Sampling Algebra for AggregateEstimation.” PVLDB 6 (2013).14: 1798–1809.
[77] Olken, Frank. “Random Sampling from Databases.” 1993.
[78] Pansare, Niketan, Borkar, Vinayak, Jermaine, Chris, and Condie, Tyson. “OnlineAggregation for Large MapReduce Jobs.” PVLDB 4 (2011): 1135–1145.
[79] Piatetsky-Shapiro, Gregory and Connell, Charles. “Accurate estimation of the numberof tuples satisfying a condition.” Proceedings of the 1984 ACM SIGMOD internationalconference on Management of data. ACM, 1984, 256–276.
[80] Pol, A. and Jermaine, C. “Relational confidence bounds are easy with the bootstrap.”Proceedings of the ACM SIGMOD International Conference on Management of Data.ACM. 2005.
[81] Politis, D.N., Romano, J.P., and Wolf, M. Subsampling. Springer, New York, 1999.
[82] Poosala, V., Ioannidis, Y. E., Haas, P. J., and Shekita, E. J. “Improved histogramsfor selectivity estimation of range predicates.” Proceedings of the ACM SIGMODInternational Conference on Management of Data. ACM, 1996.
[83] Re, Christopher and Suciu, Dan. “The trichotomy of HAVING queries on a probabilisticdatabase.” PVLDB 18 (2009): 1091–1116.
104
[84] Rusu, F. and 2007, A. Dobra. “Statistical Analysis of Sketch Estimators.” Proceedings ofthe ACM SIGMOD International Conference on Management of Data. ACM, 2007.
[85] Rusu, F. and Dobra, A. “Sketches for size of join estimation.” ACM Transactions onDatabase Systems 33 (2008).3.
[86] Rusu, Florin and Dobra, Alin. “Sketching Sampled Data Streams.” Proceedings of IEEEICDE. 2009.
[87] Sen, Prithviraj, Deshpande, Amol, and Getoor, Lise. “Read-once functions and queryevaluation in probabilistic databases.” Proceedings of the VLDB Endowment 3(2010).1-2: 1068–1079.
[88] Tran, Thanh T, Peng, Liping, Diao, Yanlei, McGregor, Andrew, and Liu, Anna. “CLARO:modeling and processing uncertain data streams.” The VLDB JournalThe InternationalJournal on Very Large Data Bases 21 (2012).5: 651–676.
[89] Tran, Thanh TL, Diao, Yanlei, Sutton, Charles, and Liu, Anna. “Supporting user-definedfunctions on uncertain data.” Proceedings of the VLDB Endowment 6 (2013).6: 469–480.
[90] Vitter, J. S. and Wang, M. “Approximate computation of multidimensional aggregates ofsparse data using wavelets.” Proceedings of the ACM SIGMOD International Conferenceon Management of Data. ACM, 1999.
[91] Wang, Daisy Zhe, Michelakis, Eirinaios, Garofalakis, Minos, and Hellerstein, Joseph M.“BayesStore: managing large, uncertain data repositories with probabilistic graphicalmodels.” Proceedings of the VLDB Endowment 1 (2008).1: 340–351.
[92] Wang, H. and Sevcik, K. C. “Utilizing histogram information.” Proceedings of CASCON.2001.
[93] ———. “Histograms based on the minimum description length principle.” VLDB Journal17 (2008).
[94] Whang, K. Y., Vander-Zanden, B. T., and Taylor, H. M. “A linear-time probabilisticcounting algorithm for database applications.” ACM Transactions on Database Systems15 (1990): 208.
[95] Xu, Fei, Jermaine, Christopher M., and Dobra, Alin. “Confidence bounds forsampling-based group by estimates.” ACM Trans. Database Syst. 33 (2008).
[96] Zeng, Kai, Gao, Shi, Mozafari, Barzan, and Zaniolo, Carlo. “The Analytical Bootstrap: ANew Method for Fast Error Estimation in Approximate Query Processing.” Proceedingsof the ACM SIGMOD International Conference on Management of Data. ACM, 2014,277–288.
105
BIOGRAPHICAL SKETCH
Supriya Nirkhiwale received her B.E. degree in electronics and telecommunications from
the Sri Govindram Sekseria Institute of Technology, Indore, India in 2006. She received her
master’s degree in electrical engineering in 2009 from Kansas State University and Ph.D. in
computer science in 2018 from the University of Florida. Her primary research is focused on
building theory and scalable frameworks Approximate Query Processing in large databases. She
has been working as a data scientist for LexisNexis since 2014.
106