Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually...

67
Mimir: Bringing CTables into Practice by Arindam Nandi May 2016 A thesis submitted to the Faculty of the Graduate School of the University at Buffalo, the State University of New York in partial fulfillment for the requirements for the degree of Master of Science Computer Science and Engineering

Transcript of Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually...

Page 1: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

Mimir: Bringing CTables into Practice

by

Arindam Nandi

May 2016

A thesis submitted to the Faculty of the Graduate School of the

University at Buffalo, the State University of New York

in partial fulfillment for the requirements for the degree of

Master of Science

Computer Science and Engineering

Page 2: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

“All that is gold does not glitter,

Not all those who wander are lost;

The old that is strong does not wither,

Deep roots are not reached by the frost.”

- J.R.R.Tolkien

Page 3: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

Abstract

Traditional data analytics requires upfront data sanitization to get reliable query results.

There is an implicit expectation on database engines to return absolutely correct answers

to all queries. Relaxing this expectation allows the automation of data curation tasks,

resulting in a spectrum in the time invested curating data and the achieved data quality.

Mimir is a system based on a recently introduced class of operators called lenses to clean

data with minimal user effort. Lenses provide a general, composable framework to apply

various data cleaning tasks to messy data, while also preserving the provenance of all

sources of uncertainty in the clean data. Mimir can sit atop any database with a mature

JDBC driver and extend it to support lenses and probabilistic queries over them.

The chief limitation of the work that introduced lenses was that join queries over uncer-

tain predicates decomposed into cross products. This thesis presents several approaches

to making query processing using lenses scalable. Experimental evidence of the viability

of using Mimir to process queries on large datasets is presented, using SPJ queries based

on the TPCH benchmarks. This thesis also describes Mimir’s GUI, which annotates

each uncertain data element in query results with quality metrics as well as its prove-

nance. A third contribution of this work is Mimirs ability to ingest raw CSV data and

provide structure to it with a type inference lens.

Page 4: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

Acknowledgements

I would like to express my sincere gratitude to the people who have helped me complete

this thesis.

Foremost among them is my adviser, Dr. Oliver Kennedy. His constant faith and

encouragement has been my principal motivation in pursuing research. He has been a

true mentor and my friend. I will truly remain eternally grateful to him.

To Dr. Lukasz Ziarek for agreeing to be on my defense committee and for piquing

my interest in compilers. To Dr. Geoffrey Challen, Dr. Steve Ko, Dr. Murat

Demirbas and other members of the faculty in the Computer Science and Engineering

Department for equipping me with the knowledge and experience to take on challenging

projects. I have learned an enormous amount from them.

To Ying Yang, whose research enabled mine, and to other members of the Mimir

group, Niccolo Meneghetti and Vinayak Karuppasamy. Meetings with them were

always rich in ideas and enthusiasm. I’m also thankful to the members of the extended

Database lab for accepting me as one of them. They showed me much of how to enjoy

being a researcher.

To Prof. Partha Basuchowdhuri and Prof. Arindam Chatterjee for their influ-

ence and support during my under-graduate years.

And finally, I would like to thank my family, for their support and love. Their list of

contributions to my life far exceed anything I could ever put down on paper and for that

and more, I will always be grateful.

iii

Page 5: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

Contents

Abstract ii

Acknowledgements iii

List of Figures vi

1 Introduction 1

2 Background 6

2.1 Representing Incomplete Information . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Codd-Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.2 C-Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1.3 PC-Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1.4 VG-Relational Algebra . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Lenses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3 Virtual C-Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4 Lineage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 The Mimir System 14

3.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.1.1 Lens implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.1.2 Iterators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2 Graphical User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2.1 Displaying uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.3 Data ingestion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3.1 Type inference lens . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.3.2 Combining lenses . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.3.3 Conveying lineage information . . . . . . . . . . . . . . . . . . . . 21

3.4 Quality Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4 Optimizations 24

4.1 Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.1.1 Partitioning mechanism . . . . . . . . . . . . . . . . . . . . . . . . 28

4.1.2 Performance boost . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.1.3 Lineage computation . . . . . . . . . . . . . . . . . . . . . . . . . . 34

iv

Page 6: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

Contents v

4.2 Inlining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.2.1 Recovering provenance . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.2.2 Computing statistical metrics . . . . . . . . . . . . . . . . . . . . . 38

4.3 Hybrid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5 Experiments 40

5.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

6 State of the Art 47

6.1 Prototype probabilistic databases . . . . . . . . . . . . . . . . . . . . . . . 47

6.2 Model databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

6.3 Provenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

6.4 Data curation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

7 Conclusion 53

Bibliography 55

Page 7: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

List of Figures

1.1 Incomplete error-filled example relations, including an implicit uniqueidentifier attribute ROWID. . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 One possible version of SaneProduct returned by Mimir. . . . . . . . . . 4

2.1 Truth tables for three-valued logic. . . . . . . . . . . . . . . . . . . . . . . 7

2.2 A sample Codd-Table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 A sample C-Table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.4 Probability distribution of y. . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.5 Grammars for boolean expressions φ and numerical expressions e includ-ing VG-Functions V ar(. . .). . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.6 A sample Var Term. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.7 Selection on C-Tables with VG-RA. . . . . . . . . . . . . . . . . . . . . . 10

2.8 Reduction to VG-RA Normal Form. . . . . . . . . . . . . . . . . . . . . . 11

2.9 F - The non-deterministic fragment of the lens query. . . . . . . . . . . . 12

3.1 Mimir’s architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2 Mimir’s GUI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.3 An explanation pop-up box. . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.4 Ratings1.csv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.5 The TypedRatings1 lens. . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.6 The final cleaned Ratings table. . . . . . . . . . . . . . . . . . . . . . . . 21

4.1 Example relations R and S. . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.2 A non-deterministic join query plan pre-normalization. . . . . . . . . . . . 26

4.3 A naive normalized non-deterministic join query plan. . . . . . . . . . . . 26

4.4 A simplified view of the Mimir data flow. . . . . . . . . . . . . . . . . . . 27

4.5 A simple partitioned query. . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.6 Optimized query plan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.7 Optimized partitioned query plan for two uncertain attributes. . . . . . . 32

4.8 Inline lineage rewrites. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.1 Testbench specifications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.2 Q1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.3 Q3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.4 Q5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.5 Q9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.6 TPC-H S.F. 0.1 SQLite. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.7 TPC-H S.F. 1 SQLite. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.8 TPC-H S.F. 0.1 DBX. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

vi

Page 8: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

List of Figures vii

5.9 TPC-H S.F. 1 DBX. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

Page 9: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

Dedicated to my parents, Subimal and Ratna. . .

viii

Page 10: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

Chapter 1

Introduction

Traditional data analytics is dependent on the quality of the underlying data. The

reliability of results is a direct function of the reliability of the data itself. Consequently,

before sitting down to do any kind of analysis, however exploratory in nature, an analyst

has to sift through vast swathes of data fixing a number of possible problems like entity

resolution, schema matching or domain constraint repair. If the data exists outside

of the database, it has to be migrated in from its original source, applying necessary

transformations to preserve or augment the structure of the data. Such transformations

are commonly referred to as an Extract-Transform-Load or ETL pipeline.

The need to curate data manually before analysis can begin is a very real problem

consuming significant man-hours in enterprise settings. Various works have looked into

amortizing the high upfront cost of data curation over a pay-as-you-go approach[18, 37].

The design philosophy of these systems is to identify and rank curation tasks according

to their impact on the final data quality. Mechanisms are provided to automate the

cleaning process with the caveat that the result of such cleaning efforts is imperfect.

Results are annotated with confidence metrics. Based on such feedback, the analyst

can decide whether the results are within acceptable error bounds. Only the minimum

necessary effort is invested to clean the data to a point where the accuracy of the results

is satisfactory.

This thesis describes Mimir, a tool that augments relational databases to support on-

demand data cleaning. Mimir can be used with any database that has a mature JDBC

driver and is currently compatible with SQLite and a commercial database management

system. Mimir is a shim layer sitting on top of the database. The base data resides

within the underlying database. Mimir requires no changes to the base data to support

its probabilistic features, relying instead on non-materialized views. This allows Mimir

1

Page 11: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

List of Figures 2

to seamlessly integrate with existing work-flows without the need to migrate data to

another database.

Data curation in Mimir is done with lenses[37]. Lenses provide a general framework to

perform a variety of data cleaning tasks. Simplistically, lenses have two components.

1. A query rewrite that transforms the original query into a normal form where all

non-determinism exists at the root operator(s) and the sub-tree is wholly deter-

ministic. Variables are introduced in the top level operator(s) to represent uncer-

tainty. So in essence, lenses behave like database views with functions modeling

uncertainty.

2. A probabilistic model that provides a distribution for each variable introduced by

the lens.

The advantages of lenses are manifold.

1. Lenses require very little configuration. The focus is on ease-of-use. Consequently,

the user-facing aspects of Mimir is designed to be minimal rather than overwhelm

the analyst with uncertainty metrics. However, all of these metrics are still avail-

able to the user on request.

2. Lenses cleanly separate the query rewriting component from the underlying data

cleaning models. This means that the actual algorithm used is decoupled from the

lens itself. Newer, better, or proprietary algorithms can be plugged into the lens

over a uniform interface.

3. Lenses are closed with respect to relational algebra. The output of a lens is a

relation. Therefore a series of lenses providing different data cleaning solutions

can be applied in any desired sequence.

4. Lenses automatically provide a way to track lineage of uncertainty in query results.

Each uncertain data element can be easily annotated with lineage information.

The following example illustrates a simple use-case for lenses.

Example 1.1. Alice is an analyst at a retail store and is developing a promotional

strategy based on public opinion ratings gathered by two data collection companies. A

thorough analysis of the data requires substantial data curation effort from Alice: As

shown in Figure 1.1, the rating company’s schemas are incompatible, and the store’s own

Page 12: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

List of Figures 3

Product

id name brand cat ROWID

1 iPhone 6S White Apple phone R1

2 iPhone 5S Black ? phone R2

3 Galaxy Note 2 Samsung phone R3

... ... ... ... ...

68 Xperia Z ? ? R68

... ... ... ... ...

147 Inspiron 15 Dell laptop R147

148 Envybook HP laptop R148

... ... ... ... ...

Ratings1

pid ... rating review ct ROWID

1 ... 4.5 50 R1

2 ... A3 245 R2

3 ... 4 100 R3

... ... ... ... ...

Ratings2

pid ... evaluation num ratings ROWID

... ... ... ... ...

68 ... 3 121 R68

... ... ... ... ...

147 ... 5 5 R147

148 ... 4.5 4 R148

... ... ... ... ...

Figure 1.1: Incomplete error-filled example relations, including an implicit uniqueidentifier attribute ROWID.

product data is incomplete. However, Alice’s preliminary analysis is purely exploratory,

and she is hesitant to invest the effort required to fully curate this data. She creates a

lens to fix missing values in the Product table:

CREATE LENS SaneProduct AS SELECT * FROM Product

USING DOMAIN_REPAIR ( cat string NOT NULL ,

brand string NOT NULL );

A possible version of SaneProduct is shown in Figure 1.2. From Alice’s perspective, the

lens SaneProduct behaves as a standard database view. However, the content of the lens

is guaranteed to satisfy the domain constraints on category and brand. NULL values in

these columns are replaced according to a classifier built over the Product table. These are

marked in red. Note that these values are not guaranteed to be correct. Under the hood,

the Mimir system maintains a probabilistic version of the view as a so-called Virtual C-

Table (VC-Table). A VC-Table cleanly separates the existence of uncertainty (e.g., the

Page 13: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

List of Figures 4

category value of a tuple is unknown), the explanation for how the uncertainty affects a

query result (this is a specific type of provenance), and the model for this uncertainty as

a probability distribution (e.g., a classifier for category values that is built when the lens

is created)

SaneProduct

id name brand cat ROWID

1 iPhone 6S White Apple phone R1

2 iPhone 5S Black Apple phone R2

3 Galaxy Note 2 Samsung phone R3

... ... ... ... ...

68 Xperia Z Apple phone R68

... ... ... ... ...

147 Inspiron 15 Dell laptop R147

148 Envybook HP laptop R148

... ... ... ... ...

Figure 1.2: One possible version of SaneProduct returned by Mimir.

Mimir currently has three lenses. A domain-constraint repair lens, a schema-matching

lens and a type inference lens. The type inference lens is used to infer types for generic

data. Mimir has functionality to import comma-separated values or CSV files. This

format has no standard definition and header information may often be incomplete.

The type inference lens can quickly attach structure to such imperfect raw data.

Mimir is designed so that users can quickly plop it on top of their existing database

and try out lenses. However, Mimir’s shim architecture presents some challenges to

query processing. The uncertainty models that power lenses are internal to Mimir. All

deterministic queries and fragments of non-deterministic queries that are deterministic

can be processed by the underlying database. However, uncertain data is plugged in by

Mimir by invoking the appropriate models as a post-processing step. When information

that could accelerate queries, for example selection predicates, is non-deterministic, the

backend database loses information that could optimize the query plan. If join predi-

cates are non-deterministic, the backend query degenerates into a cross-product, killing

scalability.

Existing systems with a similar architecture to Mimir solve this problem by modifying

base data[2, 6]. In this thesis, approaches are presented to solve this problem with the

constraint of not touching the original data, while also preserving provenance of queries.

To further make it easier for analysts to sift through large, messy data-sets, Mimir also

provides a graphical user-interface (GUI). Through Mimir’s GUI, analysts can explore

the capabilities of each lens quickly and intuitively. Information overload is avoided by

layering data through responsive contextual pop-up boxes.

Page 14: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

Introduction 5

To summarize, the major contributions of this author to the Mimir project are enumer-

ated below:

1. VC-Tables, as proposed in [37] have quadratic run-time when evaluating join

queries with non-deterministic predicates. This work addresses that shortcoming

with several approaches to making query computation scalable.

2. The GUI of Mimir.

3. Support for importing CSV data and the type inference lens.

The rest of this work is structured as follows.

1. Chapter 2 establishes some of the core concepts behind lenses.

2. Chapter 3 contains a description of how Mimir interacts with the user and its

internal architecture.

3. Chapter 4 discusses optimizations in Mimir required to make it scale well.

4. Chapter 5 presents an experimental evaluation of the Mimir system using selected

TPC-H queries.

5. Chapter 6 discusses the state of the art in probabilistic databases and discusses

some prototypes and their contrasts to Mimir.

6. Chapter 7 concludes this paper.

Page 15: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

Chapter 2

Background

Lenses are a general tool to do data cleaning tasks. Lenses extend a deterministic

table to a set of possible worlds. In possible worlds semantics, every uncertain data

element can assume different values. Each permutation of the database with all of

its uncertain data elements substituted with values from the attribute’s domain is a

possible world. Each possible world D has a probability p. The set of all possible worlds

is denoted as D. A query computed over the set of possible worlds is itself a set of

all possible results obtainable from running that query over all possible worlds, that is

Q(D) = Q(D)|D ∈ D.

A probabilistic database is a database that can be represented as a set of possible worlds

D annotated with a probability distribution p such that,

P [Q(D) = R] =∑

D∈D|Q(D)=R

p(D) (2.1)

Probabilistic databases have a number of different challenges. Efficient encoding of

incomplete information requires compression techniques to store the large number of

possibilities for unknown values. Defining consistent query semantics over a set of re-

lational operators agnostic to the data representation is non-trivial[1]. Efficient query

computation over probability distributions is often intractable leading to approximation

solutions[10]. Choosing appropriate sampling techniques is non-trivial specially in the

case of highly selective queries. Mimir addresses these challenges by drawing on a rich

body of research.

6

Page 16: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

Introduction 7

AND T ω F

T T ω Fω ω ω FF F F F

OR F ω T

F F ω Tω ω ω TT T T T

NOT(T) = F NOT(ω) = ω NOT(F) = T

Figure 2.1: Truth tables for three-valued logic.

2.1 Representing Incomplete Information

Capturing uncertainty in databases is a well researched topic. Representation systems

have to be compact and at the same time encode information about missing values and

the relationships between different missing values. A representation system establishes

a mapping between a database and the set of possible worlds it represents and also a

set of relational operators Ω that can be applied to the database while preserving their

original semantics.

2.1.1 Codd-Tables

The simplest way to encode incomplete information as proposed in Codd-Tables[8] is

to use nulls. This approach is decades old but is still prevalent among all mainstream

database engines today. Codd proposed a three-valued logic, with the literals true,

false, and ω, or “unknown”. Any comparision expression involving nulls have value

ω. The truth tables of boolean operators are also extended to include ω and are shown in

Figure 2.1. This three-valued logic leads to inconsistencies with several valid relational

algebra expressions[28]. For example, in the table shown in Figure 2.2, consider the

query given below.

SELECT Name FROM Employee WHERE Age <= 30 OR Age > 30;

Though the selection condition is a tautology, this query returns no tuples in the result.

This is because both the expressions null <= 30 and null > 30 evaluate to ω, and the

only tuples allowed in the result set are tuples whose selection predicates evaluate to

true.

Employee

Name Age

John Smith null

Figure 2.2: A sample Codd-Table.

Page 17: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

Introduction 8

2.1.2 C-Tables

The problem with simple nulls is ambiguity. It can indicate that there the attribute

is inapplicable to the current tuple or that there exists a valid but as of yet unknown,

value. nullsalso fail to capture relationships between uncertain values.

In [16], the concept of labelled-nulls is introduced. Uncertain data elements are repre-

sented as variables. Relationships between uncertain attributes are encoded by using a

set of variables. Each table additionally has a condition column containing a Boolean

predicate. This meta column is called the local condition. Each tuple in the relation is

present only if its condition column evaluates to true. Such tables are called C-Tables.

C-Tables are closed on projection, selection, union and join queries. Therefore the result

of any query on a C-Table consisting of these operators can be represented by another

C-Table.

Labelled-nulls can be used to efficiently represent attribute-level uncertainty whilst pre-

serving relationships between elements, whereas the condition column provides a way

to represent tuple-level uncertainty. For example, in Figure 2.3, the variable x encodes

the information that while the brand of all three products are unknown, we know that

they have the same value, and the variable y encodes that the set of possible worlds is

t1, t2, t1, t3 and t1 where ti represents the i th tuple.

Product

id name brand category φ

1 Galaxy Note 2 x phone >2 Galaxy S3 x phone y = “phone”2 Galaxy S3 x laptop y = “laptop”

Figure 2.3: A sample C-Table.

2.1.3 PC-Tables

PC-Tables[13, 20] annotate each variable introduced in a C-Table with a probability

distribution. For example, in the table in Figure 2.3 we could attach a probability

distribution function to y.

y P (y)

phone 0.8laptop 0.2

Figure 2.4: Probability distribution of y.

In this case, the set of possible worlds reduces to t1, t2 and t1, t3, with probabilities

0.8 and 0.2 respectively.

Page 18: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

Introduction 9

2.1.4 VG-Relational Algebra

Probability distributions for a particular variable introduced in a C-Table can either

be discrete or continuous. Storing this distribution in the form given in Figure 2.4 is

not feasible for even moderately complex data. Consider a p.d.f. with 100s or 1000s of

probability values or even a continuous p.d.f., times a million rows of data. An approach

envisaged by the MCDB[17] system is to use Variable-Generating or VG functions.

In MCDB, VG-Functions act like user-defined parametrized functions that generate

samples of tuples containing uncertain values with new values conforming to a specified

supported probability distribution.

PIP[23] introduced a form of bag-relational algebra with extended projection that sup-

ports, in addition to the usual grammar for expressions, a simplified form of VG-

Functions. This grammar is given in Figure 2.5. This form is used by Mimir to generate

unique nulls for uncertain data attributes and also evaluate expressions over them. VG-

Functions are parametrized and can generate unique variables that encode attribute level

uncertainty. Internally, a VG-Function in Mimir is represented as a V ar(arg1, arg2, ...)

and is implemented as an interface backed by probabilistic models. This is a conve-

nient way to encode the probability distribution of labelled-nulls in a loss-less, highly

compressed form. It has been shown that generalized C-Tables are closed w.r.t VG-

RA[16, 23].

During query evaluation, projections containing expressions with V ar terms are lazy-

evaluated, with only the deterministic sub-expressions simplified before binding it to

tuple-dependent values. The final evaluation is done when deterministic data is available

from the underlying database and the non-deterministic data has been plugged in by

the V ar term’s model.

Selection is interesting when using VG-RA and C-Tables. Whenever the selection con-

dition can be deterministically computed to be >, the local condition φ assumes value

>. If the condition can be deterministically computed to be ⊥, the tuple is discarded.

e := R | Column | if φ then e else e

| e +,−,×,÷ e | V ar(id[, e[, e[, . . .]]])

φ := e =, 6=, <,≤, >,≥ e | φ ∧,∨ φ | > | ⊥| e is null | ¬φ

Figure 2.5: Grammars for boolean expressions φ and numerical expressions e includ-ing VG-Functions V ar(. . .).

Page 19: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

Introduction 10

Product

id name brand category ROWID

1 Galaxy Note 2 V ar(′X ′, R1) phone R12 Galaxy S3 Samsung phone R2

Figure 2.6: A sample Var Term.

σbrand=′Apple′(Product)

name φ

Galaxy Note 2 V ar(′X ′, R1) =′ Apple′

Figure 2.7: Selection on C-Tables with VG-RA.

When the condition cannot be deterministically evaluated, the non-deterministic predi-

cate is stored in the φ column. For example, consider the query given below. This query

produces the C-Table given in Figure 2.7.

SELECT name FROM product WHERE brand = ‘Apple’;

2.2 Lenses

Lenses use VG-RA queries to define new C-Tables as views[37]. A lens defines an un-

certain view relation through a VG-RA query Flens(Q(D)), where F and Q represent

the non-deterministic and deterministic parts of the query respectively. F also defines

a joint probability distribution over each variable introduced by it, either as standard

distributions like in MCDB[17], or with more interesting models that generate samples

backed by a data cleaning algorithm. Lenses are composable because of the closure

property of VG-RA.

Example 2.1. Recall the lens definition from Example 1.1. This lens defines a new

C-Table using the VG-RA query:

πid←id,name←name,brand←f(brand),cat←f(cat)(Product)

In this expression f denotes a check for domain compliance, and a replacement with a

non-deterministic value if the check fails, as follows:

f(x) ≡ if x is null then V ar(x, ROWID) else x

The models for V ar(′brand′, ROWID) and V ar(′cat′, ROWID) are defined by classifiers

trained on the contents of Product.

Page 20: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

Introduction 11

πa′j←e′j (F(〈 ai ← ei 〉 , φ)(Q(D))) ≡ F(⟨a′j ← [[e′j(ai ← ei)]]lazy

⟩, φ)(Q(D)) (2.2)

σψ (F(〈 ai ← ei 〉 , φ)(Q(D))) ≡ F(〈 ai ← ei 〉 , φ ∧ ψvar)(σψdet(Q(D))) (2.3)

F(〈 ai ← ei 〉 , φ)(Q(D))×F(⟨a′j ← e′j

⟩, φ′)(Q′(D)) ≡

F(⟨ai ← ei, a

′j ← e′j

⟩, φ ∧ φ′)(Q(D)×Q′(D)) (2.4)

F(〈 ai ← ei 〉 , φ)(Q(D)) ] F(⟨ai ← e′i

⟩, φ′)(Q′(D)) ≡

F(⟨ai ← [[if src = 1 then ei else e′i]]lazy

⟩, [[if src = 1 then φ else φ′]]lazy)

(π∗,src←1(Q(D)) ] π∗,src←2(Q′(D))) (2.5)

Figure 2.8: Reduction to VG-RA Normal Form.

2.3 Virtual C-Tables

All uncertainty in Mimir is handled through the use of lenses. A single lens over deter-

ministic data isolates all sources of uncertainty in the top level operators of the operator

tree. The closure property of VG-RA over C-Tables allows the composition of lenses.

A set of normalization rules defined over selection, projection, union and cross-products

allows Mimir to rewrite the combination of two lenses back into a form where all non-

determinism is again only present at the top-level operators.

F ′(F(Q(D)) ≡ F ′′(Q(D)) (2.6)

These rewrites are enumerated in Figure 2.8. The advantage of such an approach is

two-fold:

1. Only the non-deterministic fragments of the query need to be computed by Mimir.

This means query processing in Mimir is essentially a set of query re-writes with the

inclusion of what is the equivalent of user-defined functions. This can be achieved

by treating Mimir as a normal application backed by a traditional database with an

adapter layer in between, like JDBC. Mimir can take advantage of sophisticated

query processing techniques without having to modify the kernel of an existing

database.

2. Since all of the sources of uncertainty are present only at the top level operators,

it is easy to infer the lineage of a particular uncertain data element by inspecting

Page 21: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

Introduction 12

SELECT

id,

name,

IF brand IS NULL THEN Var(’brand’, ROWID) ELSE brand END,

IF cat IS NULL THEN Var(’cat’, ROWID) ELSE cat END

FROM

Q(D);

Figure 2.9: F - The non-deterministic fragment of the lens query.

the query’s operator tree. This makes it easy to present this information to a

user on-demand through Mimir’s user-interface. This forms an integral part of the

interface of Mimir.

For example, consider the lens defined in Example 1.1. SaneProduct performs domain

constraint repair on the Product table on the brand and cat attributes. The CREATE

LENS statement defines a view over the Product table which can be expressed as the

SQL statement in Figure 2.9.

Q(D) is the deterministic query SELECT id, name, brand, cat FROM Product and can

be evaluated by the backend database. The resulting cursor goes through a post-

processing step in Mimir, where two things are done. When either brand or cat is

NULL, Mimir plugs in a value computed by a trained classifier that serves as the prob-

abilistic model for the missing-value lens. When the result is displayed to the user,

Mimir also provides a cue to the user that this data element is a ‘best-guess’ value and

is therefore uncertain in nature.

The conditional statements in Figure 2.9 generate the variables which transform the

Codd-Table to a C-Table. The Var terms produce labelled nulls in a way so as to

preserve relationships. In this case, since a unique variable must be introduced for

each row, Var takes ROWID as a parameter, generating a unique variable for each row.

Finally, the Var’s internal model provides a joint probability distribution for all variables

introduced, making this a PC-Table.

2.4 Lineage

Computing the lineage of uncertainty in database systems can be a non-trivial problem.

The lineage problem in general involves tracking the evolution of atomic parts of the

data with the modification of the database as a whole. In the context of probabilistic

databases, lineage can be used to refer to the mechanisms used to track the sources of

uncertainty in the data. In Mimir, lineage is used as a test to determine whether a data

Page 22: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

Background 13

element is deterministic, and in case it is not, what are the sources of non-determinism.

To be specific, since uncertainty is modeled entirely through lenses, lineage in Mimir

consists of tracking which lenses affected the results of a particular non-deterministic

query. The Var terms make this simple. We could answer queries about lineage for a

particular cell by inspecting the top level query and the current ground truth data.

Example 2.2. Consider the SaneProduct lens. This is an example of the domain-

constraint repair lens. Some of the data for the attributes brand and cat are going to be

plugged in by the internal machine learning model, and as a result, these elements are

uncertain.

φ = brand IS NULL

The condition φ is a sufficient condition to determine whether a particular cell of the

brand column is deterministic or not. Additionally, we could inspect the internal model

of the Var to determine that this was an instance of a domain-constraint lens, and this

is the only source of uncertainty.

The above example illustrates a simple case. We can see more interesting examples of

tracking lineage when there are a series of lenses applied to the base data. An example

of this is given later, in Example 3.3.

Page 23: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

Chapter 3

The Mimir System

Mimir is an application that can plug into an existing database in a minimally intrusive

way and provide composable data-cleaning solutions through lenses. In this chapter,

the general architecture of Mimir is overviewed. Specific components of the Mimir

system which were the author’s direct contributions are discussed. These are namely

the graphical user interface and the data ingestion features. Finally, an end-to-end

comprehensive use case for the system is presented, which shows the power of lenses as

a general framework for data sanitization.

3.1 Architecture

As has been stated before, much of the query processing in Mimir is handled by a tradi-

tional database connected through JDBC, which we call the backend. Mimir is a shim

layer around this backend, as opposed to a full-blown database engine or a modification

to the kernel of a mature open source database like Postgres. This architecture is one of

the biggest selling points of Mimir, because analysts could easily configure it with their

already existing database with minimal effort and start running probabilistic queries

over their data. Apart from creating a few meta-tables, Mimir makes no changes to the

underlying database.

Mimir models a query as a tree of relational operators, commonly referred to as Abstract

Syntax Trees (ASTs). Mimir has a translator for SQL which has two major components.

SqlToRA translates SQL to Mimir’s relational algebra representation. RAToSql does the

reverse. The user issues queries to Mimir’s command-line (CLI) or graphical user inter-

face (GUI). The query goes through SqlToRA which produces the AST representation

of the query. This AST goes through a set of optimization rewrites O, which includes

14

Page 24: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

Background 15

Query

Model

Annotated Cursor Lens

LensManagerJDBC Cursor

Backend Database

User

RewriteEngine

Figure 3.1: Mimir’s architecture.

the normalization rules listed in Figure 2.8. The normalized tree is checked for non-

determinism. If the query is wholly deterministic, it can be evaluated directly by the

backend.

3.1.1 Lens implementation

Mimir is more interesting when users introduce uncertainty in the database by creating

lenses over tables. Once a user creates a certain lens using the CREATE LENS syntax

illustrated in Example 1.1, the lens manager creates an instance of the corresponding

lens class with the user supplied parameters. The lens manager keeps a cache of lenses

in memory as long as the application is running. It also writes metadata for the lens

into a special table in the backend. If a user quits the application, and then tries to

access previously created lenses, the lens is reconstructed using the metadata retrieved

from the backend.

Each lens has two components. A query rewrite, that introduces Var terms where nec-

essary in the query, and probabilistic models to plug in values for said Var terms. When

Mimir encounters queries involving lenses, or equivalently, non-deterministic queries, it

goes through a series of rewrites. It first looks up the lenses involved in the query from

the lens manager. Substituting the lens’ query rewrite into the user’s query produces

the actual query to be evaluated by Mimir. This raw query is recursively simplified and

normalized. The final query is in the normal form F(Q(D)), where as may be recalled,

F is the non-deterministic fragment of the query. The deterministic fragment Q(D) is

Page 25: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

Background 16

dispatched to the backend after translating into normal SQL by RAToSql. The cursor

returned by JDBC forms the input to one of Mimir’s iterators.

3.1.2 Iterators

Mimir returns results to the user through an iterator interface. This interface provides

methods for inspecting row and column determinism, lineage conditions and a human-

readable description of the sources of uncertainty in the results. Mimir has several

iterators. For deterministic queries, a thin wrapper around the usual ResultSet of JDBC

is provided, the ResultSetIterator. In the ResultSetIterator, the probabilistic parts

of the interface return trivial values.

Mimir’s normalization rewrites transform a tree into a union of projections, where all

Var terms exist in the top-level projections of each sub-tree of the unions. Therefore

the only operators that could exist in F are unions and projections. Mimir provides

iterators for both. Query processing in Mimir follows the traditional volcano style with

the ability to delegate entire sub-trees to a different query engine.

The projection iterator is especially interesting. This is where the final evaluation of

expressions containing Var terms take place, with each Var’s underlying model plugging

in values and the rest of the data available from the input iterator. It also evaluates

the local condition of each tuple to determine whether its part of the final result set. If

the tuple is non-deterministic and deemed to be not probable enough to be in the final

result, the iterator also sets a flag to indicate that the final result may have missing

tuples.

3.2 Graphical User Interface

One of the key design principles of Mimir is to keep user interaction simple. In many

systems which deal with probabilistic data, there is an emphasis on quantifying the

degree of confidence the system has in its results. An alternative school of thought[22]

is to answer probabilistic queries with best guesses that the system can make with the

information available. Whenever an answer is uncertain, this can be communicated to

the user with the help of simple cues. The user can then inspect the guesses calculated by

the system, and can choose to either accept them or override it by providing alternative

answers. Such a system avoids overwhelming analysts with statistical error bounds and

confidence intervals, and hides complexity till such time as when the user requests it.

Page 26: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

Background 17

Figure 3.2: Mimir’s GUI.

The GUI of Mimir is illustrated in Figure 3.2. It is aimed at making the data sanitiza-

tion workflow easy and intuitive. Standard SQL queries can be typed into the textbox

marked (a) and lenses can be created using the CREATE LENS syntax. Lenses can also

be applied on top of the current relation with a single click using the lens toolbox in the

panel marked (b). Lenses require little to no configuration. A meaningful name is also

automatically generated based on the current query and the type of lens being applied.

Query results appear in the region marked (c). Lineage of queries are visualized through

the interactive graph in (d). Notifications in (e) convey additional information about

uncertainty in the result to the user.

3.2.1 Displaying uncertainty

Uncertainty in Mimir can be due to Var terms in a projection’s input expression, result-

ing in non-deterministic cells. Alternatively it could be because the local condition of a

Page 27: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

Background 18

Figure 3.3: An explanation pop-up box.

tuple contains a boolean predicate involving a Var term, resulting in non-determinsitic

rows. Non-deterministic cells are highlighted in red as in (c) in Figure 3.2. Non-

deterministic rows may or may not be part of the result set. Deterministic rows are

marked green on the extreme right of (c), whereas non-determinsitic rows are marked

red. If some non-dterministic rows did not make it into the best-guess result, this is

indicated to the user through the notification box.

Clicking a non-deterministic cell opens a pop-up containing statistical error bounds and a

human readable description of what Mimir did to produce its guess, as shown in Figure

3.3. The user may then choose to either accept Mimir’s guess or fix it by providing

another value in its place.

3.3 Data ingestion

Mimir defines a current working database as a logical separation between collections of

relations. The user has the ability to change the working database through the GUI

from the top toolbar or by restarting the CLI with appropriate flags and arguments.

Changing the working database involves reconfiguring the JDBC connection. Once the

user is connected to the appropriate database, they can take advantage of the usual DDL

Page 28: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

Background 19

commands to create, modify and delete tables. Mimir makes no modifications to DDL

queries and these queries behave much the same way as deterministic DQL queries.

In addition to this, Mimir also has support for quickly ingesting raw comma-separated

values or CSV data into the current database throguh the GUI. Data can be uploaded

to the backend database directly through Mimir, using the Upload feature. CSV files

are dragged and dropped into the work area and are processed through JDBC’s batch

insertion method for fast table creation. The file name becomes the table name. The

header supplies the attribute names. If the header contains only attribute names but

no types, it defaults to string.

3.3.1 Type inference lens

The type inference lens can be used to automatically annotate attributes with a more

specific type than string. The type inference lens has an underlying model that deter-

mines the most likely type of an attribute. This model works by sampling the table’s

data. Each data element is compared against each type supported by the lens using

regular expressions. Each match counts as a vote for that type. The lens takes an ad-

ditional threshold parameter, T . The type of an attribute is determined to be the type

with the maximum votes if the winning vote is greater than T . Ties are broken with

a preference for more restrictive types. In case the winning vote is not big enough, the

type defaults to string.

πai←CAST (ai,V ari())(R)

The query rewrite component of the type inference lens simply introduces an additional

projection that uses the CAST function to cast the data to the inferred type.

Example 3.1. Recall the Ratings tables from Example 1.1. Suppose this data is pre-

sented to Mimir as the raw CSV file given in Figure 3.4. Once this data is uploaded

through Mimir, this will be the table Ratings1. The types of all attributes of Ratings1 will

be strings. The actual types for all the attributes in this example however is numeric. To

pid,rating,review ct

1,4.5,50

2,A3,245

3,4,100

Figure 3.4: Ratings1.csv

Page 29: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

Background 20

conform to this, a type inference lens on Ratings1 is created. This can be done through

the GUI or through the following SQL query:

CREATE LENS TypedRatings1 AS

SELECT * FROM Ratings1

WITH TYPE_INFERENCE(0.5);

This creates the view TypedRatings1 as shown in Figure 3.5. All three attributes now

have a numeric type. Note that the original data had a typo for the second tuple for the

rating column, which had the value ‘A3’. Mimir defaults non-conforming data elements

to NULL. 0.5 is the value for the threshold parameter, T . Changing this parameter to

0.7 will affect the inferred type of rating, since only 67% of the data conforms to a

numeric type. Also note that all data elements are now marked in red. This is because

uncertainty involving attribute qualities affect all tuples.

Figure 3.5: The TypedRatings1 lens.

3.3.2 Combining lenses

One of the desirable properties of lenses are their composability. This allows an analyst to

fix several distinct problems with raw data by applying specific lenses in series. Consider

the use case given below:

Example 3.2. Refer to the Ratings tables in Example 1.1. There are two distinct

tables, Ratings1 and Ratings2. Let’s imagine the analyst, Alice, had outsourced the

task of collecting ratings information about the company’s products to two separate data

collection agencies. Due to a lack of communication between the two sources, the schemas

used to track the ratings data do not match, but essentially store the same information.

Now Alice wants to run analytics on products based on their popularity amongst users.

To do this effectively, she has to combine the ratings data. There are three distinct prob-

lems that have to be solved in order to achieve that. First, she has to load the data into

Page 30: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

Background 21

Figure 3.6: The final cleaned Ratings table.

the database. Since the data is provided as raw CSV with no type information provided,

we can follow the steps in Example 3.1 to have type annotated tables - Ratings1Typed

and Ratings2Typed.

As was noted in that example, applying the type inference lens on Ratings1 results in a

NULL value for the non-conforming data value ‘A3’. We can run domain constraint repair

on this table to have Mimir ‘guess’ a value instead, resulting in Ratings1Interpolated.

Finally, we want to unify the two separate tables into one complete and clean table. To

do this we first match the schema of the tables. We apply a schema-matching lens on

Ratings2Typed to transform its schema to align with Ratings1Typed’s schema, obtain-

ing Ratings2Matched. Alice can now union these two tables to get one final Ratings

table, shown in Figure 3.6.

3.3.3 Conveying lineage information

Mimir relies on machine learning algorithms to do the cleaning operations required by

lenses. No such algorithm can be correct 100% of the time. So, Mimir can and will

make mistakes. For example, in the SaneProduct table shown in Figure 1.2, savvy

observers may notice that Mimir’s inference of the brand of the product Xperia Z as

Apple is incorrect. An analyst must be able to track the source of this error. This

Page 31: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

Background 22

particular error occurred due to incorrect interpolation by the domain-constraint repair

lens. Errors in data may also arise if the type-inference lens infers a less restrictive type

for an attribute or if the schema-matching lens lines up the schemas wrong.

To help users identify the exact source of error, Mimir can provide lineage of each data

element. This is done at two levels.

1. At the table level, Mimir provides a lineage graph. An example of this can be seen

in Figure 3.6. The graph shows each lens and relational operator that was applied

to get the current result. It is constructed using the raw, unoptimized query plan.

2. For non-deterministic data elements, details of what each lens did can be more

closely inspected through its explanation box. A list of human-readable reasons

for the uncertainty of the particular data is provided, and hovering over a reason

highlights the particular lens that introduced the uncertainty, in the lineage graph.

This list of reasons is internally obtained from the iterator interface, which returns

all sources of unceratinty by checking for Var terms in the query plan.

Example 3.3. Consider the Ratings1Interpolated table. This is a composition of

the type inference and domain constraint repair lens on top of the deterministic data of

the Ratings1 table. The internal normalized query for this table looks like this:

PROJECT[

PID <= CAST(PID, Var_TI_0()),

RATING <= IF

CAST(RATING, VAR_TI_1()) IS NULL

THEN

VAR_DCR_1(ROWID)

ELSE

CAST(RATING, VAR_TI_1())

END,

REVIEW_CT <= CAST(REVIEW_CT, VAR_TI_2())

](

RATINGS1

)

Here V ar TI represents a type inference lens VG-function and V ar DCR represents a

domain-constraint repair VG-function. The additional integer is used to index into the

correct attribute, since all Var terms define a joint distribution over all attributes of the

Page 32: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

The Mimir System 23

relation. The top-level non-deterministic PROJECT will be evaluated by the uncertainty

aware ProjectionResultIterator. In Ratings1Interpolated, all data is affected by

the type inference lens, so it is present in the lineage of all cells. Only the data element

originally ‘A3’ in the base relation is affected by the domain-constraint repair lens. So,

the domain-constraint repair lens is indicated as a source of uncertainty for only this

cell.

3.4 Quality Metrics

In addition to lineage, Mimir also has support for quantifying the certainty of results.

The error-bounds and confidence-intervals of non-deterministic data are calculated upon

demand, though sampling. Each sample is one possible world of the database and is

generated based on the probability distribution of inferred data. These quality metrics

provide a summary of the degree of confidence Mimir has in its results. An analyst can

look at these measures and decide whether they lie within acceptable margins.

In case the analyst desires results with lower margins of error, effort must be spent on

manually curating parts of the data. The original lens paper[37] gives a way to rank

curation tasks in order of most payoff in terms of improving quality of results, called

Cost of Perfect Information (CPI). CPI is based on entropy, which is a measure of how

uncertain the data is. The analyst can use the CPI to determine the next best curation

task to perform. After such curation is complete, the query is re-run, which now has

lower margins of error. This process can be repeated till the analyst is satisfied with

achieved data quality.

Page 33: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

Chapter 4

Optimizations

In this chapter, we discuss strategies on making query computation in Mimir efficient.

At present, Mimir supports selection, projection, and join queries on non-deterministic

data. Selection and projection are naturally amenable to the VC-Table representation

used by Mimir. Naive evaluation of join queries is a quadratic problem. An unoptimized

join query is equivalent to a cross product with a selection predicate filtering out tuples

that don’t satisfy the join predicate. Join queries on deterministic predicates can be

handed down to the backend to exploit sophisticated hash-based and indexed-based join

algorithms. Join queries on non-deterministic predicates however, are a problem, due to

two reasons.

1. The first more theoretical challenge is dealing with the exponential complexity

of evaluating a comparison operator =, <>,<,>,<=, >=, LIKE over random

variables with continuous or discrete distributions. It should be apparent that the

result of such an expression would itself be probabilistic. [1] gives a good summary

of the various approaches that can be used to evaluate join predicates over both

continuous and discrete distributions. Under possible world semantics, with W

being the set of possible worlds and w being one possible world, the probability

that an attribute a with a p.d.f. would be equal to another uncertain attribute b

is given by:

P (a = b) =∑

w∈W |w.a=w.b

P (W )

where w.a represents the valuation of a in w. This involves materializing every

possible world, which is exponential in the number of uncertain attributes in the

database.

24

Page 34: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

The Mimir System 25

R

a b c

1 2 1

5 ? 1

3 3 4

4 1 3

S

b d

2 6

1 5

7 2

? 1

Figure 4.1: Example relations R and S.

Mimir avoids this issue by making a maximal likelihood assumption for the random

variables. That is, Mimir only computes results on the possible world obtained by

replacing all random variables with their most likely value.

2. The second problem arises as a result of the design of Mimir itself. While the

shim architecture makes life very easy for an analyst reluctant to migrate all data

to a new database, it makes efficient query processing complicated. The creation

of lenses create uncertainty in the database. Since the values of uncertain data

are plugged in by Mimir as a post-processing step, and does not actually exist in

any form in the backend database, there is no way for the backend to know this

information in advance or compute it on its own. This means that information that

the query optimizer could have potentially used to improve query performance is

never conveyed to the backend.

For example, consider the normalization rewrite 2.3:

σψ (F(〈 ai ← ei 〉 , φ)(Q(D))) ≡ F(〈 ai ← ei 〉 , φ ∧ ψvar)(σψdet(Q(D)))

The selection predicate ψ is split into deterministic and non-deterministic frag-

ments - ψdet and ψvar. The deterministic predicate is pushed into the deterministic

query sub-tree, which means it makes it into the query pushed into the backend.

Non-deterministic predicates don’t get pushed and are evaluated directly by Mimir

after plugging in Var values. This is a problem if the predicate in question was

a join predicate. Without it, the backend database only sees a cross-product,

and returns a quadratic number of tuples. This is the single biggest sources of

bottlenecks in the Mimir system.

Example 4.1. Consider two relations R(a, b, c) and S(b, d) shown in Figure 4.1.

These relations have missing values for attribute b. A domain-constraint repair lens can

be applied to infer missing values for b. Let’s call the fixed relations Sane R and Sane S.

Consider the query:

SELECT a, d FROM Sane_R, Sane_S

WHERE Sane_R.b = Sane_S.b;

Page 35: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

The Mimir System 26

πa←a,d←d

σleft.b=right.b

×

πa←a,b←DCRb,c←c

R

πb←DCRb,d←d

S

Figure 4.2: A non-deterministic join query plan pre-normalization.

This results in the pre-normalized query plan shown in Figure 4.2. A traditional query

optimizer will recognize the selection over cross-product as a join operation and replace

it with an efficient join operator. However, since the missing values of R and S cannot

be known by the backend, Mimir has to evaluate the top project and select operator in

Mimir-land. The normalized query plan is shown in Figure 4.3. The selection actually

becomes part of the C-Table condition column φ, and the resulting operator is a composite

projection and selection that is evaluated by the ProjectionResultIterator. Only

tuples satisfying the local condition φ, as computed by substituting attribute values for

the current tuple, are part of the result-set. The only part of the query the backend sees

is R× S. This quickly becomes intractable for data-sets of non-trivial sizes.

A possible way to solve this problem is to have an uncertainty aware join iterator within

Mimir, similar to the Projection and Union iterators. This iterator will enable Mimir to

process joins in Mimir land. There are several roadblocks to this approach. In contrast

to projection and union, a join operator is much more complex. Picking efficient join

strategies is a non-trivial task and is dependent on a lot of external factors, like the

physical data layout and presence of indexes. Picking a join order involves gathering

statistics and doing cost-based optimization to find effective join orderings. There are

various algorithms to choose from to do the actual join. Block-nested-loop joins can

be cheap if the relations are small enough. Some form of hash-based joins are the

norm when there aren’t appropriate indexes, because it avoids computing the Cartesian

πa←R.a,d←S.d // φ←V ar R b=V ar S b

×

R S

Figure 4.3: A naive normalized non-deterministic join query plan.

Page 36: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

The Mimir System 27

Backend

Determinisitic Data

Deterministic Query

JDBC

ProbabilisticmodelsProbabilistic Result

Probabilistic Query Rewriteengine

Mimir

Figure 4.4: A simplified view of the Mimir data flow.

product, with the caveat of needing to spill to disk if the data is large enough. Sort-

merge joins can be exploited if the data is sorted. Index-nested-look-up joins can be

applied if there’s an index on one of the join keys.

All of these factors make the design of an efficient join operator a complex engineering

task. The cost seems especially high considering the fact that Mimir sits right on top of

a commercial grade database that can handle this complexity based on the significant

efforts of domain experts. The design task then becomes that of leveraging the power

of the backend whilst still supporting the uncertainty aware capabilities elucidated in

Chapter 3.

The primary constraint in this regard stems from the unidirectional flow of data from

the backend to Mimir, as illustrated in Figure 4.4. Mimir receives a query from the user,

rewrites the query injecting what is essentially equivalent to user-defined functions - Var

terms, to enable uncertainty. This rewritten query is normalized, resulting in F(Q(D)),

where all uncertainty is isolated in F . The deterministic fragment Q is passed to the

backend for evaluation.

Since the Var terms are present in Mimir-land, they are black-boxes to the database. An

argument can be made to rewrite lenses as user-defined functions, but Mimir is meant

to be backend-agnostic, using only universally supported SQL constructs. User-defined

functions vary in scope, support and syntax between different database vendors, making

it a poor choice for the architecture.

In the following sections, two orthogonal approaches are presented to making joins in

Mimir scalable. One is based on partitioning the query into a union of queries that can be

evaluated separately by the backend, possibly in parallel. Each independent query only

Page 37: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

The Mimir System 28

touches a clearly marked portion of the data. The lineage of this data is common across

all tuples in that particular partition, and is determined by the partitioning scheme.

The second approach involves altering the unidirectional flow of data. The results of

the black box functions that Mimir would eventually plug into the result downstream,

is made available to the backend in the form of special auxiliary table. The query is

rewritten in a form that leverages this additional information that the backend now has

access to. While this provides additional complexity in terms of provenance tracking, it

allows Mimir to offload virtually all query processing into the backend. We call this the

in-lining strategy.

Finally, we notice that the two approaches are independent of each other and can be

combined into a hybrid strategy.

4.1 Partitioning

The partitioning approach relies on the realization that messy data is most often a

minority of the data. For example, consider the case of domain-constraint repair. In real-

life databases, data elements that have missing values would be a very tiny percentage

of the overall data. Consequently, Mimir does not have to do any interpolation for a

very large number of tuples. These tuples could be said to be deterministic.

If we modified the query tree upfront to split it into one query that only returns the

tuples which would need no input from the domain-constraint lens, and another which

only returns the tuples that do, it will let Mimir push a lot more of the processing into

the backend, and deal with only data that is guaranteed to need input from Mimir’s

black-box functions. It also makes lineage computation faster by simplifying the lineage

formula to be more specific to each partition of the data.

4.1.1 Partitioning mechanism

The partitioning approach works by splitting the original normalized query F(Q(D))

into a union of sub-queries, each associated with a partition identified by a Boolean

formula ψi. Ψ represents the set of all partitions. The set of partitions is complete

(∨ψi ≡ >) and disjoint(∀i 6= j · ψi → ¬ψj).

Partitions are obtained by applying Algorithm 1[31] to candidate clauses. Candidate

clauses are of the form of an if-then-else construct which has a condition that can

be deterministically evaluated splitting the processing into a deterministic case where

Page 38: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

The Mimir System 29

no interpolation is required, and a non-deterministic case where one of Mimir’s black

box functions steps in. An expression in Mimir can be checked for non-determinism by

recursively traversing through the expression grammar and checking for the presence of

Var terms, as specified in Algorithm 3.

For example, the clause,

if R.b IS NULL then V ar R b else R.b

is a candidate clause. The general rewritten query is of the form:

(F(〈 ai ← ei 〉 , φ)(Q(D)))

7→ F(〈 ai ← ei 〉 , φvar,1)(σψ1∧φdet,1(Q(D)))

∪ · · · ∪ F(〈 ai ← ei 〉 , φvar,N )(σψN∧φdet,N (Q(D)))

where φvar,i and φdet,i are respectively the non-deterministic and deterministic fragments

of φi (i.e., φi = φvar,i∧φdet,i) and φi is the value of φ when all of its candidate clauses are

substituted with the values in the corresponding partition, ψi. This is notated as φi ≡φ[ψi]. For example, (if R.b IS NULL then V ar R b else R.b)[R.b IS NULL] ≡ V ar R b.

Notice that each partition manifests itself both in the condition column φ and also as

a deterministic selection predicate to multiplex into the subset of data affected by that

particular partition. The selection predicate is obtained by decomposing φi and pushing

its deterministic fragment down into a newly created selection operator.

The partitioning algorithm given in Algorithm 1 targets only conditionals, and conjunc-

tions are handled as a by-product of the recursive traversal of Boolean predicates. But

the expression grammar also contains disjunctions. There is an opportunity for further

optimizing the partitions. To actually rewrite the query in a form that can be efficiently

evaluated by the backend, we can distinguish the set of partitions into two special cases

- one, in which the condition becomes deterministically false, and the other when it

becomes deterministically true.

Algorithm 2[31] describes a recursive algorithm that walks through φ and returns phi>,

a partition where the condition is deterministically true, φ⊥, a partition where the

condition is deterministically false and the remaining set of partitions Ψ. It uses

naivePartition to get a set of partitions for atomic terms of the expression. But

it also handles the case when a non-deterministic condition and a deterministic condi-

tion are combined using either a conjunction or a disjunction and the resulting condition

is deterministically true in case of a disjunction or is deterministically false in case of a

conjunction.

Page 39: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

The Mimir System 30

Algorithm 1 naivePartition(φ)

Require: φ: A non-deterministic boolean expressionEnsure: Ψ: A set of partition conditions ψiconditions← ∅Ψ← ∅for (if condition then α else β) ∈ subexps(φ) do/* Check ifs in φ for candidate partition conditions */

if isDet(condition) ∧ (isDet(α) 6= isDet(β)) thenconditions← conditions ∪ condition

/* Loop over the power-set of conditions */

for partition ∈ 2conditions doψi ← >/* Conditions in the partition are true, others are false */

for clause ∈ conditions doif clause ∈ partition then ψi ← ψi ∧ clause

else ψi ← ψi ∧ ¬clauseΨ← Ψ ∪ ψi

This further simplifies the queries to be run on the backend. With queries involving

multiple tables, some partitions could be directly computed by the backend as a join

instead of a cross-product, and even when a cross product is necessary, it is only done

on the non-deterministic fragment of the data.

Example 4.2. Consider the query:

SELECT a, d FROM Sane_R, S WHERE Sane_R.b = S.b;

Mimir’s normalized plan for this query is:

πa←R.a,d←S.d // φ←(if R.b IS NULL then V ar R b else R.b=S.b)(R× S)

In this example, there is only one uncertain attribute, R.b. S.b has missing values too,

but since we do not use a domain-constraint repair lens on S (Sane S), it is treated as a

normal deterministic column. The only clause in φ is the usual if-then-else construct

usual for a lens rewrite. In the case of domain-constraint repair, the condition of the

if statement is always a deterministic IS NULL check, splitting into a non-deterministic

case, where the value is NULL and a value has to be filled in, and a a deterministic case,

where the value is the original value. Observe that we can use this condition to explicitly

delineate the deterministic and non-deterministic portions of the data, and run separate

queries on both. After partitioning, the query looks like Figure 4.5.

Notice that after partitioning, one of the decomposed φs becomes entirely deterministic,

and can be pushed down as a selection predicate. This entirely deterministic query can

Page 40: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

The Mimir System 31

Algorithm 2 generalPartition(φ)

Require: φ: A non-deterministic boolean expression.Ensure: ψ>: The partition where φ is deterministically true.Ensure: ψ⊥: The partition where φ is deterministically false.Ensure: Ψvar: The set of non-deterministic partitions.

if φ is φ1 ∨ φ2 then〈 ψ>,1, ψ⊥,1,Ψvar,1 〉 ← generalPartition(φ1)〈 ψ>,2, ψ⊥,2,Ψvar,2 〉 ← generalPartition(φ2)ψ> ← ψ>,1 ∨ ψ>,2ψ⊥ ← ψ⊥,1 ∧ ψ⊥,2for all ψvar,1, ψvar,2 ∈ Ψvar,1,Ψvar,2 do

Ψ← Ψ ∪ ψvar,1 ∧ ψvar,2for all ψvar,1 ∈ Ψvar,1 do Ψ← Ψ ∪ ψvar,1 ∧ ψ⊥,2for all ψvar,2 ∈ Ψvar,2 do Ψ← Ψ ∪ ψvar,2 ∧ ψ⊥,1

else if φ is φ1 ∧ φ2 then/* Symmetric with disjunction */

else if φ is ¬φ1 then〈 ψ>,1, ψ⊥,1,Ψvar,1 〉 ← generalPartition(φ1)〈 ψ⊥, ψ>,Ψvar 〉 = 〈 ψ>,1, ψ⊥,1,Ψvar,1 〉

elseΨ = naivePartition(φ)Ψdet ← ∅; Ψvar ← ∅for all ψ ∈ Ψ do

if isDet(φ[ψ]) then Ψdet ← Ψdet ∪ ψelse Ψvar ← Ψvar ∪ ψ

ψ> = (∨

Ψdet) ∧ φ[∨

Ψdet]ψ⊥ = (

∨Ψdet) ∧ ¬φ[

∨Ψdet]

πa←R.a,d←S.d // φ←R.b=S.b

×

σR.b NOT NULL

R

S

πa←R.a,d←S.d // φ←V ar R b=S.b

×

σR.b IS NULL

R

S

Figure 4.5: A simple partitioned query.

then be forwarded to the backend. The final query is computed as shown in Figure 4.6.

The most significant improvement is that the deterministic sub-tree no longer has a

cross-product. It has been replaced by a join operator.

Example 4.3. Let us consider a more interesting query.

SELECT a, d FROM Sane_R, Sane_S WHERE Sane_R.b = Sane_S.b;

Page 41: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

The Mimir System 32

πa←R.a,d←S.d

onR.b=S.b

σR.b NOT NULL

R

S

πa←R.a,d←S.d // φ←V ar R b=S.b

×

σR.b IS NULL

R

S

Figure 4.6: Optimized query plan.

Mimir’s normalized plan for this query is similar to the last one, but phi now takes the

form:

φ← (if R.b IS NULL then V ar R b else R.b = if S.b IS NULL then V ar S b else S.b)

This condition has two candidate clauses, resulting in the partition set:

Ψ = (R.b IS NULL ∨ S.b IS NULL), (¬R.b IS NULL ∨ S.b IS NULL),

(R.b IS NULL ∨ ¬S.b IS NULL), (¬R.b IS NULL ∨ ¬S.b IS NULL)

This results in the optimized query plan shown in Figure 4.7, where RNN signifies the

subset of R where R.b NOT NULL and RNULL signifies the subset of R where R.b IS NULL,

and likewise for S.

πφ←>

onR.b=S.b

RNN SNN

πφ←V ar R b=S.b

×

RNULL SNN

πφ←R.b=V ar S b

×

RNN SNULL

πφ←V ar R b=V ar S b

×

RNULL SNULL

Figure 4.7: Optimized partitioned query plan for two uncertain attributes.

Example 4.4. Consider the query:

SELECT a, d FROM Sane_R, S

WHERE Sane_R.b = S.b AND (Sane_R.b = 1 OR Sane_R.b = 2);

Page 42: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

The Mimir System 33

For this query, the condition φ before partitioning is:

φ ≡ (if R.B IS NULL then V ar R b else R.b = S.b

∧ (if R.B IS NULL then V ar R b else R.b = 1

∨ if R.B IS NULL then V ar R b else R.b = 2))

Though there are three different candidate clauses, there’s only one distinct condition:

R.B IS NULL and two corresponding partitions:

ψ1 = ¬R.B IS NULL

ψ2 = R.B IS NULL

As is evident from the examples above, the number of partitions is exponential in the

number of distinct conditions in candidate clauses.

4.1.2 Performance boost

Figure 4.6 gives a hint of how partitioning can help speed up join queries considerably.

The original non-deterministic query is transformed into a union of two queries, one of

which is wholly deterministic and can be evaluated directly by the backend database,

with efficient join strategies. The portion of the data that needs to undergo a Cartesian

product is also significantly reduced. Let’s see an example of this.

Example 4.5. In Example 4.2, notice that the deterministic part of the query can now

be joined by the backend. This is as fast as we can achieve for that part of the data. We

still need to do cross-product of the subset of R where R.b IS NULL and S. Let’s say R

and S both have 1000 rows. Also, let us assume that 0.1% of R has missing values for b.

|R× S| = 1000× 1000 = 1000000

The naive query would have to process n2 rows. If we partition however, the number of

rows that come out of the cross-product is:

|R(b IS NULL) × S| = (0.1× 1000) ∗ 1000 = 1000

Generally, let’s say we are trying to join two relations R and S with cardinality nR and

nS on an attribute which has xR and xS missing values respectively. Naive evaluation

of such a query would require a Cartesian product with cardinality:

Page 43: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

The Mimir System 34

|(R× S)| = nR · nS (4.1)

If we partition the query, the deterministic component is:

|(R× S)DET | = (nR − xR) · (nS − xS) (4.2)

This portion of the query is eliminated from the product and is joined instead. The

rest of the product still has to be enumerated. The cardinality of the non-deterministic

fragment is the difference between Equation 4.1 and 4.2:

|(R× S)V AR| = (nR · nS)− (nR − xR) · (nS − xS)

= (nR · xS) + (nS · xR)− (xR · xS)(4.3)

This is equal to the combined cardinality of the non-deterministic sub-trees.

|(R× S)V AR| = |RNULL × SNN |+ |RNN × SNULL|+ |RNULL × SNULL|

= xR · (nS − xS) + (nR − xR) · xS + (nR − xR) · (nS − xS)

= (nS · xR − xR · xS) + (nR · xS − xR · xS)

+ (nR · nS − nR · xS − nS · xR + xR · xS)

= (nR · xS) + (nS · xR)− (xR · xS)

(4.4)

4.1.3 Lineage computation

There is an additional advantage to partitioning queries. The lineage computation is

faster because it is no longer data dependent within a partition. Partitions ensure that

the exact source of non-determinism for the query is known by the Boolean predicate

which defines the partition.

For example, in the query in Example 4.3, the leftmost sub-tree is known to be deter-

ministic, because the local condition reduces to >. The RNULL × SNN sub-tree’s only

source of uncertainty is missing values in R. The RNULL × SNULL sub-tree’s sources

of uncertainty are missing values in both R and S. Without partitions, the projection

iterator would have to inspect each tuple to get the sources of uncertainty applicable to

that particular tuple. With partitioning however, that information is no longer tuple-

dependent.

Page 44: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

The Mimir System 35

4.2 Inlining

The partitioning approach is useful but limited. From Equation 4.4, we see that the

amount of work we still have to do after partitioning is dependent on the number of

errors, x, in the base data. This follows from intuition. If x is a function of n, for

example in the case that a certain percentage of the data contains errors, then x = O(n).

Consequently, although the partitioning scheme can reduce the size of the cross-products

to be computed somewhat, it does not ultimately improve the asymptotic complexity of

the query.

As was stated earlier, much of the difficulty in processing join queries in the Mimir

architecture stems from the unidirectional flow of data from the backend to the Mimir

layer. The backend has no information about the non-deterministic data. The inlining

approach is intuitively very simple. Mimir pre-computes the non-deterministic fragments

of the data during lens creation. It materializes these pre-computed values into special

tables in the backend database. Since queries in Mimir follow the maximal likelihood

assumption, the only information that is needed to be stored are the best guesses for

the missing data.

Materialization of best guesses in the backend means that the lens query can very easily

be re-written so that almost all query processing is done entirely in the backend. The

best guesses are stored indexed by the parameters of the Var terms that engender them.

The tables they are stored in is named as a function of the Var term, ensuring unique

names corresponding to each lens.

Example 4.6. Consider the query:

SELECT a, d FROM Sane_R, S WHERE Sane_R.b = S.b;

With the inlining approach, the guesses for the missing values of R have already been

pre-computed and materialized in a backend table, Sane R b backend. Consequently, the

original query can be rewritten as:

SELECT a, d FROM R, S WHERE

(CASE WHEN R.b IS NULL

THEN

(SELECT data FROM SANE_R_b_backend

WHERE params = ROWID)

ELSE

R.b

END = S.b);

Page 45: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

The Mimir System 36

Though this query is still probabilistic from a semantic viewpoint and has a lineage

component, from the perspective of query evaluation, it is comprised of data entirely

available to the backend, and the query is a true join after de-correlation, instead of a

cross-product with a non-deterministic filtering predicate.

The inlining approach trades additional space complexity overhead in the order of O(x),

the size of the non-deterministic fragment of a lens, for reduced complexity for join

queries from quadratic to loglinear or even linear, depending on the join predicate and

the base data. The non-deterministic data is computed and stored during lens creation.

Automatic triggers that update these meta-tables can conceivably be added that run

when tuples are added or removed from the base data, but this is functionality that does

not yet exist in Mimir.

4.2.1 Recovering provenance

The normalization rewrites in Mimir consolidates all query non-determinism in the top

level operators F . The lineage computation is simple. The presence of Var terms

indicates non-determinism and evaluating F for the current tuple gives both tuple and

row determinism.

With inlining, the shim query F is rewritten and therefore, lost. An alternative way

has to be found to preserve the lineage information to support Mimir’s central feature

of providing provenance and statistical metrics for uncertain data.

Extending on the motivation to use the backend to perform all query processing, the

computation to determine row and cell uncertainty can also be pushed as part of the

deterministic query, into the backend. To do this, we extend the schema of the query

Q with n + 1 additional projections, n being the number of original projections in Q.

Each of these additional columns store Boolean formulas that can be used by a special

iterator to determine which rows and columns are non-deterministic.

Formally, a query Q with schema schQ = ai is extended to a query [[Q]]det with

schema sch[[Q]]det = ai, Di, φ. Attribute determinism is computed by evaluating its

correspondingD column, i.e. isDeterministic(ai) 7→ evaluate(Di) and row determinism

is computed by evaluating φ.

Di and φ is populated using a set of rewrites similar to the normalization rewrites in

Figure 2.8. Di is populated through projection. Algorithm 3[31] is used to rewrite

columns according to the determinism of the input with a slight modification which

replaces occurrences of an attribute ai with its corresponding metadata column Di.

These form the inputs to the determinism metadata columns Di.

Page 46: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

The Mimir System 37

Algorithm 3 isDet(E)

Require: E: An expression in either grammar from Figure 2.5.Ensure: An expression that is true when E is deterministic.1: if E ∈ R,>,⊥ then2: return >3: else if E is V ar then4: return ⊥5: else if E is Columni then6: return >7: else if E is ¬E1 then8: return isDet(E1)9: else if E is E1 ∨ E2 then

10: return (E1 ∧ isDet(E1)) ∨ (E2 ∧ isDet(E2))11: ∨ (isDet(E1) ∧ isDet(E2))12: else if E is E1 ∧ E2 then13: return (¬E1 ∧ isDet(E1)) ∨ (¬E2 ∧ isDet(E2))14: ∨ (isDet(E1) ∧ isDet(E2))15: else if E is E1 +,−,×,÷,=, 6=, >,≥, <,≤ E2 then16: return (isDet(E1) ∧ isDet(E2))17: else if E is if E1 then E2 else E3 then18: return isDet(E1) ∧ ( (E1 ∧ isDet(E2))19: ∨(¬E1 ∧ isDet(E3)) )

Attribute determinism metadata is computed using the expression returned by isDet

while row determinism metadata is simply the condition column φ from the original for-

mulation of C-Tables, and the φ rewrites are almost identical to the corresponding cases

in normalization. The complete set of rewrites are shown in Figure 4.8[37]. Basically,

the inline approach materializes both non-deterministic data and lineage computation as

part of the deterministic result set. A specialized iterator can inspect the values of the

additional columns to do uncertainty analysis similar to ProjectionResultIterator

and strip off the metadata columns before returning it to the frontend.

Example 4.7. In the query given in Example 4.6, the atrributes a and d are entirely

deterministic. Therefore D1 ← > and D2 ← >. φ remains unchanged. But let’s see

what the rewrite is if we have non-deterministic projections. Consider the query:

SELECT b from Sane_R;

In this case φ is simply >. Attribute b’s corresponding determinism metadata column,

Db ← isDet(if b IS NULL then V ar b else b). The isDet algorithm reduces the condition

to b NOT NULL, which is the lineage formula. Evaluating this predicate for a particular

row marks whether that data element is deterministic. It is deterministic if the predicate

is true, non-deterministic if it is false.

Page 47: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

The Mimir System 38

[[πai←ei(Q)]]det 7→ πai←ei,Di←isDet(ei),φ←φ([[Q]]det) (4.5)

Projection

[[σψ(Q)]]det 7→ πai←ai,Di←Di,φ←φ∧isDet(ψ)(σψ([[Q]]det)) (4.6)

Selection

[[Q1 ×Q2]]det 7→ πai←ai,Di←Di,φ←φ1∧φ2([[Q1]]det × [[Q2]]det) (4.7)

Cross-product

[[Q1 ∪Q2]]det 7→ [[Q1]]det ∪ [[Q2]]det (4.8)

Bag union

[[R]]det 7→ πai←ai,Di←>,φ←>(R) (4.9)

Relation

Figure 4.8: Inline lineage rewrites.

There is an opportunity for optimizingDi since in many scenarios, it is data-independent,

and has the same value for all rows, leading to a lot of duplication. For example, when a

query involves entirely deterministic data or is a missing value lens, for all determinstic

attributes Di ← > and for all non-deterministic attributes Di ← ai NOT NULL. Combined

with partitioning, where non-determinism can be inferred from just the query plan

itself and is not data-dependent, the metadata columns become redundant and can be

optimized out. Such optimizations though have not yet been implemented.

4.2.2 Computing statistical metrics

Mimir needs to know which data elements and rows are deterministic in order to highlight

uncertain data to the analyst running queries. The next aspect of the work-flow involves

computing statistical metrics and human readable provenance information to provide

context for such non-determinism if the analyst requires it. To do this, Mimir needs the

shim query F and the original base data. Getting the shim query can be easy because

it is generated after normalization and a copy can be retained before inlining rewrites

occur. To get the original deterministic tuple, Mimir uses provenance markers in the

form of ROWIDs, which are supported by many backends.

ROWIDs are ideal to retrieve a row in post-processing phases. They are permanently

attached to the original record and make random retrieval of rows fast. Mimir employs

a few tricks to bring the ROWIDs of all the tuples from all the source relations into

Mimir land in the form of a single data element MIMIR ROWID. This data can then be

Page 48: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

Optimizations 39

deconstructed to track the original sources from which this tuple was derived from. To

do this, ROWIDs of all relations are projected through as a metadata column. ROWID

passes unchanged through selections. Through joins, the ROWIDs are concatenated with

a delimiter in between. Through unions, each ROWID is marked according to which side

of the union it came from. These changes can be easily reversed to find the original

sources of the data, and the ground truth tuple can be reconstructed.

4.3 Hybrid

The partitioning approach and the inlining approach are independent optimizations.

Partitions break up the query into simpler fragments that target a specific cross section

of the base data while naturally retaining the lineage formula for the data and even

simplifying it. Inlining provides massive boosts to query performance involving joins

over non-deterministic predicates, eliminating the need for quadratic cross-products,

but at the cost of making lineage tracking more complex. Combining both provides

the best of both worlds. Queries are first partitioned, and then partitions involving

non-deterministic data are inlined to let the backend database join tables.

Page 49: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

Chapter 5

Experiments

In this chapter, the results of the experimental analysis of the two optimizations elu-

cidated in Chapter 4 are summarized. Virtual C-Tables are tested with the classical

normalization based execution model presented in the original lens paper[37], which is

referred to as Mimir classic, and partition, inline, and hybrid optimized execution mod-

els. All experiments were conducted on two separate backend databases, SQLite and

a commercial database codenamed DBX ue to licensing agreements. Most of Mimir is

written in Scala, with a few modules of Java code mostly to interface with Java-based

libraries. The specifications of the hosting server are listed in Figure 5.1. Mimir and all

database backends were hosted on the same machine to avoid including network latencies

in measurements. The experiments demonstrate that:

1. Virtual C-Tables scale well.

2. Virtual C- Tables impose minimal overhead compared to deterministic evaluation.

3. Hybrid evaluation is typically optimal.

OS RedHat Enterprise Linux 6.5

CPU 16 Core 2.6 GHz Intel Xeon

RAM 32GB

Storage 900GB 4-disk RAID5

Figure 5.1: Testbench specifications.

40

Page 50: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

Optimizations 41

5.1 Experimental setup

Datasets were constructed using TPC-H[9]’s dbgen with scaling factors 1 (1 GB) and 0.1

(100 MB). To simulate incomplete data that affects join predicates, 0.1% of foreign key

references in the dataset were replaced with NULL values. Domain constraint repair

lenses over the damaged relations were created to repair these NULL values as non-

materialized views. As a query workload, TPC-H Queries 1, 3, 5, and 9 were used, but

modified in two ways:

1. First, all relations used by the query were replaced by references to the corre-

sponding domain constraint repair lens. These lenses were named by prepend-

ing the original relation name with an x. Therefore the fixed lineitem table is

xlineitem and the fixed orders table is xorders and so on.

2. Second, Mimir does not yet include support for aggregation. The cost of enumer-

ating the set of results to be aggregated by stripping out all aggregate functions

and computing their parameters were measured instead.

The modified queries are shown in Figures 5.2, 5.3, 5.4 and 5.5. Execution times were

capped at 30 minutes. Two different backend databases were used for experimentation:

SQLite and a major commercial database DBX. Four different evaluation strategies were

tried: Classic is the naive, normalization-based evaluation strategy, while Partition,

Inline, and Hybrid denote the optimized approaches presented in Sections 4.1, 4.2, and

4.3 respectively. Deterministic denotes the four test queries run directly on the backend

databases with un-damaged data, and serves as a lower bound for how fast each query

can run.

5.2 Discussion

Figures 5.6, 5.7, 5.8 and 5.9[31] show the performance of Mimir running over SQLite

and DBX at scale factors 0.1 and 1. The graphs show Mimir’s overhead relative to the

equivalent deterministic query. Some interesting observations from the results follow.

1. Table scans are unaffected by Mimir. Query 1 is a single-table scan. In all config-

urations, Mimir’s overhead is virtually nonexistent.

2. Partitioning on its own cannot solve the scalability issue in this particular set of

experiments. As noted previously, the performance gains derived from partitioning

Page 51: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

Optimizations 42

depend on the size of the non-deterministic fragment of the data. In this case, since

that scales directly as a function of the data size, partitioning makes no noticeable

difference in query performance.

3. The Inline approach can vastly improve query performance in most cases.

4. The Inline approach can be slow in some cases. The lookups into auxiliary tables

should ideally be decorrelated by the query optimizer resulting in joins. However,

for Query 9, the final queries that are passed to the backend are quite complex.

The query optimizer fails to find viable plans for such queries. For instance, DBX

resolves one of the joins on the Part table as a nested loop join on the like pred-

icate. Modifying the query to exclude this clause led to the query being evaluated

much faster. Using optimizer tuning on DBX led to a significant increase in per-

formance for Query 9. Another factor in the slow-down in the Inlined queries is an

increased volume of data passing through JDBC to transport provenance markers,

such as ROWIDs from the backend to Mimir. For large joins, and several million

tuples, this additional data as well as the concatenation operation can be quite a

big overhead, specially if the provenance markers themselves are significantly large.

SELECT returnflag, linestatus,

quantity AS sum_qty,

extendedprice AS sum_base_price,

extendedprice * (1-discount) AS sum_disc_price,

extendedprice * (1-discount)*(1+tax) AS sum_charge,

quantity AS avg_qty,

extendedprice AS avg_price,

discount AS avg_disc

FROM xlineitem

WHERE shipdate <= DATE(’1997-09-01’);

Figure 5.2: Q1

SELECT xorders.orderkey,

xorders.orderdate,

xorders.shippriority,

extendedprice * (1 - discount) AS query3

FROM xcustomer, xorders, xlineitem

WHERE xcustomer.mktsegment = ’BUILDING’

AND xorders.custkey = xcustomer.custkey

AND xlineitem.orderkey = xorders.orderkey

AND xorders.orderdate < DATE(’1995-03-15’)

AND xlineitem.shipdate > DATE(’1995-03-15’);

Figure 5.3: Q3

Page 52: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

Optimizations 43

SELECT n.name, l.extendedprice * (1 - l.discount) AS revenue

FROM

xcustomer c,

xorders o,

xlineitem l,

xsupplier s,

xnation n,

region r

WHERE c.custkey = o.custkey

AND l.orderkey = o.orderkey

AND l.suppkey = s.suppkey

AND c.nationkey = s.nationkey

AND s.nationkey = n.nationkey

AND n.regionkey = r.regionkey

AND r.name = ’ASIA’

AND o.orderdate >= DATE(’1994-01-01’)

AND o.orderdate < DATE(’1995-01-01’);

Figure 5.4: Q5

SELECT

n.name AS nation,

o.orderdate AS o_year,

(

(l.extendedprice * (1 - l.discount))

- (ps.supplycost * l.quantity)

) AS amount

FROM

part p,

xsupplier s,

xlineitem l,

xpartsupp ps,

xorders o,

xnation n

WHERE s.suppkey = l.suppkey

AND ps.suppkey = l.suppkey

AND ps.partkey = l.partkey

AND p.partkey = l.partkey

AND o.orderkey = l.orderkey

AND s.nationkey = n.nationkey

AND (p.name LIKE ’%green%’);

Figure 5.5: Q9

Page 53: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

Optimizations 44

0

50

100

150

200

250

300

Q1 Q3 Q5 Q9

% o

f Det

erm

inis

tic T

ime Classic

PartitionInline

Hybrid

Figure 5.6: TPC-H S.F. 0.1 SQLite.

0

50

100

150

200

250

300

Q1 Q3 Q5 Q9

% o

f Det

erm

inis

tic T

ime Classic

PartitionInline

Hybrid

Figure 5.7: TPC-H S.F. 1 SQLite.

Page 54: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

Optimizations 45

0 50

100 150 200 250 300 350 400

Q1 Q3 Q5 Q9

% o

f Det

erm

inis

tic T

ime Classic

PartitionInline

Hybrid

Figure 5.8: TPC-H S.F. 0.1 DBX.

0 200 400 600 800

1000 1200 1400

Q1 Q3 Q5 Q9

% o

f Det

erm

inis

tic T

ime Classic

PartitionInline

Hybrid

Figure 5.9: TPC-H S.F. 1 DBX.

Page 55: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

Experiments 46

However, the Inline approach is still viable on its own. In contrast to Partitioning,

Query 9 eventually completes with the Inline approach.

5. Combining Partition with Inline can lead to greater performance. Query 9 is a 6-

way join with a cycle in its join graph. Both PartSupp and LineItem have foreign

key references that must be joined together. Consequently, Inlining creates messy

join conditions that neither backend database evaluates efficiently. Partitioning

results in substantially simpler nested queries that both databases accept far more

gracefully.

6. Combining Partition with Inline can also be harmful. Query 5 is a 6-way foreign-

key look-up join where Inline performs better than Hybrid. Each foreign-key is

dereferenced in exactly one condition in the query, allowing Inline to create a query

with a plan that can be efficiently evaluated using hash-joins. The partitioning

approach struggles because there are eight candidate clauses that result in 28 = 64

distinct partitions, resulting in a far more complex query.

Page 56: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

Chapter 6

State of the Art

Research in probabilistic databases is quite extensive. In Chapter 2, some of the issues

underlying the design of a complete probabilistic database are enumerated. Defining

consistent query semantics, defining an appropriate query language, representing and

modelling uncertainty, tracking the provenance of uncertain data and efficient query

evaluation are significant challenges in this domain.

6.1 Prototype probabilistic databases

In Chapter 2, some of the ways of representing uncertainty in databases were elucidated.

The simplest primitive is Codd’s NULLs[8] and they are still in widespread use today.

Imielinski[16] introduced labelled-nulls and the Conditional-Table or C-Table model to

efficiently encode relationships between unknown data elements. PC-Tables[13] linked

null values to probabilistic random variables.

A lot of prototype database systems use C-Tables or variants of it to model uncertainty.

A discussion of these systems follows along with comparisons to Mimir where applicable.

1. Mystiq[6] - The Mystiq project aimed at modelling uncertainty in databases as

probabilistic events. A subset of a relation’s attributes are keys. Row level uncer-

tainty is expressed by associating each tuple with a probability of appearing in the

result set. Tuples with different keys are independent. Tuples with the same keys

are mutually exclusive, representing attribute level uncertainty. Query results are

ranked according to the probability of the tuples appearing in the result set. This

probability computation is exact. The project also presents an interesting result

in computation of exact probabilities in such a database. It can be shown that in

47

Page 57: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

Experiments 48

general such computation is #P-hard, but a polynomial time algorithm exists for

certain safe query plans.

Mystiq is similar to Mimir in that it exists as a layer on top of a traditional database

connected through JDBC. However, this project aims more at testing the limits

of doing exact inference. The default query execution in Mimir works off of best

guesses instead. In Mystiq the user has to specify a probabilistic model through

configuration files. In Mimir, lenses automate this process. Mystiq has to fall back

on expensive algorithms for unsafe plans, the applicability of the optimizations

presented in this thesis are query agnostic.

2. Trio[2] - The Trio project uses Uncertainty-Lineage Databases (ULDBs). ULDBs

make lineage a first class citizen in the base database. To do this, it uses x-tuples.

x-tuples extend the base data with additional meta columns to store confidences,

lineage information and unique ids to make query processing faster. It also intro-

duces TrioQL, an extension to SQL to expose its uncertainty aware features to the

user.

Trio also exists as an application using Postgres as a backend database. However,

the use of x-tables in addition to using stored procedures means that it is not as

portable as Mimir.

3. Orion 2.0[33] - Orion 2.0 supports uncertainty via new datatypes available to

the user that could express both continuous and discrete distributions. It makes

modifications to its backend database to support this new functionality, including

making changes to the catalog manager.

4. MCDB[17] - This system adopts Monte Carlo sampling as the core execution

strategy. It has a lot of optimization tricks to avoid repeated query processing over

multiple samples. Samples are encoded in a compressed form called tuple bundles.

Standard relational operators are modified to work with this representation and

new operators are introduced so that the query plan is executed only once.

MCDB gave the original formulation of VG-functions, which Mimir adopts. Lenses

are an abstraction that use VG-Functions in the C-Table model to apply data-

cleaning algorithms in a composable framework.

5. MayBMS[15] - The MayBMS system is based around the concept of U-Relations,

which are C-Tables where attribute level uncertainty is represented as row-level

uncertainty. Tuples are decomposed into a set of U-relations, each of which enu-

merate the possible assignments to a discrete random variable, and a world table,

W , enumerates the finite probability distributions for each random variable. U-

relations are closed over relational operators and are as expressive as C-Tables. In

Page 58: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

Experiments 49

a way, this is a form of PC-Tables which can be stored as deterministic tables.

On the upside, this model lets MayBMS use a non-specialized database engine

for probabilistic query processing, but U-Relations can only deal with discrete

probability distributions.

6. Pip[23] - Pip was the first system to extend C-Tables to support continuous prob-

ability distributions, by encoding such information symbolically as VG-functions

adapted from MCDB. Mimir makes use of this formulation as the supporting infras-

tructure for lenses. Pip adopts sophisticated sampling techniques as the primary

mode of query evaluation.

7. Sprout[12] - Sprout builds on the work of Mystiq within the MayBMS ecosystem.

It presents a lazy evaluation strategy to better optimize the safe plans produced

by the Mystiq project to compute exact confidences of uncertain tuples. Sprout is

embedded in Postgres.

8. Jigsaw[24] - Jigsaw is a system primarily meant to enable business analysts to

make decisions based on running what-if scenarios on a probabilistic database

similar to MCDB. To optimize for a given goal, a large number of simulations have

to be run over every parameter of every VG-function, since no correlation between

the input and output of these black boxes can be assumed. Jigsaw comes up with a

dynamic programming strategy to speed this process up by using a technique called

fingerprinting to detect correlations between different VG-functions and reusing

results of correlated simulations.

There are several takeaway points from the above discussion. Most of these systems de-

pend on the user explicitly specifying a probability distribution to introduce uncertainty

into the database, and apart from Pip, these distributions are always discrete and finite.

Most of these systems require modification to an existing database kernel, or heavy

changes to the base data to encode uncertainty information or use configuration files

or new query languages to leverage the uncertainty and lineage features. This requires

significant commitment from new users to use these systems.

Mimir is unique because it can slot into an existing database with minimum impedance.

The optimizations in this thesis enable Mimir to do uncertainty aware query processing

and lineage tracking efficiently without any modifications to the base data.

Page 59: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

Experiments 50

6.2 Model databases

The systems in the previous section provide the underlying infrastructure to work with

probabilistic data, and serve as sandboxes for testing representation schemes, efficient

query evaluation strategies and ranking probabilistic information. Another class of

databases are aimed at providing semantic abstractions that make it easier for users

to apply such systems to solve problems in various domains.

MauveDB[11] provides meaningful insight based on messy and verbose sensor data by

constructing model based views over raw deterministic table data, similar to Mimir.

BayesStore[34] gives a formulation for using Bayesian networks to model sensor data

and defines the behavior of standard relational operators over such a model. Probabilistic

graphical models allow for more flexible inference techniques. In Mimir currently, one

lens is entirely independent of another. But if one lens can use the information of another

lens, it can make much better guesses about the data. For example, a type inference lens

replaces a typo like A3 to NULL when it expects a number. A domain constraint repair

lens has no idea about the original source of the NULL. So there is some information loss.

If the domain constraint repair lens knows the original value was A3 it can infer that

there is a high probability that the value there is 3. Work on such graphical models in

Mimir is being pursued.

Plato[21] is a system that generates a model from sensor data that can be used to

extrapolate the ground truth. Using the model instead of the stream allows for efficient

query processing. This is effectively equivalent to a lens in Mimir and work on exactly

such a lens has been pursued in Mimir too.

Lahar[27] is a data warehousing system for Markovian streams. Markovian streams are

annotated with probability values at each time instance, thus representing one of several

possible streams. Query processing over such streams is computationally expensive due

to the volume of data and the number of possible worlds. Caldera[26] is a storage

manager that indexes such streams that can accelerate such processing. SimSQL[7]

extends MCDB to work with Markovian streams.

6.3 Provenance

C-Tables use Boolean predicates as an implicit mechanism of tracking provenance of

data. Instead of using Boolean formulas to answer whether a tuple exists or not in the

result set as in C-Tables, a multiset encoding can be used, where each tuple is associated

with an integer to represent the number of instances of that tuple in the relation, 0

Page 60: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

Experiments 51

being the case that the tuple is not in the relation. Boolean formulas correspond to

set semantics whereas the multiset formulation models bag semantics. [13] unified these

approaches as K-Tables. In K-tables each tuple is annotated with a K-element semiring,

where K = 2 models Boolean predicates.

In Trio[2], provenance is a first class citizen and is materialized as part of the base data.

This allows for sophisticated tracking of provenance with different granularity.

Orchestra[14] is a system for collaborative data sharing. Its a distributed application

with a peer-to-peer architecture, and each node has a local database. Orchestra attempts

to make the local data of each node consistent in the face of updates at other nodes,

based on trust policies in sources of information and accepting user overwrites. Orchestra

preserves provenance information by preserving the history at each local node which can

be used during data exchange to resolve conflicts.

GProM[3, 4] is an application specifically geared towards computing provenance across

updates and concurrent transactions on any database that supports logging and querying

previous versions of the database. GProM constructs the provenance of a data element

by essentially rewriting the transaction that caused the change in the data.

6.4 Data curation

Mimir, like Jigsaw[24], only provides the infrastructure to users that want to use prob-

abilistic databases to solve a particular problem. The design of specific algorithms or

models are left to domain experts, and these solutions could be easily plugged in as a

new type of lens or an alternative model for an existing lens. For example, any of the

schema-matching algorithms[5, 25, 30, 32] can be used in the schema-matching lens with-

out any changes to the query rewriting components. Similarly, any exisitng interpolation

technique[11, 29] can be used for the domain-constraint repair lens.

There are other systems that aim to automate data curation. Some of these systems are

discussed below:

1. Pay-as-you-go[18], Crowd Entity Resolution[35], Guided Data Repair[36]

- All of these systems are aimed at combining automation with human expertise

to efficiently curate large datasets with a high degree of reliability. These systems

make an initial pass over the data identifying possible problems and solutions to

curation tasks like entity resolution and schema-matching. Some cost-estimation

metric is used to rank each potential solution and its impact on the final data

quality. User feedback is garnered to confirm the automated steps in order of

Page 61: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

State of the Art 52

rank. Mimir generalizes this automate-feedback-confirm-fix model to any number

of data-cleaning or data-integration tasks utilizing probabilistic database theory

in a vendor-agnostic middle-ware application.

2. Wrangler[19] - Wrangler provides a suite of transformations on raw data, and

a visualization of the history and effects of such transformations. It infers most

likely transformations to make the analyst’s work-flow smoother. Mimir’s lenses

are analogous to Wrangler’s transformations. Mimir’s GUI enables the simple

visual work-flow of Wrangler, although at present Mimir makes no inferences on

what lenses are appropriate at a particular stage.

Page 62: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

Chapter 7

Conclusion

This thesis extends the work first introduced by the lenses paper[37]. It describes Mimir,

a practical implementation of lenses. Mimir can slot into an existing database and

perform on-demand data curation. Lenses allow users to apply data cleaning solutions to

messy data with very little effort. Queries over such data is naturally non-deterministic

and modeled as virtual C-Tables. Mimir supports selection, projection and join queries

over such uncertain data.

The user interface of Mimir is designed to be simple and avoid information overload. Best

guesses are made for uncertain data elements and such uncertainty is clearly marked to

the user. Complex models abstracting powerful inference techniques run under the hood

and can provide contextual information whenever requested by the user. The addition of

a data ingestion pipeline augmented with a type inference lens enables Mimir to quickly

assimilate and curate data sourced from different origins.

Join query processing in the original VC-Table model was quadratic, which meant that

it quickly dropped in performance as the data size grew. Two approaches to making

query processing scalable are proposed and implemented. Both approaches aim towards

pushing more computation into the backend database, thus avoiding expensive cross-

products and replacing it with efficient joins. The partitioning approach delineates

deterministic and non-deterministic fragments of the data, so that only non-deterministic

data is explicitly processed in Mimir. Inlining pre-materializes non-deterministic data

inside the backend database and rewrites the queries to look-up auxiliary tables instead

of pushing enormous amounts of data upstream into Mimir for post-processing. This

practically removes any actual need for computation in Mimir itself after the initial

lenses have been created.

53

Page 63: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

Conclusion 54

Both approaches, and a hybrid combination of the two are compared to the classic

query processing mode of Mimir using the TPC-H benchmark. Experimental results

are summarized. These results indicate that the inlining and hybrid modes are effective

in successfully evaluating queries over uncertain data using the VC-Table model with

acceptable performance overhead.

Page 64: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

Bibliography

[1] Charu C. Aggarwal. Managing and Mining Uncertain Data. Springer Publishing

Company, Incorporated, 2009. ISBN 0387096892, 9780387096896.

[2] Parag Agrawal, Omar Benjelloun, Anish Das Sarma, Chris Hayworth, Shubha U.

Nabar, Tomoe Sugihara, and Jennifer Widom. Trio: A system for data, uncer-

tainty, and lineage. In VLDB, 2006. URL http://www.vldb.org/conf/2006/

p1151-agrawal.pdf.

[3] B Arab, Dieter Gawlick, Venkatesh Radhakrishnan, Hao Guo, and Boris Glavic.

A generic provenance middleware for database queries, updates, and transactions.

TaPP, 2014.

[4] Bahareh Arab, Dieter Gawlick, Vasudha Krishnaswamy, Venkatesh Radhakrishnan,

and Boris Glavic. Reenacting transactions to compute their provenance. Technical

report, Illinois Institute of Technology, 2014.

[5] Philip A Bernstein, Jayant Madhavan, and Erhard Rahm. Generic schema match-

ing, ten years later. PVLDB, 2011.

[6] Jihad Boulos, Nilesh N. Dalvi, Bhushan Mandhani, Shobhit Mathur, Christo-

pher Re, and Dan Suciu. MYSTIQ: a system for finding more answers by us-

ing probabilities. In SIGMOD, 2005. doi: 10.1145/1066157.1066277. URL

http://doi.acm.org/10.1145/1066157.1066277.

[7] Zhuhua Cai, Zografoula Vagena, Luis Perez, Subramanian Arumugam, Peter J.

Haas, and Christopher Jermaine. Simulation of database-valued markov chains

using simsql. In SIGMOD, 2013. ISBN 978-1-4503-2037-5. doi: 10.1145/2463676.

2465283. URL http://doi.acm.org/10.1145/2463676.2465283.

[8] E. F. Codd. Extending the database relational model to capture more meaning.

ACM Trans. Database Syst., 4(4):397–434, December 1979. ISSN 0362-5915. doi:

10.1145/320107.320109. URL http://doi.acm.org/10.1145/320107.320109.

[9] Transaction Processing Performance Council. TPC-H specification.

http://www.tpc.org/tpch/, 2015.

55

Page 65: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

Bibliography 56

[10] Nilesh Dalvi and Dan Suciu. Efficient query evaluation on probabilistic databases.

The VLDB Journal, 16(4):523–544, October 2007. ISSN 1066-8888. doi: 10.1007/

s00778-006-0004-3. URL http://dx.doi.org/10.1007/s00778-006-0004-3.

[11] Amol Deshpande and Samuel Madden. MauveDB: supporting model-based user

views in database systems. In SIGMOD, 2006. doi: 10.1145/1142473.1142483.

URL http://doi.acm.org/10.1145/1142473.1142483.

[12] Robert Fink, Andrew Hogue, Dan Olteanu, and Swaroop Rath. Sprout2: a squared

query engine for uncertain web data. In SIGMOD, 2011. doi: 10.1145/1989323.

1989481. URL http://doi.acm.org/10.1145/1989323.1989481.

[13] Todd J. Green and Val Tannen. Models for incomplete and probabilistic informa-

tion. In Proceedings of the 2006 International Conference on Current Trends in

Database Technology, EDBT’06, pages 278–296, Berlin, Heidelberg, 2006. Springer-

Verlag. ISBN 3-540-46788-2, 978-3-540-46788-5. doi: 10.1007/11896548 24. URL

http://dx.doi.org/10.1007/11896548_24.

[14] Todd J. Green, Grigoris Karvounarakis, Zachary G. Ives, and Val Tannen. Prove-

nance in ORCHESTRA. DEBU, 33(3):9–16, 2010. URL http://sites.computer.

org/debull/A10sept/green.pdf.

[15] Jiewen Huang, Lyublena Antova, Christoph Koch, and Dan Olteanu. MayBMS:

a probabilistic database management system. In SIGMOD, pages 1071–1074.

ACM, 2009. doi: 10.1145/1559845.1559984. URL http://doi.acm.org/10.1145/

1559845.1559984.

[16] Tomasz Imielinski and Witold Lipski, Jr. On representing incomplete information

in a relational data base. In Proceedings of the Seventh International Conference on

Very Large Data Bases - Volume 7, VLDB ’81, pages 388–397. VLDB Endowment,

1981. URL http://dl.acm.org/citation.cfm?id=1286831.1286869.

[17] Ravi Jampani, Fei Xu, Mingxi Wu, Luis Leopoldo Perez, Christopher Jermaine,

and Peter J. Haas. Mcdb: A monte carlo approach to managing uncertain data. In

Proceedings of the 2008 ACM SIGMOD International Conference on Management

of Data, SIGMOD ’08, pages 687–700, New York, NY, USA, 2008. ACM. ISBN

978-1-60558-102-6. doi: 10.1145/1376616.1376686. URL http://doi.acm.org/10.

1145/1376616.1376686.

[18] Shawn R. Jeffery, Michael J. Franklin, and Alon Y. Halevy. Pay-as-you-go user

feedback for dataspace systems. In Proceedings of the 2008 ACM SIGMOD Interna-

tional Conference on Management of Data, SIGMOD ’08, pages 847–860, New York,

Page 66: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

Bibliography 57

NY, USA, 2008. ACM. ISBN 978-1-60558-102-6. doi: 10.1145/1376616.1376701.

URL http://doi.acm.org/10.1145/1376616.1376701.

[19] Sean Kandel, Andreas Paepcke, Joseph Hellerstein, and Jeffrey Heer. Wrangler:

Interactive visual specification of data transformation scripts. In SIGCHI, 2011.

ISBN 978-1-4503-0228-9. doi: 10.1145/1978942.1979444. URL http://doi.acm.

org/10.1145/1978942.1979444.

[20] Grigoris Karvounarakis and Todd J. Green. Semiring-annotated data: Queries and

provenance? SIGMOD Rec., 41(3):5–14, October 2012. ISSN 0163-5808. doi: 10.

1145/2380776.2380778. URL http://doi.acm.org/10.1145/2380776.2380778.

[21] Yannis Katsis, Yoav Freund, and Yannis Papakonstantinou. Combining databases

and signal processing in plato. In CIDR 2015, Seventh Biennial Conference on

Innovative Data Systems Research, Asilomar, CA, USA, January 4-7, 2015, On-

line Proceedings, 2015. URL http://www.cidrdb.org/cidr2015/Papers/CIDR15_

Paper26.pdf.

[22] Oliver Kennedy. What if databases could answer incorrectly? http://odin.cse.

buffalo.edu/rants/2015-08-13-incorrect-dbs.html, 2015.

[23] Oliver Kennedy and Christoph Koch. Pip: A database system for great and

small expectations. 2014 IEEE 30th International Conference on Data Engineer-

ing, 0:157–168, 2010. doi: http://doi.ieeecomputersociety.org/10.1109/ICDE.2010.

5447879.

[24] Oliver A. Kennedy and Suman Nath. Jigsaw: Efficient optimization over uncertain

enterprise data. In Proceedings of the 2011 ACM SIGMOD International Conference

on Management of Data, SIGMOD ’11, pages 829–840, New York, NY, USA, 2011.

ACM. ISBN 978-1-4503-0661-4. doi: 10.1145/1989323.1989410. URL http://doi.

acm.org/10.1145/1989323.1989410.

[25] Yoonkyong Lee, Mayssam Sayyadian, AnHai Doan, and Arnon S Rosenthal. etuner:

tuning schema matching software using synthetic scenarios. VLDB J., 16(1):97–122,

2007.

[26] J. Letchner, C. Re, M. Balazinska, and M. Philipose. Access methods for markovian

streams. In ICDE, March 2009. doi: 10.1109/ICDE.2009.21.

[27] Julie Letchner, Christopher Re, Magdalena Balazinska, and Matthai Philipose. La-

har demonstration: Warehousing markovian streams. Proc. VLDB Endow., 2(2):

1610–1613, August 2009. ISSN 2150-8097. doi: 10.14778/1687553.1687605. URL

http://dx.doi.org/10.14778/1687553.1687605.

Page 67: Mimir: Bringing CTables into Practice - University at Buffalo · The need to curate data manually before analysis can begin is a very real problem consuming signi cant man-hours in

Bibliography 58

[28] Leonid Libkin. Sql&rsquo;s three-valued logic and certain answers. ACM Trans.

Database Syst., 41(1):1:1–1:28, March 2016. ISSN 0362-5915. doi: 10.1145/2877206.

URL http://doi.acm.org/10.1145/2877206.

[29] Chris Mayfield, Jennifer Neville, and Sunil Prabhakar. Eracer: A database ap-

proach for statistical inference and data cleaning. In SIGMOD, 2010. ISBN 978-1-

4503-0032-2. doi: 10.1145/1807167.1807178. URL http://doi.acm.org/10.1145/

1807167.1807178.

[30] Robert McCann, Warren Shen, and AnHai Doan. Matching schemas in online

communities: A web 2.0 approach. In ICDE, 2008. doi: 10.1109/ICDE.2008.

4497419. URL http://dx.doi.org/10.1109/ICDE.2008.4497419.

[31] Arindam Nandi, Ying Yang, Oliver Kennedy, Boris Glavic, Ronny Fehling,

Zhen Hua Liu, and Dieter Gawlick. Mimir: Bringing ctables into practice. CoRR,

abs/1601.00073, 2016. URL http://arxiv.org/abs/1601.00073.

[32] Erhard Rahm and Philip A Bernstein. A survey of approaches to automatic schema

matching. VLDB J., 10(4):334–350, 2001.

[33] Sarvjeet Singh, Chris Mayfield, Sagar Mittal, Sunil Prabhakar, Susanne Hambrusch,

and Rahul Shah. Orion 2.0: Native support for uncertain data. In SIGMOD, pages

1239–1242. ACM, 2008. ISBN 978-1-60558-102-6. doi: 10.1145/1376616.1376744.

URL http://doi.acm.org/10.1145/1376616.1376744.

[34] Daisy Zhe Wang, Eirinaios Michelakis, Minos Garofalakis, and Joseph M. Heller-

stein. Bayesstore: Managing large, uncertain data repositories with probabilistic

graphical models. PVLDB, 1(1):340–351, 2008. ISSN 2150-8097. doi: 10.14778/

1453856.1453896. URL http://dx.doi.org/10.14778/1453856.1453896.

[35] Jiannan Wang, Tim Kraska, Michael J. Franklin, and Jianhua Feng. CrowdER:

Crowdsourcing entity resolution. PVLDB, 5(11):1483–1494, 2012. ISSN 2150-8097.

doi: 10.14778/2350229.2350263. URL http://dx.doi.org/10.14778/2350229.

2350263.

[36] Mohamed Yakout, Ahmed K. Elmagarmid, Jennifer Neville, Mourad Ouzzani, and

Ihab F. Ilyas. Guided data repair. PVLDB, 4(5):279–289, 2011. ISSN 2150-8097.

doi: 10.14778/1952376.1952378. URL http://dx.doi.org/10.14778/1952376.

1952378.

[37] Ying Yang, Niccolo Meneghetti, Ronny Fehling, Zhen Hua Liu, and Oliver Kennedy.

Lenses: An on-demand approach to etl. Proc. VLDB Endow., 8(12):1578–1589,

August 2015. ISSN 2150-8097. doi: 10.14778/2824032.2824055. URL http://dx.

doi.org/10.14778/2824032.2824055.