Mazda R27 1

37
Uncertainty Lineage Data Bases Very Large Data Bases 1975 2006

description

Experience Mazda Zoom Zoom Lifestyle and Culture by Visiting and joining the Official Mazda Community at http://www.MazdaCommunity.org for additional insight into the Zoom Zoom Lifestyle and special offers for Mazda Community Members. If you live in Arizona, check out CardinaleWay Mazda's eCommerce website at http://www.Cardinale-Way-Mazda.com

Transcript of Mazda R27 1

Page 1: Mazda R27 1

UncertaintyLineage

DataBases

VeryLargeDataBases

19752006

Page 2: Mazda R27 1

ULDBs: Databases with ULDBs: Databases with Uncertainty and LineageUncertainty and Lineage

Omar Benjelloun, Anish Das Sarma,

Alon Halevy, Jennifer WidomStanford InfoLab

Page 3: Mazda R27 1

33

MotMotivivationation

• Many applications involve data that is uncertain (approximate, probabilistic, inexact, incomplete,

imprecise, fuzzy, inaccurate, ...)

• Many of the same applications need to track the lineage of their data

Neither uncertainty nor lineage are supported by conventional DBMSs

Coincidence or Fate?Coincidence or Fate?

Page 4: Mazda R27 1

44

Sample Applications Needing Sample Applications Needing Uncertainty and LineageUncertainty and Lineage

•Scientific databases

•Sensor databases

•Data cleaning

•Data integration

• Information extraction

Page 5: Mazda R27 1

55

Trio ProjectTrio Project

Building a new kind of DBMS in which:1. Data2. Uncertainty3. Lineage

are all first-class interrelated concepts

Page 6: Mazda R27 1

66

Lineage and UncertaintyLineage and Uncertainty

•Lots of independent work in lineage and uncertainty (related work at end of talk)

•Turns out: The connection between uncertainty and lineage goes deeper than just a shared need by several applications

Coincidence or Fate?Coincidence or Fate?

Page 7: Mazda R27 1

77

Lineage and UncertaintyLineage and Uncertainty

• Lineage...1) Enables simple and consistent

representation of uncertain data2) Correlates uncertainty in query results

with uncertainty in the input data3) Can make computation over uncertain

data more efficient

Page 8: Mazda R27 1

88

Outline of the TalkOutline of the Talk

•The ULDB data model

•Querying ULDBs

•ULDB properties

•Membership and extraction operations

•Confidences

•Current, related, and future work

Page 9: Mazda R27 1

99

Running Example: Crime Running Example: Crime SolverSolver

Saw(witness,car) Drives(person,car)

Suspects(person) = πperson(Saw ⋈ Drives)

Page 10: Mazda R27 1

1010

UncertaintyUncertainty

•An uncertain database represents a set of possible instances. Examples:– Amy saw either a Honda or a Toyota

– Jimmy drives a Toyota, a Mazda, or both

– Betty saw an Acura with confidence 0.5 or a Toyota with confidence 0.3

– Hank is a suspect with confidence 0.7

Page 11: Mazda R27 1

1111

Uncertainty in a ULDBUncertainty in a ULDB

1. Alternatives2. ‘?’ (Maybe) Annotations3. Confidences

Page 12: Mazda R27 1

1212

Uncertainty in a ULDBUncertainty in a ULDB

1. Alternatives: uncertainty about value

2. ‘?’ (Maybe) Annotations3. Confidences

Saw (witness,car)

(Amy, Honda) ∥ (Amy, Toyota) ∥ (Amy, Mazda)

Three possibleinstances

witness carAmy { Honda, Toyota,

Mazda }

=

Page 13: Mazda R27 1

1313

Six possibleinstances

Uncertainty in a ULDBUncertainty in a ULDB

1. Alternatives2. ‘?’ (Maybe): uncertainty about

presence3. Confidences

Saw (witness,car)

(Amy, Honda) ∥ (Amy, Toyota) ∥ (Amy, Mazda)

(Betty, Acura)?

Page 14: Mazda R27 1

1414

Uncertainty in a ULDBUncertainty in a ULDB

1. Alternatives2. ‘?’ (Maybe) Annotations3. Confidences: weighted

uncertaintySaw (witness,car)

(Amy, Honda): 0.5 ∥ (Amy,Toyota): 0.3 ∥ (Amy, Mazda): 0.2

(Betty, Acura): 0.6?

Six possible instances, each with a probability

Page 15: Mazda R27 1

1515

Data Models for Data Models for UncertaintyUncertainty•Our model (so far) is not especially new•We spent some time exploring the

space of models for uncertainty [ICDE 2006]

•Tension between understandability and expressiveness– Our model is understandable– But it is not complete, or even closed under

common operations

Page 16: Mazda R27 1

1616

Closure and Closure and CompletenessCompleteness

•Completeness Can represent all sets of possible

instances

•Closure Can represent results of operations

•Note: Completeness Closure

Page 17: Mazda R27 1

1717

Model (so far) Not ClosedModel (so far) Not Closed

Saw (witness,car)

(Cathy, Honda) ∥ (Cathy, Mazda)

Drives (person,car)

(Jimmy, Toyota) ∥ (Jimmy, Mazda)

(Billy, Honda) ∥ (Frank, Honda)

(Hank, Honda)

Suspects

Jimmy

Billy ∥ Frank

Hank

Suspects = πperson(Saw ⋈ Drives)

?

??

Does not correctlycapture possibleinstances in theresult

CANNOT

Page 18: Mazda R27 1

1818

LineageLineage to the Rescue to the Rescue

•Lineage: “where data came from”– Internal lineage– External lineage (not covered in this

talk)

• In ULDBs: A function λ from alternatives to sets of alternatives (or external sources)

Page 19: Mazda R27 1

1919

Example with LineageExample with Lineage

ID Saw (witness,car)

11

(Cathy, Honda) ∥ (Cathy, Mazda)

ID Drives (person,car)

21

(Jimmy, Toyota) ∥ (Jimmy, Mazda)

22

(Billy, Honda) ∥ (Frank, Honda)

23

(Hank, Honda)

ID Suspects

31

Jimmy

32

Billy ∥ Frank

33

Hank

?

?

?

Suspects = πperson(Saw ⋈ Drives) λ(31) = (11,2),(21,2)

λ(32,1) = (11,1),(22,1); λ(32,2) = (11,1),(22,2)

λ(33) = (11,1), 23

Correctly captures possible instances inthe result

Page 20: Mazda R27 1

2020

ULDBsULDBs

1. Alternatives2. ‘?’ (Maybe) Annotations3. Confidences4. Lineage

ULDBs are Closed and Complete

Page 21: Mazda R27 1

2121

Outline of the TalkOutline of the Talk

•The ULDB data model

•Querying ULDBs

•ULDB properties

•Membership and extraction operations

•Confidences

•Current, related, and future work

Page 22: Mazda R27 1

2222

Querying ULDBsQuerying ULDBs

•Query Q on ULDB D

DD

D1, D2, …, DnD1, D2, …, Dn

possibleinstances

Q on eachinstance

representationof instances

Q(D1), Q(D2), …, Q(Dn)Q(D1), Q(D2), …, Q(Dn)

D’D’implementation of Q

D + ResultD + Result

Page 23: Mazda R27 1

2323

Well-Behaved ULDBsWell-Behaved ULDBs

• If we start with a well-behaved ULDB and perform standard queries, it remains well-behaved

• Intuitively (details in paper):– Acyclic: No cycles in the lineage– Deterministic: Non-empty lineages of distinct

alternatives are distinct– Uniform: Alternatives of same tuple are derived

from the same set of tuples

Page 24: Mazda R27 1

2424

ULDB MinimalityULDB Minimality

• Data-minimality• Does every alternative appear in some

possible instance? (no extraneous alternatives)

• Does every maybe-tuple in R not appear in some possible instance? (no extraneous ‘?’s)

•Lineage-minimality

Page 25: Mazda R27 1

2525

Data-Minimality ExamplesData-Minimality ExamplesExtraneous ‘?’

. . .

10 (Billy, Honda) ∥ (Frank, Honda)

. . .

. . .

20

Billy ∥ Frank

. . .

?λ(20,1)=(10,1); λ(20,2)=(10,2)

extraneous

Page 26: Mazda R27 1

2626

Data-Minimality ExamplesData-Minimality ExamplesExtraneous alternative

(Diane, Mazda) ∥ (Diane, Acura)

Dianeextraneous

(Diane, Mazda)

(Diane, Acura)

?

? ?

Page 27: Mazda R27 1

2727

Data-MinimizationData-Minimization

• Extraneous alternative theorem:– An alternative is extraneous iff it is (possibly

transitively) derived from multiple alternatives of the same tuple.

• Extraneous “?” theorem– A “?” on tuple t is extraneous iff

1. it is derived from base tuples without “?”2. t has as many alternatives as the product of the

number in its base tuples

• Minimization algorithm based on the theorems

(see paper)

Page 28: Mazda R27 1

2828

Data-minimalLineage-minimal

Lineage-minimal

Data-minimal

Data-minimize

Queries

Lineage-minimize

ULDB Properties and ULDB Properties and OperationsOperations

Mem

bers

hip

Extraction

Page 29: Mazda R27 1

2929

Membership QuestionsMembership Questions

• Does a given tuple t appear in some (all) possible instance(s) of R ?– Polynomial algorithms based on Data-

minimization

• Is a given table T one of (all of) the possible instances of R ?– NP-Hard

t?, T?

RR

I1, I2, …, InI1, I2, …, In

possibleinstances

Page 30: Mazda R27 1

3030

ExtractionExtraction

•Extraction algorithm in paper

Suspects Eats

Saw Drives

Page 31: Mazda R27 1

3131

Outline of the TalkOutline of the Talk

•The ULDB data model

•Querying ULDBs

•ULDB properties

•Membership and extraction operations

•Confidences

•Current, related, and future work

Page 32: Mazda R27 1

3232

ConfidencesConfidences

• Confidences supplied with base data

• Trio computes confidences on query results– Default probabilistic interpretation– Can choose to plug in different arithmetic

Saw (witness,car)

(Cathy, Honda): 0.6 ∥ (Cathy, Mazda): 0.4

Drives (person,car)

(Jimmy, Mazda): 0.3 ∥ (Bill, Mazda): 0.6

(Hank, Honda): 1.0

Suspects

Jimmy: 0.12 ∥ Bill: 0.24

Hank: 0.6

?

?

?

0.3 0.4

0.6

ProbabilisticProbabilistic Min

Page 33: Mazda R27 1

3333

Query Processing with Query Processing with ConfidencesConfidences• Previous approach (probabilistic

databases)– Each operator computes confidences during

query execution– Only certain query plans allowed

• In ULDBs– Confidence of alternative A is function of

confidences in its transitive lineage

• Our approach: Decouple data and confidence computation– Use any query plan for data computation– Compute confidences on-demand using

lineage– Can give arbitrarily large improvements

Page 34: Mazda R27 1

3434

Current Work: AlgorithmsCurrent Work: Algorithms

•Algorithms: confidence computation, extraneous data, membership questions– Minimize lineage traversal– Memoization– Batch computations

Page 35: Mazda R27 1

3535

The Trio TrioThe Trio Trio

• Data Model– ULDBs (Coming: incomplete relations;

continuous uncertainty; correlation uncertainty)

• Query Language– Simple extension to SQL– Query uncertainty, confidences, and lineage

• System– Did you see our demo? – Version 1: Entirely on top of conventional DBMS– Surprisingly easy and complete, reasonably

efficient

TriQL

Page 36: Mazda R27 1

3636

Brief Related WorkBrief Related Work

•Uncertainty– Modeling

•C-tables [IL84], Probabilistic Databases [CP87], using Nested Relations [F90]

– Systems•ProbView [LLRS97], MYSTIQ [BDM+05],

ORION [CSP05], Trio [BDHW05]

•Lineage– DBNotes [CTV05], Data Warehouses

[CW03]

Page 37: Mazda R27 1

but don’t forgetthe lineage…

Thank YouThank You

Search “stanford trio”

(or, http://i.stanford.edu/trio)