Paper presentation @ SEBD '09

29
Time-completeness trade-offs in record linkage using Adaptive query Processing Paolo Missier , Alvaro A. A. Fernandes School of Computer Science, University of Manchester Roald Lengu, Giovanna Guerrini DISI, Universita' di Genova, Italy Marco Mesiti DiCo, Universita' di Milano, Italy SEBD 2009, Camogli, Italy June, 2009

description

Time-completeness trade-offs in record linkage using Adaptive query Processing

Transcript of Paper presentation @ SEBD '09

Page 1: Paper presentation @ SEBD '09

Time-completeness trade-offs in record linkage using Adaptive query Processing

Paolo Missier, Alvaro A. A. FernandesSchool of Computer Science, University of Manchester

Roald Lengu, Giovanna GuerriniDISI, Universita' di Genova, Italy

Marco MesitiDiCo, Universita' di Milano, Italy

SEBD 2009, Camogli, ItalyJune, 2009

Page 2: Paper presentation @ SEBD '09

Conclusions• Optimistic approach to online record linkage

– Based on implicit referential integrity assumption– When assumption not true, goes back to pessimistic

• Technical approach based on autonomic computing– Adaptive query processing– Mix of exact / approximate physical join operators

• Applications: on-the-fly integration scenarios (mashups, personal dataspaces, sensor data streams)

• Need to relax the referential integrity assumption -- use other sources for result size estimation

P. Missier, SEBD, June 2009, Camogli, Italy

Page 3: Paper presentation @ SEBD '09

• slight mismatches in records lead to incomplete integration– due to different encodings, conventions– due to errors in data

Context: On-the-fly data integration

3

A case of record linkage

R !"A=B S

Madchester

Manchester

B AAC D

ManchesterXYZ

XY Manchester ZManchester

• “Situational Applications”• Data mashups and personal dataspaces

P. Missier, SEBD, June 2009, Camogli, Italy

Page 4: Paper presentation @ SEBD '09

Offline vs online linkage• Offline record linkage:

– performed once before queries involving joins – 1. reconcile R and S on joining attributes R.A, S.B using

your favourite record linkage technique

4

– 2. perform regular equijoin on the transformed tables:

➡ ok for tables that can be analysed ahead of the join➡ ok when multiple queries issued on integrated tables

R! !" S!

〈R,S〉→ 〈R’,S’〉

P. Missier, SEBD, June 2009, Camogli, Italy

Page 5: Paper presentation @ SEBD '09

Offline vs online linkage• Offline record linkage:

– performed once before queries involving joins – 1. reconcile R and S on joining attributes R.A, S.B using

your favourite record linkage technique

4

– 2. perform regular equijoin on the transformed tables:

➡ ok for tables that can be analysed ahead of the join➡ ok when multiple queries issued on integrated tables

R! !" S!

• Online linkage:– performed while answering a query– exact join ⇒ similarity (or approximate) join

〈R,S〉→ 〈R’,S’〉

P. Missier, SEBD, June 2009, Camogli, Italy

Page 6: Paper presentation @ SEBD '09

Record linkage and similarity joinsHistorical timeline:

5

from: N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. Tutorial in SIGMOD '06.

P. Missier, SEBD, June 2009, Camogli, Italy

Page 7: Paper presentation @ SEBD '09

Record linkage and similarity joinsHistorical timeline:

5

from: N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. Tutorial in SIGMOD '06.

P. Missier, SEBD, June 2009, Camogli, Italy

Page 8: Paper presentation @ SEBD '09

sim(s1, s2) =|q(s1) ! q(s2)||q(s1) " q(s2)|

Measuring string similarity using q-grams• q-grams map string s to a set q(s) of substrings of length q:

Ex.: 3-grams:

q(“Manchester”) = {‘Man’, ‘anc’, ‘nch’, ‘che’, ‘hes’, ‘est ’, ‘ste’, ‘ter’}.

q(“Madchester”) = {‘Mad’, ‘adc’, ‘dch’, ‘che’, ‘hes’, ‘est ’, ‘ste’, ‘ter’}.

(Jaccard coefficient)

sim(r1, r2) < !1 ! not match!2 < sim(r1, r2)! match

P. Missier, SEBD, June 2009, Camogli, Italy

Page 9: Paper presentation @ SEBD '09

P. Missier, SEBD, June 2009, Camogli, Italy

Similarity symmetric hash join

7

R S

probe

probe

insert

xm

yn

R hash table

insert

y

x

r

s

S hash table

[R.m,S.s]

[R.n, S.r]

• Main sources of complexity:– overhead for storing and indexing q-grams– cost of computing set intersection

Efficient relational representation:[CGK06] S. Chaudhuri, V. Ganti and R. Kaushik,

“A primitive operator for similarity joins in data cleaning” (ICDE’06)‏

new primitive join operator [CGK06]

set intersection computed using a variation of the symmetric hash join

Page 10: Paper presentation @ SEBD '09

Is full similarity join always necessary?

• Pessimistic: always pays full complexity cost• Typical mismatch rate in real datasets around 5%

8

Research Goal: explore optimistic approach• detect mismatches and react as you go• requires estimates of incremental join result size• statistical + reactive ⇒

• expect to sacrifice join result completeness for faster execution

Approach:- combine exact and approximate join operatorsusing Adaptive Query Processing techniques

P. Missier, SEBD, June 2009, Camogli, Italy

Page 11: Paper presentation @ SEBD '09

[KC03] J. O. Kephart and D. M. Chess. The vision of autonomic computing. IEEE Computer, 36(1):41–50, 2003.

Autonomic computing framework

9P. Missier, SEBD, June 2009, Camogli, Italy

Page 12: Paper presentation @ SEBD '09

monitor

respond assess

Autonomic computing framework

9P. Missier, SEBD, June 2009, Camogli, Italy

Page 13: Paper presentation @ SEBD '09

monitor

respond assess

Autonomic computing framework

9

monitor incrementaljoin result size

P. Missier, SEBD, June 2009, Camogli, Italy

Page 14: Paper presentation @ SEBD '09

monitor

respond assess

Autonomic computing framework

9

monitor incrementaljoin result size

estimatejoin result size

computedivergencefrom exp

P. Missier, SEBD, June 2009, Camogli, Italy

Page 15: Paper presentation @ SEBD '09

monitor

respond assess

Autonomic computing framework

9

monitor incrementaljoin result size

estimatejoin result size

computedivergencefrom exp

swapjoin

operators

• when using exact join:• if observed/estimated sizes diverge “too much”, then switch to approximate join

• when using approximate join:• if observed/estimated converge, then switch to exact join

P. Missier, SEBD, June 2009, Camogli, Italy

Page 16: Paper presentation @ SEBD '09

Technical approach and challenges

• Assess:– Need additional assumption for result size estimation– Estimating result size at specific points during join execution

• Respond:– When and how can physical join operators switched?– Can we avoid loss of work?

• operator replacement in pipelined query plans [EFP06]

[EFP06] K. Eurviriyanukul, A. A. A. Fernandes, and N. W. Paton. A foundation for the replacement of pipelined physical join operators in adaptive query processing. In EDBT Workshops 2006, LNCS 4254

10P. Missier, SEBD, June 2009, Camogli, Italy

Page 17: Paper presentation @ SEBD '09

Assessment• Expectation of matching records• Used to derive simple result size estimation model

11

• When there are no mismatches:after scanning n < |S| tuples on S:P(a=x in |S| has been matched) = P(tuple c=x is in top n of R) = n/|R|

Thus, join result size On after n tuples is a binomial random variable:

R (parent) S (child)

xc

ydy

x

b

a

n

On ! bin(n,n

|R| )

symmetric hash join, again

P. Missier, SEBD, June 2009, Camogli, Italy

Page 18: Paper presentation @ SEBD '09

where Pn,p(n) (.) is the binomial cdf with parameters n, p(n)

Detecting divergent observed result sizeObservation is outlier wrt expected result size On => divergence

12

Pn,p(n)(On ! O) ! !out

On

Page 19: Paper presentation @ SEBD '09

where Pn,p(n) (.) is the binomial cdf with parameters n, p(n)

Detecting divergent observed result sizeObservation is outlier wrt expected result size On => divergence

12

Pn,p(n)(On ! O) ! !out

On

outlier detection on experimental datasets with various mismatch patterns

Page 20: Paper presentation @ SEBD '09

where Pn,p(n) (.) is the binomial cdf with parameters n, p(n)

Detecting divergent observed result sizeObservation is outlier wrt expected result size On => divergence

12

Pn,p(n)(On ! O) ! !out

On

outlier detection on experimental datasets with various mismatch patterns

Note: when the data does not follow our referential constraint hypothesis, the model leads to a pure

similarity join

Page 21: Paper presentation @ SEBD '09

Condition for operator replacement• Goal of AQP:

– switch operators without loss of intermediate work• sufficient condition: switch when the operator reaches

a quiescent state [EPF06]

13

[EPF06] K. Eurviriyanukul, A. A. A. Fernandes, and N. W. Paton. A foundation for the replacement of pipelined physical join operators in adaptive query processing. In EDBT Workshops 2006, LNCS 4254

Page 22: Paper presentation @ SEBD '09

Adaptivity: responder state machineEach state corresponds to a combination of physical

join operators

14

left: exactright: exact

left: approximateright: approximate

left: approximateright: exact

left: exactright: approximate

P. Missier, SEBD, June 2009, Camogli, Italy

Page 23: Paper presentation @ SEBD '09

Rationale for state transitions

lex /rex

lap /rap

lex /rap

lap /rex

evidence that left and /or right input perturbed

evidence that left and /or right input

no longer perturbed

this is formalized in terms of predicates on the observable variables- outlier detection and more (omitted)

P. Missier, SEBD, June 2009, Camogli, Italy

Page 24: Paper presentation @ SEBD '09

Experimental evaluation• Marginal gain of hybrid algorithm:

– level of completeness

16

• R: result size for approx join only• r: result size for exact only• rabs: result size actually observed

grel = (rabs – r) / (R – r)‏

P. Missier, SEBD, June 2009, Camogli, Italy

Page 25: Paper presentation @ SEBD '09

Experimental evaluation• Marginal gain of hybrid algorithm:

– level of completeness

16

• R: result size for approx join only• r: result size for exact only• rabs: result size actually observed

grel = (rabs – r) / (R – r)‏

• Marginal Cost:– baseline: exact join throughout

• model marginal cost of hybrid algorithm

unit cost of executing one step in one state– (experimental)

• number of steps in each state• unit state transition cost (experimental)• number of state transitions over entire join

P. Missier, SEBD, June 2009, Camogli, Italy

Page 26: Paper presentation @ SEBD '09

Test datasetsDatasets chosen as representative of 4 distinct patterns

we expect our results to vary:• uniform perturbation: evidence grows slowly => slow reaction• bursty perturbation: strong evidence => timely reaction

P. Missier, SEBD, June 2009, Camogli, Italy

Page 27: Paper presentation @ SEBD '09

Results

P. Missier, SEBD, June 2009, Camogli, Italy

Page 28: Paper presentation @ SEBD '09

Results

P. Missier, SEBD, June 2009, Camogli, Italy

Page 29: Paper presentation @ SEBD '09

Conclusions• Optimistic approach to online record linkage

– Based on implicit referential integrity assumption– When assumption not true, goes back to pessimistic

• Technical approach based on autonomic computing– Adaptive query processing– Mix of exact / approximate physical join operators

• Applications: on-the-fly integration scenarios (mashups, personal dataspaces, sensor data streams)

• Need to relax the referential integrity assumption -- use other sources for result size estimation

P. Missier, SEBD, June 2009, Camogli, Italy