Distance functions and IE -2 William W. Cohen CALD.

46
Distance functions and IE - 2 William W. Cohen CALD

Transcript of Distance functions and IE -2 William W. Cohen CALD.

Page 1: Distance functions and IE -2 William W. Cohen CALD.

Distance functions and IE -2

William W. Cohen

CALD

Page 2: Distance functions and IE -2 William W. Cohen CALD.

Announcements

• March 25 Thus – talk from Carlos Guestrin (Assistant Prof in Cald as of fall 2004) on max-margin Markov nets– 9:30 am in NSH 1507– open to public - tell your friends!

• Datasets: – some public extraction data is (I hope readable) on /afs/cs/project/extract-learn/repository

• Writeups:– nothing today– “distance metrics for text” – three papers - due next Monday, 3/22

Page 3: Distance functions and IE -2 William W. Cohen CALD.

Record linkage: definition

• Record linkage: determine if pairs of data records describe the same entity – I.e., find record pairs that are co-referent– Entities: usually people (or organizations or…)– Data records: names, addresses, job titles, birth

dates, …

• Main applications: – Joining two heterogeneous relations– Removing duplicates from a single relation

Page 4: Distance functions and IE -2 William W. Cohen CALD.

The data integration problem

• Control flow (modulo details about querying– Extract (author, department) pairs from DB1

– Extract (department ,www server) pairs from DB2

– Execute the two-step plan to get paper:

• author -> department -> wwwServer

– two steps means matching (linking, integrating, deduping, ....) department names in DB1/DB2

– issues are completely different if user is executing a one-step plan:

• one-step plan is retrieval

Page 5: Distance functions and IE -2 William W. Cohen CALD.

String distance metrics: Levenshtein

• Edit-distance metrics– Distance is shortest sequence of edit

commands that transform s to t.– Simplest set of operations:

• Copy character from s over to t

• Delete a character in s (cost 1)

• Insert a character in t (cost 1)

• Substitute one character for another (cost 1)

– This is “Levenshtein distance”

Page 6: Distance functions and IE -2 William W. Cohen CALD.

Computing Levenshtein distance – 4

D(i,j) = minD(i-1,j-1) + d(si,tj) //subst/copyD(i-1,j)+1 //insertD(i,j-1)+1 //delete

C O H E N

M 1 2 3 4 5

C 1 2 3 4 5

C 2 3 3 4 5

O 3 2 3 4 5

H 4 3 2 3 4

N 5 4 3 3 3

A trace indicates where the min value came from, and can be used to find edit operations and/or a best alignment (may be more than 1)

Page 7: Distance functions and IE -2 William W. Cohen CALD.

Smith-Waterman distance - 2

D(i,j) = max

0 //start overD(i-1,j-1) - d(si,tj) //subst/copyD(i-1,j) - G //insertD(i,j-1) - G //delete

G = 1

d(c,c) = -2

d(c,d) = +1

C O H E N

M -1 -2 -3 -4 -5

C 0 0 -1 -2 -3

C +1 0 -1 -2 -3

O -1 +2 +1 0 -1

H -2 +1 +4 +3 +2

N -3 0 +3 +3 +5

Page 8: Distance functions and IE -2 William W. Cohen CALD.

Smith-Waterman distance - 3

D(i,j) = max

0 //start overD(i-1,j-1) - d(si,tj) //subst/copyD(i-1,j) - G //insertD(i,j-1) - G //delete

G = 1

d(c,c) = -2

d(c,d) = +1

C O H E N

M 0 0 0 0 0

C 0 0 0 0 0

C +1 0 0 0 0

O 0 +2 +1 0 0

H 0 +1 +4 +3 +2

N 0 0 +3 +3 +5

Page 9: Distance functions and IE -2 William W. Cohen CALD.

Smith-Waterman distance - 5

c o h e n d o r f

m 0 0 0 0 0 0 0 0 0

c 1 0 0 0 0 0 0 0 0

c 0 0 0 0 0 0 0 0 0

o 0 2 1 0 0 0 2 1 0

h 0 1 4 3 2 1 1 1 0

n 0 0 3 3 5 4 3 2 1

s 0 0 2 2 4 4 3 2 1

k 0 0 1 1 3 3 3 2 1

i 0 0 0 0 2 2 2 2 1

dist=5

Page 10: Distance functions and IE -2 William W. Cohen CALD.

Smith-Waterman distance in Monge & Elkan’s WEBFIND (1996)

• String s=A1 A2 ... AK, string t=B1 B2 ... BL

• sim’ is editDistance scaled to [0,1]

• Monge-Elkan’s “recursive matching scheme” is average maximal similarity of Ai to Bj:

Page 11: Distance functions and IE -2 William W. Cohen CALD.

Results: S-W from Monge & Elkan

Page 12: Distance functions and IE -2 William W. Cohen CALD.

Affine gap distances

• Smith-Waterman fails on some pairs that seem quite similar:

William W. Cohen

William W. ‘Don’t call me Dubya’ Cohen

Intuitively, a single long insertion is “cheaper” than a lot of short insertions

Intuitively, are springlest hulongru poinstertimon extisn’t “cheaper” than a lot of short insertions

Page 13: Distance functions and IE -2 William W. Cohen CALD.

Affine gap distances - 2

• Idea: – Current cost of a “gap” of n characters: nG– Make this cost: A + (n-1)B, where A is cost of

“opening” a gap, and B is cost of “continuing” a gap.

Page 14: Distance functions and IE -2 William W. Cohen CALD.

Affine gap distances - 3

D(i,j) = maxD(i-1,j-1) + d(si,tj) //subst/copyD(i-1,j)-1 //insertD(i,j-1)-1 //delete

IS(i,j) = max D(i-1,j) - AIS(i-1,j) - B

IT(i,j) = max D(i,j-1) - AIT(i,j-1) - B

Best score in which si is aligned with a ‘gap’

Best score in which tj is aligned with a ‘gap’

D(i-1,j-1) + d(si,tj)

IS(I-1,j-1) + d(si,tj)

IT(I-1,j-1) + d(si,tj)

Page 15: Distance functions and IE -2 William W. Cohen CALD.

Affine gap distances - 4

-B

-B

-d(si,tj) D

IS

IT-d(si,tj)

-d(si,tj)

-A

-A

Page 16: Distance functions and IE -2 William W. Cohen CALD.

Affine gap distances – experiments (from McCallum,Nigam,Ungar KDD2000)

• Goal is to match data like this:

Page 17: Distance functions and IE -2 William W. Cohen CALD.

Affine gap distances – experiments (from McCallum,Nigam,Ungar KDD2000)

• Hand-tuned edit distance

• Lower costs for affine gaps

• Even lower cost for affine gaps near a “.”

• HMM-based normalization to group title, author, booktitle, etc into fields (as in Borkar et al)

Page 18: Distance functions and IE -2 William W. Cohen CALD.

Affine gap distances – experiments

TFIDF Edit Distance

Cora 0.751 0.839

0.721  

OrgName1 0.925 0.633

0.366 0.950

Orgname2 0.958 0.571

0.778 0.912

Restaurant 0.981 0.827

0.967 0.867

Parks 0.976 0.967

0.967 0.967

Page 19: Distance functions and IE -2 William W. Cohen CALD.

TFIDF distance for data integration

Experiments with WHIRL

Page 20: Distance functions and IE -2 William W. Cohen CALD.
Page 21: Distance functions and IE -2 William W. Cohen CALD.

Three ways to deal with output of IE systems

• Method 1.– Do the best you can at mapping the output into a

conventional database (or KR system) with a natural schema (info about people, events, etc)

– Answer any questions with the existing DB

• Method 2.– Given a query, try and see how much the answer can be

constrained by information derived from IE (somehow or other

– Probably requires some sort of uncertain reasoning.

Page 22: Distance functions and IE -2 William W. Cohen CALD.
Page 23: Distance functions and IE -2 William W. Cohen CALD.
Page 24: Distance functions and IE -2 William W. Cohen CALD.
Page 25: Distance functions and IE -2 William W. Cohen CALD.
Page 26: Distance functions and IE -2 William W. Cohen CALD.
Page 27: Distance functions and IE -2 William W. Cohen CALD.
Page 28: Distance functions and IE -2 William W. Cohen CALD.
Page 29: Distance functions and IE -2 William W. Cohen CALD.
Page 30: Distance functions and IE -2 William W. Cohen CALD.
Page 31: Distance functions and IE -2 William W. Cohen CALD.
Page 32: Distance functions and IE -2 William W. Cohen CALD.
Page 33: Distance functions and IE -2 William W. Cohen CALD.

• Birds: r(birdName,soundDescription) and 5 short descriptions of sounds (“an owl hooting”)

• Movies r(movieName,review) and 5 long, 5 short plot descriptions (“sci-fi comedy”, “serious czech movie”, ...)

Page 34: Distance functions and IE -2 William W. Cohen CALD.
Page 35: Distance functions and IE -2 William W. Cohen CALD.

Soft joins with “incompatible schemas”

Page 36: Distance functions and IE -2 William W. Cohen CALD.
Page 37: Distance functions and IE -2 William W. Cohen CALD.

WHIRL as a classification-learner

Page 38: Distance functions and IE -2 William W. Cohen CALD.
Page 39: Distance functions and IE -2 William W. Cohen CALD.
Page 40: Distance functions and IE -2 William W. Cohen CALD.

Classification with unlabeled “Background” instances

Example: instances are paper titles, background instances are paper abstracts

Page 41: Distance functions and IE -2 William W. Cohen CALD.
Page 42: Distance functions and IE -2 William W. Cohen CALD.

Very very short examples

Very short examples

Classifying short newswire headlines

Page 43: Distance functions and IE -2 William W. Cohen CALD.

Inference in WHIRL

• “Best-first” search: pick state s that is “best” according to f(s)

• Suppose graph is a tree, and for all s, s’, if s’ is reachable from s then f(s)>=f(s’). Then A* outputs the globally best goal state s* first, and then next best, ...

Page 44: Distance functions and IE -2 William W. Cohen CALD.

Inference in WHIRL

• Explode p(X1,X2,X3): find all DB tuples <p,a1,a2,a3> for p and bind Xi to ai.

• Constrain X~Y: if X is bound to a and Y is unbound, – find DB column C to which

Y should be bound– pick a term t in X, find

proper inverted index for t in C, and bind Y to something in that index

• Keep track of t’s used previously, and don’t allow Y to contain one.

Page 45: Distance functions and IE -2 William W. Cohen CALD.

Inference in WHIRL

Page 46: Distance functions and IE -2 William W. Cohen CALD.

Summary

• WHIRL finds the top k answers to a query• Queries tend to be easy because either they’re

– unconstrained (e.g. 2-way similarity join) => easy to find 100 or so “good” answers

– highly constrained (e.g. restricted sim join, multi-way join, classification query, ....) => easy to present all the “reasonable” answers to a user

• Data integration usually considers matching two lists of entity descriptions in the abstract– unconstrained, sometimes under constrained (what is a

match to the end user?) – i.e., we don’t know what the final query, and hence final constraints, will turn out to be.

– this is evaluated a lot in experiments, but in an ideal world it would not the “wrong” problem