Efficient Approximation of Edit Distance

27
Efficient Approximation of Edit Distance Robert Krauthgamer, Weizmann Institute of Science SPIRE 2013

description

Efficient Approximation of Edit Distance. Robert Krauthgamer, Weizmann Institute of Science SPIRE 2013. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box .: A A A A A A A A A. Edit Distance (Levenshtein distance). Given two strings x   n , y   m :. - PowerPoint PPT Presentation

Transcript of Efficient Approximation of Edit Distance

Page 1: Efficient Approximation  of Edit Distance

Efficient Approximation of Edit Distance

Robert Krauthgamer, Weizmann Institute of Science

SPIRE 2013

Page 2: Efficient Approximation  of Edit Distance

Efficient Approximation of Edit Distance

GenericSearchEngine

Given two strings xn, ym:

ed(x,y) = minimum number of character operations (insertion/deletion/substitution) that transform x to y.

Edit Distance (Levenshtein distance)

Applications:

• Computational Biology

• Text processing• Web search

Examples:ed( banana , ananas ) = 2

ed(00000, 1111) = 5

For simplicity: m = n.

2

Page 3: Efficient Approximation  of Edit Distance

Efficient Approximation of Edit Distance

Dynamic Programming Algorithm Compute ed(x,y) for input x,y n

O(n2) time by dynamic programming [WF’74] O(n2/log2 n) time when ||·O(1) [MP’80]

b a n a n a

a

n

a

n

a

s

2

56

1

1

1

1

11

222

2222

22

2

22

3

33

33

3 33 44

44

45

5

D(i,j)= min

D(i-1, j-1) , if x[i]=y[j]

D(i, j-1) + 1

D(i-1, j) + 1

D(i,j) = ed( x[1:i], y[1:j] )

Faster algorithms?

3

Page 4: Efficient Approximation  of Edit Distance

Focus of This Talk Approximating edit distance

Multiplicatively: ed(x,y) · output · A¢ed(x,y) Decision version: ed(x,y) · r or ed(x,y) > A¢r

Different computational models RAM, Sampling and query complexity, Sketching, (Streaming) Interactions (is it surprising?), Techniques

Variants of the problem

Efficient Approximation of Edit Distance 4

Page 5: Efficient Approximation  of Edit Distance

RAM Model: Sampling Idea 1: quickly estimate ed(x,y) by sampling a few positions Intuition:

If ed(x,y) is small, then “many” large blocks should “match” “Test” this by reading few (randomly chosen) blocks Apply this idea recursively (inside blocks)

Theorem [Batu-Ergun-Kilian-Magen-Raskhodinkova-Rubinfeld-Sami’03]: Factor nc “weak” approximation in sublinear time.

Obstacles: “Block match” means both “similar pattern” and “similar location” Argue that if and only if ed(x,y) is small then …

Can only distinguish ed(x,y)·n/(8A) from ed(x,y)>n/8.

Efficient Approximation of Edit Distance

Best approximation in (near) linear time?

5

Page 6: Efficient Approximation  of Edit Distance

Learn from Past Success Suppose x,y are permutations

Every symbol of appears exactly once Consider transpositions=block moves (“block edit distance”)

No Insert/delete (unreasonable), no substitution (not needed) Example: bed(0123456789, 0457689123)=2

An easy estimate (based on breakpoints) Compute Sx = {all length 2 substrings of x} = {x[i:i+1] | i=1,…,n-1}

Lemma: bed(x,y) · ½ |SxΔSy| · 3 bed(x,y) Proof idea: Fix x (wlog identity), let y=

Each block move “creates” at most 3 new breakpoints Break y at breakpoints, and move (rearrange) the blocks to get x

Can compute |SxΔSy| in linear time!! Best approximation known in poly-time is 1.375 [Elias-Hartman’06]

Efficient Approximation of Edit Distance

A B C D

Open: better approximation in linear-time?

6

Page 7: Efficient Approximation  of Edit Distance

Reduction to Hamming Distance |SxΔSy| = Hamming distance between their characteristic vectors

In fact, each vector has ||2=n2 coordinates, but only n-1 are non-zero

We thus obtain f : Permutations {0,1}n2 such that8x,y, bed(x,y) · ½ ||f(x)-f(y)||1 · 3 bed(x,y).

Such a reduction from one metric space (BED on permutations) to another (L1) is called an embedding. This one has distortion D=3. Known lower bound: distortion into L1 must be ¸ 4/3 [Polak-K.’12]

Efficient Approximation of Edit Distance

More benefits of “good” embeddings?

A sweet spot of fruitful interaction between Math/Geometry (“comparing” metric spaces using embeddings) and CS/Algorithms (solving new problems by “reducing” to old ones)

7

Page 8: Efficient Approximation  of Edit Distance

Sketching Model Idea: “summarize” each string separately, then estimate ed(x,y) only

from the short sketches s(x),s(y). Possible at all??

YES for Hamming distance, and even L1/L2 [Indyk-Motwani’98, Kushilevitz-Ostrosvky-Rabani’00] Approximation factor A=1+ε using sketch size O(ε-2) bits It’s essentially a “dimension reduction” [Johnson-Lindenstrauss’86] Achieved by projection on (inner product with) random direction in space

Consequently, YES also for block edit distance on permutations:

Applies whenever there is an embedding into L1 !!Efficient Approximation of Edit Distance

BED on perm. Hamming O(ε-2) bits sketchf s

distort. D=3 approx. 1+ε

8

Page 9: Efficient Approximation  of Edit Distance

Applications of Sketching Input: large database M, with |M| strings of length n each. Output: all pairwise distances or closest pair (BED on perm)

Naively: in time O(|M|2 n) Sketching [3+ε approx., decision version]: sketch each string, then

estimate all pairs in time O(|M|n + |M|2/ε2)

Practical viewpoint: filteration, i.e., fast pruning of “bad” pairs

Works similarly for Nearest Neighbor Search (NNS): Reduce NNS for permutations under BED, to NNS for Hamming (L1)

Efficient Approximation of Edit Distance

Q1. More embeddings?Q2. Sketching directly?Q3. Lower bounds?

9

Page 10: Efficient Approximation  of Edit Distance

Efficient Approximation of Edit Distance

Embedding ED on PermutationsTheorem [Charikar-K.’06]: Edit distance on permutations of length n

embeds into L1 with distortion O(log n).Proof. Define where

Lemma 1: ||f(P)-f(Q)||1 ≤ O(log n) ed(P,Q) Suppose Q is obtained from P by moving one symbol, say ‘s’

General case then follows by applying triangle inequality on P,P’,P’’,…,Q

Total contribution of coordinates s2{a,b} is 2k (1/k) ≤ O(log n) other coordinates is k k(1/k – 1/(k+1)) ≤ O(log n)

Intuition: sign(fa,b(P)) is indicator for “a appears before b” in P Thus, |fa,b(P)-fa,b(Q)| “measures” if {a,b} is an inversion in P vs. Q

10

Page 11: Efficient Approximation  of Edit Distance

Efficient Approximation of Edit Distance

Embedding ED on Permutations (2)Recall where

Lemma 1: ||f(P)-f(Q)||1 ≤ O(log n) ed(P,Q)

Lemma 2: ||f(P)-f(Q)||1 ¸ ½ ed(P,Q) Assume wlog that P=identity Edit Q into an increasing sequence (thus into P) using quicksort:

Choose a random pivot, Delete all characters inverted wrt to pivot Repeat recursively on left and right portions

Now argue ||f(P)-f(Q)||1 ¸ E[ #quicksort deletions ] ¸ ½ ed(P,Q)

QED

Surviving subsequence is increasing

ed(P,Q) ≤ 2 #deletions

For every inversion (a,b) in Q: Pr[a deleted “by” pivot b] ≤ 1/|Q-1[a]-Q-1[b]+1| ≤ 2 |fa,b(P) – fa,b(Q)|

11

Page 12: Efficient Approximation  of Edit Distance

Embedding Edit Distance Theorem [Ostrovsky-Rabani’05]: Edit distance on all strings (not

only permutations) embeds into L1 with distortion 2O(√log n).

Previously, distortion nc was known [BarYossef-Jayram-K.-Kumar’04, Batu-Ergun-Sahinalp’06]

Clever recursive method to match blocks much more accurately Penalizes both pattern and location errors

Not very fast (quadratic time), but influenced later work on near-linear time algorithms [Andoni-Onak’09, Andoni-Onak-K.’10]

Immediate consequences: NNS algorithms for edit distance Sketching

Efficient Approximation of Edit Distance 12

Page 13: Efficient Approximation  of Edit Distance

Lower Bounds Theorem [Khot-Naor’05, K.-Rabani’06]: Embedding edit distance

into L1 requires distortion Ω(log n) Main technique: Fourier analysis [Kahn-Kalai-Linial’88] L1 embedding $ sparsest-cuts $ Boolean functions f:{0,1}n {0,1}

Stronger assertion: O(1)-size sketches for edit distance require Ω(log n) approximation, even only for permutations [Andoni-K.’06] Actually tradeoff between approximation and sketch-size Techniques: communication complexity and Fourier analysis reduce the

problem to sketches that are linear functions (of their input x)

Efficient Approximation of Edit Distance

Q2’. Sketching vs embedding?

13

Page 14: Efficient Approximation  of Edit Distance

RAM Model: Asymmetric Sampling Idea 1’: Read all of y, and sampled positions of x

Motivations: Better chances to “obtain” information Which y’s are easier/harder? Sampling issues:

Focus on query complexity bounds (tight?) Adaptive vs non-adaptive queries Queries depend on y?

Use dynamic programming in time O(n1+ε)?

Efficient Approximation of Edit Distance

x

y

14

Page 15: Efficient Approximation  of Edit Distance

Efficient Approximation of Edit Distance

Asymmetric Sampling Results[Andoni-Onak-K.’10] Problem: Decide ed(x,y) ≥ n/10 vs ed(x,y) ≤ n/(10A) Complexity = #queries into x (unlimited access to y)

n1-ε A

Θ(log n)Θ(log2 n)

Θ(log3 n)

Θ(logt n)#queries

n1/2-ε n1/2n1/3n1/4n1/t-εn1/(t+1)

Approximation A: (log n)O(1/ε)

# Queries: O(nε)Ω(nε/loglog n)

[n1/(t+1), n1/t-ε]O(logt n)Ω(logt n)

15

Page 16: Efficient Approximation  of Edit Distance

Efficient Approximation of Edit Distance

Overview of Upper Bound Theorem 1: Can distinguish ed(x,y) ≥ n/10 vs ed(x,y) ≤ n/(10A) for

A=(log n)O(1/ε) approximation with nε queries into x (for any ε>0).

Proof structure:1. Characterize edit by “tree-distance” Txy

Parameter b≥2 (degree) Txy ≈ ed(x,y) up to 6b*log n factor

2. Prune the tree to subsample x

x1 x2 xn

b

sampled positions in x

16

Page 17: Efficient Approximation  of Edit Distance

Efficient Approximation of Edit Distance

Step 1: Tree Distance Partition x into b blocks, recursively, for h=logbn levels

x[1:n]

x[1:⅓n] x[⅔n:n]

x[1] x[2] x[3]

x[⅓n:⅔n]

y[1:n]

y[u:u+⅓n]

x[s:s+⅓n]

Ti(s,u) = tree-distance between x[s:s+ℓi] and y[u:u+ℓi] where ℓi is the block-length at level i

17

Page 18: Efficient Approximation  of Edit Distance

Efficient Approximation of Edit Distance

Tree Distance: Recursive Definition Recall Ti(s,u) = tree-distance between x[s:s+ℓi] and y[u:u+ℓi]

Base case: Th(s,u)=Hamming(x[s],y[u]) Output: Txy=T0(s=1,u=1) x[s:s+ℓi]

y[u:u+ℓi]

r0

x

y

18

Page 19: Efficient Approximation  of Edit Distance

Efficient Approximation of Edit Distance

Tree Approximates Edit Distance Lemma: Txy≈ed(x,y) up to 6b*logbn factor.

Hierarchical decomposition inspired by earlier approaches [BEKMRRS’03, OR’05] All had approximation recurrence of the type

A(n) = c*A(n/b) + bfor c≥2

Solves to A(n) ≥ 2√log n factor for every choice of b

Our characterization has no multiplicative loss (c=1):A(n) = A(n/b) + b

Analysis inspired by algorithms for smoothed instances [Andoni-K.’08]

19

Page 20: Efficient Approximation  of Edit Distance

Efficient Approximation of Edit Distance

Step 2: Compute the Tree Distance For b=2, tree-distance gives O(log n) approximation!

BUT know only how to compute T-distance in O(n2) time

Instead, for b=(log n)1/ε, can prune the tree to nO(ε) nodes, and approximate T-distance within factor 1+ε

Pruning: subsample (log n)O(1) children out of each node Works only when ed(x,y) ≥ (n) Generally, must subsample

the tree non-uniformly, using the Precision Sampling Lemma

b

sampled positions in x

20

Page 21: Efficient Approximation  of Edit Distance

Efficient Approximation of Edit Distance

Key tool: non-uniform sampling Goal:

For unknown a1, a2, …an[0,1] Estimate their sum, up to an additive constant error Using only “weak” estimates a1, a2, …an

Sum Estimator Adversary

0. fix distribution U1. Fix a1,a2,…an (unknown)

2. pick “precisions” ui

(our algorithm: ui~U[0,1] i.i.d.)3. provide a1,a2,…an

s.t. |ai-ai|<1/ui4. report S=S(a1,…,u1,…) with |S – ∑ai | < 1.

21

Page 22: Efficient Approximation  of Edit Distance

Efficient Approximation of Edit Distance

Precision Sampling Goal: estimate ∑ai from {ai} s.t. |ai-ai|<1/ui. Precision Sampling Lemma: Can achieve WHP

additive error 1 and multiplicative error 1.5 with expected precision Eu_i~U[ui]=O(log n).

Inspired by a technique from [IW’05] for streaming (Fk moments) In fact, PSL gives simple & improved algorithms for Fk moments,

cascaded (mixed) norms, ℓp-sampling problems [AKO’11]

Also distant relative of Priority Sampling [DLT’07]

22

Page 23: Efficient Approximation  of Edit Distance

Efficient Approximation of Edit Distance

Precision Sampling for Edit Distance Apply Precision Sampling to the tree from the characterization

recursively at each node If a node has very weak precision, can trim the entire sub-tree

23

Page 24: Efficient Approximation  of Edit Distance

Fast Approximation Algorithm Theorem [Andoni-Onak-K.’10]: Can approximate ed(x,y) within

factor (log n)O(1/ε) using nε queries to x and in time n1+ε (for any ε>0).

Exponential improvement over previous factor 2O(√log n) [Andoni-Onak’09] Asymmetric sampling approach, implemented faster by data structure

tricks Sampling is non-adaptive, independent of y

Efficient Approximation of Edit Distance 24

Page 25: Efficient Approximation  of Edit Distance

Smoothed Instances Smooth Instance (x,y) constructed by:

Start with arbitrary x*,y*2{0,1}n and their optimal alignment A* Replace each position w/probability p by random bit, but respect A*

Theorem [Andoni-K.’08]: Can approximate ed(x,y) within constant factor, in smoothed runtime that is (whp) near-linear n1+ε. Some extensions to sublinear time

Techniques: Match blocks of length L=O(1/p¢log n) that have edit distance ·εL.

A known heuristic technique (e.g. PatternHunter) To find block matches quickly, we use naive NNS algorithm

Because of smoothing, blocks are likely to be distinct (and even far), so modulo overlaps between blocks, we “effectively” have permutations

Efficient Approximation of Edit Distance

Open: Better time n¢polylog(n)? Approximation independent of p?

25

Page 26: Efficient Approximation  of Edit Distance

Variants of Edit Distance Edit distance with block operations

Admits O(log n¢log*n) approximation in near-linear time, via embedding into L1 [Muthukrishnan-Sahnialp’00,Cormode-Muthukrishnan’02]

Open: Distortion lower bounds? Better approximation in polytime?

Edit distance between trees (generalizes strings) Basic operations: insert/delete/relabel vertex Can be computed in O(n3) time [Demaine-Mozes-Rossman-Weimann’07] Open: Embedding?

Edit distance with “rich” alphabet Can model shape matching [Klein-Tirthapura-Sharvit-Kimia’00] Challenge: Cost of basic operation varies with symbols

Efficient Approximation of Edit Distance 26

Page 27: Efficient Approximation  of Edit Distance

Conclusion Having multiple computational models is fruitful

New ideas, techniques, viewpoints, applications can come full circle Lower bounds —in certain models — highlight limitations of methods Explore which instances are easy/hard

“Asymmetric algorithms” can work well for symmetric problems

Connections to other fields (sampling, embeddings, communication complexity, Fourier analysis) and computational problems (NNS)

Had much progress, but still many gaps, and much more to go

Efficient Approximation of Edit Distance

Thank You! 27