On Embedding Edit Distance into L 1

On Embedding Edit Distance into L_1

1

On Embedding Edit Distance into L1

Robert Krauthgamer (Weizmann Institute and IBM Almaden)

Based on joint work

(i) with Moses Charikar,

(ii) with Yuval Rabani,

(iii) with Parikshit Gopalan and T.S. Jayram.

(iv) with Alex Andoni


2

x 2 n, y 2 m

ED(x,y) = Minimum number of character insertions, deletions and substitutions that transform x to y. [aka Levenshtein distance]

Examples:

ED(00000, 1111) = 5

ED(01010, 10101) = 2

Applications:

• Genomics

• Text processing

• Web searchFor simplicity: m = n.

Edit Distance

X


3

Embedding into L1

An embedding of (X,d) into l1 is a map f : X! l1. It has distortion K¸1 if

d(x,y) ≤ kf(x)-f(y)k1 ≤ K d(x,y) 8x,y2X

Very powerful concept (when distortion is small)

Goal: Embed edit distance into l1 with small distortion Motivation:

Reduce algorithmic problems to l1 E.g. Nearest-Neighbor Search

Study a simple metric space without norm E.g. Hamming cube w/cyclic shifts.


4

Large Gap … Despite signficant effort!!!

Known Results for Edit Distance

O(n2/3) [Bar Yossef-Jayram-K.-Kumar’04]

2O(√log n) [Ostrovsky-Rabani’05]

Upper

bound:

Lower

bound:

(log n)1/2-o(1) [Khot-Naor’05] and

3/2 [Andoni-Deza-Gupta-Indyk-Raskhodnikova’03]

(log n)

[K.-Rabani’06]

Previous boundsEmbed ({0,1}n, ED) into L1


5

Submetrics (Restricted Strings) Why focus on submetrics of edit distance?

May admit smaller distortion Partial progress towards general case A framework to analyzing non worst-case instances

Example (a la computational biology): Handle only “typical” strings

Class 1: A string is k-non-repetitive if all its k-substrings are distinct

A random 0-1 string is WHP (2log n)-non-repetitive Yields a submetric containing 1-o(1) fraction of the strings

Class 2: Ulam metric = edit distance on all permutations (here ={1,…,n}) Every permutation is 1-non-repetitive Note: k-non-repetitive strings embed into Ulam with distortion k.

Theory of Computation Seminar, Computer Science Department

k=7


6

Large Gap … Near-tight!

Known Results for Ulam Metric

O(log n) [Charikar-K.’06]

(New proof by [Gopalan-Jayram-K.]) 2O(√log n) [Ostrovsky-Rabani’05]

Upper

bound:

Lower

bound:

log n/loglog n) [Andoni-K.’07] (Actually qualitatively stronger)

(log n)

[K.-Rabani’06]

Embed Ulam metric into L1 Embed ({0,1}n, ED) into L1


7

Embedding of permutationsTheorem [Charikar-K.’06]: The Ulam metric of dimension n embeds

into l1 with distortion O(log n).

Proof. Define where

Claim 1: ||f(P)-f(Q)||1 ≤ O(log n) ED(P,Q) Suppose Q is obtained from P by moving one symbol, say ‘s’

General case then follows by applying triangle inequality on P,P’,P’’,…,Q Total contribution of

coordinates s2{a,b} is 2k (1/k) ≤ O(log n) other coordinates is k k(1/k – 1/(k+1)) ≤ O(log n)

Intuition: sign(fa,b(P)) is indicator for “a appears before b” in P Thus, |fa,b(P)-fa,b(Q)| “measures” if {a,b} is an inversion in P vs. Q


8

Embedding of permutationsTheorem [Charikar-K.’06]: The Ulam metric of dimension n embeds

into l1 with distortion O(log n).

Proof. Define where

Claim 1: ||f(P)-f(Q)||1 ≤ O(log n) ED(P,Q)

Claim 2: ||f(P)-f(Q)||1 ¸ ½ ED(P,Q) Assume wlog that P=identity Edit Q into an increasing sequence (thus into P) using quicksort:

Choose a random pivot, Delete all characters inverted wrt to pivot Repeat recursively on left and right portions

Now argue ||f(P)-f(Q)||1 ¸ E[ #quicksort deletions ] ¸ ½ ED(P,Q)

Surviving subsequence is increasing

ED(P,Q) ≤ 2 #deletions

For every inversion (a,b) in Q:Pr[a deleted “by” pivot b] ≤ 1/|Q-1[a]-Q-1[b]+1| ≤ 2 |fa,b(P) – fa,b(Q)|


9

Lower bound for 0-1 stringsTheorem [K.-Rabani’06]: Embedding of ({0,1}n,ED) into L1 requires

distortion (log n) Proof sketch: Suppose embeds with distortion D¸1, and let V={0,1}n. By the cut-cone characterization of L1:

For every symmetric probability distributions and over V£V,

The embedding f into L1 can be written as

Hence,


10

Lower bound for 0-1 stringsTheorem [K.-Rabani’06]: Embedding of ({0,1}n,ED) into L1 requires

distortion (log n) Proof sketch: Suppose embeds with distortion D¸1, and let V={0,1}n. By the cut-cone characterization of L1:

For every symmetric probability distributions and over V£V,

We choose: =uniform over V£V =½(H+S) where

H=random point+random bit flip (uniform over EH={(x,y): ||x-y||1=1}) S=random point+a cyclic shift (uniform over ES={(x,S(x)} )

The RHS of (*) evaluates to O(D/n) by a counting argument. Main Lemma: For all AµV, the LHS of (*) is (log n) / n.

Analysis of Boolean functions on the hypercube


11

Lower bound for 0-1 strings – cont. Recall =½(H+S) where

H=random point+random bit flip

S=random point+a cyclic shift

Lemma: For all AµV, the LHS of (*) is

Proof sketch: Assume to contrary, and define f = 1A.


12

Lower bound for 0-1 strings – cont. Claim: Ij ¸ 1/n1/8 ) Ij+1 ¸ 1/2n1/8 Proof:

x

x+ej S(x+ej)

flip bit j

cyclic shift

S(x)

flip bit j+1

cyclic shift

= S(x )+ej+1


13

Communication Complexity Approach

Alice

x2n y2n

randomness

Distance Estimation Problem: decide whether d(x,y)¸R or d(x,y)·R/A

Communication complexity model: Two-party protocol Shared randomness Promise (gap) version A = approximation factor CCA = min. # bits to decide whp…

CCA bitsBob

Previous communication lower bounds: l1 [Saks-Sun’02, BarYossef-Jayram-

Kumar-Shivakumar’04] l1 [Woodruff’04]

Earthmover [Andoni-Indyk-K.’07]


14

Communication Bounds for Edit DistanceA tradeoff between approximation and communication Theorem [Andoni-K.’07]:

For Hamming distance: CC1+ = (1/2)[Kushilevitz-Ostrovsky-Rabani’98], [Woodruff’04]

First computational model where edit is provably harder than Hamming!

Corollary 1: Approximation A=O(1) requires CCA ¸ (loglog n)

Corollary 2: Communication CCA=O(1) requires A ¸ *(log n)

Implications to embeddings: Embedding ED into L1 (or squared-L2) requires distortion *(log n)

Furthermore, holds for both 0-1 strings and permutations (Ulam)

¢A A ¸ n


15

Proof Outline Step 1 [Yao’s minimax Theorem]: Reduce to distributional complexity

If CCA≤k then for every two distributions far,close there is a k-bit deterministic protocol with success probability ¸ 2/3

Step 2 [Andoni-Indyk-K.’07]: Reduce to 1-bit protocols Further to above, there are Boolean functions sA,sB :n{0,1} with advantage

Pr(x,y)2 far[sA(x)sB(y)] – Pr(x,y)2 close[sA(x)sB(y)] ¸ (2-k)

Step 3 [Fourier expansion]: Reduce to one Fourier level Furthermore, sA,sB depend only on fixed positions j1,…,j

Step 4 [Choose distribution]: Analyze (x,y)2 projected on these positions Let close,far include -noise handle a high level

Let close,far include (few/more) block rotations handle a low level

Step 5: Reduce Ulam to {0,1}n A random mapping {0,1} works

Key property: distribution of (xj1,…,xj, yj1,…,yj) is “statistically close” under far vs. under close

Compare this additive analysis to our previous analysis:


16

Summary of Known Results

O(log n) [Charikar-K.’06]

(New proof by [Gopalan-Jayram-K.]) 2O(√log n) [Ostrovsky-Rabani’05]

Upper

bound:

Lower

bound:

log n/loglog n) [Andoni-K.’07] (Qualitatively much stronger)

(log n)

[K.-Rabani’06]

Embed Ulam metric into L1 Embed ({0,1}n, ED) into L1


17

Concluding Remarks The computational lens

Study Distance Estimation problems rather than embeddings

Open problems: Still large gap for 0-1 strings Variants of edit distance (e.g. edit distance with block-moves) Rule out other algorithms (e.g. “CC model” capturing Indyk’s NNS for l1)

Recent progress: Bypass L1-embedding by devising new techniques

E.g. using max (l1) product for NNS under Ulam metric [Andoni- Indyk-K.]

Analyze/design “good” heuristics E.g. smoothed analysis [Andoni-K.]

On Embedding Edit Distance into L 1

Documents

Transcript of On Embedding Edit Distance into L 1