Overcoming the L 1 Non- Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work...

Overcoming the L1 Non-Embeddability Barrier

Robert Krauthgamer (Weizmann Institute)

Joint work with Alexandr Andoni and Piotr Indyk (MIT)

Overcoming the L_1 non-embeddability barrier 2

Algorithms on Metric Spaces Fix a metric M Fix a computational problem

Solve problem under M

Ulam metric

ED(x,y) = minimum number of edit operations that transform x into y.edit operation = insert/delete/ substitute a character

ED(0101010, 1010101) = 2 Nearest Neighbor Search:

Preprocess n strings, so that given a query string, can find the closest string to it.

Compute distance between x,yEarthmover distance

…

…

Hamming distance


Motivation for Nearest Neighbor Many applications:

Image search (Euclidean dist, Earth-mover dist) Processing of genetic information, text processing (edit dist.) many others…

GenericSearchEngine


A General Tool: Embeddings An embedding of M into a host metric

(H,dH) is a map f : M→H preserves distances approximately

has distortion A ≥ 1 if for all x,y dM(x,y) ≤ dH(f(x),f(y)) ≤ A*dM(x,y)

Why? If H is “easy” (= can solve efficiently

computational problems like NNS) Then get good algorithms for the original

space M!f


Host space?Popular target metric: ℓ1 Have efficient algorithms:

Distance estimation: O(d) for d-dimensional space (often less) NNS: c-approx with O(n1/c) query time and O(n1+1/c) space [IM98]

Powerful enough for some things…

Metric References Upper bound Lower bound

Edit distance over 0,1d [OR05];

[KN05,KR06,AK07]2O(√log d) Ω(log d)

Ulam (= edit distance over permutations)

[CK06];

[AK07]O(log d) Ω:(log d)

Block edit distance over 0,1d [MS00, CM07];

[Cor03]O(log d) 4/3

Earthmover distance in 2

(sets of size s)

[Cha02, IT03];

[NS07]O(log s) (log1/2 s)

Earthmover distance in 0,1d

(set of size s)

[AIK08];

[KN05]O(log s*log d) (log s)

ℓ1=real space withd1(x,y) =∑i |xi-yi|


Below logarithmic? Cannot work with ℓ1

Other possibilities? (ℓ2)p is bigger and algorithmically tractable

but not rich enough (often same lower bounds)

ℓ∞ is rich (includes all metrics), but not efficient computationally usually (high dimension)

And that’s roughly it… (at least for efficient NNS)

(ℓ2)p=real space withdist2p(x,y)=||x-y||2p

ℓ∞=real space withdist∞(x,y)=maxi|xi-yi|


d∞,1

d1

…

Meet our new host

Iterated product space, Ρ22,∞,1=

L °(`2)2

L ¯`1

`®1

L ¯`1

`®1®1

x = (x1; : : :xa) 2 R ®

d1(x;y) =P ®

i=1 jxi ¡ yi j

x = (x1; : : :x¯ ) 2 `®1 £ `®

1 £ :: :`®1

d1 ;1(x;y) = max¯i=1 d1(xi ;yi )

x = (x1; : : :x° ) 2L ¯

`1`®1 £

L ¯`1

`®1 £ :: :

L ¯`1

`®1

d22;1 ;1(x;y) =P °

i=1(d1 ;1(xi ;yi ))2

β

α

γ

d1

…

d∞,1

d1

…

d∞,1d22,∞,1


Why Ρ22,∞,1?

Because we can… Theorem 1. Ulam embeds into Ρ22,∞,1 with O(1) distortion

Dimensions (γ,β,α)=(d, log d, d)

Theorem 2. Ρ22,∞,1 admits NNS on n points with O(log log n) approximation O(nε) query time and O(n1+ε) space

In fact, there is more for Ulam…

Rich

Algorithmicallytractable

L °(`2)2

L ¯`1

`®1


Our Algorithms for Ulam Ulam = edit on strings where each symbol appears at most once

A classical distance between rankings Exhibits hardness of misalignments (as in general edit)

All lower bounds same as for general edit (up to Θ() ) Distortion of embedding into ℓ1 (and (ℓ2)p, etc): Θ(log d)

Our approach implies new algorithms for Ulam:1. NNS with O(log log n) approx, O(nε) query time

Can improve to O(log log d) approx

2. Sketching with O(1)-approx in logO(1) d space

3. Distance estimation with O(1)-approx in time

ED(1234567, 7123456) = 2

[BEKMRRS03]: when ED¼d, approx dε in O(d1-2ε) time

If we ever hope for approximation <<log d for NNS under general edit,first we have to get it under Ulam!


Theorem 1

Theorem 1. Can embed Ulam into Ρ22,∞,1 with O(1) distortion Dimensions (γ,β,α)=(d, log d, d)

Proof “Geometrization” of Ulam characterizations Previously studied in the context of testing monotonicity (sortedness):

Sublinear algorithms [EKKRV98, ACCL04] Data-stream algorithms [GJKK07, GG07, EH08]

L °(`2)2

L ¯`1

`®1


Thm 1: Characterizing Ulam Consider permutations x,y over [d]

Assume for now: x = identity permutation Idea:

Count # chars in y to delete to obtain increasing sequence (≈ Ulam(x,y)) Call them faulty characters

Issues: Ambiguity… How do we count them?

123456789

234657891

123456789

341256789

X=

y=


Thm 1: Characterization – inversions Definition: chars a<b form inversion if b precedes a in y

How to identify faulty char? Has an inversion?

Doesn’t work: all chars might have inversion Has many inversions?

Still can miss “faulty” chars Has many inversions locally?

Same problem

123456789

234567891

123456789

213456798

123456789

567981234

Check if either is true!

X=

y=


Thm 1: Characterization – faulty chars Definition 1: a is faulty if exists K>0 s.t.

a is inverted w.r.t. a majority of the K symbols preceding a in y (ok to consider K=2k)

Lemma [ACCL04, GJKK07]: # faulty chars = Θ(Ulam(x,y)).

123456789

234567891

4 characters preceding 1 (all inversions with 1)


Thm 1: CharacterizationEmbedding To get embedding, need:

1. Symmetrization (neither string is identity)

2. Deal with “exists”, “majority”…?

To resolve (1), use instead X[a;K] …

Definition 2: a is faulty if exists K=2k such that |X[a;2k] Δ Y[a;2k]| > 2k (symmetric difference)

123456789

123467895

Y[5;4]

X[5;4]

E:g: 1X [5;22] = (1;1;1;1;0;0;0;0;0)

°°1X [a;2k ] ¡ 1Y [a;2k ]

°°

1> 2k


Thm 1: Embedding – final step We have

Replace by weight?

Final embedding:

123456789

123467895

Y[5;22]

X[5;22]

Ulam(x;y) ¼dX

a=1

maxk=1¢¢¢logd

Âh°°1X [a;2k ] ¡ 1Y [a;2k ]

°°

1> 2k

i

equal 1 iff true

Ulam(x;y) ¼dX

a=1

maxk=1¢¢¢logd

k1X [a;2k ] ¡ 1Y [a;2k ]k1

2¢2k( )2

f (x) =³ ¡

12¢2k 1X [a;2k ]

¢k=1::: logd]

´

a=1:::d2

L d(`2)2

L logd`1

d1


Theorem 2

Theorem 2. Ρ22,∞,1 admits NNS on n points O(log log n) approximation O(nε) query time and O(n1+ε) space for any small ε

(ignoring (αβγ)O(1))

A rather general approach “LSH” on ℓ1-products of general metric spaces

Of course, cannot do, but can reduce to ℓ∞-products

L °(`2)2

L ¯`1

`®1


Thm 2: Proof

Let’s start from basics: ℓ1α

[IM98]: c-approx with O(n1/c) query time and O(n1+1/c) space (ignoring αO(1))

Ok, what about L ¯

`1`®1

L ¯`1

M

L °(`2)2

L ¯`1

`®1

Suppose: NNS for M with• cM-approx• QM query time• SM space.

Then: NNS for • O(cM * log log n) -approx• O(QM) query time• O(SM * n1+ε) space.

[I02]


Thm 2: What about (ℓ2)2-product? Enough to consider

(for us, M is the l1-product)

Off-the-shelf? [I04]: gives space ~n or >log n approximation

We reduce to multiple NNS queries under Instructive to first look at NNS for standard ℓ1 …

L °`1

M

L °`1

M


Thm 2: Review of NNS for ℓ1 LSH family: collection H of

hash functions such that: For random hH (parameter >0)

Pr[h(q)=h(p)] ≈ 1-||q-p||1 /

Query just uses primitive:

Can obtain H by imposing randomly-shifted grid of side-length

Then for h defined by ri2[0, ] at random, primitive becomes:

pq

“return all points p such that h(q)=h(p)

“return all p s.t. |qi-pi|<ri for all i[d]


Thm 2: LSH for ℓ1-product Intuition: abstract LSH! Recall we had:

for ri random from [0, ],

point p returned if for all i: |qi-pi|<ri

Equivalently For all i:

maxi1r i

jqi ¡ pi j < 1

pq

ℓ∞ product of R!

“return all points p’s such thatmaxi dM(qi,pi)/ri<1

For ℓ1

L °`1

MFor

“return all p s.t. |qi-pi|<ri for all i[d]


Thm 2: Final Thus, sufficient to solve primitive:

We reduced NNS over

to several instances of NNS over(with appropriately scaled coordinates)

Approximation is O(1)*O(log log n) Done!

“return all points p’s such that maxi dM(qi,pi)/ri<1 (in fact, for k independent choices of (r1,…rd))

L °`1

M

L °`1

ML ° k

`1M

For


L °(`2)2

L ¯`1

`®1Take-home message:

Can embed combinatorial metrics into iterated product spaces Works for Ulam (=edit on non-repetitive strings)

Approach bypasses non-embeddability results into usual-suspect spaces like ℓ1, (ℓ2)2 …

Open: Embeddings for edit over

0,1d, EMD, other metrics? Understanding product

spaces?[Jayram-Woodruff]: sketching

Overcoming the L 1 Non- Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work...

Documents

Transcript of Overcoming the L 1 Non- Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work...