Approximate schemas Michel de Rougemont, LRI, University Paris II Joint work with E. Fischer,...

48
Approximate schemas Michel de Rougemont, LRI , University Paris II Joint work with E. Fischer, Technion, F. Magniez, LRI

Transcript of Approximate schemas Michel de Rougemont, LRI, University Paris II Joint work with E. Fischer,...

Page 1: Approximate schemas Michel de Rougemont, LRI, University Paris II Joint work with E. Fischer, Technion, F. Magniez, LRI.

Approximate schemas

Michel de Rougemont, LRI , University Paris II

Joint work with E. Fischer, Technion, F. Magniez, LRI

Page 2: Approximate schemas Michel de Rougemont, LRI, University Paris II Joint work with E. Fischer, Technion, F. Magniez, LRI.

1. Distance between words (structures)Edit distance with moves

2. Distance between a word (structure) and a class of words (structures)

3. Distance between two languages (classes)

4. Applications: regular languages, DTDs

Distances between languages

2121 close vfinitely)(except v if LvLLL

122121 and if LLLLLL

)',( Min),( ' wwdistLwdist Lw

Page 3: Approximate schemas Michel de Rougemont, LRI, University Paris II Joint work with E. Fischer, Technion, F. Magniez, LRI.

1. Tester for equality, constant time

2. Tester for w in L, constant time

3. Tester for approximate equivalence of regular languages, polynomial

Equivalence tester

Results

acceptsA then If 21 LL

32 proba with rejectsA then ) ( If 21 LL

Page 4: Approximate schemas Michel de Rougemont, LRI, University Paris II Joint work with E. Fischer, Technion, F. Magniez, LRI.

1. Satisfiability : Tree |= F

2. Approximate satisfiability Tree |= F

3. Approximate equivalence

Image on a class K of trees

F F F

F fromfar -

1. Approximate Satisfiability and Equivalence

GF

G

Page 5: Approximate schemas Michel de Rougemont, LRI, University Paris II Joint work with E. Fischer, Technion, F. Magniez, LRI.

Let F be a property on a class K of structures U

An ε -tester for F is a probabilistic algorithm A such that:• If U |= F, A accepts

• If U is ε far from F, A rejects with high probability

• Time(A) independent of n.(Goldreich, Golwasser, Ron 1996 , Rubinfeld, Sudan 1994)

Tester usually implies a linear time corrector.

Testers on a class K

Page 6: Approximate schemas Michel de Rougemont, LRI, University Paris II Joint work with E. Fischer, Technion, F. Magniez, LRI.

History of Testers

Self-testers and correctors for Linear Algebra ,Blum & Kanan 1989

Robust characterizations of polynomials, R. Rubinfeld, M. Sudan, 1994

Testers for graph properties : k-colorability, Goldreich and al. 1996

graph properties have testers, Alon and al. 1999

Regular languages have testers, Alon and al. 2000s

Testers for Regular tree languages , Mdr and Magniez, ICALP 2004

2

Page 7: Approximate schemas Michel de Rougemont, LRI, University Paris II Joint work with E. Fischer, Technion, F. Magniez, LRI.

1. Classical Edit Distance:

Insertions, Deletions, Modifications

2. Edit Distance with moves

0111000011110011001

0111011110000011001

3. Edit Distance with Moves generalizes to Trees

2. Equality tester

Page 8: Approximate schemas Michel de Rougemont, LRI, University Paris II Joint work with E. Fischer, Technion, F. Magniez, LRI.

Statistics on words

k

k

Kt k-t

Block statistics: b.stat

Uniform statistics: u.stat

Block Uniform statistics: bu.stat

1k

Page 9: Approximate schemas Michel de Rougemont, LRI, University Paris II Joint work with E. Fischer, Technion, F. Magniez, LRI.

Block statistics

W=001010101110…… length n, subword of length k, n/k blocks

61.

1401

)(.

Wstatb

/1.

#....

#)(.

2

1

knn

nWstatb

k

....

"00...1" ofnumber #"00...0" ofnumber #

2

1

nn

"11...1" ofnumber #

....2kn

For k=2, n/k=6

Page 10: Approximate schemas Michel de Rougemont, LRI, University Paris II Joint work with E. Fischer, Technion, F. Magniez, LRI.

Uniform statisticsW=001010101110 length n, subword of length k, n-k+1 blocks

111.

2441

)(.

Wstatu

1

1.#....

#)(.

2

1

knn

nWstatu

k

...."00...1" ofnumber #"00...0" ofnumber #

2

1

nn

"11...1" ofnumber #

....2kn

Page 11: Approximate schemas Michel de Rougemont, LRI, University Paris II Joint work with E. Fischer, Technion, F. Magniez, LRI.

Statistics and distance

W=001010101110 (n=12, k=2)

W1=001101001110 W2=110100001110 W3=110100001111

dist( W,W’)= 3 dist( W,W’) /12=0.25

W’=110100001111

61.

1401

)(.

Wstatb

111.

4223

)'(.

Wstatu

111.

2441

)(.

Wstatu

61.

3012

)'(.

Wstatb

33.168

0 d

73.0118

1 d

Page 12: Approximate schemas Michel de Rougemont, LRI, University Paris II Joint work with E. Fischer, Technion, F. Magniez, LRI.

Goal: d1 approximates the distance

Let ε =1/k : For n>n0 dist – ε.n < d1 < dist + ε.n

Practical application: ε=10-2 hence k=100, stat dimension 2100

Words of length n=109 , d1 is approximated by N samples and a good approximation after N=O(1/ε3) trials.

Remarks:1. Distance with Moves.

W =000….0001111…111 W’=1111…111000….000

2. Robustness to noiseIf W,W’ are noisy inputs (but ε-close), the method still works.

3. Random words are close with the moves, far without.

)'(.

2/5.0..

..2/5.0

)(. WstatbWstatb

Page 13: Approximate schemas Michel de Rougemont, LRI, University Paris II Joint work with E. Fischer, Technion, F. Magniez, LRI.

Classical complexity

Edit distances:

1. P problem on words without the moves.• Approximation?• Sublinear algorithm?

2. NP-complete problem on words with the moves.• O(1)-approximable

3. P problem on ordered trees without the moves

4. NP-complete problem on unordered trees and trees with the moves.

Page 14: Approximate schemas Michel de Rougemont, LRI, University Paris II Joint work with E. Fischer, Technion, F. Magniez, LRI.

Basic tool: Chernoff bound

Random variables:

Markov bound

Chebyshev bound

Chernoff bound: sum of independent variables Xi, whose average is μ

Hoeffding bound

k

Prob[X=k]

].Pr[2..8 aNeaYX

1

...1

Ni

iXN

Y

a.μ

0...010

iX

Page 15: Approximate schemas Michel de Rougemont, LRI, University Paris II Joint work with E. Fischer, Technion, F. Magniez, LRI.

Tester for equality of strings

Edit distance with moves. NP-complete problem, but O(1)-approximable.

Uniform statistics ( ): W=001010101110

Theorem 1. |u.stat(w)-ustat(w’)| approximates dist(w,w’)/n .

Sample N subwords of length k, compute Y(w) and Y(w’):

Theorem 2. Y(w) approximates u.stat(w).

Corollary. |Y(w)-Y(w’)| approximates dist(w,w’)/n .

Tester: If |Y(w)-Y(w’)| <ε. accept, else reject.

1)(

...1

Ni

iXN

wY

0...010

iX

111.

2441

)(.

Wstatu

1)'(

...1

Ni

iXN

wY

1k

Page 16: Approximate schemas Michel de Rougemont, LRI, University Paris II Joint work with E. Fischer, Technion, F. Magniez, LRI.

2a. Regular words Definition:

L is a regular language and A an automaton for L, Test w in L.

0C

1C

2C

3C

4C

Admissible Z=

A word W is Z-feasible if there are two states

4320 ... CCCC

......... Zand ' such that ', W jiji CCqqCqCq

init accept

)',( Min),( ' wwdistLwdist Lw

Page 17: Approximate schemas Michel de Rougemont, LRI, University Paris II Joint work with E. Fischer, Technion, F. Magniez, LRI.

Tester for regular words

)/log(1,...,iFor m

random )/.2( Choose 3 mN ii

For every admissible path Z:

else REJECT.

1i2 size of subwords wij

Theorem: Tester(W,A, ε ) is an ε -tester for L(A).

Tester. Input : W,A, ε

.ACCEPT feasible, Zare W of all If wij

Page 18: Approximate schemas Michel de Rougemont, LRI, University Paris II Joint work with E. Fischer, Technion, F. Magniez, LRI.

Proof schema of the Tester

Theorem: Regular words are testable.

Robustness lemma: If W is ε-far from L, then for every admissible path Z, there exists such that the number

of Z-infeasible subwords

Splitting lemma: if W is far from L there are many disjoint infeasible subwords.

Amplifying lemma: If there are many infeasible words, there are many short ones.

).5

log(2

m

i

...2

least at is 2 2

1i1i n

m

Page 19: Approximate schemas Michel de Rougemont, LRI, University Paris II Joint work with E. Fischer, Technion, F. Magniez, LRI.

Merging trees

Merging lemma: Let Z be an admissible path, and let F be a Z-feasible cut of size h’ . Then '),( 2hmLFDist

C

C C

C

C

C

Take each word and split it along its connected components, removing single letters. Rearrange all the words of the same component in its Z-order.Add gluing words to obtain W’ in L:

Fwi

............' 222110 wgwgwgW

Page 20: Approximate schemas Michel de Rougemont, LRI, University Paris II Joint work with E. Fischer, Technion, F. Magniez, LRI.

Splitting

Splitting lemma: If Z is an admissible path, W a word s.t. dist(W,L) > h, then W has

Proof by contraposition:

.n)(h subwords.disjoint infeasible Zh/m than more 2

subwords.disjoint and infeasible Zminimal / than less hasW 2 mhh'

'.L)Dis(F, lemma merging By the 2hm'. F)Dist(W, h

'' L)Dist(W, Hence 2hmh

h L)Dist(W, And

F.cut feasible a provides letterslast theRemoving

Page 21: Approximate schemas Michel de Rougemont, LRI, University Paris II Joint work with E. Fischer, Technion, F. Magniez, LRI.

2b. Regular trees

a

e

b

c d

a

e

b

ca

eb

c

df

e

DeletionEdge

InsertionNode andLabel

Tree Edit distance with moves:

a

e

b

c d

a

e

b

c d

1 move

Distance Problem is NP-complete, non-approximable.

Page 22: Approximate schemas Michel de Rougemont, LRI, University Paris II Joint work with E. Fischer, Technion, F. Magniez, LRI.

Binary trees : Distance with moves allows permutations

Tree-Edit-Distance on binary trees

Distance(T1,T2) =4 m-Distance (T1,T2) =2

Page 23: Approximate schemas Michel de Rougemont, LRI, University Paris II Joint work with E. Fischer, Technion, F. Magniez, LRI.

• (q0, q0) q1• (q0,q1) q1

Tree automata

q0 q0

q0

q0q0

q0

q1q1

q1

q1

q1

q0 q0

q0q1

q2

(q1,q1)q2

(q1,q0)q2

(q2,-) q2

(-,q2) q2)1,,0,( qqQA

Page 24: Approximate schemas Michel de Rougemont, LRI, University Paris II Joint work with E. Fischer, Technion, F. Magniez, LRI.

Fact . If then the number of infeasible subtrees of constant size is O(n).

Infeasible subtrees

nLT .),(Distance

Page 25: Approximate schemas Michel de Rougemont, LRI, University Paris II Joint work with E. Fischer, Technion, F. Magniez, LRI.

Tester for regular Trees

12

iFor

mr

random ).

( Choose 2

34

m

irm

N

i size of subtrees and nodes tij

Theorem: Tester(T,A, ε ) is an ε -tester for L(A).

Tester. Input : T,A,

.ACCEPT feasible, Zare T of all If tij

Page 26: Approximate schemas Michel de Rougemont, LRI, University Paris II Joint work with E. Fischer, Technion, F. Magniez, LRI.

Proof schema of the Tester

Theorem: Regular trees are testable.

Robustness lemma: If T is ε-far from L, then for every admissible path Z, there exists such that the number

of Z-infeasible i-subtrees

Splitting lemma: if T is far from L there are many disjoint infeasible subtrees.

Amplifying lemma: If there are many infeasible subtrees, there are many small ones.

)(12

mr

i

...1least at is 234 n

r m

Page 27: Approximate schemas Michel de Rougemont, LRI, University Paris II Joint work with E. Fischer, Technion, F. Magniez, LRI.

Splitting and Merging

C

C C

C

C

C

Splitting and Merging on words:

Splitting and Merging on trees:

Page 28: Approximate schemas Michel de Rougemont, LRI, University Paris II Joint work with E. Fischer, Technion, F. Magniez, LRI.

Splitting and Merging trees

C D D

CC

E

Connected Components

Corrected tree

Page 29: Approximate schemas Michel de Rougemont, LRI, University Paris II Joint work with E. Fischer, Technion, F. Magniez, LRI.

Correction in practice: right branch treehttp://www.lri.fr/~mdr/xml/

2 moves, dist=2

Page 30: Approximate schemas Michel de Rougemont, LRI, University Paris II Joint work with E. Fischer, Technion, F. Magniez, LRI.

1. Inclusion

2. Equivalence

Equivalence tester

3. Equivalent testing of Regular Languages

2121 close vfinitely)(except v if LvLLL

122121 and if LLLLLL

acceptsA then If 21 LL

32 proba with rejectsA then ) ( If 21 LL

Page 31: Approximate schemas Michel de Rougemont, LRI, University Paris II Joint work with E. Fischer, Technion, F. Magniez, LRI.

Statistics on words

k

k

Block statistics: b.stat

Uniform statistics: u.stat

1k

Construction of tester for regular languages exponential in the size of the automaton

We need a construction polynomial in the size of the automaton.

For equivalence testing, we use b.stat

Page 32: Approximate schemas Michel de Rougemont, LRI, University Paris II Joint work with E. Fischer, Technion, F. Magniez, LRI.

Automata for Regular languages

Regular languages and automata

Non-deterministic automaton A, let Ak be the automaton accepting words of length k, reading v in Σk

Definition: v in Σk is an Ak loop if there are u,w such that• Word u.v.w is accepted by Ak

• State after u identical to the state after u.vA finite set of loops is Ak-compatible if all loops can occur in an accepting word.

Definition:

Convex-hull:

))(.),...,(.Hull(Convex- 1

loops -A,...v k1

t

v

vstatbvstatbt

1 s.t. )(.

,..,1

i

iiti

i vstatb

Page 33: Approximate schemas Michel de Rougemont, LRI, University Paris II Joint work with E. Fischer, Technion, F. Magniez, LRI.

Automata for Regular languages

Basic property:

Proposition:

Caratheodory’s theorem: in dimension d, convex hull of N points can be decomposed into in the union of convex hulls of d+1 points

Large loops can be decomposed. Small loops (less than m=|A|) suffice.

))(.),...,(.Hull(Convex- 1

1, tloops, -A,...vkk

1

t

mvv

vstatbvstatbit

where..... to 1 muuuvclosewLw il loops compatible-A ofset -multi a is .... k

1 luu

Page 34: Approximate schemas Michel de Rougemont, LRI, University Paris II Joint work with E. Fischer, Technion, F. Magniez, LRI.

Approximate Parikh mapping

Lemma:

Find w’’ ε close to wRemove v, i.e. at most m block letters.

Lemma: For every X in H, for every n, there exists w in L s. t.

)',( '0,', mwwdistmwwwLw

H )'(. and

.2 )'(.)(. wstatb

wmwstatbwstatb

ncwstatbn )(.-X and toclose w

.....'' 1 luuvw

Page 35: Approximate schemas Michel de Rougemont, LRI, University Paris II Joint work with E. Fischer, Technion, F. Magniez, LRI.

Approximate Parikh mapping

Lemma: For every X in H, w in L s. t.

)(.-X wstatb

X .

b-stat(w)

w

n).2

(L)dist(w,

H is a fair representation of L

Page 36: Approximate schemas Michel de Rougemont, LRI, University Paris II Joint work with E. Fischer, Technion, F. Magniez, LRI.

Construction of H

k.m

)(

jistatbP t

t

Enumerate all loops:

Number of b-stat is less : Some loops have same b-stat: ABBA and BBAA#partitions of a word of length m with « big blocks »

Construct H by matrix iteration:

k.

m

11 tt PPP

Page 37: Approximate schemas Michel de Rougemont, LRI, University Paris II Joint work with E. Fischer, Technion, F. Magniez, LRI.

Construction of H

k.mmI

Lemma: compute a set I of at most | Σ| k +1 compatible loops,

Compute Pt for t=1,…,m

In the diagonals, find the b-stat of small loops, at mostConsider subsets of at most | Σ| k +1 elements which are compatible.

k.

m

)Hull(Convex- S

IS

)( timek.m

mO

Page 38: Approximate schemas Michel de Rougemont, LRI, University Paris II Joint work with E. Fischer, Technion, F. Magniez, LRI.

Example

Automaton A:

Blocks, k=2, m=4, | Σ |=4, | Σ| k +1=17:

Loops: {(aa,ca:1),(bb,2),(cc,ac:3),(dd:4)}

1 2

3 4

a

b

b

ca

cd

d

aa ca

H A

ac cc

bb

dd

Page 39: Approximate schemas Michel de Rougemont, LRI, University Paris II Joint work with E. Fischer, Technion, F. Magniez, LRI.

Equivalence tester

Tester for w in L (regular):Compute b-stat(w) and H. Decide if dist(w,L)>ε.nTime is polynomial in m=|L|.

Previous tester was exponential in m.

Tester of 1. Compute HA and HB

2. Reject if HA and HB are different.

Time polynomial in m=|A,B|

BA

Page 40: Approximate schemas Michel de Rougemont, LRI, University Paris II Joint work with E. Fischer, Technion, F. Magniez, LRI.

Generalizations

Buchi Automata. Distance on infinite words:Two words are ε-close if

A word is ε-close to a language L if there exists w’ in L s. t. W and w’ are ε-close.

Statistics: set of accumulation points of

H: compatible loops of connected components of accepting states

Tester for Buchi Automata: Compute HA and HB

Reject if HA and HB are different.

Equivalence of CF grammars is undecidable, Approximate equivalence in exponential.

(n))w'dist(w(n), lim sup n

w(n))(. nstatb

Page 41: Approximate schemas Michel de Rougemont, LRI, University Paris II Joint work with E. Fischer, Technion, F. Magniez, LRI.

Conclusion1. Testers and Correctors2. Constant algorithm for Edit Distance with

moves 3a. Testers and Correctors for regular words3b. Tester for regular trees and corrector for

regular trees4. Equivalence tester for automata

Polynomial time algorithm

Generalization to Buchi automata and Context-Free Tree regular languages

Page 42: Approximate schemas Michel de Rougemont, LRI, University Paris II Joint work with E. Fischer, Technion, F. Magniez, LRI.

Let F be a property on a class K of structures U

F is Equality

Soundness: close structures have close statistics

Robustness: far structures have far statistics

Soundness and Robustness

.)',( nwwdist

.)',( nwwdist

Page 43: Approximate schemas Michel de Rougemont, LRI, University Paris II Joint work with E. Fischer, Technion, F. Magniez, LRI.

Robustness of b.stat

Robustness of b-stat: ).)'(.)(. .21()',( nwstatbwstatbwwdist

.)',( then )'(.)(. if nwwdistwstatbwstatb

)'()''( t.s. 'w'construct then )'(.)(. if wstatbwstatbwstatbwstatb

61.

1401

)(.

Wstatb

61.

1302

)'(.

Wstatb

in W' 3 andin W 4 "10" #but in W' 2 andin W 1"00"#

: Example on w. onssubstituti )'(.)(.2

most at after wstatbwstatb.n

"10" intoit change andin W "00" ofblock one take:'W'

Page 44: Approximate schemas Michel de Rougemont, LRI, University Paris II Joint work with E. Fischer, Technion, F. Magniez, LRI.

Soundness of u.stat

Soundness of u-stat:

Simple edit:

Move w=A.B.C.D, w’=A.C.B.D:

Hence, for ε2.n operations,

Problem: robustness of u.stat ? Harder! You need an auxiliary distribution and two key lemmas.

.6)'(.)(. .)',( 2 wstatuwstatunwwdist

.2

12)'(.)(.

nknkwstatuwstatu

.6

1)1(3.2)'(.)(. nkn

kwstatuwstatu

.6)'(.)(. wstatuwstatu

Page 45: Approximate schemas Michel de Rougemont, LRI, University Paris II Joint work with E. Fischer, Technion, F. Magniez, LRI.

Block Uniform Statistics

))(.())(.()(./,...1

vstatbEvstatbEnKwstatbu v

Kniiti

1][0 ],)[(.][ ),(. uXuvstatbuXvstatbX iiiii

])[(. is Average t.independen is ][Each uwstatbuuXi

2Kn-8

e]])[(.])[(.])[(.Pr[ : Bound Chernofft

uwstatbutuwstatbuuvstatb 2

Kn-8k

.e])(.)(.)(.Pr[ : BoundUnion t

wstatbutwstatbuvstatb

0]2

)(.)(.Pr[ 2

t wstatbuvstatb

2)(.)(. vw vstatbwstatbuLemma 1:

Page 46: Approximate schemas Michel de Rougemont, LRI, University Paris II Joint work with E. Fischer, Technion, F. Magniez, LRI.

Uniform Statistics

)1).(1( :bu by missedk length of subwords# Knk

., onsdistributi uniform twoand ALet : Lemma BA BA

BA

AB .2.Then BA

).

log()(.)(. 4 n

Owstatbuvstatu

log.

,1 with lemma previous Apply the3 n

KknB

.log

)(. )(. w 4 nwstatuwstatbu

Lemma 2:

Page 47: Approximate schemas Michel de Rougemont, LRI, University Paris II Joint work with E. Fischer, Technion, F. Magniez, LRI.

Robustness of the uniform Statistics

Robustness of u-stat:

By Lemma 1:

By Lemma 3:

.5,6)'(. )(. .5)',( wstatuwstatunwwdist

2)(.)(. vw vstatbwstatbu

.log

)(. )(. w 4 nwstatuwstatbu

w' w,from v'Get v,

stat.u- of robustness impliesstat -b of Robustness

Page 48: Approximate schemas Michel de Rougemont, LRI, University Paris II Joint work with E. Fischer, Technion, F. Magniez, LRI.

Tester for the distance with moves

NP-complete problem, but O(1)-approximable.

Approximate u.stat:Sample N subwords of length k, compute Y:

Y is a good approximation of u.stat (Chernoff),

Uniform statistics is a good approximation of the distance by soundness and robustness.

Tester: If Y<ε.n accept, else reject.

].)(.Pr[2..8 aNeaYWstatu

1

...1

Ni

iXN

Y

0...010

iX