Approximate schemas Michel de Rougemont, LRI, University Paris II Joint work with E. Fischer,...
-
Upload
jane-henderson -
Category
Documents
-
view
217 -
download
1
Transcript of Approximate schemas Michel de Rougemont, LRI, University Paris II Joint work with E. Fischer,...
Approximate schemas
Michel de Rougemont, LRI , University Paris II
Joint work with E. Fischer, Technion, F. Magniez, LRI
1. Distance between words (structures)Edit distance with moves
2. Distance between a word (structure) and a class of words (structures)
3. Distance between two languages (classes)
4. Applications: regular languages, DTDs
Distances between languages
2121 close vfinitely)(except v if LvLLL
122121 and if LLLLLL
)',( Min),( ' wwdistLwdist Lw
1. Tester for equality, constant time
2. Tester for w in L, constant time
3. Tester for approximate equivalence of regular languages, polynomial
Equivalence tester
Results
acceptsA then If 21 LL
32 proba with rejectsA then ) ( If 21 LL
1. Satisfiability : Tree |= F
2. Approximate satisfiability Tree |= F
3. Approximate equivalence
Image on a class K of trees
F F F
F fromfar -
1. Approximate Satisfiability and Equivalence
GF
G
Let F be a property on a class K of structures U
An ε -tester for F is a probabilistic algorithm A such that:• If U |= F, A accepts
• If U is ε far from F, A rejects with high probability
• Time(A) independent of n.(Goldreich, Golwasser, Ron 1996 , Rubinfeld, Sudan 1994)
Tester usually implies a linear time corrector.
Testers on a class K
History of Testers
Self-testers and correctors for Linear Algebra ,Blum & Kanan 1989
Robust characterizations of polynomials, R. Rubinfeld, M. Sudan, 1994
Testers for graph properties : k-colorability, Goldreich and al. 1996
graph properties have testers, Alon and al. 1999
Regular languages have testers, Alon and al. 2000s
Testers for Regular tree languages , Mdr and Magniez, ICALP 2004
2
1. Classical Edit Distance:
Insertions, Deletions, Modifications
2. Edit Distance with moves
0111000011110011001
0111011110000011001
3. Edit Distance with Moves generalizes to Trees
2. Equality tester
Statistics on words
k
k
Kt k-t
Block statistics: b.stat
Uniform statistics: u.stat
Block Uniform statistics: bu.stat
1k
Block statistics
W=001010101110…… length n, subword of length k, n/k blocks
61.
1401
)(.
Wstatb
/1.
#....
#)(.
2
1
knn
nWstatb
k
....
"00...1" ofnumber #"00...0" ofnumber #
2
1
nn
"11...1" ofnumber #
....2kn
For k=2, n/k=6
Uniform statisticsW=001010101110 length n, subword of length k, n-k+1 blocks
111.
2441
)(.
Wstatu
1
1.#....
#)(.
2
1
knn
nWstatu
k
...."00...1" ofnumber #"00...0" ofnumber #
2
1
nn
"11...1" ofnumber #
....2kn
Statistics and distance
W=001010101110 (n=12, k=2)
W1=001101001110 W2=110100001110 W3=110100001111
dist( W,W’)= 3 dist( W,W’) /12=0.25
W’=110100001111
61.
1401
)(.
Wstatb
111.
4223
)'(.
Wstatu
111.
2441
)(.
Wstatu
61.
3012
)'(.
Wstatb
33.168
0 d
73.0118
1 d
Goal: d1 approximates the distance
Let ε =1/k : For n>n0 dist – ε.n < d1 < dist + ε.n
Practical application: ε=10-2 hence k=100, stat dimension 2100
Words of length n=109 , d1 is approximated by N samples and a good approximation after N=O(1/ε3) trials.
Remarks:1. Distance with Moves.
W =000….0001111…111 W’=1111…111000….000
2. Robustness to noiseIf W,W’ are noisy inputs (but ε-close), the method still works.
3. Random words are close with the moves, far without.
)'(.
2/5.0..
..2/5.0
)(. WstatbWstatb
Classical complexity
Edit distances:
1. P problem on words without the moves.• Approximation?• Sublinear algorithm?
2. NP-complete problem on words with the moves.• O(1)-approximable
3. P problem on ordered trees without the moves
4. NP-complete problem on unordered trees and trees with the moves.
Basic tool: Chernoff bound
Random variables:
Markov bound
Chebyshev bound
Chernoff bound: sum of independent variables Xi, whose average is μ
Hoeffding bound
k
Prob[X=k]
].Pr[2..8 aNeaYX
1
...1
Ni
iXN
Y
a.μ
0...010
iX
Tester for equality of strings
Edit distance with moves. NP-complete problem, but O(1)-approximable.
Uniform statistics ( ): W=001010101110
Theorem 1. |u.stat(w)-ustat(w’)| approximates dist(w,w’)/n .
Sample N subwords of length k, compute Y(w) and Y(w’):
Theorem 2. Y(w) approximates u.stat(w).
Corollary. |Y(w)-Y(w’)| approximates dist(w,w’)/n .
Tester: If |Y(w)-Y(w’)| <ε. accept, else reject.
1)(
...1
Ni
iXN
wY
0...010
iX
111.
2441
)(.
Wstatu
1)'(
...1
Ni
iXN
wY
1k
2a. Regular words Definition:
L is a regular language and A an automaton for L, Test w in L.
0C
1C
2C
3C
4C
Admissible Z=
A word W is Z-feasible if there are two states
4320 ... CCCC
......... Zand ' such that ', W jiji CCqqCqCq
init accept
)',( Min),( ' wwdistLwdist Lw
Tester for regular words
)/log(1,...,iFor m
random )/.2( Choose 3 mN ii
For every admissible path Z:
else REJECT.
1i2 size of subwords wij
Theorem: Tester(W,A, ε ) is an ε -tester for L(A).
Tester. Input : W,A, ε
.ACCEPT feasible, Zare W of all If wij
Proof schema of the Tester
Theorem: Regular words are testable.
Robustness lemma: If W is ε-far from L, then for every admissible path Z, there exists such that the number
of Z-infeasible subwords
Splitting lemma: if W is far from L there are many disjoint infeasible subwords.
Amplifying lemma: If there are many infeasible words, there are many short ones.
).5
log(2
m
i
...2
least at is 2 2
1i1i n
m
Merging trees
Merging lemma: Let Z be an admissible path, and let F be a Z-feasible cut of size h’ . Then '),( 2hmLFDist
C
C C
C
C
C
Take each word and split it along its connected components, removing single letters. Rearrange all the words of the same component in its Z-order.Add gluing words to obtain W’ in L:
Fwi
............' 222110 wgwgwgW
Splitting
Splitting lemma: If Z is an admissible path, W a word s.t. dist(W,L) > h, then W has
Proof by contraposition:
.n)(h subwords.disjoint infeasible Zh/m than more 2
subwords.disjoint and infeasible Zminimal / than less hasW 2 mhh'
'.L)Dis(F, lemma merging By the 2hm'. F)Dist(W, h
'' L)Dist(W, Hence 2hmh
h L)Dist(W, And
F.cut feasible a provides letterslast theRemoving
2b. Regular trees
a
e
b
c d
a
e
b
ca
eb
c
df
e
DeletionEdge
InsertionNode andLabel
Tree Edit distance with moves:
a
e
b
c d
a
e
b
c d
1 move
Distance Problem is NP-complete, non-approximable.
Binary trees : Distance with moves allows permutations
Tree-Edit-Distance on binary trees
Distance(T1,T2) =4 m-Distance (T1,T2) =2
• (q0, q0) q1• (q0,q1) q1
Tree automata
q0 q0
q0
q0q0
q0
q1q1
q1
q1
q1
q0 q0
q0q1
q2
(q1,q1)q2
(q1,q0)q2
(q2,-) q2
(-,q2) q2)1,,0,( qqQA
Fact . If then the number of infeasible subtrees of constant size is O(n).
Infeasible subtrees
nLT .),(Distance
Tester for regular Trees
12
iFor
mr
random ).
( Choose 2
34
m
irm
N
i size of subtrees and nodes tij
Theorem: Tester(T,A, ε ) is an ε -tester for L(A).
Tester. Input : T,A,
.ACCEPT feasible, Zare T of all If tij
Proof schema of the Tester
Theorem: Regular trees are testable.
Robustness lemma: If T is ε-far from L, then for every admissible path Z, there exists such that the number
of Z-infeasible i-subtrees
Splitting lemma: if T is far from L there are many disjoint infeasible subtrees.
Amplifying lemma: If there are many infeasible subtrees, there are many small ones.
)(12
mr
i
...1least at is 234 n
r m
Splitting and Merging
C
C C
C
C
C
Splitting and Merging on words:
Splitting and Merging on trees:
Splitting and Merging trees
C D D
CC
E
Connected Components
Corrected tree
Correction in practice: right branch treehttp://www.lri.fr/~mdr/xml/
2 moves, dist=2
1. Inclusion
2. Equivalence
Equivalence tester
3. Equivalent testing of Regular Languages
2121 close vfinitely)(except v if LvLLL
122121 and if LLLLLL
acceptsA then If 21 LL
32 proba with rejectsA then ) ( If 21 LL
Statistics on words
k
k
Block statistics: b.stat
Uniform statistics: u.stat
1k
Construction of tester for regular languages exponential in the size of the automaton
We need a construction polynomial in the size of the automaton.
For equivalence testing, we use b.stat
Automata for Regular languages
Regular languages and automata
Non-deterministic automaton A, let Ak be the automaton accepting words of length k, reading v in Σk
Definition: v in Σk is an Ak loop if there are u,w such that• Word u.v.w is accepted by Ak
• State after u identical to the state after u.vA finite set of loops is Ak-compatible if all loops can occur in an accepting word.
Definition:
Convex-hull:
))(.),...,(.Hull(Convex- 1
loops -A,...v k1
t
v
vstatbvstatbt
1 s.t. )(.
,..,1
i
iiti
i vstatb
Automata for Regular languages
Basic property:
Proposition:
Caratheodory’s theorem: in dimension d, convex hull of N points can be decomposed into in the union of convex hulls of d+1 points
Large loops can be decomposed. Small loops (less than m=|A|) suffice.
))(.),...,(.Hull(Convex- 1
1, tloops, -A,...vkk
1
t
mvv
vstatbvstatbit
where..... to 1 muuuvclosewLw il loops compatible-A ofset -multi a is .... k
1 luu
Approximate Parikh mapping
Lemma:
Find w’’ ε close to wRemove v, i.e. at most m block letters.
Lemma: For every X in H, for every n, there exists w in L s. t.
)',( '0,', mwwdistmwwwLw
H )'(. and
.2 )'(.)(. wstatb
wmwstatbwstatb
ncwstatbn )(.-X and toclose w
.....'' 1 luuvw
Approximate Parikh mapping
Lemma: For every X in H, w in L s. t.
)(.-X wstatb
X .
b-stat(w)
w
n).2
(L)dist(w,
H is a fair representation of L
Construction of H
k.m
)(
jistatbP t
t
Enumerate all loops:
Number of b-stat is less : Some loops have same b-stat: ABBA and BBAA#partitions of a word of length m with « big blocks »
Construct H by matrix iteration:
k.
m
11 tt PPP
Construction of H
k.mmI
Lemma: compute a set I of at most | Σ| k +1 compatible loops,
Compute Pt for t=1,…,m
In the diagonals, find the b-stat of small loops, at mostConsider subsets of at most | Σ| k +1 elements which are compatible.
k.
m
)Hull(Convex- S
IS
)( timek.m
mO
Example
Automaton A:
Blocks, k=2, m=4, | Σ |=4, | Σ| k +1=17:
Loops: {(aa,ca:1),(bb,2),(cc,ac:3),(dd:4)}
1 2
3 4
a
b
b
ca
cd
d
aa ca
H A
ac cc
bb
dd
Equivalence tester
Tester for w in L (regular):Compute b-stat(w) and H. Decide if dist(w,L)>ε.nTime is polynomial in m=|L|.
Previous tester was exponential in m.
Tester of 1. Compute HA and HB
2. Reject if HA and HB are different.
Time polynomial in m=|A,B|
BA
Generalizations
Buchi Automata. Distance on infinite words:Two words are ε-close if
A word is ε-close to a language L if there exists w’ in L s. t. W and w’ are ε-close.
Statistics: set of accumulation points of
H: compatible loops of connected components of accepting states
Tester for Buchi Automata: Compute HA and HB
Reject if HA and HB are different.
Equivalence of CF grammars is undecidable, Approximate equivalence in exponential.
(n))w'dist(w(n), lim sup n
w(n))(. nstatb
Conclusion1. Testers and Correctors2. Constant algorithm for Edit Distance with
moves 3a. Testers and Correctors for regular words3b. Tester for regular trees and corrector for
regular trees4. Equivalence tester for automata
Polynomial time algorithm
Generalization to Buchi automata and Context-Free Tree regular languages
Let F be a property on a class K of structures U
F is Equality
Soundness: close structures have close statistics
Robustness: far structures have far statistics
Soundness and Robustness
.)',( nwwdist
.)',( nwwdist
Robustness of b.stat
Robustness of b-stat: ).)'(.)(. .21()',( nwstatbwstatbwwdist
.)',( then )'(.)(. if nwwdistwstatbwstatb
)'()''( t.s. 'w'construct then )'(.)(. if wstatbwstatbwstatbwstatb
61.
1401
)(.
Wstatb
61.
1302
)'(.
Wstatb
in W' 3 andin W 4 "10" #but in W' 2 andin W 1"00"#
: Example on w. onssubstituti )'(.)(.2
most at after wstatbwstatb.n
"10" intoit change andin W "00" ofblock one take:'W'
Soundness of u.stat
Soundness of u-stat:
Simple edit:
Move w=A.B.C.D, w’=A.C.B.D:
Hence, for ε2.n operations,
Problem: robustness of u.stat ? Harder! You need an auxiliary distribution and two key lemmas.
.6)'(.)(. .)',( 2 wstatuwstatunwwdist
.2
12)'(.)(.
nknkwstatuwstatu
.6
1)1(3.2)'(.)(. nkn
kwstatuwstatu
.6)'(.)(. wstatuwstatu
Block Uniform Statistics
))(.())(.()(./,...1
vstatbEvstatbEnKwstatbu v
Kniiti
1][0 ],)[(.][ ),(. uXuvstatbuXvstatbX iiiii
])[(. is Average t.independen is ][Each uwstatbuuXi
2Kn-8
e]])[(.])[(.])[(.Pr[ : Bound Chernofft
uwstatbutuwstatbuuvstatb 2
Kn-8k
.e])(.)(.)(.Pr[ : BoundUnion t
wstatbutwstatbuvstatb
0]2
)(.)(.Pr[ 2
t wstatbuvstatb
2)(.)(. vw vstatbwstatbuLemma 1:
Uniform Statistics
)1).(1( :bu by missedk length of subwords# Knk
., onsdistributi uniform twoand ALet : Lemma BA BA
BA
AB .2.Then BA
).
log()(.)(. 4 n
Owstatbuvstatu
log.
,1 with lemma previous Apply the3 n
KknB
.log
)(. )(. w 4 nwstatuwstatbu
Lemma 2:
Robustness of the uniform Statistics
Robustness of u-stat:
By Lemma 1:
By Lemma 3:
.5,6)'(. )(. .5)',( wstatuwstatunwwdist
2)(.)(. vw vstatbwstatbu
.log
)(. )(. w 4 nwstatuwstatbu
w' w,from v'Get v,
stat.u- of robustness impliesstat -b of Robustness
Tester for the distance with moves
NP-complete problem, but O(1)-approximable.
Approximate u.stat:Sample N subwords of length k, compute Y:
Y is a good approximation of u.stat (Chernoff),
Uniform statistics is a good approximation of the distance by soundness and robustness.
Tester: If Y<ε.n accept, else reject.
].)(.Pr[2..8 aNeaYWstatu
1
...1
Ni
iXN
Y
0...010
iX