Software for enumerative and analytic combinatorics arXiv ...
Analytic Combinatorics (of P. Flajolet) Profile of Digital ... · Analytic Combinatorics (of P....
Transcript of Analytic Combinatorics (of P. Flajolet) Profile of Digital ... · Analytic Combinatorics (of P....
Analytic Combinatorics (of P. Flajolet)
Profile of Digital Trees
Wojciech Szpankowski∗
Department of Computer Science
Purdue University, W. Lafayette, IN
U.S.A.
August 14, 2012
Polish Combinatorial Conference, Bedlewo, 2012
Dedicated to PHILIPPE FLAJOLET
∗Joint work with H-K. Hwang, M. Drmota, P. Flajolet, P. Nicodeme, G. Park, and M. Ward.
Outline of the Presentation
1. Tries and Suffix Trees
2. Profiles of Tries
• Trie Parameters
• Main Results
• Sketch of Proofs
• Consequences (height, fill-up, shortest path)
3. Profile of Digital Search Trees
4. Applications: Error Resilient Lempel-Ziv’77 (Suffix Trees)
5. Analysis of Algorithms and Analytic Information Theory
Tries and Suffix Trees
00
0
0100
0
01010
0
01011
1
1
0
011
1
1
0
1010
0
1011
1
1
0
1
A trie, or prefix tree, is an ordered tree
data structure that stores keys usually
represented by strings.
Tries were introduced by de la Briandais
(1959) and Fredkin (1960) who introduced
the name:
“tries” derived from retrieval.
Suffix tree is a trie built form suffixes of
one string.
Other digital trees are: PATRICIA and
digital search trees (DST).
Typical Tries: In this talk we mostly discuss random tries (and DST)
built from n (independent) sequences generated by a binary memoryless
source with p denoting the probability of generating a “0” (q = 1−p ≤ p).
Digital Trees
Digital Trees:
Tries, PATRICIA Tries, Digital Search Trees:
Figure 1: A trie, PATRICIA trie and a digital search tree bu ilt from:
x1 = 11100 . . . , x2 = 10111 . . . , x3 = 00110 . . . , and x4 = 00001 . . ..
Usefulness of Tries
Tries, suffix tress, and DST are widely used in diverse applications:
• automatically correcting words in texts; Kukich (1992);
• taxonomies of regular language; Watson (1995);
• event history in datarace detection for multi-threaded object-oriented
programs; Choi et al. (2002);
• internet IP addresses lookup; Nilsson and Tikkanen (2002);
• data compression, Lempel-Ziv, . . . ; W.S. (2001);
• distributed hash tables, Malkhi et al. (2002) and Adler et al. (2003).
Fundamental, prototype data structures:
• have a large number of variations and extensions (Patricia, DST, bucket
digital search trees, k-d tries, quadtries, LC-tries, multiple-tries, etc.);
• closely connected to several splitting procedures using coin-flipping:
collision resolution in multi-access (or broadcast) communication
models, loser selection or leader election, etc.
• have direct combinatorial interpretations in terms of words, urn models,
etc.
Outline Update
1. Tries and Suffix Trees
2. Usefulness of Tries
3. Profiles of Tries
• Trie Parameters
• Main Results
• Sketch of Proofs
• Consequences
4. Applications
(G. Park) (P. Nicodeme)
G. Park, H-K. Hwang, P. Nicodeme, W. Szpankowski,
“Profile in Tries,”
SIAM J. Computing, 38, 5, 1821-1880, 2009.
External and Internal Profiles
00
0
0100
0
01010
0
01011
1
1
0
011
1
1
0
1010
0
1011
1
1
0
1
B0n = 0, I0
n = 1
B1n = 0, I1
n = 2
B2n = 1, I2
n = 2
B3n = 1, I3
n = 2
B4n = 3, I4
n = 1
B5n = 2 I5
n = 0
External profile and internal profile:
Bkn = # external nodes at distance k from the root;
Ikn = # internal nodes at distance k from the root.
Why to Study Profiles?
• Fine, informative shape characteristic;
• Related to path length, depth, height, shortest path, width, etc.;
• Breadth-first search;
• Compression algorithms.
• Mathematically challenging, phenomenally interesting!
Example: Parameters such height Hn, shortest path, sn, fill-up level Fn, and
depth, Dn can be studied through the profiles since:
Hn = max{k : Bkn > 0},
sn = min{k : Bkn > 0},
Fn = max{k : Ikn = 2k},
Pr(Dn = k) =E[Bk
n]n .
0 1
x4 x5
x1 x2
Fn Dn5 Hns n
x3
Recurrence for the Profiles
External Profile Bkn:
Define the probability generating function as
Bnk(u)
Bik-1(u) Bn-i(u)
k-1
Bkn(u) = E[uBk
n] =∑
ℓ≥0
P (Bkn = l)ul.
Then
Bkn(u) =
n∑
i=0
(ni
)piqn−i
Bk−1i (u)B
k−1n−i(u)
with B0n = 1 for n 6= 1 and B0
1 = u
Internal Profile probability generating function Ikn(u) = E[Ik
n] satisfies the
same recurrence with I0n(u) = u for n > 1 and I0
0(u) = I01(u) = 1.
Average External Profile:
E[Bkn] =
n∑
i=0
(ni
)piqn−i
(E[Bk−1i ] + E[B
k−1n−i ]), n ≥ 2, k ≥ 1,
under some initial conditions (e.g., E[Bk0 ] = 0 for all k).
Main Results
Notation: r = p/q = p/(1 − p)> 1, and α := αn,k = k
log n:
p0.50.60.70.80.9 1
2
4
6
8
10
α1
α2
α3
α1 :=1
log(1/q),
α2 :=p2 + q2
p2 log(1/p) + q2 log(1/q),
α3 :=2
log(1/(p2 + q2)).
1: Exponential Growth (0 < α < α1):
Let 1 ≤ k ≤ 1log q−1(log n − log log logn + log(r − 1) − ε):
E[Bkn] = nqk(1 − qk)n−1
(1 + O
((log n)−δ
))= O(2−nν)
2: Logarithmic Growth (0 < α < α1):
Let 1 ≤ k ≤ 1log q−1(log n − log log logn + m log(r − 1) − ε):
E[Bkn] = O(log log n · logm−β
n).
where m and β are constants (smaller or greater than m).
Phase Transitions
3: Polynomial Growth: α1 · log n < k < α2 · logn: (α1 < α < α2)
E[Bkn] ∼ G1 (log n)
pρqρ(p−ρ + q−ρ)√2πα log(p/q)
· nυ1
√log n
,
where G1(x) is a periodic function and
υ1 = −ρ + α log(p−ρ
+ q−ρ
), ρ = − 1
log(p/q)log
(−1 − α log q
1 + α log p
).
6 × 10−23
−6 × 10−23
1p = 0.55
3 × 10−7
−3 × 10−7
1
p = 0.65
6 × 10−3
−6 × 10−3
1p = 0.75p = 0.85p = 0.95
Figure 2: The fluctuating part of the periodic function G1(x) for p = 0.55, 0.65, . . . , 0.95.
4: Polynomial Growth/Decay: α2 · log n < k: (α2 < α)
E[Bkn] =
2pq
p2 + q2n
ν2 + O (nν3)
where ν2 = 2 + α log(p2 + q2) for some ν3 < ν2.
External Shapes
log2n
log n + O(1)
log2 n + O(1)
2 log2 n + O(1)
– Typeset by FoilTEX – 1
log1/qn
log log n + O(1)
log n
p log(1/p) + q log(1/q)
2log(1/(p2+q2))
log n + O(1)
1(p = 0.5, α1 = α2 = 1/ log 2) (p = 0.75)
Average Internal Profile
1: Almost Full Tree: k < α1 · logn
E(Ikn) = 2k − E(Bn,k)(1 + o(1)).
2: Phase Transition I: α1 · logn < k < α0 · logn, where α0 = 2log(1/p)+log(1/q)
E[Ikn] = 2
k − G2 (log n)E(Bn,k)(1 + o(1))
where G2(x) is a periodic function.
3: Phase Transition II: α0 · logn < k < α2 · logn
E[Ikn] = G2 (logn)E(Bn,k)(1 + o(1))
where G2(x) is a periodic function.
4: Polynomial Growth/Decay: α2 · log n < k
E[Ikn] =
1
2n
ν2(1 + o(1))
where ν2 = 2 − α log(p2 + q2).
Variance and Limiting Distributions of the External Profile
Variance:
1: k < α1 · logn: V[Bkn] ∼ E[Bk
n].
2: α1 · logn < k < α2 · logn: V[Bkn] ∼ G3(log n)E[Bk
n].where G3(log n) is a periodic function.
3: α2 · logn < k: V[Bkn] ∼ 2E[Bk
n].
Limiting Distributions:
Central Limit Theorem: For α1 · logn < k < α3 · log n:
Bkn − E[Bk
n]√V[Bk
n]→ N(0, 1),
where N(0, 1) is the standard normal distribution.
Poisson Distribution: For α3 · log n < k:
P (Bn,k = 2m) =λ0
m
m!e−λ0 + o(1), and P (Bn,k = 2m + 1) = o(1),
where λ0 := pqn2(p2 + q2)k−1.
Consequences
Height: For large n (cf. Flajolet, 1980, Pittel, 1985, W.S., 1988, Devroye, 1992)
Hn =2
log(p2 + q2)−1logn = α3log n := kH, (whp).
•
•
•
•
••
• ••
••
•
•
Upper Bound: P (Hn > (1 + ǫ)kH) ≤ P (Bkn ≥ 1) ≤ E[Bk
n] → 0.
Lower Bound: P (Hn < (1 − ǫ)kH) ≤ P (B⌈(1−ε)kH⌉)n = 0)
≤ V[B⌈(1−ǫ)kHn ⌉]
(E[B⌈(1−ǫ)kH⌉n ])2
= O
(1
E[B⌈(1−ǫ)kH⌉n ]
)→ 0.
Define: kS := ⌊ 1log q−1(log n − log log log n + log(e log r))⌋.
Shortest Path: For large n (cf. Knessl and W.S., 2005)
P (sn = kS or sn = kS + 1) → 1.
Fill-up: For large n (cf. Pittel, 1986, Devroye, 1992, Knessl & W.S., 2005)
P (Fn = kS − 1 or Fn = kS) → 1.
Sketch of the Proof
1. Recurrence: E[Bkn] =
∑ni=0
(ni
)piqn−i(E[Bk−1
i ]+E[Bk−1n−i ]), n ≥ 2, k ≥ 1.
Sketch of the Proof
1. Recurrence: E[Bkn] =
∑ni=0
(ni
)piqn−i(E[Bk−1
i ]+E[Bk−1n−i ]), n ≥ 2, k ≥ 1.
2. Poisson Transform: Ek(z) =∑∞
n=0 E[Bkn]
zn
n! e−z:
Ek(z) = Ek−1(zp) + Ek−1(zq), k ≥ 2,
Sketch of the Proof
1. Recurrence: E[Bkn] =
∑ni=0
(ni
)piqn−i(E[Bk−1
i ]+E[Bk−1n−i ]), n ≥ 2, k ≥ 1.
2. Poisson Transform: Ek(z) =∑∞
n=0 E[Bkn]
zn
n! e−z:
Ek(z) = Ek−1(zp) + Ek−1(zq), k ≥ 2,
3. Mellin Transform: E∗k(s) :=
∫∞0
zs−1Ek(z)dz = (p−s + q−s)E∗k−1(s):
E∗k(s) = (p
−s+ q
−s)k−1 · s · (p−s
+ q−s − 1)Γ(s)
for ℜ(s) ∈ (−2,∞), where Γ(s) is the Euler Gamma function.
Sketch of the Proof
1. Recurrence: E[Bkn] =
∑ni=0
(ni
)piqn−i(E[Bk−1
i ]+E[Bk−1n−i ]), n ≥ 2, k ≥ 1.
2. Poisson Transform: Ek(z) =∑∞
n=0 E[Bkn]
zn
n! e−z:
Ek(z) = Ek−1(zp) + Ek−1(zq), k ≥ 2,
3. Mellin Transform: E∗k(s) :=
∫∞0
zs−1Ek(z)dz = (p−s + q−s)E∗k−1(s):
E∗k(s) = (p
−s+ q
−s)k−1 · s · (p−s
+ q−s − 1)Γ(s)
for ℜ(s) ∈ (−2,∞), where Γ(s) is the Euler Gamma function.
4. Inverse Mellin Transform: Ek(z) = 12πi
∫ c+i∞c−i∞ z−sE∗
k(s)ds:
Ek(z) =1
2πi
∫ c+i∞
c−i∞s(p
−s+ q
−s − 1)Γ(s)z−s
(p−s
+ q−s
)k−1
ds
through the saddle point method (see next slide for more).
Sketch of the Proof
1. Recurrence: E[Bkn] =
∑ni=0
(ni
)piqn−i(E[Bk−1
i ]+E[Bk−1n−i ]), n ≥ 2, k ≥ 1.
2. Poisson Transform: Ek(z) =∑∞
n=0 E[Bkn]
zn
n! e−z:
Ek(z) = Ek−1(zp) + Ek−1(zq), k ≥ 2,
3. Mellin Transform: E∗k(s) :=
∫∞0
zs−1Ek(z)dz = (p−s + q−s)E∗k−1(s):
E∗k(s) = (p
−s+ q
−s)k−1 · s · (p−s
+ q−s − 1)Γ(s)
for ℜ(s) ∈ (−2,∞), where Γ(s) is the Euler Gamma function.
4. Inverse Mellin Transform: Ek(z) = 12πi
∫ c+i∞c−i∞ z−sE∗
k(s)ds:
Ek(z) =1
2πi
∫ c+i∞
c−i∞s(p
−s+ q
−s − 1)Γ(s)z−s
(p−s
+ q−s
)k−1
ds
through the saddle point method (see next slide for more).
5. Depoissonization: From the Poisson transform Ek(z) to E[Bkn].
Saddle Point Method: Phase Transitions
By depoisonization we have Ek(n) ∼ Ek(z), where recall
Ek(n) =1
2πi
∫ c+i∞
c−i∞g(s)Γ(s + 1)n
−s(p
−s+ q
−s)kds
=1
2πi
∫ c+i∞
c−i∞g(s)Γ(s + 1) exp(h(s)log n)ds, k = αlogn.
where g(s) = (p−s + q−s − 1) (note g(0) = g(−1) = 0). For k = αlog n,
n−s(p−s + q−s)k = exp(h(s)logn) is large.
Saddle Point Method: Phase Transitions
By depoisonization we have Ek(n) ∼ Ek(z), where recall
Ek(n) =1
2πi
∫ c+i∞
c−i∞g(s)Γ(s + 1)n−s(p−s + q−s)kds
=1
2πi
∫ c+i∞
c−i∞g(s)Γ(s + 1) exp(h(s)log n)ds, k = αlogn.
where g(s) = (p−s + q−s − 1) (note g(0) = g(−1) = 0). For k = αlog n,
n−s(p−s + q−s)k = exp(h(s)logn) is large.
Saddle Point Method:
A function F (z) analytic with nonnegative coefficients and “fast growth”.
fn = [zn]F (z) =
1
2iπ
∫
CF (z)
dz
zn+1=
1
2iπ
∫
CeH(z)
dz,
where H(z) := logF (z) − (n + 1) log z.
Define H ′(z0) = 0 to be the saddle point, that is,
H(z) = H(z0) +12(z − z0)
2H ′′(z0) + O(H ′′′(z0)(z − z0)3).
[zn]F (z) =
1√2π|H ′′(z0)|
exp(H(z0)) ·(1 + O
(H ′′′(z0)
(H ′′(z0))3/2
)).
Infinity of Saddle Points
The saddle point equation is h′(s) = 0 where
h(s) = α log(p−s + q−s) − s.
It has a unique real root:
ρ =−1
log rlog
(α log q−1 − 1
1 − α log p−1
),
1
log q−1< α <
1
log p−1.
Note∣∣∣p−ρ−itj + q−ρ−itj
∣∣∣ = p−ρ + q−ρ for tj = 2πj/ log r, j ∈ Z
nlog
ρ +⋅
rlog----------i
2π
ρ +
⋅rlog
----------
i
2π
ρ –
ρ – i
i⋅
rlog----------
2π
2πlog-----------
r⋅
nlog–
2
2
1
1
j ≥
j ≤ρ – i⋅
rlog----------
2π j
ρ + i⋅
rlog----------
2π j,
,
Phase Transitions:
1. There are infinitely many saddle points ρ + itj.
2. ρ → ∞ as α ↓ 1/ log q−1 = α1.
3 ρ → −∞ when α ↑ 1/ log p−1.
4. Saddle points coalesce with poles of the
Γ(s + 1) function at s = −2,−3, . . ..
Pole s = −2 leads to α2.
Analytic Depoissonization
Theorem 1 (Jacquet, and W.S., 1998). Let G(z) =∑
n≥0 gnzn
n! e−z be the
Poisson transform of gn. In a linear cone Sθ assume two conditions:
(I) For z ∈ Sθ
and some reals B,R > 0, ν
|z| > R ⇒ |G(z)| ≤ B|z|νΨ(|z|)
where Ψ(x) is a
slowly varying function.
(O) For z /∈ Sθ and A,α < 1
|z| > R ⇒ |G(z)ez| ≤ Aexp(α|z|).
Then
gn = G(n) + O(nν−1
Ψ(n)).
Using the depoissonization theorem, we get E[Bkn] = Ek(n) + O
(nν−1√log n
).
Outline Update
1. Tries and Suffix Trees
2. Profiles of Tries
3. Profile of Digital Search Trees
4. Applications: Error Resilient Lempel-Ziv’77
M. Drmota and W. Szpankowski,
“The Expected Profile of Digital Search Trees”,
J. Combin. Theory, Ser. A, 118, 1939-1965, 2011.
Digital Search Trees
Figure 3: A digital search tree built on eight strings s1, . . . , s8 (i.e.,
s1 = 0 . . ., s2 = 1 . . ., s3 = 01 . . ., s4 = 11 . . ., etc.) with internal and
external (squares) nodes, and its profiles.
Profile of Digital Search Trees
1. Recurrence: E[Bk+1n+1] =
∑ni=0
(ni
)piqn−i(E[Bk
i ] +E[Bkn−i]), n ≥ 2, k ≥ 0.
Profile of Digital Search Trees
1. Recurrence: E[Bk+1n+1] =
∑ni=0
(ni
)piqn−i(E[Bk
i ] +E[Bkn−i]), n ≥ 2, k ≥ 0.
2. Poisson Transform: Ek(z) =∑∞
n=0 E[Bkn]
zn
n! e−z:
E′k+1(z) + Ek+1(z) = Ek(zp) + Ek(zq), k ≥ 2,
Profile of Digital Search Trees
1. Recurrence: E[Bk+1n+1] =
∑ni=0
(ni
)piqn−i(E[Bk
i ] +E[Bkn−i]), n ≥ 2, k ≥ 0.
2. Poisson Transform: Ek(z) =∑∞
n=0 E[Bkn]
zn
n! e−z:
E′k+1(z) + Ek+1(z) = Ek(zp) + Ek(zq), k ≥ 2,
3. Mellin Transform: E∗k(s) :=
∫∞0
zs−1Ek(z)dz = −Γ(s)Fk(s):
F∗k+1(s) − F
∗k+1(s − 1) = (p
−s+ q
−s) · F ∗
k (s)
for ℜ(s) ∈ (−k − 1, 0), and F ∗0 (s) = 1.
Profile of Digital Search Trees
1. Recurrence: E[Bk+1n+1] =
∑ni=0
(ni
)piqn−i(E[Bk
i ] +E[Bkn−i]), n ≥ 2, k ≥ 0.
2. Poisson Transform: Ek(z) =∑∞
n=0 E[Bkn]
zn
n! e−z:
E′k+1(z) + Ek+1(z) = Ek(zp) + Ek(zq), k ≥ 2,
3. Mellin Transform: E∗k(s) :=
∫∞0
zs−1Ek(z)dz = −Γ(s)Fk(s):
F∗k+1(s) − F
∗k+1(s − 1) = (p
−s+ q
−s) · F ∗
k (s)
for ℜ(s) ∈ (−k − 1, 0), and F ∗0 (s) = 1.
4. The Power Series: f(x, s) =∑
k≥0 F k(s)xs becomes
f(x, s) =g(x, s)
g(x, 0), g(x, s) =
h(x, s)
1 − x(p−s + q−s),
where g(x, s) = 1+x∑
j≥0 g(x, s − j)(p−s+j+q−s+j), and asymptotically
F ∗k(s) ∼ A(s)(p−s + q−s)k.
where A(s) is analytic with A(−r) = 0 for r = 0, 1, 2 . . ..
Profile of Digital Search Trees
1. Recurrence: E[Bk+1n+1] =
∑ni=0
(ni
)piqn−i(E[Bk
i ] +E[Bkn−i]), n ≥ 2, k ≥ 0.
2. Poisson Transform: Ek(z) =∑∞
n=0 E[Bkn]
zn
n! e−z:
E′k+1(z) + Ek+1(z) = Ek(zp) + Ek(zq), k ≥ 2,
3. Mellin Transform: E∗k(s) :=
∫∞0
zs−1Ek(z)dz = −Γ(s)Fk(s):
F∗k+1(s) − F
∗k+1(s − 1) = (p
−s+ q
−s) · F ∗
k (s)
for ℜ(s) ∈ (−k − 1, 0), and F ∗0 (s) = 1.
4. The Power Series: f(x, s) =∑
k≥0 F k(s)xs becomes
f(x, s) =g(x, s)
g(x, 0), g(x, s) =
h(x, s)
1 − x(p−s + q−s),
where g(x, s) = 1+x∑
j≥0 g(x, s − j)(p−s+j+q−s+j), and asymptotically
F ∗k(s) ∼ A(s)(p−s + q−s)k.
where A(s) is analytic with A(−r) = 0 for r = 0, 1, 2 . . ..
5. Inverse Mellin Transform by saddle point and depoissonization.
Profile of Digital Search Trees
1. Recurrence: E[Bk+1n+1] =
∑ni=0
(ni
)piqn−i(E[Bk
i ] +E[Bkn−i]), n ≥ 2, k ≥ 0.
2. Poisson Transform: Ek(z) =∑∞
n=0 E[Bkn]
zn
n! e−z:
E′k+1(z) + Ek+1(z) = Ek(zp) + Ek(zq), k ≥ 2,
3. Mellin Transform: E∗k(s) :=
∫∞0
zs−1Ek(z)dz = −Γ(s)Fk(s):
F∗k+1(s) − F
∗k+1(s − 1) = (p
−s+ q
−s) · F ∗
k (s)
for ℜ(s) ∈ (−k − 1, 0), and F ∗0 (s) = 1.
4. The Power Series: f(x, s) =∑
k≥0 F k(s)xs becomes
f(x, s) =g(x, s)
g(x, 0), g(x, s) =
h(x, s)
1 − x(p−s + q−s),
where g(x, s) = 1+x∑
j≥0 g(x, s − j)(p−s+j+q−s+j), and asymptotically
F∗k(s) ∼ A(s)(p
−s+ q
−s)k.
where A(s) is analytic with A(−r) = 0 for r = 0, 1, 2 . . ..
5. Inverse Mellin Transform by saddle point and depoissonization.
6. Asymptotically DST profile behaves similarly to the profile of tries.
Main Results
Theorem 2. Let E Ikn denote the expected internal profile in (asymmetric)
digital search trees, 0 < p < q = 1 − p < 1. The following assertions hold:
1. α1 = 1
log 1p+ ε ≤ k
log n ≤ α0 − ε:
E Ikn = 2
k − G3
(log p
kn) (p−ρn,k + q−ρn,k)kn−ρn,k
√2πβ(ρn,k)k
(1 + O
(k−1/2
)),
where G3(x) is a non-zero periodic function with period 1.
2. k = α0
(log n + ξ
√α0β(0) logn
), where ξ = o((log n)
16), then
E Ikn = 2
kΦ(−ξ)
(1 + O
(1 + |ξ|3√
logn
))
where Φ is the normal distribution function.
3. α0 + ε ≤ klog n ≤ 1
log 1q− ε = α4 − ε:
E Ikn = G3
(log p
kn) (p−ρn,k + q−ρn,k)kn−ρn,k
√2πβ(ρn,k)k
(1 + O
(k−1/2
))
where ρn,k = ρ(k/ logn).
Thank You!
Outline Update
1. Tries and Suffix Trees
2. Usefulness of Tries
3. Profiles of Tries
4. Applications: Error Resilient Lempel-Ziv’77 (Suffix Trees)
S. Lonardi and M. Ward, and W. Szpankowski,
“Error Resilient LZ’77 Compression: Algorithms, Analysis, and Experiments”,
IEEE Trans. Information Theory, 53, 1799-1813, 2007.
Error Resilient LZ’77 Scheme
1. The Lempel-Ziv’77 works on-line: It compresses phrases by replacing the
longest prefix by (pointer, length) of its copy.
2. Castelli and Lastras in 2004 proved that a single error in LZ’77 corrupts
O(n2/3) phrases, thus about O(n2/3 logn) symbols, where n is the size.
3. There are multiple copies of the longest prefix that we denote by Mn for
a database of length n.
4. By a judicious choice of pointers in the LZ’77 scheme, we can recover
⌊log2 Mn⌋ bits without losing a bit in compression. Parity bits recovered
from the multiple copies are used for the Reed-Solomon channel coding.
historyhistory current positioncurrent position
0001
10
11
Analysis of Mn Via Suffix Trees
Why does LZRS’77 work so well?
Performance of LZRS’77 depends on Mn. How does Mn typically behave?
Build a suffix tree from the first n suffixes of the database X (i.e., S1 =X∞
1 , S2 = X∞2 , . . . , Sn = X∞
n ). Then insert the (n+1)st suffix, Sn+1 = X∞n+1.
Depth of insertion of Sn+1 is the (n+1)-st phrase length, and Mn is the size
of the subtree that starts at the insertion point of the (n + 1)st suffix.
S1 S2
S3 S4
S5
Mn
Figure 6: M4(=2) is the size of the subtree at the insertion point of S5.
Analysis of Mn for Independent Tries
1. Consider digital tries built over n independent strings.
Average E[MIn] and probability generating function
satisfying the following recurrences
(p = 1 − q is the probability of generating a “1”) MkI
MnI
Mn k–I
q = 1-pp
E[MIn] = pn(qn + pE[MI
n]) + qn(pn + qE[MIn)] +
n−1∑
k=1
(nk
)pkqn−k(pE[MI
k ] + qE[MIn−k])
E[uMIn] = pn(qun + pE[uMI
n]) + qn(pun + qE[uMIn]) +
n−1∑
k=1
(nk
)pkqn−k(pE[uMI
k ] + qE[uMI
n−k]).
• (Analytic) Poissonization(W (z) =
∑n≥0 E[M
In]
zn
n! e−z)
:
W (z) = qpzeqz + pqzepz + pW (pz) + qW (qz).
• Mellin Transform(f∗(s) =
∫∞0
f(x)xs−1dx):
W∗(s) =
Γ(s + 1)(pq−s + qp−s)
1 − p−s+1 − q−s+1.
• Inverse Mellin Transform: W (z) = 1/h + fluctuations.
• (Analytic) Depoissonization: E[MIn] = 1/h + fluctuations.
Analysis of Mn for Dependent Strings
2. Suffix Trees: Using analytic combinatorics on words we prove that
M(z, u) =∞∑
n=1
∞∑
k=1
P(Mn = k)ukzn
=∑
w∈A∗α∈A
uP(β)P(w)
Dw(z)
Dwα(z) − (1 − z)
Dw(z) − u(Dwα(z) − (1 − z))
Dw(z) = (1 − z)Sw(z) + zmP (w) and Sw(z) is the autocorrelation
polynomial:
Sw(z) =∑
k∈P(w)
P(wmk+1)z
m−k
P(w) denotes the set of positions k of w satisfyingw1 . . . wk = wm−k+1 . . . wm.
For any ε > 0 there exists β > 1 such that (all hard analytic work is here!)
|Pr(Mn = k) − Pr(MIn = k)| = O(n−εβ−k)
Random suffix trees resemble random independent tries (cf. P. Jacquet,
W.S., 1994, Lonardi, W.S., Ward, 2005: IEEE Trans. Inf. Th., 2007).
Main Results
Theorem 4 (Ward, W.S., 2005). Let zk = 2krπiln p ∀k ∈ Z, where ln p
ln q = rs for
some relatively prime integers r, s (i.e., ln pln q is rational).
The jth factorial moment E[(Mn)j] = E[Mn(Mn − 1) · · ·Mn(−j + 1)] is
E[(Mn)j] = Γ(j)
q(p/q)j + p(q/p)j
h+ δj(log1/p n) + O(n−η)
where h = −p log p− q log q is the entropy rate, η > 0, and where Γ is the
Euler gamma function and
δj(t) =∑
k 6=0
−e2krπitΓ(zk + j)
(pjq−zk−j+1 + qjp−zk−j+1
)
p−zk+1 ln p + q−zk+1 ln q.
δj is a periodic function that has a small magnitude and exhibits fluctuation
when ln pln q is rational
Note: On average there are
E[Mn] ∼ 1/h additional pointers.
j 1ln 2
∑k 6=0
∣∣Γ(j − 2kiπ
ln 2
)∣∣1 1.4260 ×10−5
3 1.2072 ×10−3
5 1.1421 ×10−1
6 1.1823 ×100
8 1.4721 ×102
9 1.7798 ×103
10 2.2737 ×104
Distribution of Mn
Theorem 5 (Ward, W.S., 2005). Let zk = 2krπiln p ∀k ∈ Z, where ln p
ln q = rs. Then
P (Mn = j) =pjq + qjp
jh+∑
k 6=0
−e2krπi log1/p n
Γ(zk)(pjq + qjp)(zk)
j
j!(p−zk+1 ln p + q−zk+1 ln q)+ O(n
−η)
where η > 0, and Γ is the Euler gamma function.
Therefore, Mn follows the logarithmic series distribution with mean 1/h (plus
some fluctuations).
The logarithmic series distribution ((pjq + qjp)/(jh))
is well concentrated around its mean EMn ≈ 1/h.
0
0.2
0.4
0.6
0.8
2 3 4 5 6 7 8
x
Analytic Information Theory and Analysis of Algorithms
• In the 1997 Shannon Lecture Jacob Ziv presented compelling
arguments for “backing off” from first-order asymptotics in order to
predict the behavior of real systems with finite length description.
• To overcome these difficulties we propose replacing first-order analyses
by full asymptotic expansions and more accurate analyses (e.g., large
deviations, central limit laws).
• Following Hadamard’s precept1, we study information theory problems
using techniques of complex analysis such as generating functions,
combinatorial calculus, Rice’s formula, Mellin transform, Fourier series,
sequences distributed modulo 1, saddle point methods, analytic
poissonization and depoissonization, and singularity analysis.
• This program, which applies complex-analytic tools of analysis of
algorithms to information theory, constitutes analytic information
theory.
1The shortest path between two truths on the real line passes through the complex plane.
Thank You!