Csr2011 june15 09_30_shen

. . . . . .

CSR-2011

Kolmogorov complexity as a language

Alexander ShenLIF CNRS, Marseille; on leave from

ИППИ РАН, Москва

. . . . . .

Kolmogorov complexity

I A powerful toolI Just a way to reformulate argumentsI three languages: combinatorial/algorithmic/probabilistic

. . . . . .

I A powerful tool

I Just a way to reformulate argumentsI three languages: combinatorial/algorithmic/probabilistic

. . . . . .

I A powerful toolI Just a way to reformulate arguments

I three languages: combinatorial/algorithmic/probabilistic

. . . . . .

Reminder and notation

I K(x) = minimal length of a program that produces xI KD(x) = min|p| : D(p) = xI depends on the interpreter DI optimal D makes it minimal up to O(1) additive termI Variations:

p = string p = prefix of a sequencex = string plain prefix

K(x), C(x) KP(x), K(x)x = prefix decision monotone

of a sequence KR(x), KD(x) KM(x), Km(x)

Conditional complexity C(x|y):minimal length of a program p : y 7→ x.There is also a priori probability (in two versions: discrete, on strings;continuous, on prefixes)

. . . . . .

Reminder and notationI K(x) = minimal length of a program that produces x

I KD(x) = min|p| : D(p) = xI depends on the interpreter DI optimal D makes it minimal up to O(1) additive termI Variations:

. . . . . .

Reminder and notationI K(x) = minimal length of a program that produces xI KD(x) = min|p| : D(p) = x

I depends on the interpreter DI optimal D makes it minimal up to O(1) additive termI Variations:

. . . . . .

Reminder and notationI K(x) = minimal length of a program that produces xI KD(x) = min|p| : D(p) = xI depends on the interpreter D

I optimal D makes it minimal up to O(1) additive termI Variations:

. . . . . .

Reminder and notationI K(x) = minimal length of a program that produces xI KD(x) = min|p| : D(p) = xI depends on the interpreter DI optimal D makes it minimal up to O(1) additive term

I Variations:

. . . . . .

Reminder and notationI K(x) = minimal length of a program that produces xI KD(x) = min|p| : D(p) = xI depends on the interpreter DI optimal D makes it minimal up to O(1) additive termI Variations:

. . . . . .

Conditional complexity C(x|y):minimal length of a program p : y 7→ x.

There is also a priori probability (in two versions: discrete, on strings;continuous, on prefixes)

. . . . . .

Foundations of probability theory

I Random object or random process?I “well shuffled deck of cards”: any meaning?

[xkcd cartoon]

I randomness = incompressibility (maximal complexity)I ω = ω1ω2 . . . is random iff KM(ω1 . . . ωn) ≥ n−O(1)

I Classical probability theory: random sequence satisfies theStrong Law of Large Numbers with probability 1

I Algorithmic version: every (algorithmically) randomsequence satisfies SLLN

I algorithmic ⇒ classical: Martin-Löf random sequencesform a set of measure 1.

. . . . . .

I Random object or random process?

I “well shuffled deck of cards”: any meaning?

[xkcd cartoon]

. . . . . .

[xkcd cartoon]

. . . . . .

[xkcd cartoon]

. . . . . .

[xkcd cartoon]

I randomness = incompressibility (maximal complexity)

I ω = ω1ω2 . . . is random iff KM(ω1 . . . ωn) ≥ n−O(1)

. . . . . .

[xkcd cartoon]

. . . . . .

[xkcd cartoon]

. . . . . .

[xkcd cartoon]

. . . . . .

[xkcd cartoon]

. . . . . .

Sampling random strings (S.Aaronson)

I A device that (being switched on) produces N-bit string andstops

I “The device produces a random string”: what does itmean?

I classical: the output distribution is close to the uniform oneI effective: with high probability the output string isincompressible

I not equivalent if no assumptions about the deviceI but are related under some assumptions

. . . . . .

I classical: the output distribution is close to the uniform one

I effective: with high probability the output string isincompressible

. . . . . .

I not equivalent if no assumptions about the device

I but are related under some assumptions

. . . . . .

Example: matrices without uniform minors

I k× k minor of n× n Boolean matrix: select k rows and kcolumns

I minor is uniform if it is all-0 or all-1.I claim: there is a n× n bit matrix without k× k uniformminors for k = 3 log n.

. . . . . .

Counting argument and complexity reformulation

I ≤ nk × nk positions of the minor [k = 3 logn]I 2 types of uniform minors (0/1)I 2n

2−k2 possibilities for the restI n2k × 2× 2n

2−k2 = 2log n×2×3 log n+1+(n2−9 log2 n) < 2n2 .

I log n bits to specify a column or row: 2k log n bits in totalI one additional bit to specify the type of minor (0/1)I n2 − k2 bits to specify the rest of the matrixI 2k log n+ 1+ (n2 − k2) = 6 log2 n+ 1+ (n2 − 9 log2 n) < n2.

. . . . . .

I ≤ nk × nk positions of the minor [k = 3 logn]

I 2 types of uniform minors (0/1)I 2n

. . . . . .

I ≤ nk × nk positions of the minor [k = 3 logn]I 2 types of uniform minors (0/1)

I 2n2−k2 possibilities for the rest

I n2k × 2× 2n2−k2 = 2log n×2×3 log n+1+(n2−9 log2 n) < 2n

. . . . . .

2−k2 possibilities for the rest

I n2k × 2× 2n2−k2 = 2log n×2×3 log n+1+(n2−9 log2 n) < 2n

. . . . . .

2−k2 =

2log n×2×3 log n+1+(n2−9 log2 n) < 2n2 .

. . . . . .

I log n bits to specify a column or row: 2k log n bits in total

I one additional bit to specify the type of minor (0/1)I n2 − k2 bits to specify the rest of the matrixI 2k log n+ 1+ (n2 − k2) = 6 log2 n+ 1+ (n2 − 9 log2 n) < n2.

. . . . . .

I log n bits to specify a column or row: 2k log n bits in totalI one additional bit to specify the type of minor (0/1)

I n2 − k2 bits to specify the rest of the matrixI 2k log n+ 1+ (n2 − k2) = 6 log2 n+ 1+ (n2 − 9 log2 n) < n2.

. . . . . .

I log n bits to specify a column or row: 2k log n bits in totalI one additional bit to specify the type of minor (0/1)I n2 − k2 bits to specify the rest of the matrix

I 2k log n+ 1+ (n2 − k2) = 6 log2 n+ 1+ (n2 − 9 log2 n) < n2.

. . . . . .

One-tape Turing machines

I copying n-bit string on 1-tape TM requires Ω(n2) time

I complexity version: if initially the tape was empty on theright of the border, then after n steps the complexity of azone that is d cells far from the border is O(n/d).

K(u(t)) ≤ O(n/d)I proof: border guards in each cell of the border security zone write down the

contents of the head of TM; each of the records is enough to reconstruct u(t) so

the length of it should be Ω(K(u(t)); the sum of lengths does not exceed time

. . . . . .

One-tape Turing machinesI copying n-bit string on 1-tape TM requires Ω(n2) time

. . . . . .

K(u(t)) ≤ O(n/d)

I proof: border guards in each cell of the border security zone write down the

. . . . . .

Everywhere complex sequences

I Random sequence has n-bit prefix of complexity nI but some factors (substrings) have small complexityI Levin: there exist everywhere complex sequences: everyn-bit substring has complexity 0.99n−O(1)

I Combinatorial equivalent: Let F be a set of strings that hasat most 20.99n strings of length n. Then there is a sequenceω s.t. all sufficiently long substrings of ω are not in F.

I combinatorial and complexity proofs not just translations ofeach other (Lovasz lemma, Rumyantsev, Miller, Muchnik)

. . . . . .

I Random sequence has n-bit prefix of complexity n

I but some factors (substrings) have small complexityI Levin: there exist everywhere complex sequences: everyn-bit substring has complexity 0.99n−O(1)

. . . . . .

I Random sequence has n-bit prefix of complexity nI but some factors (substrings) have small complexity

I Levin: there exist everywhere complex sequences: everyn-bit substring has complexity 0.99n−O(1)

. . . . . .

Gilbert-Varshamov complexity bound

I coding theory: how many n-bit strings x1, . . . , xk one canfind if Hamming distance between every two is at least d

I lower bound (Gilbert–Varshamov)I then < d/2 changed bits are harmlessI but bit insertion or deletions could beI general requirement: C(xi|xj) ≥ dI generalization of GV bound: d-separated family of size

Ω(2n−d)

. . . . . .

Ω(2n−d)

. . . . . .

I lower bound (Gilbert–Varshamov)

I then < d/2 changed bits are harmlessI but bit insertion or deletions could beI general requirement: C(xi|xj) ≥ dI generalization of GV bound: d-separated family of size

Ω(2n−d)

. . . . . .

I lower bound (Gilbert–Varshamov)I then < d/2 changed bits are harmless

I but bit insertion or deletions could beI general requirement: C(xi|xj) ≥ dI generalization of GV bound: d-separated family of size

Ω(2n−d)

. . . . . .

I lower bound (Gilbert–Varshamov)I then < d/2 changed bits are harmlessI but bit insertion or deletions could be

I general requirement: C(xi|xj) ≥ dI generalization of GV bound: d-separated family of size

Ω(2n−d)

. . . . . .

I lower bound (Gilbert–Varshamov)I then < d/2 changed bits are harmlessI but bit insertion or deletions could beI general requirement: C(xi|xj) ≥ d

I generalization of GV bound: d-separated family of sizeΩ(2n−d)

. . . . . .

Ω(2n−d)

. . . . . .

Inequalities for complexities and combinatorialinterpretation

I C(x, y) ≤ C(x) + C(y|x) +O(log)

I C(x, y) ≥ C(x) + C(y|x) +O(log)I C(x, y) < k+ l ⇒ C(x) < k+O(log) or C(y|x) < l+O(log)I every set A of size < 2k+l can be split into two partsA = A1 ∪ A2 such that w(A1) ≤ 2k and h(A2) ≤ 2l

. . . . . .

I C(x, y) ≤ C(x) + C(y|x) +O(log)

. . . . . .

I C(x, y) ≤ C(x) + C(y|x) +O(log)

. . . . . .

I C(x, y) ≤ C(x) + C(y|x) +O(log)

I C(x, y) ≥ C(x) + C(y|x) +O(log)

I C(x, y) < k+ l ⇒ C(x) < k+O(log) or C(y|x) < l+O(log)I every set A of size < 2k+l can be split into two partsA = A1 ∪ A2 such that w(A1) ≤ 2k and h(A2) ≤ 2l

. . . . . .

I C(x, y) ≤ C(x) + C(y|x) +O(log)

I C(x, y) ≥ C(x) + C(y|x) +O(log)I C(x, y) < k+ l ⇒ C(x) < k+O(log) or C(y|x) < l+O(log)

I every set A of size < 2k+l can be split into two partsA = A1 ∪ A2 such that w(A1) ≤ 2k and h(A2) ≤ 2l

. . . . . .

I C(x, y) ≤ C(x) + C(y|x) +O(log)

. . . . . .

One more inequality

I 2C(x, y, z) ≤ C(x, y) + C(y, z) + C(x, z)

I V2 ≤ S1 × S2 × S3

I Also for Shannon entropies; special case of Shearerlemma

. . . . . .

One more inequality

I 2C(x, y, z) ≤ C(x, y) + C(y, z) + C(x, z)

I V2 ≤ S1 × S2 × S3

. . . . . .

One more inequality

I 2C(x, y, z) ≤ C(x, y) + C(y, z) + C(x, z)

I V2 ≤ S1 × S2 × S3

. . . . . .

One more inequality

I 2C(x, y, z) ≤ C(x, y) + C(y, z) + C(x, z)

I V2 ≤ S1 × S2 × S3

. . . . . .

Common information and graph minors

I mutual information: I(a : b) = C(a) + C(b)− C(a,b)

I common information:

I combinatorial: graph minors

can the graph be covered by 2δ minors of size 2α−δ × 2β−δ?

. . . . . .

Almost uniform sets

I nonuniformity= (maximal section)/(average section)

I Theorem: every set of N elements can be represented asunion of polylog(N) sets whose nonuniformity is polylog(N).

I multidimensional versionI how to construct parts using Kolmogorov complexity: takestrings with given complexity bounds

I so simple that it is not clear what is the combinatorialtranslation

I but combinatorial argument exists (and gives even astronger result)

. . . . . .

Almost uniform setsI nonuniformity= (maximal section)/(average section)

. . . . . .

I multidimensional version

I how to construct parts using Kolmogorov complexity: takestrings with given complexity bounds

. . . . . .

Shannon coding theorem

I ξ is a random variable; k values, probabilities p1, . . . , pkI ξN: N independent trials of ξI Shannon’s informal question: how many bits are needed toencode a “typical” value of ξN?

I Shannon’s answer: NH(ξ), where

H(ξ) = p1 log(1/p1) + . . .+ pn log(1/pn).

I formal statement is a bit complicatedI Complexity version: with high probablity the value of ξNhas complexity close to NH(ξ).

. . . . . .

I ξ is a random variable; k values, probabilities p1, . . . , pk

I ξN: N independent trials of ξI Shannon’s informal question: how many bits are needed toencode a “typical” value of ξN?

H(ξ) = p1 log(1/p1) + . . .+ pn log(1/pn).

. . . . . .

I ξ is a random variable; k values, probabilities p1, . . . , pkI ξN: N independent trials of ξ

I Shannon’s informal question: how many bits are needed toencode a “typical” value of ξN?

H(ξ) = p1 log(1/p1) + . . .+ pn log(1/pn).

. . . . . .

H(ξ) = p1 log(1/p1) + . . .+ pn log(1/pn).

. . . . . .

H(ξ) = p1 log(1/p1) + . . .+ pn log(1/pn).

. . . . . .

H(ξ) = p1 log(1/p1) + . . .+ pn log(1/pn).

I formal statement is a bit complicated

I Complexity version: with high probablity the value of ξNhas complexity close to NH(ξ).

. . . . . .

H(ξ) = p1 log(1/p1) + . . .+ pn log(1/pn).

. . . . . .

Complexity, entropy and group size

I 2C(x, y, z) ≤ C(x, y) + C(y, z) + C(x, z) +O(log)I The same for entropy:

2H(ξ, η, τ) ≤ H(ξ, η) + H(ξ, τ) + H(η, τ)I …and even for the sizes of subgroups U,V,W of somefinite group G: 2 log(|G|/|U ∩ V ∩W|) ≤log(|G|/|U ∩ V|) + log(|G|/|U ∩W|) + log(|G|/|V ∩W|).

I in all three cases inequalities are the same(Romashchenko, Chan, Yeung)

I some of them are quite strange:

I Related to Romashchenko’s theorem: if three last termsare zeros, one can extract common information from a,b,e.

. . . . . .

I 2C(x, y, z) ≤ C(x, y) + C(y, z) + C(x, z) +O(log)

I The same for entropy:2H(ξ, η, τ) ≤ H(ξ, η) + H(ξ, τ) + H(η, τ)

I …and even for the sizes of subgroups U,V,W of somefinite group G: 2 log(|G|/|U ∩ V ∩W|) ≤log(|G|/|U ∩ V|) + log(|G|/|U ∩W|) + log(|G|/|V ∩W|).

. . . . . .

2H(ξ, η, τ) ≤ H(ξ, η) + H(ξ, τ) + H(η, τ)

I …and even for the sizes of subgroups U,V,W of somefinite group G: 2 log(|G|/|U ∩ V ∩W|) ≤log(|G|/|U ∩ V|) + log(|G|/|U ∩W|) + log(|G|/|V ∩W|).

. . . . . .

Muchnik and Slepian–Wolf

I a,b: two stringsI we look for a program p that maps a to bI by definition C(p) is at least C(b|a) but could be higherI there exist p : a 7→ b that is simple relative to b, e.g., “mapeverything to b”

I Muchnik theorem: it is possible to combine these twoconditions: there exists p : a 7→ b such that C(p) ≈ C(b|a)and C(p|b) ≈ 0

I information theory analog: Wolf–SlepianI similar technique was developed by Fortnow and Laplante(randomness extractors)

I (Romashchenko, Musatov): how to use explicit extractorsand derandomization to get space-bounded versions

. . . . . .

I a,b: two strings

I we look for a program p that maps a to bI by definition C(p) is at least C(b|a) but could be higherI there exist p : a 7→ b that is simple relative to b, e.g., “mapeverything to b”

. . . . . .

I a,b: two stringsI we look for a program p that maps a to b

I by definition C(p) is at least C(b|a) but could be higherI there exist p : a 7→ b that is simple relative to b, e.g., “mapeverything to b”

. . . . . .

I a,b: two stringsI we look for a program p that maps a to bI by definition C(p) is at least C(b|a) but could be higher

I there exist p : a 7→ b that is simple relative to b, e.g., “mapeverything to b”

. . . . . .

I information theory analog: Wolf–Slepian

I similar technique was developed by Fortnow and Laplante(randomness extractors)

. . . . . .

Computability theory: simple sets

I Simple set: enumerable set with infinite complement, butno algorithm can generate infinitely many elements fromthe complement

I Construction using Kolmogorov complexity: a simple stringx has C(x) ≤ |x|/2.

I Most strings are not simple ⇒ infinite complementI Let x1, x2, . . . be a computable sequence of differentnon-simple strings

I May assume wlog that |xi| > i and therefore C(xi) > i/2I but to specify xi we need O(log i) bits onlyI “Minimal integer that cannot be described in ten Englishwords” (Berry)

. . . . . .

I Most strings are not simple ⇒ infinite complement

I Let x1, x2, . . . be a computable sequence of differentnon-simple strings

. . . . . .

I May assume wlog that |xi| > i and therefore C(xi) > i/2

I but to specify xi we need O(log i) bits onlyI “Minimal integer that cannot be described in ten Englishwords” (Berry)

. . . . . .

I May assume wlog that |xi| > i and therefore C(xi) > i/2I but to specify xi we need O(log i) bits only

I “Minimal integer that cannot be described in ten Englishwords” (Berry)

. . . . . .

Lower semicomputable random reals

ai computable converging series with rational termsI is α =

∑ai computable (ε → ε-approximation)?

I not necessarily (Specker example)I lower semicomputable realsI Solovay classification: α ≼ β if ε-approximation to β can beeffectively converted to O(ε)-approximation to α

I There are maximal elements = random lowersemicomputable reals

I = slowly converging seriesI modulus of convergence: ε 7→ N(ε) = how many terms areneeded for ε-precision

I maximal elements: N(2−n) > BP(n−O(1)), where BP(k)is the maximal integer whose prefix complexity is k or less.

. . . . . .

ai computable converging series with rational terms

I is α =∑

ai computable (ε → ε-approximation)?I not necessarily (Specker example)I lower semicomputable realsI Solovay classification: α ≼ β if ε-approximation to β can beeffectively converted to O(ε)-approximation to α

. . . . . .

I not necessarily (Specker example)

I lower semicomputable realsI Solovay classification: α ≼ β if ε-approximation to β can beeffectively converted to O(ε)-approximation to α

. . . . . .

I not necessarily (Specker example)I lower semicomputable reals

I Solovay classification: α ≼ β if ε-approximation to β can beeffectively converted to O(ε)-approximation to α

. . . . . .

I There are maximal elements

= random lowersemicomputable reals

. . . . . .

I = slowly converging series

I modulus of convergence: ε 7→ N(ε) = how many terms areneeded for ε-precision

. . . . . .

Lovasz local lemma: constructive proof