What is the information content of an algorithm? - AofA...

29 May 2013

What is the information content of an algorithm?

Joachim M. Buhmann

Computer Science Department, ETH Zurich

29 May 2013 Joachim M. Buhmann AofA 2013, Menorca 2

Big Data in Neuroscience

WM


Connectivity-based Cortex Parcellation

Connectivity-based cortex parcellation is a compact description of the data:


Connectivity-based Cortex Parcellation

Seed voxels at the cortical boundary

Fiber tracking Clustering

Cortex parcellation

Biological Question: What is the relationship between the morphology of a cortical region of interest (ROI) and its differences in connectivity?

Tractograms of distinct cortical areas show a variety of connectivity pattern


Roadmap

§  Big data in neuroscience: cortex parcellation

§  Learning and robust optimization §  Measuring the information content of algorithms

§  Model validation by information theory §  Graph cut for gene expression analysis

§  Robust sorting

§  Outlook & vision: Low power computing, resilient programming


What is learning?

§  Given: data and a hypothesis class

§  Learning is identifying a set of preferable hypotheses in the hypothesis class!

CX ⇠ P(X)

… 1

2 3

4

6

5 9

7 8

10 11

12 1

2 3

4

6

5 9

7 8

10 11

12

1

2 3

4

6

5

9

7 8

10 11

12


How can we find good models? Informativeness vs. robustness §  Learning game: The learner receives two (k)

instances and is tested on a third (k+1) instance.

§  Idea: Convert an optimization problem into a coding problem! A set of approximate solutions represent a code word.

§  Accuracy is bounded by generalization capacity => Information theoretic model validation: Find an identifiable generalization set with minimal size!

(JB, ISIT 2010)


How to design optimal generalization sets?

§  Fundamental problem: Data often live in much higher dimensional spaces than solutions! §  Graphs: Cuts:

§  How does noise perturb solutions?

=> Measure sensitivity of solution sets to data noise!

…

2(n2)

1

2 3

4

6

5

9

7 8

10 11

12

|C| = 2n

1

2 3

4

6

5 9

7 8

10 11

12 1

2 3

4

6

5 9

7 8

10 11

12


{ ,…, }

Information Theory & Machine Learning

§  IT-Components §  Code book ( {strings} =

hypothesis class )

§  Noisy channel 1100110 -> 0100110

§  Decoder: minimize Hamming distance |0100110-1100110| < |0100110-1010101|

§  Machine Learning elements §  Set of generalization sets

§  Noisy optimization problem

§  Decoding by approximate optimization of test instance

0000000 0100101 1000011 1100110 0001111 0101010 1001100 1101001 0010110 0110011 1010101 1110000 0011001 0111100 1011010 1111111

{ ,…, } 1 2 3

4 6

5 9

7 8

10 11 12 5

12 7

4 6

1 11

3 8

2 10 9

5 12 7

4 6

1 11

3 8

2 10 9 5

12 7

4 6

1 11

3 8

2 10 9


Approximate Optimization

§  Define weights for hypotheses with low costs

§  Total weight of low cost hypotheses (partition fct.)

§  Special case: Approximation set

w�(c,X) =

(1, R (c,X) R

�c?,X

�+ 1/�;

0, otherwise.

R(c,X)

c? 2 arg minc2C

R(c,X)

Z(X) =X

c2C(O)

exp(��R(c,X))

w�(c,X) = exp(��R(c,X)), c 2 C(O)


§  Quantize hypothesis class by generalization sets

Coarsening of hypothesis classes and the two instances test

§  Size: partition function

§  Weight overlap models

joint approximations

Generalization set with minimizer

X0 ⇠ P(X)

X00 ⇠ P(X)

�Z(X0,X00) =X

c2C(O0)

w�(c,X0) w�(c,X00)

Z(X0) =X

c2C(O0)

w�(c,X0)


Asymptotic equipartition property

§  Condition for cost function R(c,X): Convergence of log partition functions towards free energies

F 0 := limn!1

� log Z(X0(n))

log |C(O0(n))|,

F 00 := limn!1

� log Z(X00(n))

log |C(O00(n))|,

�F := limn!1

� log�Z(X0(n),X00(n)

)

log |C(O00(n))|


Jointly typical instances

A(n)✏ :=

((X0(n)

,X00(n)) 2 X (n) ⇥ X (n) :

��log Z(X0(n)

)

log |C(O0(n))|� F 0

�� < ✏ ,

��log Z(X00(n)

)

log |C(O00(n))|� F 00

�� < ✏ ,

��log�Z(X0(n)

,X00(n))

log |C(O00(n))|��F

�� < ✏

)


§  Set of generalization sets

§  How to generate sets of generalization sets?

Random transformations

Generalization sets as symbols for coding

§  Cover hypothesis class densely, but identifiably!

Transformation

{ ,…, } 1 2 3

4 6

5 9

7 8

10 11 12 5

12 7

4 6

1 11

3 8

2 10 9

1 2

3

4

5

5 9

7

4

1

1 2

3

4

5

5 3

2

1

4

T =�⌧ : R(c,X) = R(⌧ � c, ⌧ � X)

C(O0)


Validating spectral clusterings

§  Graph representation: vertices denote objects edges express (dis)similarities

§  Hypothesis class: all 3-cuts of a graph

1 1 1 11 1 1 1 11 1 11 1 1 11 1 1 1 1

1 1 1 11 1 1 1

1 1 11 1 1 1

1 1 11 1

1 1 1

1 1 1 11 1 1 1 11 1 11 1 1 11 1 1 1 1

1 1 1 11 1 1 1

1 1 11 1 1 1

1 1 11 1

1 1 1

1

2 3

4

6

5

9

7 8

10 11

12 1

2 3

4

6

5

9

7 8

10 11

12


1 1 1 11 1 1 1 11 1 11 1 1 11 1 1 1 1

1 1 1 11 1 1 1

1 1 11 1 1 1

1 1 11 1

1 1 1

5

12 7

4

6

1

11

3 8

2 10

9 5

12 7

4

6

1

11

3 8

2 10

9 1

2 3

4

6

5

9

7 8

10 11

12 1

2 3

4

6

5

9

7 8

10 11

12

1 1 1 11 1 1 1 11 1 11 1 1 11 1 1 1 1

1 1 1 11 1 1 1

1 1 11 1 1 1

1 1 11 1

1 1 1

1 1 1 1 11 1 1

1 1 1 11 1 1 11 1 1 1

1 1 1 11 1 1

1 1 11 1 11 1

1 1 1 11 1 1 1 1

1 1 1 1 11 1 1

1 1 1 11 1 1 11 1 1 1

1 1 1 11 1 1

1 1 11 1 11 1

1 1 1 11 1 1 1 1

1 1 1 1 -1 1 1

- 1 1 11 1 1 11 1 1 -

- - 1 1 11 1 1

1 - 1 11 1 11 1 1

1 1 1 1 1 1- 1 1 - 1 1 1

Code problem generation for Graph Cut

graph cut code problems graph cut

test problem

… M

1 1 1 1 -1 1 1

- 1 1 11 1 1 11 1 1 -

- - 1 1 11 1 1

1 - 1 11 1 11 1 1

1 1 1 1 1 1- 1 1 - 1 1 1

5

12 7

4

6

1

11

3 8

2 10

9 5

12 7

4

6

1

11

3 8

2 10

9 5

12 7

4

6

1

11

3 8

2 10

9


Coding with Graph Cut approximation sets

sender

problem generator PG

receiver

define a set of code problems

1

2 3

4

6

5

9

7 8

10 11

12

1

2 3

4

6

5

9

7 8

10 11

12

5

12 7

4

6

1

11

3 8

2 10

9

5

12 7

4

6

1

11

3 8

2 10

9

R(·,X0) R(·,X0)

⌧ � X0{⌧1, . . . , ⌧M}


Communication by approximate Graph Cuts

sender

problem generator

receiver

τs

estimate the coding error

1.  Sender sends a permutation index τs to problem generator.

2.  Problem generator sends a new problem with permuted indices to receiver without revealing τs.

3.  Receiver identifies the permutation by comparing generalization sets.

c ? ( ~ X ( 2 ) )

~ X ( 2 ) = ¾ s ± X ( 2 )

1

2 3

4

6

5

9

7 8

10 11

12

1

2 3

4

6

5

9

7 8

10 11

12

5

12 7

4

6

1

11

3 8

2 10

9

5

12 7

4

6

1

11

3 8

2 10

9

⌧̂

⌧̂

c?(X̃)

R(·, ⌧s � X00), s.t.

X0,X00 ⇠ P (X)

X̃ = ⌧s � X00


Communication Process

§  Receiver compares sets of hypothesis weights

§  Decoding by maximizing weight overlap

§  Error event:

§  Calculate

⌧̂ := arg max⌧

# ( \ )⌧

⌧̂ 6= ⌧s

limn!1

P (⌧̂ 6= ⌧s|⌧s) = 0


Error Probability §  Estimate error given random transformations

Markov ineq.

⌧ 2 T

Union bound

�Zj :=X

c2C(X00)

w�(c, ⌧j � X0) w�(c, ⌧s � X00)

P (⌧̂ 6= ⌧s|⌧s) = P

✓maxj 6=s

�Zj > �Zs

��⌧s

◆

X

j 6=s

P��Zj > �Zs

��⌧s

�

MP��Z 6=s > �Zs

��⌧s

�

M EX0,X00E⌧ 6=s

�Z 6=s

�Zs


Decoupling of generalization sets

§  Intersecting an approximation set with a randomly transformed one yields decoupling!

E⌧�Z⌧ = E⌧

X

c2C(X00)

w�(c, ⌧ � X0) w�(c, ⌧s � X00)

=X

c2C(X00)

w�(c, ⌧s � X00)1

|T|X

⌧2Tw�(⌧�1 � c,X0)

1

|T| Z(X00) Z(X0)


Generalization capacity from typicality

§  Asymptotic error free communication is possible for

§  Interpretation: “mutual information” > rate log M

limn!1

P (⌧̂ 6= ⌧s|⌧s) = 0

P (⌧̂ 6= ⌧s|⌧s) M EX0,X00Z(X0) Z(X00)

|T| �Zs(X0,X00)

M

|T| exp⇣�F 0 log |C0| � (F 00 ��F) log |C00|

⌘

M Z(X0) Z(X00)|T| �Zs(X0,X00)

= exp⇣��log

|T|Z 0 + log

|C00|Z 00 � log

|C00|�Zs

� log M�⌘


Algorithm selection by maximizing generalization capacity

sender

problem generator

receiver

Communication channel

§  Optimize the communication channel w.r.t. precision β, cost function R(.,.) and algorithm

§  Maximize “Mutual Information”

⌧̂R(·, ⌧s � X(2)), s.t.

X(1),X(2) ⇠ P (X)

⌧s


§  Hypothesis class: set of binary

strings

§  Communication:

§  Costs of string s: Hamming

distance

§  Mutual information:

ASC for binary channel consistent with Shannon information theory

Channel capacity of BSC

⇠0, ⇠00 2 {�1, 1}n

R(s, ⇠0) =

nX

i=1

I{si 6=⇠0i}

⇠0 ) ) ⇠00

�� = 1

n

��{i : ⇠0 6= ⇠00}��

I� = ln 2 + (1 � �) ln cosh� � ln(cosh� + 1)

for (⇤) dI�

d� = 0(⇤)= ln 2 + � ln � + (1 � �) ln(1 � �)


Clustering of Gene expression data

§  Spectral clustering of correlations in yeast cell cycle data

§  Pairwise clustering (PC), normalized cut (NCut), correlation clustering (CC), adaptive ratio cut, (ARC) dominant set clustering (DS), for different distance measures

Yeast gene expression data: Eisen et al. (1998) PNAS 95,14863


Validating relational clustering costs

§  Cost function for pairwise clustering

§  Normalized cut

Defs.:

Rpc(c,X) = �KX

k=1

Gk

X

(i,j)2Ekk

Xij

|Ekk|

Gk = {i : c(i) = k}, Ekl = {(i, j) : c(i) = k ^ c(j) = l}

Rnc(c; W ) =X

vk

✓cut(Gv, V \ Gv)

assoc(Gv, V)

◆


§  Shifted correlation clustering shift s balances clusters sizes and should be adapted

§  Adaptive ratio cut Exponent interpolates between arithmetic and harmonic normalization to balance clusters

R(c,X, K, s) =1

2

X

1uK

X

(i,j)2Euu

(|Xij + s| � Xij � s)

+1

2

X

1uK

X

1v<u

X

(i,j)2Euv

(|Xij + s| + Xij + s).

R(c,X) =

KX

↵=1

KX

�=↵+1

cut(V↵, V�)⇣|V↵| 1

p + |V� | 1p

⌘�p

p 2 [�1, 1]


Validating Dominant Set clustering §  Dominant set clustering employs a replicator

dynamics of evolutionary game theory to derive a partition

§  The replicator dynamics is applied in an iterative way to identify one cluster after another.

(Pavan & Pelillo, PAMI 2007)

xi(t + 1) = xi(t)(Ax)i

x(t)T Ax(t), x 2 simplex of Rn


§  Gene expression data of Eisen et al. (1998) PNAS 95,14863

§  Pairwise clustering (PC) vs. dominant set clustering (DS) for different distance measures

Approximation Capacity of Yeast Gene Expression Data


Influence of metric on AC §  Path-based distance §  Correlation dissimilarity §  Euclidean distance


Consistency of clusterings

CC vs. PC (correlation distance)

PC & CC cluster objects (i,j) together PC & CC cluster objects (i,j) separately PC clusters objects (i,j) together, CC clusters separately PC clusters objects (i,j) separately, CC clusters together

CC & PC cluster objects (i,j) together CC & PC cluster objects (i,j) separately CC clusters objects (i,j) together, PC clusters separately CC clusters objects (i,j) separately, PC clusters together

CC vs. PC (path distance)


Approximate sorting

§  Among all permutations π, find the one that orders the items best!

(This cost function counts the number of disagreements between the provided pairwise comparisons and the pairwise comparisons induced by a proposal ranking from the hypothesis class.)

1 1 11 1 1 1

1 11 1 1

11 1

“>” …

X = (xij)

RSorting(⇡|X) =X

i,j

xijI{⇡i>⇡j}


Approximate Sorting §  Approximately minimize cost function

§  Approximation capacity

§  Approximate sorting algorithm: Gibbs sampler with

optimal β* or deterministic annealing up to β*

RSorting(⇡|X) =X

i,j

xijI{⇡i>⇡j}

max�

I� =1

nlog |T | + max

�

(1

n

nX

i=1

lognX

k=1

exp⇣��(E(1)

ik + E(1)ik )⌘

� 1

n

nX

i=1

log

nX

k=1

exp⇣��E(1)

ik

⌘ nX

k0=1

exp⇣��E(2)

ik0

⌘!)

1 1 1 1 1 1 1

1 1 1 1 1

1 1 1

…


Prediction Performance of ASC

§  Experiments on synthetic data: preferences are derived from noise contaminated rank values (n=50)


Robust chess rankings

§  Tournament data from the German chess league, years 2009-2011

Method Absolute error

Relative error

Joint global minimizer

0.29 14%

ASC 0.14 6.8%

Prediction error of chess ranks


Dynamics of sorting algorithms Joint work with

L.M. Busse & M.H. Chehreghani

Information concentration of sorting algorithms

Run-time parameterizes the size of the approximation sets

=> Coupling of computational to statistical complexity

Cardinality of the approximation set is a function

of the noise level in the data

Sn


Pareto-optimality of sorting algorithms

Performance vs. Robustness

Experimental setting: array size n=50 noise ~ Ber(0.1)


Approximation set of selection sort SelectionSort Algorithm

t=1

t=2

t=3

CSelSortt=0

�CSelSortt = (n � t)!

�CSelSortt


Information capacity of MergeSort vs. BubbleSort (Ludwig Busse 2012,

Diss. ETH #20600)

§  Experimental setting: sort n=500 items


Conclusion §  Quantization: Noise quantizes mathematical

structures (hypothesis classes) => symbols §  These symbols can be used for coding! §  Optimal error free coding scheme determines

approximation capacity of a model class. ⇒ Bounds for robust optimization. ⇒ Quantization of hypothesis class measures

structure specific information in data.


Future Work

§  Generalization: replace approximation sets based on cost functions by smoothed outputs of algorithms (“smoothed generalization”)

§  Model reduction in dynamical systems: quantize sets of ODEs or PDEs (systems biology)

§  Relate statistical complexity, i.e. the approximation capacity, to algorithmic or computational complexity.


Low-Energy Architecture Trends

§  Novel low-power architectures operate near transistor threshold voltage (NTV) §  e.g., Intel Claremont §  1.5 mW @10 MHz (x86)

§  NTV promises 10x more energy efficiency at 10x more parallelism! §  105 times more soft errors (bits flip stochastically) §  Hard to correct in hardware à expose to programmer?

source: Intel

@

@ Thorsten Höffler


Resilient Programming

§  Distorted computations §  Some parts of the program work

-  e.g., Conjugate Gradient while computing the new direction

§  Others cannot -  e.g., the target address for a jump-table

§  NTV is very intriguing for large-scale computing

§  Interesting model for developing algorithms, programming systems, and architectures

Source: MIT, CHRISTINE DANILOFF

@ Thorsten Höffler

What is the information content of an algorithm? - AofA...

Documents

Transcript of What is the information content of an algorithm? - AofA...