What is the information content of an algorithm? - AofA...
Transcript of What is the information content of an algorithm? - AofA...
29 May 2013
What is the information content of an algorithm?
Joachim M. Buhmann
Computer Science Department, ETH Zurich
29 May 2013 Joachim M. Buhmann AofA 2013, Menorca 2
Big Data in Neuroscience
WM
29 May 2013 Joachim M. Buhmann AofA 2013, Menorca 3
Connectivity-based Cortex Parcellation
Connectivity-based cortex parcellation is a compact description of the data:
29 May 2013 Joachim M. Buhmann AofA 2013, Menorca 5
Connectivity-based Cortex Parcellation
Seed voxels at the cortical boundary
Fiber tracking Clustering
Cortex parcellation
Biological Question: What is the relationship between the morphology of a cortical region of interest (ROI) and its differences in connectivity?
Tractograms of distinct cortical areas show a variety of connectivity pattern
29 May 2013 Joachim M. Buhmann AofA 2013, Menorca 6
Roadmap
§ Big data in neuroscience: cortex parcellation
§ Learning and robust optimization § Measuring the information content of algorithms
§ Model validation by information theory § Graph cut for gene expression analysis
§ Robust sorting
§ Outlook & vision: Low power computing, resilient programming
29 May 2013 Joachim M. Buhmann AofA 2013, Menorca 7
What is learning?
§ Given: data and a hypothesis class
§ Learning is identifying a set of preferable hypotheses in the hypothesis class!
CX ⇠ P(X)
… 1
2 3
4
6
5 9
7 8
10 11
12 1
2 3
4
6
5 9
7 8
10 11
12
1
2 3
4
6
5
9
7 8
10 11
12
29 May 2013 Joachim M. Buhmann AofA 2013, Menorca 8
How can we find good models? Informativeness vs. robustness § Learning game: The learner receives two (k)
instances and is tested on a third (k+1) instance.
§ Idea: Convert an optimization problem into a coding problem! A set of approximate solutions represent a code word.
§ Accuracy is bounded by generalization capacity => Information theoretic model validation: Find an identifiable generalization set with minimal size!
(JB, ISIT 2010)
29 May 2013 Joachim M. Buhmann AofA 2013, Menorca 9
How to design optimal generalization sets?
§ Fundamental problem: Data often live in much higher dimensional spaces than solutions! § Graphs: Cuts:
§ How does noise perturb solutions?
=> Measure sensitivity of solution sets to data noise!
…
2(n2)
1
2 3
4
6
5
9
7 8
10 11
12
|C| = 2n
1
2 3
4
6
5 9
7 8
10 11
12 1
2 3
4
6
5 9
7 8
10 11
12
29 May 2013 Joachim M. Buhmann AofA 2013, Menorca 10
{ ,…, }
Information Theory & Machine Learning
§ IT-Components § Code book ( {strings} =
hypothesis class )
§ Noisy channel 1100110 -> 0100110
§ Decoder: minimize Hamming distance |0100110-1100110| < |0100110-1010101|
§ Machine Learning elements § Set of generalization sets
§ Noisy optimization problem
§ Decoding by approximate optimization of test instance
0000000 0100101 1000011 1100110 0001111 0101010 1001100 1101001 0010110 0110011 1010101 1110000 0011001 0111100 1011010 1111111
{ ,…, } 1 2 3
4 6
5 9
7 8
10 11 12 5
12 7
4 6
1 11
3 8
2 10 9
5 12 7
4 6
1 11
3 8
2 10 9 5
12 7
4 6
1 11
3 8
2 10 9
29 May 2013 Joachim M. Buhmann AofA 2013, Menorca 11
Approximate Optimization
§ Define weights for hypotheses with low costs
§ Total weight of low cost hypotheses (partition fct.)
§ Special case: Approximation set
w�(c,X) =
(1, R (c,X) R
�c?,X
�+ 1/�;
0, otherwise.
R(c,X)
c? 2 arg minc2C
R(c,X)
Z(X) =X
c2C(O)
exp(��R(c,X))
w�(c,X) = exp(��R(c,X)), c 2 C(O)
29 May 2013 Joachim M. Buhmann AofA 2013, Menorca 12
§ Quantize hypothesis class by generalization sets
Coarsening of hypothesis classes and the two instances test
§ Size: partition function
§ Weight overlap models
joint approximations
Generalization set with minimizer
X0 ⇠ P(X)
X00 ⇠ P(X)
�Z(X0,X00) =X
c2C(O0)
w�(c,X0) w�(c,X00)
Z(X0) =X
c2C(O0)
w�(c,X0)
29 May 2013 Joachim M. Buhmann AofA 2013, Menorca 13
Asymptotic equipartition property
§ Condition for cost function R(c,X): Convergence of log partition functions towards free energies
F 0 := limn!1
� log Z(X0(n))
log |C(O0(n))|,
F 00 := limn!1
� log Z(X00(n))
log |C(O00(n))|,
�F := limn!1
� log�Z(X0(n),X00(n)
)
log |C(O00(n))|
29 May 2013 Joachim M. Buhmann AofA 2013, Menorca 14
Jointly typical instances
A(n)✏ :=
((X0(n)
,X00(n)) 2 X (n) ⇥ X (n) :
������log Z(X0(n)
)
log |C(O0(n))|� F 0
����� < ✏ ,
������log Z(X00(n)
)
log |C(O00(n))|� F 00
����� < ✏ ,
������log�Z(X0(n)
,X00(n))
log |C(O00(n))|��F
����� < ✏
)
29 May 2013 Joachim M. Buhmann AofA 2013, Menorca 15
§ Set of generalization sets
§ How to generate sets of generalization sets?
Random transformations
Generalization sets as symbols for coding
§ Cover hypothesis class densely, but identifiably!
Transformation
{ ,…, } 1 2 3
4 6
5 9
7 8
10 11 12 5
12 7
4 6
1 11
3 8
2 10 9
1 2
3
4
5
5 9
7
4
1
1 2
3
4
5
5 3
2
1
4
T =�⌧ : R(c,X) = R(⌧ � c, ⌧ � X)
C(O0)
29 May 2013 Joachim M. Buhmann AofA 2013, Menorca 16
Validating spectral clusterings
§ Graph representation: vertices denote objects edges express (dis)similarities
§ Hypothesis class: all 3-cuts of a graph
1 1 1 11 1 1 1 11 1 11 1 1 11 1 1 1 1
1 1 1 11 1 1 1
1 1 11 1 1 1
1 1 11 1
1 1 1
1 1 1 11 1 1 1 11 1 11 1 1 11 1 1 1 1
1 1 1 11 1 1 1
1 1 11 1 1 1
1 1 11 1
1 1 1
1
2 3
4
6
5
9
7 8
10 11
12 1
2 3
4
6
5
9
7 8
10 11
12
29 May 2013 Joachim M. Buhmann AofA 2013, Menorca 17
1 1 1 11 1 1 1 11 1 11 1 1 11 1 1 1 1
1 1 1 11 1 1 1
1 1 11 1 1 1
1 1 11 1
1 1 1
5
12 7
4
6
1
11
3 8
2 10
9 5
12 7
4
6
1
11
3 8
2 10
9 1
2 3
4
6
5
9
7 8
10 11
12 1
2 3
4
6
5
9
7 8
10 11
12
1 1 1 11 1 1 1 11 1 11 1 1 11 1 1 1 1
1 1 1 11 1 1 1
1 1 11 1 1 1
1 1 11 1
1 1 1
1 1 1 1 11 1 1
1 1 1 11 1 1 11 1 1 1
1 1 1 11 1 1
1 1 11 1 11 1
1 1 1 11 1 1 1 1
1 1 1 1 11 1 1
1 1 1 11 1 1 11 1 1 1
1 1 1 11 1 1
1 1 11 1 11 1
1 1 1 11 1 1 1 1
1 1 1 1 -1 1 1
- 1 1 11 1 1 11 1 1 -
- - 1 1 11 1 1
1 - 1 11 1 11 1 1
1 1 1 1 1 1- 1 1 - 1 1 1
Code problem generation for Graph Cut
graph cut code problems graph cut
test problem
… M
1 1 1 1 -1 1 1
- 1 1 11 1 1 11 1 1 -
- - 1 1 11 1 1
1 - 1 11 1 11 1 1
1 1 1 1 1 1- 1 1 - 1 1 1
5
12 7
4
6
1
11
3 8
2 10
9 5
12 7
4
6
1
11
3 8
2 10
9 5
12 7
4
6
1
11
3 8
2 10
9
29 May 2013 Joachim M. Buhmann AofA 2013, Menorca 18
Coding with Graph Cut approximation sets
sender
problem generator PG
receiver
define a set of code problems
1
2 3
4
6
5
9
7 8
10 11
12
1
2 3
4
6
5
9
7 8
10 11
12
5
12 7
4
6
1
11
3 8
2 10
9
5
12 7
4
6
1
11
3 8
2 10
9
R(·,X0) R(·,X0)
⌧ � X0{⌧1, . . . , ⌧M}
29 May 2013 Joachim M. Buhmann AofA 2013, Menorca 19
Communication by approximate Graph Cuts
sender
problem generator
receiver
τs
estimate the coding error
1. Sender sends a permutation index τs to problem generator.
2. Problem generator sends a new problem with permuted indices to receiver without revealing τs.
3. Receiver identifies the permutation by comparing generalization sets.
c ? ( ~ X ( 2 ) )
~ X ( 2 ) = ¾ s ± X ( 2 )
1
2 3
4
6
5
9
7 8
10 11
12
1
2 3
4
6
5
9
7 8
10 11
12
5
12 7
4
6
1
11
3 8
2 10
9
5
12 7
4
6
1
11
3 8
2 10
9
⌧̂
⌧̂
c?(X̃)
R(·, ⌧s � X00), s.t.
X0,X00 ⇠ P (X)
X̃ = ⌧s � X00
29 May 2013 Joachim M. Buhmann AofA 2013, Menorca 20
Communication Process
§ Receiver compares sets of hypothesis weights
§ Decoding by maximizing weight overlap
§ Error event:
§ Calculate
⌧̂ := arg max⌧
# ( \ )⌧
⌧̂ 6= ⌧s
limn!1
P (⌧̂ 6= ⌧s|⌧s) = 0
29 May 2013 Joachim M. Buhmann AofA 2013, Menorca 21
Error Probability § Estimate error given random transformations
Markov ineq.
⌧ 2 T
Union bound
�Zj :=X
c2C(X00)
w�(c, ⌧j � X0) w�(c, ⌧s � X00)
P (⌧̂ 6= ⌧s|⌧s) = P
✓maxj 6=s
�Zj > �Zs
��⌧s
◆
X
j 6=s
P��Zj > �Zs
��⌧s
�
MP��Z 6=s > �Zs
��⌧s
�
M EX0,X00E⌧ 6=s
�Z 6=s
�Zs
29 May 2013 Joachim M. Buhmann AofA 2013, Menorca 22
Decoupling of generalization sets
§ Intersecting an approximation set with a randomly transformed one yields decoupling!
E⌧�Z⌧ = E⌧
X
c2C(X00)
w�(c, ⌧ � X0) w�(c, ⌧s � X00)
=X
c2C(X00)
w�(c, ⌧s � X00)1
|T|X
⌧2Tw�(⌧�1 � c,X0)
1
|T| Z(X00) Z(X0)
29 May 2013 Joachim M. Buhmann AofA 2013, Menorca 23
Generalization capacity from typicality
§ Asymptotic error free communication is possible for
§ Interpretation: “mutual information” > rate log M
limn!1
P (⌧̂ 6= ⌧s|⌧s) = 0
P (⌧̂ 6= ⌧s|⌧s) M EX0,X00Z(X0) Z(X00)
|T| �Zs(X0,X00)
M
|T| exp⇣�F 0 log |C0| � (F 00 ��F) log |C00|
⌘
M Z(X0) Z(X00)|T| �Zs(X0,X00)
= exp⇣��log
|T|Z 0 + log
|C00|Z 00 � log
|C00|�Zs
� log M�⌘
29 May 2013 Joachim M. Buhmann AofA 2013, Menorca 24
Algorithm selection by maximizing generalization capacity
sender
problem generator
receiver
Communication channel
§ Optimize the communication channel w.r.t. precision β, cost function R(.,.) and algorithm
§ Maximize “Mutual Information”
⌧̂R(·, ⌧s � X(2)), s.t.
X(1),X(2) ⇠ P (X)
⌧s
29 May 2013 Joachim M. Buhmann AofA 2013, Menorca 25
§ Hypothesis class: set of binary
strings
§ Communication:
§ Costs of string s: Hamming
distance
§ Mutual information:
ASC for binary channel consistent with Shannon information theory
Channel capacity of BSC
⇠0, ⇠00 2 {�1, 1}n
R(s, ⇠0) =
nX
i=1
I{si 6=⇠0i}
⇠0 ) ) ⇠00
�� = 1
n
��{i : ⇠0 6= ⇠00}���
I� = ln 2 + (1 � �) ln cosh� � ln(cosh� + 1)
for (⇤) dI�
d� = 0(⇤)= ln 2 + � ln � + (1 � �) ln(1 � �)
29 May 2013 Joachim M. Buhmann AofA 2013, Menorca 28
Clustering of Gene expression data
§ Spectral clustering of correlations in yeast cell cycle data
§ Pairwise clustering (PC), normalized cut (NCut), correlation clustering (CC), adaptive ratio cut, (ARC) dominant set clustering (DS), for different distance measures
Yeast gene expression data: Eisen et al. (1998) PNAS 95,14863
29 May 2013 Joachim M. Buhmann AofA 2013, Menorca 29
Validating relational clustering costs
§ Cost function for pairwise clustering
§ Normalized cut
Defs.:
Rpc(c,X) = �KX
k=1
Gk
X
(i,j)2Ekk
Xij
|Ekk|
Gk = {i : c(i) = k}, Ekl = {(i, j) : c(i) = k ^ c(j) = l}
Rnc(c; W ) =X
vk
✓cut(Gv, V \ Gv)
assoc(Gv, V)
◆
29 May 2013 Joachim M. Buhmann AofA 2013, Menorca 30
§ Shifted correlation clustering shift s balances clusters sizes and should be adapted
§ Adaptive ratio cut Exponent interpolates between arithmetic and harmonic normalization to balance clusters
R(c,X, K, s) =1
2
X
1uK
X
(i,j)2Euu
(|Xij + s| � Xij � s)
+1
2
X
1uK
X
1v<u
X
(i,j)2Euv
(|Xij + s| + Xij + s).
R(c,X) =
KX
↵=1
KX
�=↵+1
cut(V↵, V�)⇣|V↵| 1
p + |V� | 1p
⌘�p
p 2 [�1, 1]
29 May 2013 Joachim M. Buhmann AofA 2013, Menorca 31
Validating Dominant Set clustering § Dominant set clustering employs a replicator
dynamics of evolutionary game theory to derive a partition
§ The replicator dynamics is applied in an iterative way to identify one cluster after another.
(Pavan & Pelillo, PAMI 2007)
xi(t + 1) = xi(t)(Ax)i
x(t)T Ax(t), x 2 simplex of Rn
29 May 2013 Joachim M. Buhmann AofA 2013, Menorca 32
§ Gene expression data of Eisen et al. (1998) PNAS 95,14863
§ Pairwise clustering (PC) vs. dominant set clustering (DS) for different distance measures
Approximation Capacity of Yeast Gene Expression Data
29 May 2013 Joachim M. Buhmann AofA 2013, Menorca 33
Influence of metric on AC § Path-based distance § Correlation dissimilarity § Euclidean distance
29 May 2013 Joachim M. Buhmann AofA 2013, Menorca 34
Consistency of clusterings
CC vs. PC (correlation distance)
PC & CC cluster objects (i,j) together PC & CC cluster objects (i,j) separately PC clusters objects (i,j) together, CC clusters separately PC clusters objects (i,j) separately, CC clusters together
CC & PC cluster objects (i,j) together CC & PC cluster objects (i,j) separately CC clusters objects (i,j) together, PC clusters separately CC clusters objects (i,j) separately, PC clusters together
CC vs. PC (path distance)
29 May 2013 Joachim M. Buhmann AofA 2013, Menorca 35
Approximate sorting
§ Among all permutations π, find the one that orders the items best!
(This cost function counts the number of disagreements between the provided pairwise comparisons and the pairwise comparisons induced by a proposal ranking from the hypothesis class.)
1 1 11 1 1 1
1 11 1 1
11 1
“>” …
X = (xij)
RSorting(⇡|X) =X
i,j
xijI{⇡i>⇡j}
29 May 2013 Joachim M. Buhmann AofA 2013, Menorca 36
Approximate Sorting § Approximately minimize cost function
§ Approximation capacity
§ Approximate sorting algorithm: Gibbs sampler with
optimal β* or deterministic annealing up to β*
RSorting(⇡|X) =X
i,j
xijI{⇡i>⇡j}
max�
I� =1
nlog |T | + max
�
(1
n
nX
i=1
lognX
k=1
exp⇣��(E(1)
ik + E(1)ik )⌘
� 1
n
nX
i=1
log
nX
k=1
exp⇣��E(1)
ik
⌘ nX
k0=1
exp⇣��E(2)
ik0
⌘!)
1 1 1 1 1 1 1
1 1 1 1 1
1 1 1
…
29 May 2013 Joachim M. Buhmann AofA 2013, Menorca 37
Prediction Performance of ASC
§ Experiments on synthetic data: preferences are derived from noise contaminated rank values (n=50)
29 May 2013 Joachim M. Buhmann AofA 2013, Menorca 38
Robust chess rankings
§ Tournament data from the German chess league, years 2009-2011
Method Absolute error
Relative error
Joint global minimizer
0.29 14%
ASC 0.14 6.8%
Prediction error of chess ranks
29 May 2013 Joachim M. Buhmann AofA 2013, Menorca 39
Dynamics of sorting algorithms Joint work with
L.M. Busse & M.H. Chehreghani
Information concentration of sorting algorithms
Run-time parameterizes the size of the approximation sets
=> Coupling of computational to statistical complexity
Cardinality of the approximation set is a function
of the noise level in the data
Sn
29 May 2013 Joachim M. Buhmann AofA 2013, Menorca 40
Pareto-optimality of sorting algorithms
Performance vs. Robustness
Experimental setting: array size n=50 noise ~ Ber(0.1)
29 May 2013 Joachim M. Buhmann AofA 2013, Menorca 41
Approximation set of selection sort SelectionSort Algorithm
t=1
t=2
t=3
CSelSortt=0
�CSelSortt = (n � t)!
�CSelSortt
29 May 2013 Joachim M. Buhmann AofA 2013, Menorca 42
Information capacity of MergeSort vs. BubbleSort (Ludwig Busse 2012,
Diss. ETH #20600)
§ Experimental setting: sort n=500 items
29 May 2013 Joachim M. Buhmann AofA 2013, Menorca 43
Conclusion § Quantization: Noise quantizes mathematical
structures (hypothesis classes) => symbols § These symbols can be used for coding! § Optimal error free coding scheme determines
approximation capacity of a model class. ⇒ Bounds for robust optimization. ⇒ Quantization of hypothesis class measures
structure specific information in data.
29 May 2013 Joachim M. Buhmann AofA 2013, Menorca 44
Future Work
§ Generalization: replace approximation sets based on cost functions by smoothed outputs of algorithms (“smoothed generalization”)
§ Model reduction in dynamical systems: quantize sets of ODEs or PDEs (systems biology)
§ Relate statistical complexity, i.e. the approximation capacity, to algorithmic or computational complexity.
29 May 2013 Joachim M. Buhmann AofA 2013, Menorca 45
Low-Energy Architecture Trends
§ Novel low-power architectures operate near transistor threshold voltage (NTV) § e.g., Intel Claremont § 1.5 mW @10 MHz (x86)
§ NTV promises 10x more energy efficiency at 10x more parallelism! § 105 times more soft errors (bits flip stochastically) § Hard to correct in hardware à expose to programmer?
source: Intel
@
@ Thorsten Höffler
29 May 2013 Joachim M. Buhmann AofA 2013, Menorca 46
Resilient Programming
§ Distorted computations § Some parts of the program work
- e.g., Conjugate Gradient while computing the new direction
§ Others cannot - e.g., the target address for a jump-table
§ NTV is very intriguing for large-scale computing
§ Interesting model for developing algorithms, programming systems, and architectures
Source: MIT, CHRISTINE DANILOFF
@ Thorsten Höffler