Tutorial Part I: Information theory meets machine learning · 2015. 6. 14. · Part I of tutorial:...

Tutorial Part I:Information theory meets machine learning

Emmanuel Abbe Martin WainwrightUC Berkeley Princeton University

(UC Berkeley and Princeton) Information theory and machine learning June 2015 1 / 46

Introduction

Era of massive data sets

Fascinating problems at the interfaces between information theory andstatistical machine learning.

1 Fundamental issues◮ Concentration of measure: high-dimensional problems are remarkably

predictable◮ Curse of dimensionality: without structure, many problems are hopeless◮ Low-dimensional structure is essential

2 Machine learning brings in algorithmic components◮ Computational constraints are central◮ Memory constraints: need for distributed and decentralized procedures◮ Increasing importance of privacy

Historical perspective: info. theory and statistics

Claude Shannon Andrey Kolmogorov

A rich intersection between information theory and statistics

1 Hypothesis testing, large deviations2 Fisher information, Kullback-Leibler divergence3 Metric entropy and Fano’s inequality

Statistical estimation as channel coding

Codebook: indexed family of probability distributions {Qθ | θ ∈ Θ}

Codeword: nature chooses some θ∗ ∈ Θ

Θ

θ∗

Xn1Qθ∗ θ̂

Channel: user observes n i.i.d. draws Xi ∼ Qθ∗

Decoding: estimator Xn1 7→ θ̂ such that θ̂ ≈ θ∗

Statistical estimation as channel coding

Codebook: indexed family of probability distributions {Qθ | θ ∈ Θ}

Codeword: nature chooses some θ∗ ∈ Θ

Θθ∗

Xn1Qθ∗ θ̂

Channel: user observes n i.i.d. draws Xi ∼ Qθ∗

Decoding: estimator Xn1 7→ θ̂ such that θ̂ ≈ θ∗

Perspective dating back to Kolmogorov (1950s) with many variations:

codebooks/codewords: graphs, vectors, matrices, functions, densities....

channels: random graphs, regression models, elementwise probes ofvectors/machines, random projections

closeness θ̂ ≈ θ∗: exact/partial graph recovery in Hamming, ℓp-distances,L(Q)-distances, sup-norm etc.

Machine learning: algorithmic issues to forefront!

1 Efficient algorithms are essential◮ only (low-order) polynomial-time methods can ever be implemented◮ trade-offs between computational complexity and performance?◮ fundamental barriers due to computational complexity?

2 Distributed procedures are often needed◮ many modern data sets: too large to stored on a single machine◮ need methods that operate separately on pieces of data◮ trade-offs between decentralization and performance?◮ fundamental barriers due to communication complexity?

3 Privacy and access issues◮ conflicts between individual privacy and benefits of aggregation?◮ principled information-theoretic formulations of such trade-offs?


Part I of tutorial: Three vignettes

1 Graphical model selection

2 Sparse principal component analysis

3 Structured non-parametric regression and minimax theory


Vignette A: Graphical model selectionSimple motivating example: Epidemic modeling

Disease status of person s: xs =

{+1 if individual s is infected

−1 if individual s is healthy

(1) Independent infection

Q(x1, . . . , x5) ∝5∏

s=1

exp(θsxs)

(2) Cycle-based infection

Q(x1, . . . , x5) ∝5∏

s=1

exp(θsxs)∏

(s,t)∈C

exp(θstxs xt)

(3) Full clique infection

Q(x1, . . . , x5) ∝5∏

s=1

exp(θsxs)∏

s 6=t

exp(θstxsxt)

Possible epidemic patterns (on a square)

From epidemic patterns to graphs

Underlying graphs

Model selection for graphsdrawn n i.i.d. samples from

Q(x1, . . . , xp; Θ) ∝ exp{∑

s∈V

θsxs +∑

(s,t)∈E

θstxsxt}

graph G and matrix [Θ]st = θst of edge weights are unknown

data matrix Xn1 ∈ {−1,+1}n×p

want estimator Xn1 7→ Ĝ to minimize error probability

Qn[Ĝ(Xn1 ) 6= G

]︸︷︷︸

Prob. that estimated graph differs from truth

Channel decoding:

Think of graphs as codewords, and the graph family as a codebook.

Past/on-going work on graph selection

exact polynomial-time solution on trees (Chow & Liu, 1967)

testing for local conditional independence relationships (e.g., Spirtes et al,2000; Kalisch & Buhlmann, 2008)

pseudolikelihood and BIC criterion (Csiszar & Talata, 2006)

pseudolikelihood and ℓ1-regularized neighborhood regression◮ Gaussian case (Meinshausen & Buhlmann, 2006)◮ Binary case (Ravikumar, W. & Lafferty et al., 2006)

various greedy and related methods: (Bresler et al., 2008; Bresler, 2015;Netrapalli et al., 2010; Anandkumar et al., 2013)

lower bounds and inachievability results◮ information-theoretic bounds (Santhanam & W., 2012)◮ computational lower bounds (Dagum & Luby, 1993; Bresler et al., 2014)◮ phase transitions and performance of neighborhood regression (Bento &

Montanari, 2009)

US Senate network (2004–2006 voting)

Experiments for sequences of star graphs

p = 9 d = 3

p = 18d = 6

Empirical behavior: Unrescaled plots

0 100 200 300 400 500 6000

0.2

0.4

0.6

0.8

1

Number of samples

Pro

b. s

ucce

ss

Star graph; Linear fraction neighbors

p = 64p = 100p = 225

Plots of success probability versus raw sample size n.

Empirical behavior: Appropriately rescaled

0 0.5 1 1.5 20

0.2

0.4

0.6

0.8

1

Control parameter

Pro

b. s

ucce

ss

Star graph; Linear fraction neighbors

p = 64p = 100p = 225

Plots of success probability versus rescaled sample size nd2 log p

Some theory: Scaling law for graph selectiongraphs Gp,d with p nodes and maximum degree d

minimum absolute weight θmin on edges

how many samples n needed to recover the unknown graph?

Theorem (Ravikumar, W. & Lafferty, 2010; Santhanam & W., 2012)

Achievable result: Under regularity conditions, for a graph estimate Ĝproduced by ℓ1-regularized logistic regression:

n > cu(d2 + 1/θ2min

)log p

︸︷︷︸Lower bound on sample size

=⇒ Q[Ĝ 6= G] → 0︸︷︷︸Vanishing probability of error

Necessary condition: For graph estimate G̃ produced by any algorithm.

n < cℓ(d2 + 1/θ2min

)log p

︸︷︷︸Upper bound on sample size

=⇒ Q[G̃ 6= G] ≥ 1/2︸︷︷︸Constant probability of error

Information-theoretic analysis of graph selection

Question:

How to prove lower bounds on graph selection methods?

Answer:

Graph selection is an unorthodox channel coding problem.

codewords/codebook: graph G in some graph class G

channel use: draw sample Xi• = (Xi1, . . . , Xip) ∈ {−1,+1}p from thegraph-structured distribution QG

decoding problem: use n samples {X1•, . . . , Xn•} to correctly distinguishthe “codeword”

X1•, X2•, . . . , Xn•Q(X | G)G

Proof sketch: Main ideas for necessary conditionsbased on assessing difficulty of graph selection over various sub-ensemblesG ⊆ Gp,d

choose G ∈ G u.a.r., and consider multi-way hypothesis testing problembased on the data Xn1 = {X1•, . . . , Xn•}

for any graph estimator ψ : Xn → G, Fano’s inequality implies that

Q[ψ(Xn1 ) 6= G] ≥ 1−I(Xn1 ;G)

log |G| − o(1)

where I(Xn1 ;G) is mutual information between observations Xn1 and

randomly chosen graph G

remaining steps:

1 Construct “difficult” sub-ensembles G ⊆ Gp,d

2 Compute or lower bound the log cardinality log |G|.

3 Upper bound the mutual information I(Xn1 ;G).

A “hard” d-clique ensemble

Base graph G Graph Guv Graph Gst

1 Divide the vertex set V into ⌊ pd+1⌋ groups of size d+ 1, and form the basegraph G by making a (d+ 1)-clique C within each group.

2 Form graph Guv by deleting edge (u, v) from G.

3 Consider testing problem over family of graph-structured distributions{Q(· ;Gst), (s, t) ∈ C}.

Why is this ensemble “hard”?

Kullback-Leibler divergence between pairs decays exponentially in d, unlessminimum edge weight decays as 1/d.

Vignette B: Sparse principal components analysisPrincipal component analysis:

widely-used method for {dimensionality reduction, data compression etc.}extracts top eigenvectors of sample covariance matrix Σ̂

classical PCA in p dimensions: inconsistent unless p/n→ 0low-dimensional structure: many applications lead to sparse eigenvectors

Population model: Rank-one spiked covariance matrix

Σ = ν︸︷︷︸SNR parameter

θ∗(θ∗)T︸︷︷︸rank-one spike

+Ip

Sampling model: Draw n i.i.d. zero-mean vectors with cov(Xi) = Σ, andform sample covariance matrix

Σ̂ :=1

n

n∑

i=1

XiXTi

︸︷︷︸p-dim. matrix, rank min{n, p}

Example: PCA for face compression/recognition

First 25 standard principal components (estimated from data)

Example: PCA for face compression/recognition

First 25 sparse principal components (estimated from data)

Perhaps the simplest estimator....Diagonal thresholding: (Johnstone & Lu, 2008)

0 10 20 30 40 500

2

4

6

8

10

12

14

Index

Dia

gona

l val

ue

Diagonal thresholding

Given n i.i.d. samples Xi with zero mean, and with spiked covarianceΣ = νθ∗(θ∗)T + I:

1 Compute diagonal entries of sample covariance: Σ̂jj =1n

∑ni=1X

2ij .

2 Apply threshold to vector {Σ̂jj , j = 1, . . . , p}

Diagonal thresholding: unrescaled plots

0 200 400 600 8000

0.2

0.4

0.6

0.8

1

Number of observations

Pro

b. s

ucce

ss

Unrescaled plots of diagonal thresholding; k = O(log p)

p = 100p = 200p = 300p = 600p = 1200

Diagonal thresholding: rescaled plots

0 5 10 150

0.2

0.4

0.6

0.8

1

Control parameter

Pro

b. s

ucce

ss

Diagonal thresholding: k = O(log(p))

p = 100p = 200p = 300p = 600p = 1200

Scaling is quadratic in sparsity: nk2 log p

Diagonal thresholding and fundamental limitConsider spiked covariance matrix

Σ = ν︸︷︷︸Signal-to-noise

θ∗(θ∗)T︸︷︷︸rank one spike

+Ip×p where θ ∈ B0(k) ∩ B2(1)︸︷︷︸k-sparse and unit norm

Theorem (Amini & W., 2009)

(a) There are thresholds τDTℓ ≤ τDTu such thatn

k2 log p≤ τDTℓ

︸︷︷︸DT fails w.h.p.

n

k2 log p≥ τDTu

︸︷︷︸DT succeeds w.h.p.

(b) For optimal method (exhaustive search):

n

k log p< τES

︸︷︷︸Fail w.h.p.

n

k log p> τES

︸︷︷︸Succeed w.h.p.

.

One polynomial-time SDP relaxationRecall Courant-Fisher variational formulation:

θ∗ = arg max‖z‖2=1

{zT

(νθ∗(θ∗)T + Ip×p

)

︸︷︷︸Population covariance Σ

z}.

Equivalently, lifting to matrix variables Z = zzT :

θθT = arg maxZ∈Rp×p

Z=ZT , Z�0

{trace(ZTΣ)

}s.t. trace(Z) = 1, and rank(Z) = 1

Dropping rank constraint yields a standard SDP relaxation:

ẐT = arg maxZ∈Rp×p

Z=ZT , Z�0

{trace(ZTΣ)

}s.t. trace(Z) = 1.

In practice: (d’Aspremont et al., 2008)

apply this relaxation using the sample covariance matrix Σ̂

add the ℓ1-constraint∑p

i,j=1 |Zij | ≤ k2.

Phase transition for SDP: logarithmic sparsity

0 5 10 150

0.2

0.4

0.6

0.8

1

Control parameter

Pro

b. s

ucce

ss

SDP relaxation (k = O(log p))

p = 100

p = 200

p = 300

Scaling is linear in sparsity: nk log p

A natural question

Questions:

Can logarithmic sparsity or rank one condition be removed?

Computational lower bound for sparse PCAConsider spiked covariance matrix

Σ = ν︸︷︷︸signal-to-noise

(θ∗)(θ∗)T︸︷︷︸rank one spike

+Ip×p where θ∗ ∈ B0(k) ∩ B2(1)︸︷︷︸

k-sparse and unit norm

Sparse PCA detection problem:H0 (no signal): Xi ∼ D(0, Ip×p)H1 (spiked signal): Xi ∼ D(0,Σ).

Distribution D with sub-exponential tail behavior.

Theorem (Berthet & Rigollet, 2013)

Under average-case hardness of planted clique, polynomial-minimax level ofdetection νPOLY is given by

νPOLY ≍k2 log p

nfor all sparsity log p≪ k ≪ √p

Classical minimax level νOPT ≍ k log pn .

Planted k-clique problem

Erdos-Renyi Planted k-clique

Binary hypothesis test based on observing random graph G on p-vertices:

H0 Erdos-Renyi, each edge randomly with prob. 1/2H1 Planted k-clique, remaining edges random

Planted k-clique problem

Random entries Planted k × k sub-matrix

Binary hypothesis test based on observing random binary matrix:

H0 Random {0, 1} matrix with Ber(1/2) on off-diagonalH1 Planted k × k sub-matrix

Vignette C: (Structured) non-parametric regression

Goal: How to predict output from covariates?

given covariates (x1, x2, x3, . . . , xp)output variable ywant to predict y based on (x1, . . . , xp)

Examples: Medical Imaging; Geostatistics; Astronomy; ComputationalBiology .....

−0.5 −0.25 0 0.25 0.5−1.5

−1

−0.5

0

0.5

Design value x

Fun

ctio

n va

lue

2nd order polynomial kernel

−0.5 −0.25 0 0.25 0.5−1.5

−1

−0.5

0

0.5

Design value x

Fun

ctio

n va

lue

1st order Sobolev

(a) Second-order polynomial fit (b) Lipschitz function fit

High dimensions and sample complexityPossible models:

ordinary linear regression: y =

p∑

j=1

θ∗jxj

︸︷︷︸〈θ∗, x〉

+w

general non-parametric model: y = f∗(x1, x2, . . . , xp) + w.

Estimation accuracy: How well can f∗ be estimated using n samples?

linear models◮ without any structure: error δ2 ≍ p/n

︸︷︷︸

linear in p

◮ with sparsity k ≪ p: error δ2 ≍(k log

ep

k

)/n

︸︷︷︸

logarithmic in p

non-parametric models: p-dimensional, smoothness α

Curse of dimensionality: Error δ2 ≍ (1/n) 2α2α+p︸︷︷︸Exponential slow-down

Minimax risk and sample size

Consider a function class F , and n i.i.d. samples from the model

yi = f∗(xi) + wi, where f

∗ is some member of F .

For a given estimator {(xi, yi)}ni=1 7→ f̂ ∈ F , worst-case risk in a metric ρ:

Rnworst(f̂ ;F) = supf∗∈F

En[ρ2(f̂ , f∗)].

Minimax risk

For a given sample size n, the minimax risk

inff̂Rnworst(f̂ ;F) = inf

f̂supf∗∈F

En[ρ2(f̂ , f∗)

]

where the infimum is taken over all estimators.


How to measure “size” of function classes?

A 2δ-packing of F in metric ρ is acollection {f1, . . . , fM} ⊂ F such that

ρ(f j , fk

)> 2δ for all j 6= k.

The packing number M(2δ) is thecardinality of the largest such set.

Packing/covering entropy: emerged from Russian school in 1940s/1950s(Kolmogorov and collaborators)

Central object in proving minimax lower bounds for nonparametricproblems (e.g., Hasminskii & Ibragimov, 1978; Birge, 1983; Yu, 1997; Yang &Barron, 1999)

Packing and covering numbers

f1

f2 f3

f4

fM

2δ

A 2δ-packing of F in metric ρ is a collection {f1, . . . , fM} ⊂ F such that

ρ(f j , fk

)> 2δ for all j 6= k.

The packing number M(2δ) is the cardinality of the largest such set.

Example: Sup-norm packing for Lipschitz functions

Lδ

−Lδ

L

−Lδ 2δ Mδ

δ-packing set: functions {f1, f2, . . . , fM} such that ‖f j − fk‖2 > δ for allj 6= kfor L-Lipschitz functions in 1-dimension:

M(δ) ≍ 2(L/δ).

Standard reduction: from estimation to testing

f1

f2 f3f4

fM

2δ

goal: to characterize the minimax risk forρ-estimation over Fconstruct a 2δ-packing of F :

collection {f1, . . . , fM} such that ρ(f j , fk) > 2δ

now form a M -component mixturedistribution as follows:

◮ draw packing index V ∈ {1, . . .M}uniformly at random

◮ conditioned on V = j, draw n i.i.d.samples (Xi, Yi) ∼ Qfj

1 Claim: Any estimator f̂ such that ρ(f̂ , fJ ) ≤ δ w.h.p. can be used tosolve the M -ary testing problem.

2 Use standard techniques ({Assouad, Le Cam, Fano }) to lower bound theprobability of error in the testing problem.

Minimax rate via metric entropy matching

observe (xi, yi) pairs from model yi = f∗(xi) + wi

two possible norms

‖f̂ − f∗‖2n :=1

n

n∑

i=1

(f̂(xi)− f∗(xi)

)2, or ‖f̂ − f∗‖22 = E

[(f̂(X̃)− f∗(X̃))2

].

Metric entropy master equation

For many regression problems, minimax rate δn > 0 determined by solving themaster equation

logM(2δ;F , ‖ · ‖

)≍ nδ2.

basic idea (with Hellinger metric) dates back to Le Cam (1973)

elegant general version due to Yang and Barron (1999)


Example 1: Sparse linear regression

Observations yi = 〈xi, θ∗〉+ wi, where

θ∗ ∈ Θ(k, p) :={θ ∈ Rp | ‖θ‖0 ≤ k, and ‖θ‖2 ≤ 1

}.

Gilbert-Varshamov: can construct a 2δ-separated set with

logM(2δ) ≍ k log(epk) elements

Master equation and minimax rate

logM(2δ) ≍ nδ2 ⇐⇒ δ2 ≍ k log(

ep

k

)n .

Polynomial-time achievability:

by ℓ1-relaxations under restricted eigenvalue (RE) conditions(Candes & Tao, 2007; Bickel et al., 2009; Buhlmann & van de Geer, 2011)

achieve minimax-optimal rates for ℓ2-error (Raskutti, W., & Yu, 2011)

ℓ1-methods can be sub-optimal for prediction error ‖X(θ̂ − θ∗)‖2/√n

(Zhang, W. & Jordan, 2014)

Example 2: α-smooth non-parametric regression

Observations yi = f∗(xi) + wi, where

f∗ ∈ F(α) ={f : [0, 1] → R | f is α-times diff’ble, with ∑αj=0 ‖f (j)‖∞ ≤ C

}.

Classical results in approximation theory (e.g., Kolmogorov & Tikhomorov, 1959)

logM(2δ;F) ≍(1/δ

)1/α


logM(2δ) ≍ nδ2 ⇐⇒ δ2 ≍(1/n

) 2α2α+1 .


Example 3: Sparse additive and α-smooth

Observations yi = f∗(xi) + wi, where f

∗ satisfies constraints

Additively decomposable: f∗(x1, . . . xp) =

p∑

j=1

g∗j (xj)

Sparse: At most k of (g∗1 , . . . , g∗p) are non-zero

Smooth: Each g∗j belongs to α-smooth family F(α)

Combining previous results yields 2δ-packing with

logM(2δ;F) ≍ k(1/δ

)1/α+ k log

(epk

)


Solving logM(2δ) ≍ nδ2 yields

δ2 ≍ k(1/n)

2α2α+1

︸︷︷︸k-component estimation

+k log

(epk

)

n︸︷︷︸search complexity

SummaryTo be provided during tutorial....

Tutorial Part I: Information theory meets machine learning · 2015. 6. 14. · Part I of tutorial:...

Documents

Transcript of Tutorial Part I: Information theory meets machine learning · 2015. 6. 14. · Part I of tutorial:...