Tutorial Part I: Information theory meets machine learning · 2015. 6. 14. · Part I of tutorial:...

46
Tutorial Part I: Information theory meets machine learning Emmanuel Abbe Martin Wainwright UC Berkeley Princeton University (UC Berkeley and Princeton) Information theory and machine learning June 2015 1 / 46

Transcript of Tutorial Part I: Information theory meets machine learning · 2015. 6. 14. · Part I of tutorial:...

  • Tutorial Part I:Information theory meets machine learning

    Emmanuel Abbe Martin WainwrightUC Berkeley Princeton University

    (UC Berkeley and Princeton) Information theory and machine learning June 2015 1 / 46

  • Introduction

    Era of massive data sets

    Fascinating problems at the interfaces between information theory andstatistical machine learning.

    1 Fundamental issues◮ Concentration of measure: high-dimensional problems are remarkably

    predictable◮ Curse of dimensionality: without structure, many problems are hopeless◮ Low-dimensional structure is essential

    2 Machine learning brings in algorithmic components◮ Computational constraints are central◮ Memory constraints: need for distributed and decentralized procedures◮ Increasing importance of privacy

  • Historical perspective: info. theory and statistics

    Claude Shannon Andrey Kolmogorov

    A rich intersection between information theory and statistics

    1 Hypothesis testing, large deviations2 Fisher information, Kullback-Leibler divergence3 Metric entropy and Fano’s inequality

  • Statistical estimation as channel coding

    Codebook: indexed family of probability distributions {Qθ | θ ∈ Θ}

    Codeword: nature chooses some θ∗ ∈ Θ

    Θ

    θ∗

    Xn1Qθ∗ θ̂

    Channel: user observes n i.i.d. draws Xi ∼ Qθ∗

    Decoding: estimator Xn1 7→ θ̂ such that θ̂ ≈ θ∗

  • Statistical estimation as channel coding

    Codebook: indexed family of probability distributions {Qθ | θ ∈ Θ}

    Codeword: nature chooses some θ∗ ∈ Θ

    Θθ∗

    Xn1Qθ∗ θ̂

    Channel: user observes n i.i.d. draws Xi ∼ Qθ∗

    Decoding: estimator Xn1 7→ θ̂ such that θ̂ ≈ θ∗

    Perspective dating back to Kolmogorov (1950s) with many variations:

    codebooks/codewords: graphs, vectors, matrices, functions, densities....

    channels: random graphs, regression models, elementwise probes ofvectors/machines, random projections

    closeness θ̂ ≈ θ∗: exact/partial graph recovery in Hamming, ℓp-distances,L(Q)-distances, sup-norm etc.

  • Machine learning: algorithmic issues to forefront!

    1 Efficient algorithms are essential◮ only (low-order) polynomial-time methods can ever be implemented◮ trade-offs between computational complexity and performance?◮ fundamental barriers due to computational complexity?

    2 Distributed procedures are often needed◮ many modern data sets: too large to stored on a single machine◮ need methods that operate separately on pieces of data◮ trade-offs between decentralization and performance?◮ fundamental barriers due to communication complexity?

    3 Privacy and access issues◮ conflicts between individual privacy and benefits of aggregation?◮ principled information-theoretic formulations of such trade-offs?

    (UC Berkeley and Princeton) Information theory and machine learning June 2015 6 / 46

  • Part I of tutorial: Three vignettes

    1 Graphical model selection

    2 Sparse principal component analysis

    3 Structured non-parametric regression and minimax theory

    (UC Berkeley and Princeton) Information theory and machine learning June 2015 7 / 46

  • Vignette A: Graphical model selectionSimple motivating example: Epidemic modeling

    Disease status of person s: xs =

    {+1 if individual s is infected

    −1 if individual s is healthy

    (1) Independent infection

    Q(x1, . . . , x5) ∝5∏

    s=1

    exp(θsxs)

    (2) Cycle-based infection

    Q(x1, . . . , x5) ∝5∏

    s=1

    exp(θsxs)∏

    (s,t)∈C

    exp(θstxs xt)

    (3) Full clique infection

    Q(x1, . . . , x5) ∝5∏

    s=1

    exp(θsxs)∏

    s 6=t

    exp(θstxsxt)

  • Possible epidemic patterns (on a square)

  • From epidemic patterns to graphs

  • Underlying graphs

  • Model selection for graphsdrawn n i.i.d. samples from

    Q(x1, . . . , xp; Θ) ∝ exp{∑

    s∈V

    θsxs +∑

    (s,t)∈E

    θstxsxt}

    graph G and matrix [Θ]st = θst of edge weights are unknown

    data matrix Xn1 ∈ {−1,+1}n×p

    want estimator Xn1 7→ Ĝ to minimize error probability

    Qn[Ĝ(Xn1 ) 6= G

    ]︸ ︷︷ ︸

    Prob. that estimated graph differs from truth

    Channel decoding:

    Think of graphs as codewords, and the graph family as a codebook.

  • Past/on-going work on graph selection

    exact polynomial-time solution on trees (Chow & Liu, 1967)

    testing for local conditional independence relationships (e.g., Spirtes et al,2000; Kalisch & Buhlmann, 2008)

    pseudolikelihood and BIC criterion (Csiszar & Talata, 2006)

    pseudolikelihood and ℓ1-regularized neighborhood regression◮ Gaussian case (Meinshausen & Buhlmann, 2006)◮ Binary case (Ravikumar, W. & Lafferty et al., 2006)

    various greedy and related methods: (Bresler et al., 2008; Bresler, 2015;Netrapalli et al., 2010; Anandkumar et al., 2013)

    lower bounds and inachievability results◮ information-theoretic bounds (Santhanam & W., 2012)◮ computational lower bounds (Dagum & Luby, 1993; Bresler et al., 2014)◮ phase transitions and performance of neighborhood regression (Bento &

    Montanari, 2009)

  • US Senate network (2004–2006 voting)

  • Experiments for sequences of star graphs

    p = 9 d = 3

    p = 18d = 6

  • Empirical behavior: Unrescaled plots

    0 100 200 300 400 500 6000

    0.2

    0.4

    0.6

    0.8

    1

    Number of samples

    Pro

    b. s

    ucce

    ss

    Star graph; Linear fraction neighbors

    p = 64p = 100p = 225

    Plots of success probability versus raw sample size n.

  • Empirical behavior: Appropriately rescaled

    0 0.5 1 1.5 20

    0.2

    0.4

    0.6

    0.8

    1

    Control parameter

    Pro

    b. s

    ucce

    ss

    Star graph; Linear fraction neighbors

    p = 64p = 100p = 225

    Plots of success probability versus rescaled sample size nd2 log p

  • Some theory: Scaling law for graph selectiongraphs Gp,d with p nodes and maximum degree d

    minimum absolute weight θmin on edges

    how many samples n needed to recover the unknown graph?

    Theorem (Ravikumar, W. & Lafferty, 2010; Santhanam & W., 2012)

    Achievable result: Under regularity conditions, for a graph estimate Ĝproduced by ℓ1-regularized logistic regression:

    n > cu(d2 + 1/θ2min

    )log p

    ︸ ︷︷ ︸Lower bound on sample size

    =⇒ Q[Ĝ 6= G] → 0︸ ︷︷ ︸Vanishing probability of error

    Necessary condition: For graph estimate G̃ produced by any algorithm.

    n < cℓ(d2 + 1/θ2min

    )log p

    ︸ ︷︷ ︸Upper bound on sample size

    =⇒ Q[G̃ 6= G] ≥ 1/2︸ ︷︷ ︸Constant probability of error

  • Information-theoretic analysis of graph selection

    Question:

    How to prove lower bounds on graph selection methods?

    Answer:

    Graph selection is an unorthodox channel coding problem.

    codewords/codebook: graph G in some graph class G

    channel use: draw sample Xi• = (Xi1, . . . , Xip) ∈ {−1,+1}p from thegraph-structured distribution QG

    decoding problem: use n samples {X1•, . . . , Xn•} to correctly distinguishthe “codeword”

    X1•, X2•, . . . , Xn•Q(X | G)G

  • Proof sketch: Main ideas for necessary conditionsbased on assessing difficulty of graph selection over various sub-ensemblesG ⊆ Gp,d

    choose G ∈ G u.a.r., and consider multi-way hypothesis testing problembased on the data Xn1 = {X1•, . . . , Xn•}

    for any graph estimator ψ : Xn → G, Fano’s inequality implies that

    Q[ψ(Xn1 ) 6= G] ≥ 1−I(Xn1 ;G)

    log |G| − o(1)

    where I(Xn1 ;G) is mutual information between observations Xn1 and

    randomly chosen graph G

    remaining steps:

    1 Construct “difficult” sub-ensembles G ⊆ Gp,d

    2 Compute or lower bound the log cardinality log |G|.

    3 Upper bound the mutual information I(Xn1 ;G).

  • A “hard” d-clique ensemble

    Base graph G Graph Guv Graph Gst

    1 Divide the vertex set V into ⌊ pd+1⌋ groups of size d+ 1, and form the basegraph G by making a (d+ 1)-clique C within each group.

    2 Form graph Guv by deleting edge (u, v) from G.

    3 Consider testing problem over family of graph-structured distributions{Q(· ;Gst), (s, t) ∈ C}.

    Why is this ensemble “hard”?

    Kullback-Leibler divergence between pairs decays exponentially in d, unlessminimum edge weight decays as 1/d.

  • Vignette B: Sparse principal components analysisPrincipal component analysis:

    widely-used method for {dimensionality reduction, data compression etc.}extracts top eigenvectors of sample covariance matrix Σ̂

    classical PCA in p dimensions: inconsistent unless p/n→ 0low-dimensional structure: many applications lead to sparse eigenvectors

    Population model: Rank-one spiked covariance matrix

    Σ = ν︸︷︷︸SNR parameter

    θ∗(θ∗)T︸ ︷︷ ︸rank-one spike

    +Ip

    Sampling model: Draw n i.i.d. zero-mean vectors with cov(Xi) = Σ, andform sample covariance matrix

    Σ̂ :=1

    n

    n∑

    i=1

    XiXTi

    ︸ ︷︷ ︸p-dim. matrix, rank min{n, p}

  • Example: PCA for face compression/recognition

    First 25 standard principal components (estimated from data)

  • Example: PCA for face compression/recognition

    First 25 sparse principal components (estimated from data)

  • Perhaps the simplest estimator....Diagonal thresholding: (Johnstone & Lu, 2008)

    0 10 20 30 40 500

    2

    4

    6

    8

    10

    12

    14

    Index

    Dia

    gona

    l val

    ue

    Diagonal thresholding

    Given n i.i.d. samples Xi with zero mean, and with spiked covarianceΣ = νθ∗(θ∗)T + I:

    1 Compute diagonal entries of sample covariance: Σ̂jj =1n

    ∑ni=1X

    2ij .

    2 Apply threshold to vector {Σ̂jj , j = 1, . . . , p}

  • Diagonal thresholding: unrescaled plots

    0 200 400 600 8000

    0.2

    0.4

    0.6

    0.8

    1

    Number of observations

    Pro

    b. s

    ucce

    ss

    Unrescaled plots of diagonal thresholding; k = O(log p)

    p = 100p = 200p = 300p = 600p = 1200

  • Diagonal thresholding: rescaled plots

    0 5 10 150

    0.2

    0.4

    0.6

    0.8

    1

    Control parameter

    Pro

    b. s

    ucce

    ss

    Diagonal thresholding: k = O(log(p))

    p = 100p = 200p = 300p = 600p = 1200

    Scaling is quadratic in sparsity: nk2 log p

  • Diagonal thresholding and fundamental limitConsider spiked covariance matrix

    Σ = ν︸︷︷︸Signal-to-noise

    θ∗(θ∗)T︸ ︷︷ ︸rank one spike

    +Ip×p where θ ∈ B0(k) ∩ B2(1)︸ ︷︷ ︸k-sparse and unit norm

    Theorem (Amini & W., 2009)

    (a) There are thresholds τDTℓ ≤ τDTu such thatn

    k2 log p≤ τDTℓ

    ︸ ︷︷ ︸DT fails w.h.p.

    n

    k2 log p≥ τDTu

    ︸ ︷︷ ︸DT succeeds w.h.p.

    (b) For optimal method (exhaustive search):

    n

    k log p< τES

    ︸ ︷︷ ︸Fail w.h.p.

    n

    k log p> τES

    ︸ ︷︷ ︸Succeed w.h.p.

    .

  • One polynomial-time SDP relaxationRecall Courant-Fisher variational formulation:

    θ∗ = arg max‖z‖2=1

    {zT

    (νθ∗(θ∗)T + Ip×p

    )

    ︸ ︷︷ ︸Population covariance Σ

    z}.

    Equivalently, lifting to matrix variables Z = zzT :

    θθT = arg maxZ∈Rp×p

    Z=ZT , Z�0

    {trace(ZTΣ)

    }s.t. trace(Z) = 1, and rank(Z) = 1

    Dropping rank constraint yields a standard SDP relaxation:

    ẐT = arg maxZ∈Rp×p

    Z=ZT , Z�0

    {trace(ZTΣ)

    }s.t. trace(Z) = 1.

    In practice: (d’Aspremont et al., 2008)

    apply this relaxation using the sample covariance matrix Σ̂

    add the ℓ1-constraint∑p

    i,j=1 |Zij | ≤ k2.

  • Phase transition for SDP: logarithmic sparsity

    0 5 10 150

    0.2

    0.4

    0.6

    0.8

    1

    Control parameter

    Pro

    b. s

    ucce

    ss

    SDP relaxation (k = O(log p))

    p = 100

    p = 200

    p = 300

    Scaling is linear in sparsity: nk log p

  • A natural question

    Questions:

    Can logarithmic sparsity or rank one condition be removed?

  • Computational lower bound for sparse PCAConsider spiked covariance matrix

    Σ = ν︸︷︷︸signal-to-noise

    (θ∗)(θ∗)T︸ ︷︷ ︸rank one spike

    +Ip×p where θ∗ ∈ B0(k) ∩ B2(1)︸ ︷︷ ︸

    k-sparse and unit norm

    Sparse PCA detection problem:H0 (no signal): Xi ∼ D(0, Ip×p)H1 (spiked signal): Xi ∼ D(0,Σ).

    Distribution D with sub-exponential tail behavior.

    Theorem (Berthet & Rigollet, 2013)

    Under average-case hardness of planted clique, polynomial-minimax level ofdetection νPOLY is given by

    νPOLY ≍k2 log p

    nfor all sparsity log p≪ k ≪ √p

    Classical minimax level νOPT ≍ k log pn .

  • Planted k-clique problem

    Erdos-Renyi Planted k-clique

    Binary hypothesis test based on observing random graph G on p-vertices:

    H0 Erdos-Renyi, each edge randomly with prob. 1/2H1 Planted k-clique, remaining edges random

  • Planted k-clique problem

    Random entries Planted k × k sub-matrix

    Binary hypothesis test based on observing random binary matrix:

    H0 Random {0, 1} matrix with Ber(1/2) on off-diagonalH1 Planted k × k sub-matrix

  • Vignette C: (Structured) non-parametric regression

    Goal: How to predict output from covariates?

    given covariates (x1, x2, x3, . . . , xp)output variable ywant to predict y based on (x1, . . . , xp)

    Examples: Medical Imaging; Geostatistics; Astronomy; ComputationalBiology .....

    −0.5 −0.25 0 0.25 0.5−1.5

    −1

    −0.5

    0

    0.5

    Design value x

    Fun

    ctio

    n va

    lue

    2nd order polynomial kernel

    −0.5 −0.25 0 0.25 0.5−1.5

    −1

    −0.5

    0

    0.5

    Design value x

    Fun

    ctio

    n va

    lue

    1st order Sobolev

    (a) Second-order polynomial fit (b) Lipschitz function fit

  • High dimensions and sample complexityPossible models:

    ordinary linear regression: y =

    p∑

    j=1

    θ∗jxj

    ︸ ︷︷ ︸〈θ∗, x〉

    +w

    general non-parametric model: y = f∗(x1, x2, . . . , xp) + w.

    Estimation accuracy: How well can f∗ be estimated using n samples?

    linear models◮ without any structure: error δ2 ≍ p/n

    ︸︷︷︸

    linear in p

    ◮ with sparsity k ≪ p: error δ2 ≍(k log

    ep

    k

    )/n

    ︸ ︷︷ ︸

    logarithmic in p

    non-parametric models: p-dimensional, smoothness α

    Curse of dimensionality: Error δ2 ≍ (1/n) 2α2α+p︸ ︷︷ ︸Exponential slow-down

  • Minimax risk and sample size

    Consider a function class F , and n i.i.d. samples from the model

    yi = f∗(xi) + wi, where f

    ∗ is some member of F .

    For a given estimator {(xi, yi)}ni=1 7→ f̂ ∈ F , worst-case risk in a metric ρ:

    Rnworst(f̂ ;F) = supf∗∈F

    En[ρ2(f̂ , f∗)].

    Minimax risk

    For a given sample size n, the minimax risk

    inff̂Rnworst(f̂ ;F) = inf

    f̂supf∗∈F

    En[ρ2(f̂ , f∗)

    ]

    where the infimum is taken over all estimators.

    (UC Berkeley and Princeton) Information theory and machine learning June 2015 37 / 46

  • How to measure “size” of function classes?

    A 2δ-packing of F in metric ρ is acollection {f1, . . . , fM} ⊂ F such that

    ρ(f j , fk

    )> 2δ for all j 6= k.

    The packing number M(2δ) is thecardinality of the largest such set.

    Packing/covering entropy: emerged from Russian school in 1940s/1950s(Kolmogorov and collaborators)

    Central object in proving minimax lower bounds for nonparametricproblems (e.g., Hasminskii & Ibragimov, 1978; Birge, 1983; Yu, 1997; Yang &Barron, 1999)

  • Packing and covering numbers

    f1

    f2 f3

    f4

    fM

    A 2δ-packing of F in metric ρ is a collection {f1, . . . , fM} ⊂ F such that

    ρ(f j , fk

    )> 2δ for all j 6= k.

    The packing number M(2δ) is the cardinality of the largest such set.

  • Example: Sup-norm packing for Lipschitz functions

    −Lδ

    L

    −Lδ 2δ Mδ

    δ-packing set: functions {f1, f2, . . . , fM} such that ‖f j − fk‖2 > δ for allj 6= kfor L-Lipschitz functions in 1-dimension:

    M(δ) ≍ 2(L/δ).

  • Standard reduction: from estimation to testing

    f1

    f2 f3f4

    fM

    goal: to characterize the minimax risk forρ-estimation over Fconstruct a 2δ-packing of F :

    collection {f1, . . . , fM} such that ρ(f j , fk) > 2δ

    now form a M -component mixturedistribution as follows:

    ◮ draw packing index V ∈ {1, . . .M}uniformly at random

    ◮ conditioned on V = j, draw n i.i.d.samples (Xi, Yi) ∼ Qfj

    1 Claim: Any estimator f̂ such that ρ(f̂ , fJ ) ≤ δ w.h.p. can be used tosolve the M -ary testing problem.

    2 Use standard techniques ({Assouad, Le Cam, Fano }) to lower bound theprobability of error in the testing problem.

  • Minimax rate via metric entropy matching

    observe (xi, yi) pairs from model yi = f∗(xi) + wi

    two possible norms

    ‖f̂ − f∗‖2n :=1

    n

    n∑

    i=1

    (f̂(xi)− f∗(xi)

    )2, or ‖f̂ − f∗‖22 = E

    [(f̂(X̃)− f∗(X̃))2

    ].

    Metric entropy master equation

    For many regression problems, minimax rate δn > 0 determined by solving themaster equation

    logM(2δ;F , ‖ · ‖

    )≍ nδ2.

    basic idea (with Hellinger metric) dates back to Le Cam (1973)

    elegant general version due to Yang and Barron (1999)

    (UC Berkeley and Princeton) Information theory and machine learning June 2015 42 / 46

  • Example 1: Sparse linear regression

    Observations yi = 〈xi, θ∗〉+ wi, where

    θ∗ ∈ Θ(k, p) :={θ ∈ Rp | ‖θ‖0 ≤ k, and ‖θ‖2 ≤ 1

    }.

    Gilbert-Varshamov: can construct a 2δ-separated set with

    logM(2δ) ≍ k log(epk) elements

    Master equation and minimax rate

    logM(2δ) ≍ nδ2 ⇐⇒ δ2 ≍ k log(

    ep

    k

    )n .

    Polynomial-time achievability:

    by ℓ1-relaxations under restricted eigenvalue (RE) conditions(Candes & Tao, 2007; Bickel et al., 2009; Buhlmann & van de Geer, 2011)

    achieve minimax-optimal rates for ℓ2-error (Raskutti, W., & Yu, 2011)

    ℓ1-methods can be sub-optimal for prediction error ‖X(θ̂ − θ∗)‖2/√n

    (Zhang, W. & Jordan, 2014)

  • Example 2: α-smooth non-parametric regression

    Observations yi = f∗(xi) + wi, where

    f∗ ∈ F(α) ={f : [0, 1] → R | f is α-times diff’ble, with ∑αj=0 ‖f (j)‖∞ ≤ C

    }.

    Classical results in approximation theory (e.g., Kolmogorov & Tikhomorov, 1959)

    logM(2δ;F) ≍(1/δ

    )1/α

    Master equation and minimax rate

    logM(2δ) ≍ nδ2 ⇐⇒ δ2 ≍(1/n

    ) 2α2α+1 .

    (UC Berkeley and Princeton) Information theory and machine learning June 2015 44 / 46

  • Example 3: Sparse additive and α-smooth

    Observations yi = f∗(xi) + wi, where f

    ∗ satisfies constraints

    Additively decomposable: f∗(x1, . . . xp) =

    p∑

    j=1

    g∗j (xj)

    Sparse: At most k of (g∗1 , . . . , g∗p) are non-zero

    Smooth: Each g∗j belongs to α-smooth family F(α)

    Combining previous results yields 2δ-packing with

    logM(2δ;F) ≍ k(1/δ

    )1/α+ k log

    (epk

    )

    Master equation and minimax rate

    Solving logM(2δ) ≍ nδ2 yields

    δ2 ≍ k(1/n)

    2α2α+1

    ︸ ︷︷ ︸k-component estimation

    +k log

    (epk

    )

    n︸ ︷︷ ︸search complexity

  • SummaryTo be provided during tutorial....