Information-Theoretic Limits for Inference, Learning, and...

121
Information-Theoretic Limits for Inference, Learning, and Optimization Part 1: Fano’s Inequality for Statistical Estimation Jonathan Scarlett Croucher Summer Course in Information Theory (CSCIT) [July 2019]

Transcript of Information-Theoretic Limits for Inference, Learning, and...

Page 1: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Information-Theoretic Limits for Inference,Learning, and Optimization

Part 1: Fano’s Inequality for Statistical Estimation

Jonathan Scarlett

Croucher Summer Course in Information Theory (CSCIT)[July 2019]

Page 2: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Information Theory

• How do we quantify “information” in data?

• Information theory [Shannon, 1948]:

I Fundamental limits of data communication

Source Codeword Output Reconstruction

Encoder Channel Decoder

(X1, . . . , Xn) (Y1, . . . , Yn)(U1, . . . , Up) (U1, . . . , Up)

I Information of source: Entropy

I Information learned at channel output: Mutual information

Principles:

I First fundamental limits without complexity constraints, then practical methods

I First asymptotic analyses, then convergence rates, finite-length, etc.

I Mathematically tractable probabilistic models

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 1/ 53

Page 3: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Information Theory

• How do we quantify “information” in data?

• Information theory [Shannon, 1948]:

I Fundamental limits of data communication

Source Codeword Output Reconstruction

Encoder Channel Decoder

(X1, . . . , Xn) (Y1, . . . , Yn)(U1, . . . , Up) (U1, . . . , Up)

I Information of source: Entropy

I Information learned at channel output: Mutual information

Principles:

I First fundamental limits without complexity constraints, then practical methods

I First asymptotic analyses, then convergence rates, finite-length, etc.

I Mathematically tractable probabilistic models

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 1/ 53

Page 4: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Information Theory

• How do we quantify “information” in data?

• Information theory [Shannon, 1948]:

I Fundamental limits of data communication

Source Codeword Output Reconstruction

Encoder Channel Decoder

(X1, . . . , Xn) (Y1, . . . , Yn)(U1, . . . , Up) (U1, . . . , Up)

I Information of source: Entropy

I Information learned at channel output: Mutual information

Principles:

I First fundamental limits without complexity constraints, then practical methods

I First asymptotic analyses, then convergence rates, finite-length, etc.

I Mathematically tractable probabilistic models

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 1/ 53

Page 5: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Information Theory and Data

• Conventional view:

Information theory is a theory of communicationInference &

Learning

Data Generation

Inference &Learning

Optimization

Storage &Transmission

Inference &Learning

Optimization

Data Generation

Data Generation

OptimizationInference &Learning

OptimizationData Generation

Inference &Learning OptimizationData

Generation

InformationTheory

Storage &Transmission

Inference &Learning

OptimizationData Generation

Information Theory

• Emerging view:

Information theory is a theory of data

Inference &Learning

Data Generation

Inference &Learning

Optimization

Storage &Transmission

Inference &Learning

Optimization

Data Generation

Data Generation

OptimizationInference &Learning

OptimizationData Generation

Inference &Learning OptimizationData

Generation

InformationTheory

Storage &Transmission

Inference &Learning

OptimizationData Generation

Information Theory

• Extracting information from channel output vs. Extracting information from data

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 2/ 53

Page 6: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Information Theory and Data

• Conventional view:

Information theory is a theory of communicationInference &

Learning

Data Generation

Inference &Learning

Optimization

Storage &Transmission

Inference &Learning

Optimization

Data Generation

Data Generation

OptimizationInference &Learning

OptimizationData Generation

Inference &Learning OptimizationData

Generation

InformationTheory

Storage &Transmission

Inference &Learning

OptimizationData Generation

Information Theory

• Emerging view:

Information theory is a theory of data

Inference &Learning

Data Generation

Inference &Learning

Optimization

Storage &Transmission

Inference &Learning

Optimization

Data Generation

Data Generation

OptimizationInference &Learning

OptimizationData Generation

Inference &Learning OptimizationData

Generation

InformationTheory

Storage &Transmission

Inference &Learning

OptimizationData Generation

Information Theory

• Extracting information from channel output vs. Extracting information from data

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 2/ 53

Page 7: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Examples

• Information theory in machine learning and statistics:I Statistical estimation [Le Cam, 1973]

I Group testing [Malyutov, 1978]

I Multi-armed bandits [Lai and Robbins, 1985]

I Phylogeny [Mossel, 2004]

I Sparse recovery [Wainwright, 2009]

I Graphical model selection [Santhanam and Wainwright, 2012]

I Convex optimization [Agarwal et al., 2012]

I DNA sequencing [Motahari et al., 2012]

I Sparse PCA [Birnbaum et al., 2013]

I Community detection [Abbe, 2014]

I Matrix completion [Riegler et al., 2015]

I Ranking [Shah and Wainwright, 2015]

I Adaptive data analysis [Russo and Zou, 2015]

I Supervised learning [Nokleby, 2016]

I Crowdsourcing [Lahouti and Hassibi, 2016]

I Distributed computation [Lee et al., 2018]

I Bayesian optimization [Scarlett, 2018]

• Note: More than just using entropy / mutual information...

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 3/ 53

Page 8: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Analogies

Same concepts, different terminology:

Communication Problems Data Problems

Feedback Active learning / adaptivity

Rate-distortion theory Approximate recovery

Joint source-channel coding Non-uniform prior

. . . . . .

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 4/ 53

Page 9: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Analogies

Same concepts, different terminology:

Communication Problems Data Problems

Channels with feedback Active learning / adaptivity

Rate distortion theory Approximate recovery

Joint source-channel coding Non-uniform prior

Error probability Error probability

Random coding Random sampling

Side information Side information

Channels with memory Statistically dependent measurements

Mismatched decoding Model mismatch

. . . . . .

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 4/ 53

Page 10: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Cautionary Notes

Some cautionary notes on the information-theoretic viewpoint:

I The simple models we can analyze may be over-simplified(more so than in communication)

I Compared to communication, we often can’t get matching achievability/converse(often settle with correct scaling laws)

I Information-theoretic limits not (yet) considered much in practice(to my knowledge) ... but they do guide the algorithm design

I Often encounter gaps between information-theoretic limits and computation limits

I Often information theory simply isn’t the right tool for the job

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 5/ 53

Page 11: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Terminology: Achievability and Converse

Achievability result (example): Given n(ε) data samples, there exists an algorithmachieving an “error” of at most ε

I Estimation error: ‖θ − θtrue‖ ≤ εI Optimization error: f (xselected) ≤ minx f (x) + ε

Converse result (example): In order to achieve an “error” of at most ε, any algorithmrequires at least n(ε) data samples

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 6/ 53

Page 12: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Terminology: Achievability and Converse

Achievability result (example): Given n(ε) data samples, there exists an algorithmachieving an “error” of at most ε

I Estimation error: ‖θ − θtrue‖ ≤ εI Optimization error: f (xselected) ≤ minx f (x) + ε

Converse result (example): In order to achieve an “error” of at most ε, any algorithmrequires at least n(ε) data samples

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 6/ 53

Page 13: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Converse Bounds for Statistical Estimationvia Fano’s Inequality

(Based on survey chapter https://arxiv.org/abs/1901.00555)

Page 14: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Statistical Estimation

Parameter V

Index VSelect

Parameter

Algorithm

Samples

Y XInfer

Output

Estimate V

Index

Parameter

Algorithm

Samples

Y X

Estimate

• General statistical estimation setup:

I Unknown parameter θ ∈ ΘI Samples Y = (Y1, . . . ,Yn) drawn from Pθ(y)

I More generally, from Pθ,X with inputs X = (X1, . . . ,Xn)

I Given Y (and possibly X), construct estimate θ

• Goal. Minimize some loss `(θ, θ)

I 0-1 loss: `(θ, θ) = 1θ 6= θI Squared `2 loss: ‖θ − θ‖2

• Typical example. Linear regression

I Estimate θ ∈ Rp from Y = Xθ + Z

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 7/ 53

Page 15: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Statistical Estimation

Parameter V

Index VSelect

Parameter

Algorithm

Samples

Y XInfer

Output

Estimate V

Index

Parameter

Algorithm

Samples

Y X

Estimate

• General statistical estimation setup:

I Unknown parameter θ ∈ ΘI Samples Y = (Y1, . . . ,Yn) drawn from Pθ(y)

I More generally, from Pθ,X with inputs X = (X1, . . . ,Xn)

I Given Y (and possibly X), construct estimate θ

• Goal. Minimize some loss `(θ, θ)

I 0-1 loss: `(θ, θ) = 1θ 6= θI Squared `2 loss: ‖θ − θ‖2

• Typical example. Linear regression

I Estimate θ ∈ Rp from Y = Xθ + Z

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 7/ 53

Page 16: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Statistical Estimation

Parameter V

Index VSelect

Parameter

Algorithm

Samples

Y XInfer

Output

Estimate V

Index

Parameter

Algorithm

Samples

Y X

Estimate

• General statistical estimation setup:

I Unknown parameter θ ∈ ΘI Samples Y = (Y1, . . . ,Yn) drawn from Pθ(y)

I More generally, from Pθ,X with inputs X = (X1, . . . ,Xn)

I Given Y (and possibly X), construct estimate θ

• Goal. Minimize some loss `(θ, θ)

I 0-1 loss: `(θ, θ) = 1θ 6= θI Squared `2 loss: ‖θ − θ‖2

• Typical example. Linear regression

I Estimate θ ∈ Rp from Y = Xθ + Z

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 7/ 53

Page 17: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Defining Features

• There are many properties that impact the analysis:

I Discrete θ (e.g., graph learning, sparsity pattern recovery)

I Continuous θ (e.g., regression, density estimation)

I Bayesian θ (average-case performance)

I Minimax bounds over Θ (worst-case performance)

I Non-adaptive inputs (all X1, . . . ,Xn chosen in advance)

I Adaptive inputs (Xi can be chosen based on Y1, . . . ,Yi−1)

• This talk. Minimax bounds, mostly non-adaptive, first discrete and then continuous

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 8/ 53

Page 18: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

High-Level Steps

Steps in attaining a minimax lower bound (converse):

1. Reduce estimation problem to multiple hypothesis testing

2. Apply a form of Fano’s inequality

3. Bound the resulting mutual information term

(Multiple hypothesis testing: Given samples Y1, . . . ,Yn, determine which distributionamong P1(y), . . . ,PM(y) generated them. M = 2 gives binary hypothesis testing.)

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 9/ 53

Page 19: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Step I: Reduction to Multiple Hypothesis Testing

• Lower bound worst-case error by average over hard subset θ1, . . . , θM :

Parameter V

Index VSelect

Parameter

Algorithm

Samples

Y XInfer

Output

Estimate V

Index

Idea:

I Show “successful” algorithm θ =⇒ Correct estimation of V (When is this true?)

I Equivalent statement: If V can’t be estimated reliably, then θ can’t be successful.

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 10/ 53

Page 20: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Step I: Example

• Example: Suppose algorithm is claimed to return θ such that ‖θ − θ‖2 ≤ ε

• If θ1, . . . , θM are separated by 2ε, then we can identify the correct V ∈ 1, . . . ,M

• Note: Tension between number of hypotheses, difficulty in distinguishing them, andsufficient separation. Choosing a suitable set θ1, . . . , θM can be very challenging.

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 11/ 53

Page 21: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Step II: Application of Fano’s Inequality

• Standard form of Fano’s inequality for M-ary hypothesis testing and uniform V :

P[V 6= V ] ≥ 1−I (V ; V ) + log 2

log M.

I Intuition: Need learned information I (V ; V ) to be close to prior uncertainty log M

• Variations:

I Non-uniform V

I Approximate recovery

I Conditional version

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 12/ 53

Page 22: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Step II: Application of Fano’s Inequality

• Standard form of Fano’s inequality for M-ary hypothesis testing and uniform V :

P[V 6= V ] ≥ 1−I (V ; V ) + log 2

log M.

I Intuition: Need learned information I (V ; V ) to be close to prior uncertainty log M

• Variations:

I Non-uniform V

I Approximate recovery

I Conditional version

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 12/ 53

Page 23: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Step III: Bounding the Mutual Information

• The key quantity remaining after applying Fano’s inequality is I (V ; V )

• Data processing inequality: (Based on V → Y → V or similar)

I No inputs: I (V ; V ) ≤ I (V ; Y)

I Non-adaptive inputs: I (V ; V |X) ≤ I (V ; Y|X)

I Adaptive inputs: I (V ; V ) ≤ I (V ; X,Y)

• Tensorization: (Based on conditional independence of the samples)

I No inputs: I (V ; Y) ≤∑n

i=1 I (V ;Yi )

I Non-adaptive inputs: I (V ; Y|X) ≤∑n

i=1 I (V ;Yi |Xi )

I Adaptive inputs: I (V ; X,Y) ≤∑n

i=1 I (V ;Yi |Xi )

• KL Divergence Bounds:

I I (V ;Y ) ≤ maxv,v′ D(PY |V (·|v)‖PY |V (·|v ′)

)I I (V ;Y ) ≤ maxv D

(PY |V (·|v)‖QY

)for any QY

I If each PY |V (·|v) is ε-close to the closest Q1(y), . . . ,QN(y) in KL divergence,then I (V ;Y ) ≤ log N + ε

I (Similarly with conditioning on X )

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 13/ 53

Page 24: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Step III: Bounding the Mutual Information

• The key quantity remaining after applying Fano’s inequality is I (V ; V )

• Data processing inequality: (Based on V → Y → V or similar)

I No inputs: I (V ; V ) ≤ I (V ; Y)

I Non-adaptive inputs: I (V ; V |X) ≤ I (V ; Y|X)

I Adaptive inputs: I (V ; V ) ≤ I (V ; X,Y)

• Tensorization: (Based on conditional independence of the samples)

I No inputs: I (V ; Y) ≤∑n

i=1 I (V ;Yi )

I Non-adaptive inputs: I (V ; Y|X) ≤∑n

i=1 I (V ;Yi |Xi )

I Adaptive inputs: I (V ; X,Y) ≤∑n

i=1 I (V ;Yi |Xi )

• KL Divergence Bounds:

I I (V ;Y ) ≤ maxv,v′ D(PY |V (·|v)‖PY |V (·|v ′)

)I I (V ;Y ) ≤ maxv D

(PY |V (·|v)‖QY

)for any QY

I If each PY |V (·|v) is ε-close to the closest Q1(y), . . . ,QN(y) in KL divergence,then I (V ;Y ) ≤ log N + ε

I (Similarly with conditioning on X )

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 13/ 53

Page 25: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Step III: Bounding the Mutual Information

• The key quantity remaining after applying Fano’s inequality is I (V ; V )

• Data processing inequality: (Based on V → Y → V or similar)

I No inputs: I (V ; V ) ≤ I (V ; Y)

I Non-adaptive inputs: I (V ; V |X) ≤ I (V ; Y|X)

I Adaptive inputs: I (V ; V ) ≤ I (V ; X,Y)

• Tensorization: (Based on conditional independence of the samples)

I No inputs: I (V ; Y) ≤∑n

i=1 I (V ;Yi )

I Non-adaptive inputs: I (V ; Y|X) ≤∑n

i=1 I (V ;Yi |Xi )

I Adaptive inputs: I (V ; X,Y) ≤∑n

i=1 I (V ;Yi |Xi )

• KL Divergence Bounds:

I I (V ;Y ) ≤ maxv,v′ D(PY |V (·|v)‖PY |V (·|v ′)

)I I (V ;Y ) ≤ maxv D

(PY |V (·|v)‖QY

)for any QY

I If each PY |V (·|v) is ε-close to the closest Q1(y), . . . ,QN(y) in KL divergence,then I (V ;Y ) ≤ log N + ε

I (Similarly with conditioning on X )

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 13/ 53

Page 26: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Discrete Example 1

Group Testing

Page 27: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Group Testing

Items

Tests

Outcomes

Contaminated ContaminatedClean Clean

I Goal:

Given test matrix X and outcomes Y, recover item vector β

I Sample complexity: Required number of tests n

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 14/ 53

Page 28: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Group Testing

Items

Tests

Outcomes

Contaminated ContaminatedClean Clean

I Goal:

Given test matrix X and outcomes Y, recover item vector β

I Sample complexity: Required number of tests n

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 14/ 53

Page 29: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Information Theory and Group Testing

Items

Tests

Outcomes

• Information-theoretic viewpoint:

S : Defective setXS : Columns indexed by S

S

Codeword

Y ∼ P nY |XS S

Encoder Channel Decoder

Message

XS

OutputEstimate

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 15/ 53

Page 30: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Information Theory and Group Testing

• Example formulation of general result:

Entropy

Mutual Information

(Model uncertainty)

(Information learned from measurements)

Sample complexity

n H(S)

I(PY |XS)

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 15/ 53

Page 31: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Converse via Fano’s Inequality

• Reduction to multiple hypothesis testing: Trivial! Set V = S .

• Application of Fano’s Inequality:

P[S 6= S] ≥ 1−I (S ; S|X) + log 2

log(pk

)• Bounding the mutual information:

I Data processing inequality: I (S ; S |X) ≤ I (U; Y) where U are pre-noise outputs

I Tensorization: I (U; Y) ≤∑n

i=1 I (Ui ;Yi )

I Capacity bound: I (Ui ;Yi ) ≤ C if outcome passed through channel of capacity C

• Final result:

n ≤log(pk

)C

(1− ε) =⇒ P[S 6= S] 6→ 0

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 16/ 53

Page 32: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Converse via Fano’s Inequality

• Reduction to multiple hypothesis testing: Trivial! Set V = S .

• Application of Fano’s Inequality:

P[S 6= S] ≥ 1−I (S ; S|X) + log 2

log(pk

)

• Bounding the mutual information:

I Data processing inequality: I (S ; S |X) ≤ I (U; Y) where U are pre-noise outputs

I Tensorization: I (U; Y) ≤∑n

i=1 I (Ui ;Yi )

I Capacity bound: I (Ui ;Yi ) ≤ C if outcome passed through channel of capacity C

• Final result:

n ≤log(pk

)C

(1− ε) =⇒ P[S 6= S] 6→ 0

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 16/ 53

Page 33: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Converse via Fano’s Inequality

• Reduction to multiple hypothesis testing: Trivial! Set V = S .

• Application of Fano’s Inequality:

P[S 6= S] ≥ 1−I (S ; S|X) + log 2

log(pk

)• Bounding the mutual information:

I Data processing inequality: I (S ; S |X) ≤ I (U; Y) where U are pre-noise outputs

I Tensorization: I (U; Y) ≤∑n

i=1 I (Ui ;Yi )

I Capacity bound: I (Ui ;Yi ) ≤ C if outcome passed through channel of capacity C

• Final result:

n ≤log(pk

)C

(1− ε) =⇒ P[S 6= S] 6→ 0

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 16/ 53

Page 34: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Converse via Fano’s Inequality

• Reduction to multiple hypothesis testing: Trivial! Set V = S .

• Application of Fano’s Inequality:

P[S 6= S] ≥ 1−I (S ; S|X) + log 2

log(pk

)• Bounding the mutual information:

I Data processing inequality: I (S ; S |X) ≤ I (U; Y) where U are pre-noise outputs

I Tensorization: I (U; Y) ≤∑n

i=1 I (Ui ;Yi )

I Capacity bound: I (Ui ;Yi ) ≤ C if outcome passed through channel of capacity C

• Final result:

n ≤log(pk

)C

(1− ε) =⇒ P[S 6= S] 6→ 0

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 16/ 53

Page 35: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Noiseless Group Testing (Exact Recovery)

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

DD

Informaton-Theoretic

Counting Bound

Bernoulli

COMP

Near-Const.

• Other Implications:I Adaptivity and approximate recovery:

I No gain at low sparsity levelsI Significant gain at high sparsity levels

I Efficient algorithm optimal (w.r.t random test design) at high sparsity levels

I Analogous results in noisy settings

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 17/ 53

Page 36: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Noiseless Group Testing (Exact Recovery)

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

DD

Informaton-Theoretic

Counting Bound

Bernoulli

COMP

Near-Const.

• Other Implications:I Adaptivity and approximate recovery:

I No gain at low sparsity levelsI Significant gain at high sparsity levels

I Efficient algorithm optimal (w.r.t random test design) at high sparsity levels

I Analogous results in noisy settings

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 17/ 53

Page 37: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Detour: An Information-Theoretic Achievability Example

Page 38: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

A Practical Approach: Separate Decoding of Items

• Searching over(pk

)subsets is infeasible

• Practical alternative: Separate decoding of itemsItems

Tests

Outcomes

Items

Tests

Outcomes

• Channel coding interpretation:

1 X1 Y 1

Y pp Xp

Encoder 1

Encoder p

Channel 1

Channel p

Decoder 1

Decoder p

• Claim: Asymptotic optimality to within a factor of log 2 ≈ 0.7 (sometimes)

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 18/ 53

Page 39: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

A Practical Approach: Separate Decoding of Items

• Searching over(pk

)subsets is infeasible

• Practical alternative: Separate decoding of itemsItems

Tests

Outcomes

Items

Tests

Outcomes

• Channel coding interpretation:

1 X1 Y 1

Y pp Xp

Encoder 1

Encoder p

Channel 1

Channel p

Decoder 1

Decoder p

• Claim: Asymptotic optimality to within a factor of log 2 ≈ 0.7 (sometimes)

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 18/ 53

Page 40: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Example Results (Symmetric Noise)

• Summary of results in the symmetric noise setting: [Scarlett/Cevher, 2018]I Joint: Intractable joint decoding rule (otherwise separate)

I False pos/neg: Small #false positives/negatives allowed in reconstruction

0 0.2 0.4 0.6 0.8 1Value 3 such that k = #(p3)

0

0.1

0.2

0.3

0.4

0.5

Asy

mpto

tic

ratio

ofk

log2p=k

ton

Joint false pos/neg

Joint exact

False pos/neg

False neg

False pos

Exact

Best previous practical

Note: Within a factor log 2 ≈ 0.7 of the information-theoretic optimum at low sparsity

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 19/ 53

Page 41: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

SDI Achievability Proof 1 (Notation)

• Define the defectivity indicator random variable

βj = 1j ∈ S

• Separate decoding of items: Decode βj based only on and Y

I Xj : j-th column of X

• Binary hypothesis testing based rule

βj (Xj ,Y) = 1

n∑i=1

logPY |Xj ,βj

(Y (i)|X (i)j , 1)

PY (Y (i))> γ

I X (i) : i-th row of X

I Y (i) : i-th test outcome

I γ : Threshold to be set later

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 20/ 53

Page 42: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

SDI Achievability Proof 1 (Notation)

• Define the defectivity indicator random variable

βj = 1j ∈ S

• Separate decoding of items: Decode βj based only on and Y

I Xj : j-th column of X

• Binary hypothesis testing based rule

βj (Xj ,Y) = 1

n∑i=1

logPY |Xj ,βj

(Y (i)|X (i)j , 1)

PY (Y (i))> γ

I X (i) : i-th row of X

I Y (i) : i-th test outcome

I γ : Threshold to be set later

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 20/ 53

Page 43: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

SDI Achievability Proof 2 (Non-Asymptotic Bound)

Non-asymptotic bound: Under Bernoulli testing, we have

Pe ≤ kP[ın1(X1,Y) ≤ γ] + (p − k)e−γ

where

ın1(Xj ,Y) :=n∑

i=1

ı1(X(i)j ,Y (i)), ı1(x1, y) := log

PY |X1(y |x1)

PY (y)

is the information density

Proof idea:

I First term: Union bound over k false negative events

I Second term: Union bound over p − k false positive events & change of measure

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 21/ 53

Page 44: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

SDI Achievability Proof 2 (Non-Asymptotic Bound)

Non-asymptotic bound: Under Bernoulli testing, we have

Pe ≤ kP[ın1(X1,Y) ≤ γ] + (p − k)e−γ

where

ın1(Xj ,Y) :=n∑

i=1

ı1(X(i)j ,Y (i)), ı1(x1, y) := log

PY |X1(y |x1)

PY (y)

is the information density

Proof idea:

I First term: Union bound over k false negative events

I Second term: Union bound over p − k false positive events & change of measure

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 21/ 53

Page 45: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

SDI Achievability Proof 3 (Asymptotic Analysis)

Simplification: If we have concentration of the form

P[ın1(X1,Y) ≤ nI (X1;Y )(1− δ2)] ≤ ψn(δ2)

then the following conditions suffice for Pe → 0:

n ≥log(

1δ1

(p − k))

I (X1;Y )(1− δ2)

k · ψn(δ2)→ 0

Notes

I Proved by setting γ = log p−kδ1

in initial result

I With false positives, can instead use γ = log p−kkδ1

I With false negatives, it suffices that ψn(δ2)→ 0

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 22/ 53

Page 46: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

SDI Achievability Proof 3 (Asymptotic Analysis)

Simplification: If we have concentration of the form

P[ın1(X1,Y) ≤ nI (X1;Y )(1− δ2)] ≤ ψn(δ2)

then the following conditions suffice for Pe → 0:

n ≥log(

1δ1

(p − k))

I (X1;Y )(1− δ2)

k · ψn(δ2)→ 0

Notes

I Proved by setting γ = log p−kδ1

in initial result

I With false positives, can instead use γ = log p−kkδ1

I With false negatives, it suffices that ψn(δ2)→ 0

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 22/ 53

Page 47: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

SDI Achievability Proof 4 (Concentration Bounds)

Bernstein’s inequality:

P[ın1(X1,Y) ≤ nI1(1− δ2)] ≤ exp

( − 12· nk· c2

meanδ22

cvar + 13cmeancmaxδ2

)for suitable constants cmean, cvar, cmax

Refined concentration for noiseless model:

P[ın1(X1,Y) ≤ nI1(1− δ2)] ≤ exp

(−

n log 2

k

((1− δ2) log(1− δ2) + δ2

)(1 + o(1))

)

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 23/ 53

Page 48: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

SDI Achievability Proof 4 (Concentration Bounds)

Bernstein’s inequality:

P[ın1(X1,Y) ≤ nI1(1− δ2)] ≤ exp

( − 12· nk· c2

meanδ22

cvar + 13cmeancmaxδ2

)for suitable constants cmean, cvar, cmax

Refined concentration for noiseless model:

P[ın1(X1,Y) ≤ nI1(1− δ2)] ≤ exp

(−

n log 2

k

((1− δ2) log(1− δ2) + δ2

)(1 + o(1))

)

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 23/ 53

Page 49: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

SDI Achievability Proof 5 (Final Conditions)

Final noiseless result: We have Pe → 0 if

n ≥ minδ2>0

max

k log p

(log 2)2(1− δ2),

k log k

(log 2)((1− δ2) log(1− δ2) + δ2

)(1 + η)

Final noisy result: We have Pe → 0 if

n ≥ minδ2>0

max

k log p

(log 2)(log 2− H2(ρ))(1− δ2),

(k log k) ·(cvar + 1

3cmeancmaxδ2

)12· c2

meanδ22

(1+η)

Approximate recovery: We have P(approx)e → 0 if

n ≥log p

k

I (X1;Y )(1 + η)

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 24/ 53

Page 50: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

SDI Achievability Proof 5 (Final Conditions)

Final noiseless result: We have Pe → 0 if

n ≥ minδ2>0

max

k log p

(log 2)2(1− δ2),

k log k

(log 2)((1− δ2) log(1− δ2) + δ2

)(1 + η)

Final noisy result: We have Pe → 0 if

n ≥ minδ2>0

max

k log p

(log 2)(log 2− H2(ρ))(1− δ2),

(k log k) ·(cvar + 1

3cmeancmaxδ2

)12· c2

meanδ22

(1+η)

Approximate recovery: We have P(approx)e → 0 if

n ≥log p

k

I (X1;Y )(1 + η)

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 24/ 53

Page 51: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

SDI Achievability Proof 5 (Final Conditions)

Final noiseless result: We have Pe → 0 if

n ≥ minδ2>0

max

k log p

(log 2)2(1− δ2),

k log k

(log 2)((1− δ2) log(1− δ2) + δ2

)(1 + η)

Final noisy result: We have Pe → 0 if

n ≥ minδ2>0

max

k log p

(log 2)(log 2− H2(ρ))(1− δ2),

(k log k) ·(cvar + 1

3cmeancmaxδ2

)12· c2

meanδ22

(1+η)

Approximate recovery: We have P(approx)e → 0 if

n ≥log p

k

I (X1;Y )(1 + η)

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 24/ 53

Page 52: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Example Results (Symmetric Noise)

• Summary of results in the symmetric noise setting: [Scarlett/Cevher, 2018]I Joint: Intractable joint decoding rule (otherwise separate)

I False pos/neg: Small #false positives/negatives allowed in reconstruction

0 0.2 0.4 0.6 0.8 1Value 3 such that k = #(p3)

0

0.1

0.2

0.3

0.4

0.5

Asy

mpto

tic

ratio

ofk

log2p=k

ton

Joint false pos/neg

Joint exact

False pos/neg

False neg

False pos

Exact

Best previous practical

Note: Within a factor log 2 ≈ 0.7 of the information-theoretic optimum at low sparsity

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 25/ 53

Page 53: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Discrete Example 2

Graphical Model Selection

Page 54: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Graphical Model Representations of Joint Distributions

Motivating example:

I In a population of p people, let

Yi =

1 person i is infected

−1 person i is healthy,i = 1, . . . , p

I Example models:

[Abbe and Wainwright, ISIT Tutorial, 2015]

I Joint distribution for a given graph G = (V ,E):

P[(Y1, . . . ,Yp) = (y1, . . . , yp)

]=

1

Zexp

( ∑(i,j)∈E

λijyiyj

)

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 26/ 53

Page 55: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Graphical Model Representations of Joint Distributions

Motivating example:

I In a population of p people, let

Yi =

1 person i is infected

−1 person i is healthy,i = 1, . . . , p

I Example models:

[Abbe and Wainwright, ISIT Tutorial, 2015]

I Joint distribution for a given graph G = (V ,E):

P[(Y1, . . . ,Yp) = (y1, . . . , yp)

]=

1

Zexp

( ∑(i,j)∈E

λijyiyj

)

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 26/ 53

Page 56: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Graphical Model Representations of Joint Distributions

Motivating example:

I In a population of p people, let

Yi =

1 person i is infected

−1 person i is healthy,i = 1, . . . , p

I Example models:

[Abbe and Wainwright, ISIT Tutorial, 2015]

I Joint distribution for a given graph G = (V ,E):

P[(Y1, . . . ,Yp) = (y1, . . . , yp)

]=

1

Zexp

( ∑(i,j)∈E

λijyiyj

)

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 26/ 53

Page 57: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Graphical Model Selection: Illustration

I A larger example from [Abbe and Wainwright, ISIT Tutorial 2015]:I Example graphs:

I Sample images (Ising model):

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 27/ 53

Page 58: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Graphical Model Selection: Illustration

I A larger example from [Abbe and Wainwright, ISIT Tutorial 2015]:I Example graphs:

I Sample images (Ising model):

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 27/ 53

Page 59: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Graphical Model Selection: Illustration

I A larger example from [Abbe and Wainwright, ISIT Tutorial 2015]:I Example graphs:

I Sample images (Ising model):

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 27/ 53

Page 60: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Graphical Model Selection: Illustration

I A larger example from [Abbe and Wainwright, ISIT Tutorial 2015]:I Example graphs:

I Sample images (Ising model):

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 27/ 53

Page 61: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Graphical Model Selection: Illustration

I A larger example from [Abbe and Wainwright, ISIT Tutorial, 2015]:I Example graphs:

I Sample images (Ising model):

I Goal: Identify graph given n independent samples

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 27/ 53

Page 62: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Graphical Model Selection: Definition

General problem statement.I Given n i.i.d. samples of (Y1, . . . ,Yp) ∼ PG , recover the underlying graph G

I Applications: Statistical physics, social and biological networks

I Error probability:Pe = max

G∈GP[G 6= G |G ].

Assumptions.I Distribution class:

I Ising model

PG (x1, . . . , xp)]

=1

Zexp

( ∑(i,j)∈E

λijxixj

)I Gaussian model

(X1, . . . ,Xp) ∼ N (µ,Σ)

where (Σ−1)ij 6= 0 ⇐⇒ (i, j) ∈ E [Hammersley-Clifford theorem]

I Graph class:

I Bounded-edge (at most k edges total)I Bounded-degree (at most d edges out of each node)

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 28/ 53

Page 63: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Graphical Model Selection: Definition

General problem statement.I Given n i.i.d. samples of (Y1, . . . ,Yp) ∼ PG , recover the underlying graph G

I Applications: Statistical physics, social and biological networks

I Error probability:Pe = max

G∈GP[G 6= G |G ].

Assumptions.I Distribution class:

I Ising model

PG (x1, . . . , xp)]

=1

Zexp

( ∑(i,j)∈E

λijxixj

)I Gaussian model

(X1, . . . ,Xp) ∼ N (µ,Σ)

where (Σ−1)ij 6= 0 ⇐⇒ (i, j) ∈ E [Hammersley-Clifford theorem]

I Graph class:

I Bounded-edge (at most k edges total)I Bounded-degree (at most d edges out of each node)

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 28/ 53

Page 64: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Information-Theoretic Viewpoint

• Information-theoretic viewpoint:

ChannelDecoder

G GY

PY|G

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 29/ 53

Page 65: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Converse via Fano’s Inequality

• Reduction to multiple hypothesis testing: Let G be uniform on hard subset G0 ⊆ GI Ideally many graphs (lots of graphs to distinguish)

I Ideally close together (harder to distinguish)

• Application of Fano’s Inequality:

P[S 6= S] ≥ 1−I (G ; G) + log 2

log |G0|

• Bounding the mutual information:

I Data processing inequality: I (G ; G) ≤ I (G ; Y)

I Tensorization: I (G ; Y) ≤∑n

i=1 I (G ;Yi )

I KL divergence bound: Bound I (G ;Yi ) ≤ maxG D(PY |G (·|G)‖QY ) case-by-case

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 30/ 53

Page 66: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Converse via Fano’s Inequality

• Reduction to multiple hypothesis testing: Let G be uniform on hard subset G0 ⊆ GI Ideally many graphs (lots of graphs to distinguish)

I Ideally close together (harder to distinguish)

• Application of Fano’s Inequality:

P[S 6= S] ≥ 1−I (G ; G) + log 2

log |G0|

• Bounding the mutual information:

I Data processing inequality: I (G ; G) ≤ I (G ; Y)

I Tensorization: I (G ; Y) ≤∑n

i=1 I (G ;Yi )

I KL divergence bound: Bound I (G ;Yi ) ≤ maxG D(PY |G (·|G)‖QY ) case-by-case

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 30/ 53

Page 67: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Converse via Fano’s Inequality

• Reduction to multiple hypothesis testing: Let G be uniform on hard subset G0 ⊆ GI Ideally many graphs (lots of graphs to distinguish)

I Ideally close together (harder to distinguish)

• Application of Fano’s Inequality:

P[S 6= S] ≥ 1−I (G ; G) + log 2

log |G0|

• Bounding the mutual information:

I Data processing inequality: I (G ; G) ≤ I (G ; Y)

I Tensorization: I (G ; Y) ≤∑n

i=1 I (G ;Yi )

I KL divergence bound: Bound I (G ;Yi ) ≤ maxG D(PY |G (·|G)‖QY ) case-by-case

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 30/ 53

Page 68: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Graph Ensembles

I Graphs that are difficult to distinguish from the empty graph:

I Reveals n = Ω(

1λ2 log p

)necessary condition with “edge strength” λ and p nodes

I Graphs that are difficult to distinguish from the complete (sub-)graph:

I Reveals n = Ω(eλd)

necessary condition with “edge strength” λ and degree d

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 31/ 53

Page 69: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Graph Ensembles

I Graphs that are difficult to distinguish from the empty graph:

I Reveals n = Ω(

1λ2 log p

)necessary condition with “edge strength” λ and p nodes

I Graphs that are difficult to distinguish from the complete (sub-)graph:

I Reveals n = Ω(eλd)

necessary condition with “edge strength” λ and degree d

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 31/ 53

Page 70: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Upper vs. Lower Bounds

• Example results with maximal degree d , edge strength λ (slightly informal):

I (Converse) n = Ω(

max

1λ2 , e

λd

log p)

[Santhanam and Wainwright, 2012]

I (Achievability) n = O(

max

1λ2 , e

λdd log p

)[Santhanam and Wainwright, 2012]

I (Early Practical) n = O(d2 log p) but extra assumptions that are hard to cerify[Ravikumar/Wainwright/Lafferty, 2010]

I (Recent Practical) n = O(d2eλd

λ2 log p)

[Klivans/Meka 2017]

I ...and more [Lokhov et al., 2018][Wu/Sanghavi/Dimakis 2018]

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 32/ 53

Page 71: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Graphical Model Selection with Twists

• Approximate recovery: [S., Cevher, IEEE TSIPN, 2016]

Pe(dmax) = maxG∈G

P[dH(E , E) > dmax |G ]

• Active learning / adaptive sampling [S., Cevher, AISTATS, 2017]

Decoder

Choose

Nodes

X(i)

GGSample

Y (i)

• Analogies

Communication Problems Data Problems

Feedback Active learning / adaptivity

Rate-distortion theory Approximate recovery

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 33/ 53

Page 72: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Graphical Model Selection with Twists

• Approximate recovery: [S., Cevher, IEEE TSIPN, 2016]

Pe(dmax) = maxG∈G

P[dH(E , E) > dmax |G ]

• Active learning / adaptive sampling [S., Cevher, AISTATS, 2017]

Decoder

Choose

Nodes

X(i)

GGSample

Y (i)

• Analogies

Communication Problems Data Problems

Feedback Active learning / adaptivity

Rate-distortion theory Approximate recovery

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 33/ 53

Page 73: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Graphical Model Selection with Twists

• Approximate recovery: [S., Cevher, IEEE TSIPN, 2016]

Pe(dmax) = maxG∈G

P[dH(E , E) > dmax |G ]

• Active learning / adaptive sampling [S., Cevher, AISTATS, 2017]

Decoder

Choose

Nodes

X(i)

GGSample

Y (i)

• Analogies

Communication Problems Data Problems

Feedback Active learning / adaptivity

Rate-distortion theory Approximate recovery

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 33/ 53

Page 74: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

What About Continuous-Valued Estimation?

Page 75: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Statistical Estimation

Parameter V

Index VSelect

Parameter

Algorithm

Samples

Y XInfer

Output

Estimate V

Index

Parameter

Algorithm

Samples

Y X

Estimate

• General statistical estimation setup:

I Unknown parameter θ ∈ ΘI Samples Y = (Y1, . . . ,Yn) drawn from Pθ(y)

I More generally, from Pθ,X with inputs X = (X1, . . . ,Xn)

I Given Y (and possibly X), construct estimate θ

• Goal. Minimize some loss `(θ, θ)

I 0-1 loss: `(θ, θ) = 1θ 6= θI Squared `2 loss: ‖θ − θ‖2

• Typical example. Linear regression

I Estimate θ ∈ Rp from Y = Xθ + Z

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 34/ 53

Page 76: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Statistical Estimation

Parameter V

Index VSelect

Parameter

Algorithm

Samples

Y XInfer

Output

Estimate V

Index

Parameter

Algorithm

Samples

Y X

Estimate

• General statistical estimation setup:

I Unknown parameter θ ∈ ΘI Samples Y = (Y1, . . . ,Yn) drawn from Pθ(y)

I More generally, from Pθ,X with inputs X = (X1, . . . ,Xn)

I Given Y (and possibly X), construct estimate θ

• Goal. Minimize some loss `(θ, θ)

I 0-1 loss: `(θ, θ) = 1θ 6= θI Squared `2 loss: ‖θ − θ‖2

• Typical example. Linear regression

I Estimate θ ∈ Rp from Y = Xθ + Z

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 34/ 53

Page 77: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Statistical Estimation

Parameter V

Index VSelect

Parameter

Algorithm

Samples

Y XInfer

Output

Estimate V

Index

Parameter

Algorithm

Samples

Y X

Estimate

• General statistical estimation setup:

I Unknown parameter θ ∈ ΘI Samples Y = (Y1, . . . ,Yn) drawn from Pθ(y)

I More generally, from Pθ,X with inputs X = (X1, . . . ,Xn)

I Given Y (and possibly X), construct estimate θ

• Goal. Minimize some loss `(θ, θ)

I 0-1 loss: `(θ, θ) = 1θ 6= θI Squared `2 loss: ‖θ − θ‖2

• Typical example. Linear regression

I Estimate θ ∈ Rp from Y = Xθ + Z

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 34/ 53

Page 78: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Minimax Risk

• Since the samples are random, so is θ and hence `(θ, θ)

• So seek to minimize the average loss Eθ[`(θ, θ)].

I Note: Eθ and Pθ denote averages w.r.t. Y when the true parameter is θ.

• Minimax risk:Mn(Θ, `) = inf

θsupθ∈Θ

Eθ[`(θ, θ)

],

i.e., worst case average loss over all θ ∈ Θ

• Approach: Lower bound worst-case error by average over hard subset θ1, . . . , θM :

Parameter V

Index VSelect

Parameter

Algorithm

Samples

Y XInfer

Output

Estimate V

Index

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 35/ 53

Page 79: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Minimax Risk

• Since the samples are random, so is θ and hence `(θ, θ)

• So seek to minimize the average loss Eθ[`(θ, θ)].

I Note: Eθ and Pθ denote averages w.r.t. Y when the true parameter is θ.

• Minimax risk:Mn(Θ, `) = inf

θsupθ∈Θ

Eθ[`(θ, θ)

],

i.e., worst case average loss over all θ ∈ Θ

• Approach: Lower bound worst-case error by average over hard subset θ1, . . . , θM :

Parameter V

Index VSelect

Parameter

Algorithm

Samples

Y XInfer

Output

Estimate V

Index

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 35/ 53

Page 80: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Minimax Risk

• Since the samples are random, so is θ and hence `(θ, θ)

• So seek to minimize the average loss Eθ[`(θ, θ)].

I Note: Eθ and Pθ denote averages w.r.t. Y when the true parameter is θ.

• Minimax risk:Mn(Θ, `) = inf

θsupθ∈Θ

Eθ[`(θ, θ)

],

i.e., worst case average loss over all θ ∈ Θ

• Approach: Lower bound worst-case error by average over hard subset θ1, . . . , θM :

Parameter V

Index VSelect

Parameter

Algorithm

Samples

Y XInfer

Output

Estimate V

Index

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 35/ 53

Page 81: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

General Lower Bound via Fano’s Inequality

• To get a meaningful result, need a sufficiently “well-behaved” loss function.Subsequently, focus on loss functions of the form

`(θ, θ) = Φ(ρ(θ, θ)

)where ρ(θ, θ′) is some metric, and Φ(·) is some non-negative and increasing function

(e.g., `(θ, θ) = ‖θ − θ‖2)

• Claim. Fix ε > 0, and let θ1, . . . , θM be a finite subset of Θ such that

ρ(θv , θv′ ) ≥ ε, ∀v , v ′ ∈ 1, . . . ,M, v 6= v ′.

Then, we have

Mn(Θ, `) ≥ Φ( ε

2

)(1−

I (V ; Y) + log 2

log M

),

where V is uniform on 1, . . . ,M, and I (V ; Y) is with respect to V → θV → Y.

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 36/ 53

Page 82: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

General Lower Bound via Fano’s Inequality

• To get a meaningful result, need a sufficiently “well-behaved” loss function.Subsequently, focus on loss functions of the form

`(θ, θ) = Φ(ρ(θ, θ)

)where ρ(θ, θ′) is some metric, and Φ(·) is some non-negative and increasing function

(e.g., `(θ, θ) = ‖θ − θ‖2)

• Claim. Fix ε > 0, and let θ1, . . . , θM be a finite subset of Θ such that

ρ(θv , θv′ ) ≥ ε, ∀v , v ′ ∈ 1, . . . ,M, v 6= v ′.

Then, we have

Mn(Θ, `) ≥ Φ( ε

2

)(1−

I (V ; Y) + log 2

log M

),

where V is uniform on 1, . . . ,M, and I (V ; Y) is with respect to V → θV → Y.

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 36/ 53

Page 83: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Proof of General Lower Bound• Using Markov’s inequality:

supθ∈Θ

Eθ[`(θ, θ)

]≥ supθ∈Θ

Φ(ε0)Pθ[`(θ, θ) ≥ Φ(ε0)]

= Φ(ε0) supθ∈Θ

Pθ[ρ(θ, θ) ≥ ε0]

• Suppose that V = arg minj=1,...,M ρ(θj , θ). Then by the triangle inequality and

ρ(θv , θv′ ) ≥ ε, if ρ(θv , θ) < ε2

then we must have V = v :

Pθv[ρ(θv , θ) ≥

ε

2

]≥ Pθv [V 6= v ].

• Hence,

supθ∈Θ

Pθ[ρ(θ, θ) ≥

ε

2

]≥ max

v=1,...,MPθv

[ρ(θv , θ) ≥

ε

2

]≥ max

v=1,...,MPθv [V 6= v ]

≥1

M

∑v=1,...,M

Pθv [V 6= v ]

≥ 1−I (V ; Y) + log 2

log M

where the final step uses Fano’s inequality.

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 37/ 53

Page 84: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Proof of General Lower Bound• Using Markov’s inequality:

supθ∈Θ

Eθ[`(θ, θ)

]≥ supθ∈Θ

Φ(ε0)Pθ[`(θ, θ) ≥ Φ(ε0)]

= Φ(ε0) supθ∈Θ

Pθ[ρ(θ, θ) ≥ ε0]

• Suppose that V = arg minj=1,...,M ρ(θj , θ). Then by the triangle inequality and

ρ(θv , θv′ ) ≥ ε, if ρ(θv , θ) < ε2

then we must have V = v :

Pθv[ρ(θv , θ) ≥

ε

2

]≥ Pθv [V 6= v ].

• Hence,

supθ∈Θ

Pθ[ρ(θ, θ) ≥

ε

2

]≥ max

v=1,...,MPθv

[ρ(θv , θ) ≥

ε

2

]≥ max

v=1,...,MPθv [V 6= v ]

≥1

M

∑v=1,...,M

Pθv [V 6= v ]

≥ 1−I (V ; Y) + log 2

log M

where the final step uses Fano’s inequality.

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 37/ 53

Page 85: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Proof of General Lower Bound• Using Markov’s inequality:

supθ∈Θ

Eθ[`(θ, θ)

]≥ supθ∈Θ

Φ(ε0)Pθ[`(θ, θ) ≥ Φ(ε0)]

= Φ(ε0) supθ∈Θ

Pθ[ρ(θ, θ) ≥ ε0]

• Suppose that V = arg minj=1,...,M ρ(θj , θ). Then by the triangle inequality and

ρ(θv , θv′ ) ≥ ε, if ρ(θv , θ) < ε2

then we must have V = v :

Pθv[ρ(θv , θ) ≥

ε

2

]≥ Pθv [V 6= v ].

• Hence,

supθ∈Θ

Pθ[ρ(θ, θ) ≥

ε

2

]≥ max

v=1,...,MPθv

[ρ(θv , θ) ≥

ε

2

]≥ max

v=1,...,MPθv [V 6= v ]

≥1

M

∑v=1,...,M

Pθv [V 6= v ]

≥ 1−I (V ; Y) + log 2

log M

where the final step uses Fano’s inequality.

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 37/ 53

Page 86: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Local Approach

• General bound: If ρ(θv , θv′ ) ≥ ε for all v , v ′ then

Mn(Θ, `) ≥ Φ( ε

2

)(1−

I (V ; Y) + log 2

log M

),

• Local approach: Carefully-chosen “local” hard subset:

/2

/2

PY

x xx x

KL Divergence Ball

Pv

PY

x xx x

xx

x x

xx x

Resulting bound:

Mn(Θ, `) ≥ Φ( ε

2

)(1−

minv=1,...,M D(Pnθv‖Qn

Y ) + log 2

log M

).

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 38/ 53

Page 87: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Local Approach

• General bound: If ρ(θv , θv′ ) ≥ ε for all v , v ′ then

Mn(Θ, `) ≥ Φ( ε

2

)(1−

I (V ; Y) + log 2

log M

),

• Local approach: Carefully-chosen “local” hard subset:

/2

/2

PY

x xx x

KL Divergence Ball

Pv

PY

x xx x

xx

x x

xx x

Resulting bound:

Mn(Θ, `) ≥ Φ( ε

2

)(1−

minv=1,...,M D(Pnθv‖Qn

Y ) + log 2

log M

).

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 38/ 53

Page 88: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Local Approach

• General bound: If ρ(θv , θv′ ) ≥ ε for all v , v ′ then

Mn(Θ, `) ≥ Φ( ε

2

)(1−

I (V ; Y) + log 2

log M

),

• Local approach: Carefully-chosen “local” hard subset:

/2

/2

PY

x xx x

KL Divergence Ball

Pv

PY

x xx x

xx

x x

xx x

Resulting bound:

Mn(Θ, `) ≥ Φ( ε

2

)(1−

minv=1,...,M D(Pnθv‖Qn

Y ) + log 2

log M

).

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 38/ 53

Page 89: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Global Approach

• General bound: If ρ(θv , θv′ ) ≥ ε for all v , v ′ then

Mn(Θ, `) ≥ Φ( ε

2

)(1−

I (V ; Y) + log 2

log M

),

• Global approach: Pack as many ε-separated points as possible:

/2

/2

PY

x xx x

KL Divergence Ball

Pv

PY

x xx x

xx

x x

xx x

I Typically suited to infinite-dimensional problems (e.g., non-parametric regression)

• Resulting bound:

Mn(Θ, `) ≥ Φ( εp

2

)(1−

log N∗KL,n(Θ, εc,n) + εc,n + log 2

log M∗ρ (Θ, εp)

).

I M∗ρ (Θ, εp): No. ε-separated θ we can pack into Θ (packing number)I N∗KL,n(Θ, εc,n): No. εc,n-size KL divergence balls to cover PY (covering number)

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 39/ 53

Page 90: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Global Approach

• General bound: If ρ(θv , θv′ ) ≥ ε for all v , v ′ then

Mn(Θ, `) ≥ Φ( ε

2

)(1−

I (V ; Y) + log 2

log M

),

• Global approach: Pack as many ε-separated points as possible:

/2

/2

PY

x xx x

KL Divergence Ball

Pv

PY

x xx x

xx

x x

xx x

I Typically suited to infinite-dimensional problems (e.g., non-parametric regression)

• Resulting bound:

Mn(Θ, `) ≥ Φ( εp

2

)(1−

log N∗KL,n(Θ, εc,n) + εc,n + log 2

log M∗ρ (Θ, εp)

).

I M∗ρ (Θ, εp): No. ε-separated θ we can pack into Θ (packing number)I N∗KL,n(Θ, εc,n): No. εc,n-size KL divergence balls to cover PY (covering number)

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 39/ 53

Page 91: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Global Approach

• General bound: If ρ(θv , θv′ ) ≥ ε for all v , v ′ then

Mn(Θ, `) ≥ Φ( ε

2

)(1−

I (V ; Y) + log 2

log M

),

• Global approach: Pack as many ε-separated points as possible:

/2

/2

PY

x xx x

KL Divergence Ball

Pv

PY

x xx x

xx

x x

xx x

I Typically suited to infinite-dimensional problems (e.g., non-parametric regression)

• Resulting bound:

Mn(Θ, `) ≥ Φ( εp

2

)(1−

log N∗KL,n(Θ, εc,n) + εc,n + log 2

log M∗ρ (Θ, εp)

).

I M∗ρ (Θ, εp): No. ε-separated θ we can pack into Θ (packing number)I N∗KL,n(Θ, εc,n): No. εc,n-size KL divergence balls to cover PY (covering number)

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 39/ 53

Page 92: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Continuous Example 1

Sparse Linear Regression

Page 93: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Sparse Linear Regression

• Linear regression model Y = Xθ + Z:

2 RpY 2 Rn X 2 Rnp

= +

Z 2 Rn

Samples Feature Matrix Coefficients Noise

p

n

I Feature matrix X is given, noise is i.i.d. N(0, σ2)

I Coefficients are sparse – at most k non-zeros

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 40/ 53

Page 94: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Converse via Fano’s Inequality

• Reduction to hyp. testing: Fix ε > 0 and restrict to sparse vectors of the form

θ = (0, 0, 0,±ε, 0,±ε, 0, 0, 0, 0, 0,±ε, 0)

I Total number of such sequences = 2k(pk

)≈ exp

(k log p

k

)(if k p)

I Choose a “well separated” subset of size exp(k4

log pk

)(Gilbert-Varshamov)

I Well-separated: Non-zero entries differ in at least k8

indices

• Application of Fano’s inequality:

I Using the general bound given previously:

Mn(Θ, `) ≥kε2

32

(1−

I (V ; Y|X) + log 2k4

log pk

)

• Bounding the mutual information:

I By a direct calculation, I (V ; Y|X) ≤ ε2

2σ2 · kp ‖X‖2F (Gaussian RVs)

I Substitute and choose ε to optimize the bound: ε2 =σ2p log p

k

2‖X‖2F

• Final result: If ‖X‖2F ≤ npΓ, then E[‖θ − θ‖2

2] ≤ δ requires n ≥ cσ2

Γδ· k log p

k

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 41/ 53

Page 95: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Converse via Fano’s Inequality

• Reduction to hyp. testing: Fix ε > 0 and restrict to sparse vectors of the form

θ = (0, 0, 0,±ε, 0,±ε, 0, 0, 0, 0, 0,±ε, 0)

I Total number of such sequences = 2k(pk

)≈ exp

(k log p

k

)(if k p)

I Choose a “well separated” subset of size exp(k4

log pk

)(Gilbert-Varshamov)

I Well-separated: Non-zero entries differ in at least k8

indices

• Application of Fano’s inequality:

I Using the general bound given previously:

Mn(Θ, `) ≥kε2

32

(1−

I (V ; Y|X) + log 2k4

log pk

)

• Bounding the mutual information:

I By a direct calculation, I (V ; Y|X) ≤ ε2

2σ2 · kp ‖X‖2F (Gaussian RVs)

I Substitute and choose ε to optimize the bound: ε2 =σ2p log p

k

2‖X‖2F

• Final result: If ‖X‖2F ≤ npΓ, then E[‖θ − θ‖2

2] ≤ δ requires n ≥ cσ2

Γδ· k log p

k

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 41/ 53

Page 96: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Converse via Fano’s Inequality

• Reduction to hyp. testing: Fix ε > 0 and restrict to sparse vectors of the form

θ = (0, 0, 0,±ε, 0,±ε, 0, 0, 0, 0, 0,±ε, 0)

I Total number of such sequences = 2k(pk

)≈ exp

(k log p

k

)(if k p)

I Choose a “well separated” subset of size exp(k4

log pk

)(Gilbert-Varshamov)

I Well-separated: Non-zero entries differ in at least k8

indices

• Application of Fano’s inequality:

I Using the general bound given previously:

Mn(Θ, `) ≥kε2

32

(1−

I (V ; Y|X) + log 2k4

log pk

)

• Bounding the mutual information:

I By a direct calculation, I (V ; Y|X) ≤ ε2

2σ2 · kp ‖X‖2F (Gaussian RVs)

I Substitute and choose ε to optimize the bound: ε2 =σ2p log p

k

2‖X‖2F

• Final result: If ‖X‖2F ≤ npΓ, then E[‖θ − θ‖2

2] ≤ δ requires n ≥ cσ2

Γδ· k log p

k

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 41/ 53

Page 97: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Converse via Fano’s Inequality

• Reduction to hyp. testing: Fix ε > 0 and restrict to sparse vectors of the form

θ = (0, 0, 0,±ε, 0,±ε, 0, 0, 0, 0, 0,±ε, 0)

I Total number of such sequences = 2k(pk

)≈ exp

(k log p

k

)(if k p)

I Choose a “well separated” subset of size exp(k4

log pk

)(Gilbert-Varshamov)

I Well-separated: Non-zero entries differ in at least k8

indices

• Application of Fano’s inequality:

I Using the general bound given previously:

Mn(Θ, `) ≥kε2

32

(1−

I (V ; Y|X) + log 2k4

log pk

)

• Bounding the mutual information:

I By a direct calculation, I (V ; Y|X) ≤ ε2

2σ2 · kp ‖X‖2F (Gaussian RVs)

I Substitute and choose ε to optimize the bound: ε2 =σ2p log p

k

2‖X‖2F

• Final result: If ‖X‖2F ≤ npΓ, then E[‖θ − θ‖2

2] ≤ δ requires n ≥ cσ2

Γδ· k log p

k

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 41/ 53

Page 98: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Upper vs. Lower Bounds

• Recap of model: Y = Xθ + Z, where θ is k-sparse

• Lower bound: If ‖X‖2F ≤ npΓ, achieving E[‖θ − θ‖2

2] ≤ δ requires n ≥ cσ2

Γδ· k log p

k

• Upper bound: If X is a zero-mean random Gaussian matrix with power Γ per entry,

then we can achieve E[‖θ − θ‖22] ≤ δ using at most n ≥ c′σ2

Γδ· k log p

ksamples

I Maximum-likelihood estimation suffices

• Tighter lower bounds could potentially be obtained under additional restrictions on X

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 42/ 53

Page 99: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Continuous Example 2

Convex Optimization

Page 100: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Stochastic Convex Optimization

• A basic optimization problem

x? = argminx∈D f (x)

For simplicity, we focus on the 1D case D ⊆ R (extensions to Rd are possible)

• Model:

I Noisy samples: When we query x , we get a noisy value and noisy gradient:

Y = f (x) + Z , Y ′ = f ′(x) + Z ′

where Z ∼ N(0, σ2) an Z ′ ∼ N(0, σ2)

I Adaptive sampling: Chosen Xi may depend on Y1, . . . ,Yi−1

• Function classes: Convex, strongly convex, Lipschitz, self-concordant, etc.

I We will focus on the class of strongly convex functions

I Strong convexity: f (x)− c2x2 is a convex function for some c > 0 (we set c = 1)

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 43/ 53

Page 101: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Stochastic Convex Optimization

• A basic optimization problem

x? = argminx∈D f (x)

For simplicity, we focus on the 1D case D ⊆ R (extensions to Rd are possible)

• Model:

I Noisy samples: When we query x , we get a noisy value and noisy gradient:

Y = f (x) + Z , Y ′ = f ′(x) + Z ′

where Z ∼ N(0, σ2) an Z ′ ∼ N(0, σ2)

I Adaptive sampling: Chosen Xi may depend on Y1, . . . ,Yi−1

• Function classes: Convex, strongly convex, Lipschitz, self-concordant, etc.

I We will focus on the class of strongly convex functions

I Strong convexity: f (x)− c2x2 is a convex function for some c > 0 (we set c = 1)

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 43/ 53

Page 102: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Stochastic Convex Optimization

• A basic optimization problem

x? = argminx∈D f (x)

For simplicity, we focus on the 1D case D ⊆ R (extensions to Rd are possible)

• Model:

I Noisy samples: When we query x , we get a noisy value and noisy gradient:

Y = f (x) + Z , Y ′ = f ′(x) + Z ′

where Z ∼ N(0, σ2) an Z ′ ∼ N(0, σ2)

I Adaptive sampling: Chosen Xi may depend on Y1, . . . ,Yi−1

• Function classes: Convex, strongly convex, Lipschitz, self-concordant, etc.

I We will focus on the class of strongly convex functions

I Strong convexity: f (x)− c2x2 is a convex function for some c > 0 (we set c = 1)

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 43/ 53

Page 103: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Performance Measure and Minimax Risk

• After sampling n points, the algorithm returns a final point x

• The loss incurred is `f (x) = f (x)−minx∈X f (x), i.e., the gap to the optimum

• For a given class of functions F , the minimax risk is given by

Mn(F) = infX

supf∈F

Ef [`f (X )]

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 44/ 53

Page 104: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Reduction to Multiple Hypothesis Testing

• The picture remains the same:

Index V Select

Algorithm

Samples

InferEstimate V

Index

Function fV

Functionxy

xSelection

I Successful optimization =⇒ Successful identification of V

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 45/ 53

Page 105: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

General Minimax Lower Bound

• Claim 1. Fix ε > 0, and let f1, . . . , fM ⊆ F be a subset of F such that for eachx ∈ X , we have `fv (x) ≤ ε for at most one value of v ∈ 1, . . . ,M. Then we have

Mn(F) ≥ ε ·(

1−I (V ; X,Y) + log 2

log M

), (1)

where V is uniform on 1, . . . ,M, and I (V ; X,Y) is w.r.t V → fV → (X,Y).

• Claim 2. In the special case M = 2, we have

Mn(F) ≥ ε · H−12

(log 2− I (V ; X,Y)

), (2)

where H−12 (·) ∈ [0, 0.5] is the inverse binary entropy function.

• Proof is like with estimation, starting with Markov’s inequality:

supf∈F

Ef [`f (X )] ≥ supf∈F

ε · Pf [`f (X ) ≥ ε].

• Proof for M = 2 uses a (somewhat less well-known) form of Fano’s inequality forbinary hypothesis testing

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 46/ 53

Page 106: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

General Minimax Lower Bound

• Claim 1. Fix ε > 0, and let f1, . . . , fM ⊆ F be a subset of F such that for eachx ∈ X , we have `fv (x) ≤ ε for at most one value of v ∈ 1, . . . ,M. Then we have

Mn(F) ≥ ε ·(

1−I (V ; X,Y) + log 2

log M

), (1)

where V is uniform on 1, . . . ,M, and I (V ; X,Y) is w.r.t V → fV → (X,Y).

• Claim 2. In the special case M = 2, we have

Mn(F) ≥ ε · H−12

(log 2− I (V ; X,Y)

), (2)

where H−12 (·) ∈ [0, 0.5] is the inverse binary entropy function.

• Proof is like with estimation, starting with Markov’s inequality:

supf∈F

Ef [`f (X )] ≥ supf∈F

ε · Pf [`f (X ) ≥ ε].

• Proof for M = 2 uses a (somewhat less well-known) form of Fano’s inequality forbinary hypothesis testing

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 46/ 53

Page 107: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

General Minimax Lower Bound

• Claim 1. Fix ε > 0, and let f1, . . . , fM ⊆ F be a subset of F such that for eachx ∈ X , we have `fv (x) ≤ ε for at most one value of v ∈ 1, . . . ,M. Then we have

Mn(F) ≥ ε ·(

1−I (V ; X,Y) + log 2

log M

), (1)

where V is uniform on 1, . . . ,M, and I (V ; X,Y) is w.r.t V → fV → (X,Y).

• Claim 2. In the special case M = 2, we have

Mn(F) ≥ ε · H−12

(log 2− I (V ; X,Y)

), (2)

where H−12 (·) ∈ [0, 0.5] is the inverse binary entropy function.

• Proof is like with estimation, starting with Markov’s inequality:

supf∈F

Ef [`f (X )] ≥ supf∈F

ε · Pf [`f (X ) ≥ ε].

• Proof for M = 2 uses a (somewhat less well-known) form of Fano’s inequality forbinary hypothesis testing

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 46/ 53

Page 108: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Strongly Convex Class: Choice of Hard Subset

• Reduction to hyp. testing. In 1D, it suffices to choose just two similar functions!

I (Becomes 2constant×d in d dimensions)

0 0.2 0.4 0.6 0.8 1

Input x

0

0.05

0.1

0.15

Functionvalue

• The precise functions:

fv (x) =1

2(x − x∗v )2, v = 1, 2,

x∗1 =1

2−√

2ε′ x∗2 =1

2+√

2ε′

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 47/ 53

Page 109: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Strongly Convex Class: Choice of Hard Subset

• Reduction to hyp. testing. In 1D, it suffices to choose just two similar functions!

I (Becomes 2constant×d in d dimensions)

0 0.2 0.4 0.6 0.8 1

Input x

0

0.05

0.1

0.15

Functionvalue

• The precise functions:

fv (x) =1

2(x − x∗v )2, v = 1, 2,

x∗1 =1

2−√

2ε′ x∗2 =1

2+√

2ε′

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 47/ 53

Page 110: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Analysis

• Application of Fano’s Inequality. As above,

Mn(F) ≥ ε · H−12 (log 2− I (V ; X,Y))

I Approach: H−12 (α) ≥ 1

10if α ≥ log 2

2

I How few samples ensure I (V ; X,Y) ≤ log 22

?

• Bounding the Mutual Information. Let PY ,PY ′ be the observation distributions(function and gradient), and QY ,QY ′ similar but with f0(x) = 1

2x2. Then:

D(PY × PY ′‖QY × QY ′ ) =(f1(x)− f0(x))2

2σ2+

(f ′1 (x)− f ′0 (x))2

2σ2

I Simplifications: (f1(x)− f0(x))2 ≤(ε+

√ε2

)2 ≤ 2ε and (f ′1 (x)− f ′0 (x))2 = 2ε

I With some manipulation, I (V ; X,Y) ≤ log 22

when ε = σ2 log 24n

• Final result: Mn(F) ≥ σ2 log 240n

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 48/ 53

Page 111: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Analysis

• Application of Fano’s Inequality. As above,

Mn(F) ≥ ε · H−12 (log 2− I (V ; X,Y))

I Approach: H−12 (α) ≥ 1

10if α ≥ log 2

2

I How few samples ensure I (V ; X,Y) ≤ log 22

?

• Bounding the Mutual Information. Let PY ,PY ′ be the observation distributions(function and gradient), and QY ,QY ′ similar but with f0(x) = 1

2x2. Then:

D(PY × PY ′‖QY × QY ′ ) =(f1(x)− f0(x))2

2σ2+

(f ′1 (x)− f ′0 (x))2

2σ2

I Simplifications: (f1(x)− f0(x))2 ≤(ε+

√ε2

)2 ≤ 2ε and (f ′1 (x)− f ′0 (x))2 = 2ε

I With some manipulation, I (V ; X,Y) ≤ log 22

when ε = σ2 log 24n

• Final result: Mn(F) ≥ σ2 log 240n

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 48/ 53

Page 112: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Analysis

• Application of Fano’s Inequality. As above,

Mn(F) ≥ ε · H−12 (log 2− I (V ; X,Y))

I Approach: H−12 (α) ≥ 1

10if α ≥ log 2

2

I How few samples ensure I (V ; X,Y) ≤ log 22

?

• Bounding the Mutual Information. Let PY ,PY ′ be the observation distributions(function and gradient), and QY ,QY ′ similar but with f0(x) = 1

2x2. Then:

D(PY × PY ′‖QY × QY ′ ) =(f1(x)− f0(x))2

2σ2+

(f ′1 (x)− f ′0 (x))2

2σ2

I Simplifications: (f1(x)− f0(x))2 ≤(ε+

√ε2

)2 ≤ 2ε and (f ′1 (x)− f ′0 (x))2 = 2ε

I With some manipulation, I (V ; X,Y) ≤ log 22

when ε = σ2 log 24n

• Final result: Mn(F) ≥ σ2 log 240n

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 48/ 53

Page 113: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Upper vs. Lower Bounds

• Lower bound for 1D strongly convex functions: Mn(F) ≥ c σ2

n

• Upper bound for 1D strongly convex functions: Mn(F) ≤ c ′ σ2

n

I Achieved by stochastic gradient descent

• Analogous results (and proof techniques) known for d-dimensional functions,additional Lipschitz assumptions, etc. [Raginsky and Rakhlin, 2011]

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 49/ 53

Page 114: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Limitations and Generalizations

• Limitations of Fano’s Inequality.

I Non-asymptotic weakness

I Typically hard to tightly bound mutual information in adaptive settings

I Restriction to KL divergence

• Generalizations of Fano’s Inequality.

I Non-uniform V [Han/Verdu, 1994]

I More general f -divergences [Guntuboyina, 2011]

I Continuous V [Duchi/Wainwright, 2013]

(This list is certainly incomplete!)

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 50/ 53

Page 115: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Example: Difficulties in Adaptive Settings

• A simple search problem: Find the (only) biased coin using few flips

| z | z P[heads] = 1

2 P[heads] = 12

P[heads] = 12 +

I Heavy coin V ∈ 1, . . . ,M uniformly at random

I Selected coin at time i = 1, . . . , n is Xi , observation is Yi ∈ 0, 1 (1 for heads)

• Non-adaptive setting:

I Since Xi and V are independent, can show I (V ;Yi |Xi ) . ε2

M

I Substituting into Fano’s inequality gives the requirement n & M log Mε2

• Adaptive setting:

I Nuisance to characterize I (V ;Yi |Xi ), as Xi depends on V due to adaptivity!

I Worst-case bounding only gives n & log Mε2

I Next lecture: An alternative tool that gives n & Mε2

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 51/ 53

Page 116: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Example: Difficulties in Adaptive Settings

• A simple search problem: Find the (only) biased coin using few flips

| z | z P[heads] = 1

2 P[heads] = 12

P[heads] = 12 +

I Heavy coin V ∈ 1, . . . ,M uniformly at random

I Selected coin at time i = 1, . . . , n is Xi , observation is Yi ∈ 0, 1 (1 for heads)

• Non-adaptive setting:

I Since Xi and V are independent, can show I (V ;Yi |Xi ) . ε2

M

I Substituting into Fano’s inequality gives the requirement n & M log Mε2

• Adaptive setting:

I Nuisance to characterize I (V ;Yi |Xi ), as Xi depends on V due to adaptivity!

I Worst-case bounding only gives n & log Mε2

I Next lecture: An alternative tool that gives n & Mε2

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 51/ 53

Page 117: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Example: Difficulties in Adaptive Settings

• A simple search problem: Find the (only) biased coin using few flips

| z | z P[heads] = 1

2 P[heads] = 12

P[heads] = 12 +

I Heavy coin V ∈ 1, . . . ,M uniformly at random

I Selected coin at time i = 1, . . . , n is Xi , observation is Yi ∈ 0, 1 (1 for heads)

• Non-adaptive setting:

I Since Xi and V are independent, can show I (V ;Yi |Xi ) . ε2

M

I Substituting into Fano’s inequality gives the requirement n & M log Mε2

• Adaptive setting:

I Nuisance to characterize I (V ;Yi |Xi ), as Xi depends on V due to adaptivity!

I Worst-case bounding only gives n & log Mε2

I Next lecture: An alternative tool that gives n & Mε2

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 51/ 53

Page 118: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Conclusion

• Information theory as a theory of data:

Inference &Learning

Data Generation

Inference &Learning

Optimization

Storage &Transmission

Inference &Learning

Optimization

Data Generation

Data Generation

OptimizationInference &Learning

OptimizationData Generation

Inference &Learning OptimizationData

Generation

InformationTheory

Storage &Transmission

Inference &Learning

OptimizationData Generation

Information Theory

• Approach highlighted in this talk:

I Reduction to multiple hypothesis testing

I Application of Fano’s inequality

I Bounding the mutual information

• Examples:

I Group testing

I Graphical model selection

I Sparse regression

I Convex optimization

I ...hopefully many more to come!

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 52/ 53

Page 119: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Conclusion

• Information theory as a theory of data:

Inference &Learning

Data Generation

Inference &Learning

Optimization

Storage &Transmission

Inference &Learning

Optimization

Data Generation

Data Generation

OptimizationInference &Learning

OptimizationData Generation

Inference &Learning OptimizationData

Generation

InformationTheory

Storage &Transmission

Inference &Learning

OptimizationData Generation

Information Theory

• Approach highlighted in this talk:

I Reduction to multiple hypothesis testing

I Application of Fano’s inequality

I Bounding the mutual information

• Examples:

I Group testing

I Graphical model selection

I Sparse regression

I Convex optimization

I ...hopefully many more to come!

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 52/ 53

Page 120: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Conclusion

• Information theory as a theory of data:

Inference &Learning

Data Generation

Inference &Learning

Optimization

Storage &Transmission

Inference &Learning

Optimization

Data Generation

Data Generation

OptimizationInference &Learning

OptimizationData Generation

Inference &Learning OptimizationData

Generation

InformationTheory

Storage &Transmission

Inference &Learning

OptimizationData Generation

Information Theory

• Approach highlighted in this talk:

I Reduction to multiple hypothesis testing

I Application of Fano’s inequality

I Bounding the mutual information

• Examples:

I Group testing

I Graphical model selection

I Sparse regression

I Convex optimization

I ...hopefully many more to come!

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 52/ 53

Page 121: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods

Tutorial Chapter

• Tutorial Chapter: “An Introductory Guide to Fano’s Inequalitywith Applications in Statistical Estimation” [S. and Cevher, 2019]

https://arxiv.org/abs/1901.00555

(To appear in book Information-Theoretic Methods in DataScience, Cambridge University Press)

Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 53/ 53