Information-Theoretic Limits for Inference, Learning, and...
Transcript of Information-Theoretic Limits for Inference, Learning, and...
![Page 1: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/1.jpg)
Information-Theoretic Limits for Inference,Learning, and Optimization
Part 1: Fano’s Inequality for Statistical Estimation
Jonathan Scarlett
Croucher Summer Course in Information Theory (CSCIT)[July 2019]
![Page 2: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/2.jpg)
Information Theory
• How do we quantify “information” in data?
• Information theory [Shannon, 1948]:
I Fundamental limits of data communication
Source Codeword Output Reconstruction
Encoder Channel Decoder
(X1, . . . , Xn) (Y1, . . . , Yn)(U1, . . . , Up) (U1, . . . , Up)
I Information of source: Entropy
I Information learned at channel output: Mutual information
Principles:
I First fundamental limits without complexity constraints, then practical methods
I First asymptotic analyses, then convergence rates, finite-length, etc.
I Mathematically tractable probabilistic models
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 1/ 53
![Page 3: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/3.jpg)
Information Theory
• How do we quantify “information” in data?
• Information theory [Shannon, 1948]:
I Fundamental limits of data communication
Source Codeword Output Reconstruction
Encoder Channel Decoder
(X1, . . . , Xn) (Y1, . . . , Yn)(U1, . . . , Up) (U1, . . . , Up)
I Information of source: Entropy
I Information learned at channel output: Mutual information
Principles:
I First fundamental limits without complexity constraints, then practical methods
I First asymptotic analyses, then convergence rates, finite-length, etc.
I Mathematically tractable probabilistic models
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 1/ 53
![Page 4: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/4.jpg)
Information Theory
• How do we quantify “information” in data?
• Information theory [Shannon, 1948]:
I Fundamental limits of data communication
Source Codeword Output Reconstruction
Encoder Channel Decoder
(X1, . . . , Xn) (Y1, . . . , Yn)(U1, . . . , Up) (U1, . . . , Up)
I Information of source: Entropy
I Information learned at channel output: Mutual information
Principles:
I First fundamental limits without complexity constraints, then practical methods
I First asymptotic analyses, then convergence rates, finite-length, etc.
I Mathematically tractable probabilistic models
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 1/ 53
![Page 5: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/5.jpg)
Information Theory and Data
• Conventional view:
Information theory is a theory of communicationInference &
Learning
Data Generation
Inference &Learning
Optimization
Storage &Transmission
Inference &Learning
Optimization
Data Generation
Data Generation
OptimizationInference &Learning
OptimizationData Generation
Inference &Learning OptimizationData
Generation
InformationTheory
Storage &Transmission
Inference &Learning
OptimizationData Generation
Information Theory
• Emerging view:
Information theory is a theory of data
Inference &Learning
Data Generation
Inference &Learning
Optimization
Storage &Transmission
Inference &Learning
Optimization
Data Generation
Data Generation
OptimizationInference &Learning
OptimizationData Generation
Inference &Learning OptimizationData
Generation
InformationTheory
Storage &Transmission
Inference &Learning
OptimizationData Generation
Information Theory
• Extracting information from channel output vs. Extracting information from data
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 2/ 53
![Page 6: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/6.jpg)
Information Theory and Data
• Conventional view:
Information theory is a theory of communicationInference &
Learning
Data Generation
Inference &Learning
Optimization
Storage &Transmission
Inference &Learning
Optimization
Data Generation
Data Generation
OptimizationInference &Learning
OptimizationData Generation
Inference &Learning OptimizationData
Generation
InformationTheory
Storage &Transmission
Inference &Learning
OptimizationData Generation
Information Theory
• Emerging view:
Information theory is a theory of data
Inference &Learning
Data Generation
Inference &Learning
Optimization
Storage &Transmission
Inference &Learning
Optimization
Data Generation
Data Generation
OptimizationInference &Learning
OptimizationData Generation
Inference &Learning OptimizationData
Generation
InformationTheory
Storage &Transmission
Inference &Learning
OptimizationData Generation
Information Theory
• Extracting information from channel output vs. Extracting information from data
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 2/ 53
![Page 7: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/7.jpg)
Examples
• Information theory in machine learning and statistics:I Statistical estimation [Le Cam, 1973]
I Group testing [Malyutov, 1978]
I Multi-armed bandits [Lai and Robbins, 1985]
I Phylogeny [Mossel, 2004]
I Sparse recovery [Wainwright, 2009]
I Graphical model selection [Santhanam and Wainwright, 2012]
I Convex optimization [Agarwal et al., 2012]
I DNA sequencing [Motahari et al., 2012]
I Sparse PCA [Birnbaum et al., 2013]
I Community detection [Abbe, 2014]
I Matrix completion [Riegler et al., 2015]
I Ranking [Shah and Wainwright, 2015]
I Adaptive data analysis [Russo and Zou, 2015]
I Supervised learning [Nokleby, 2016]
I Crowdsourcing [Lahouti and Hassibi, 2016]
I Distributed computation [Lee et al., 2018]
I Bayesian optimization [Scarlett, 2018]
• Note: More than just using entropy / mutual information...
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 3/ 53
![Page 8: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/8.jpg)
Analogies
Same concepts, different terminology:
Communication Problems Data Problems
Feedback Active learning / adaptivity
Rate-distortion theory Approximate recovery
Joint source-channel coding Non-uniform prior
. . . . . .
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 4/ 53
![Page 9: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/9.jpg)
Analogies
Same concepts, different terminology:
Communication Problems Data Problems
Channels with feedback Active learning / adaptivity
Rate distortion theory Approximate recovery
Joint source-channel coding Non-uniform prior
Error probability Error probability
Random coding Random sampling
Side information Side information
Channels with memory Statistically dependent measurements
Mismatched decoding Model mismatch
. . . . . .
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 4/ 53
![Page 10: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/10.jpg)
Cautionary Notes
Some cautionary notes on the information-theoretic viewpoint:
I The simple models we can analyze may be over-simplified(more so than in communication)
I Compared to communication, we often can’t get matching achievability/converse(often settle with correct scaling laws)
I Information-theoretic limits not (yet) considered much in practice(to my knowledge) ... but they do guide the algorithm design
I Often encounter gaps between information-theoretic limits and computation limits
I Often information theory simply isn’t the right tool for the job
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 5/ 53
![Page 11: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/11.jpg)
Terminology: Achievability and Converse
Achievability result (example): Given n(ε) data samples, there exists an algorithmachieving an “error” of at most ε
I Estimation error: ‖θ − θtrue‖ ≤ εI Optimization error: f (xselected) ≤ minx f (x) + ε
Converse result (example): In order to achieve an “error” of at most ε, any algorithmrequires at least n(ε) data samples
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 6/ 53
![Page 12: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/12.jpg)
Terminology: Achievability and Converse
Achievability result (example): Given n(ε) data samples, there exists an algorithmachieving an “error” of at most ε
I Estimation error: ‖θ − θtrue‖ ≤ εI Optimization error: f (xselected) ≤ minx f (x) + ε
Converse result (example): In order to achieve an “error” of at most ε, any algorithmrequires at least n(ε) data samples
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 6/ 53
![Page 13: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/13.jpg)
Converse Bounds for Statistical Estimationvia Fano’s Inequality
(Based on survey chapter https://arxiv.org/abs/1901.00555)
![Page 14: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/14.jpg)
Statistical Estimation
Parameter V
Index VSelect
Parameter
Algorithm
Samples
Y XInfer
Output
Estimate V
Index
Parameter
Algorithm
Samples
Y X
Estimate
• General statistical estimation setup:
I Unknown parameter θ ∈ ΘI Samples Y = (Y1, . . . ,Yn) drawn from Pθ(y)
I More generally, from Pθ,X with inputs X = (X1, . . . ,Xn)
I Given Y (and possibly X), construct estimate θ
• Goal. Minimize some loss `(θ, θ)
I 0-1 loss: `(θ, θ) = 1θ 6= θI Squared `2 loss: ‖θ − θ‖2
• Typical example. Linear regression
I Estimate θ ∈ Rp from Y = Xθ + Z
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 7/ 53
![Page 15: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/15.jpg)
Statistical Estimation
Parameter V
Index VSelect
Parameter
Algorithm
Samples
Y XInfer
Output
Estimate V
Index
Parameter
Algorithm
Samples
Y X
Estimate
• General statistical estimation setup:
I Unknown parameter θ ∈ ΘI Samples Y = (Y1, . . . ,Yn) drawn from Pθ(y)
I More generally, from Pθ,X with inputs X = (X1, . . . ,Xn)
I Given Y (and possibly X), construct estimate θ
• Goal. Minimize some loss `(θ, θ)
I 0-1 loss: `(θ, θ) = 1θ 6= θI Squared `2 loss: ‖θ − θ‖2
• Typical example. Linear regression
I Estimate θ ∈ Rp from Y = Xθ + Z
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 7/ 53
![Page 16: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/16.jpg)
Statistical Estimation
Parameter V
Index VSelect
Parameter
Algorithm
Samples
Y XInfer
Output
Estimate V
Index
Parameter
Algorithm
Samples
Y X
Estimate
• General statistical estimation setup:
I Unknown parameter θ ∈ ΘI Samples Y = (Y1, . . . ,Yn) drawn from Pθ(y)
I More generally, from Pθ,X with inputs X = (X1, . . . ,Xn)
I Given Y (and possibly X), construct estimate θ
• Goal. Minimize some loss `(θ, θ)
I 0-1 loss: `(θ, θ) = 1θ 6= θI Squared `2 loss: ‖θ − θ‖2
• Typical example. Linear regression
I Estimate θ ∈ Rp from Y = Xθ + Z
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 7/ 53
![Page 17: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/17.jpg)
Defining Features
• There are many properties that impact the analysis:
I Discrete θ (e.g., graph learning, sparsity pattern recovery)
I Continuous θ (e.g., regression, density estimation)
I Bayesian θ (average-case performance)
I Minimax bounds over Θ (worst-case performance)
I Non-adaptive inputs (all X1, . . . ,Xn chosen in advance)
I Adaptive inputs (Xi can be chosen based on Y1, . . . ,Yi−1)
• This talk. Minimax bounds, mostly non-adaptive, first discrete and then continuous
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 8/ 53
![Page 18: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/18.jpg)
High-Level Steps
Steps in attaining a minimax lower bound (converse):
1. Reduce estimation problem to multiple hypothesis testing
2. Apply a form of Fano’s inequality
3. Bound the resulting mutual information term
(Multiple hypothesis testing: Given samples Y1, . . . ,Yn, determine which distributionamong P1(y), . . . ,PM(y) generated them. M = 2 gives binary hypothesis testing.)
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 9/ 53
![Page 19: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/19.jpg)
Step I: Reduction to Multiple Hypothesis Testing
• Lower bound worst-case error by average over hard subset θ1, . . . , θM :
Parameter V
Index VSelect
Parameter
Algorithm
Samples
Y XInfer
Output
Estimate V
Index
Idea:
I Show “successful” algorithm θ =⇒ Correct estimation of V (When is this true?)
I Equivalent statement: If V can’t be estimated reliably, then θ can’t be successful.
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 10/ 53
![Page 20: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/20.jpg)
Step I: Example
• Example: Suppose algorithm is claimed to return θ such that ‖θ − θ‖2 ≤ ε
• If θ1, . . . , θM are separated by 2ε, then we can identify the correct V ∈ 1, . . . ,M
• Note: Tension between number of hypotheses, difficulty in distinguishing them, andsufficient separation. Choosing a suitable set θ1, . . . , θM can be very challenging.
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 11/ 53
![Page 21: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/21.jpg)
Step II: Application of Fano’s Inequality
• Standard form of Fano’s inequality for M-ary hypothesis testing and uniform V :
P[V 6= V ] ≥ 1−I (V ; V ) + log 2
log M.
I Intuition: Need learned information I (V ; V ) to be close to prior uncertainty log M
• Variations:
I Non-uniform V
I Approximate recovery
I Conditional version
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 12/ 53
![Page 22: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/22.jpg)
Step II: Application of Fano’s Inequality
• Standard form of Fano’s inequality for M-ary hypothesis testing and uniform V :
P[V 6= V ] ≥ 1−I (V ; V ) + log 2
log M.
I Intuition: Need learned information I (V ; V ) to be close to prior uncertainty log M
• Variations:
I Non-uniform V
I Approximate recovery
I Conditional version
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 12/ 53
![Page 23: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/23.jpg)
Step III: Bounding the Mutual Information
• The key quantity remaining after applying Fano’s inequality is I (V ; V )
• Data processing inequality: (Based on V → Y → V or similar)
I No inputs: I (V ; V ) ≤ I (V ; Y)
I Non-adaptive inputs: I (V ; V |X) ≤ I (V ; Y|X)
I Adaptive inputs: I (V ; V ) ≤ I (V ; X,Y)
• Tensorization: (Based on conditional independence of the samples)
I No inputs: I (V ; Y) ≤∑n
i=1 I (V ;Yi )
I Non-adaptive inputs: I (V ; Y|X) ≤∑n
i=1 I (V ;Yi |Xi )
I Adaptive inputs: I (V ; X,Y) ≤∑n
i=1 I (V ;Yi |Xi )
• KL Divergence Bounds:
I I (V ;Y ) ≤ maxv,v′ D(PY |V (·|v)‖PY |V (·|v ′)
)I I (V ;Y ) ≤ maxv D
(PY |V (·|v)‖QY
)for any QY
I If each PY |V (·|v) is ε-close to the closest Q1(y), . . . ,QN(y) in KL divergence,then I (V ;Y ) ≤ log N + ε
I (Similarly with conditioning on X )
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 13/ 53
![Page 24: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/24.jpg)
Step III: Bounding the Mutual Information
• The key quantity remaining after applying Fano’s inequality is I (V ; V )
• Data processing inequality: (Based on V → Y → V or similar)
I No inputs: I (V ; V ) ≤ I (V ; Y)
I Non-adaptive inputs: I (V ; V |X) ≤ I (V ; Y|X)
I Adaptive inputs: I (V ; V ) ≤ I (V ; X,Y)
• Tensorization: (Based on conditional independence of the samples)
I No inputs: I (V ; Y) ≤∑n
i=1 I (V ;Yi )
I Non-adaptive inputs: I (V ; Y|X) ≤∑n
i=1 I (V ;Yi |Xi )
I Adaptive inputs: I (V ; X,Y) ≤∑n
i=1 I (V ;Yi |Xi )
• KL Divergence Bounds:
I I (V ;Y ) ≤ maxv,v′ D(PY |V (·|v)‖PY |V (·|v ′)
)I I (V ;Y ) ≤ maxv D
(PY |V (·|v)‖QY
)for any QY
I If each PY |V (·|v) is ε-close to the closest Q1(y), . . . ,QN(y) in KL divergence,then I (V ;Y ) ≤ log N + ε
I (Similarly with conditioning on X )
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 13/ 53
![Page 25: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/25.jpg)
Step III: Bounding the Mutual Information
• The key quantity remaining after applying Fano’s inequality is I (V ; V )
• Data processing inequality: (Based on V → Y → V or similar)
I No inputs: I (V ; V ) ≤ I (V ; Y)
I Non-adaptive inputs: I (V ; V |X) ≤ I (V ; Y|X)
I Adaptive inputs: I (V ; V ) ≤ I (V ; X,Y)
• Tensorization: (Based on conditional independence of the samples)
I No inputs: I (V ; Y) ≤∑n
i=1 I (V ;Yi )
I Non-adaptive inputs: I (V ; Y|X) ≤∑n
i=1 I (V ;Yi |Xi )
I Adaptive inputs: I (V ; X,Y) ≤∑n
i=1 I (V ;Yi |Xi )
• KL Divergence Bounds:
I I (V ;Y ) ≤ maxv,v′ D(PY |V (·|v)‖PY |V (·|v ′)
)I I (V ;Y ) ≤ maxv D
(PY |V (·|v)‖QY
)for any QY
I If each PY |V (·|v) is ε-close to the closest Q1(y), . . . ,QN(y) in KL divergence,then I (V ;Y ) ≤ log N + ε
I (Similarly with conditioning on X )
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 13/ 53
![Page 26: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/26.jpg)
Discrete Example 1
Group Testing
![Page 27: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/27.jpg)
Group Testing
Items
Tests
Outcomes
Contaminated ContaminatedClean Clean
I Goal:
Given test matrix X and outcomes Y, recover item vector β
I Sample complexity: Required number of tests n
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 14/ 53
![Page 28: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/28.jpg)
Group Testing
Items
Tests
Outcomes
Contaminated ContaminatedClean Clean
I Goal:
Given test matrix X and outcomes Y, recover item vector β
I Sample complexity: Required number of tests n
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 14/ 53
![Page 29: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/29.jpg)
Information Theory and Group Testing
Items
Tests
Outcomes
• Information-theoretic viewpoint:
S : Defective setXS : Columns indexed by S
S
Codeword
Y ∼ P nY |XS S
Encoder Channel Decoder
Message
XS
OutputEstimate
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 15/ 53
![Page 30: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/30.jpg)
Information Theory and Group Testing
• Example formulation of general result:
Entropy
Mutual Information
(Model uncertainty)
(Information learned from measurements)
Sample complexity
n H(S)
I(PY |XS)
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 15/ 53
![Page 31: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/31.jpg)
Converse via Fano’s Inequality
• Reduction to multiple hypothesis testing: Trivial! Set V = S .
• Application of Fano’s Inequality:
P[S 6= S] ≥ 1−I (S ; S|X) + log 2
log(pk
)• Bounding the mutual information:
I Data processing inequality: I (S ; S |X) ≤ I (U; Y) where U are pre-noise outputs
I Tensorization: I (U; Y) ≤∑n
i=1 I (Ui ;Yi )
I Capacity bound: I (Ui ;Yi ) ≤ C if outcome passed through channel of capacity C
• Final result:
n ≤log(pk
)C
(1− ε) =⇒ P[S 6= S] 6→ 0
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 16/ 53
![Page 32: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/32.jpg)
Converse via Fano’s Inequality
• Reduction to multiple hypothesis testing: Trivial! Set V = S .
• Application of Fano’s Inequality:
P[S 6= S] ≥ 1−I (S ; S|X) + log 2
log(pk
)
• Bounding the mutual information:
I Data processing inequality: I (S ; S |X) ≤ I (U; Y) where U are pre-noise outputs
I Tensorization: I (U; Y) ≤∑n
i=1 I (Ui ;Yi )
I Capacity bound: I (Ui ;Yi ) ≤ C if outcome passed through channel of capacity C
• Final result:
n ≤log(pk
)C
(1− ε) =⇒ P[S 6= S] 6→ 0
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 16/ 53
![Page 33: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/33.jpg)
Converse via Fano’s Inequality
• Reduction to multiple hypothesis testing: Trivial! Set V = S .
• Application of Fano’s Inequality:
P[S 6= S] ≥ 1−I (S ; S|X) + log 2
log(pk
)• Bounding the mutual information:
I Data processing inequality: I (S ; S |X) ≤ I (U; Y) where U are pre-noise outputs
I Tensorization: I (U; Y) ≤∑n
i=1 I (Ui ;Yi )
I Capacity bound: I (Ui ;Yi ) ≤ C if outcome passed through channel of capacity C
• Final result:
n ≤log(pk
)C
(1− ε) =⇒ P[S 6= S] 6→ 0
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 16/ 53
![Page 34: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/34.jpg)
Converse via Fano’s Inequality
• Reduction to multiple hypothesis testing: Trivial! Set V = S .
• Application of Fano’s Inequality:
P[S 6= S] ≥ 1−I (S ; S|X) + log 2
log(pk
)• Bounding the mutual information:
I Data processing inequality: I (S ; S |X) ≤ I (U; Y) where U are pre-noise outputs
I Tensorization: I (U; Y) ≤∑n
i=1 I (Ui ;Yi )
I Capacity bound: I (Ui ;Yi ) ≤ C if outcome passed through channel of capacity C
• Final result:
n ≤log(pk
)C
(1− ε) =⇒ P[S 6= S] 6→ 0
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 16/ 53
![Page 35: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/35.jpg)
Noiseless Group Testing (Exact Recovery)
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
DD
Informaton-Theoretic
Counting Bound
Bernoulli
COMP
Near-Const.
• Other Implications:I Adaptivity and approximate recovery:
I No gain at low sparsity levelsI Significant gain at high sparsity levels
I Efficient algorithm optimal (w.r.t random test design) at high sparsity levels
I Analogous results in noisy settings
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 17/ 53
![Page 36: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/36.jpg)
Noiseless Group Testing (Exact Recovery)
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
DD
Informaton-Theoretic
Counting Bound
Bernoulli
COMP
Near-Const.
• Other Implications:I Adaptivity and approximate recovery:
I No gain at low sparsity levelsI Significant gain at high sparsity levels
I Efficient algorithm optimal (w.r.t random test design) at high sparsity levels
I Analogous results in noisy settings
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 17/ 53
![Page 37: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/37.jpg)
Detour: An Information-Theoretic Achievability Example
![Page 38: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/38.jpg)
A Practical Approach: Separate Decoding of Items
• Searching over(pk
)subsets is infeasible
• Practical alternative: Separate decoding of itemsItems
Tests
Outcomes
Items
Tests
Outcomes
• Channel coding interpretation:
1 X1 Y 1
Y pp Xp
Encoder 1
Encoder p
Channel 1
Channel p
Decoder 1
Decoder p
• Claim: Asymptotic optimality to within a factor of log 2 ≈ 0.7 (sometimes)
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 18/ 53
![Page 39: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/39.jpg)
A Practical Approach: Separate Decoding of Items
• Searching over(pk
)subsets is infeasible
• Practical alternative: Separate decoding of itemsItems
Tests
Outcomes
Items
Tests
Outcomes
• Channel coding interpretation:
1 X1 Y 1
Y pp Xp
Encoder 1
Encoder p
Channel 1
Channel p
Decoder 1
Decoder p
• Claim: Asymptotic optimality to within a factor of log 2 ≈ 0.7 (sometimes)
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 18/ 53
![Page 40: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/40.jpg)
Example Results (Symmetric Noise)
• Summary of results in the symmetric noise setting: [Scarlett/Cevher, 2018]I Joint: Intractable joint decoding rule (otherwise separate)
I False pos/neg: Small #false positives/negatives allowed in reconstruction
0 0.2 0.4 0.6 0.8 1Value 3 such that k = #(p3)
0
0.1
0.2
0.3
0.4
0.5
Asy
mpto
tic
ratio
ofk
log2p=k
ton
Joint false pos/neg
Joint exact
False pos/neg
False neg
False pos
Exact
Best previous practical
Note: Within a factor log 2 ≈ 0.7 of the information-theoretic optimum at low sparsity
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 19/ 53
![Page 41: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/41.jpg)
SDI Achievability Proof 1 (Notation)
• Define the defectivity indicator random variable
βj = 1j ∈ S
• Separate decoding of items: Decode βj based only on and Y
I Xj : j-th column of X
• Binary hypothesis testing based rule
βj (Xj ,Y) = 1
n∑i=1
logPY |Xj ,βj
(Y (i)|X (i)j , 1)
PY (Y (i))> γ
I X (i) : i-th row of X
I Y (i) : i-th test outcome
I γ : Threshold to be set later
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 20/ 53
![Page 42: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/42.jpg)
SDI Achievability Proof 1 (Notation)
• Define the defectivity indicator random variable
βj = 1j ∈ S
• Separate decoding of items: Decode βj based only on and Y
I Xj : j-th column of X
• Binary hypothesis testing based rule
βj (Xj ,Y) = 1
n∑i=1
logPY |Xj ,βj
(Y (i)|X (i)j , 1)
PY (Y (i))> γ
I X (i) : i-th row of X
I Y (i) : i-th test outcome
I γ : Threshold to be set later
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 20/ 53
![Page 43: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/43.jpg)
SDI Achievability Proof 2 (Non-Asymptotic Bound)
Non-asymptotic bound: Under Bernoulli testing, we have
Pe ≤ kP[ın1(X1,Y) ≤ γ] + (p − k)e−γ
where
ın1(Xj ,Y) :=n∑
i=1
ı1(X(i)j ,Y (i)), ı1(x1, y) := log
PY |X1(y |x1)
PY (y)
is the information density
Proof idea:
I First term: Union bound over k false negative events
I Second term: Union bound over p − k false positive events & change of measure
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 21/ 53
![Page 44: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/44.jpg)
SDI Achievability Proof 2 (Non-Asymptotic Bound)
Non-asymptotic bound: Under Bernoulli testing, we have
Pe ≤ kP[ın1(X1,Y) ≤ γ] + (p − k)e−γ
where
ın1(Xj ,Y) :=n∑
i=1
ı1(X(i)j ,Y (i)), ı1(x1, y) := log
PY |X1(y |x1)
PY (y)
is the information density
Proof idea:
I First term: Union bound over k false negative events
I Second term: Union bound over p − k false positive events & change of measure
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 21/ 53
![Page 45: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/45.jpg)
SDI Achievability Proof 3 (Asymptotic Analysis)
Simplification: If we have concentration of the form
P[ın1(X1,Y) ≤ nI (X1;Y )(1− δ2)] ≤ ψn(δ2)
then the following conditions suffice for Pe → 0:
n ≥log(
1δ1
(p − k))
I (X1;Y )(1− δ2)
k · ψn(δ2)→ 0
Notes
I Proved by setting γ = log p−kδ1
in initial result
I With false positives, can instead use γ = log p−kkδ1
I With false negatives, it suffices that ψn(δ2)→ 0
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 22/ 53
![Page 46: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/46.jpg)
SDI Achievability Proof 3 (Asymptotic Analysis)
Simplification: If we have concentration of the form
P[ın1(X1,Y) ≤ nI (X1;Y )(1− δ2)] ≤ ψn(δ2)
then the following conditions suffice for Pe → 0:
n ≥log(
1δ1
(p − k))
I (X1;Y )(1− δ2)
k · ψn(δ2)→ 0
Notes
I Proved by setting γ = log p−kδ1
in initial result
I With false positives, can instead use γ = log p−kkδ1
I With false negatives, it suffices that ψn(δ2)→ 0
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 22/ 53
![Page 47: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/47.jpg)
SDI Achievability Proof 4 (Concentration Bounds)
Bernstein’s inequality:
P[ın1(X1,Y) ≤ nI1(1− δ2)] ≤ exp
( − 12· nk· c2
meanδ22
cvar + 13cmeancmaxδ2
)for suitable constants cmean, cvar, cmax
Refined concentration for noiseless model:
P[ın1(X1,Y) ≤ nI1(1− δ2)] ≤ exp
(−
n log 2
k
((1− δ2) log(1− δ2) + δ2
)(1 + o(1))
)
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 23/ 53
![Page 48: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/48.jpg)
SDI Achievability Proof 4 (Concentration Bounds)
Bernstein’s inequality:
P[ın1(X1,Y) ≤ nI1(1− δ2)] ≤ exp
( − 12· nk· c2
meanδ22
cvar + 13cmeancmaxδ2
)for suitable constants cmean, cvar, cmax
Refined concentration for noiseless model:
P[ın1(X1,Y) ≤ nI1(1− δ2)] ≤ exp
(−
n log 2
k
((1− δ2) log(1− δ2) + δ2
)(1 + o(1))
)
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 23/ 53
![Page 49: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/49.jpg)
SDI Achievability Proof 5 (Final Conditions)
Final noiseless result: We have Pe → 0 if
n ≥ minδ2>0
max
k log p
(log 2)2(1− δ2),
k log k
(log 2)((1− δ2) log(1− δ2) + δ2
)(1 + η)
Final noisy result: We have Pe → 0 if
n ≥ minδ2>0
max
k log p
(log 2)(log 2− H2(ρ))(1− δ2),
(k log k) ·(cvar + 1
3cmeancmaxδ2
)12· c2
meanδ22
(1+η)
Approximate recovery: We have P(approx)e → 0 if
n ≥log p
k
I (X1;Y )(1 + η)
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 24/ 53
![Page 50: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/50.jpg)
SDI Achievability Proof 5 (Final Conditions)
Final noiseless result: We have Pe → 0 if
n ≥ minδ2>0
max
k log p
(log 2)2(1− δ2),
k log k
(log 2)((1− δ2) log(1− δ2) + δ2
)(1 + η)
Final noisy result: We have Pe → 0 if
n ≥ minδ2>0
max
k log p
(log 2)(log 2− H2(ρ))(1− δ2),
(k log k) ·(cvar + 1
3cmeancmaxδ2
)12· c2
meanδ22
(1+η)
Approximate recovery: We have P(approx)e → 0 if
n ≥log p
k
I (X1;Y )(1 + η)
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 24/ 53
![Page 51: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/51.jpg)
SDI Achievability Proof 5 (Final Conditions)
Final noiseless result: We have Pe → 0 if
n ≥ minδ2>0
max
k log p
(log 2)2(1− δ2),
k log k
(log 2)((1− δ2) log(1− δ2) + δ2
)(1 + η)
Final noisy result: We have Pe → 0 if
n ≥ minδ2>0
max
k log p
(log 2)(log 2− H2(ρ))(1− δ2),
(k log k) ·(cvar + 1
3cmeancmaxδ2
)12· c2
meanδ22
(1+η)
Approximate recovery: We have P(approx)e → 0 if
n ≥log p
k
I (X1;Y )(1 + η)
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 24/ 53
![Page 52: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/52.jpg)
Example Results (Symmetric Noise)
• Summary of results in the symmetric noise setting: [Scarlett/Cevher, 2018]I Joint: Intractable joint decoding rule (otherwise separate)
I False pos/neg: Small #false positives/negatives allowed in reconstruction
0 0.2 0.4 0.6 0.8 1Value 3 such that k = #(p3)
0
0.1
0.2
0.3
0.4
0.5
Asy
mpto
tic
ratio
ofk
log2p=k
ton
Joint false pos/neg
Joint exact
False pos/neg
False neg
False pos
Exact
Best previous practical
Note: Within a factor log 2 ≈ 0.7 of the information-theoretic optimum at low sparsity
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 25/ 53
![Page 53: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/53.jpg)
Discrete Example 2
Graphical Model Selection
![Page 54: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/54.jpg)
Graphical Model Representations of Joint Distributions
Motivating example:
I In a population of p people, let
Yi =
1 person i is infected
−1 person i is healthy,i = 1, . . . , p
I Example models:
[Abbe and Wainwright, ISIT Tutorial, 2015]
I Joint distribution for a given graph G = (V ,E):
P[(Y1, . . . ,Yp) = (y1, . . . , yp)
]=
1
Zexp
( ∑(i,j)∈E
λijyiyj
)
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 26/ 53
![Page 55: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/55.jpg)
Graphical Model Representations of Joint Distributions
Motivating example:
I In a population of p people, let
Yi =
1 person i is infected
−1 person i is healthy,i = 1, . . . , p
I Example models:
[Abbe and Wainwright, ISIT Tutorial, 2015]
I Joint distribution for a given graph G = (V ,E):
P[(Y1, . . . ,Yp) = (y1, . . . , yp)
]=
1
Zexp
( ∑(i,j)∈E
λijyiyj
)
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 26/ 53
![Page 56: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/56.jpg)
Graphical Model Representations of Joint Distributions
Motivating example:
I In a population of p people, let
Yi =
1 person i is infected
−1 person i is healthy,i = 1, . . . , p
I Example models:
[Abbe and Wainwright, ISIT Tutorial, 2015]
I Joint distribution for a given graph G = (V ,E):
P[(Y1, . . . ,Yp) = (y1, . . . , yp)
]=
1
Zexp
( ∑(i,j)∈E
λijyiyj
)
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 26/ 53
![Page 57: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/57.jpg)
Graphical Model Selection: Illustration
I A larger example from [Abbe and Wainwright, ISIT Tutorial 2015]:I Example graphs:
I Sample images (Ising model):
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 27/ 53
![Page 58: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/58.jpg)
Graphical Model Selection: Illustration
I A larger example from [Abbe and Wainwright, ISIT Tutorial 2015]:I Example graphs:
I Sample images (Ising model):
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 27/ 53
![Page 59: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/59.jpg)
Graphical Model Selection: Illustration
I A larger example from [Abbe and Wainwright, ISIT Tutorial 2015]:I Example graphs:
I Sample images (Ising model):
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 27/ 53
![Page 60: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/60.jpg)
Graphical Model Selection: Illustration
I A larger example from [Abbe and Wainwright, ISIT Tutorial 2015]:I Example graphs:
I Sample images (Ising model):
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 27/ 53
![Page 61: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/61.jpg)
Graphical Model Selection: Illustration
I A larger example from [Abbe and Wainwright, ISIT Tutorial, 2015]:I Example graphs:
I Sample images (Ising model):
I Goal: Identify graph given n independent samples
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 27/ 53
![Page 62: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/62.jpg)
Graphical Model Selection: Definition
General problem statement.I Given n i.i.d. samples of (Y1, . . . ,Yp) ∼ PG , recover the underlying graph G
I Applications: Statistical physics, social and biological networks
I Error probability:Pe = max
G∈GP[G 6= G |G ].
Assumptions.I Distribution class:
I Ising model
PG (x1, . . . , xp)]
=1
Zexp
( ∑(i,j)∈E
λijxixj
)I Gaussian model
(X1, . . . ,Xp) ∼ N (µ,Σ)
where (Σ−1)ij 6= 0 ⇐⇒ (i, j) ∈ E [Hammersley-Clifford theorem]
I Graph class:
I Bounded-edge (at most k edges total)I Bounded-degree (at most d edges out of each node)
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 28/ 53
![Page 63: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/63.jpg)
Graphical Model Selection: Definition
General problem statement.I Given n i.i.d. samples of (Y1, . . . ,Yp) ∼ PG , recover the underlying graph G
I Applications: Statistical physics, social and biological networks
I Error probability:Pe = max
G∈GP[G 6= G |G ].
Assumptions.I Distribution class:
I Ising model
PG (x1, . . . , xp)]
=1
Zexp
( ∑(i,j)∈E
λijxixj
)I Gaussian model
(X1, . . . ,Xp) ∼ N (µ,Σ)
where (Σ−1)ij 6= 0 ⇐⇒ (i, j) ∈ E [Hammersley-Clifford theorem]
I Graph class:
I Bounded-edge (at most k edges total)I Bounded-degree (at most d edges out of each node)
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 28/ 53
![Page 64: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/64.jpg)
Information-Theoretic Viewpoint
• Information-theoretic viewpoint:
ChannelDecoder
G GY
PY|G
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 29/ 53
![Page 65: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/65.jpg)
Converse via Fano’s Inequality
• Reduction to multiple hypothesis testing: Let G be uniform on hard subset G0 ⊆ GI Ideally many graphs (lots of graphs to distinguish)
I Ideally close together (harder to distinguish)
• Application of Fano’s Inequality:
P[S 6= S] ≥ 1−I (G ; G) + log 2
log |G0|
• Bounding the mutual information:
I Data processing inequality: I (G ; G) ≤ I (G ; Y)
I Tensorization: I (G ; Y) ≤∑n
i=1 I (G ;Yi )
I KL divergence bound: Bound I (G ;Yi ) ≤ maxG D(PY |G (·|G)‖QY ) case-by-case
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 30/ 53
![Page 66: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/66.jpg)
Converse via Fano’s Inequality
• Reduction to multiple hypothesis testing: Let G be uniform on hard subset G0 ⊆ GI Ideally many graphs (lots of graphs to distinguish)
I Ideally close together (harder to distinguish)
• Application of Fano’s Inequality:
P[S 6= S] ≥ 1−I (G ; G) + log 2
log |G0|
• Bounding the mutual information:
I Data processing inequality: I (G ; G) ≤ I (G ; Y)
I Tensorization: I (G ; Y) ≤∑n
i=1 I (G ;Yi )
I KL divergence bound: Bound I (G ;Yi ) ≤ maxG D(PY |G (·|G)‖QY ) case-by-case
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 30/ 53
![Page 67: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/67.jpg)
Converse via Fano’s Inequality
• Reduction to multiple hypothesis testing: Let G be uniform on hard subset G0 ⊆ GI Ideally many graphs (lots of graphs to distinguish)
I Ideally close together (harder to distinguish)
• Application of Fano’s Inequality:
P[S 6= S] ≥ 1−I (G ; G) + log 2
log |G0|
• Bounding the mutual information:
I Data processing inequality: I (G ; G) ≤ I (G ; Y)
I Tensorization: I (G ; Y) ≤∑n
i=1 I (G ;Yi )
I KL divergence bound: Bound I (G ;Yi ) ≤ maxG D(PY |G (·|G)‖QY ) case-by-case
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 30/ 53
![Page 68: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/68.jpg)
Graph Ensembles
I Graphs that are difficult to distinguish from the empty graph:
I Reveals n = Ω(
1λ2 log p
)necessary condition with “edge strength” λ and p nodes
I Graphs that are difficult to distinguish from the complete (sub-)graph:
I Reveals n = Ω(eλd)
necessary condition with “edge strength” λ and degree d
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 31/ 53
![Page 69: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/69.jpg)
Graph Ensembles
I Graphs that are difficult to distinguish from the empty graph:
I Reveals n = Ω(
1λ2 log p
)necessary condition with “edge strength” λ and p nodes
I Graphs that are difficult to distinguish from the complete (sub-)graph:
I Reveals n = Ω(eλd)
necessary condition with “edge strength” λ and degree d
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 31/ 53
![Page 70: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/70.jpg)
Upper vs. Lower Bounds
• Example results with maximal degree d , edge strength λ (slightly informal):
I (Converse) n = Ω(
max
1λ2 , e
λd
log p)
[Santhanam and Wainwright, 2012]
I (Achievability) n = O(
max
1λ2 , e
λdd log p
)[Santhanam and Wainwright, 2012]
I (Early Practical) n = O(d2 log p) but extra assumptions that are hard to cerify[Ravikumar/Wainwright/Lafferty, 2010]
I (Recent Practical) n = O(d2eλd
λ2 log p)
[Klivans/Meka 2017]
I ...and more [Lokhov et al., 2018][Wu/Sanghavi/Dimakis 2018]
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 32/ 53
![Page 71: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/71.jpg)
Graphical Model Selection with Twists
• Approximate recovery: [S., Cevher, IEEE TSIPN, 2016]
Pe(dmax) = maxG∈G
P[dH(E , E) > dmax |G ]
• Active learning / adaptive sampling [S., Cevher, AISTATS, 2017]
Decoder
Choose
Nodes
X(i)
GGSample
Y (i)
• Analogies
Communication Problems Data Problems
Feedback Active learning / adaptivity
Rate-distortion theory Approximate recovery
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 33/ 53
![Page 72: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/72.jpg)
Graphical Model Selection with Twists
• Approximate recovery: [S., Cevher, IEEE TSIPN, 2016]
Pe(dmax) = maxG∈G
P[dH(E , E) > dmax |G ]
• Active learning / adaptive sampling [S., Cevher, AISTATS, 2017]
Decoder
Choose
Nodes
X(i)
GGSample
Y (i)
• Analogies
Communication Problems Data Problems
Feedback Active learning / adaptivity
Rate-distortion theory Approximate recovery
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 33/ 53
![Page 73: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/73.jpg)
Graphical Model Selection with Twists
• Approximate recovery: [S., Cevher, IEEE TSIPN, 2016]
Pe(dmax) = maxG∈G
P[dH(E , E) > dmax |G ]
• Active learning / adaptive sampling [S., Cevher, AISTATS, 2017]
Decoder
Choose
Nodes
X(i)
GGSample
Y (i)
• Analogies
Communication Problems Data Problems
Feedback Active learning / adaptivity
Rate-distortion theory Approximate recovery
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 33/ 53
![Page 74: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/74.jpg)
What About Continuous-Valued Estimation?
![Page 75: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/75.jpg)
Statistical Estimation
Parameter V
Index VSelect
Parameter
Algorithm
Samples
Y XInfer
Output
Estimate V
Index
Parameter
Algorithm
Samples
Y X
Estimate
• General statistical estimation setup:
I Unknown parameter θ ∈ ΘI Samples Y = (Y1, . . . ,Yn) drawn from Pθ(y)
I More generally, from Pθ,X with inputs X = (X1, . . . ,Xn)
I Given Y (and possibly X), construct estimate θ
• Goal. Minimize some loss `(θ, θ)
I 0-1 loss: `(θ, θ) = 1θ 6= θI Squared `2 loss: ‖θ − θ‖2
• Typical example. Linear regression
I Estimate θ ∈ Rp from Y = Xθ + Z
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 34/ 53
![Page 76: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/76.jpg)
Statistical Estimation
Parameter V
Index VSelect
Parameter
Algorithm
Samples
Y XInfer
Output
Estimate V
Index
Parameter
Algorithm
Samples
Y X
Estimate
• General statistical estimation setup:
I Unknown parameter θ ∈ ΘI Samples Y = (Y1, . . . ,Yn) drawn from Pθ(y)
I More generally, from Pθ,X with inputs X = (X1, . . . ,Xn)
I Given Y (and possibly X), construct estimate θ
• Goal. Minimize some loss `(θ, θ)
I 0-1 loss: `(θ, θ) = 1θ 6= θI Squared `2 loss: ‖θ − θ‖2
• Typical example. Linear regression
I Estimate θ ∈ Rp from Y = Xθ + Z
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 34/ 53
![Page 77: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/77.jpg)
Statistical Estimation
Parameter V
Index VSelect
Parameter
Algorithm
Samples
Y XInfer
Output
Estimate V
Index
Parameter
Algorithm
Samples
Y X
Estimate
• General statistical estimation setup:
I Unknown parameter θ ∈ ΘI Samples Y = (Y1, . . . ,Yn) drawn from Pθ(y)
I More generally, from Pθ,X with inputs X = (X1, . . . ,Xn)
I Given Y (and possibly X), construct estimate θ
• Goal. Minimize some loss `(θ, θ)
I 0-1 loss: `(θ, θ) = 1θ 6= θI Squared `2 loss: ‖θ − θ‖2
• Typical example. Linear regression
I Estimate θ ∈ Rp from Y = Xθ + Z
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 34/ 53
![Page 78: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/78.jpg)
Minimax Risk
• Since the samples are random, so is θ and hence `(θ, θ)
• So seek to minimize the average loss Eθ[`(θ, θ)].
I Note: Eθ and Pθ denote averages w.r.t. Y when the true parameter is θ.
• Minimax risk:Mn(Θ, `) = inf
θsupθ∈Θ
Eθ[`(θ, θ)
],
i.e., worst case average loss over all θ ∈ Θ
• Approach: Lower bound worst-case error by average over hard subset θ1, . . . , θM :
Parameter V
Index VSelect
Parameter
Algorithm
Samples
Y XInfer
Output
Estimate V
Index
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 35/ 53
![Page 79: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/79.jpg)
Minimax Risk
• Since the samples are random, so is θ and hence `(θ, θ)
• So seek to minimize the average loss Eθ[`(θ, θ)].
I Note: Eθ and Pθ denote averages w.r.t. Y when the true parameter is θ.
• Minimax risk:Mn(Θ, `) = inf
θsupθ∈Θ
Eθ[`(θ, θ)
],
i.e., worst case average loss over all θ ∈ Θ
• Approach: Lower bound worst-case error by average over hard subset θ1, . . . , θM :
Parameter V
Index VSelect
Parameter
Algorithm
Samples
Y XInfer
Output
Estimate V
Index
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 35/ 53
![Page 80: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/80.jpg)
Minimax Risk
• Since the samples are random, so is θ and hence `(θ, θ)
• So seek to minimize the average loss Eθ[`(θ, θ)].
I Note: Eθ and Pθ denote averages w.r.t. Y when the true parameter is θ.
• Minimax risk:Mn(Θ, `) = inf
θsupθ∈Θ
Eθ[`(θ, θ)
],
i.e., worst case average loss over all θ ∈ Θ
• Approach: Lower bound worst-case error by average over hard subset θ1, . . . , θM :
Parameter V
Index VSelect
Parameter
Algorithm
Samples
Y XInfer
Output
Estimate V
Index
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 35/ 53
![Page 81: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/81.jpg)
General Lower Bound via Fano’s Inequality
• To get a meaningful result, need a sufficiently “well-behaved” loss function.Subsequently, focus on loss functions of the form
`(θ, θ) = Φ(ρ(θ, θ)
)where ρ(θ, θ′) is some metric, and Φ(·) is some non-negative and increasing function
(e.g., `(θ, θ) = ‖θ − θ‖2)
• Claim. Fix ε > 0, and let θ1, . . . , θM be a finite subset of Θ such that
ρ(θv , θv′ ) ≥ ε, ∀v , v ′ ∈ 1, . . . ,M, v 6= v ′.
Then, we have
Mn(Θ, `) ≥ Φ( ε
2
)(1−
I (V ; Y) + log 2
log M
),
where V is uniform on 1, . . . ,M, and I (V ; Y) is with respect to V → θV → Y.
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 36/ 53
![Page 82: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/82.jpg)
General Lower Bound via Fano’s Inequality
• To get a meaningful result, need a sufficiently “well-behaved” loss function.Subsequently, focus on loss functions of the form
`(θ, θ) = Φ(ρ(θ, θ)
)where ρ(θ, θ′) is some metric, and Φ(·) is some non-negative and increasing function
(e.g., `(θ, θ) = ‖θ − θ‖2)
• Claim. Fix ε > 0, and let θ1, . . . , θM be a finite subset of Θ such that
ρ(θv , θv′ ) ≥ ε, ∀v , v ′ ∈ 1, . . . ,M, v 6= v ′.
Then, we have
Mn(Θ, `) ≥ Φ( ε
2
)(1−
I (V ; Y) + log 2
log M
),
where V is uniform on 1, . . . ,M, and I (V ; Y) is with respect to V → θV → Y.
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 36/ 53
![Page 83: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/83.jpg)
Proof of General Lower Bound• Using Markov’s inequality:
supθ∈Θ
Eθ[`(θ, θ)
]≥ supθ∈Θ
Φ(ε0)Pθ[`(θ, θ) ≥ Φ(ε0)]
= Φ(ε0) supθ∈Θ
Pθ[ρ(θ, θ) ≥ ε0]
• Suppose that V = arg minj=1,...,M ρ(θj , θ). Then by the triangle inequality and
ρ(θv , θv′ ) ≥ ε, if ρ(θv , θ) < ε2
then we must have V = v :
Pθv[ρ(θv , θ) ≥
ε
2
]≥ Pθv [V 6= v ].
• Hence,
supθ∈Θ
Pθ[ρ(θ, θ) ≥
ε
2
]≥ max
v=1,...,MPθv
[ρ(θv , θ) ≥
ε
2
]≥ max
v=1,...,MPθv [V 6= v ]
≥1
M
∑v=1,...,M
Pθv [V 6= v ]
≥ 1−I (V ; Y) + log 2
log M
where the final step uses Fano’s inequality.
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 37/ 53
![Page 84: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/84.jpg)
Proof of General Lower Bound• Using Markov’s inequality:
supθ∈Θ
Eθ[`(θ, θ)
]≥ supθ∈Θ
Φ(ε0)Pθ[`(θ, θ) ≥ Φ(ε0)]
= Φ(ε0) supθ∈Θ
Pθ[ρ(θ, θ) ≥ ε0]
• Suppose that V = arg minj=1,...,M ρ(θj , θ). Then by the triangle inequality and
ρ(θv , θv′ ) ≥ ε, if ρ(θv , θ) < ε2
then we must have V = v :
Pθv[ρ(θv , θ) ≥
ε
2
]≥ Pθv [V 6= v ].
• Hence,
supθ∈Θ
Pθ[ρ(θ, θ) ≥
ε
2
]≥ max
v=1,...,MPθv
[ρ(θv , θ) ≥
ε
2
]≥ max
v=1,...,MPθv [V 6= v ]
≥1
M
∑v=1,...,M
Pθv [V 6= v ]
≥ 1−I (V ; Y) + log 2
log M
where the final step uses Fano’s inequality.
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 37/ 53
![Page 85: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/85.jpg)
Proof of General Lower Bound• Using Markov’s inequality:
supθ∈Θ
Eθ[`(θ, θ)
]≥ supθ∈Θ
Φ(ε0)Pθ[`(θ, θ) ≥ Φ(ε0)]
= Φ(ε0) supθ∈Θ
Pθ[ρ(θ, θ) ≥ ε0]
• Suppose that V = arg minj=1,...,M ρ(θj , θ). Then by the triangle inequality and
ρ(θv , θv′ ) ≥ ε, if ρ(θv , θ) < ε2
then we must have V = v :
Pθv[ρ(θv , θ) ≥
ε
2
]≥ Pθv [V 6= v ].
• Hence,
supθ∈Θ
Pθ[ρ(θ, θ) ≥
ε
2
]≥ max
v=1,...,MPθv
[ρ(θv , θ) ≥
ε
2
]≥ max
v=1,...,MPθv [V 6= v ]
≥1
M
∑v=1,...,M
Pθv [V 6= v ]
≥ 1−I (V ; Y) + log 2
log M
where the final step uses Fano’s inequality.
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 37/ 53
![Page 86: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/86.jpg)
Local Approach
• General bound: If ρ(θv , θv′ ) ≥ ε for all v , v ′ then
Mn(Θ, `) ≥ Φ( ε
2
)(1−
I (V ; Y) + log 2
log M
),
• Local approach: Carefully-chosen “local” hard subset:
/2
/2
PY
x xx x
KL Divergence Ball
Pv
PY
x xx x
xx
x x
xx x
Resulting bound:
Mn(Θ, `) ≥ Φ( ε
2
)(1−
minv=1,...,M D(Pnθv‖Qn
Y ) + log 2
log M
).
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 38/ 53
![Page 87: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/87.jpg)
Local Approach
• General bound: If ρ(θv , θv′ ) ≥ ε for all v , v ′ then
Mn(Θ, `) ≥ Φ( ε
2
)(1−
I (V ; Y) + log 2
log M
),
• Local approach: Carefully-chosen “local” hard subset:
/2
/2
PY
x xx x
KL Divergence Ball
Pv
PY
x xx x
xx
x x
xx x
Resulting bound:
Mn(Θ, `) ≥ Φ( ε
2
)(1−
minv=1,...,M D(Pnθv‖Qn
Y ) + log 2
log M
).
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 38/ 53
![Page 88: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/88.jpg)
Local Approach
• General bound: If ρ(θv , θv′ ) ≥ ε for all v , v ′ then
Mn(Θ, `) ≥ Φ( ε
2
)(1−
I (V ; Y) + log 2
log M
),
• Local approach: Carefully-chosen “local” hard subset:
/2
/2
PY
x xx x
KL Divergence Ball
Pv
PY
x xx x
xx
x x
xx x
Resulting bound:
Mn(Θ, `) ≥ Φ( ε
2
)(1−
minv=1,...,M D(Pnθv‖Qn
Y ) + log 2
log M
).
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 38/ 53
![Page 89: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/89.jpg)
Global Approach
• General bound: If ρ(θv , θv′ ) ≥ ε for all v , v ′ then
Mn(Θ, `) ≥ Φ( ε
2
)(1−
I (V ; Y) + log 2
log M
),
• Global approach: Pack as many ε-separated points as possible:
/2
/2
PY
x xx x
KL Divergence Ball
Pv
PY
x xx x
xx
x x
xx x
I Typically suited to infinite-dimensional problems (e.g., non-parametric regression)
• Resulting bound:
Mn(Θ, `) ≥ Φ( εp
2
)(1−
log N∗KL,n(Θ, εc,n) + εc,n + log 2
log M∗ρ (Θ, εp)
).
I M∗ρ (Θ, εp): No. ε-separated θ we can pack into Θ (packing number)I N∗KL,n(Θ, εc,n): No. εc,n-size KL divergence balls to cover PY (covering number)
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 39/ 53
![Page 90: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/90.jpg)
Global Approach
• General bound: If ρ(θv , θv′ ) ≥ ε for all v , v ′ then
Mn(Θ, `) ≥ Φ( ε
2
)(1−
I (V ; Y) + log 2
log M
),
• Global approach: Pack as many ε-separated points as possible:
/2
/2
PY
x xx x
KL Divergence Ball
Pv
PY
x xx x
xx
x x
xx x
I Typically suited to infinite-dimensional problems (e.g., non-parametric regression)
• Resulting bound:
Mn(Θ, `) ≥ Φ( εp
2
)(1−
log N∗KL,n(Θ, εc,n) + εc,n + log 2
log M∗ρ (Θ, εp)
).
I M∗ρ (Θ, εp): No. ε-separated θ we can pack into Θ (packing number)I N∗KL,n(Θ, εc,n): No. εc,n-size KL divergence balls to cover PY (covering number)
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 39/ 53
![Page 91: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/91.jpg)
Global Approach
• General bound: If ρ(θv , θv′ ) ≥ ε for all v , v ′ then
Mn(Θ, `) ≥ Φ( ε
2
)(1−
I (V ; Y) + log 2
log M
),
• Global approach: Pack as many ε-separated points as possible:
/2
/2
PY
x xx x
KL Divergence Ball
Pv
PY
x xx x
xx
x x
xx x
I Typically suited to infinite-dimensional problems (e.g., non-parametric regression)
• Resulting bound:
Mn(Θ, `) ≥ Φ( εp
2
)(1−
log N∗KL,n(Θ, εc,n) + εc,n + log 2
log M∗ρ (Θ, εp)
).
I M∗ρ (Θ, εp): No. ε-separated θ we can pack into Θ (packing number)I N∗KL,n(Θ, εc,n): No. εc,n-size KL divergence balls to cover PY (covering number)
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 39/ 53
![Page 92: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/92.jpg)
Continuous Example 1
Sparse Linear Regression
![Page 93: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/93.jpg)
Sparse Linear Regression
• Linear regression model Y = Xθ + Z:
2 RpY 2 Rn X 2 Rnp
= +
Z 2 Rn
Samples Feature Matrix Coefficients Noise
p
n
I Feature matrix X is given, noise is i.i.d. N(0, σ2)
I Coefficients are sparse – at most k non-zeros
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 40/ 53
![Page 94: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/94.jpg)
Converse via Fano’s Inequality
• Reduction to hyp. testing: Fix ε > 0 and restrict to sparse vectors of the form
θ = (0, 0, 0,±ε, 0,±ε, 0, 0, 0, 0, 0,±ε, 0)
I Total number of such sequences = 2k(pk
)≈ exp
(k log p
k
)(if k p)
I Choose a “well separated” subset of size exp(k4
log pk
)(Gilbert-Varshamov)
I Well-separated: Non-zero entries differ in at least k8
indices
• Application of Fano’s inequality:
I Using the general bound given previously:
Mn(Θ, `) ≥kε2
32
(1−
I (V ; Y|X) + log 2k4
log pk
)
• Bounding the mutual information:
I By a direct calculation, I (V ; Y|X) ≤ ε2
2σ2 · kp ‖X‖2F (Gaussian RVs)
I Substitute and choose ε to optimize the bound: ε2 =σ2p log p
k
2‖X‖2F
• Final result: If ‖X‖2F ≤ npΓ, then E[‖θ − θ‖2
2] ≤ δ requires n ≥ cσ2
Γδ· k log p
k
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 41/ 53
![Page 95: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/95.jpg)
Converse via Fano’s Inequality
• Reduction to hyp. testing: Fix ε > 0 and restrict to sparse vectors of the form
θ = (0, 0, 0,±ε, 0,±ε, 0, 0, 0, 0, 0,±ε, 0)
I Total number of such sequences = 2k(pk
)≈ exp
(k log p
k
)(if k p)
I Choose a “well separated” subset of size exp(k4
log pk
)(Gilbert-Varshamov)
I Well-separated: Non-zero entries differ in at least k8
indices
• Application of Fano’s inequality:
I Using the general bound given previously:
Mn(Θ, `) ≥kε2
32
(1−
I (V ; Y|X) + log 2k4
log pk
)
• Bounding the mutual information:
I By a direct calculation, I (V ; Y|X) ≤ ε2
2σ2 · kp ‖X‖2F (Gaussian RVs)
I Substitute and choose ε to optimize the bound: ε2 =σ2p log p
k
2‖X‖2F
• Final result: If ‖X‖2F ≤ npΓ, then E[‖θ − θ‖2
2] ≤ δ requires n ≥ cσ2
Γδ· k log p
k
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 41/ 53
![Page 96: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/96.jpg)
Converse via Fano’s Inequality
• Reduction to hyp. testing: Fix ε > 0 and restrict to sparse vectors of the form
θ = (0, 0, 0,±ε, 0,±ε, 0, 0, 0, 0, 0,±ε, 0)
I Total number of such sequences = 2k(pk
)≈ exp
(k log p
k
)(if k p)
I Choose a “well separated” subset of size exp(k4
log pk
)(Gilbert-Varshamov)
I Well-separated: Non-zero entries differ in at least k8
indices
• Application of Fano’s inequality:
I Using the general bound given previously:
Mn(Θ, `) ≥kε2
32
(1−
I (V ; Y|X) + log 2k4
log pk
)
• Bounding the mutual information:
I By a direct calculation, I (V ; Y|X) ≤ ε2
2σ2 · kp ‖X‖2F (Gaussian RVs)
I Substitute and choose ε to optimize the bound: ε2 =σ2p log p
k
2‖X‖2F
• Final result: If ‖X‖2F ≤ npΓ, then E[‖θ − θ‖2
2] ≤ δ requires n ≥ cσ2
Γδ· k log p
k
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 41/ 53
![Page 97: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/97.jpg)
Converse via Fano’s Inequality
• Reduction to hyp. testing: Fix ε > 0 and restrict to sparse vectors of the form
θ = (0, 0, 0,±ε, 0,±ε, 0, 0, 0, 0, 0,±ε, 0)
I Total number of such sequences = 2k(pk
)≈ exp
(k log p
k
)(if k p)
I Choose a “well separated” subset of size exp(k4
log pk
)(Gilbert-Varshamov)
I Well-separated: Non-zero entries differ in at least k8
indices
• Application of Fano’s inequality:
I Using the general bound given previously:
Mn(Θ, `) ≥kε2
32
(1−
I (V ; Y|X) + log 2k4
log pk
)
• Bounding the mutual information:
I By a direct calculation, I (V ; Y|X) ≤ ε2
2σ2 · kp ‖X‖2F (Gaussian RVs)
I Substitute and choose ε to optimize the bound: ε2 =σ2p log p
k
2‖X‖2F
• Final result: If ‖X‖2F ≤ npΓ, then E[‖θ − θ‖2
2] ≤ δ requires n ≥ cσ2
Γδ· k log p
k
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 41/ 53
![Page 98: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/98.jpg)
Upper vs. Lower Bounds
• Recap of model: Y = Xθ + Z, where θ is k-sparse
• Lower bound: If ‖X‖2F ≤ npΓ, achieving E[‖θ − θ‖2
2] ≤ δ requires n ≥ cσ2
Γδ· k log p
k
• Upper bound: If X is a zero-mean random Gaussian matrix with power Γ per entry,
then we can achieve E[‖θ − θ‖22] ≤ δ using at most n ≥ c′σ2
Γδ· k log p
ksamples
I Maximum-likelihood estimation suffices
• Tighter lower bounds could potentially be obtained under additional restrictions on X
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 42/ 53
![Page 99: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/99.jpg)
Continuous Example 2
Convex Optimization
![Page 100: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/100.jpg)
Stochastic Convex Optimization
• A basic optimization problem
x? = argminx∈D f (x)
For simplicity, we focus on the 1D case D ⊆ R (extensions to Rd are possible)
• Model:
I Noisy samples: When we query x , we get a noisy value and noisy gradient:
Y = f (x) + Z , Y ′ = f ′(x) + Z ′
where Z ∼ N(0, σ2) an Z ′ ∼ N(0, σ2)
I Adaptive sampling: Chosen Xi may depend on Y1, . . . ,Yi−1
• Function classes: Convex, strongly convex, Lipschitz, self-concordant, etc.
I We will focus on the class of strongly convex functions
I Strong convexity: f (x)− c2x2 is a convex function for some c > 0 (we set c = 1)
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 43/ 53
![Page 101: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/101.jpg)
Stochastic Convex Optimization
• A basic optimization problem
x? = argminx∈D f (x)
For simplicity, we focus on the 1D case D ⊆ R (extensions to Rd are possible)
• Model:
I Noisy samples: When we query x , we get a noisy value and noisy gradient:
Y = f (x) + Z , Y ′ = f ′(x) + Z ′
where Z ∼ N(0, σ2) an Z ′ ∼ N(0, σ2)
I Adaptive sampling: Chosen Xi may depend on Y1, . . . ,Yi−1
• Function classes: Convex, strongly convex, Lipschitz, self-concordant, etc.
I We will focus on the class of strongly convex functions
I Strong convexity: f (x)− c2x2 is a convex function for some c > 0 (we set c = 1)
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 43/ 53
![Page 102: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/102.jpg)
Stochastic Convex Optimization
• A basic optimization problem
x? = argminx∈D f (x)
For simplicity, we focus on the 1D case D ⊆ R (extensions to Rd are possible)
• Model:
I Noisy samples: When we query x , we get a noisy value and noisy gradient:
Y = f (x) + Z , Y ′ = f ′(x) + Z ′
where Z ∼ N(0, σ2) an Z ′ ∼ N(0, σ2)
I Adaptive sampling: Chosen Xi may depend on Y1, . . . ,Yi−1
• Function classes: Convex, strongly convex, Lipschitz, self-concordant, etc.
I We will focus on the class of strongly convex functions
I Strong convexity: f (x)− c2x2 is a convex function for some c > 0 (we set c = 1)
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 43/ 53
![Page 103: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/103.jpg)
Performance Measure and Minimax Risk
• After sampling n points, the algorithm returns a final point x
• The loss incurred is `f (x) = f (x)−minx∈X f (x), i.e., the gap to the optimum
• For a given class of functions F , the minimax risk is given by
Mn(F) = infX
supf∈F
Ef [`f (X )]
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 44/ 53
![Page 104: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/104.jpg)
Reduction to Multiple Hypothesis Testing
• The picture remains the same:
Index V Select
Algorithm
Samples
InferEstimate V
Index
Function fV
Functionxy
xSelection
I Successful optimization =⇒ Successful identification of V
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 45/ 53
![Page 105: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/105.jpg)
General Minimax Lower Bound
• Claim 1. Fix ε > 0, and let f1, . . . , fM ⊆ F be a subset of F such that for eachx ∈ X , we have `fv (x) ≤ ε for at most one value of v ∈ 1, . . . ,M. Then we have
Mn(F) ≥ ε ·(
1−I (V ; X,Y) + log 2
log M
), (1)
where V is uniform on 1, . . . ,M, and I (V ; X,Y) is w.r.t V → fV → (X,Y).
• Claim 2. In the special case M = 2, we have
Mn(F) ≥ ε · H−12
(log 2− I (V ; X,Y)
), (2)
where H−12 (·) ∈ [0, 0.5] is the inverse binary entropy function.
• Proof is like with estimation, starting with Markov’s inequality:
supf∈F
Ef [`f (X )] ≥ supf∈F
ε · Pf [`f (X ) ≥ ε].
• Proof for M = 2 uses a (somewhat less well-known) form of Fano’s inequality forbinary hypothesis testing
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 46/ 53
![Page 106: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/106.jpg)
General Minimax Lower Bound
• Claim 1. Fix ε > 0, and let f1, . . . , fM ⊆ F be a subset of F such that for eachx ∈ X , we have `fv (x) ≤ ε for at most one value of v ∈ 1, . . . ,M. Then we have
Mn(F) ≥ ε ·(
1−I (V ; X,Y) + log 2
log M
), (1)
where V is uniform on 1, . . . ,M, and I (V ; X,Y) is w.r.t V → fV → (X,Y).
• Claim 2. In the special case M = 2, we have
Mn(F) ≥ ε · H−12
(log 2− I (V ; X,Y)
), (2)
where H−12 (·) ∈ [0, 0.5] is the inverse binary entropy function.
• Proof is like with estimation, starting with Markov’s inequality:
supf∈F
Ef [`f (X )] ≥ supf∈F
ε · Pf [`f (X ) ≥ ε].
• Proof for M = 2 uses a (somewhat less well-known) form of Fano’s inequality forbinary hypothesis testing
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 46/ 53
![Page 107: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/107.jpg)
General Minimax Lower Bound
• Claim 1. Fix ε > 0, and let f1, . . . , fM ⊆ F be a subset of F such that for eachx ∈ X , we have `fv (x) ≤ ε for at most one value of v ∈ 1, . . . ,M. Then we have
Mn(F) ≥ ε ·(
1−I (V ; X,Y) + log 2
log M
), (1)
where V is uniform on 1, . . . ,M, and I (V ; X,Y) is w.r.t V → fV → (X,Y).
• Claim 2. In the special case M = 2, we have
Mn(F) ≥ ε · H−12
(log 2− I (V ; X,Y)
), (2)
where H−12 (·) ∈ [0, 0.5] is the inverse binary entropy function.
• Proof is like with estimation, starting with Markov’s inequality:
supf∈F
Ef [`f (X )] ≥ supf∈F
ε · Pf [`f (X ) ≥ ε].
• Proof for M = 2 uses a (somewhat less well-known) form of Fano’s inequality forbinary hypothesis testing
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 46/ 53
![Page 108: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/108.jpg)
Strongly Convex Class: Choice of Hard Subset
• Reduction to hyp. testing. In 1D, it suffices to choose just two similar functions!
I (Becomes 2constant×d in d dimensions)
0 0.2 0.4 0.6 0.8 1
Input x
0
0.05
0.1
0.15
Functionvalue
• The precise functions:
fv (x) =1
2(x − x∗v )2, v = 1, 2,
x∗1 =1
2−√
2ε′ x∗2 =1
2+√
2ε′
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 47/ 53
![Page 109: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/109.jpg)
Strongly Convex Class: Choice of Hard Subset
• Reduction to hyp. testing. In 1D, it suffices to choose just two similar functions!
I (Becomes 2constant×d in d dimensions)
0 0.2 0.4 0.6 0.8 1
Input x
0
0.05
0.1
0.15
Functionvalue
• The precise functions:
fv (x) =1
2(x − x∗v )2, v = 1, 2,
x∗1 =1
2−√
2ε′ x∗2 =1
2+√
2ε′
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 47/ 53
![Page 110: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/110.jpg)
Analysis
• Application of Fano’s Inequality. As above,
Mn(F) ≥ ε · H−12 (log 2− I (V ; X,Y))
I Approach: H−12 (α) ≥ 1
10if α ≥ log 2
2
I How few samples ensure I (V ; X,Y) ≤ log 22
?
• Bounding the Mutual Information. Let PY ,PY ′ be the observation distributions(function and gradient), and QY ,QY ′ similar but with f0(x) = 1
2x2. Then:
D(PY × PY ′‖QY × QY ′ ) =(f1(x)− f0(x))2
2σ2+
(f ′1 (x)− f ′0 (x))2
2σ2
I Simplifications: (f1(x)− f0(x))2 ≤(ε+
√ε2
)2 ≤ 2ε and (f ′1 (x)− f ′0 (x))2 = 2ε
I With some manipulation, I (V ; X,Y) ≤ log 22
when ε = σ2 log 24n
• Final result: Mn(F) ≥ σ2 log 240n
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 48/ 53
![Page 111: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/111.jpg)
Analysis
• Application of Fano’s Inequality. As above,
Mn(F) ≥ ε · H−12 (log 2− I (V ; X,Y))
I Approach: H−12 (α) ≥ 1
10if α ≥ log 2
2
I How few samples ensure I (V ; X,Y) ≤ log 22
?
• Bounding the Mutual Information. Let PY ,PY ′ be the observation distributions(function and gradient), and QY ,QY ′ similar but with f0(x) = 1
2x2. Then:
D(PY × PY ′‖QY × QY ′ ) =(f1(x)− f0(x))2
2σ2+
(f ′1 (x)− f ′0 (x))2
2σ2
I Simplifications: (f1(x)− f0(x))2 ≤(ε+
√ε2
)2 ≤ 2ε and (f ′1 (x)− f ′0 (x))2 = 2ε
I With some manipulation, I (V ; X,Y) ≤ log 22
when ε = σ2 log 24n
• Final result: Mn(F) ≥ σ2 log 240n
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 48/ 53
![Page 112: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/112.jpg)
Analysis
• Application of Fano’s Inequality. As above,
Mn(F) ≥ ε · H−12 (log 2− I (V ; X,Y))
I Approach: H−12 (α) ≥ 1
10if α ≥ log 2
2
I How few samples ensure I (V ; X,Y) ≤ log 22
?
• Bounding the Mutual Information. Let PY ,PY ′ be the observation distributions(function and gradient), and QY ,QY ′ similar but with f0(x) = 1
2x2. Then:
D(PY × PY ′‖QY × QY ′ ) =(f1(x)− f0(x))2
2σ2+
(f ′1 (x)− f ′0 (x))2
2σ2
I Simplifications: (f1(x)− f0(x))2 ≤(ε+
√ε2
)2 ≤ 2ε and (f ′1 (x)− f ′0 (x))2 = 2ε
I With some manipulation, I (V ; X,Y) ≤ log 22
when ε = σ2 log 24n
• Final result: Mn(F) ≥ σ2 log 240n
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 48/ 53
![Page 113: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/113.jpg)
Upper vs. Lower Bounds
• Lower bound for 1D strongly convex functions: Mn(F) ≥ c σ2
n
• Upper bound for 1D strongly convex functions: Mn(F) ≤ c ′ σ2
n
I Achieved by stochastic gradient descent
• Analogous results (and proof techniques) known for d-dimensional functions,additional Lipschitz assumptions, etc. [Raginsky and Rakhlin, 2011]
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 49/ 53
![Page 114: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/114.jpg)
Limitations and Generalizations
• Limitations of Fano’s Inequality.
I Non-asymptotic weakness
I Typically hard to tightly bound mutual information in adaptive settings
I Restriction to KL divergence
• Generalizations of Fano’s Inequality.
I Non-uniform V [Han/Verdu, 1994]
I More general f -divergences [Guntuboyina, 2011]
I Continuous V [Duchi/Wainwright, 2013]
(This list is certainly incomplete!)
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 50/ 53
![Page 115: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/115.jpg)
Example: Difficulties in Adaptive Settings
• A simple search problem: Find the (only) biased coin using few flips
| z | z P[heads] = 1
2 P[heads] = 12
P[heads] = 12 +
I Heavy coin V ∈ 1, . . . ,M uniformly at random
I Selected coin at time i = 1, . . . , n is Xi , observation is Yi ∈ 0, 1 (1 for heads)
• Non-adaptive setting:
I Since Xi and V are independent, can show I (V ;Yi |Xi ) . ε2
M
I Substituting into Fano’s inequality gives the requirement n & M log Mε2
• Adaptive setting:
I Nuisance to characterize I (V ;Yi |Xi ), as Xi depends on V due to adaptivity!
I Worst-case bounding only gives n & log Mε2
I Next lecture: An alternative tool that gives n & Mε2
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 51/ 53
![Page 116: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/116.jpg)
Example: Difficulties in Adaptive Settings
• A simple search problem: Find the (only) biased coin using few flips
| z | z P[heads] = 1
2 P[heads] = 12
P[heads] = 12 +
I Heavy coin V ∈ 1, . . . ,M uniformly at random
I Selected coin at time i = 1, . . . , n is Xi , observation is Yi ∈ 0, 1 (1 for heads)
• Non-adaptive setting:
I Since Xi and V are independent, can show I (V ;Yi |Xi ) . ε2
M
I Substituting into Fano’s inequality gives the requirement n & M log Mε2
• Adaptive setting:
I Nuisance to characterize I (V ;Yi |Xi ), as Xi depends on V due to adaptivity!
I Worst-case bounding only gives n & log Mε2
I Next lecture: An alternative tool that gives n & Mε2
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 51/ 53
![Page 117: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/117.jpg)
Example: Difficulties in Adaptive Settings
• A simple search problem: Find the (only) biased coin using few flips
| z | z P[heads] = 1
2 P[heads] = 12
P[heads] = 12 +
I Heavy coin V ∈ 1, . . . ,M uniformly at random
I Selected coin at time i = 1, . . . , n is Xi , observation is Yi ∈ 0, 1 (1 for heads)
• Non-adaptive setting:
I Since Xi and V are independent, can show I (V ;Yi |Xi ) . ε2
M
I Substituting into Fano’s inequality gives the requirement n & M log Mε2
• Adaptive setting:
I Nuisance to characterize I (V ;Yi |Xi ), as Xi depends on V due to adaptivity!
I Worst-case bounding only gives n & log Mε2
I Next lecture: An alternative tool that gives n & Mε2
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 51/ 53
![Page 118: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/118.jpg)
Conclusion
• Information theory as a theory of data:
Inference &Learning
Data Generation
Inference &Learning
Optimization
Storage &Transmission
Inference &Learning
Optimization
Data Generation
Data Generation
OptimizationInference &Learning
OptimizationData Generation
Inference &Learning OptimizationData
Generation
InformationTheory
Storage &Transmission
Inference &Learning
OptimizationData Generation
Information Theory
• Approach highlighted in this talk:
I Reduction to multiple hypothesis testing
I Application of Fano’s inequality
I Bounding the mutual information
• Examples:
I Group testing
I Graphical model selection
I Sparse regression
I Convex optimization
I ...hopefully many more to come!
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 52/ 53
![Page 119: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/119.jpg)
Conclusion
• Information theory as a theory of data:
Inference &Learning
Data Generation
Inference &Learning
Optimization
Storage &Transmission
Inference &Learning
Optimization
Data Generation
Data Generation
OptimizationInference &Learning
OptimizationData Generation
Inference &Learning OptimizationData
Generation
InformationTheory
Storage &Transmission
Inference &Learning
OptimizationData Generation
Information Theory
• Approach highlighted in this talk:
I Reduction to multiple hypothesis testing
I Application of Fano’s inequality
I Bounding the mutual information
• Examples:
I Group testing
I Graphical model selection
I Sparse regression
I Convex optimization
I ...hopefully many more to come!
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 52/ 53
![Page 120: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/120.jpg)
Conclusion
• Information theory as a theory of data:
Inference &Learning
Data Generation
Inference &Learning
Optimization
Storage &Transmission
Inference &Learning
Optimization
Data Generation
Data Generation
OptimizationInference &Learning
OptimizationData Generation
Inference &Learning OptimizationData
Generation
InformationTheory
Storage &Transmission
Inference &Learning
OptimizationData Generation
Information Theory
• Approach highlighted in this talk:
I Reduction to multiple hypothesis testing
I Application of Fano’s inequality
I Bounding the mutual information
• Examples:
I Group testing
I Graphical model selection
I Sparse regression
I Convex optimization
I ...hopefully many more to come!
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 52/ 53
![Page 121: Information-Theoretic Limits for Inference, Learning, and ...scarlett/it_data_slides/01-FanoEst.pdf · I First fundamental limits without complexity constraints, then practical methods](https://reader033.fdocuments.net/reader033/viewer/2022052023/6038e167446b0870df77ca01/html5/thumbnails/121.jpg)
Tutorial Chapter
• Tutorial Chapter: “An Introductory Guide to Fano’s Inequalitywith Applications in Statistical Estimation” [S. and Cevher, 2019]
https://arxiv.org/abs/1901.00555
(To appear in book Information-Theoretic Methods in DataScience, Cambridge University Press)
Information-Theoretic Limits for Stats/ML — Jonathan Scarlett Slide 53/ 53