PrincipalInertiaComponentsandApplications - arXiv · The PICs provide a fine-grained decomposition...

44
arXiv:1704.00820v1 [cs.IT] 3 Apr 2017 Principal Inertia Components and Applications Flavio P. Calmon, Ali Makhdoumi, Muriel M´ edard, Mayank Varia, Mark Christiansen, Ken R. Duffy * Abstract We explore properties and applications of the Principal Inertia Components (PICs) between two discrete random variables X and Y . The PICs lie in the intersection of information and estimation theory, and provide a fine-grained decomposition of the dependence between X and Y . Moreover, the PICs describe which functions of X can or cannot be reliably inferred (in terms of MMSE) given an observation of Y . We demonstrate that the PICs play an important role in information theory, and they can be used to characterize information-theoretic limits of certain estimation problems. In privacy settings, we prove that the PICs are related to fundamental limits of perfect privacy. Contents 1 Introduction 2 1.1 Principal Inertia Components .................................. 3 1.2 Organization of the Paper .................................... 3 1.3 Notation .............................................. 5 1.4 Related Work ........................................... 7 2 Principal Inertia Components 8 2.1 A Geometric Interpretation of the PICs ............................ 8 2.2 Definition and Characterizations of the PICs ......................... 9 2.3 k-correlation ........................................... 11 3 Applications of the PICs to Information Theory 12 3.1 Conforming distributions .................................... 13 3.2 One-bit Functions and Channel Transformations ....................... 14 3.3 Example: Binary Additive Noise Channels ........................... 15 3.4 The Information of a Boolean Function of the Input of a Channel .............. 16 3.5 On the “Most Informative Bit” ................................. 17 3.6 One-bit Estimators ........................................ 18 4 Application to Estimation: Bounds on Error Probability 20 4.1 A Convex Program for Bounds on Estimation ......................... 21 4.2 A Lower Bound for Error Probability Based on the PICs ................... 22 4.3 Bounding the Estimation Error of Functions of a Hidden Random Variable ........ 24 5 Applications of the PICs to Security and Privacy 26 5.1 The Privacy Funnel ........................................ 26 5.2 The Optimal Privacy-Utility Coefficient and the Smallest PIC ................ 27 5.3 Information Disclosure with Perfect Privacy .......................... 28 6 Final Remarks 30 Appendix A Proofs from Section 2 31 Appendix B Proofs from Section 3 32 Appendix C Proofs from Section 4 33 Appendix D Proofs from Section 5 37 F. P. Calmon is with the IBM T.J. Watson Research Center, Yorktown Heights, NY. Email: [email protected]. A. Makhdoumi and M. M´ edard are with the Massachusetts Institute of Technology. Email: {makhdoum, medard}@mit.edu. M. Varia is with Boston University. Email: [email protected]. M. Christiansen is with the Automobile Association, Ireland. Email: [email protected]. K. R. Duffy is with the Hamilton Institute at Maynooth University. Email: ken.duff[email protected]. This paper was presented in part at the 51st Annual Allerton Conference on Communication, Control, and Computing (2013), the 2014 IEEE Info. Theory Workshop, and the 2015 International Symposium on Info. Theory. 1

Transcript of PrincipalInertiaComponentsandApplications - arXiv · The PICs provide a fine-grained decomposition...

Page 1: PrincipalInertiaComponentsandApplications - arXiv · The PICs provide a fine-grained decomposition of the dependence between two random variables. Well- studied statistical methods

arX

iv:1

704.

0082

0v1

[cs

.IT

] 3

Apr

201

7

Principal Inertia Components and Applications

Flavio P. Calmon, Ali Makhdoumi, Muriel Medard,

Mayank Varia, Mark Christiansen, Ken R. Duffy ∗

Abstract

We explore properties and applications of the Principal Inertia Components (PICs) between twodiscrete random variables X and Y . The PICs lie in the intersection of information and estimationtheory, and provide a fine-grained decomposition of the dependence between X and Y . Moreover, thePICs describe which functions of X can or cannot be reliably inferred (in terms of MMSE) given anobservation of Y . We demonstrate that the PICs play an important role in information theory, andthey can be used to characterize information-theoretic limits of certain estimation problems. In privacysettings, we prove that the PICs are related to fundamental limits of perfect privacy.

Contents

1 Introduction 2

1.1 Principal Inertia Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Organization of the Paper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Principal Inertia Components 8

2.1 A Geometric Interpretation of the PICs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Definition and Characterizations of the PICs . . . . . . . . . . . . . . . . . . . . . . . . . 92.3 k-correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Applications of the PICs to Information Theory 12

3.1 Conforming distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.2 One-bit Functions and Channel Transformations . . . . . . . . . . . . . . . . . . . . . . . 143.3 Example: Binary Additive Noise Channels . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.4 The Information of a Boolean Function of the Input of a Channel . . . . . . . . . . . . . . 163.5 On the “Most Informative Bit” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.6 One-bit Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4 Application to Estimation: Bounds on Error Probability 20

4.1 A Convex Program for Bounds on Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 214.2 A Lower Bound for Error Probability Based on the PICs . . . . . . . . . . . . . . . . . . . 224.3 Bounding the Estimation Error of Functions of a Hidden Random Variable . . . . . . . . 24

5 Applications of the PICs to Security and Privacy 26

5.1 The Privacy Funnel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265.2 The Optimal Privacy-Utility Coefficient and the Smallest PIC . . . . . . . . . . . . . . . . 275.3 Information Disclosure with Perfect Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . 28

6 Final Remarks 30

Appendix A Proofs from Section 2 31

Appendix B Proofs from Section 3 32

Appendix C Proofs from Section 4 33

Appendix D Proofs from Section 5 37

∗F. P. Calmon is with the IBM T.J. Watson Research Center, Yorktown Heights, NY. Email: [email protected]. A.Makhdoumi and M. Medard are with the Massachusetts Institute of Technology. Email: {makhdoum, medard}@mit.edu.M. Varia is with Boston University. Email: [email protected]. M. Christiansen is with the Automobile Association, Ireland.Email: [email protected]. K. R. Duffy is with the Hamilton Institute at Maynooth University. Email:[email protected]. This paper was presented in part at the 51st Annual Allerton Conference on Communication, Control,and Computing (2013), the 2014 IEEE Info. Theory Workshop, and the 2015 International Symposium on Info. Theory.

1

Page 2: PrincipalInertiaComponentsandApplications - arXiv · The PICs provide a fine-grained decomposition of the dependence between two random variables. Well- studied statistical methods

“Sanitized”

data

YXSStatistical

ElementS

Privacy

Estimation

Estimate of

private info.

Privacy-assuring

mappingPrivate

info.

Hidden

variable

Function of

hidden var.

Nature Observed

variable

Estimate

Data

Figure 1: Problem central to both estimation and privacy.

1 Introduction

There is a fundamental limit to how much we can learn from data. The problem of determining whichfunctions of a hidden variable can or cannot be estimated from a noisy observation is at the heart ofestimation, statistical learning theory [1], and numerous other applications of interest. For example, oneof the main goals of prediction is to determine a function of a hidden variable that can be reliably inferredfrom the output of a system.

Privacy and security applications are concerned with the inverse problem: guaranteeing that a certainset of functions of a hidden variable cannot be reliably estimated given the output of a system. Examplesof such functions are the identity of an individual whose information is contained in a supposedly anony-mous dataset [2], sensitive information of a user who joined a database [3,4], the political preference of aset of users who disclosed their movie ratings [5–7], among others. On the one hand, estimation methodsattempt to extract as much information as possible from data. On the other hand, privacy-assuringsystems seek to minimize the information about a secret variable that can be reliably estimated fromdisclosed data. The relationship between privacy and estimation is similar to the one noted by Shannonbetween cryptography and communication [8]: they are connected fields, but with different goals. Asillustrated in Fig. 1, estimation and privacy are concerned with the same fundamental problem, and canbe simultaneously studied through an information-theoretic lens.

In this paper, we discuss information-theoretic tools to address challenges in privacy, security and esti-mation. By studying fundamental models that are common to these fields, we derive information-theoreticmetrics and associated results that simultaneously (i) delineate the fundamental limits of estimation and(ii) characterize the security properties of privacy-assuring systems.

We focus on the question that is central to privacy and estimation (illustrated in Fig. 1): How well cana random variable S, that is correlated with a hidden variable X, be estimated given an observation ofY ? The information-theoretic metrics presented here seek to quantify properties of the random mappingfrom X to Y that can be translated into bounds on the error of estimating S given an observation of Y .These bounds, which are often at the heart of information-theoretic converse proofs [9], provide universal,algorithm-independent guarantees on what can (or cannot) be learned from Y . With a characterizationof these bounds in hand, we study properties of random mappings that seek to achieve privacy in termsof how well an adversary can estimate a secret S given the output of the mapping Y .

The results in this paper are situated at the intersection of estimation, privacy and security. We derivea set of general sharp bounds on how well certain classes of functions of a hidden variable can(not) beestimated from a noisy observation. The bounds are expressed in terms of different information metrics ofthe joint distribution of the hidden and observed variables, and provide converse (negative) results: If aninformation metric is small, then not only the hidden variable cannot be reliably estimated, but also anynon-trivial function of the hidden variable cannot be inferred with probability of error or mean-squarederror smaller than a certain threshold.

These results are applicable to both estimation and privacy. For estimation and statistical learningtheory, they shed light on the fundamental limits of learning from noisy data, and can help guide thedesign of practical learning algorithms. In particular, the converse bounds can be used to derive minimaxlower bounds (the same way Fano-style inequalities are used [10]). Furthermore, as illustrated in thispaper, the proposed bounds are useful for creating security and privacy metrics, for characterizing theinherent trade-off between privacy and utility in statistical data disclosure problems and for studying thefundamental limits of perfect privacy. The tools used to derive the converse bounds are based on a setof statistics known as the Principal Inertia Components (PICs).

2

Page 3: PrincipalInertiaComponentsandApplications - arXiv · The PICs provide a fine-grained decomposition of the dependence between two random variables. Well- studied statistical methods

1.1 Principal Inertia Components

The PICs provide a fine-grained decomposition of the dependence between two random variables. Well-studied statistical methods for estimating the PICs [11,12] can lead to results on the (im)possibility ofestimating a large classes of functions by using bounds based on the PICs and standard statistical tests.We show how PICs can be used to characterize the information-theoretic limits of certain estimationproblems. The PICs generalize other measures that are used in information theory, such as maximalcorrelation [13] and χ2-dependence [14]. The largest and smallest PIC play an important role in estima-tion and privacy (discussed in Sections 4 and 5). We also study properties of the sum of the k largestprincipal inertia components. Below we list a few key properties of the PICs studied in this paper.

1. Overview of the PICs: We present an overview of the PICs and their different interpretations,summarized in Theorem 1. For two discrete random variables X and Y , we denote the k largestPICs by λ1(X;Y ), λ2(X;Y ), . . . , λk(X; Y ).

2. Sum of the PICs: We propose a measure of dependence termed k-correlation which is definedas the sum of the k largest PICs, i.e., Jk(X;Y ) ,

∑ki=1 λi(X; Y ). This metric satisfies two key

properties: (i) convexity in pY |X (Theorem 2); (ii) Data Processing Inequality (Theorem 3). Thelatter is also satisfied by λ1(X; Y ), . . . , λd(X;Y ) individually, where d = min{|X |, |Y|} − 1. Bothmaximal correlation and the χ2-dependence between X and Y are special cases of k-correlation,with J1(X;Y ) = ρm(X;Y )2 and Jd(X;Y ) = χ2(X; Y ) (cf. notation in Section 1.3).

3. Largest PIC The largest PIC satisfies λ1(X;Y ) = ρm(X; Y )2, where ρm(X;Y ) is the maximalcorrelation between X and Y , defined as [15]

ρm(X;Y ) , maxE[f(X)]=E[g(Y )]=0

E[f(X)2]=E[g(Y )2]=1

E [f(X)g(Y )] . (1)

We show that both the probability of error and the minimum mean-squared error (MMSE) ofestimating any function of a hidden variable X given an observation Y are closely related to thelargest PIC.

By making use of the fact that the PICs satisfy the Data Processing Inequality (DPI), we areable to derive a family of bounds for the smallest average error of estimating X having observedY Pe(X|Y ) (cf. (5) and notation in Section 1.3) in terms of the marginal distribution of X, pX ,and λ1(X;Y ), . . . , λd(X;Y ), described in Theorem 6. This result sheds light on the relationship ofPe(X|Y ) with the PICs.

One immediate consequence of Theorem 6 is a useful scaling law for Pe(X|Y ) in terms of the largestPIC, the maximal correlation. Let X = 1 be the most likely outcome for X. Corollary 4 provesthat the advantage an adversary (who has access to Y ) has of guessing X over guessing the mostlikely outcome X = 1 satisfies

Adv(X|Y ) , |1− pX(1)− Pe(X|Y )| ≤ O(√

λ1(X;Y )).

4. Smallest PIC We show that the smallest PIC determines when perfect privacy, defined in Section5, can be achieved with non-trivial utility in the model depicted in Fig. 1. More specifically, perfectprivacy can be achieved with non-trivial utility if and only if the smallest PIC is 0 (Theorem 10).

1.2 Organization of the Paper

This paper is organized as follows. The rest of this section introduces notation and discusses relatedwork. In Section 2, we present the PICs and their multiple characterizations (Theorem 1). We alsointroduce the definition of k-correlation, and demonstrate several properties of both k-correlation and,more broadly, the PICs, including convexity and the DPI. In Section 3, we apply the PICs to problemsin information theory. In Section 4, we derive bounds on error probability and other estimation-theoreticresults based on the PICs. Finally, in Section 5, we demonstrate how the PICs play an important rolein privacy and can be used for determining privacy-assuring mappings. We first summarize the mainresults obtained by applying the PICs to information theory, estimation theory and privacy.

Applications to Information Theory

We present several distinct applications of the PICs to information theory. In Section 3.2, we demonstratethat the PICs correspond to the singular values of certain channel transformation matrices, and there

3

Page 4: PrincipalInertiaComponentsandApplications - arXiv · The PICs provide a fine-grained decomposition of the dependence between two random variables. Well- studied statistical methods

effect on input distributions to the channel bear an interpretation similar to that of filter coefficients in alinear filter [16]. This is illustrated through an example in binary additive noise channels, where we arguethat the binary symmetric channel is akin to a low-pass filter. We show how the PICs, and particularlythe largest PIC, can be used to derive bounds on information metrics between one-bit functions of ahidden variable X and a correlated observation Y . We apply these results to the “one-bit functionconjecture” [17] We do not solve this conjecture here. Nevertheless, we present further evidence for itsvalidity, and introduce another conjecture based on our results.

The new conjecture (cf. Conjecture 1) generalizes the “one-bit function conjecture”. It states that,given a symmetric distribution pX,Y , if we generate a new distribution qX,Y by making all the PICsof pX,Y equal to the largest one, then the new distribution is more informative about bits of X. Bymore informative, we mean that, for any 1-bit function b, I(b(X);Y ) is larger under qX,Y than underpX,Y . Indeed, from an estimation-theoretic perspective, increasing the PICs imply that any function ofX can be estimated with smaller MMSE when considering qX,Y than pX,Y . Furthermore, in this case,we show that qX,Y is a q-ary symmetric channel. This conjecture, if proven, would imply as a corollarythe original one-bit function conjecture.

We do show that our results on the PICs can be used to resolve the one-bit function conjecture ina specific setting in Section 3.6. Instead of considering the mutual information between b(X) and Y ,we study the mutual information between b(X) and a one-bit estimator b(Y ). We show in Theorem 5that, when b(Y ) is an unbiased estimator, the information that b(Y ) carries about b(X) can be upper-bounded for a range of dependence metrics (e.g. mutual information). This result also leads to boundson estimation error probability.

Applications to Estimation Theory

In Section 4, we derive converse bounds on estimation error based on the PICs. In particular, we providelower bounds on (i) the probability of correctly guessing a hidden variable X given an observation Y and(ii) on the MMSE of estimating X given Y . These results are stated in terms of the PICs between Xand Y , and provide algorithm-independent bounds on estimation. We also extend these bounds to thefunctional setting, and show that the advantage over a random guess of correctly estimating a function ofX given an observation of Y is upper-bounded by the largest PIC between X and Y . More specifically,we propose a family of lower bounds for the error probability of estimating X given Y based on the PICsof pX,Y and the marginal distribution of X in Theorems 6 and 9. We also extend these bounds for theprobability of correctly estimating a function of the hidden variable X given an observation of Y .

These results are based on a more general framework for deriving bounds on error probability, dis-cussed in Section 4.1. At the heart of this framework are rate-distortion (test-channel) formulations thatallow bounds on information metrics to be translated into bounds on estimation. These formulations, inturn, are based on convex programs that minimize the average estimation error over all possible distri-butions that satisfy a bound on a given information metric. The solution of such convex programs arecalled the error-rate functions. We study extremal properties of error-rate function and, by revisiting aresult by Ahlswede [18], we show how to extend the error-rate function to quantify not only the smallestaverage error of estimating a hidden variable, but also of estimating any function of a hidden variable.

Applications to Privacy

When referring to privacy in this paper, we consider the setting studied by Calmon and Fawaz in [19].Using Fig. 1 as reference, we study the problem of disclosing data X to a third-party in order to derivesome utility based on X. At the same time, some information correlated with X, denoted by S, issupposed to remain private. The engineering goal is to create a random mapping, called the privacy-assuring mapping, that transforms X into a new data Y that achieves a certain target utility, whileminimizing the information revealed about S. For example, X can represent movie ratings that a userintends to disclose to a third-party in order to receive movie recommendations [5–7,20]. At the sametime, the user may want to keep her political preference S secret. We allow the user to distort movieratings in her data X in order to generate a new data Y . The goal would then be to find privacy-assuringmappings that minimize the number of distorted entries in Y given a privacy constraint (e.g. the third-party cannot guess S with significant advantage over a random guess). In general, X is not restricted tobe the data of an individual user, and can also represent multidimensional data derived from differentsources.

We present necessary and sufficient conditions for achieving perfect privacy while disclosing a non-trivial amount of useful information when both S and X have finite support S and X , respectively. We

4

Page 5: PrincipalInertiaComponentsandApplications - arXiv · The PICs provide a fine-grained decomposition of the dependence between two random variables. Well- studied statistical methods

prove that the smallest PIC of pS,X plays a central role for achieving perfect privacy (i.e. I(S;Y ) = 0):If |X |≤ |S|, then perfect privacy is achievable with I(X;Y ) > 0 if and only if the smallest PIC of pS,Xis 0. Since I(S;Y ) = 0 if and only if S ⊥⊥ Y , this fundamental result holds for any privacy metric wherestatistical independence implies perfect privacy. We also provide an explicit lower bound for the amountof useful information that can be released while guaranteeing perfect privacy, and demonstrate how toconstruct pY |X in order to achieve this bound.

In addition, we derive general bounds for the minimum amount of disclosed private informationI(S;Y ) given that, on average, at least t bits of useful information are revealed, i.e. I(X;Y ) ≥ t. Thesebounds are sharp, and delimit the achievable privacy-utility region for the considered setting. Adoptingan analysis related to the information bottleneck [21] and for characterizing linear contraction coefficientsin strong DPIs in [22,23], we determine the smallest achievable ratio between disclosed private and usefulinformation, i.e. infpY |X

I(S;Y )/I(X;Y ). We prove that this value is upper-bounded by the smallestPIC, and is zero if and only if the smallest PIC is zero. In this case, we present an explicit construction ofa privacy-assuring mapping that discloses a non-trivial amount of useful information while guaranteeingperfect privacy. We also show that when the data is composed by multiple i.i.d. samples (Sn, Xn),the smallest PIC decreases exponentially in n. Consequently, as the number of samples n increases, wecan achieve a more favorable trade-off between disclosing useful and private information. Finally, wemotivate potential future applications of the PICs as a design driver for privacy assuring mappings inour final remarks in Section 6.

1.3 Notation

Capital letters (e.g. X and Y ) are used to denote random variables, and calligraphic letters (e.g. X andY) denote sets. The exceptions are (i) I, which will be used in Section 4 to denote a non-specified measureof dependence, and (ii) T , which will denote the conditional expectation operator (defined below). Thesupport set of random variables X and Y are denoted by X and Y, respectively. If X and Y have finitesupport sets |X |< ∞ and |Y|< ∞, then we denote the joint probability mass function (pmf) of X andY as pX,Y , the conditional pmf of Y given X as pY |X , and the marginal distributions of X and Y aspX and pY , respectively. We denote the fact that X is distributed according to pX by X ∼ pX . WhenpX,Y,Z(x, y, z) = pX(x)pY |X(y|x)pZ|Y (z|y) (i.e. X,Y, Z form a Markov chain), we write X → Y → Z.We denote independence of two random variables X and Y by X ⊥⊥ Y .

For positive integers j, k, n, j ≤ k, we define [n] , {1, . . . , n} and [j, k] , {j, j + 1, . . . , k}. For anyx ∈ R, [x]+ is defined as x if x ≥ 0 and 0 otherwise. Matrices are denoted in bold capital letters (e.g.X) and vectors in bold lower-case letters (e.g. x). The (i, j)-th entry of a matrix X is given by [X]i,j .Furthermore, for x ∈ R

n, we let x = (x1, . . . , xn). We denote by 1 the vector with all entries equal to 1,and the dimension of 1 will be clear from the context. The singular values of a matrix X ∈ R

m×n aredenoted by σ1(X), . . . , σm(X). For a matrix X, we denote its k-th Ky Fan norm [24, Eq. (7.4.8.1)] by‖X‖k,

∑ki=1 σi(X).

For a random variable X with discrete support and X ∼ pX , the entropy of X is given by

H(X) , −E [log (pX(X))] .

If Y has a discrete support set and X,Y ∼ pX,Y , the mutual information between X and Y is

I(X;Y ) , E

[log

(pX,Y (X,Y )

pX(X)pY (Y )

)].

The basis of the logarithm will be clear from the context. The χ2-information between X and Y is

χ2(X;Y ) , E

[(pX,Y (X,Y )

pX(X)pY (Y )

)]− 1.

We denote the binary entropy function hb : [0, 1] → R as

hb(x) , −x log x− (1− x) log(1− x), (2)

where, as usual, 0 log 0 , 0.Let X and Y be discrete random variables with finite support sets X = [m] and Y = [n], respectively.

Then we define the joint distribution matrix P as an m × n matrix with [P]i,j , pX,Y (i, j). We denoteby pX (respectively, pY ) the vector with i-th entry equal to pX(i) (resp. pY (i)). DX = diag (pX) andDY = diag (pY ) are matrices with diagonal entries equal to pX and pY , respectively, and all other entriesequal to 0. The matrix PY |X ∈ R

m×n is defined as [PY |X ]i,j , pY |X(j|i). Note that P = DXPY |X .

5

Page 6: PrincipalInertiaComponentsandApplications - arXiv · The PICs provide a fine-grained decomposition of the dependence between two random variables. Well- studied statistical methods

For any real-valued random variable X, we denote the Lp-norm of X as

‖X‖p, (E [|X|p])1/p .

The set of all functions that when composed with a random variable X with distribution pX result in anL2-norm smaller than 1 is given by

L2(pX) , {f : X → R | ‖f(X)‖2≤ 1} . (3)

The operators TX : L2(pY ) → L2(pX) and TY : L2(pX) → L2(pY ) denote conditional expectation,where

(TXg)(x) = E [g(Y )|X = x] and (TY f)(y) = E [f(X)|Y = y] , (4)

respectively. Observe that TX and TY are adjoint operators.For X and Y with discrete support sets, we denote by Pe(X|Y ) the smallest average probability of

error of estimating X given an observation of Y , defined as

Pe(X|Y ) = minX→Y →X

Pr{X 6= X}, (5)

where the minimum is taken over all distributions pX|Y such that X → Y → X. The advantage ofcorrectly estimating X given an observation of Y over a random guess is defined as:

Adv(X|Y ) = 1− Pe(X|Y )−maxx∈X

pX(x). (6)

The MMSE of estimating X from an observation of Y is given by

mmse(X|Y ) , minX→Y→X

E

[(X − X)2

],

where the minimum is taken over all distributions pX|Y such that X → Y → X. Note that, from Jensen’s

inequality, it is sufficient to consider X a deterministic mapping of Y . For any X → Y → g(Y ) with‖g(Y )‖2= α and ‖X‖2= σ

E[(X − g(Y ))2

]≥ σ2 + α2 − 2α‖E [X|Y ] ‖2,

with equality if and only if g(Y ) is proportional to E [X|Y ]. Minimizing the right-hand side over all α,we find that the MMSE estimator of X from Y is g(y) = E [X|Y = y], and

mmse(X|Y ) = ‖X‖22−‖E [X|Y ] ‖22. (7)

For a given joint distribution pX,Y and corresponding joint distribution matrix P, the set of all vectorscontained in the unit cube in R

n that satisfy ‖Px‖1= a is given by

Cn(a,P) , {x ∈ Rn|0 ≤ xi ≤ 1, ‖Px‖1= a}. (8)

We represent the set of all m× n probability distribution matrices by Pm,n.For xn ∈ {−1, 1}n and S ⊆ [n],

χS(xn) ,

i∈S

xi (9)

(we consider χ∅(x) = 1). For yn ∈ {−1, 1}n, an = xn ⊕ yn is the vector resulting from the entrywiseproduct of xn and yn, i.e. ai = xiyi, i ∈ [n].

Given two probability distributions pX and qX and f(t) a smooth convex function defined for t > 0with f(1) = 0, the f -divergence is defined as [25]

Df (pX ||qX ) ,∑

x

qX(x)f

(pX(x)

qX (x)

). (10)

The f -information is given byIf (X;Y ) , Df (pX,Y ||pXpY ). (11)

When f(x) = x log(x), then If (X;Y ) = I(X;Y ). A study of information metrics related to f -informationwas given in [26] in the context of channel coding converses.

6

Page 7: PrincipalInertiaComponentsandApplications - arXiv · The PICs provide a fine-grained decomposition of the dependence between two random variables. Well- studied statistical methods

1.4 Related WorkThe joint distribution matrix P can be viewed as a contingency table and decomposed using standardtechniques from correspondence analysis [11,27]. For an overview of correspondence analysis, we refer thereader to [28]. The term “principal inertia components”, used here, is borrowed from the correspondenceanalysis literature [11]. However, the study of the PICs of the joint distribution of two random variablesor, equivalently, the spectrum of the conditional expectation operator, predates correspondence analysis,and goes back to the work of Hirschfeld [29], Gebelein [30], Sarmanov [31] and Renyi [15], having alsoappeared in the work of Witsenhausen [32] and Ahlswede and Gacs [22]. The PICs are also related tostrong DPIs and contraction coefficients, being recently investigated by Anantharam et al. [23], Polyan-skiy [33], Raginsky [34], Calmon et al. [35], Makur and Zheng [36], among others. Recently, Liu et al. [37]provided a unified perspective on several functional inequalities used in the study of strong DPIs andhypercontractivity. The PICs also play a role in Euclidean Information Theory [38], since they related toχ2-divergence and, consequently, to local approximations of mutual information and related measures.

The largest principal inertia component is equal to ρm(X; Y )2, where ρm(X; Y ) is the maximalcorrelation between X and Y . Maximal correlation has been widely studied in the information theoryand statistics literature (e.g [15,31]). Ahslwede and Gacs studied maximal correlation in the contextof contraction coefficients in strong data processing inequalities [22], and more recently Anantharamet al. presented in [23] an overview of different characterizations of maximal correlation, as well as itsapplication in information theory. Estimating the maximal correlation is also the goal of the AlternatingConditional Expectation (ACE) algorithm introduced by Breiman and Friedman [12], further analyzedby Buja [39], and recently investigated in [40].

The DPI for the PICs was shown by Kang and Ulukus in [41, Theorem 2] in a different setting thanthe one considered here. Kang and Ulukus made use of the decomposition of the joint distribution matrixto derive outer bounds for the rate-distortion region achievable in certain distributed source and channelcoding problems.

Lower bounds on the average estimation error can be found using Fano-style inequalities. Recently,Guntuboyina et al. ([42,43]) presented a family of sharp bounds for the minmax risk in estimationproblems involving general f -divergences. These bounds generalize Fano’s inequality and, under certainassumptions, can be extended in order to lower bound Pe(X|Y ).

Most information-theoretic approaches for estimating or communicating functions of a random vari-able are concerned with properties of specific functions given i.i.d. samples of the hidden variable X,such as in the functional compression literature [44,45]. These results are rate-based and asymptotic,and do not immediately extend to the case where the function f(X) can be an arbitrary member of aclass of functions, and only a single observation is available.

More recently, Kumar and Courtade [17] investigated Boolean functions in an information-theoreticcontext. In particular, they analyzed which is the most informative (in terms of mutual information)1-bit function for the case where X is composed by n i.i.d. Bernoulli(1/2) random variables, and Y isthe result of passing X through a discrete memoryless binary symmetric channel. Even in this simplecase, determining the most informative function seems to be non-trivial. Further investigations of thisproblem was done in [46–49]. In particular, [46] studies a related problem in a continuous setting byconsidering that X and Y are Gaussian random vectors. Recently, Samorodnitsky [50] presented a proofof the conjecture in the high noise regime.

Information-theoretic formulations for privacy have appeared in [51–55]. For an overview, we referthe reader to [19,53] and the references therein. The privacy against statistical inference frameworkconsidered here was further studied in [5,6,56]. The results presented in this paper are closely connected tothe study of hypercontractivity coefficients and strong data processing results, such as in [22,23,33,34,57].PIC-based analysis were used in the context of security in [58,59]. Extremal properties of privacy werealso investigated in [60,61], and in particular [62] builds upon some of the results introduced here. Formore details on designing privacy-assuring mappings and applications with real-world data, we refer thereader to [5–7,19,20,63].

We note that the privacy against statistical inference setting is related to differential privacy [3,4]. Inthe classic differential privacy setting, the output of a statistical query over a database is masked againstsmall perturbations of the data contained in the database. Assuming this centralized statistical databasesetting, the private variable S can represent an individual user’s entry to the database, and the variableX the output of a query over the database. Unlike in differential privacy, here we consider an additionaldistortion constraint, which can be chosen according to the application at hand. In the privacy funnelsetting [63], the distortion constraint is given in terms of the mutual information between X and theperturbed query output Y . Connections between differential privacy and the privacy setting depicted inFig. 1 as well as connections between differential privacy and PICs are studied in [19,64].

7

Page 8: PrincipalInertiaComponentsandApplications - arXiv · The PICs provide a fine-grained decomposition of the dependence between two random variables. Well- studied statistical methods

Figure 2: Geometric interpretation of the PICs for X = {x1, x2, x3} and X uniformly distributed. In (a), each pointon the simplex corresponds to a posterior distribution pX|Y (·|y) induced on X by an observation of Y = yk. If all theposterior distribution points are close together (b), then X and Y are approximately independent. If these points arefar apart (c), then there may exist a function of X that can be approximately reliably estimated given an observationof Y (in this case, a binary function). The PICs can be intuitively understood as a measure of inertia of the posteriordistribution vectors on the simplex.

2 Principal Inertia Components

We introduce in this section the Principal Inertia Components (PICs) of the joint distribution of tworandom variables X and Y . The PICs provide a fine-grained decomposition of the statistical depen-dence between X and Y , and are dependence measures that lie in the intersection of information andestimation theory. The PICs possess several desirable information-theoretic properties (e.g. satisfy theDPI, convexity, tensorization, etc.), and describe which functions of X can or cannot be reliably inferred(in terms of MMSE) given an observation of Y . The latter interpretation is discussed in more detail inSection 4.

2.1 A Geometric Interpretation of the PICs

We give an intuitive geometric interpretation of the PICs before presenting their formal definition in thenext section. Let X and Y be related through a conditional distribution (channel), denoted by pY |X .For each y ∈ Y, pX|Y (·|y) will be a vector on the |X |-dimensional simplex, and the position of thesevectors on the simplex will determine the nature of the relationship between X and Y (Fig. 2). If pX|Y

is fixed, what can be learned about X given an observation of Y , or the degree of accuracy of what canbe inferred about X a posteriori, will then depend on the marginal distribution pY . The value pY (y), inturn, ponderates the corresponding vector pX|Y (·|y) akin to a mass. As a simple example, if |X |= |Y| andthe vectors pX|Y (·|y) are located on distinct corners of the simplex, then X can be perfectly learned fromY . As another example, assume that the vectors pX|Y (·|y) can be grouped into two clusters located nearopposite corners of the simplex. If the sum of the masses induced by pY for each cluster is approximately1/2, then one may expect to reliably infer on the order of 1 unbiased bit of X from an observation of Y .

The above discussion naturally leads to considering the use of techniques borrowed from classicalmechanics. For a given inertial frame of reference, the mechanical properties of a collection of distributedpoint masses can be characterized by the moments of inertia of the system. The moments of inertiameasure how the weight of the point masses is distributed around the center of mass. An analogousmetric exists for the distribution of the vectors pX|Y and masses pY in the simplex, and it is the subjectof study of a branch of applied statistics called correspondence analysis ([11,28]). In correspondenceanalysis, the joint distribution pX,Y is decomposed in terms of the PICs, which, in some sense, areanalogous to the moments of inertia of a collection of point masses. For more related literature, we referthe reader back to Section 1.4.

8

Page 9: PrincipalInertiaComponentsandApplications - arXiv · The PICs provide a fine-grained decomposition of the dependence between two random variables. Well- studied statistical methods

2.2 Definition and Characterizations of the PICs

We start with the definition of principal inertia components. In this paper we focus on the discretecase, since two of our main goals are (i) derive lower bounds on average estimation error probabilityand (ii) apply these results to privacy, where private data is often categorical. In addition, tools fromcorrespondence analysis [27] can be used for estimating the PICs in the discrete setting. Nevertheless,the definition below is not limited to discrete random variables, and can be directly extended to generalprobability measures under compactness of the operator TXTY (cf. [32, Section 3]).

Definition 1. Let X and Y be random variables with support sets X and Y, respectively, and jointdistribution pX,Y . In addition, let f0 : X → R and g0 : Y → R be the constant functions f0(x) = 1 andg0(y) = 1. For k ∈ Z+, we (recursively) define

λk(X;Y ) = max{E [f(X)g(Y )]2

∣∣∣f ∈ L2(pX), g ∈ L2(pY ),E [f(X)fj(X)] = 0,

E [g(Y )gj(Y )] = 0, j ∈ {0, . . . , k − 1}}, (12)

where

(fk, gk) , argmax{E [f(X)g(Y )]2

∣∣∣f ∈ L2(pX), g ∈ L2(pY ),E [f(X)fj(X)] = 0,

E [g(Y )gj(Y )] = 0, j ∈ {0, . . . , k − 1}}. (13)

The values λk(X;Y ) are called the principal inertia components (PICs) of pX,Y . The functions fk andgk are called the principal functions of X and Y .

Observe that the PICs satisfy λk(X; Y ) ≤ 1, since fk ∈ L2(pX) gk ∈ L2(pY ) and

E [f(X)g(Y )] ≤ ‖f(X)‖2‖g(Y )‖2≤ 1.

Thus, from Definition 1, λk+1(X;Y ) ≤ λk(X; Y ) ≤ 1. When both random variables X and Y have afinite support set, we have the following definition.

Definition 2. For X = [m] and Y = [n], let P ∈ Rm×n be a matrix with entries [P]i,j = pX,Y (i, j),

and DX ∈ Rm×m and DY ∈ R

n×n be diagonal matrices with diagonal entries [DX ]i,i = pX(i) and[DY ]j,j = pY (j), respectively, where i ∈ [m] and j ∈ [n]. We define

Q , D−1/2X PD

−1/2Y . (14)

We denote the singular value decomposition of Q by Q = UΣVT .

The next theorem provides four equivalent characterizations of the PICs.

Theorem 1. The following characterizations of the PICs are equivalent:

(1) The characterization given in Definition 1 where, for fk and gk given in (13), gk(Y ) = E[fk(X)|Y ]‖E[fk(X)|Y ]‖2

and fk(X) = E[gk(Y )|X]‖E[gk(Y )|X]‖2

.

(2) [32, Section 3] Consider the conditional expectation operator TY : L2(pX) → L2(pY ), defined in(4). Then (

1,√

λ1(X; Y ),√

λ2(X;Y ), . . .)

are the singular values of TY .

(3) For any k ∈ Z+,

1− λk(X; Y ) = min{mmse(f(X)|Y )

∣∣∣f ∈ L2(pX), ‖f(X)‖2= 1,E [f(X)hj(X)] = 0, j ∈ {0, . . . , k − 1}},

(15)

where

hk , argmin{mmse(f(X)|Y )

∣∣∣f ∈ L2(pX), ‖f(X)‖2= 1,E [f(X)hj(X)] = 0, j ∈ {0, . . . , k − 1}}.

(16)

If λk(X;Y ) is unique, then hk = fk given in (13).

Finally, if both X and Y are defined over finite supports, the following characterization is also equivalent.

9

Page 10: PrincipalInertiaComponentsandApplications - arXiv · The PICs provide a fine-grained decomposition of the dependence between two random variables. Well- studied statistical methods

(4)√

λk(X;Y ) is the (k + 1)-st largest singular value of Q. The principal functions fk and gk in (13)

correspond to the columns of the matrices D−1/2X U and D

−1/2Y V, respectively, where Q = UΣV.

Proof. We will prove that (1) ⇐⇒ (2), (1) ⇐⇒ (3), finally and (1) ⇐⇒ (4).

• (1) ⇐⇒ (2). First observe that for f ∈ L2(pX) and g ∈ L2(pY )

E [f(X)g(Y )] = E [g(Y )E [f(X)|Y ]] ≤ ‖g(Y )‖2‖E [f(X)|Y ] ‖2≤ ‖E [f(X)|Y ] ‖2,

where the first inequality follows from the Cauchy-Schwarz inequality, with equality if and only ifg(Y ) = E[f(X)|Y ]

‖E[f(X)|Y ]‖2. The equivalence then follows by noting that

√λ1(X;Y ) = max

E[f(X)]=E[g(X)]=0‖f(X)‖2=‖g(Y )‖2=1

E [f(X)g(Y )]

= maxE[f(X)]=E[g(X)]=0

‖f(X)‖2=‖g(Y )‖2=1

E [E [g(Y )f(X)|Y ]]

= maxE[f(X)]=0‖f(X)‖2=1

‖E [f(X)|Y ] ‖2, (17)

where the last equality follows by setting g(Y ) = E[f(X)|Y ]‖E[(X)|Y ]‖2

. Inverting the roles of f and g, we find

f(X) = E[g(Y )|X]‖E[(Y )|X]‖2

. Since this last expression is the second largest singular value of the conditional

expectation operator TY (the largest being 1), the result follows for λ1(X;Y ). The equivalentresult for the other PICs follows by adding orthogonality constraints and the min-max propertiesof singular values (cf. Rayleigh-Ritz Theorem [24, Theorem 4.2.2]).

• (1) ⇐⇒ (3). The result follows from λk(X;Y ) = ‖E [fk(X)|Y ] ‖22 in (17) and by noting that theMMSE can be written as (7). Consequently, maximizing ‖E [f(X)|Y )] ‖ is equivalent to minimizingthe MMSE in (15).

• (1) ⇐⇒ (4). Let f ∈ L2(pX) and g ∈ L2(pY ). Define the column-vectors f , (f(1), . . . , f(m))T

and g , (g(1), . . . , g(n))T . ThenE [f(X)g(Y )] = f

TPg

andfTDX f = g

TDY g = 1.

For Q = UΣVT given in Definition 2, put u , UTD1/2X f and v , VD

1/2Y g. Then ‖u‖2= ‖v‖2= 1,

andE [f(X)g(Y )] = u

TΣv.

The result then follows directly from the variational characterization of singular values [24, Theorem7.3.8].

Assuming unique PICs, note that the column-vectors (f0, f1, . . . , fd) corresponding to the functions

(f0, f1, . . . , fd) are the first d + 1 columns of D−1/2X U, and the column-vectors (g0,g1, . . . ,gd)

corresponding to the functions (g0, g1, . . . , gd) are the first d + 1 of D−1/2Y V. In addition, let

zk ∈ Rn be the column vector with entries E [fk(X)|Y = j]. Then

zk = fTPD

−1Y = f

Tk D

1/2X UΣV

TD

−1/2Y =

√λk(X; Y )gk,

so λk(X; Y ) = ‖E [fk(X)|Y ] ‖22 and once again we find gk(Y ) = E[fk(X)|Y ]‖E[fk(X)|Y ]‖2

.

The previous theorem provides different operational characterization of the PICs. Characterization(1), presented in Definition 1 implies that the principal functions of X and Y are the solution to thefollowing problem: Consider two parties, namely Alice and Bob, where Alice has access to an observationof X and Bob has access to an observation Y . Alice and Bob’s goal is to produce zero-mean, unitvariance functions f(X) and g(Y ), respectively, that maximizes the correlation E [f(X)g(Y )] withoutany additional information beyond their respective observations of X and Y . The optimal choice offunctions is f1 and g1, given in the theorem. Moreover,

λ1(X;Y ) = ρm(X;Y )2.

10

Page 11: PrincipalInertiaComponentsandApplications - arXiv · The PICs provide a fine-grained decomposition of the dependence between two random variables. Well- studied statistical methods

Characterization (3) above proves that the PICs are the solution to another related question: Given anoisy observation Y of a hidden variable X, what is the unit-variance, zero-mean function of X thatcan be estimated with the smallest mean-squared error? It follows directly from (15) that the functionis f1(X), and the minimum MMSE is 1 − λ1(X;Y ). Indeed, since they are orthonormal, the principalfunctions form a basis for the zero-mean functions in L2(pX) (we revisit this point in the Section 6).Characterization (4) lends itself to the geometric interpretation discussed in Section 2.1.

The next result states the well-known tensorization property the PICs between sequences of inde-pendent random variables (e.g. [23,32,65]). We present a proof of the discrete case here for the sake ofcompleteness.

Lemma 1. Let (X1, Y1) ⊥⊥ (X2, Y2), d1 = min{|X1|, |Y1|} − 1 < ∞ and d2 = min{|X2|, |Y2|} − 1 < ∞.Then the PICs of p(X1,X2),(Y1,Y2) are λi(X1, Y1)λj(X2, Y2) for (i, j) ∈ [0, d1]× [0, d2], where λ0(X1, Y1) =

λ0(X2, Y2) = 1. Furthermore, denoting the principal functions (X1, Y1) by fi and of (X2, Y2) by fj , then

the principal functions of p(X1,X2),(Y1,Y2) are of the form (x1, x2) 7→ fi(x1)fj(x2). In particular

λ1((X1, X2); (Y1, Y2)) = max{λ1(X1; Y1), λ1(X2;Y2)}.

Proof. Let [Q1]i,j =pX1,Y1

(i,j)√pX1

(i)pY1(j)

and [Q2]i,j =pX2,Y2

(i,j)√pX2

(i)pY2(j)

. Denoting by Q the decomposition in

Definition 1 of p(X1,X2),(Y1,Y2) then, from the independence assumption, Q = Q1 ⊗ Q2, where ⊗ is theKronecker product. The result follows directly from the fact that the singular values of the Kroneckerproduct of two matrices are the Kronecker product of the singular values (and equivalently for the singularvectors) [66, Theorem 4.2.15].

2.3 k-correlation

In this section we introduce the k-correlation Jk(X;Y ) between two random variables, which is equivalentto the sum of the k largest PICs. We prove that k-correlation is convex in pY |X and satisfies the DPI.

Definition 3. We define the k-correlation between X and Y as

Jk(X;Y ) ,k∑

i=1

λi(X; Y ). (18)

For finite X and Y, the k-correlation is given by

Jk(X; Y ) , ‖QQT ‖k−1. (19)

Note thatJ1(X; Y ) = ρm(X;Y )2,

and for finite X and Y, d = min{|X |, |Y|} − 1,

Jd(X;Y ) = E

[pX,Y (X,Y )

pX(X)pY (Y )

]− 1 = χ2(X; Y ).

We demonstrate next that k-correlation and, consequently, maximal correlation, is convex in pY |X fora fixed pX and satisfies a form of the DPI, i.e. if X → Y → Z, then Jk(X; Y ) ≤ Jk(X;Z). These resultshold for both discrete and continuous random variables (under appropriate compactness assumptions),

Theorem 2. For a fixed pX , Jk(X; Y ) is convex in pY |X .

Proof. First note that ‖E [f(X)|Y ] ‖22 is convex pX,Y , since for any U → (X,Y )

EY

[(EX|Y [f(X)|Y ]

)2]= EY

[(EU|Y

[EX|Y,U [f(X)|Y, U ]

])2]

≤ EY

[EU|Y

[(EX|Y,U [f(X)|Y,U ]

)2]]

= EU

[EY |U

[(EX|Y,U [f(X)|Y,U ]

)2]],

where the inequality follows from Jensen’s inequality. Consequently, for any {f1, . . . , fk} ⊆ L2(pX),∑ki=1‖E [fi(X)|Y ] ‖22 is convex in pX,Y and thus, for a fixed pX , convex in pY |X . From Theorem 1 and

the Poincare separation theorem [24, Corollary 4.3.16]

k∑

i=1

λi(X;Y ) = max{fi}

ki=1

⊆L2(pX )fi⊥fj,i6=j

E[fi]=0

k∑

i=1

‖E [fi(X)|Y ] ‖22.

11

Page 12: PrincipalInertiaComponentsandApplications - arXiv · The PICs provide a fine-grained decomposition of the dependence between two random variables. Well- studied statistical methods

Since the pointwise supremum of convex functions is convex [67, Sec 3.2.3], it follows that for fixed pXJk(X;Y ) is convex in pY |X .

The following lemma will be used to prove that the PICs satisfy the DPI.

Lemma 2 (DPI for MMSE). For X → Y → Z and any f ∈ L2(pX), E [f(X)] = 0,

‖E [f(X)|Z] ‖22≤ λ1(Y ;Z)‖E [f(X)|Y ] ‖22. (20)

Consequently, mmse(f(X)|Y ) ≤ mmse(f(X)|Z).

Proof. The proof is in Appendix A.

Lemma 2 leads to the following theorem.

Theorem 3 (DPI for the PICs). Assume that X → Y → Z. Then λk(X;Z) ≤ λ1(Y ;Z)λk(X;Y ) forall k.

Proof. A direct consequence of Theorem 1 is that for any two random variables X,Y

λk(X; Y ) = min{fi}

ki=1

⊆L2(pX)max

f∈L2(pX )E[f(X)fi(X)]=0

‖E [f(X)|Y ] ‖22,

and equivalently for λk(X;Z). The result then follows directly from (20).

The next corollary is a direct consequence of the previous theorem.

Corollary 1. For X → Y → Z forming a Markov chain, Jk(X;Z) ≤ λ1(Y ;Z)Jk(X;Y ).

Remark 1. The data processing result in Theorem 3 and the previous corollary was proved by Kangand Ulukus in [41, Theorem 2] and applied to problems in distributed source and channel coding, eventhough they do not make the explicit connection with maximal correlation and PICs. A weaker formof Theorem 3 can be derived using a clustering result presented in [11, Sec. 7.5.4] and originally due toDeniau et al. [68]. We use a different proof technique from the one in [11, Sec. 7.5.4] and [41, Theorem2] to show result stated in the theorem, and present the proof here for completeness. Finally, a relateddata processing result was stated in [33].

In the next three sections of the paper, we demonstrate the fundamental role of PICs in problems ininformation theory, estimation theory, and privacy.

3 Applications of the Principal Inertia Components to In-

formation Theory

In this section, we present results that connect the PICs with other information-theoretic metrics. Asseen in Section 2, the distribution of the vectors pY |X in the simplex or, equivalently, the PICs of thejoint distribution of X and Y , are inherently connected to how an observation of Y is statistically relatedto X. In this section, we explore this connection within an information theoretic framework. We showthat, under certain assumptions, the PICs play an important part in estimating a one-bit function ofX, namely b(X) where b : X → {0, 1}, given an observation of Y : they can be understood as thesingular values (or filter coefficients) in the linear transformation of pb(X)|X into pb(X)|Y determined bythe channel transition matrix. Alternatively, the PICs can bear an interpretation as the transform ofthe distribution of the noise in certain additive-noise channels, in particular when X and Y are binarystrings. We also show that maximizing the PICs is equivalent to maximizing the first-order term of theTaylor series expansion of certain convex dependence measures between b(X) and Y . We conjecturethat, for symmetric distributions of X and Y and a given upper bound on the value of the largest PIC,I(b(X);Y ) is maximized when all the principal inertia components have the same value as the largestprincipal inertia component. For uniformly distributed X and Y , this is equivalent to Y being the resultof passing X through a q-ary symmetric channel. This conjecture, if proven, would imply the conjecturemade by Kumar and Courtade in [17].

Finally, we study the Markov chain B → X → Y → B, where B and B are binary random variables,and the role of the principal inertia components in characterizing the relation between B and B. Weshow that this relation is linked to solving a non-linear maximization problem, which, in turn, can besolved when B is an unbiased estimate of B (i.e. E [B] = E[B]), the joint distribution of X and Y is

12

Page 13: PrincipalInertiaComponentsandApplications - arXiv · The PICs provide a fine-grained decomposition of the dependence between two random variables. Well- studied statistical methods

symmetric and Pr{B = B = 0} ≥ E [B]2. We illustrate this result for the setting where X is a binarystring and Y is the result of sending X through a memoryless binary symmetric channel. We note thatthis is a similar setting to the one considered by Anantharam et al. in [47].

The rest of the section is organized as follows. Section 3.1 introduces the notion of conformingdistributions and ancillary results. Section 3.2 presents results concerning the role of the PICs in inferringone-bit functions of X from an observation of Y and in the transformation of pX into pY in certainsymmetric settings. We argue that, in such settings, the PICs can be viewed as singular values (filtercoefficients) in a linear transformation. In particular, results for binary channels with additive noise arederived using techniques inspired by Fourier analysis of Boolean functions. Furthermore, Section 3.2 alsointroduces a conjecture that encompasses the one made by Kumar and Courtade in [17]. Finally, Section

3.6 provides further evidence for this conjecture by investigating the Markov chain B → X → Y → Bwhere B and B are binary random variables. Throughout this section we assume X and Y are discreterandom variables defined over a finite support set.

3.1 Conforming distributions

In this section we shall focus on probability distributions that meet the following definition.

Definition 4. A joint distribution pX,Y is said to be conforming if the corresponding matrix P satisfiesP = PT and P is positive-semidefinite.

Conforming distributions are particularly interesting since they are closely related to symmetric chan-nels1. In addition, if a joint distribution is conforming, then its eigenvalues are equal to (the square rootof) its PICs when its marginal distributions are identical. We shall illustrate this relation in the followingtwo lemmas and in Section 3.2.

Remark 2. If X and Y have a conforming joint distribution, then they have the same marginal distri-bution. Consequently, D , DX = DY , and P = D1/2UΣUTD1/2 (cf. Definition 2 for notation).

Lemma 3. If P is conforming, then the corresponding conditional distribution matrix PY |X is positivesemi-definite. Furthermore, for any symmetric channel PY |X = PT

Y |X , there is an input distributionpX (namely, the uniform distribution) such that the PICs of P = DXPY |X correspond to the squareof the eigenvalues of PY |X . In this case, if PY |X is also positive-semidefinite, then the resulting P isconforming.

Proof. LetP be conforming and X = Y = [m]. ThenPY |X = D−1/2UΣUTD1/2 =(D−1/2U

)Σ(D−1/2U

)−1

.

It follows that diag (Σ) are the eigenvalues of PY |X , and, consequently, PY |X is positive semi-definite.Now let PY |X = PT

Y |X = UΛΛΛUT . The entries of ΛΛΛ here are the eigenvalues of PY |X and notnecessarily positive. SincePY |X is symmetric, it is also doubly stochastic, and forX uniformly distributedY is also uniformly distributed. Thus, the resulting joint distribution matrix P is symmetric, andP = UΛΛΛUT /m. It follows directly that the principal inertia components of P are the diagonal entries ofΛΛΛ2, and if PY |X is positive-semidefinite then P is conforming.

The q-ary symmetric channel, defined below, is of particular interest to some of the results derivedin the following subsections.

Definition 5. The q-ary symmetric channel with crossover probability ǫ ≤ 1 − q−1, also denoted as(ǫ, q)-SC, is defined as the channel with input X and output Y where X = Y = [q] and

pY |X(y|x) =

1− ǫ if x = y

ǫ

q − 1if x 6= y.

In the rest of this section, we assume that X and Y have a conforming joint distribution matrix withX = Y = [q] and PICs λk(X;Y ) = σ2

k for k ∈ [d − 1]. The following lemma shows that a conformingP with uniform marginals can be transformed into the joint distribution of a q-ary symmetric channelwith input distribution pX by setting σ2

1 = σ22 = · · · = σ2

q−1, i.e. making all principal inertia componentsequal to the largest one.

1We say that a channel is symmetric if PY |X = PTY |X

.

13

Page 14: PrincipalInertiaComponentsandApplications - arXiv · The PICs provide a fine-grained decomposition of the dependence between two random variables. Well- studied statistical methods

Lemma 4. Let P be a conforming joint distribution matrix of X and Y , with X = Y = [q], P =

D1/2UΣUTD1/2, where D = DX and Σ = diag (1, σ1, . . . , σd). For Σ = diag (1, σ1, . . . , σ1), let X and

Y have joint distribution P = D1/2UΣUTD1/2. Then, Y is output of a channel with input X and

probability transition matrixPY |X = σ1I+ (1− σ1)1p

TX . (21)

In particular, if X is uniform, Y is the output of an (ǫ, q)-SC with input X, where

ǫ =(q − 1)(1− ρm(X; Y ))

q. (22)

Proof. The first column of U is p1/2X . Therefore

P = D1/2

UΣUTD

1/2

= σ1D+ (1− σ1)pXpTX . (23)

By left multiplying P by D−1, we obtain the channel transition matrix given in (21).

Remark 3. For X, Y and Y given in the previous lemma, a natural question that arises is whetherY is a degraded version of Y , i.e. X → Y → Y . Unfortunately, this is not true in general, since thematrix UΣ−1ΣUT does not necessarily contain only positive entries, although it is doubly-stochastic.However, since the PICs of X and Y upper bound the PICs of X and Y , it is natural to expect that, atleast in some sense, Y is more informative about X than Y . This intuition is indeed correct for certainestimation problems where a one-bit function of X is to be inferred from a single observation Y or Y ,and will be investigated in the next subsection. In addition, using the characterization of the PICs inTheorem 1, it follows that any function of X can be inferred with smaller MMSE from Y than from Y .Consequently, even if, for example I(X; Y ) ≤ I(X;Y ), any function of X can be estimated with smaller

MMSE for Y than from Y .

3.2 One-bit Functions and Channel Transformations

Let B → X → Y , where B is a binary random variable. When X and Y have a conforming probabilitydistribution, the PICs of X and Y have a particularly interesting interpretation: they can be understoodas the filter coefficients in a linear transformation from pB|X into pB|Y , as we explain next. Consider thejoint distribution of B and Y , denoted here by B, given by

B , [x 1− x]TP = [x 1− x]TPX|Y DY = [y 1− y]TDY , (24)

where x ∈ Rm and y ∈ R

n are column-vectors with entries xi = pB|X(0|i) and yj = pB|Y (0|j). Inparticular, if B is a deterministic function of X, x ∈ {0, 1}m.

If P is conforming and X = Y = [m], then P = D1/2UΣUTD1/2, where D = DX = DY . AssumingD fixed, the joint distribution B is entirely specified by the linear transformation of x into y. DenotingT , UTD1/2, this transformation is done in three steps:

1. (Linear transform) x , Tx,

2. (Filter) y , Σx, where the diagonal of Σ2 are the PICs of X and Y ,

3. (Inverse transform) y = T−1y.

Note that x1 = y1 = 1 − E [B] and y = Ty. Consequently, the PICs of X and Y correspond to thesingular values (or filter coefficients) of the linear transformation of pB|X(0|·) into pB|Y (0|·).

A similar interpretation can be made for symmetric channels, where PY |X = PTY |X = UΛΛΛUT and

PY |X acts as the matrix of the linear transformation of pX into pY . Note that pY = PY |XpX , and,consequently, pX is transformed into pY in the same three steps as before:

1. (Linear transform) pX = UTpX ,

2. (Filter) pY , ΛΛΛpX , where the diagonal of ΛΛΛ2 is the PICs of X and Y in the particular case whenX is uniformly distributed (Lemma 3),

3. (Inverse transform) pY = UpY .

From this perspective, the vector z = UΛΛΛ1m−1/2 can be understood as a proxy for the noise effect ofthe channel. Note that

∑i zi = 1. However, the entries of z are not necessarily positive, and z might

not be a probability distribution.We now illustrate these ideas by investigating binary channels with additive noise in the next section,

where T will correspond to the well-known Walsh-Hadamard transform matrix.

14

Page 15: PrincipalInertiaComponentsandApplications - arXiv · The PICs provide a fine-grained decomposition of the dependence between two random variables. Well- studied statistical methods

3.3 Example: Binary Additive Noise Channels

In this example, let Xn,Yn ⊆ {−1, 1}n be the support sets of Xn and Y n, respectively. We define twosets of channels that maps Xn to Y n. In each set definition, we assume the conditions for pY n|Xn to bea valid probability distribution (i.e. non-negativity and unit sum).

Definition 6. The set of parity-changing channels of block-length n, denoted by An, is defined as:

An ,{pY n|Xn | ∀S ⊆ [n], ∃cS ∈ [−1, 1] s.t. E [χS(Y

n)|Xn] = cSχS(Xn)}, (25)

where χS(·) is defined in (9). The set of all binary additive noise channels is given by

Bn ,{pY n|Xn | ∃Zn s.t. Y n = Xn ⊕ Zn, supp(Zn) ⊆ {−1, 1}n, Zn ⊥⊥ Xn

}. (26)

The definition of parity-changing channels is inspired by results from the literature on Fourier analysisof Boolean functions. For an overview of the topic we refer the reader to the survey [69]. The set ofbinary additive noise channels, in turn, is widely used in the information theory literature. The followinglemma shows that both characterizations are equivalent.

Lemma 5. For An and Bn given in (25) and (26), respectively, An = Bn.

Proof. The proof is in Appendix B.

The previous theorem suggests that there is a correspondence between the coefficients cS in (25) andthe distribution of the additive noise Zn in the definition of Bn. The next result shows that this is indeedthe case and, when Xn is uniformly distributed, the coefficients c2S correspond to the PICs of Xn andY n.

Theorem 4. Let pY n|Xn ∈ Bn, and Xn ∼ pXn . Then PXn,Y n = DXnH2nΛΛΛH2n , where Hl is the l × l

normalized Hadamard matrix2 (hence H2l = I). Furthermore, for Zn ∼ pZn , diag (ΛΛΛ) = 2n/2H2npZn ,

and the diagonal entries of ΛΛΛ are equal to cS in (25). Finally, if X is uniformly distributed, then c2S arethe principal inertia components of Xn and Y n.

Proof. Let pY n|Xn ∈ An be given. From Lemma 5 and the definition of An, it follows that χS(Yn) is a

right eigenvector of pY n|Xn with corresponding eigenvalue cS . Since χS(Yn)2−n/2 corresponds to a row

of H2n for each S (due to the Kronecker product construction of the Hadamard matrix) and H22n = I,

then PXn,Y n = DXnH2nΛΛΛH2n . Finally, note that pTZ = 2−n/21TΛΛΛH2n . From Lemma 3, it follows that

c2S are the PICs of Xn and Y n if Xn is uniformly distributed.

Remark 4. Theorem 4 suggests that one possible method for estimating the distribution of the additivebinary noise Zn is to estimate its effect on the parity bits of Xn and Y n. In this case, we are estimatingthe coefficients aS of the Walsh-Hadamard transform of pZn . This approach was studied by Raginsky etal. in [70] and in other learning literature (see [71] and the references therein).

Theorem 4 illustrates the filtering role of the principal inertia components (discussed in Section 3.2)in binary additive noise channels. If Xn is uniform, then the vector of conditional probabilities pX istransformed into the vector of a posteriori probabilities pY by: (i) taking the Hadamard transform of pX ,(ii) filtering the transformed vector according to the coefficients cS (these coefficients have a one-to-onemapping to the entries of the vector resulting from the Hadamard transform of pZ), and (iii) taking theinverse Hadamard transform to recover pY .

2We define the normalized Hadamard matrix H2k as H1 , [1],

H2 ,1√2

[

1 11 −1

]

,

and H2k , H2 ⊗H2k−1 .

15

Page 16: PrincipalInertiaComponentsandApplications - arXiv · The PICs provide a fine-grained decomposition of the dependence between two random variables. Well- studied statistical methods

3.4 Quantifying the Information of a Boolean Function of the Input of

a Noisy Channel

We now investigate the connection between the PICs and f -information (cf. Eq. (11)) in the contextof one-bit functions of X. Recall from the discussion in the beginning of this section and, in particular,equation (24), that for a binary B and B → X → Y , the distribution of B and Y is entirely specified bythe transformation of x into y, where x and y are vectors with entries equal to pB|X(0|·) and pB|Y (0|·),respectively.

For E [B] = 1− a, the f -information between B and Y is given by (cf. (11))

If (B;Y ) = E

[af

(pB(0|Y )

a

)+ (1− a)f

(1− pB(0|Y )

1− a

)].

For 0 ≤ r, s ≤ 1, and since f is smooth with f(1) = 0, we can expand f(rs

)around 1 as

f(rs

)=

∞∑

k=1

f (k)(1)

k!

(r − s

r

)k.

Denoting

ck(α) ,1

ak−1+

(−1)k

(1− a)k−1,

the f -information can then be expressed as

If (B;Y ) =∞∑

k=2

f (k)(1)ck(a)

k!E

[(pB(0|Y )− a)k

]. (27)

Similarly to [25, Chapter 4], for a fixed E [B] = 1− a, maximizing the PICs of X and Y will alwaysmaximize the first term in the expansion (27). To see why this is the case, observe that

E[(pB|Y (0|Y )− a)2

]= (y − a)TDY (y− a)

= yTDY y − a2

= xTD

1/2X UΣ

2U

TD

1/2X x− a2. (28)

For a fixed a and any x such that xT1 = a, (28) is non-decreasing in the diagonal entries of Σ2 which, inturn, are exactly the PICs of X and Y . Equivalently, (28) is non-decreasing in the χ2-divergence betweenpX,Y and pXpY .

However, we do note that increasing the PICs does not increase the f -information between B and Yin general. Indeed, for a fixed U, V and marginal distributions of X and Y , increasing the PICs mightnot even lead to a valid probability distribution matrix P.

Nevertheless, if P is conforming and X and Y are uniformly distributed over [q], as shown in Lemma

4, by increasing the PICs we can define a new random variable Y that results from sending X througha (ǫ, q)-SC, where ǫ is given in (22). In this case, the f -information between B and Y has a simpleexpression when B is a function of X.

Lemma 6. Let B → X → Y , where B = b(X) for some b : [q] → {0, 1}, E [B] = 1 − a where aq is

an integer, X is uniformly distributed in [q] and Y is the result of passing X through a (ǫ, q)-SC withǫ ≤ (q − 1)/q. Then

If (B; Y ) = a2f (1 + σ1c) + 2a(1− a)f (1− σ1) + (1− a)2f(1 + σ1c

−1) (29)

where σ1 = ρm(X; Y ) = 1 − ǫq(q − 1)−1 and c , (1 − a)a−1. In particular, for f(x) = x log x, then

If (X; Y ) = I(X; Y ), and for σ1 = 1− 2δ

I(B; Y ) = hb(a)− αhb (2δ(1− a))− (1− a)hb(2δa) (30)

≤ 1− hb(δ). (31)

where hb(·) is the binary entropy function, defined in (2).

16

Page 17: PrincipalInertiaComponentsandApplications - arXiv · The PICs provide a fine-grained decomposition of the dependence between two random variables. Well- studied statistical methods

Proof. Since B is a deterministic function of X and aq is an integer, x is a vector with aq entries equalto 1 and (1− a)q entries equal to 0. It follows from (23) that

If (B; Y ) =1

q

q∑

i=1

af

((1− σ1)a+ xiσ1

a

)+ (1− a)f

(1− (1− σ1)a− xiσi

1− a

)

=a2f

(1 + σ1

1− a

a

)+ 2a(1− a)f (1− σ1) + (1− a)2f

(1 + σ1

a

1− a

).

Letting f(x) = x log x, (30) follows immediately. Since (30) is concave in a and symmetric arounda = 1/2, it is maximized at a = 1/2, resulting in (31).

3.5 On the “Most Informative Bit”

We now return to channels with additive binary noise, analyzed in Section 3.3. Let Xn be a uniformlydistributed binary string of length n (X = {−1, 1}) and Y n be the result of passing Xn through a mem-oryless binary symmetric channel with crossover probability δ ≤ 1/2. Kumar and Courtade conjectured[17] that for all binary B and B → Xn → Y n we have

I(B;Y n) ≤ 1− hb(δ). (conjecture) (32)

It is sufficient to consider B a function of Xn, denoted by B = b(Xn), b : {−1, 1}n → {0, 1}, and wemake this assumption henceforth.

From the discussion in Section 3.3, for the memoryless binary symmetric channel Y n = Xn ⊕ Zn,where Zn is an i.i.d. string with Pr{Zi = 1} = 1− δ, and any S ∈ [n],

E [χS(Yn)|Xn] = χS(X

n) (Pr {χS(Zn) = 1} − Pr {χS(Z

n) = −1})= χS(X

n) (2Pr {χS(Zn) = 1} − 1)

= χS(Xn)(1− 2δ)|S|.

It follows directly that cS = (1 − 2δ)|S| for all S ⊆ [n]. Consequently, from Theorem 4, the principalinertia components of Xn and Y n are of the form (1−2δ)2|S| for some S ⊆ [n]. Observe that the principalinertia components act, broadly speaking, as a low pass filter on the vector of conditional probabilitiesx given in (24), since it attenuates the high order interaction terms in the Walsh-Hadamard transformof x.

Can the noise distribution be modified so that the principal inertia components act as an all-passfilter? More specifically, what happens when Y n = Xn⊕W n, where W n is such that the principal inertiacomponents between Xn and Y n satisfy σi = 1− 2δ? Then, from Lemma 4, Y n is the result of sendingXn through a (ǫ, 2n)-SC with ǫ = 2δ(1− 2−n). Therefore, from (31),

I(B; Y n) ≤ 1− hb(δ).

For any function b : {−1, 1}n → {0, 1} such that B = b(Xn), from standard results in Fourier analysisof Boolean functions [69, Prop. 1.1], b(Xn) can be expanded as

b(Xn) =∑

S⊆[n]

βSχS(Xn).

The value of B is uniquely determined by the action of b on χS(Xn). Consequently, for a fixed function

b, one could expect that Y n should be more informative about B than Y n, since the parity bits χS(Xn)

are more reliably estimated from Y n than from Y n. Indeed, the memoryless binary symmetric channelattenuates χS(X

n) exponentially in |S|, acting (as argued previously) as a low-pass filter. In addition,

if one could prove that for any fixed b the inequality I(B;Y n) ≤ I(B; Y n) holds, then (32) would beproven true. This motivates the following conjecture.

Conjecture 1. For all b : {−1, 1}n → {0, 1} and B = b(Xn)

I(B;Y n) ≤ I(B; Y n).

We note that Conjecture 1 is false if B is not a deterministic function of Xn. In the next section,we provide further evidence for this conjecture by investigating information metrics between B and anestimate B derived from Y n.

17

Page 18: PrincipalInertiaComponentsandApplications - arXiv · The PICs provide a fine-grained decomposition of the dependence between two random variables. Well- studied statistical methods

3.6 One-bit Estimators

Let B → X → Y → B, where B and B are binary random variables with E [B] = 1−a and E[B] = 1− b.Again, we let x ∈ R

m and y ∈ Rn be the column vectors with entries xi = pB|X(0|i) and yj = pB|Y (0|j).

The joint distribution matrix of B and B is given by

PB,B =

(z a− z

b− z 1− a− b+ z

), (33)

where z = xTPy = Pr{B = B = 0}. For fixed values of a and b, the joint distribution of B and B onlydepends on z.

Let f : P2×2 → R, and, with a slight abuse of notation, we also denote f as a function of the entriesof the 2 × 2 matrix as f(a, b, z). If f is convex in z for a fixed a and b, then f is maximized at oneof the extreme values of z. Examples of such functions f include mutual information and expectederror probability. Therefore, characterizing the maximum and minimum values of z is equivalent tocharacterizing the maximum value of f over all possible mappings X → B and Y → B. This leads tothe following definition.

Definition 7. For a fixed P and given E [B] = 1 − a and E[B] = 1 − b, the minimum and maximum

values of z over all possible mappings X → B and Y → B are defined as

z∗l (a, b,P) , minx∈Cm(a,PT )y∈Cn(b,P)

xTPy and z∗u(a, b,P) , max

x∈Cm(a,PT )y∈Cn(b,P)

xTPy,

respectively, and Cn(a,P) is defined in (8).

The next lemma provides a simple upper-bound for z∗u(a, b,P) in terms of the largest principal inertiacomponents or, equivalently, the maximal correlation between X and Y .

Lemma 7. z∗u(a, b,P) ≤ ab+ ρm(X;Y )√

a(1− a)b(1− b).

Proof. The proof is in Appendix B.

Remark 5. An analogous result was derived by Witsenhausen [32, Thm. 2] for bounding the probabilityof agreement of a common bit derived from two correlated sources.

We will focus in the rest of this section on functions and corresponding estimators that are (i) unbiased(a = b) and (ii) satisfy z = Pr{B = B = 0} ≥ a2. The set of all such mappings is given by

H(a,P) ,{(x,y) | x ∈ Cm(a,PT ),y ∈ Cn(a,P),xT

Py ≥ a2}.

The next results provide upper and lower bounds on z for the mappings in H(a,P).

Lemma 8. Let 0 ≤ a ≤ 1/2 and P be fixed. For any (x,y) ∈ H(a,P)

a2 ≤ z ≤ a2 + ρm(X; Y )a(1− a), (34)

where z = xTPy.

Proof. The lower bound for z follows directly from the definition of H(a,P), and the upper bound followsfrom Lemma 7.

The previous lemma allows us to provide an upper bound over the mappings in H(a,P) for the

f -information between B and B when If is non-negative.

Theorem 5. For any non-negative If and fixed a and P,

sup(x,y)∈H(a,P)

If (B; B) ≤ a2f (1 + σ1c) + 2a(1− a)f (1− σ1) + (1− a)2f(1 + σ1c

−1) (35)

where here σ1 = ρm(X; Y ) and c , (1− a)a−1. In particular, for a = 1/2,

sup(x,y)∈H(1/2,P)

If (B; B) ≤ 1

2(f(1− σ1) + f(1 + σ1)) . (36)

18

Page 19: PrincipalInertiaComponentsandApplications - arXiv · The PICs provide a fine-grained decomposition of the dependence between two random variables. Well- studied statistical methods

Proof. Using the matrix form of the joint distribution between B and B given in (33), for E [B] =

E

[B]= 1− a, the f information is given by

If (B; B) = a2f( z

a2

)+ 2a(1− a)f

(a− z

a(1− a)

)+ (1− a)2f

(1− 2a+ z

(1− a)2

). (37)

Consequently, (37) is convex in z. For (x,y) ∈ H(a,P), it follows from Lemma 8 that z is restricted tothe interval in (34). Since If (B; B) is non-negative by assumption, If (B; B) = 0 for z = a2 and (37) isconvex in z, then If (B; B) is non-decreasing in z for z in (34). Substituting z = a2 + ρm(X;Y )a(1− a)in (37), inequality (35) follows.

Remark 6. Note that the right-hand side of (35) matches the right-hand side of (29), and providesfurther evidence for Conjecture 1 by demonstrating that the conjecture holds for the specific case whenB → X → Y → B and E [B] = E[B]. Moreover, this result indicates that, for conforming probabilitydistributions, the information between a binary function and its corresponding unbiased estimate ismaximized when all the PICs have the same value.

Following the same approach from Lemma 6, we find the next bound for the mutual informationbetween B and B.

Corollary 2. For 0 ≤ a ≤ 1 and ρm(X;Y ) = 1− 2δ

sup(pB|X ,p

B|Y)∈H(a,P)

I(B; B) ≤ 1− hb(δ).

We provide next a few application examples for the results derived in this section.

Example 1 (Memoryless Binary Symmetric Channels with Uniform Inputs). We turn our attention backto the setting considered in Section 3.3. Let Y n be the result of passing Xn through a memoryless binarysymmetric channel with crossover probability δ, Xn uniformly distributed, and B → Xn → Y n → B.Then ρm(Xn;Y n) = 1− 2δ and, from (40), when E [B] = 1/2,

Pr{B 6= B} ≥ δ.

Consequently, inferring any unbiased one-bit function of the input of a binary symmetric channel is atleast as hard (in terms of error probability) as inferring a single output from a single input.

Using the result from Corollary 2, it follows that when E [B] = E

[B]= a and Pr{B = B = 0} ≥ a2,

thenI(B; B) ≤ 1− hb(δ). (38)

Remark 7. Anantharam et al. presented in [47] a computer aided proof that the upper bound (38)

holds for any B → Xn → Y n → B. Nevertheless, we highlight that the methods introduced here allowedan analytical derivation of (38) for unbiased estimators.

Example 2 (Lower Bounding the Estimation Error Probability). For z given in (33), the average

estimation error probability is given by Pr{B 6= B} = a+ b − 2z, which is a convex (linear) function ofz. If a and b are fixed, then the error probability is minimized when z is maximized. Therefore

Pr{B 6= B} ≥ a+ b− 2z∗u(a, b).

Using the bound from Lemma 7, it follows that

Pr{B 6= B} ≥ a+ b− 2ab− 2ρm(X;Y )√

a(1− a)b(1− b). (39)

The bound (39) is exactly the bound derived by Witsenhausen in [32, Thm 2.]. Furthermore, minimizingthe right-hand side of (39) over 0 ≤ b ≤ 1/2, we arrive at

Pr{B 6= B} ≥ 1

2

(1−

√1− 4a(1− a)(1− ρm(X; Y )2)

). (40)

This result suggests that the PICs are particularly useful for deriving bounds on error probability. Weexplore this fact in the next section, and show that (40) is a particular form of a more general boundderived in Theorem 6.

19

Page 20: PrincipalInertiaComponentsandApplications - arXiv · The PICs provide a fine-grained decomposition of the dependence between two random variables. Well- studied statistical methods

4 Application to Estimation: Bounds on Error Probability

In this section we derive lower bounds on error-probability based on the PICs (cf. section 2). Beforepresenting these bounds, we discuss the general approach used for deriving lower bounds, which can beextended to other measures of dependence. This approach is particularly useful for proving information-theoretic security and privacy guarantees.

Recall the central estimation-theoretic problem: Given an observation of a random variable Y , whatcan we learn about a correlated, hidden variable X? Such questions are relevant for different applicationareas. For example, in a symmetric-key encryption setup, X can be the plaintext message, and Ythe ciphertext and any additional side information available to an adversary. If there is an encryptionmechanism in place that guarantees that the mutual information between an individual symbol of theplaintext X and a cipheretext Y is at most 0.01 bits [72], how well can an adversary guess individualsymbols of X? How does this result depend on the distribution of the plaintext source? Are there otherinformation measures besides mutual information for deriving such bounds on estimation?

If the joint distribution between X and Y is known, the probability of error of estimating X given anobservation of Y can be calculated exactly. However, in most practical settings, this joint distribution isunknown. Nevertheless, it may be possible to estimate certain correlation (dependence) measure of Xand Y reliably, such as maximal correlation, χ2 or mutual information. In general, we will denote thismeasure as I(X; Y ).

Given an upper bound θ on a certain dependence measure I, i.e. I(X; Y ) ≤ θ, is it possible todetermine a lower bound for the average error of estimating X from Y over all possible estimators?We answer this question in the affirmative. In particular, the problem of computing such a boundfor a given distribution pX and θ is equivalent to computing a distortion-rate function, presented inDefinition 9. When the estimation metric is error probability, we call the corresponding distortion-rate function the error-rate function, denoted by eI(pX , θ) and given in Definition 10. In the context ofsecurity and privacy, this bound characterizes the best estimation of the plaintext that a (computationallyunbounded) adversary can make given an observation of the output of the system in terms of the statisticof the distribution of the input and output. This allows, for example, guarantees on correlation measuresfrequently used in security and privacy settings to be translated into bounds on the estimation error.

Recall that X and Y are discrete random variables with support X = [m] and Y = [n], and, con-sequently, the joint pmf pX,Y can be displayed as the entries of a matrix P ∈ R

m×n, where [P]i,j =pX,Y (i, j). The problem of determining the estimator X of X given an observation of Y then reduces tofinding a row-stochastic matrix PX|Y ∈ R

n×m that is the solution of

Pe(X|Y ) = minP

X|Y

1− tr(P×PX|Y

). (41)

Note that the previous minimization is a linear program, and by taking its dual the reader can verifythat the optimal PX|Y is the maximum a posteriori (MAP) estimator, as expected.

We highlight again that in applications the joint distribution matrix P may not be known exactly –only a given dependence measure I(pX,Y ) may be known. Equation (41) hints that dependence measuresthat depend on the spectrum of P may lead to sharp lower bounds for error probability. Indeed, the traceof the product of two matrices is closely related to their spectra (cf. Von Neumman’s trace inequality[24, Thm. 7.4.1.1]). This motivates the following question: Are there information measures that capturethe spectrum of a joint distribution matrix P? This naturally leads to the consideration of measuresof dependence and lower bounds on estimation error based on the PICs. These bounds are derived inSection 4.2, but we first provide an overview of our approach in Section 4.1.

Owing to the nature of the joint distribution, it may be infeasible to estimate X from Y with smallestimation error. It is, however, possible that a non-trivial function f(X) exists that is of interestto a learner and can be estimated reliably from Y . If f is the identity function, this reduces to thestandard problem of estimating X from Y . Determining if such a function exists is relevant to severalapplications in learning, privacy, security and information theory. In particular, this setting is related tothe information bottleneck method [73] and functional compression [44], where the goal is to compressX into Y such that Y still preserves information about f(X).

For most security applications, minimizing the average error of estimating a hidden variable X from anobservation of Y is insufficient. As argued in [59], cryptographic definitions of security, and in particularsemantic security [74], require that an adversary has negligible advantage in guessing any function of theinput given an observation of the output. In light of this, we present bounds for the best possible averageerror achievable for estimating functions of X given an observation of Y .

20

Page 21: PrincipalInertiaComponentsandApplications - arXiv · The PICs provide a fine-grained decomposition of the dependence between two random variables. Well- studied statistical methods

Assuming that pX,Y is unknown, pX is given and a bound I(X;Y ) ≤ θ is known (where I is notrestricted to being mutual information), we present in Theorem 8 a method for adapting bounds for errorprobability into bounds for the average estimation error of functions of X given Y . This method dependson a few technical assumptions on the dependence measure (stated in Definition 8 and in Theorem 8),foremost of which is the existence of a lower bound for the error-rate function that is Schur-concave3 inpX for a fixed θ. Theorem 8 then states that, under these assumptions, for any deterministic, surjectivefunction f : X → {1, . . . ,M}, we can obtain a lower bound for the average estimation error of f bycomputing eI(pU , θ), where U is a random variable that is a function X.

Note that Schur-concavity is crucial for this result. In Theorem 7, we show that this condition isalways satisfied when I(X; Y ) is concave in pX for a fixed pY |X , convex in pY |X for a fixed pX , andsatisfies the DPI. This generalizes a result by Ahlswede [18] on the extremal properties of rate-distortionfunctions. Consequently, Fano’s inequality can be adapted in order to bound the average estimationerror of functions, as shown in Corollary 5. By observing that a particular form of the bound stated inTheorem 6 is Schur-concave, we present in the next section a bound for the error probability of estimatingfunctions in terms of the maximal correlation, stated in Corollary 6.

4.1 A Convex Program for Mapping Information Guarantees to Bounds

on Estimation

Throughout the rest of the paper, we let X and Y be two random variables drawn from finite sets Xand Y. We have the following definition.

Definition 8. We say that a function I that maps any joint probability mass function (pmf) to a non-negative real number is a dependence measure (equivalently measure of dependence) if for any discreterandom variables W , X, Y and Z (i) I(pX,Y ) is convex in pY |X for a fixed pX , (ii) I satisfies the DPI,i.e. if X → Y → Z then I(pX,Z) ≤ I(pX,Y ), and (iii) if W is a one-to-one mapping of Y and Z isa one-to-one mapping of X, then I(pW,Z) = I(pX,Y ) (invariance property). We overload the notationof I and let I(pX,Y ) = I(pX , pY |X) in order to make the dependence on the marginal distribution andthe channel (transition probability) clear. Furthermore, we also denote I(pX,Y ) = I(X; Y ) when thedistribution is clear from the context. Examples of dependence measures includes maximal correlation,defined in (1), and mutual information.

Now consider the standard estimation setup where a hidden variable X should be estimated froman observed random variable Y . We assume that the joint distribution between pX,Y is not known,but the marginal distribution pX is known, and that I(pX,Y ) ≤ θ (e.g. security constraint) for a givendependence measure I. Since I satisfies the DPI, for any estimate X of X such that X → Y → X wehave I(X; X) ≤ I(X; Y ) ≤ θ. The problem of translating a bound on I into a constraint on how well ahidden variable X can (on average) be estimated from Y given an error function d : X × X → R can beapproximated by solving the optimization problem

infpX|X

E

[d(X, X)

](42)

s.t. I(X; X) ≤ θ. (43)

This motivates the following definition.

Definition 9. We denote the smallest (average) estimation error DI,d for a given dependence measureI and estimation cost function d : X × X → R as

DI,d(pX , θ) , infpX|X

{E

[d(X, X)

]∣∣∣I(pX , pX|X) ≤ θ}, (44)

where the infimum is over all conditional distributions pX|X .

Observe that for any pY |X that satisfies I(pX , pY |X) ≤ θ

DI,d(pX , θ) ≤ infpX|Y

{E

[d(X, X)

]∣∣∣X → Y → X},

since, by the assumption that I satisfies the DPI, I(X; X) ≤ I(X;Y ) ≤ θ. When I(X; Y ) = I(X;Y ),DI,d(pX , θ) is the distortion-rate function [9, pg. 306]. When the distortion function d is the Hamming

3A function f : Rn → R is said to be Schur-concave if for all x,y ∈ Rn where x is majorized by y, then f(x) ≥ f(y).

21

Page 22: PrincipalInertiaComponentsandApplications - arXiv · The PICs provide a fine-grained decomposition of the dependence between two random variables. Well- studied statistical methods

distortion, DI,d(pX , θ) gives the smallest probability of error for estimating X given an observation Ythat satisfies I(X;Y ) ≤ θ. This case will be of particular interest in this section, motivating the nextdefinition.

Definition 10. Denoting the Hamming distortion metric as

dH(x, y) ,

{0, x = y,

1, otherwise,

we define the error-rate function4 for the dependence measure I as

eI(pX , θ) , DI,dH (pX , θ).

The definition of error-rate function directly leads to the following simple lemma.

Lemma 9. For a given dependence measure I and any fixed pX,Y such that I(pX,Y ) ≤ θ

Pe(X|Y ) ≥ eI(pX , θ).

Proof. Observe that Pe(X|Y ) = minX→Y →X E

[dH(X, X)

], where the minimum is over all distributions

pX|X that satisfy the Markov constraint X → Y → X . Since I satisfies the DPI, then I(X; X) ≤I(X;Y ) ≤ θ, and the result follows from Definition 9.

The previous lemma shows that the characterization of eI(pX , θ) for different measures of informationI is particularly relevant for applications in privacy and security, where X is a variable that shouldremain hidden (e.g. plaintext) and Y is an adversary’s observation (e.g. ciphertext). Knowing eI allowsus to translate an upper bound I(X; Y ) ≤ θ into an estimation guarantee: regardless of an adversary’scomputational resources, given only access to Y he will not be able to estimate X with an average errorprobability Pe(X|Y ) smaller than eI(pX , θ). Therefore, by simply estimating θ and calculating eI(pX , θ)we are able to evaluate the security threat incurred by an adversary that has access to Y .

Example 3 (Error-rate function for mutual information.). Using the expression for the rate-distortionfunction under Hamming distortion for mutual information ([75, (9.5.8)]), for I(X; Y ) = I(X;Y ) andX = [m], the error-rate function is given by eI(pX , θ) = d∗, where d∗ is the solution of

hb(d∗) + d∗ log(m− 1) = H(X)− θ, (45)

and hb(x) , −x log x − (1 − x) log(1 − x). Denoting X → Y → X and pe , Pe(X|Y ), note that (45)implies Fano’s inequality [9, 2.140]:

hb(pe) + pe log(m− 1) ≥ H(X)− I(X;Y ) = H(X|Y ). (46)

4.2 A Lower Bound for Error Probability Based on the PICs

Throughout the rest of the section, we assume without loss of generality that pX is sorted in decreasingorder, i.e. pX(1) ≥ pX(2) ≥ . . . ≥ pX(m).

Definition 11. Let ΛΛΛ(pX,Y ) denote the vector of PICs of a joint distribution pX,Y sorted in decreasing

order, i.e. ΛΛΛ(pX,Y ) = (λ1(X;Y ), . . . , λd(X; Y )). We denote ΛΛΛ(pX,Y ) ≤ λλλ , (λ1, . . . , λd) if λk(X; Y ) ≤ λk

for k ∈ [d]

R(q, λλλ) ,{pX,Y |pX = q and ΛΛΛ(pX,Y ) ≤ λλλ

}. (47)

In the next theorem we present a Fano-style bound for the estimation error probability of X thatdepends on the marginal distribution pX and on the principal inertias.

Theorem 6. For λλλ = (λ1, . . . , λd) and fixed pX , let

k∗, max

k ∈ [m]

∣∣∣ pX(k) ≥∑

i∈[m]

pX(i)2

. (48)

4The term error-rate function is used in the same sense as distortion-rate function in rate distortion theory [9, Chap. 10].We adopt “error” instead of distortion here since we only consider Hamming distance as the distortion metric.

22

Page 23: PrincipalInertiaComponentsandApplications - arXiv · The PICs provide a fine-grained decomposition of the dependence between two random variables. Well- studied statistical methods

In addition, let p = (pX(1), . . . , pX(m)) and λλλk∗ = (λ1, . . . , λk∗ , λk∗ , λk∗+1, . . . , λm−1) (where λm , 0and λλλm = (λ1, . . . , λm)). Defining

u(pX ,λλλ) , min0≤β≤pX(2)

β +

√pTλλλk∗ − λk∗‖p‖22+

∥∥[p− β]+∥∥22,

then for any (X,Y ) ∼ qX,Y ∈ R(pX ,λλλ),

Pe(X|Y ) ≥ 1− u(pX ,λλλ). (49)

Proof. The proof of the theorem is presented in Appendix C.

Remark 8. If λi = 1 for all 1 ≤ i ≤ d, (49) reduces to Pe(X|Y ) ≥ 0. Furthermore, if λi = 0 for all1 ≤ i ≤ d, (49) simplifies to Pe(X|Y ) ≥ 1− pX(1).

We now present a few direct but, as we shall show in the next section, useful corollaries of the result inTheorem 6. We note that a bound with the same square-root order dependence on χ2-divergence as Eq.(50) below has appeared in the context of bounding the minmax decision risk in [76, Eq. (3.4)]. However,the proof technique used in [76] does not seem to lead to the general bound presented in Theorem 6.

Corollary 3. If X is uniformly distributed in [m], then

Pe(X|Y ) ≥ 1− 1

m−√

(m− 1)χ2(X;Y )

m. (50)

Furthermore, for ρm(X;Y ) =√λ1

Pe(X|Y ) ≥ 1− 1

m−

√λ1

(1− 1

m

)

= 1− 1

m− ρm(X;Y )

(1− 1

m

).

Corollary 4. For any pair of variables (X,Y ) with marginal distribution in X equal to pX and maximalcorrelation (largest principal inertia) ρm(X; Y )2 = λ1, we have for all β ≥ 0

Pe(X|Y ) ≥ 1− β −

√√√√λ1

(1−

m∑

i=1

pX(i)2

)+

m∑

i=1

([pX(i) − β]+

)2. (51)

In particular, setting β = pX(2),

Pe(X|Y ) ≥ 1− pX(2)−

√√√√λ1

(1−

m∑

i=1

pX(i)2

)+ (pX(1)− pX(2))2 (52)

≥ 1− pX(1)− ρm(X; Y )

√√√√(1−

m∑

i=1

pX(i)2

), (53)

where (53) follows from (52) being decreasing in pX(2).

Remark 9. The bounds (51) and (53) are particularly helpful for showing how the error probabilityscales with the input distribution and the maximal correlation. For a given pX,Y , recall that

Adv(X|Y ) , 1− pX(1)− Pe(X|Y ),

defined in (6), is the advantage of correctly estimating X from an observation of Y over a random guessof X when Y is unknown. Then, from equation (53)

Adv(X|Y ) ≤ ρm(X;Y )

√√√√(1−

m∑

i=1

pX(i)2

)

≤ ρm(X;Y ) =√

λ1(X;Y ).

Therefore, the advantage of estimating X from Y decreases at least linearly with the maximal correlationbetween X and Y .

We present next results on the extremal properties of the error-rate function. This analysis will beparticularly useful for determining how to bound the probability of error of estimating functions of arandom variable.

23

Page 24: PrincipalInertiaComponentsandApplications - arXiv · The PICs provide a fine-grained decomposition of the dependence between two random variables. Well- studied statistical methods

4.3 Extremal Properties of the Error-Rate Function and Bounding the

Estimation Error of Functions of a Hidden Random Variable

Owing to convexity of I(pX , pX|X) in pX|X , it follows directly that eI(pX , θ) is convex in θ for a fixedpX . We will now prove that, for a fixed θ, eI(pX , θ) is Schur-concave in pX if I(pX , pX|X) is concavein pX for a fixed pX|X . Ahlswede [18, Theorem 2] proved this result for the particular case whereI(X;Y ) = I(X;Y ) by investigating the properties of the explicit characterization of the rate-distortionfunction under Hamming distortion. The proof presented here is simpler and more general, and is basedon a proof technique used by Ahlswede in [18, Theorem 1].

Theorem 7. If I(pX , pX|X) is concave in pX for a fixed pX|X , then eI(pX , θ) is Schur-concave in pXfor a fixed θ.

Proof. The proof is presented in Appendix C.

For a given integer 1 ≤ M ≤ |X |, we define

FM , {f : X → U | f is surjective and |U|≥ M} (54)

andPe,M (X|Y ) , min

f∈FM

Pe(f(X)|Y ). (55)

Pe,|X|(X|Y ) is simply the error probability of estimating X from Y , i.e. Pe,|X|(X|Y ) = Pe(X|Y ). Thesurjectivity condition in the definition of FM is mostly technical, and was added to (i) avoid the constantfunction being in FM (which would render the estimation error trivial) and (ii) enable the use of Schur-concavity results to derive bounds on estimation error. Note that, in the discrete setting considered here,by varying M we span the set of all functions of X, so there is no loss of generality. Nevertheless, thereare practical settings where this condition naturally arises. In classification problems, for example, thesurjectivity condition would correspond to the number of classes used to classify X.

The next theorem shows that a lower bound for Pe,M can be derived for any dependence measure Ias long as eI(pX , θ) or a lower bound for eI(pX , θ) is Schur-concave in pX .

Theorem 8. For a given M ∈ [m] and pX with X = [m] and pX(1) ≥ pX(2) ≥ . . . ≥ pX(m), letU = gM (X), where gM : {1, . . . ,m} → {1, . . . ,M} is defined as

gM (x) ,

{1 1 ≤ x ≤ m−M + 1

x−m+M m−M + 2 ≤ x ≤ m .

Let pU be the marginal distribution5 of U . Assume that, for a given dependence measure I, there existsa function LI(·, ·) such that for all distributions qX and any θ, eI(qX , θ) ≥ LI(qX , θ). If LI(pX , θ) isSchur-concave in pX , then for X ∼ pX and I(X; Y ) ≤ θ,

Pe,M (X|Y ) ≥ LI(pU , θ). (56)

In addition6, for any S → X → Y such that pU majorizes pS,

Pe(S|Y ) ≥ LI(pU , θ). (57)

Proof. The result follows from the following chain of inequalities:

Pe,M (X|Y )(a)

≥ minf∈FM ,θ

{eI(pf(X), θ

): θ ≤ θ

}

(b)

≥ minf∈FM

{eI(pf(X), θ

)}

(c)

≥ minf∈FM

{LI

(pf(X), θ

)}

(d)

≥ LI(pU , θ),

5The pmf of U is pU (1) =∑m−M+1

i=1 pX(i) and pU (k) = pX(m−M + k) for k = 2, . . . ,M .6We thank Dr. Nadia Fawaz ([email protected]) for pointing out this extension.

24

Page 25: PrincipalInertiaComponentsandApplications - arXiv · The PICs provide a fine-grained decomposition of the dependence between two random variables. Well- studied statistical methods

where (a) follows from the DPI, (b) follows from eI(qX , θ), being decreasing in θ, (c) follows fromeI(qX , θ) ≥ LI(qX , θ) for all qX , and θ and (d) follows from the Schur-concavity of the lower bound andby observing that pU majorizes pf(X) for every f ∈ FM . In the case of Pe(S|X), the same inequalities holdwith S playing the role of f(X) in (a) and (b), and the last inequality also following from Schur-concavityof LI(pS, θ) in pS.

Remark 10. The function gM (X) = U in Theorem 8 is formed by adding the most likely symbols ofX, and, consequently, pU majorizes any other distribution pf(X) for f ∈ FM . The function gM can thusbe regarded as the “least uncertain” function of X in FM in the following sense: since Renyi entropy7 isSchur-concave, Hα(gM (X)) ≤ Hα(f(X)) for all f ∈ FM and α ≥ 0.

The following results illustrates how Theorem 8 can be used for mutual information and maximalcorrelation.

Corollary 5. Let I(X;Y ) ≤ θ. ThenPe,M (X|Y ) ≥ d∗

where d∗ is the solution of

hb(d∗) + d∗ log(m− 1) = min{H(U)− θ, 0},

and hb(·) is the binary entropy function.

Proof. Let RI(pX , δ) , minpX|X

{I(X; X)|E[dH(X, X)] ≤ δ} be the well known rate-distortion function

under Hamming distortion. ThenRI(pX , δ) satisfies ([75, (9.5.8)]) RI(pX , δ) ≥ H(X)−hb(d∗)−d∗ log(m−

1). The result follows from Theorem 7, since mutual information is concave in pX .

Corollary 6. Let J1(X;Y ) = ρm(X;Y ) ≤ θ. Then

Pe,M (X|Y ) ≥ 1− pU (1)− θ

√√√√(1−

M∑

i=1

pU(i)2

),

where Pe,M (X|Y ) is defined in (55) and U is defined as in Theorem 8.

Proof. The proof follows directly from Theorems 2, 3 and Corollary 4, by noting that (53) is Schur-concave in pX .

The previous result leads to the next theorem, which states that the probability of guessing anyfunction of a hidden variable X from an observation Y is upper bounded by the maximal correlation ofX and Y .

Theorem 9. Let pX be fixed, |X |< ∞ and FM be given in (54). Define (cf. (6))

AdvM (X|Y ) , max

{1− max

k∈[M]pf(X)(k)− Pe(f(X)|Y ) | f ∈ FM

}.

Then

AdvM (X|Y ) ≤ ρm(X; Y )

√1− 1

M≤ ρm(X;Y ). (58)

Proof. For f ∈ FM

Adv(f(X)|Y ) ≤ ρm(f(X); Y )

√1−

i∈[M]

pf(X)(i)2

≤ ρm(X; Y )

√1−

i∈[M]

pf(X)(i)2

≤ ρm(X; Y )

√1− 1

M,

where the first inequality follows from (53) and the definition (6), the second inequality follows by com-bining Theorem 3 (DPI for the PICs) and the fact that λ1(f(X);X) ≤ 1, which leads to ρm(f(X); Y ) ≤ρm(X;Y ), and the last inequality follows from the fact that

∑i∈[M] pf(X)(i)

2 is minimized when pf(X)

is uniform. The result follows by maximizing over all f ∈ FM .

7The Renyi entropy of a discrete random variable X is given by Hα(X) , 11−α

log(∑

x∈X pX(x)α)

.

25

Page 26: PrincipalInertiaComponentsandApplications - arXiv · The PICs provide a fine-grained decomposition of the dependence between two random variables. Well- studied statistical methods

The results presented in this section demonstrate that the PICs are a useful information measure thatcan shed light on fundamental limits of estimation. In particular, Theorem 9 connects the largest PIC,namely the maximal correlation, with the probability of correctly guessing any function of a hidden,discrete random variable. The PICs also provide a characterization of the functions of a hidden variablethat can (or cannot) be estimated with small mean-squared error (Theorem 1). In the next section, weexplore applications of the PICs to privacy and security.

5 Applications of the PICs to Security and Privacy

In this section, we present a few applications of the principal inertia components to problems in securityand privacy. We adopt the privacy against statistical inference framework presented in [19]. This setup,called the Privacy Funnel, was introduced in [63]. Consider two communicating parties, namely Aliceand Bob. Alice’s goal is to disclose to Bob information about a set of measurement points, representedby the random variable X. Alice discloses this information in order to receive some utility from Bob.Simultaneously, Alice wishes to limit the amount of information revealed about a private random variableS that is dependent on X. For example, X may represent Alice’s movie ratings, released to Bob in orderto receive movie recommendations, whereas S may represent Alice’s political preference or yearly income.Bob will try to extract the maximum amount of information about S from the data disclosed by Alice.

Instead of revealing X directly to Bob, Alice releases a new random variable, denoted by Y . Thisrandom variable is produced from X through a random mapping pY |X , called the privacy-assuringmapping. We assume that pS,X is fixed and known by both Alice and Bob, and S → X → Y . Alice’sgoal is to find a mapping pY |X that minimizes I(S;Y ), while guaranteeing that the information disclosedabout X is above a certain threshold t, i.e. I(X;Y ) ≥ t. We refer to the quantity I(S;Y ) as the disclosedprivate information, and I(X;Y ) as the disclosed useful information. As discussed in Section 1.1, whenI(S;Y ) = 0, we say that perfect privacy is achieved, i.e. Y does not reveal any information about S.We consider here the non-interactive, one-shot regime, where Alice discloses information once, and noadditional information is released. We also assume that Bob knows the privacy-assuring mapping pY |X

chosen by Alice, and no side information is available to Bob about S besides Y .

5.1 The Privacy Funnel

We define next the privacy funnel function, which captures the smallest amount of disclosed privateinformation for a given threshold on the amount of disclosed useful information. We then characterizeproperties of the privacy funnel function in the rest of this section.

Definition 12. For 0 ≤ t ≤ H(X) and a joint distribution pS,X over S×X , we define the privacy funnelfunction GI(t, pS,X) as

GI(t, pS,X) , inf {I(S;Y )|I(X;Y ) ≥ t, S → X → Y } , (59)

where the infimum is over all mappings pY |X such that Y is finite. For a fixed pS,X and t ≥ 0, the set ofpairs {(t, GI(t, pS,X))} is called the privacy region of pS,X .

We now enunciate a few useful properties of GI(t, pS,X) and the privacy region.

Lemma 10.

GI(t, pS,X) = minpY |X

{I(S;Y )|I(X;Y ) ≥ t, S → X → Y, |Y|≤ |X |+1} . (60)

In addition, for a fixed pS,X , the mapping t 7→ GI (t,pS,X)

tis non-decreasing, and GI(t, pS,X) is convex in

t.

Proof. The proof is in Appendix D.

Lemma 11. For 0 ≤ t ≤ H(X),

max{t−H(X|S), 0} ≤ GI(t, pS,X) ≤ tI(X;S)

H(X). (61)

26

Page 27: PrincipalInertiaComponentsandApplications - arXiv · The PICs provide a fine-grained decomposition of the dependence between two random variables. Well- studied statistical methods

I(S;X)

H(X|S) H(X)t

GI(t,p

S,X)

Figure 3: For a fixed pS,X , the privacy region is contained within the shaded area. The red and the blue linescorrespond, respectively, to the upper and lower bounds presented in Lemma 11.

Proof. Observe that GI(H(X), pS,X) = I(X;S), since I(X;Y ) = H(X) implies that pY |X is a one-to-onemapping of X. The upper bound then follows directly from (99).

Clearly GI(t, pS,X) ≥ 0. In addition, for any pY |X ,

I(S;Y ) = I(X;Y )− I(X;Y |S)≥ I(X;Y )−H(X|S)≥ t−H(X|S),

proving the lower bound.

Figure 3 illustrates the bounds from Lemma 11. The privacy region is contained withing the shadedarea. The next two examples illustrate that both the upper bound (red line) and the lower bound (blueline) of the privacy region can be achieved for particular instances of pS,X .

Example 4. Let X = (S,W ), where W ⊥⊥ S. Then by setting Y = W , we have I(S;Y ) = 0 andI(X;Y ) = H(W ) = H(X|S). Consequently, from Lemmas 10 and 11, GI(t, pS,X) = 0 for t ∈ [0, H(X|S)].By letting Y = W w.p. λ and Y = (S,W ) w.p. 1 − λ for λ ∈ [0, 1], the lower-bound GI(t, pS,X) =t−H(X|S) can be achieved for H(X|S) = H(W ) ≤ t ≤ H(X). Consequently, the lower bound in (61)is sharp.

Example 5. Now let X = f(S). Then I(X;S) = H(X) and

I(S;Y ) = I(X;Y )− I(X;Y |S) = I(X;Y ).

Consequently, GI(t, pS,X) = t, and the upper bound in (61) is sharp.

5.2 The Optimal Privacy-Utility Coefficient and the Smallest PIC

We now study the smallest possible ratio between disclosed private and useful information, defined next.

Definition 13. The optimal privacy-utility coefficient for a given distribution pS,X is given by

v∗(pS,X) , infpY |X

I(S;Y )

I(X;Y ). (62)

It follows directly from Lemma 10 that

v∗(pS,X) = limt→0

GI(t, pS,X)

t. (63)

We show in Section 5.3 that the value of v∗(pS,X) is related to the smallest PIC of pS,X (i.e. thesmallest eigenvalue of the spectrum of the conditional expectation operator, defined below). We also provethat v∗(pS,X) = 0 is a necessary and sufficient condition for achieving perfect privacy while disclosinga non-trivial amount of useful information. Before introducing these results, we present an alternativecharacterization of v∗(pS,X) (Lemma 12), and introduce a measure based on the smallest PIC (Definition14) and an auxiliary result (Lemma 13).

27

Page 28: PrincipalInertiaComponentsandApplications - arXiv · The PICs provide a fine-grained decomposition of the dependence between two random variables. Well- studied statistical methods

Remark 11. The proofs of Lemma 12 and Lemma 14 in this section are closely related to [23]. Weacknowledge that their proof techniques inspired some of the results presented here.

The next result provides a characterization of the optimal privacy-utility coefficient.

Lemma 12. Let qS denote the distribution of S when pS|X is fixed and X ∼ qX . Then

v∗(pS,X) = infqX 6=pX

D(qS ||pS)D(qX ||pX)

. (64)

Proof. The proof is in Appendix D.

The smallest PIC is of particular interest for privacy, and upper bounds the value of v∗(pS,X). Inparticular, we will be interested in the coefficient δ(pS,X), defined bellow

Definition 14. Let d , min{|S|, |X |} − 1, and λd(S;X) the smallest PIC of pS,X . We define

δ(pS,X) ,

{λd(S;X) if |X |≤ |S|,0 otherwise.

(65)

The following lemma provides a useful characterization of δ(pS,X), related to the interpretation ofthe PICs as the spectrum of the conditional expectation operator given in Theorem 1. This result is adirect consequence of Theorem 1, and we present a self-contained proof in Appendix D.

Lemma 13. For a given pS,X ,

δ(pS,X) = min{‖E [f(X)|S] ‖22

∣∣ f : X → R,E [f(X)] = 0, ‖f(X)‖2= 1}. (66)

5.3 Information Disclosure with Perfect Privacy

If v∗(pS,X) = 0, then it may be possible to disclose some information about X without revealing anyinformation about S. However, since GI(0, pX,S) = 0, it is not immediately clear that v∗(pS,X) = 0implies that there exists t strictly bounded away from 0 such GI(t, pX,S) = 0. This would represent theideal privacy setting, since, from Lemma 10, there would exist a privacy-assuring mapping that allowsthe disclosure of some non-negligible amount of useful information while guaranteeing I(S;Y ) = 0. This,in turn, would mean that perfect privacy is achievable with non-negligible utility regardless of the specificprivacy metric used, since S and Y would be independent.

In this section, we prove that if the optimal privacy-utility coefficient is 0, then there indeed exists aprivacy-assuring mapping that allows the disclosure of a non-trivial amount of useful information whileguaranteeing perfect privacy. We also show that the value of δ(pS,X) is closely related to v∗(pS,X). Thisrelationship is analogous to the one between the hypercontractivity coefficient s∗, defined in [22] and [77],and the maximal correlation ρm. In particular, as shown in the next two lemmas, v∗(pS,X) ≤ δ(pS,X)and v∗(pS,X) = 0 ⇐⇒ δ(pS,X) = 0.

Lemma 14. For any pS,X with finite support S × X ,

v∗(pS,X) ≤ δ(pS,X). (67)

andinfpX

v∗(pS,X) = infpX

δ(pS,X). (68)

Proof. The proof is in Appendix D.

The next theorem proves that δ(pS,X) can serve as a proxy for perfect privacy, since the optimalprivacy-utility coefficient is 0 if and only if δ(pS,X) is also 0.

Lemma 15. Let pS,X be such that H(X) > 0 and S and X are finite. Then8

v∗(pS,X) = 0 ⇐⇒ δ(pS,X) = 0. (69)

Proof. The proof can be found in Appendix D.

8If S is binary, then (69) implies that perfect privacy is achievable iff S and X are independent (since δ(pS,X ) = ρm(S;X)2),recovering [61, Thm. 2].

28

Page 29: PrincipalInertiaComponentsandApplications - arXiv · The PICs provide a fine-grained decomposition of the dependence between two random variables. Well- studied statistical methods

We are now ready to prove that a non-trivial amount of useful information can be disclosed withoutrevealing any private information if and only if v∗(pS,X) = 0 (or equivalently, δ(pS,X) = 0). This resultfollows naturally from Theorem 15, since v∗(pS,X) = 0 implies that δ(pS,X) = 0, which means that thematrix Q and, consequently, PS|X , is either not full rank or has more columns than rows (i.e. |X |> |S|).This, in turn, can be exploited in order to find a mapping pY |X such that Y reveals some informationabout X, but no information about S. This argument is made precise in the next theorem.

Remark 12. When PS|X is not full rank or has more columns than rows, then S and X are weakly inde-pendent. As shown in [78, Thm. 4] and [61], this implies that a privacy-assuring mapping that achievesperfect privacy while disclosing a non-trivial amount of useful information can be found. Theorem 10recovers this result in terms of the smallest PIC, and Corollary 8 provides an estimate of the amount ofuseful information that can be revealed with perfect privacy.

Theorem 10. For a given pS,X , there exists a privacy-assuring mapping pY |X such that S → X → Y ,I(X;Y ) > 0 and I(S;Y ) = 0 if and only if δ(pS,X) = 0 (equivalently v∗(pS,X) = 0). In particular,

∃t0 > 0 : GI(t0, pS,X) = 0 ⇐⇒ δ(pS,X) = 0. (70)

Proof. The direct part of the theorem follows directly from the definition of v∗(pS,X) and Lemma 15.Assume that δ(pS,X) = 0. Then, from Lemma 13, there exists f : X → R such that ‖f(X)‖2= 1,E [f(X) = 0], and ‖E [f(X)|S] ‖2= 0. Consequently, E [f(X)|S = s] = 0 for all s ∈ S .

Fix Y = [2], and, for ǫ > 0 and ǫ appropriately small,

pY |X(y|x) ={

12− ǫf(x), y = 1,

12+ ǫf(x), y = 2.

Note that it is sufficient to choose ǫ = (2maxx∈X |f(X)|)−1, so ǫ is strictly bounded away from 0. Inaddition, pY (1) = 1/2. Therefore,

I(X;Y ) = 1−∑

x∈X

pX(x)hb

(1

2+ ǫf(x)

)> 0. (71)

Since S → X → Y ,

pY |S(y|s) =∑

x∈X

pY |X(y|x)pX|S(x|s)

=∑

x∈X

(1

2+ (−1)yǫf(x)

)pX|S(x|s)

= 1/2 + (−1)yǫE [f(X)|S = s]

= 1/2,

and, consequently, S and Y are independent. Then I(S;Y ) = 0, and the result follows.

The previous result proves that if either |X |> |S| or the smallest principal inertia component ofpS,X is 0 (i.e. δ(pS,X) = 0), then it is possible to achieve perfect privacy while disclosing some usefulinformation. In particular, the value of t0 in (100) is lower-bounded by the expression in (71). We notethat this result would not necessarily hold if S and X are not finite sets.

Since I(S;Y ) = 0 implies that S and Y are independent, Theorem 10 holds not only for mutualinformation, but also for any dependence measure I, defined in Definition 8, that satisfies I(X; Y ) = 0if and only if X and Y are independent. This leads to the following result.

Corollary 7. Let pS,X be given, and I be a non-negative dependence measure (e.g. total variation ormaximal correlation, cf. Definition 8) such that for any two random variable A and B, I(A;B) = 0 ⇐⇒A ⊥⊥ B. Then there exists pY |X such that S → X → Y , I(X;Y ) > 0 and I(S;Y ) = 0 if and only ifδ(pS,X) = 0 .

Proof. This is a direct consequence of Theorem 10, since I(X; Y ) > 0 ⇐⇒ I(X;Y ) > 0 and I(S; Y ) =0 ⇐⇒ I(S;Y ) = 0.

Remark 13. As long as privacy is measured in terms of statistical dependence (with perfect privacyimplying statistical independence) and some utility can be derived when Y is not independent of X, thenδ(pS,X) fully characterizes when perfect privacy is achievable with non-trivial utility.

29

Page 30: PrincipalInertiaComponentsandApplications - arXiv · The PICs provide a fine-grained decomposition of the dependence between two random variables. Well- studied statistical methods

We present next an explicit lower bound for the largest amount of useful information that can bedisclosed while guaranteeing perfect privacy. The result follows directly from the construction used inthe proof of Theorem 10, and is presented in Appendix D.

Corollary 8. For fixed pS,X , let

F0 , {f : X → R|E [f(X)] = 0, ‖f(X)‖2= 1, ‖E [f(X)|S] ‖2= 0} ∪ w0,

where w0 is the trivial function that maps X to {0}. Then GI(t, pS,X) = 0 for t ∈ [0, t∗], where

t∗ ≥ 1− maxf∈F0

E

[hb

(1

2+

f(X)

2‖f‖∞

)]. (72)

Furthermore, the lower bound for t∗ is sharp when δ(pS,X) = 0, i.e. there exists a pS,X such that t∗ > 0and GI(t, pS,X) = 0 if and only if t ∈ [0, t∗].

The previous bound for t∗ can be loose, especially if |X | is large. In addition, the right-hand side of(72) can be made arbitrarily small by decreasing minx∈X pX(x). Nevertheless, (72) is an explicit boundon the amount of useful information that can be disclosed with perfect privacy.

When Sn = (S1, . . . , Sn) and Xn = (X1, . . . , Xn), where (Si, Xi) ∼ pS,X are i.i.d. random variables,the next proposition states that δ(pSn,Xn) = δ(pS,X)n. Consequently, as long as δ(pS,X) < 1, it ispossible to disclose a non-trivial amount of useful information while disclosing an arbitrarily small amountof private information by making n sufficiently large. Loosely speaking, this is similar to hiding a needlein a haystack: As the number of available samples of S and X increases, we can use the additionalrandomness to better hide the private variables Si.

Proposition 1. Let Sn = (S1, . . . , Sn) and Xn = (X1, . . . , Xn), where (Si, Xi) ∼ pS,X are i.i.d. randomvariables. Then

v∗(pSn,Xn) ≤ δ(pSn,Xn) = δ(pS,X)n. (73)

Proof. The result is a direct consequence of the tensorization property of the principal inertia components,presented in Lemma 1.

6 Final Remarks

The PICs are powerful information-theoretic metrics that provide both (i) a measure of dependencebetween two random variables X and Y , and (ii) a complete characterization of which functions of Xcan be reliably estimated (in terms of mean-squared error) given an observation of Y . As shown here,the PICs play can be used for deriving bounds on one-bit functions of a channel input given a channeloutput. Furthermore, in privacy applications, we proved that perfect privacy can be achieved if and onlyif the smallest PIC is zero. The PICs were also used to derive bounds on estimation error probability.In particular, we demonstrated that the largest PIC (equivalently, the maximal correlation ρm(X;Y ))plays a key role in estimation:

Adv(f(X)|Y ) ≤ ρm(X;Y ),

i.e. the advantage over a random guess of estimating any function of X given Y is at most ρm(X;Y ).Information theoretic security and privacy applications provide fertile ground for the use of PICs,

specially when privacy is measured in terms of how well an adversary can estimate a secret (private)variable. The principal functions (cf. Definition 1) provide a basis for the finite-variance functions of arandom variable, and the PICs measure the MMSE of estimating each of these functions. Consequently,the PICs provide a characterization of which functions of X can or cannot be inferred reliably (in termsof MMSE) from an observation of Y . This property can be used in privacy applications: For example,in order to quantify how well an adversary can estimate a private function S = f(X) given a disclosedvariable Y , it is sufficient to express f(X) in terms of the principal functions of pX,Y . The adversary’sability of correctly estimating f(X) is then entirely determined by the PICs of pX,Y .

More precisely, for f : X → R, the mean squared-error f(X) given Y can be expressed as

mmse(f(X)|Y ) = E[f(X)2 − E [f(X)|Y ]2

]

= ‖f(X)‖22(1− ‖E [f(X)|Y ] ‖22

‖f(X)‖22

), (74)

Consequently, the MMSE depends on the spectrum of the conditional expectation operator (TY f)(y) ,E [f(X)|Y = y] which, in turn, is composed by the principal inertia components (cf. Theorem 1). When

30

Page 31: PrincipalInertiaComponentsandApplications - arXiv · The PICs provide a fine-grained decomposition of the dependence between two random variables. Well- studied statistical methods

E [f(X)] = 0 and E[f(X)2

]= 1, one can determine functions f1, f2, . . . as in Theorem 1, such that fi is

given by

fi = argmax{‖E [f(X)|Y ] ‖22|f : X → R, E [f(X)] = 0, E

[f(X)2

]= 1,

E [f(X)fj(X)] = 0 for 1 ≤ j ≤ i− 1} .Then

‖E [fi(X)|Y ] ‖22= λi(X;Y ).

It follows directly that, for any non-trivial function f : X → R with E [f(X)] = 0,

mmse(f(X)|Y ) ≥ ‖f(X)‖22(1− ρm(X;Y )2

), (75)

with equality if f(X) = cf1(X), where c = ‖f(X)‖2. Therefore, for a fixed variance c, cf1(X) is thefunction of X that can be most reliably estimated (in terms of mean-squared error) from all possiblemappings X → R. Furthermore,

mmse(f(X)|Y ) = ‖f(X)‖22

(1−

i

c2iλi(X; Y )

), (76)

where ci , E [f(X)fi(X)] /‖f(X)‖2 and∑

i c2i = 1. Consequently, functions that are closely “aligned”

with fi for small i cannot be inferred with small mean squared-error.In privacy applications with estimation constraints, this result sheds light on the nature of the fun-

damental tradeoff between privacy and utility. If X and Y correspond, respectively, to the input andoutput of a privacy-assuring mapping, then the PICs and corresponding principal functions of pX,Y willdetermine which functions (features) of X remain private. If, for example, the principal functions cor-responding to small PICs also span functions of X that should be reliably estimated from Y for utilitypurposes, then the privacy-assuring mapping pY |X will provide an unfavorable tradeoff between privacyan utility.

As another example, assume that we wish to design a privacy-assuring mapping where the secretS = (h1(X), . . . , ht(X)) is composed by a certain set of functions (features) h1, . . . , ht of X that aresupposed to remain private. The privacy-assuring mapping pY |X should then assure that the principalfunctions that span the subspace formed by (h1(X), . . . , ht(X)) must have small PICs. These examples,together with the results presented here, motivate the future use of PICs to drive the design of privacy-assuring mappings that achieve a favorable tradeoff between privacy and utility.

Acknowledgments

The authors gratefully acknowledge Stefano Tessaro (University of California Santa Barbara), NadiaFawaz (LinkedIn) and Yury Polyanskiy (Massachusetts Institute of Technology) for helpful and insightfuldiscussions and feedback on the results contained in this paper. We also thank the anonymous reviewersand the Associate Editor for many helpful comments and suggestions.

Appendix A Proofs from Section 2

Lemma 2

Proof. Let f ∈ L2(pX), E [f(X)] = 0 and g ∈ L2(Z), E [g(Z)] = 0, ‖g(Z)‖2= 1. Then

E [f(X)g(Z)] = E [E [f(X)g(Z)|Y ]]

(a)= E [E [f(X)|Y ]E [g(Z)|Y ]]

(b)

≤ ‖E [f(X)|Y ] ‖2‖E [g(Z)|Y ] ‖2(c)

≤√

λ1(Z; Y )‖E [f(X)|Y ] ‖2,where (a) follows from the assumption thatX → Y → Z, (b) follows from the Cauchy-Schwarz inequality,and (c) follows from characterization (3) in Theorem 1. By choosing g(z) = E [f(X)|Z = z] /‖E [f(X)|Z] ‖2and using the last inequality, we have

E [f(X)g(Z)] = E [E [f(X)|Z] g(Z)] = ‖E [f(X)|Z] ‖2≤√

λ1(Z;Y )‖E [f(X)|Y ] ‖2.Squaring both sides, we arrive at (20).

31

Page 32: PrincipalInertiaComponentsandApplications - arXiv · The PICs provide a fine-grained decomposition of the dependence between two random variables. Well- studied statistical methods

Appendix B Proofs from Section 3

Lemma 5

Proof. Let Y n = Xn ⊕ Zn for some Zn distributed over {−1, 1}n and independent of Xn. Thus

E [χS(Yn)|Xn] = E [χS(Z

n ⊕Xn) | Xn]

= E [χS(Xn)χS(Z

n) | Xn]

= χS(Xn)E [χS(Z

n)] ,

where the last equality follows from the assumption that Xn ⊥⊥ Zn. By letting cS = E [χS(Zn)], it

follows that pY n|Xn ∈ An and, consequently, Bn ⊆ An.Now let yn be fixed and δyn : {−1, 1}n → {0, 1} be given by

δyn(xn) =

{1, xn = yn,

0, otherwise.

Since the function δyn has Boolean inputs, it can be expressed in terms of its Fourier expansion [69, Prop.1.1] as

δyn(xn) =∑

S⊆[n]

dSχS(xn)

for some set of coefficients dS ∈ R, S ⊆ [n]. Now let pY n|Xn ∈ An. Observe that pY n|Xn(yn|xn) =E [δyn(Y n) | Xn = xn] and, for zn ∈ {−1, 1}n,

pY n|Xn(yn ⊕ zn|xn ⊕ zn) = E [δyn⊕zn(Yn) | Xn = xn ⊕ zn]

= E [δyn(Y n ⊕ zn) | Xn = xn ⊕ zn]

= E

S⊆[n]

dSχS(Yn ⊕ zn) | Xn = xn ⊕ zn

= E

S⊆[n]

dSχS(Yn)χS(z

n) | Xn = xn ⊕ zn

(a)=∑

S⊆[n]

cS dSχS(xn ⊕ zn)χS(z

n)

=∑

S⊆[n]

cS dSχS(xn)

(b)= E

S⊆[n]

dSχS(Yn)|Xn = xn

= E [δyn(Y n) | Xn = xn]

= pY n|Xn(yn|xn).

Equalities (a) and (b) follow from the definition of An. By defining the distribution of Zn as pZn(zn) ,pY n|Xn(zn|1n), where 1n is the vector with all entries equal to 1, it follows that Zn = Xn⊕Y n, Zn ⊥⊥ Xn

and pY n|Xn ⊆ Bn.

Lemma 7

Proof. Let x ∈ Cm(a,PT ) and y ∈ Cn(b,P). Then, for P decomposed as P = D1/2X QD

1/2Y where Q

given in (14) and denoting Σ− = diag (0, σ1, . . . , σd),

xTPy = ab+ x

TD

1/2X UΣ

−V

TD

1/2Y y

= ab+ xTΣ

−y, (77)

32

Page 33: PrincipalInertiaComponentsandApplications - arXiv · The PICs provide a fine-grained decomposition of the dependence between two random variables. Well- studied statistical methods

where x , UTD1/2X x and y , VTD

1/2Y y. Since x1 = ‖x‖2= a and y1 = ‖y‖2= b, then

xTΣ

−y =

d+1∑

i=2

σi−1xiyi

≤ σ1

√(‖x‖22−x2

1) (‖y‖22−y21)

= σ1

√(a− a2)(b− b2).

The result follows by noting that σ1 = ρm(X;Y ).

Appendix C Proofs from Section 4

Theorem 6

Consider the matrix Q = UΣVT given in (14), and define

A , D1/2X U, B , D

1/2Y V.

ThenP = AΣB

T , (78)

where ATD−1X A = BTD−1

Y B = I.

It follows directly from Theorem 1 that A, B and Σ have the form

A = [pX A] , B = [pY B] , Σ = diag(1,√λ1, . . . ,

√λd

), (79)

and, consequently, the joint distribution can be written as

pX,Y (x, y) = pX(x)pY (y) +d∑

k=1

√λkby,kax,k, (80)

where ax,k and by,k are the entries of A and B in (79), respectively.Theorem 6 follows directly from the next two lemmas.

Lemma 16. Let the marginal distribution pX and the PICs λλλ = (λ1, . . . , λd) be given, where d = m− 1.Then for any pX,Y ∈ R(pX ,λλλ), 0 ≤ α ≤ 1 and 0 ≤ β ≤ pX(2)

Pe(X|Y ) ≥ 1− β −

√√√√f0(α,pX ,λλλ) +m∑

i=1

([pX(i)− β]+

)2,

where

f0(α,pX ,λλλ) =

d+1∑

i=2

pX(i)(λi−1 + ci − ci−1)

+ pX(1)(c1 + α) − αpTXpX , (81)

and ci = [λi − α]+ for i = 1, . . . , d and cd+1 = 0.

Proof. Let X and Y have a joint distribution matrix P with marginal pX and principal inertias individ-ually bounded by λλλ = (λ1, . . . , λd). We assume without loss of generality that d = m−1, where |X |= m.This can always be achieved by adding inertia components equal to 0.

Consider X → Y → X, where X is the estimate of X from Y . The mapping from Y to X can bedescribed without loss of generality by a |Y|×|X | row stochastic matrix, denoted by F, where the (i, j)-thentry is the probability pX|Y (j|i). The probability of correct estimation Pc is then

Pc = 1− Pe(X|Y ) = tr(PX,X

),

where PX,X , PF.

33

Page 34: PrincipalInertiaComponentsandApplications - arXiv · The PICs provide a fine-grained decomposition of the dependence between two random variables. Well- studied statistical methods

The matrix PX,X can be decomposed according to (78), resulting in

Pc = tr(D

1/2X UΣV

TD

1/2

X

)= tr

(ΣV

TD

1/2

XD

1/2X U

), (82)

where

U =[p1/2X u2 · · · um

],

V =[p1/2

Xv2 · · · vm

],

Σ = diag

(1,

√λ1, . . . ,

√λd

),

DX = diag (pX) ,

and U and V are orthogonal matrices. The probability of correct detection can be written as

Pc = pTXpX +

m∑

k=2

m∑

i=1

(λk−1pX(i)pX(i)

)1/2uk,ivk,i

= pTXpX +

m∑

k=2

m∑

i=1

λ1/2k−1uk,ivk,i

where uk,i = [uk]i, vk,i = [vk]i, uk,i =√

pX(i)uk,i and vk,i =√

pX(i)vk,i. Applying the Cauchy-Schwarzinequality twice, we obtain

Pc ≤ pTXpX +

m∑

i=1

(m∑

k=2

v2k,i

)1/2( m∑

k=2

λk−1u2k,i

)1/2

= pTXpX +

m∑

i=1

(pX(i)(1− pX(i))

m∑

k=2

λk−1u2k,i

)1/2

≤ pTXpX +

(1−

m∑

i=1

pX(i)2)1/2( m∑

i=1

m∑

k=2

λk−1u2k,i

)1/2

. (83)

Let U = [u2 · · ·um] and Σ = diag(λ1, . . . , λd

). Then

m∑

i=1

m∑

k=2

λk−1u2k,i = tr

(ΣU

TDXU

)

≤d∑

k=1

σkλk,

≤d∑

k=1

σkλk. (84)

where σk are the singular values of UTDXU. The first inequality follows from the application of Von-

Neumman’s trace inequality [24, Thm. 7.4.1.1] and the fact that UTDXU is symmetric and positive

semi-definite. The second inequality follows by observing that the PICs satisfy the DPI and, therefore,λk ≤ λk.

We will now find an upper bound for (84) by bounding the eigenvalues σk. First, note that U UT=

I− p1/2X

(p1/2X

)Tand consequently

d∑

k=1

σk = tr(U

TDXU

)

= tr

(DX

(I− p

1/2X

(p1/2X

)T))

= 1−m∑

i=1

pX(i)2 . (85)

34

Page 35: PrincipalInertiaComponentsandApplications - arXiv · The PICs provide a fine-grained decomposition of the dependence between two random variables. Well- studied statistical methods

Second, note that UTDXU is a principal submatrix of UTDXU, formed by removing the first row and

columns of UTDXU. It then follows from Cauchy’s interlacing theorem [24, Theorem 4.3.17] that

pX(m) ≤ σm−1 ≤ pX(m− 1) ≤ . . . ≤ pX(2) ≤ σ1 ≤ pX(1). (86)

Combining (85) and (86), an upper bound for (84) can be found by solving the following linearprogram

maxsi

d∑

i=1

λisi (87)

subject to

d∑

i=1

si = 1− pTXpX ,

pX(i+ 1) ≤ si ≤ pX(i), i = 1, . . . , d .

Let δi , pX(i)− pX(i+ 1) and γi , λipX(i+ 1). The dual of (87) is

minyi,µ

µ(pX(1)− p

TXpX

)+

m−1∑

i=1

δiyi + γi (88)

subject to yi ≥ [λi − µ]+ , i = 1, . . . , d .

For any given value of µ, the optimal values of the dual variables yi in (88) are

yi = [λi − µ]+ = ci, i = 1, . . . , d .

Therefore the linear program (88) is equivalent to

minµ

f0(µ,pX ,λλλ), (89)

where f0(µ,pX ,λλλ) is defined in the statement of the lemma.Denote the solution of (87) by f∗

P (pX ,λλλ) and of (88) by f∗D(pX ,λλλ). It follows that (84) can be

bounded

d∑

k=1

σkλk ≤ f∗P (pX ,λλλ) = f∗

D(pX ,λλλ) ≤ f0(α,pX ,λλλ) ∀ α ∈ R. (90)

We may consider 0 ≤ α ≤ 1 in (90) without loss of generality.Using (90) to bound (83), we find

Pc ≤ pTXpX +

[f0(α,pX ,λλλ)

(1−

m∑

i=1

pX(i)2)]1/2

(91)

The previous bound can be maximized over all possible output distributions pX by solving:

maxxi

[f0(α,pX ,λλλ)

(1−

m∑

i=1

x2i

)]1/2+

m∑

i=1

pX(i)xi (92)

subject to

m∑

i=1

xi = 1,

xi ≥ 0, i = 1, . . . , m .

The dual function of (92) over the constraint∑m

i=1 xi = 1 is

L(β) = maxxi≥0

β +

[f0(α,pX ,λλλ)

(1−

m∑

i=1

x2i

)]1/2

+m∑

i=1

(pX(i)− β)xi

35

Page 36: PrincipalInertiaComponentsandApplications - arXiv · The PICs provide a fine-grained decomposition of the dependence between two random variables. Well- studied statistical methods

= β +

√√√√f0(α,pX ,λλλ) +m∑

i=1

([pX(i) − β]+

)2. (93)

Since L(β) is an upper bound of (92) for any β and, therefore, is also an upper bound of (91), then

Pc ≤ β +

√√√√f0(α,pX ,λλλ) +

m∑

i=1

([pX(i)− β]+

)2. (94)

Note that we can consider 0 ≤ β ≤ pX(2) in (94), since L(β) is increasing for β > pX(2). TakingPe(X|Y ) = 1− Pc, the result follows.

The next result tightens the bound introduced in Lemma 16 by optimizing over all values of α.

Lemma 17. Let f∗0 (pX ,λλλ) , minα f0(α,pX ,λλλ) and k∗ be defined as in (48). Then

f∗0 (pX ,λλλ) =

k∗∑

i=1

λipX(i) +

m∑

i=k∗+1

λi−1pX(i)− λk∗pTXpX , (95)

where λm = 0.

Proof. Let pX and λλλ be fixed, and λk ≤ α ≤ λk−1, where we define for ease of notation λ0 , 1 andλm , 0 (recall that the PICs correspond to λ1, . . . , λm−1). Then ci = λi −α for 1 ≤ i ≤ k− 1 and ci = 0for k ≤ i ≤ d in (81). Therefore

f0(α,pX ,λλλ) =k−1∑

i=1

λipX(i) + αpX(k) +m∑

i=k+1

λi−1pX(i) − αpTXpX . (96)

Note that (96) is convex in α, and is decreasing when pX(k) − pTXpX ≤ 0 and increasing when

pX(k) − pTXpX ≥ 0. Therefore, f0(α,pX ,λλλ) is minimized when α = λk such that pX(k) ≥ pT

XpX andpX(k−1) ≤ pT

XpX . If pX(k)−pTXpX ≥ 0 for all k (i.e. pX is uniform), then we can take α = 0. Theorem

6 follows directly.

Theorem 7

Proof. Consider two probability distributions pX and qX defined over X = {1, . . . ,m}, and assume thatpX majorizes qX , i.e.

∑ki=1 qX (i) ≤ ∑k

i=1 pX(i) for 1 ≤ k ≤ m. Therefore qX is a convex combination

of permutations of pX [79], and can be written as qX =∑l

i=1 ai(pX ◦ πi) for some l ≥ 1, where ai ≥ 0,∑ai = 1 and πi is a permutation of pX , i.e. pX ◦ πi = pπi(X). Hence, for a fixed pX|X :

I(qX , pX|X) = I(

l∑

i=1

ai(pX ◦ πi), pX|X

)

≥l∑

i=1

aiI(pX ◦ πi, pX|X),

=l∑

i=1

aiI(pX , πi ◦ pX|X), (97)

where the inequality follows from the concavity assumption and the last equality from I(X; X) beinginvariant to one-to-one mappings of X and X. Consequently, from the definition of error-rate functionin Defn. 10,

eI(qX , θ) = infpX|X

x,x′∈[m]

dH(x, x′)qX(x)pX|X(x′|x)

∣∣∣∣∣∣I(qX , pX|X) ≤ θ

(a)

≥ infpX|X

i∈[l]

ai

x,x′∈[m]

dH(πi(x), x′)pX(x)pX|X(x′|πi(x))

∣∣∣∣∣∣

i∈[l]

aiI(pX , πi ◦ pX|X) ≤ θ

36

Page 37: PrincipalInertiaComponentsandApplications - arXiv · The PICs provide a fine-grained decomposition of the dependence between two random variables. Well- studied statistical methods

(b)= inf

pX|X

i∈[l]

ai

x,x′∈[m]

dH(πi(x), πi(x′))pX(x)pX|X(πi(x

′)|πi(x))

∣∣∣∣∣∣

i∈[l]

aiI(pX , πi ◦ pX|X ◦ πi) ≤ θ

(c)

≥ infp1X|X

,...,plX|X

i∈[l]

ai

x,x′∈[m]

dH(x, x′)pX(x)piX|X(x|x′)

∣∣∣∣∣∣

i∈[l]

aiI(pX , piX|X) ≤ θ

(d)= inf

θ1,...,θl≥0

{l∑

i=1

aieI(pX , θi)

∣∣∣∣∣

l∑

i=1

aiθi = θ

}

(e)

≥ infθ1,...,θl≥0

{eI(pX ,

∑aiθi

)∣∣∣∣∣

l∑

i=1

aiθi = θ

}

= eI (pX , θ) ,

where inequality (a) follows from (97), (b) follows from the fact that the infimum is taken over allmapping pX|X and that I(X; X) is invariant to one-to-one mappings of X and X, (c) follows by allowing

a mapping piX|X

to be independently minimized for each i (as opposed to minimizing the same mapping

pX|X for all i), (d) is obtained by noting that the optimal choice of piX|X

is the one that minimizes the

Hamming distortion dH for a given upperbound on I(pX , piX|X

), and (e) follows from the convexity of

eI(pX , θ) in θ. Since this holds for any qX that is majorized by pX , eI(pX , θ) is Schur-concave.

Appendix D Proofs from Section 5

Lemma 10

Proof. Let pS,X and pY |X be given, with S → X → Y . Denote by wi the vector in the |X |-simplex with

entries pX|Y (·|i). Furthermore, let ai , H(X)−H(X|Y = i), and bi , H(S)−H(S|Y = i). Therefore

|Y|∑

i=1

pY (i) [wi, ai, bi] = [pX , I(X;Y ), I(S;Y )] . (98)

Since wi belongs to the |X |-simplex, the vector [wi, ai, bi] is taken from a connected, compact |X |+1dimensional space. Then, from Fenchel-Eggleston strengthening of Caratheodory’s theorem [80, Theorem18, pg. 35], the point [pX , I(X;Y ), I(S;Y )] can also be achieved by at most |X |+1 non-zero values ofpY (i). It follows directly that it is sufficient to consider |Y|≤ |X |+1 for the mappings that approachthe infimum GI(t, pS,X) in (59). The set of all mappings pY |X for |Y|≤ |X |+1 is compact, and bothpY |X → I(S;Y ) and pY |X → I(X;Y ) are continuous and bounded when S, X and Y have finite support.Consequently, the infimum in (59) is attainable.

For 0 < t ≤ H(X) and pS,X fixed, let GI(t, pS,X) = α. From the discussion above, there exists pY |X

that achieves I(S;Y ) = α for I(X;Y ) ≥ t. Now consider pY |X where Y = [|Y|+1] and, for 0 < λ ≤ 1,

pY |X(y|x) = (1− λ)1{y=|Y|+1} + λ1{y 6=|Y|+1}pY |X(y|x).

Note that Y can be understood as an erased version of Y , with the erasure symbol being |Y|+1. It

follows (cf. [9, Sec. 7.1.5]) that I(S; Y ) = λI(S;Y ) = λα and I(X; Y ) = λI(X;Y ) ≥ λt. We have thus

explicitly constructed a new mapping pY |X that satisfies S → X → Y and achieves I(S; Y ) = λα and

I(X; Y ) ≥ λt. Therefore, from the definition of GI in (59), GI(λt, pS,X) ≤ λα = λI(S;Y ). Consequently,

GI(λt, pS,X)

λt≤ λI(S;Y )

λt=

GI(t, pS,X)

t. (99)

Since this holds for any 0 < λ ≤ 1, thenGI(t,pS,X)

tis non-decreasing in t. Finally, for a fixed pS,X , the set

of points (wi, ai, bi) ∈ R|X|+2 that satisfies (98) is convex, and thus, for a fixed pX , it’s lower-boundary,

which corresponds to the graph of (t, GI(t, pS,X)), is convex.

37

Page 38: PrincipalInertiaComponentsandApplications - arXiv · The PICs provide a fine-grained decomposition of the dependence between two random variables. Well- studied statistical methods

Lemma 12

Proof. For fixed pY |X and pS,X , and assuming I(X;Y ) > 0,

I(S;Y )

I(X;Y )=

∑y∈Y pY (y)D(pS|Y=y||pS)∑y∈Y pY (y)D(pX|Y=y||pX )

≥ miny∈Y:

D(pX|Y =y ||pX)>0

D(pS|Y=y||pS)D(pX|Y =y||pX)

≥ infqX 6=pX

D(qS ||pS)D(qX ||pX)

.

Now let d∗ be the infimum in the right-hand side of (64), and qX satisfy

D(qY ||pY )

D(qX ||pX)= d∗ + δ,

where δ > 0. For ǫ > 0 and sufficiently small, let pY |X be such that Y = [2], pY (1) = ǫ, pX|Y (x|1) = qX(x)and

pX|Y (x|2) =1

1− ǫpX(x)− ǫ

1− ǫqX(x).

Since for any distribution rX with support X we have D ((1− ǫ)pX + ǫrX ||pX) = o(ǫ), we find

I(S;Y ) = ǫD(pS|Y =1||pS) + (1− ǫ)D(pS|Y=0||pS)= ǫD(qS ||pS) + o(ǫ),

and equivalently, I(X;Y ) = ǫD(qX ||pX) + o(ǫ). Consequently,

I(S;Y )

I(X;Y )=

ǫD(qS ||pS) + o(ǫ)

ǫD(qX ||pX) + o(ǫ)→ d∗ + δ,

where the limit is taken as ǫ → 0. Since this holds for any δ > 0, then v∗(pS,X) ≤ d∗, proving theresult.

Lemma 13

Proof. Let f : X → R, E [f(X)] = 0 and ‖f(X)‖22=1, and f ∈ R|X| be a vector with entries fi = f(i) for

i ∈ X . Observe that

‖E [f(X)|S] ‖22 =∑

s∈S

pS(s)E [f(X)|S = s]2

= fTP

TX|SDSPX|Sf

T

= fTD

1/2X Q

TQD

1/2X f

≥ δ(pS,X),

where the last inequality follows by noting that x , fTD1/2X satisfies ‖x‖2= 1 and that δ(pS,X) is the

smallest eigenvalue of the positive semi-definite matrix QTQ, where Q was defined in Definition 1 asQ , D

−1/2S PX,SD

−1/2X .

Lemma 14

Proof. Let pS|X be fixed, and define

gλ(pX) , H(S)− λH(X),

where H(S) and H(X) are the entropy of S and X, respectively, when (S,X) ∼ pS|XpX . For 0 < ǫ ≪ 1,let

pǫ(i) , pX(i)(1 + ǫf(i))

38

Page 39: PrincipalInertiaComponentsandApplications - arXiv · The PICs provide a fine-grained decomposition of the dependence between two random variables. Well- studied statistical methods

be a perturbed version of pX , where E [f(X)] = 0 and, w.l.o.g., ‖f(X)‖2= 1. The second derivative ofgλ(pǫ) at ǫ = 0 is9

∂2gλ(pǫ)

∂ǫ2

∣∣∣∣ǫ=0

= log2(e)(−‖E [f(X)|S] ‖22+λ‖f(X)‖22

)

= log2(e)(−‖E [f(X)|S] ‖22+λ

). (100)

Thus, from Lemma 13, if λ ≤ δ(pS,X) then for any sufficiently small perturbation of pX , (100) willbe non-positive. Conversely, if λ > δ(pS,X), then we can find a perturbation f(X) such that (100) ispositive. Therefore, gλ(pX) has a negative semi-definite Hessian if and only if 0 ≤ λ ≤ δ(pS,X).

For any S → X → Y , we have I(S;Y )/I(X;Y ) ≥ v∗(pS,X), and, consequently, for 0 ≤ λ† ≤ v∗(pS,X),

gλ†(pX) ≥ H(S|Y )− λ†H(X|Y ),

and gλ†(pX) touches the upper-concave envelope of gλ† at pX . Since a function must be concave at thepoints where it matches its concave envelope, gλ† has a negative semi-definite Hessian at pX and, from(100), λ† ≤ δ(pS,X). Since this holds for any 0 ≤ λ† ≤ v∗(pS,X), we find v∗(pS,X) ≤ δ(pS,X).

For a fixed pS|X , the function gλ(pX) is concave when λ = 0 and convex when λ = 1. Consequently,the maximum λ for which gλ(pX) has a negative Hessian at pX is δ(pS,X). Furthermore, Lemma 12implies that a value λ1 for which gλ(pX) touches its lower concave envelope at pX for all λ1 ≥ λ isv∗(pS,X). Therefore, both infpX v∗(pS,X) and infpX δ(pS,X) equal the maximum value of λ such that thefunction gλ(pX) is concave at all values of pX . Therefore, we established that for a given pS|X ,

infpX

v∗(pS,X) = infpX

δ(pS,X).

Lemma 15

Proof. Theorem 14 immediately gives δ(pS,X) = 0 ⇒ v∗(pS,X) = 0. Let v∗(pS,X) = 0. Then, sinceD(qX ||pX) ≤ −mini∈X log2 pX(i) and X is finite, Lemma 12 implies that for any ǫ > 0 there exists qXand 0 < δ ≤ −mini∈X log2 pX(i) such that

D(qX ||pX) ≥ δ > 0

andD(qS ||pS) < ǫ.

We can then construct a sequence q1X , q2X , q3X , . . . such that qiX 6= pX , D(qkS ||pS) ≤ ǫk and

limk→∞

ǫk = 0.

Let qkS be a vector whose entries are qkS(·). Then, from Pinsker’s inequality,

ǫk ≥ 1

2‖qk

S − pS‖21≥1

2‖qk

S − pS‖22. (101)

Defining xk = qkX − pX , observe that 0 < ‖xk‖22≤ 2 and, from (101), ‖PS|Xxk‖2≤

√2ǫk. Hence,

limk→∞

‖PS|Xxk‖22‖xk‖22

= 0. (102)

In addition, denoting sm , mins∈S pS(s) and xM , minx∈X pX(x), for each k we have

‖PS|Xxk‖22‖xk‖22

≥ min‖y‖2

2>0

‖PS|Xy‖22‖y‖22

= min‖y‖2

2>0

‖PS,XD−1/2X y‖22

‖D1/2X y‖22

(103)

≥ min‖y‖2

2>0

sm‖D−1/2S PS,XD

−1/2X y‖22

xM‖y‖22(104)

9This was observed in [23] and [77], and follows directly from − ∂2

∂ǫ2a(1 + bǫ) log2 a(1 + bǫ) = −b2a log2(e).

39

Page 40: PrincipalInertiaComponentsandApplications - arXiv · The PICs provide a fine-grained decomposition of the dependence between two random variables. Well- studied statistical methods

=smxM

min‖y‖2

2>0

‖Qy‖22‖y‖22

(105)

=smδ(pS,X)

xM. (106)

In the derivation above, (103) follows from DX being invertible (by definition), (104) is a direct con-

sequence of ‖D−1/2S y‖22≤ s−1

m ‖y‖22 and ‖D1/2X y‖22≤ xM‖y‖22 for any y, and (105) and (106) follow from

the definition of Q and δ(pS,X), respectively. Combining (106) with (102), it follows that δ(pS,X) = 0,proving the desired result.

Corollary 8

Proof. If δ(pS,X) = 0, then the lower bound for t∗ follows from the construction used in (71) and, morespecifically, by (i) maximizing the right-hand side of (71) across all functions in F0 and (ii) observingthat the maximum value of ǫ such that pY |X is non-negative is ǫ = 1/2‖f‖∞ . If δ(pS,X) > 0, then F0 issingular (i.e. F0 = {w0}), and the lower bound (72) reduces to the trivial bound t∗ ≥ 0.

In order to prove that the lower bound is sharp, consider S being an unbiased bit, drawn from {1, 2},and X the result of sending S through an erasure channel with erasure probability 1/2 and X = {1, 2, 3},with 3 playing the role of the erasure symbol. Let

f(x) ,

{1, x ∈ {1, 2},−1 x = 3.

Then f ∈ F0, hb

(12+ f(x)

2‖f‖∞

)= 0 for x ∈ X and t∗ = 1. But, from Lemma 11, t∗ ≤ H(X|S) = 1. The

result follows.

40

Page 41: PrincipalInertiaComponentsandApplications - arXiv · The PICs provide a fine-grained decomposition of the dependence between two random variables. Well- studied statistical methods

References

[1] Y. S. Abu-Mostafa, M. Magdon-Ismail, and H.-T. Lin, Learning From Data. AMLBook, Mar. 2012.

[2] L. Sweeney, “K-anonymity: a model for protecting privacy,” International Journal on Uncertainty,Fuzziness and Knowledge-based Systems, 2002.

[3] C. Dwork, F. Mcsherry, K. Nissim, and A. Smith, “Calibrating noise to sensitivity in private dataanalysis,” in TCC, 2006.

[4] C. Dwork, “Differential privacy,” in Automata, Languages and Programming. Springer, 2006, vol.4052, pp. 1–12.

[5] S. Salamatian, A. Zhang, F. P. Calmon, S. Bhamidipati, N. Fawaz, B. Kveton, P. Oliveira, andN. Taft, “How to hide the elephant-or the donkey-in the room: Practical privacy against statisticalinference for large data,” IEEE GlobalSIP, 2013.

[6] S. Salamatian, A. Zhang, F. P. Calmon, S. Bhamidipati, N. Fawaz, B. Kveton, P. Oliveira, andN. Taft, “Managing your private and public data: Bringing down inference attacks against yourprivacy,” IEEE J. Sel. Topics Signal Process., October 2015.

[7] S. Bhamidipati, N. Fawaz, B. Kveton, and A. Zhang, “PriView: Personalized Media ConsumptionMeets Privacy against Inference Attacks,” IEEE Software, vol. 32, no. 4, pp. 53–59, Jul. 2015.

[8] C. E. Shannon, “Communication theory of secrecy systems,” Bell System Technical Journal, vol. 28,no. 4, pp. 656–715, 1949.

[9] T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed. Wiley-Interscience, Jul.2006.

[10] J. C. Duchi and M. J. Wainwright, “Distance-based and continuum fano inequalities with applica-tions to statistical estimation,” arXiv preprint arXiv:1311.2669, 2013.

[11] M. J. Greenacre, Theory and Applications of Correspondence Analysis. Academic Pr, Mar. 1984.

[12] L. Breiman and J. H. Friedman, “Estimating Optimal Transformations for Multiple Regression andCorrelation,” Journal of the American Statistical Association, vol. 80, no. 391, pp. 580–598, Sep.1985.

[13] A. Renyi, “On measures of dependence,” Acta mathematica hungarica, vol. 10, no. 3, pp. 441–451,1959.

[14] I. Csiszar and P. C. Shields, “Information theory and statistics: A tutorial,” Communications andInformation Theory, vol. 1, no. 4, pp. 417–528, 2004.

[15] A. Renyi, “On measures of dependence,” Acta Math. Hung., vol. 10, no. 3-4, pp. 441–451, Sep. 1959.

[16] A. V. Oppenheim and R. W. Schafer, Discrete-Time Signal Processing, 3rd ed. Upper Saddle River:Pearson, Aug. 2009.

[17] G. R. Kumar and T. A. Courtade, “Which Boolean functions maximize information of noisy inputs?”IEEE Trans. Inf. Theory, vol. 60, no. 8, pp. 4515–4525, Aug. 2014.

[18] R. Ahlswede, “Extremal properties of rate distortion functions,” IEEE Trans. on Info. Theory,vol. 36, no. 1, pp. 166–171, 1990.

[19] F. P. Calmon and N. Fawaz, “Privacy against statistical inference,” in Proc. 50th Ann. AllertonConf. Commun., Contr., and Comput., 2012, pp. 1401–1408.

[20] A. Zhang, S. Bhamidipati, N. Fawaz, and B. Kveton, “PriView: Media Consumption and Rec-ommendation Meet Privacy Against Inference Attacks,” in IEEE Web 2.0 Security and PrivacyWorkshop, 2014.

[21] N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck method,” in Proc. 37th Ann.Allerton Conf. Commun., Contr., and Comput., 1999, pp. 368–377.

[22] R. Ahlswede and P. Gacs, “Spreading of sets in product spaces and hypercontraction of the markovoperator,” Ann. Probab., vol. 4, no. 6, pp. 925–939, Dec. 1976.

[23] V. Anantharam, A. Gohari, S. Kamath, and C. Nair, “On maximal correlation, hypercontractivity,and the data processing inequality studied by Erkip and Cover,” arXiv e-print 1304.6133, Apr. 2013.

[24] R. A. Horn and C. R. Johnson, Matrix Analysis, 2nd ed. Cambridge University Press, Oct. 2012.

[25] I. Csiszar, Information Theory And Statistics: A Tutorial. Now Publishers Inc, 2004.

41

Page 42: PrincipalInertiaComponentsandApplications - arXiv · The PICs provide a fine-grained decomposition of the dependence between two random variables. Well- studied statistical methods

[26] Y. Polyanskiy and S. Verdu, “Arimoto channel coding converse and Renyi divergence,” in Proc. 48thAnn. Allerton Conf. Commun., Contr., and Comput., 2010, pp. 1327–1333.

[27] M. Greenacre, Correspondence Analysis in Practice, Second Edition, 2nd ed. Chapman andHall/CRC, May 2007.

[28] M. Greenacre and T. Hastie, “The geometric interpretation of correspondence analysis,” J. Am.Stat. Assoc., vol. 82, no. 398, pp. 437–447, Jun. 1987.

[29] H. O. Hirschfeld, “A connection between correlation and contingency,” in Proc. Cambridge Philos.Soc., vol. 31, 1935, pp. 520–524.

[30] H. Gebelein, “Das statistische problem der korrelation als variations- und eigenwertproblem undsein zusammenhang mit der ausgleichsrechnung,” ZAMM-Z. Angew. Math. Me., vol. 21, no. 6, pp.364–379, 1941.

[31] O. Sarmanov, “Maximum correlation coefficient (nonsymmetric case),” Selected Translations inMathematical Statistics and Probability, vol. 2, pp. 207–210, 1962.

[32] H. S. Witsenhausen, “On sequences of pairs of dependent random variables,” SIAM J. on Appl.Math., vol. 28, no. 1, pp. 100–113, Jan. 1975.

[33] Y. Polyanskiy, “Hypothesis testing via a comparator,” in Proc. 2012 IEEE Int. Symp. on Inf. Theory,Jul. 2012, pp. 2206–2210.

[34] M. Raginsky, “Logarithmic Sobolev inequalities and strong data processing theorems for discretechannels,” in Proc. 2013 IEEE Int. Symp. on Inf. Theory, Jul. 2013, pp. 419–423.

[35] F. P. Calmon, M. Varia, M. Medard, M. Christiansen, K. Duffy, and S. Tessaro, “Bounds oninference,” in Proc. 51st Ann. Allerton Conf. Commun., Contr., and Comput., Oct. 2013, pp. 567–574.

[36] A. Makur and L. Zheng, “Bounds between Contraction Coefficients,” arXiv:1510.01844 [cs, math],Oct. 2015.

[37] J. Liu, T. A. Courtade, P. Cuff, and S. Verdu, “Brascamp-Lieb Inequality and Its Reverse: AnInformation Theoretic View,” arXiv:1605.02818 [cs, math], May 2016.

[38] S.-L. Huang, C. Suh, and L. Zheng, “Euclidean information theory of networks,” IEEE Trans. onInfo. Theory, vol. 61, no. 12, pp. 6795–6814, 2015.

[39] A. Buja, “Remarks on Functional Canonical Variates, Alternating Least Squares Methods and Ace,”The Annals of Statistics, vol. 18, no. 3, pp. 1032–1069, Sep. 1990.

[40] A. Makur, F. Kozynski, S.-L. Huang, and L. Zheng, “An efficient algorithm for information decom-position and extraction,” in Proceedings of the 53rd Annual Allerton Conference on Communication,Control and Computing, Allerton House, UIUC, Illinois, USA, 2015.

[41] W. Kang and S. Ulukus, “A new data processing inequality and its applications in distributed sourceand channel coding,” IEEE Trans. Inf. Theory, vol. 57, no. 1, pp. 56–69, 2011.

[42] A. Guntuboyina, “Lower bounds for the minimax risk using -divergences, and applications,” IEEETrans. Inf. Theory, vol. 57, no. 4, pp. 2386–2399, 2011.

[43] A. Guntuboyina, S. Saha, and G. Schiebinger, “Sharp inequalities for f -divergences,”arXiv:1302.0336, Feb. 2013.

[44] V. Doshi, D. Shah, M. Medard, and M. Effros, “Functional compression through graph coloring,”IEEE Trans. Inf. Theory, vol. 56, no. 8, pp. 3901 –3917, Aug. 2010.

[45] A. Orlitsky and J. Roche, “Coding for computing,” IEEE Trans. Inf. Theory, vol. 47, no. 3, pp. 903–917, Mar. 2001.

[46] G. Kindler, R. O’Donnell, and D. Witmer, “Remarks on the Most Informative Function Conjectureat fixed mean,” arXiv:1506.03167, 2015.

[47] V. Anantharam, A. Gohari, S. Kamath, and C. Nair, “On hypercontractivity and the mutual in-formation between Boolean functions,” in Proc. 51st Ann. Allerton Conf. Commun., Contr., andComput., Oct. 2013, pp. 13–19.

[48] O. Ordentlich, O. Shayevitz, and O.Weinstein, “An Improved Upper Bound for the Most InformativeBoolean Function Conjecture,” arXiv:1505.05794 [cs, math], May 2015.

[49] V. Chandar and A. Tchamkerten, “Most informative quantization functions,” in Proc. ITA Work-shop, San Diego, CA, USA, 2014.

42

Page 43: PrincipalInertiaComponentsandApplications - arXiv · The PICs provide a fine-grained decomposition of the dependence between two random variables. Well- studied statistical methods

[50] A. Samorodnitsky, “The ”Most informative Boolean function” conjecture holds for high noise,”arXiv:1510.08656 [cs, math], Oct. 2015.

[51] I. S. Reed, “Information theory and privacy in data banks,” in Proc. of the National ComputerConference and Exposition, ser. AFIPS ’73. New York, NY, USA: ACM, June 1973, pp. 581–587.

[52] D. Rebollo-Monedero, J. Forne, and J. Domingo-Ferrer, “From t-closeness-like privacy to postran-domization via information theory,” IEEE Trans. on Knowledge and Data Engineering, vol. 22,no. 11, pp. 1623–1636, Nov. 2010.

[53] L. Sankar, S. Rajagopalan, and H. Poor, “Utility-Privacy Tradeoffs in Databases: An Information-Theoretic Approach,” IEEE Trans. on Inf. Forensics and Security, vol. 8, no. 6, pp. 838–852, Jun.2013.

[54] R. Tandon, L. Sankar, and H. Poor, “Discriminatory lossy source coding: Side information privacy,”IEEE Transactions on Information Theory, vol. 59, no. 9, pp. 5665–5677, Sep. 2013.

[55] A. Evfimievski, J. Gehrke, and R. Srikant, “Limiting privacy breaches in privacy preserving datamining,” in Proceedings of the twenty-second ACM Symposium on Principles of Database Systems,New York, NY, USA, 2003, pp. 211–222.

[56] A. Makhdoumi and N. Fawaz, “Privacy-utility tradeoff under statistical uncertainty,” in Proc. 51thAnnual Allerton Conference on Communication, Controrl, and Computation, 2013, pp. 1627–1634.

[57] Y. Polyanskiy and Y. Wu, “Dissipation of information in channels with input constraints,”arXiv:1405.3629 [cs, math], May 2014.

[58] C. T. Li and A. E. Gamal, “Maximal correlation secrecy,” arXiv:1412.5374 [cs, math], Dec. 2014.

[59] F. P. Calmon, M. Varia, and M. Medard, “On information-theoretic metrics for symmetric-keyencryption and privacy,” in Proc. 52nd Annual Allerton Conference on Communication, Control,and Computing, 2014.

[60] S. Chakraborty, N. Bitouze, M. Srivastava, and L. Dolecek, “Protecting data against unwantedinferences,” in 2013 IEEE Information Theory Workshop (ITW), Sep. 2013, pp. 1–5.

[61] S. Asoodeh, F. Alajaji, and T. Linder, “Notes on information-theoretic privacy,” in Proc. 52nd Ann.Allerton Conf. Commun., Contr., and Comput., Sep. 2014, pp. 1272–1278.

[62] S. Asoodeh, M. Diaz, F. Alajaji, and T. Linder, “Information Extraction Under Privacy Con-straints,” arXiv preprint arXiv:1511.02381, 2015.

[63] A. Makhdoumi, S. Salamatian, N. Fawaz, and M. Medard, “From the information bottleneck to theprivacy funnel,” in IEEE Inf. Theory Workshop (ITW), 2014, pp. 501–505.

[64] A. Makhdoumi, F. P. Calmon, and M. Medard, “Forgot your password: Correlation dilution,” inInternational Symp. on Info. Theory, 2015, pp. 2944–2948.

[65] S. Beigi and A. Gohari, “On the duality of additivity and tensorization,” in International Symp. onInfo. Theory. IEEE, 2015, pp. 2381–2385.

[66] C. R. J. Roger A. Horn, Topics in Matrix Analysis. Cambridge University Press, 1994.

[67] S. P. Boyd and L. Vandenberghe, Convex optimization. Cambridge, UK; New York: CambridgeUniversity Press, 2004.

[68] C. Deniau, G. Oppenheim, and J. P. Benzecri, “Effet de l’affinement d’une partition sur les valeurspropres issues d’un tableau de correspondance,” Cahiers de l’analyse des donnees, vol. 4, no. 3, pp.289–297.

[69] R. O’Donnell, “Some topics in analysis of Boolean functions,” in Proc. 40th ACM Symp. on Theoryof Computing, 2008, pp. 569–578.

[70] M. Raginsky, J. G. Silva, S. Lazebnik, and R. Willett, “A recursive procedure for density estimationon the binary hypercube,” Electron. J. Statist., vol. 7, pp. 820–858, 2013.

[71] R. O’Donnell, Analysis of Boolean Functions, 1st ed. New York, NY: Cambridge University Press,Jun. 2014.

[72] F. P. Calmon, M. Medard, L. Zeger, J. Barros, M. M. Christiansen, and K. R. Duffy, “Lists thatare smaller than their parts: A coding approach to tunable secrecy,” in Proc. 50th Annual AllertonConf. on Commun., Control, and Comput., 2012.

[73] N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck method,”arXiv:physics/0004057 [physics.data-an], Apr. 2000.

43

Page 44: PrincipalInertiaComponentsandApplications - arXiv · The PICs provide a fine-grained decomposition of the dependence between two random variables. Well- studied statistical methods

[74] S. Goldwasser and S. Micali, “Probabilistic encryption,” Journal of Computer and System Sciences,vol. 28, no. 2, pp. 270–299, Apr. 1984.

[75] R. G. Gallager, Information theory and reliable communication. New York: Wiley, 1968.

[76] A. Guntuboyina, “Minimax lower bounds,” Ph.D., Yale University, United States – Connecticut,2011.

[77] S. Kamath and V. Anantharam, “Non-interactive simulation of joint distributions: The Hirschfeld-Gebelein-Renyi maximal correlation and the hypercontractivity ribbon,” in Proc. 50th Ann. AllertonConf. Commun., Contr., and Comput. IEEE, 2012, pp. 1057–1064.

[78] T. Berger and R. Yeung, “Multiterminal source encoding with encoder breakdown,” IEEE Trans.on Inf. Theory, vol. 35, no. 2, pp. 237–244, Mar. 1989.

[79] A. W. Marshall, I. Olkin, and B. C. Arnold, Inequalities: theory of majorization and its applications.New York: Springer Series in Statistics, 2011.

[80] H. G. Eggleston, Convexity, 1st ed. Cambridge England: Cambridge University Press, Jan. 2009.

44