Information-Theoretic Compressive Measurement Design

15
1 Information-Theoretic Compressive Measurement Design Liming Wang, Minhua Chen, Miguel Rodrigues, David Wilcox, Robert Calderbank and Lawrence Carin Abstract—An information-theoretic projection design framework is proposed, of interest for feature design and compressive mea- surements. Both Gaussian and Poisson measurement models are considered. The gradient of a proposed information-theoretic metric (ITM) is derived, and a gradient-descent algorithm is applied in design; connections are made to the information bottleneck. The fundamental solution structure of such design is revealed in the case of a Gaussian measurement model and arbitrary input statistics. This new theoretical result reveals how ITM parameter settings impact the number of needed projection measurements, with this verified experimentally. The ITM achieves promising results on real data, for both signal recovery and classification. Index Terms—Information-theoretic metric; information bottleneck; projection design; gradient of mutual information; compressive sens- ing. 1 I NTRODUCTION D IMENSIONALITY reduction plays a pivotal role in many machine-learning applications, including com- pressive measurements [1] and feature design [2], [3], [4]. Although there is interest in nonlinear representations [5], linear compressive measurements and feature design are desirable because they are amenable to theoretical analysis and guarantees [6], and such representations can also be readily implemented experimentally [7]. Linear dimension- ality reduction is achieved by multiplying a signal of interest with a measurement matrix, and the number of rows of this matrix defines the number of measurements/features; that number is ideally small relative to the dimension of the original data vector. One typically assumes additive noise in the measurement, which is often modeled as being Gaussian L. Wang, R. Calderbank and L. Carin are with the Department of Electrical and Computer Engineering, Duke University, Durham, NC 27708, USA. email: {liming.w, robert.calderbank, lcarin}@duke.edu M. Chen is with the Department of Computer Science and Statistics, University of Chicago, Chicago, IL 60637, USA. email: [email protected] D. Wilcox is with the Department of Chemistry, Purdue University, West Lafayette, IN 47907, USA. email: [email protected] M. Rodrigues is with the Department of Electronic and Electrical Engineering, University College London, London WC1E 7JE, U.K. email: [email protected] [8], [9]. However, there are low-photon-count applications for which a Gaussian noise model is inappropriate, and a Poisson measurement model is desired. There are far fewer theoretical results for the Poisson case. Among the various criteria for measuring the information in a compressive measurement or in feature design, mutual information is widely used [10], [11], due to its ubiquitous presence as an information loss measure, as well as its close relationship to Bayesian classification error [12]. Mutual- information-based measurement design for the Gaussian model has been considered in [8] for signal recovery, and in [4] for classification (feature design). If one designs compressive measurements based only on a classification criterion, the features may be good for an algorithm to classify, but a human may also wish to look at the underly- ing data (e.g., a medical scientist may wish to confirm an algorithm-defined diagnosis), and poor signal recovery may undermine that objective. It is therefore of interest to jointly balance the goals of feature design (for classification) and signal-recovery quality. In this paper, we propose an information-theoretic metric (ITM) by balancing two performance metrics linked to the data. The proposed ITM is able to accommodate non- Gaussian inputs and Poisson measurement noise. Further, we also consider two types of constraints for the ITM, an energy and orthonormality constraint (on the characteristics of the projection vectors), which are widely utilized in many applications. Under proper settings of the model, the proposed ITM is like the information bottleneck (IB) [13], which has been widely used in various applications, including document clustering [14], gene-expression analysis [15], video search [16], feature design [17] and speech recognition [18], among many others. The case for which the input signal is Gaussian and there is Gaussian measurement noise has been investigated in [19], where an analytic solution of the IB problem has been revealed. In situations for which the input signal and/or the mea- surement is not Gaussian, the mutual information terms involved in the ITM generally do not possess analytic forms under almost all common input distributions. A goal of this paper is to develop numerical algorithms as well as theoretical results for projection-matrix design under

Transcript of Information-Theoretic Compressive Measurement Design

Page 1: Information-Theoretic Compressive Measurement Design

1

Information-Theoretic CompressiveMeasurement Design

Liming Wang, Minhua Chen, Miguel Rodrigues, David Wilcox,Robert Calderbank and Lawrence Carin

F

Abstract—An information-theoretic projection design framework isproposed, of interest for feature design and compressive mea-surements. Both Gaussian and Poisson measurement models areconsidered. The gradient of a proposed information-theoretic metric(ITM) is derived, and a gradient-descent algorithm is applied indesign; connections are made to the information bottleneck. Thefundamental solution structure of such design is revealed in the caseof a Gaussian measurement model and arbitrary input statistics. Thisnew theoretical result reveals how ITM parameter settings impactthe number of needed projection measurements, with this verifiedexperimentally. The ITM achieves promising results on real data, forboth signal recovery and classification.

Index Terms—Information-theoretic metric; information bottleneck;projection design; gradient of mutual information; compressive sens-ing.

1 INTRODUCTION

D IMENSIONALITY reduction plays a pivotal role inmany machine-learning applications, including com-

pressive measurements [1] and feature design [2], [3], [4].Although there is interest in nonlinear representations [5],linear compressive measurements and feature design aredesirable because they are amenable to theoretical analysisand guarantees [6], and such representations can also bereadily implemented experimentally [7]. Linear dimension-ality reduction is achieved by multiplying a signal ofinterest with a measurement matrix, and the number of rowsof this matrix defines the number of measurements/features;that number is ideally small relative to the dimension of theoriginal data vector. One typically assumes additive noise inthe measurement, which is often modeled as being Gaussian

• L. Wang, R. Calderbank and L. Carin are with the Department ofElectrical and Computer Engineering, Duke University, Durham, NC27708, USA. email: {liming.w, robert.calderbank, lcarin}@duke.edu

• M. Chen is with the Department of Computer Science andStatistics, University of Chicago, Chicago, IL 60637, USA. email:[email protected]

• D. Wilcox is with the Department of Chemistry, Purdue University,West Lafayette, IN 47907, USA. email: [email protected]

• M. Rodrigues is with the Department of Electronic and ElectricalEngineering, University College London, London WC1E 7JE, U.K.email: [email protected]

[8], [9]. However, there are low-photon-count applicationsfor which a Gaussian noise model is inappropriate, and aPoisson measurement model is desired. There are far fewertheoretical results for the Poisson case.

Among the various criteria for measuring the informationin a compressive measurement or in feature design, mutualinformation is widely used [10], [11], due to its ubiquitouspresence as an information loss measure, as well as its closerelationship to Bayesian classification error [12]. Mutual-information-based measurement design for the Gaussianmodel has been considered in [8] for signal recovery, andin [4] for classification (feature design). If one designscompressive measurements based only on a classificationcriterion, the features may be good for an algorithm toclassify, but a human may also wish to look at the underly-ing data (e.g., a medical scientist may wish to confirm analgorithm-defined diagnosis), and poor signal recovery mayundermine that objective. It is therefore of interest to jointlybalance the goals of feature design (for classification) andsignal-recovery quality.

In this paper, we propose an information-theoretic metric(ITM) by balancing two performance metrics linked tothe data. The proposed ITM is able to accommodate non-Gaussian inputs and Poisson measurement noise. Further,we also consider two types of constraints for the ITM, anenergy and orthonormality constraint (on the characteristicsof the projection vectors), which are widely utilized inmany applications.

Under proper settings of the model, the proposed ITM islike the information bottleneck (IB) [13], which has beenwidely used in various applications, including documentclustering [14], gene-expression analysis [15], video search[16], feature design [17] and speech recognition [18],among many others. The case for which the input signalis Gaussian and there is Gaussian measurement noise hasbeen investigated in [19], where an analytic solution of theIB problem has been revealed.

In situations for which the input signal and/or the mea-surement is not Gaussian, the mutual information termsinvolved in the ITM generally do not possess analyticforms under almost all common input distributions. A goalof this paper is to develop numerical algorithms as wellas theoretical results for projection-matrix design under

Page 2: Information-Theoretic Compressive Measurement Design

2

the ITM, for general Gaussian and Poisson measurementmodels, for non-Gaussian input signals. Despite the lack ofanalytic mutual information expressions, the gradients ofmutual information for both Gaussian and Poisson measure-ment models possess relatively simple forms, which haveconnections to the Minimum Mean Square Error (MMSE)estimator [20], [8], [21], [22], [23]. Hence, gradient-descentmethods may be adopted to infer a desired measurementmatrix. In the context of compressive sensing, this paper isbelieved to be the first for which a balance between signalrecovery and classification is considered.

The ITM setup has a parameter β that may betuned, which weighs between the classification and signal-recovery objectives. As detailed below, for β < 0 we jointlyconsider the goals of classification and recovery; withβ → −∞ we optimize only for signal recovery, and whenβ = 0 we optimize only for classification. For β > 0 theITM considers classification with a penalty (“bottleneck”)imposed on signal recovery. For β > 0, our new theorydemonstrates that larger β encourages fewer number oflinear measurements/features used in the classifier. Thisresult for β > 0 under the energy constraint has aninteresting waterfilling-like interpretation, generalizing themode alignment-scheme in [24], [8] to the IB. Further, thederived waterfilling-like interpretation for ITM under theenergy constraint is valid for arbitrary source model andextends the result in [19].

The ITM under the orthonormality constraint is relatedto calculation of volumes on the Stiefel manifold, wherethe derivation of an analytical solution is known to bechallenging [25]. We shed light on its analytical solutionby postulating a Gaussian input and Gaussian noise withdiagonal variance as in [19], and a fixed-point-type solutionis derived, for which analytic solutions are realized forβ → −∞ and β = 0. While aspects of this theory islimited to the case of a Gaussian source, the experiments(numerical methods) are not.

We demonstrate how the theoretical results may be usedin practice, for projection design or supervised dimension-ality reduction. Specifically, we provide numerical resultson several datasets, and compare the performance of theproposed ITM to Linear Discriminant Analysis (LDA) [26],Information Discriminant Analysis (IDA) [12] and a Renyi-entropy method [2], [3]. For a Gaussian measurementmodel, limiting cases of the ITM correspond to specialcases in [8] and [4], to which we also compare. Forthe Poisson measurement model, we apply the ITM onreal measured data from a compressive photon-countinghyperspectral camera (the measurement system in [27]),and demonstrate that designed projection matrices can yieldimproved performance on preserving both signal recoveryand classification information, relative to random measure-ment design [28].

The remainder of the paper is organized as follows. InSection 2, we provide a formal definition of the proposedITM and related optimization problems of interest. Wederive explicit forms of the gradients of the ITM in Section3, and summarize a gradient-based numerical algorithm for

the ITM. In Section 4 we present two theorems, on thesolution structure for the ITM under energy and orthonor-mality constraints on the measurement matrix, and weelaborate on the interpretations of these theoretical results.We present experimental results for the ITM in Section 5,for both Gaussian and Poisson measurement models; forthe Poisson case, the methods are applied to real data froma measurement system. Finally, we conclude the paper inSection 6.

2 INFORMATION-THEORETIC FRAMEWORKFOR DESIGNED MEASUREMENTS

2.1 Problem Statements

The setup we consider is characterized by the Markovsequence X → Y → Y → X . Two types of signal modelsare considered. For the first, discrete X ∈ {1, . . . , L}denotes the class label, and Y ∈ Rn are data conditioned onX . Alternatively, for the second signal model, the discreteX in the Markov sequence is replaced by a continuousrandom variable X ′ ∈ Rp, and (X ′, Y ) are assumed tobe jointly Gaussian. Y ∈ Rm is the compressive form ofY with m � n, and X is the estimated X (or X ′ is theestimated X ′), based on Y .

For the first model, pX(X = i) = πi, with πi > 0and

∑Li=1 πi = 1. The random variable Y is modeled

as drawn from a mixture model: PY =∑Li=1 PX(X =

i)PY |X(Y |X = i) =∑Li=1 πiPY |X(Y |X = i). We

consider general conditional distributions PY |X .The mapping Y → Y is stochastic, and is here effected

first in terms of the deterministic linear mapping ΦY , whereΦ ∈ Rm×n is a measurement matrix. For both the Gaussianand Poisson measurement models that are the focus ofthis paper, the m-dimensional signal ΦY defines the meanof the mapping Y → Y . Specifically, for the Gaussianmeasurement channel

PY |Y = N (Y ; ΦY,Λ−1) (1)

where Λ ∈ Rm×m is the precision of the additive noise inthe Gaussian measurement.

For the Poisson measurement model

PY |Y = Pois(Y ; ΦY +λ) =

m∏i=1

Pois(Yi; (ΦY )i +λi) (2)

where Φ ∈ Rm×n+ , λ ∈ Rm+ is the “dark current,” and Yirepresents component i of vector Y (this subscript notationidentifies vector components throughout the paper). Forthe Poisson case we also require that the componentsof Y are non-negative. One may view Y ∈ Zn+ as then-dimensional Poisson rate of the original signal, andΦY is the compressive form of the Poisson rate. In (2),Pois(Y ; ΦY + λ) denotes a vector Poisson distribution,while Pois(Yi; (ΦY )i + λi) is the common scalar Pois-son distribution; the form of the distribution is assumedunderstood from the context in which it is used, to notcomplicated notation.

Page 3: Information-Theoretic Compressive Measurement Design

3

For the second model discussed above, (X ′, Y ) aredrawn jointly Gaussian, and therefore Y is not restrictedto be nonnegative. Hence, the second model is restrictedto the case in (1), the Gaussian measurement model. Ifinterested in the Poisson measurement model, Y may betreated as the log of the Poisson rate.

The proposed ITM is characterized by a linear combina-tion of mutual information I(X; Y ) and I(Y, Y ); Shannonentropy with the natural logarithm is considered below. TheITM setup for defining Φ is

maxΦITM(Φ, β) := max

Φ

{I(X; Y )− βI(Y , Y )

}(3)

where β ∈ R controls the relative importance of the twomutual-information metrics.

We consider two different types of constraints on Φ. Thefirst is the energy constraint:

1

mtr(ΦΦT ) ≤ 1, (4)

where the trace tr(·) manifests an energy regularization onthe measurement. The second constraint imposes that theprojections are orthonormal:

ΦΦT = Im, (5)

where Im is the m×m identity matrix.In compressive sensing, one often requires that Φ obey

unit-norm row constrains or, instead, orthonormality con-straints [12]. For communications applications, one mayrather desire the energy constraint, requiring the rows haveunit-norm on average [29].

We consider the first signal model (discrete X) unlessotherwise specified, for both the Gaussian and Poissonmeasurement models. It is assumed that PX and PY |X areknown a priori. However, we do not assume any specificform of PY |X , thus a general mixture model is consideredfor Y . On the other hand, for the second signal model, forwhich (X ′, Y ) are jointly Gaussian, the ITM for the Gaus-sian measurement model possesses an analytical expression,thereby enabling a direct optimization and analysis [8],[13].

2.2 Mutual Information Metrics

The mapping Y → Y constitutes the heart of the compres-sive measurement or linear feature design, and we wishto design Φ such that it preserves desired information.The mutual information is employed as in 3, from twoperspectives, which we motivate here.

In [8] the authors considered compressive sensing, andthe goal was to recover Y from Y . The measurement modelPY |Y was assumed to be Gaussian as in (1), and PY wasarbitrary. In the experiments in [8], a mixture model wasused for Y : PY =

∑Li=1 PX=iPY |X=i; in [8] the focus

was not on recovering X , but rather on estimating Y . Inthat case the mutual information of interest is I(Y ; Y ), andthe goal is to design Φ such that I(Y ; Y ) is maximized.

This design framework may be justified by noting that ithas been shown recently that [30]

MMSE ≥ 1

2πeexp{2[h(Y )− I(Y ;Y )]} (6)

where h(Y ) is the differential entropy of Y and MMSE =E{tr[(Y − E(Y |Y ))(Y − E(Y |Y ))T ]} is the minimummean-square error, so that by maximizing mutual informa-tion one may hope to achieve a lower reconstruction error.

For the classification problem considered in [4], theobjective was to recover the class label X from Y , andtherefore the goal there was to design Φ to maximizeI(X; Y ). This is justified by recalling the Bayesian classi-fication error Pe =

∫PY (y)[1 − maxXPX|Y (x|y)]dy, and

noting that it has been shown in [31] that

Pe ≤1

2H(X|Y ) (7)

where H(X|Y ) = H(X) − I(X; Y ), and H(·) denotesthe entropy of a discrete random variable. Since H(X) isindependent of Φ, minimizing the upper bound to Pe isequivalent to maximizing I(X; Y ).

There are practical settings for which both signal recov-ery and classification are of interest simultaneously, whichmotivates the proposed ITM in 3 when β ≤ 0. Note thatβ = 0 corresponds to the case that only the classificationtask is of interest and β → −∞ optimizes only for thesignal recovery.

2.3 Connections to the Information BottleneckWhen β > 0, the proposed ITM is like the IB and thegoal is to recover X from Y , but one wishes to constrainthe number of measurements/features m, this defining the“bottleneck”. In the language of the IB [13], X is the“relevant” information we principally wish to recover;Y ∈ Rn is the input signal that depends on X , Y ∈ Rm isthe compressive measurement or representation of Y , andX is the estimate of X , based on Y . We note, however,that the IB does not typically impose constraints on Φ, likewe do in (4) and (5).

Due to different settings in the model, our ITM setup in(3) differs from the one proposed in [13]. Most importantly,in [13] β was assumed to be nonnegative, which impliesthat the goal is to design Φ to maximize I(X; Y ), whilesimultaneously minimizing I(Y , Y ); since the maximiza-tion of I(X; Y ) and the minimization of I(Y , Y ) does notgenerally occur simultaneously for some Φ, the choice of βdefines the relative importance of these two metrics on thedesign of Φ. In this manner the model effectively seeksa low-complexity model, which is effective at retainingthe “relevant” information X , while reducing complexitythrough simplified Y , that may not be effective for recov-ering Y . Our motivation is expanded beyond cases typicallyassociated with the IB model, as we will also allow β < 0,permitting consideration of the case of joint interest insignal recovery and classification.

In general, one cannot perform the optimization in (3) an-alytically, and therefore numerical procedures are required.

Page 4: Information-Theoretic Compressive Measurement Design

4

Specifically, one must take the gradient of ITM(Φ, β)with respect to Φ, for specified choices of β. WhileITM(Φ, β) itself may be difficult to compute, expand-ing on recent analysis, one may express the gradient ofITM(Φ, β) in closed form, amenable to optimizationalgorithms (e.g., gradient descent).

3 GRADIENTS OF THE ITM MODEL

3.1 Explicit Forms for the GradientsWe introduce two theorems on gradients of the ITM, forboth Gaussian and Poisson measurement models. We al-ways assume the regularity conditions, specifically, that theorder of integration and differentiation can be interchangedfreely, i.e., the order of the differential operators ∂

∂Φijand

the expectation operator E(·) may be interchanged. Thisassumption is mild and almost always valid in practice [32].

The following two theorems are straightforward to es-tablish given results from [20], [4], [21].

Theorem 1. For the Gaussian measurement model in (1)under the first signal model, for arbitrary PY |X suchthat the mixture distribution PY is consistent with theassumed regularity conditions, the gradient of the ITM∇ΦITM(Φ, β) with respect to the projection matrix Φis given by

∇ΦITM(Φ, β) = ΛΦE − βΛΦE, (8)

where E = E[(Y − E[Y |Y ])(Y − E[Y |Y ])T

]is the

MMSE matrix, andE = E

[(E[Y |Y , X]− E[Y |Y ])(E[Y |Y , X]− E[Y |Y ])T

].

Theorem 2. For the Poisson measurement model in (2)under the first signal model, for arbitrary PY |X suchthat the mixture distribution PY is consistent with theassumed regularity conditions, the gradient of the ITM∇ΦITM(Φ, β) with respect to the projection matrix Φis given by

[∇ΦITM(Φ, β)ij ] = E

[E[Yj |Y , X] log

E[(ΦY )i|Y , X]

E[(ΦY )i|Y ]

]− β

{E [Yj log((ΦY )i)]− E

[E[Yj |Y ] logE[(ΦY )i|Y ]

] }.

(9)

where (·)ij denotes the ij-th entry of the matrix and (·)idenotes the i-th entry of the vector.

All theorem proofs are provided in the Appendix. Notethat Theorems 1 and 2 are valid for arbitrary mixture modelfor PY provided it satisfies the regularity conditions.

3.2 Gradient-Based Numerical DesignA numerical solution to the optimization in (3) can berealized via a gradient-descent method. The matrices E andE involved in Theorems 1 and 2 can be readily calculatedby Monte Carlo integration; we elaborate on this calculationwhen presenting experimental results. The algorithm issummarized as follows:

1) Initialize Φ and either set or learn all the input distri-butions from the training datasets.

2) Use Monte Carlo integration to calculate the gradi-ent. For Gaussian measurement model, the posteriorsPY |Y and PY |Y ,X possess analytical expressions whensignal Y is assumed to be a Gaussian mixture model(GMM) [4]. Likewise, these posteriors may be readilyapproximated when the signal Y is assumed to be alog-GMM, and the details are presented in Section5.4. Therefore, the calculation of the gradients inTheorem 1 and 2 reduces to calculating the expectationE[f(Y , X)] of some function f . It can be approxi-mated via the Monte Carlo integration E[f(Y , X)] ≈1S

∑Si=1 f((y, x)i), and {(y, x)i}, i = 1, . . . , S are

samples drawn from the distribution PY ,X , which canbe obtained via the well-known Metropolis-Hastingsalgorithm [33].

3) Update the projection matrix as Φnew = proj(Φold +δ∇ΦITM(Φ, β)), where δ is the step size and proj(·)projects the matrix to the feasible set either definedby the energy constraint in (4) or the orthonormal-ity constraint in (5), i.e., re-normalize the matrix tosatisfy the energy constraint or using Gram-Schmidtmethod to ortho-normalize the matrix to satisfy theorthonormality constraint.

4) Repeat previous step until convergence.The above procedure (particularly the Monte Carlo in-

tegration) is attractive to implement for the specific formsof PY |X assumed when presenting experimental results. Ingeneral, the ITM is neither a convex nor a concave functionof Φ, and therefore we are not guaranteed a global-optimalsolution. In all experiments we have considered the solutionconverged to a useful/effective solution from a random start.When presenting experimental results, we also discuss howthe positivity constraint on Φ is imposed for the Poissonmeasurement model.

4 THEORETICAL ANALYSIS

In the previous section we proposed a numerical algorithmfor design of the measurement matrix Φ based on thegradient-descent method, for solving the ITM constructionin (3), with either constraint (4) or (5). We now investigatethe theoretical characteristics of the solution structure. In[19], an analytic solution to the unconstrained IB model hasbeen investigated, under the assumptions that X ′ is a con-tinuous random variable and (X ′, Y ) are jointly Gaussian(our second signal model), and a Gaussian measurementmodel has been assumed as well; by contrast, for ourfirst signal model, of principal concern, X is discrete. Inthis section, we first present a generalized water-filling-like[34], [35] solution to the Gaussian measurement modelwith the energy constraint as in (4). Specifically, as in[19] we assume a Gaussian measurement model, but weconsider general statistics for Y , rather than assuming PY isGaussian (because X is also not Gaussian, but is discrete).

Assuming the Gaussian measurement model in (1), wefirst define symbols that will be used in the statement of

Page 5: Information-Theoretic Compressive Measurement Design

5

the theorem. Consider E and E as defined in Theorem 1.The covariance matrices of X , X ′, Y , Y |X and Y |X ′ aredenoted as ΣX , Σ′X , ΣY , ΣY |X and ΣY |X′ , respectively.Define E′β = E − βE. Consider the singular value de-composition for Φ, and eigenvalue decomposition for E′βand Σ = Λ−1: Φ = UΦDΦV

TΦ , E′β = UE′βDE′β

UTE′βand

Σ = UΣDΣUTΣ , respectively, where UΦ, VΦ, UE′β and

UΣ are m × m, n × n, n × n and m × m orthonormalmatrices, respectively. The eigenvalues in DΦ and DE′βare in descending order. The eigenvalues in DΣ are inascending order, and we express DΣ = diag(σ2

1 , . . . , σ2m).

Theorem 3. The optimal linear matrix Φ∗ for the ITM forthe Gaussian measurement model in (1) under first signalmodel and energy constraint in (4), for arbitrary sourcestatistics PY , is

Φ∗ = U∗ΦD∗ΦV∗TΦ , (10)

where U∗Φ = UΣ, V ∗Φ = UE′β and D∗Φ =[diag(

√λ∗1, . . . ,

√λ∗m),0

]. Define

fi(β, λ∗i ) = σ2

i {[UTE′β EUE′β ]ii − β[UTE′βEUE′β ]ii}−1 (11)

The elements λ∗1, . . . , λ∗m satisfy the following fixed-point

solution:

if 1/η < fi(β, 0), then λ∗i = 0,

otherwise λ∗i is such that 1/η = fi(β, λ∗i ), (12)

where η ≥ 0 is the Lagrange multiplier which ensures theenergy constraints 1

m tr(ΦΦT ) ≤ 1. In particular, if β > 1,λ∗i = 0 for i = 1, . . . ,m.

The ITM with the orthonormality constraint under arbi-trary PY |X can be viewed as a optimization problem relatedto calculating volumes on the Stiefel manifold, where thederivation of an analytical solution is known to be partic-ularly challenging [25]. Rather than seeking the solutionunder the first signal model, we alternatively address theproblem with the second signal model in which (X ′, Y ) arejointly Gaussian, and the noise precision matrix is diagonalΛ = σ−1Im with σ > 0. For this case a structural fixed-point type solution is presented, and analytical solutionswhen β = 0 and β → −∞ can be readily deduced. Wenow present a result for the solution of the ITM under theorthonormality constraint.

Theorem 4. Assume the second signal model in which(X ′, Y ) are jointly Gaussian and the noise precisionΛ = σ−1Im with σ > 0. There exists a nonsingularmatrix W (Φ) ∈ Rm×m which simultaneously diagonal-izes Φ(ΣY + σIm)ΦT and Φ(ΣY |X′ + σIm)ΦT , suchthat WΦ(ΣY + σIm)ΦTWT = Im and WΦ(ΣY |X′ +σIm)ΦTWT = diag{α1, . . . , αm}. We have that {αi}mi=1

is a collection of m arbitrary eigenvalues out of n eigen-values {ωi}ni=1 of the matrix (ΣY + σIm)−

12 (ΣY |X′ +

σIm)(ΣY + σIm)−12 , whose specific choice of those m

eigenvalues depends on Φ., i.e., {αi}mi=1 = ∪i∈Γ1(Φ){ωi}for some Γ1(Φ) ⊂ {1, 2, . . . , n} depending on Φ with|Γ1| = m. Let U(Φ) ∈ Rn×m be constituted by m

eigenvectors corresponding respectively to the eigenvalues{αi}mi=1. The optimal orthogonal linear matrix Φ∗ underthe orthonormality constraint with β ∈ (−∞, 0) satisfiesthe following fixed-point equation:

Φ∗ = W−1(Φ∗)UT (Φ∗)(ΣY + σIm)−12 , (13)

and the following optimization:

arg minΦ{log det((W (Φ))2)− 1

β

∑i∈Γ1(Φ)

logωi} (14)

When β = 0, the above fixed-point solution admits ananalytic solution that

Φ∗ = orth(UT1 (ΣY + σIm)−12 ), (15)

where U1 is constituted by the eigenvectors correspond-ing to the smallest m eigenvalues of the matrix (ΣY +σIm)−

12 (ΣY |X′+σIm)(ΣY +σIm)−

12 . orth(·) denotes an

orthonormal basis of the row space of the argument matrix.When β → −∞, the above fixed-point solution also

admits an analytic solution that

Φ∗ = orth(UT2 (ΣY + σIm)−12 ), (16)

where U2 is constituted by the eigenvectors correspondingto the smallest m eigenvalues of the matrix (ΣY +σIm)−1.

4.1 DiscussionTheorems 3 and 4 provide fixed-point type solutions forthe form of Φ∗, under energy and orthonormality con-straints, respectively. Note that fi(β, λ∗i ) defined in (11)is a function of Φ∗, and hence so are all {λ∗i } and all{σi}, through the matrices E and E, which are both afunction of the complete Φ∗. Therefore, fi(β, λ∗i ) should beviewed in (11) as a function of λ∗i , with all other parametersassociated with Φ∗ at their optimal settings (the {σi} arenot optimized, but are assumed known characteristics of themeasurement noise). This suggests an iterative solution forthe parameters of Φ∗ in (10) under the energy constraint.Similarly, an iterative solution is also available for Φ∗

under the orthonormality constraint via (13). In the belowexperiments we perform design of Φ∗ based on gradientdescent (using Theorems 1 and 2), since that approachis also applicable to the Poisson case we consider inexperiments. The key contribution of Theorem 3 is thatit provides insight into the characteristics of the solution ofthe ITM for Gaussian measurements but arbitrary sourcestatistics PY |X ; these insights are consistent with findingsof experimental results in Section 5. Theorem 4 sheds lighton the solution structure of the ITM under the orthonor-mality constraint, and provides analytic solutions for twoextreme cases where β = 0 and β → −∞.

The values {fi(β, 0)} correspond to “steps”, and if the“water level” 1/η is lower than the step level fi(β, 0), thenthe corresponding λ∗i = 0. When the water level 1/η >fi(β, 0), the corresponding λ∗i 6= 0. The non-zero valuesλ∗i are not the difference 1/η − fi(β, 0), but rather thevalue λ†i for which we achieve equality 1/η = fi(β, λ

†i ).

Page 6: Information-Theoretic Compressive Measurement Design

6

The results in Theorem 3 and 4 reduce to publishedresults for special cases. For example, when β = 0 theproblem reduces to the goal of classification based on Y ,without any concern for recovering Y , or on the complexityof Φ∗ [4]. In this case fi(β = 0, λ∗i ) is only a functionof the generalized MMSE matrix E, and does not dependon the spectral properties of E, which is explicitly tied toreconstruction quality. It can also be demonstrated that inthe limiting case where β → −∞, we recover the result in[8], which was only concerned with recovering Y from Y .

We now examine what the result in Theorem 3 implies.First consider the role of σi in fi(β, λ

∗i ). The smaller the

noise variance σ2i , the more likely 1/η > fi(β, 0), and

therefore the more likely that the ith eigenvector of thenoise covariance will contribute to utilized U∗Φ on the leftdecomposition of Φ∗. What this says is that Φ∗Y ∈ Rmresides in a linear subspace of Rm spanned by the weakest-energy noise eigenvectors (the number of which are “turnedon” depends on the water level 1/η which is connectedto the energy constraint). This is expected, as by utilizingthe eigenvectors of Σ where the additive noise is weakestto define utilized elements of U∗Φ, Y resides in a linearsubspace where the noise is weakest. This characteristicof the left singular vectors of Φ∗ is consistent across alldesigns of for a Gaussian measurement model [8], [4].

Concerning the right singular vectors of Φ∗, they arelinked to the eigenspectrum of the MMSE matrix E and thegeneralized form E. For the simplified case of a Gaussiansource PY and a focus only on recovery β → −∞, thissimplifies to the case in [8] for which the right singularvectors of Φ∗ span a linear subspace where the input signalis largest, corresponding to the principal components of thesource.

We now consider the role of β under the energy con-straint, which from (3) impacts the care one places in recov-ery (the degree to which I(Y ; Y ) is emphasized). For largeand positive β, large values of I(Y ; Y ) are de-emphasized,implying that the measurement matrix Φ∗ should only cap-ture the most “relevant” information about the label X , butthe “complexity” of Φ∗ should be as little as possible. Thiswas discussed in [19] heuristically for a Gaussian source,where here we extend to arbitrary input statistics PY |X , andwe make the heuristic reasoning explicit. Note that as βbecomes more positive, more indices i are likely to satisfythe inequality η > 1

σ2i{[UTE′β EUE′β ]ii − β[UTE′β

EUE′β ]ii},which means that the associated λ∗i = 0. This demonstratesexplicitly that with increasing β the notion of reducingthe complexity of Φ∗ means that its rank is diminished;we emphasize and demonstrate below that this defines thenumber of needed measurements m. A reduced rank ofΦ∗ implies that fewer measurements (features) are used,because the number of measurements may be set to therank. Hence, for increasing β > 0 the “bottleneck” inthe ITM is the number of features/measurements used forclassification.

While Theorem 3 is only for a Gaussian measurementmodel, we observed the same phenomenon with respectto β for the Poisson measurement model. Specifically, we

observed that increasing β favors lower-rank Φ∗, even inthe Poisson case (detailed in Section 5.4).

According to Theorem 4, the optimal orthogonal pro-jection is applied to the input signal Y via the followingthree steps. First, Y is multiplied by a covariance matrix(ΣY +σIm)−

12 , and signal Y is transformed in the way that

the variance of noise Σ is incorporated. The second step,where an essential compression occurs, is that the trans-formed signal is then mapped via UT to the eigenspacesspanned by m eigenvectors of (ΣY + σIm)−

12 (ΣY |X′ +

σIm)(ΣY + σIm)−12 , and the specific choice of such m

eigenvectors is manifested via the optimization in (14).Finally, an invertible transform W−1 is applied to form anorthonormal basis of the compressed m-dimensional space.

Since Φ is naturally required to be full-rank, the range ofβ under the orthonormality constraint is confined as β ∈(−∞, 0]. When β → −∞, the ITM is simplified to thecase where pure signal recovery is considered, and β =0 corresponds to the pure classification problem. As weobserve from the analytic expressions presented in Theorem4, in the former extreme case, the m-dimensional eigen-matrix U2 solely depends on the signal variance ΣY itself,whereas in the latter case, the conditional variance ΣY |X′plays a role in the matrix U1.

5 EXPERIMENTS

5.1 Summary of the Proposed AlgorithmsThe experiments considered below are performed in thecontext of the first model discussed in Sec. 2.1 (discreteX). There are three steps associated with implementing theabove methods in the subsequent experiments:

1) Learn PY |X and PY , based on available training data.The X corresponds to the class label in the examplesconsidered. The distribution PY |X is modeled as aGMM, and the GMM parameters are learned using theEM algorithm based on training data, as in [36]. Thereis a discrete probability distribution on X (probabilityof each data class), and therefore PY is also a GMM(discussed further below).

2) Given the learned signal statistics, design the optimalprojection Φ via the algorithm described in Section3.2.

3) In the testing phase, we test the accuracy of theestimated X and/or Y , based on the designed Φ.To estimate X and/or Y , we use the GMM signalmodel from Step 1. To estimate Y , we use the sameanalytic CS inversion method as introduced in [36].When estimating X , we maximize PX|Y (which canbe expressed analytically, given the signal model).

Providing further details, in our experiments for theGaussian measurement model, we impose the GMMPY |X=i =

∑L(i)

j=1 ν(i)j N (µ

(i)j ,Σ

(i)j ), meaning that the dis-

tribution of Y for each discrete X is a GMM. Further, PY ,with the class label summed out is also a GMM: PY =∑Li=1 PY |X=iPX=i =

∑Li=1

∑L(i)

j=1 πiν(i)j N (µ

(i)j ,Σ

(i)j ),

where ν(i)j , j = 1, . . . , L(i) are the GMM coefficients

Page 7: Information-Theoretic Compressive Measurement Design

7

within class i. µ(i)j and Σ

(i)j , j = 1, . . . , L(i) are the means

and covariance matrices of the respective L(i) Gaussiancomponents. As detailed in [36], [4], the GMM parametersmay be readily learned based on training data.

In addition to being easily sampled, the GMM has theadvantage of an analytic posterior PY |Y , which is alsoa GMM [36], specifically PY |Y =

∑Li=1 πiPY |Y ,X=i,

with analytic expressions for {πi} and PY |Y ,X=i (see[36] for details on these GMM expressions). This pos-terior naturally induces an analytical Bayesian classifiermaxi PX=i|Y , where PX=i|Y = πi, and an analyticalMMSE estimator Y for the input signal Y , Y := E[Y |Y ] =∑Li=1 πi

∑L(i)

j=1 ν(i)j µ

(i)j .

5.2 Related MethodsIDA [12] and LDA [26] are two popular information-theoretic supervised dimensionality reduction algorithms,and they both are derived for a mixture model for Y ,across the L classes, i.e., PY =

∑Li=1 πiN (µi,Σi) (LDA

and IDA assume a single Gaussian per class). LDA aimsto simultaneously maximize between-class features andminimize the scatter of the projected data within eachclass. Let µY =

∑Li=1 πiµi denote the mean of Y ,

ΣY =∑Li=1 πi(Σi + (µi−µY )(µi−µY )T ) represents the

covariance of Y , and Λ is the precision matrix of the Gaus-sian noise. The information-theoretic criterion ILDA(X; Y )is

ILDA(X; Y ) =1

2log((2πe)m det(ΦΣY ΦT + Λ−1))

− 1

2log((2πe)m det(Φ(

L∑i=1

πiΣi)ΦT + Λ−1)), (17)

The optimal solution maximizing ILDA(X; Y ) is availableanalytically [26].

IDA seeks to maximize mutual information I(X; Y )directly, which can be expressed as IIDA(X; Y ) =h(Y )−h(Y |X), where h(·) is the differential entropy, andh(Y |X) = 1

2

∑Li=1 πi log((2πe)m det(ΦΣiΦ

T + Λ−1)). Inorder to calculate h(Y ), IDA imposes a Gaussian approxi-mation, with learned covariance ΣY . Hence, under this ap-proximation, h(Y ) = 1

2 log((2πe)m det(ΦΣY ΦT + Λ−1)).The maximization on IIDA(X; Y ) can be carried out bythe gradient descent method [12].

Making a connection to the proposed ITM frameworkin (3), consider the case β = 0, for which we are onlyinterested in classification. Further, assume only a singleGaussian (not a GMM) per class X . In this case the ITMreduces to the IDA criterion. Hence, in the limit β → 0,the difference between our ITM and IDA is that in ITMwe employ a GMM per class X . Of course, the ITM alsoextends to cases β 6= 0.

In order to apply LDA and IDA, we must convert themodel PY |X=i =

∑L(i)

j=1 ν(i)j N (µ

(i)j ,Σ

(i)j ) used in our

projection design to a form that LDA and IDA may employ(single Gaussian per class). We use the overall mean andcovariance for class X = i as the associated parameters

of the class-i Gaussian model. While this simplified modelis used in LDA and IDA design, classification and signalrecovery with the Φ so learned are performed using the fullmixture model, with PY |X=i =

∑L(i)

j=1 ν(i)j N (µ

(i)j ,Σ

(i)j ).

Therefore, all algorithms studied here use the same modelfor classification and for estimation of Y , and the onlydifference is manifested in how Φ is learned. A similarLDA and IDA design and test procedure was employed in[4].

We also consider the information-theoretic supervised di-mensionality reduction method proposed in [2], [3]. Insteadof using the Shannon entropy, they used quadratic Renyientropy to define the mutual information as

IT (X; Y ) =

L∑i=1

∫(PY=y,X=i − PY=yPX=i)

2dy. (18)

The derivative of the quadratic Renyi mutual informationcan be expressed analytically for the GMM signal [2], [3].

5.3 Satellite and USPS Data

For the Gaussian measurement model in (1), we conduct ex-periments on the satellite and USPS datasets. Both datasetsare available online and have been used in [12], [4]. Thesatellite dataset contains 36-dimensional feature vectorsof satellite data, comprised of pixel values of a 3 × 3neighborhood in 4 spectral channels. The six class labels forthe central pixel are real soil, cotton crop, grey soil, dampgrey soil, soil with vegetation stubble, and very damp greysoil. The training set contains 4435 samples, and the testingset contains 2000 samples. The USPS data contains greyscale images of dimension 256 for handwritten ten single-digits. There are 7291 training samples and 2007 testingsamples.

As stated above, the parameters of the GMM are learnedvia the EM algorithm. Unless otherwise stated, we considerL(i) = 5 mixture components for each class; similar resultswere found for L(i) = 1 and L(i) = 10. The noise covari-ance matrix Σ is set to be 10−6Im, where Im is the m×midentity matrix (we are essentially doing feature learningin this application, as in [4]). When performing the MonteCarlo integrations, 500 samples were applied to calculate Eand E. The performance of the gradient descent algorithmcan be significantly affected by the choice of step size, andsmaller step size usually leads to a better performance buta slower convergence rate. For these two datasets, we findthat a good trade-off between speed and performance isachieved when the step size is chosen between 0.1% to1% of the tr(ΦΦT ), and we use 500 gradient iterationsin the experiments on these two datasets. We compareperformance of the proposed ITM under various settingsof β to the LDA, IDA and Renyi methods.

In Tables 1, 2, 3 and 4, we present respectively classi-fication accuracy and fractional error for signal recoveryon the Satellite data under the energy and orthonormalityconstraints with L(i) = 5, where the fractional error isdefined as ‖Y−Y ‖

22

‖Y ‖22, where Y is the estimate of Y . As

Page 8: Information-Theoretic Compressive Measurement Design

8

Table 1Classification accuracy on the Satellite datasets under the energy constraint with L(i) = 5. PC and PSR stand for pure

classification (as in [4]) and pure signal recovery (as in [8]), which correspond to the case β = 0 and β = −1000,respectively.

m IDA LDA Renyi PC PSR β = −10 β = −5 β = −2 β = −1.5 β = −1 β = −0.51 0.4940 0.6325 0.6213 0.6455 0.4835 0.4815 0.4840 0.4860 0.4865 0.4890 0.49352 0.8190 0.7925 0.8012 0.8295 0.8230 0.8270 0.8265 0.8275 0.8260 0.8262 0.82653 0.8565 0.8535 0.8551 0.8600 0.8400 0.8415 0.8440 0.8450 0.8545 0.8550 0.85754 0.8675 0.8560 0.8610 0.8695 0.8525 0.8545 0.8620 0.8625 0.8620 0.8675 0.86905 0.8685 0.8640 0.8655 0.8680 0.8650 0.8660 0.8675 0.8685 0.8690 0.8685 0.8695

Table 2Fractional error of the signal recovery on the Satellite datasets with various values of β under the energy constraint L(i) = 5.

m IDA LDA Renyi PC PSR β = −10 β = −5 β = −2 β = −1.5 β = −1 β = −0.51 2.2632 2.5170 2.3270 2.3882 2.2552 2.2588 2.2608 2.2704 2.2761 2.2903 2.33342 0.6451 1.9027 0.6761 0.5558 0.5337 0.5341 0.5377 0.5430 0.5449 0.5482 0.55243 0.6244 0.6876 0.6512 0.5233 0.4250 0.4216 0.4311 0.4400 0.4501 0.4604 0.48974 0.5143 0.5975 0.5170 0.4718 0.3204 0.3271 0.3275 0.3285 0.3290 0.3298 0.33205 0.4871 0.5813 0.5042 0.3889 0.2903 0.2909 0.2919 0.2930 0.2929 0.2913 0.2989

Table 3Classification accuracy on the Satellite datasets under the orthonormality constraint with L(i) = 5. PC and PSR stand for

pure classification (as in [4]) and pure signal recovery (as in [8]), which correspond to the case β = 0 and β = −1000,respectively.

m IDA LDA Renyi PC PSR β = −10 β = −5 β = −2 β = −1.5 β = −1 β = −0.51 0.4840 0.5385 0.5211 0.5940 0.4785 0.4805 0.4810 0.4835 0.4860 0.4935 0.49002 0.8240 0.7760 0.7904 0.8290 0.7990 0.8050 0.8070 0.8080 0.8095 0.8125 0.82553 0.8480 0.8500 0.8502 0.8525 0.8235 0.8280 0.8300 0.8315 0.8420 0.8465 0.84804 0.8680 0.8575 0.8619 0.8710 0.8610 0.8615 0.8620 0.8635 0.8640 0.8655 0.86955 0.8720 0.8620 0.8683 0.8725 0.8670 0.8635 0.8645 0.8675 0.8675 0.8715 0.8795

Table 4Fractional error of signal recovery on the Satellite datasets with various values of β under the orthonormality constraint with

L(i) = 5.

m IDA LDA Renyi PC PSR β = −10 β = −5 β = −2 β = −1.5 β = −1 β = −0.51 2.3318 3.7341 2.8723 2.4284 2.2841 2.2847 2.2852 2.2896 2.2949 2.3071 2.33562 0.6811 1.4982 0.7497 0.6846 0.6240 0.6286 0.6365 0.6439 0.6510 0.6684 0.67733 0.6524 0.6111 0.6237 0.5499 0.4906 0.4976 0.4993 0.5004 0.4993 0.5203 0.52874 0.5343 0.5393 0.5562 0.4504 0.3617 0.3633 0.3632 0.3782 0.3797 0.3863 0.38945 0.4284 0.5228 0.4342 0.3883 0.2906 0.2901 0.2909 0.2987 0.2983 0.3006 0.3079

Table 5Classification accuracy on the Satellite datasets under the energy constraint with L(i) = 1. PC and PSR stand for pure

classification (as in [4]) and pure signal recovery (as in [8]), which correspond to the case β = 0 and β = −1000,respectively.

m IDA LDA Renyi PC PSR β = −10 β = −5 β = −2 β = −1.5 β = −1 β = −0.51 0.4515 0.5822 0.5835 0.6315 0.4592 0.4614 0.4620 0.4625 0.4633 0.4645 0.46902 0.7918 0.7714 0.7892 0.8083 0.7995 0.8014 0.8025 0.8033 0.8045 0.8053 0.80713 0.8218 0.8227 0.8220 0.8412 0.8294 0.8304 0.8314 0.8325 0.8333 0.8345 0.83524 0.8282 0.8301 0.8293 0.8514 0.8383 0.8403 0.8416 0.8422 0.8438 0.8450 0.84635 0.8323 0.8338 0.8319 0.8592 0.8321 0.8330 0.8346 0.8359 0.8366 0.8375 0.8389

Table 6Fractional error of the signal recovery on the Satellite datasets with various values of β under the energy constraint with

L(i) = 1.

m IDA LDA Renyi PC PSR β = −10 β = −5 β = −2 β = −1.5 β = −1 β = −0.51 2.3812 2.4131 2.3911 2.3982 2.3452 2.3481 2.3542 2.3601 2.3663 2.3722 2.37832 0.6812 2.1396 0.7821 0.5912 0.5512 0.5541 0.5617 0.5680 0.5712 0.5753 0.58123 0.6512 0.7123 0.7029 0.5612 0.4325 0.4416 0.4443 0.4480 0.4513 0.4652 0.47124 0.6125 0.6240 0.6284 0.5016 0.3641 0.3671 0.3718 0.3807 0.3893 0.3984 0.40125 0.5471 0.5618 0.5345 0.4234 0.3121 0.3292 0.3378 0.3406 0.3484 0.3581 0.3691

Page 9: Information-Theoretic Compressive Measurement Design

9

0 5 10 15 2030

40

50

60

70

80

90

100

Number of Projections

Clas

sific

atio

n Ac

cura

cy in

Per

cent

age

Classification Performance

β=−10β=−5β=−2β=−1.5β=−1β=−0.5Pure ClassificationPure ReconstructionLDAIDA

(a)

0 5 10 15 200

50

100

150

200

250

300

Number of Projections

Frac

tiona

l Err

or in

Per

cent

age

Signal Reconstruction Performance

(b)

Figure 1. Classification accuracy and fractional error forsignal recovery on the USPS datasets under the energy con-staint. (a) The classification accuracy under various valuesof β. (b) The fractional error of signal recovery under variousvalues of β. In (b), the same symbol identifications are usedas in (a).

explained in the previous section, the pure classificationand pure signal recovery cases correspond to β = 0 andβ = −1000, respectively (values of β < −1000 yieldedalmost identical results, indicating that this approached theasymptotic limit).

We also present classification accuracy and fractionalerror for signal recovery on the Satellite data under theenergy constraint with L(i) = 1 in Tables 5 and 6. Theresults under orthonormality constraint are similar, and areomitted for brevity.

Under both the energy and orthonormality constraints, itcan be easily observed that the advantage of the ITM forjoint classification and recovery comes from β < 0, whenone can balance the goals of signal classification (Tables 1,3 and 5) and recovery (Tables 2, 4 and 6). For m = 1, itseems that the classification performance is less sensitiveto the tuning of β. Note that for β = −0.5 and for numberof measurements m ≥ 2, the ITM classification accuracyis almost as good as the pure-classification case (β = 0and [4]), and better than LDA, IDA and Renyi method).However, there is a marked gain in recovery accuracyversus these three alternatives, as indicated in Tables 2, 4and 6. Such a designed Φ has the significant advantage ofexcellent recovery and classification simultaneously.

When β > 0, ITM considers pure classification, andincreasing β further penalizes signal recovery (β > 0seeks the fewest number of classification features). Morespecifically, the rank of the projection matrix is controlledby the value of β > 0, as suggested by Theorem 3. In

0 2 4 6 8 10 12 14 16 18 2030

40

50

60

70

80

90

100

Number of Projections

Cla

ssifi

catio

n A

ccur

acy

in P

erce

ntag

e

Classification Performance

=-10

=-5

=-2

=-1.5

=-1

=-0.5

Pure Classification

Pure Reconstruction

LDA

IDA

(a)

0 2 4 6 8 10 12 14 16 18 200

50

100

150

200

250

300

350

400

450

500

Number of Projections

Frac

tiona

l Err

or in

Per

cent

age

Signal Reconstruction Performance

(b)

Figure 2. Classification accuracy and fractional error for sig-nal recovery on the USPS datasets under the orthonormalityconstraint. (a) The classification accuracy under various val-ues of β. (b) The fractional error of signal recovery undervarious values of β. In (b), the same symbol identificationsare used as in (a).

Table 7Classification and rank under various values of β on theSatellite datasets. CA stands for classification accuracy.

β Rank CA CA (rank-reduced)100 1 0.6343 0.631150 2 0.8139 0.808310 4 0.8642 0.86221 9 0.9012 0.8981

order to verify the theoretical result, we set m = n, i.e.,Φ ∈ Rn×n. Note that in our experiments, Φ is non-zerofor any β, since we implement the energy constraint as1m tr(ΦΦT ) = 1 in all the experiments.

In order to calculate an approximate rank, we thresholdthe singular values of the numerically obtained projectionmatrix at 10% of the largest singular value (because of thenumerics, Φ∗ is not exactly low-rank, as stipulated in The-orem 3). In order to obtain the rank-reduced Φ∗ ∈ Rr×n,one may truncate Φ∗ to be in Rr×n, by taking any r linearlyindependent rows and normalizing to satisfy the energyconstraint on trace(Φ∗(Φ∗)T ). In Table 7, we list the rankand classification performances under various values of β.

The rank-controlling property of the ITM, for β > 0,has important practical importance and β > 0 defines thenumber of measurements needed, an important output of theITM setup. The performance of the original Φ∗ ∈ Rn×nand the rank-reduced and renormalized Φ∗ ∈ Rr×n werevirtually identical, for signal classification.

We next consider the USPS dataset. In Figures 1 and 2

Page 10: Information-Theoretic Compressive Measurement Design

10

Table 8Classification and rank under various values of β on theUSPS datasets. CA stands for classification accuracy.

β Rank CA CA (rank-reduced)500 1 0.3416 0.3410400 2 0.4129 0.4092100 6 0.8812 0.882150 11 0.9131 0.9119

we show signal classification and recovery results under theenergy and orthonormality constraints, for the same casesas considered in the above tabular results, as a functionof β < 0, but for a larger range of measurements m.We observe that for classification under both constraints,quantified in Figures 1(a) and 2(a), IDA performs poorlyand the performance of ITM gradually grows with increaseof β, with the best result achieved under β = 0 (whichcorresponds to [4]). However, the β = 0 case yields poorsignal recovery for both cases (Figures 1(b) and 2(b)),and LDA and IDA yield particularly poor recovery results.The results for the Renyi method are similar to IDA, andare omitted for brevity. Part of the reason for the poorperformance of IDA in the classification case may be dueto the fact that IDA approximates the mutual informationI(X; Y ) up to second-order statistics, which does not workwell on the USPS datasets. Again, for β = −0.5, theITM setup yields a good balance between recovery andclassification.

For the USPS data, we consider β > 0 in Table 8. Weagain observe the rank-controlling property of β > 0, thisdefining the number of measurements/features for classifi-cation, with recovery penalized.

5.4 Poisson Compressive Hyperspectral Camera

We consider a compressive chemical sensing system basedon the wavelength-dependent signature of chemicals, at op-tical frequencies. The details of the measurement system areprovided in [27]. Summarizing briefly, multi-wavelength(hyperspectral) photons are scattered off a sample, and adigital mirror microdevice (DMD) is used to perform binarylinear projections (binary projections on the frequency-dependent signature). For our purposes the key points arethat the data are Poisson, as in (2) and the elements of themeasurement matrix Φ are either 1 or 0. Neither [8] nor[4] considered the Poisson measurement model.

Assume that there are T chemicals of interest, andthat the hyperspectral sensor performs measurements at nwavelengths. The observed data are represented Y |X ∼Pois(Y ; Φ(Ψ exp(ZX) + λ)), where Y = Ψ exp(ZX),ZX ∈ RT is conditioned on label X , Y ∈ Zn+ representsthe count of photons at the n sensor wavelengths, λ ∈ Rn+represents the sensor dark current, and the t-th column ofΨ ∈ Rn×T+ reflects the mean Poisson rate for chemical t.The expression exp(ZX) denotes a pointwise exponentia-tion of ZX , and ZX is drawn from a GMM. One Gaussianper chemical was sufficient.

This signal model corresponds to Poisson factor analysis[37], where Ψ are the factor loadings and exp(ZX) are thefactor scores; we have also added a dark current. We hereconsider T factors, as there are T chemicals, but this isnot necessary. Methods like those discussed in [37] maybe used to learn the model. The measurements for thisstudy were performed in our laboratory, and will be madeavailable for others to experiment with.

The measurement matrix, which controls the states ofthe micro-mirrors, is binary, i.e., Φ ∈ {0, 1}m×n. Insteadof directly optimizing Φ, we put a logistic link on eachvalue Φij = logit(Mij). By the chain rule, we can statethe gradient with respect to M as: [∇MIB(Φ(M), β)ij ] =[∇ΦIB(Φ, β)ij ][∇MΦij ]. Matrix M was initialized at ran-dom, and we threshold the logistic at 0.5 to get the finalbinary Φ. 100 Monte Carlo samples are used to calculatethe MMSE matrix. 1000 iterations of the gradient-descenthave been used. Ten (T = 10) chemicals are considered inthis test: acetone, acetonitrile, benzene, dimethylacetamide,dioxane, ethanol, hexane, methylcyclohexane, octane, andtoluene. A Bayesian classifier maxi PX=i|Y and MMSE es-timator Y = E[Y |Y ] are readily employed for classificationand signal recovery problems, respectively, and they can becalculated as follow.

The posterior of the class label X can be expressed as

PX=i|Y =PY |X=iPX=i∫XP ˜Y |X=i

PX=

PY |X=iPX=i∑Tj=1 P ˜Y |X=j

PX=j

, (19)

where PY |X=i =∫Z

Pois(Y ; Φ(Ψ exp(Z) +

λ))N (Z;µi,Λ−1i ), which can be readily calculated

via Monte Carlo integration.In order to calculate the posterior PY |Y , we employ

the Laplace approximation [33]. More specifically, eachmixture component in the prior ZX ∼

∑Ti=1 πiN (µi,Λ

−1i )

may be viewed as a model for the observed data, and πirepresents the probability of model i. Using the Poissonlikelihood function Pois(Φ(Ψ exp(ZX) + λ)), we mayapply a Laplace approximation for the posterior of ZX formodel/mixture component i. Via the Laplace approxima-tion, we update each mixture-component prior N (µi,Λ

−1i )

to a Laplace-approximated posterior. This is done for eachmixture component i ∈ {1, . . . , L}, and the model evidenceis employed for the updated/posterior mixture probabilities.The overall posterior on ZX is an updated Gaussian mixturemodel, and the MMSE estimator is the mean of the updatedGaussian mixture model.

In Figure 3 we present the average performances ofclassification and signal recovery under various settings ofβ (we learn a signal model based on measured trainingdata, and that model is used with different β for designof Φ); 10 experiments were performed for each case, andthe error bars were tight (omitted for simplity, and tonot clutter the figures). The β = 0 results are best forclassification (subfigure (a)) and β = −1000 is best forrecovery (subfigure (b)). Random design performs poorly(probability of 1/0 in Φ set at 0.5). As in the previousexamples, by setting β = −0.5, we achieve a good

Page 11: Information-Theoretic Compressive Measurement Design

11

0 2 4 6 8 100.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Number of Measurements

Acc

urac

y in

Per

cent

age

10 Chemical: Compressive Sensing Classification

Randomβ=0β=−0.5β=−1β=−1000

(a)

0 2 4 6 8 1010

−4

10−3

10−2

10−1

100

Number of Projections

Frac

tiona

l Err

or

10 Chemical: Compressive Sensing Reconstruction

(b)

Figure 3. Classification accuracy and fractional error forsignal recovery on the chemical sensing dataset. (a) Theclassification accuracy under various values of β. (b) Thefractional error of signal recovery under various values of β.In (b), the same symbol identifications are used as in (a).

Table 9Classification and rank under various values of β on thechemical-sensing dataset. CA stands for classification

accuracy.

β Rank CA CA (rank-reduced)1 1 0.6771 0.6766

0.9 4 0.8664 0.85830.7 8 0.9012 0.89310.5 9 0.9411 0.9373

balance between classification and recovery performance(here recovery of the factor score).

In Table 9, we summarize design results when β > 0where we set m = n, and via the rank infer the numberof needed measurements r (in this case we are onlyinterested in classification, and β controls the number ofmeasurements). It is observed that the rank of the projectionis manipulated by the value of β, which is similar to theGaussian case.

6 CONCLUSIONS

We have developed an information-theoretic metric (ITM)for linear dimensionality reduction. Gradients of the ITMhave been derived for Gaussian and Poisson measurementmodels. Two types of constraints have been consideredand the structure of the optimal linear projections havebeen derived explicitly for the Gaussian-measurement case.It has been shown that the proposed ITM is able topreserve simultaneously the information for signal recoveryand classification when ITM parameter β < 0, and for

β > 0 one can control the number of measurements forclassification (number of linear features).

APPENDIX AAPPENDIX: PROOFS OF THEOREMS

Proof of Theorem 1 and 2:These two theorems directly follow from gradient results

in [20], [32], [4], [21]. Note that

∇ΦIT C(Φ, β) = ∇ΦI(X; Y )− β∇ΦI(Y ; Y ). (20)

For Gaussian measurement model, by [4], we have that

∇ΦI(X; Y ) = ΛΦE, (21)

whereE = E

[(E[Y |Y , X]− E[Y |Y ])(E[Y |Y , X]− E[Y |Y ])T

].

By [20], [32],

∇ΦI(Y ; Y ) = ΛΦE, (22)

where E = E[(Y − E[Y |Y ])(Y − E[Y |Y ])T

]. Combin-

ing those two equalities, we have Theorem 1.Similarly, for Poisson measurement model, by [21], we

have that

∇ΦI(X; Y ) = E

[E[Yj |Y , X] log

E[(ΦY )i|Y , X]

E[(ΦY )i|Y ]

],

(23)and

∇ΦI(Y ; Y )

= E [Yj log((ΦY )i)]− E[E[Yj |Y ] logE[(ΦY )i|Y ]

].

(24)

Combining those two equalities, we have Theorem 2.Proof of Theorem 3:

Consider the Karush-Kuhn-Tucker (KKT) condition ofITM, the optimal Φ∗ should satisfy the follow conditions:

∇Φ

[ITM(Φ, β) + η · (1− 1

mtr(ΦΦT ))

]∣∣∣∣Φ=Φ∗

= 0,

(25)

η · (1− 1

mtr(Φ∗Φ∗T )) = 0, (26)

where η ≥ 0 is the Lagrange multiplier. By Theorem 1, wehave

2

mΦ∗ = ΛΦ∗E

′∗β , (27)

2

mΦ∗Φ∗T = Λ(Φ∗E

′∗β Φ∗T ). (28)

By taking a transpose on both sides of 27, we obtain2mΦ∗T = E

′∗β Φ∗TΛ, and thus Λ and Φ∗E

′∗β Φ∗T are com-

mutable. By [38], two symmetric matrices are commutableif and only if they are simultaneously diagonalizable.Therefore we have

U∗Φ = UΛΠ∗U = UΣΠ∗U , (29)V ∗Φ = U∗

E′∗β

Π∗V , (30)

Page 12: Information-Theoretic Compressive Measurement Design

12

where the optimal orthogonal matrices Π∗U and Π∗V leveragethe effect of reordering among the associated eigenvalues.Since we have ordered the eigenvalues of Σ and E

′∗β ,

without loss of generality, we may assume that Π∗U = Imand Π∗V = In. It is straightforward to verify that ouranalysis still holds with non-identity Π∗U and Π∗V .

Notice that ITM(Φ, β) and the energy constraint isinvariant under arbitrary orthogonal transformation onthe measurement Y . We apply the orthogonal transformΣ1/2UTΣ on Y and obtain the equivalent Gaussian mea-surement model

Y ′ = Σ1/2DΦVTΦ Y +W ′, (31)

where W ′ ∼ N (0, Im). Hence the original ITM problemis equivalent to the following optimization problem:

max{λ1,...,λm}

I(X; Y′)− βI(Y ; Y

′)

s.t.m∑i

λi ≤ m and λi ≥ 0, i = 1, . . . ,m, (32)

where {λ1, . . . , λm} = Diag(D) := Diag(DΦDTΦ). L(D)

is the Lagrangian

L(D, η, ηi) = I(X; Y′)−βI(Y ; Y

′)+η·(m−

m∑i

λi)+

m∑i

ηiλi.

(33)where η, ηi ≥ 0 are the Lagrange multipliers. By the resultin [29], we have

∂(I(X; Y

′)− βI(Y ; Y

′))

∂D= Diag(ΛD−1

Σ ), (34)

where Λ = UTE′β

E′

βUE′β. Apply the KKT condition

∂L(D,η,ηi)∂D = 0, we obtain that the optimal eigenvalues

{λ∗i }mi=1 should satisfy the following fixed point equation

η(DΣ)i − (Λ)i = ηi(DΣ)i, (35)

where ()i denotes the i-th diagonal element and (Λ)i is thefi as defined in the statement of the theorem.

By complementary slackness of KKT condition, if λ∗i >0, then ηi = 0. Hence, we may conclude that η > 0.Therefore, in this case, we obtain λ∗i = λ†i , where λ†i isa positive solution of the equation η(DΣ)i = (Λ)i. Wenote (Λ)i is a function of λi.

On the other hands, if η(DΣ)i = (Λ)i does not possessa positive solution of λi, then ηi > 0. Therefore, we musthave λ∗i = 0 and η(DΣ)i − (Λ)i = ηi(DΣ)i > 0. Thus wederive the condition η(DΣ)i > (Λ)i.

We now consider the case when β > 1. Notice that X →Y → Y forms a Markov chain. By the data processinginequality [39], we have

I(X; Y ) ≤ I(Y ; Y ). (36)

Therefore,

ITM(Φ, β) ≤ (1− β)I(Y ; Y ). (37)

When β > 1, we have ITM(Φ, β) < 0, given thatI(Y ; Y ) > 0. The maximum 0 is obtained when I(Y ; Y ) =

0, which can be achieved by setting Φ = 0, i.e., λ∗i = 0for all i = 1, . . . ,m.

Proof of Theorem 4:The existence of the matrix W is guaranteed by consti-

tuting W via the generalized eigenvectors of Φ(ΣY |X′ +σIm)ΦT and Φ(ΣY + σIm)ΦT [38], and such W isnonsingular.

The ITM can be calculated as

ITM(Φ, β) = I(X ′; Y )− βI(Y , Y ) (38)

= h(Y )− h(Y |X ′)− βh(Y ) + βh(Y |Y ).(39)

Recall that for d-dimensional random variable X ′ ∼N (µX′ ,ΣX′) = 1

2 log((2πe)d det(ΣX′)) [39]. Since that(X ′, Y ) is jointly Guassian, it follows that Y and Y |X ′are also Gaussian random variables with variance ΣY =ΦΣY ΦT + Σ and ΣY |X′ = ΦΣY |X′Φ

T + Σ, respectively[40]. Therefore, we have

arg maxΦIT C(Φ, β)

= arg maxΦ{h(Y )− h(Y |X ′)− βh(Y ) + βh(Y |Y )}

(40)= arg max

Φ{(1− β) log det(ΣY )− log det(ΣY |X′)

+ β log det(Σ)} (41)

= arg maxΦ{β((

1

β− 1) log det(ΣY )− 1

βlog det(ΣY |X′)

+ log det(Σ))} (42)

As β ∈ (−∞, 0), we have

arg maxΦITM(Φ, β)

= arg minΦ{( 1

β− 1) log det(ΣY )

− 1

βlog det(ΣY |X′) + log det(Σ)} (43)

= arg minΦ{( 1

β− 1) log det(ΣY )− 1

βlog det(ΣY |X′)}

(44)

= arg minΦ{( 1

β− 1) log det(ΦΣY ΦT + Σ)

− 1

βlog det(ΦΣY |X′Φ

T + Σ)} (45)

Let W (Φ) be the matrix simultaneously diagonalizingΦ(ΣY |X′ + σIm)ΦT and Φ(ΣY + σIm)ΦT such that

WΦ(ΣY |X′ + σIm)ΦTWT = diag{α1, . . . , αm} (46)

WΦ(ΣY + σIm)ΦTWT = Im. (47)

Claim that {αi}mi=1 are eigenvalues of the matrix (ΣY +σIm)−

12 (ΣY |X′ + σIm)(ΣY + σIm)−

12 and U(Φ) =

(ΣY +σIm)12 ΦTWT (Φ) is formed by their corresponding

eigenvectors.To see this, it is straightforward to check that

UT (Φ)U(Φ) = Im and UT (Φ)(ΣY + σIm)−12 (ΣY |X′ +

σIm)(ΣY + σIm)−12U(Φ) = diag{α1, . . . , αm}.

Page 13: Information-Theoretic Compressive Measurement Design

13

We note that even though the eigenvalues of (ΣY +σIm)−

12 (ΣY |X′ + σIm)(ΣY + σIm)−

12 are independent

of Φ, the specific choice of the m eigenvalues {αi}mi=1

does depend on Φ. Hence U(Φ) is also a function of Φ.Therefore, we end up with a fixed-point equation where theoptimal orthogonal Φ∗ must obey:

Φ∗ = W−T (Φ∗)UT (Φ∗)(ΣY + σIm)−12 . (48)

The original ITM can be expressed as

arg maxΦITM(Φ, β)

= arg minΦ{( 1

β− 1) log det(ΦΣY ΦT + Σ)

− 1

βlog det(ΦΣY |X′Φ

T + Σ)} (49)

= arg minΦ{( 1

β− 1) log det(W−1W−T )

− 1

βlog det(W−1diag{α1, . . . , αm}W−T )}

(50)

= arg minΦ{− log det(W−1W−T )

− 1

βlog det(diag{α1, . . . , αm})} (51)

= arg minΦ{log det((W (Φ))2)− 1

β

∑i∈Γ1(Φ)

logωi}. (52)

When β → 0−, the above ITM is essentially the solutionof the following optimization:

arg maxΦITM(Φ, β) = arg min

Φ{− 1

β

m∑i=1

logαi} (53)

= arg minΦ{m∑i=1

logαi}, (54)

where the minimum is achieved by choosing {αi}mi=1 asthe smallest m eigenvalues of (ΣY + σIm)−

12 (ΣY |X′ +

σIm)(ΣY + σIm)−12 . Let U1 be the matrix formed by the

corresponding eigenvectors. The fixed-point equation (48)becomes

Φ∗ = W−T (Φ∗)UT1 (ΣY + σIm)−12 . (55)

Claim that the above fixed-point equation admits an ana-lytical solution:

Φ∗ = orth(UT1 (ΣY + σIm)−12 ). (56)

It is enough to verify that U∗ = (ΣY +σIm)12 Φ∗TW ∗T (Φ)

is the eigenvector matrix corresponding to the smallestm eigenvalues of (ΣY + σIm)−

12 (ΣY |X′ + σIm)(ΣY +

σIm)−12 . By using the Gram-Schmidt method, we can

rewrite Φ∗ = LUT1 (ΣY + σIm)−12 , where L ∈ Rm×m

is a lower triangular matrix with positive diagonal entries.Hence, we have that W (Φ∗) = L−1 and the followingexpressions

W (Φ∗)Φ∗(ΣY |X′ + σIm)Φ∗TWT (Φ∗)

= diag{α1, . . . , αm} (57)

W (Φ∗)Φ∗(ΣY + σIm)Φ∗TWT (Φ∗) = Im. (58)

Note that the optimization in (53) is independent of W (Φ).Therefore, we have that Φ∗ = orth(UT1 (ΣY + σIm)−

12 )

For the case when β → −∞, the ITM boils down to thefollowing problem:

arg maxΦITM(Φ, β) = arg min

Φ{− log det(ΦΣY ΦT + Σ)}.

(59)

By the similar arguments, there exist a W ∈ Rm×m suchthat

W (Φ)WT (Φ) = diag{α1, . . . , αm}(60)

W (Φ)Φ(ΣY + σIm)ΦTWT (Φ) = Im. (61)

Further, {αi}mi=1 are eigenvalues of the matrix (ΣY +σIm)−1 and U(Φ) = (ΣY + σIm)

12 ΦTWT (Φ) is formed

by their corresponding eigenvectors. Hence, we have that

arg maxΦITM(Φ, β) = arg min

Φ{− log det(ΦΣY ΦT + Σ)}

(62)

= arg minΦ{− log det(W−1W−T )}

(63)

= arg minΦ{m∑i=1

logαi}. (64)

In order to solve the above optimization, we require that{αi}mi=1 are the smallest m eigenvalues of (ΣY + σIm)−1

and the optimal projection matrix is given by

Φ∗ = orth(UT2 (ΣY + σIm)−12 ), (65)

where U2 is constituted by the eigenvectors correspondingto the smallest m eigenvalues of the matrix (ΣY +σIm)−1.

REFERENCES

[1] M. Seeger and H. Nickisch, “Compressed sensing and Bayesianexperimental design,” in Proceedings of the 25th InternationalConference on Machine Learning (ICML), 2008, pp. 912–919.

[2] K. Torkkola, “Learning discriminative feature transforms to lowdimensions in low dimensions,” in Advances of Neural InformationProcessing Systems (NIPS), 2001, pp. 969–976.

[3] ——, “Feature extraction by non parametric mutual informationmaximization,” The Journal of Machine Learning Research, vol. 3,pp. 1415–1438, 2003.

[4] M. Chen, W. Carson, M. Rodrigues, R. Calderbank, and L. Carin,“Communications inspired linear discriminant analysis,” in Proceed-ings of the International Conference on Machine Learning (ICML),2012.

[5] L. Song, A. Smola, K. Borgwardt, and A. Gretton, “Coloredmaximum variance unfolding,” in Advances of Neural InformationProcessing Systems (NIPS), 2008.

[6] E. Candes, J. Romberg, and T. Tao, “Stable signal recovery fromincomplete and inaccurate measurements,” Communications on pureand applied mathematics, vol. 59, no. 8, pp. 1207–1223, 2006.

[7] M. Gehm, R. John, D. Brady, R. Willett, and T. Schulz, “Single-shotcompressive spectral imaging with a dual-disperser architecture,”Optics Express, vol. 15, no. 21, pp. 14 013–14 027, 2007.

[8] W. Carson, M. Chen, M. Rodrigues, R. Calderbank, and L. Carin,“Communications-inspired projection design with application tocompressive sensing,” SIAM Journal on Imaging Sciences, vol. 5,no. 4, pp. 1185–1212, 2012.

Page 14: Information-Theoretic Compressive Measurement Design

14

[9] S. Ji, Y. Xue, and L. Carin, “Bayesian compressive sensing,” IEEETransactions on Signal Processing, vol. 56, no. 6, pp. 2346–2356,2008.

[10] K. Hild, D. Erdogmus, K. Torkkola, and J. Principe, “Featureextraction using information-theoretic learning,” IEEE Transactionson Pattern Analysis and Machine Intelligence, vol. 28, no. 9, pp.1385–1392, 2006.

[11] S. Kaski and J. Peltonen, “Informative discriminant analysis,” in Pro-ceedings of the 25th International Conference on Machine Learning(ICML), 2003, pp. 329–336.

[12] Z. Nenadic, “Information discriminant analysis: Feature extractionwith an information-theoretic objective,” IEEE Transactions on Pat-tern Analysis and Machine Intelligence, vol. 29, no. 8, pp. 1394–1407, 2007.

[13] N. Tishby, F. Pereira, and W. Bialek, “The information bottleneckmethod,” in Proceedings of the 37th annual Allerton Conference onCommunication, Control, and Computing, 1999, pp. 368–377.

[14] N. Tishby and N. Slonim, “Data clustering by Markovian relaxationand the information bottleneck method,” in Advances of NeuralInformation Processing Systems (NIPS), 2000, pp. 640–646.

[15] N. Slonim, N. Friedman, and N. Tishby, “Multivariate informationbottleneck,” Neural Computation, vol. 18, no. 8, pp. 1739–1789,2006.

[16] W. Hsu, L. Kennedy, and S. F. Chang, “Video search rerankingvia information bottleneck principle,” in Proceedings of the ACMInternational Conference on Multimedia, 2006, pp. 35–44.

[17] F. Creutzig and H. Sprekeler, “Predictive coding and the slownessprinciple: An information-theoretic approach,” Neural Computation,vol. 20, no. 4, pp. 1026–1041, 2008.

[18] R. Hecht, E. Noor, and N. Tishby, “Speaker recognition by Gaussianinformation bottleneck.” in Proceedings of INTERSPEECH, 2009,pp. 1567–1570.

[19] G. Chechik, A. Globerson, N. Tishby, and Y. Weiss, “Informationbottleneck for Gaussian variables,” The Journal of Machine LearningResearch, vol. 6, no. 1, pp. 165–188, 2005.

[20] D. Guo, S. Shamai, and S. Verdu, “Mutual information and minimummean-square error in Gaussian channels,” IEEE Transactions onInformation Theory, vol. 51, no. 4, pp. 1261–1282, 2005.

[21] L. Wang, M. Rodrigues, and L. Carin, “Generalized Bregmandivergence and gradient of mutual information in vector Poissonchannels,” in IEEE International Symposium on Information TheoryProceedings (ISIT), 2013, pp. 454–458.

[22] L. Wang, D. Carlson, M. Rodrigues, D. Wilcox, R. Calderbank,and L. Carin, “Designed measurements for vector count data,” inAdvances in Neural Information Processing Systems (NIPS), 2013,pp. 1142–1150.

[23] L. Wang, D. Carlson, M. Rodrigues, R. Calderbank, and L. Carin, “ABregman matrix and the gradient of mutual information for vectorPoisson and Gaussian channels,” IEEE Transactions on InformationTheory, to appear, 2014.

[24] B. Tang, J. Tang, and Y. Peng, “MIMO radar waveform design incolored noise based on information theory,” IEEE Transactions onSignal Processing, vol. 58, no. 9, pp. 4684–4697, 2010.

[25] L. Zheng and D. Tse, “Communication on the Grassmann manifold:A geometric approach to the noncoherent multiple-antenna channel,”IEEE Transactions on Information Theory, vol. 48, no. 2, pp. 359–383, 2002.

[26] R. Fisher, “The use of multiple measurements in taxonomic prob-lems,” Annals of Eugenics, vol. 7, no. 2, pp. 179–188, 1936.

[27] D. Wilcox, G. Buzzard, B. Lucier, P. Wang, and D. Ben-Amotz,“Photon level chemical classification using digital compressive de-tection,” Analytica Chimica Acta, vol. 755, pp. 17–27, 2012.

[28] N. Halko, P. Martinsson, and J. Tropp, “Finding structure withrandomness: Probabilistic algorithms for constructing approximatematrix decompositions,” SIAM review, vol. 53, no. 2, pp. 217–288,2011.

[29] C. Xiao, Y. Zheng, and Z. Ding, “Globally optimal linear precodersfor finite alphabet signals over complex vector Gaussian channels,”IEEE Transactions on Signal Processing, vol. 59, no. 7, pp. 3301–3314, 2011.

[30] S. Prasad, “Certain relations between mutual information and fidelityof statistical estimation,” http://arxiv.org/pdf/1010.1508v1.pdf, 2012.

[31] M. Hellman and J. Raviv, “Probability of error, equivocation, and theChernoff bound,” IEEE Transactions on Information Theory, vol. 16,no. 4, pp. 368–372, 1970.

[32] D. Palomar and S. Verdu, “Representation of mutual information viainput estimates,” IEEE Transactions on Information Theory, vol. 53,no. 2, pp. 453–470, 2007.

[33] C. Bishop and N. Nasrabadi, Pattern Recognition and MachineLearning. Springer New York, 2006, vol. 1.

[34] A. Lozano, A. Tulino, and S. Verdu, “Optimum power allocation forparallel Gaussian channels with arbitrary input distributions,” IEEETransactions on Information Theory, vol. 52, no. 7, pp. 3033–3051,2006.

[35] F. Perez-Cruz, M. Rodrigues, and S. Verdu, “MIMO Gaussian chan-nels with arbitrary inputs: Optimal precoding and power allocation,”IEEE Transactions onInformation Theory, vol. 56, no. 3, pp. 1070–1084, 2010.

[36] M. Chen, J. Silva, J. Paisley, C. Wang, D. Dunson, and L. Carin,“Compressive sensing on manifolds using a nonparametric mixtureof factor analyzers: Algorithm and performance bounds,” IEEETransactions on Signal Processing, vol. 58, no. 12, pp. 6140–6155,2010.

[37] M. Zhou, L. Hannah, D. Dunson, and L. Carin, “Beta-negativebinomial process and Poisson factor analysis,” in Proceedings ofInternational Conference on Artificial Intelligence and Statistics(AISTATS), 2012, pp. 1462–1471.

[38] R. A. Horn and C. R. Johnson, Matrix Analysis. CambridgeUniversity Press, 2012.

[39] T. M. Cover and J. A. Thomas, Elements of Information Theory.John Wiley & Sons, 2012.

[40] A. Gut, An Intermediate Course in Probability. Springer, 1995,vol. 1.

Liming Wang (S08-M11) receivedthe B.S. degree in ElectronicEngineering from the HuazhongUniversity of Science and Technology,China, in 2006, M.S. degree inMathematics and Ph.D. degree inElectrical and Computer Engineeringfrom the University of Illinois atChicago, both in 2011. From 2011

to 2012, he was a Postdoctoral Research Scientist in theDepartment of Electrical Engineering, Columbia University.Since May 2012, he is a Postdoctoral Associate in theDepartment of Electrical and Computer Engineering, DukeUniversity. His research interests are in high-dimensionaldata processing; machine learning; compressive sensing;statistical signal processing; bioinformatics and genomicsignal processing.

Minhua Chen received his B.S.degrees in Electronic Engineeringfrom Tsinghua University in 2004,and M.S. and Ph.D. degrees inElectrical and Computer Engineeringfrom Duke University in 2009and 2012. He was a postdoctoralresearcher at University of Chicagofrom September 2012 to May 2014.He is currently a research scientist at

Amazon. His research interest is in machine learning andsignal processing.

Miguel R. D. Rodrigues (S98-A02-M03) received the Licenciaturadegree in Electrical Engineering fromthe University of Porto, Portugalin 1998 and the Ph.D. degree in

Page 15: Information-Theoretic Compressive Measurement Design

15

Electronic and Electrical Engineeringfrom University College London,U.K. in 2002. He is currently aSenior Lecturer with the Department

of Electronic and Electrical Engineering, UniversityCollege London, UK. He was previously with theDepartment of Computer Science, University of Porto,Portugal, rising through the ranks from Assistant toAssociate Professor, where he also led the InformationTheory and Communications Research Group at Institutode Telecomunicaes - Porto. He has held postdoctoral orvisiting appointments with various Institutions worldwideincluding University College London, CambridgeUniversity, Princeton University, and Duke Universityin the period 2003-2013. His research work, which lies inthe general areas of information theory, communicationsand signal processing, has led to over 100 papers injournals and conferences to date. Dr. Rodrigues was alsohonored by the IEEE Communications and InformationTheory Societies Joint Paper Award in 2011 for his workon Wireless Information-Theoretic Security (joint with M.Bloch, J. Barros and S. M. McLaughlin).

David Wilcox earned a PhD in Chemistry from PurdueUniversity in 2012. He is now a post doctoral researcherin Chemistry at the University of Chicago. His researchfocus is on optical chemical sensing and compressivesensing. [No photo available]

Robert Calderbank (M89-SM97-F98) received the BSc degree in1975 from Warwick University,England, the MSc degree in 1976from Oxford University, England,and the PhD degree in 1980 from theCalifornia Institute of Technology,all in mathematics. Dr. Calderbankis Professor of Electrical Engineering

at Duke University where he now directs the InformationInitiative at Duke (iiD) after serving as Dean of NaturalSciences (2010-2013). Dr. Calderbank was previouslyProfessor of Electrical Engineering and Mathematics atPrinceton University where he directed the Program inApplied and Computational Mathematics. Prior to joiningPrinceton in 2004, he was Vice President for Researchat AT&T, responsible for directing the first industrialresearch lab in the world where the primary focus isdata at scale. At the start of his career at Bell Labs,innovations by Dr. Calderbank were incorporated in aprogression of voiceband modem standards that movedcommunications practice close to the Shannon limit.Together with Peter Shor and colleagues at AT&T Labshe showed that good quantum error correcting codes existand developed the group theoretic framework for quantumerror correction. He is a co- inventor of space-time codesfor wireless communication, where correlation of signalsacross different transmit antennas is the key to reliabletransmission. Dr. Calderbank served as Editor in Chief of

the IEEE Trans. Information Theory from 1995 to 1998,and as Associate Editor for Coding Techniques from 1986to 1989. He was a member of the Board of Governors ofthe IEEE Information Theory Society from 1991 to 1996and from 2006 to 2008. Dr. Calderbank was honored bythe IEEE Information Theory Prize Paper Award in 1995for his work on the Z4 linearity of Kerdock and PreparataCodes (joint with A.R. Hammons Jr., P.V. Kumar, N.J.A.Sloane, and P. Sole), and again in 1999 for the inventionof space-time codes (joint with V.Tarokh and N. Seshadri).He has received the 2006 IEEE Donald G. Fink PrizePaper Award, the IEEE Millennium Medal, the 2013 IEEERichard W. Hamming Medal, and he was elected to theUS National Academy of Engineering in 2005.

Lawrence Carin (SM’96-F’01)earned the BS, MS, and PhD degreesin electrical engineering from the Uni-versity of Maryland, College Park, in1985, 1986, and 1989, respectively.In 1989 he joined the Electrical En-gineering Department at PolytechnicUniversity (Brooklyn) as an AssistantProfessor, and became an Associate

Professor there in 1994. In September 1995 he joinedthe Electrical Engineering Department at Duke University,where he is now a professor. He was Chairman of theElectrical and Computer Engineering Department from2011-2014. He held the William H. Younger Professorshipfrom 2003-2013. He is now Vice Provost for Research atDuke. He is a co-founder of Signal Innovations Group, Inc.(SIG), a small business that was acquired by BAE Systemsin 2014. His current research interests include machinelearning and applied statistics. He has published over 300peer-reviewed papers, and he is a member of the Tau BetaPi and Eta Kappa Nu honor societies.