Toward a Theory of Information Processing

Toward a Theory of Information Processing �

Sinan Sinanovic, Don H. Johnson �

Computer and Information Technology Institute, Department of Electrical and ComputerEngineering, MS 380, Rice University, 6100 Main Street, Houston, Texas 77005–1892

Abstract

Information processing theory endeavors to quantify how well signals encode informationand how well systems, by acting on signals, process information. We use information-theoretic distance measures, the Kullback-Leibler distance in particular, to quantify howwell signals represent information. The ratio of distances calculated between two informa-tionally different signals at a system’s output and input quantifies the system’s informationprocessing properties. Using this approach, we derive the fundamental processing capa-bilities of simple system architectures that apply universally: the systems and the kindsof signals they process and produce don’t affect our general results. Applications in arraysignal processing and in neural signal analysis illustrate how to apply the theory.

Key words: information theory, information processing, Kullback-Leibler distance

1 Introduction

Information-bearing signals take on a variety of forms: analog and sampled, contin-uous and discrete-valued, numeric and symbolic, � single and multi-channel, andsingle and multi-dimensional. We seek a comprehensive framework that allows usto quantify how well a signal represents information and how well systems extractand process information.

� Supported by grants from the National Science Foundation and Halliburton Energy Ser-vices.� Corresponding author.Email addresses: [email protected] (Sinan Sinanovic), [email protected] (Don H.

Johnson).� Symbolic signals are not real or complex-valued, but simplymembers of a set. Examplesof symbolic signals are text and DNA sequences.

Preprint submitted to Signal Processing 27 September 2006

An important application area that strongly influenced our thinking was neural cod-ing. Neuroscientists want to determine from neural recordings whether the mea-sured signals represent information and if so, what information is represented andwith what fidelity. If the signals comprising a neuron’s (or a group of neurons)inputs are known, how is information processed and represented in the output?Classic information theory [1] has few answers to these representation and fidelityquestions. For example, entropy is commonly used to assess a signal’s “uncer-tainty,” but signals have an entropy regardless of whether they are bear informa-tion or not and whether the information is relevant or not. Assuming an answer tothe information-representation question can be found, rate distortion theory [1–3]provides an approach to answering the fidelity question. Unfortunately, using thisapproach requires knowledge of the joint probability distribution of the input andoutput information. Because of the elusive nature of the question “What is informa-tion?” and because the signal being studied may or may not be the ultimate output,determining the required joint probability function is quite difficult. Furthermore,rate distortion theory rests on specifying a distortion measure and on computing themutual information between information and signals. Recent work has determinedwhat distortion measure matches signal encoding mechanisms [4]. One of the rea-sons neuroscientists are making recordings is to determine the actual distortionmeasure. Tishby [5] has developed the information bottleneck approach, whereinrate-distortion theory, the Data Processing Theorem, and compression play majorroles. There, mutual information is the primary quantity; using mutual informationin this way requires information to obey a probabilistic model. We question theultimate validity of creating a probabilistic model for information. Is the informa-tion contained in Shannon’s paper produced probabilistically? What we mean byinformation is not the sequence of words and phrases, but rather the meaning of theprose. If the information the meaning is random, what kind of random quantityis it and from what probability distribution was it drawn? Because these questionsultimately have no concrete answers, we developed a new approach to the analysisof information processing systems. In our view, a viable information processingtheory requires an “information probe” that can examine any kind of signal mea-sured anywhere in a complex system and determine how well it conveys relevantinformation. By comparing the information probe’s result at a system’s input andoutput, we should be able to quantify how well the system processes information.

In this paper, we frame a theory that recasts how one approaches quantifying theinformation representation of signals and the information processing capabilitiesof systems. Our theory weds results from information theory and statistical signalprocessing. We use information-theoretic, generalized-distance measures betweenprobability distributions that describe stochastic signals. We require viable distancemeasures to be analytically tractable, empirically measurable from any kind of sig-nal, and related to the performance of optimal classification and estimation algo-rithms. By requiring a single distance measure to be inversely related to the erroroptimal processing systems incur, we find that the Kullback-Leibler distance bestsatisfies our requirements. We use this measure to create a system theory of infor-

2

informationsource

informationencoder

informationfilter

informationsink

X Y Zαχ1

χ0

Fig. 1. An information source produces information considered relevant, an abstract quan-tity represented by the symbol �. This information, as well as extraneous information rep-resented by �� and �� is encoded (modulated) onto the signal � , which we assume tobe stochastic. A system, having its input-output relationship defined by the conditionalprobability function �� , serves as an information filter, changing the fidelity withwhich the information is represented by its output, possibly accentuating some aspects ofthe information while suppressing others. The information sink responds to the informationencoded in � by exhibiting an action �.

mation processing structures that quantifies the inherent processing capabilities ofvarious architectures for information processing.

This preliminary theory complements classic information theory, which answersdifferent questions than the ones posed here. Warren Weaver described in his intro-duction to Shannon’s classic paper [6] that while Shannon’s work was the mathe-matical theory of communication, it did not touch the entire realm of informationprocessing notions that require analysis. He stratified communication problems onthree levels: technical, semantic, and influential. He cast Shannon’s work as the en-gineering or technical side of the problem because it did not deal with the extractionof meaning and with taking actions in response to the transmitted information. Ourtheory takes the approach that the meaning of signals can only be determined fromactions taken as a result of interpreting the information. Thus, we merge semanticinterpretation into quantifying the effectiveness of action (i.e., influence). In ourframework, information processing ultimately results in actions, and these actionsresult in signals of some sort that our information probe the information-theoreticKullback-Leibler distance can assess.

2 Assumptions

Our theory rests on several fundamental assumptions. We use a standardcommunication-like model as a backdrop (Fig. 1).

� Information can have any form and need not be stochastic.In general, we make no assumption about what form information takes. We rep-resent it by the symbol � that we take to be a member of a set (either countableor uncountable). This parameter could be multi-valued so as to represent multi-dimensional information, each component of which is a member of a differentset. We represent parameter vectors by the boldfaced symbol �. This symbolrepresents the meaning of the information that signals encode. Information unin-teresting to the information sink may also be present; the symbols ��, �� in Fig. 1

3

represent extraneous information. We do not assume information results from aprobabilistic generative process. Rather, information takes on a value selectedsomehow from an abstract set.

� Signals encode information.Information does not exist in a tangible form; rather information is always en-coded into a signal. Signals representing the same information can take onvery different forms. For example, this text’s alphabetic/punctuation symbol se-quence, the image of the page you are reading and the acoustic signal producedby reading this page aloud each represent the same information. In the first case,the signal is a sequence drawn from a discrete, finite alphabet; in the latter two,the signals are analog. Any viable information processing theory must place avariety of signals on the same footing and allow comparison of the various waysinformation can be encoded.

� What may or may not be information is determined by the information sink.The recipient of the information the information sink determines what abouta signal it receives is informative. Additional, often extraneous, information ��,�� is represented by signals as well. Returning to the example, information extra-neous to the paper’s central theme is also contained in the signal. The text comesfrom a Roman alphabet expressing English prose. The page image reveals pagenumbers, fonts and details of themathematical notation. The spoken text encodesthe gender of the reader and dialect, which provides clues as to his/her region oforigin. Such information may or may not be considered extraneous by some suc-ceeding processing. For example, speaker identification systems are presumablydesigned to ignore what is being said. It is for the recipient to determine whatinformation is. An immediate consequence of this assumption is that no singleobjective measure can quantify the information contained in a signal. “Objec-tive” here means analysis of a signal out of context and without regard to therecipient.

� When systems act on their input signal(s) and produce output signal(s), theyindirectly perform information processing.Systems map signals from their input space onto their output space. Becauseinformation does not exist as an entity, systems affect the information encodedin their inputs by their input-output map. We assess the effect a system has oninformation-bearing signals by comparing the fidelity of the relevant informationrepresented by the input and output signals.

� The result of processing information is an action.An action is a quantifiable behavior taken by the information sink when it inter-prets the signal’s information content. We assume that every action is a signalthat can be measured. For example, the influence of a voice command can bemeasured by observing the recipient’s consequent behavior.

4

3 Quantifying information processing

In this preliminary information processing theory, we consider information to be ascalar parameter � or a parameter vector � that controls signal features. In this pa-per, we take the parameters to be real-valued, but in general they could be complex-valued or even symbolic. While parameterizing information reduces the generalityof the approach, it helps us develop results that can be applied to many information-processing systems.

3.1 Quantifying signal encoding

As shown in Fig. 1, let � denote a signal that encodes information representedby the parameter � as well as �� and ��. We assume that signals in our theoryare stochastic, with their statistical structures completely specified by the proba-bility function (either a density or a mass function) �� that depends on theinformation parameter �. Implicitly, this probability function also depends on theextraneous information. This notation for a signal’s probability function could bemisinterpreted as suggesting that signals are simple random variables. Rather, thenotation is meant to express succinctly that signals could be of any kind. We as-sume that signals are observed over some interval in the signal’s domain and thatits probability function expresses the joint probability distribution of these observa-tions. The dimensionality of the signal domain is suppressed notationally becauseour results do not depend on it; speech, image and video signals are all representedby the same generic symbol. For discrete domains, we use the joint distribution ofthe signal’s values for the required probability function. We include continuous-domain processes having a Karhunen-Loeve expansion [7, �1.4] because they arespecified by the joint distribution of the expansion coefficients (see Appendix A).Those that don’t have a Karhunen-Loeve expansion are also included because of aresult due to Kolmogorov [7, �1.14]. Later, we will need to specify when the signalhas multiple components (i.e., multi-channel signals); in these cases, the signal willbe written in boldface.

Because we have argued that analyzing signals for their relevant information con-tent statically is a hopeless task, we conceptually (or in reality for empirical work)consider controlled changes of the relevant and/or irrelevant information and de-termine how well the signals encode this information change [8]. By consideringinduced information changes, we essentially specify what constitutes information.We quantify the effectiveness of the information representation by calculating howdifferent are the signals corresponding to the two information states. In this way,we have what amounts to an information “probe” that can be used anywhere inan information system architecture to quantify the processing. Because the signalscan have an arbitrary form, usual choices for assessing signal difference like mean-

5

squared error make little sense. Instead, we rely on distance measures that quantifydifferences between the signals’ probabilistic descriptions. More concretely, whenwe change the information from a value of �� to ��, the change in � is quantifiedby the distance �� , which depends not on the signal itself, but on the prob-ability functions �� and ��. Basseville reviews distance measuresfor probability functions in the context of signal processing applications [10]. Werequire potential distance measures to satisfy �� with �� .Conceptually, we want increasing distance to mean a more easily discerned in-formation change in the signal. We do not require viable distance measures to bemonotonically related to the information change because doing so would implythat information has an ordering; since we include symbolic information, impos-ing an ordering would be overly restrictive. We envision calculating or estimat-ing the distance for several values of �� while maintaining a reference point ��.In this way, we can explore how information changes with respect to a referenceproduce distance changes. We have found that a systematic empirical study of asignal’s information representation requires that several reference points be used.This requirement would seem excessive, but the geometries of both estimation andclassification [9] clearly state that information extraction performance depends onthe reference. Consequently, a thorough characterization of information encodingdemands that the information space be explored.

A widely used class of distances measures for probability functions is known asthe Ali-Silvey class [11]. Distances in this class have the form �� , where�� represents the likelihood ratio �� , ��is convex, �� denotes expected value with respect to the probability functionspecified by the parameter ��, and �� is a non-decreasing function. Thus, eachdistance measure in this class are defined by the choices for �� and ��. Becausewe require �� , we restrict attention here on those distances that satisfy��

�� . Among the many distance measures in this class are the Kullback-

Leibler distance and the Chernoff distance. The Kullback-Leibler distance is de-fined to be

��

��

�� (1)

While the choice of the logarithm’s base is arbitrary, �� denotes the natural log-arithm here unless otherwise stated. To create the Kullback-Leibler distance withinthe Ali-Silvey framework, �� and �� . The Kullback-Leiblerdistance is not necessarily symmetric in its arguments, which means that it cannotserve as a metric. Another distance measure of importance here is the Chernoffdistance [12].

� ��

��

�� (2)

It too is in the Ali-Silvey class for each value of , with �� and �� . Because of the optimization in its definition, the Chernoff distance itself

6

is not in the Ali-Silvey class. A special case of the Chernoff distance in the Ali-Silvey class is the Bhattacharyya distance�� , which equals ��

�� [13,

14].

As the Kullback-Leibler distance exemplifies, the term “distance” should not betaken rigorously; all of the distances defined here do not obey some of the fun-damental axioms true distances must satisfy. The Kullback-Leibler distance is notsymmetric, and the Chernoff and Bhattacharyya distances do not satisfy the tri-angle inequality [14]. �� is taken to mean the distance from �� to ��; because of the asymmetry, the distance �� from �� to �� usually differs from �� . Despite these technical difficulties,recent work has shown that the Kullback-Leibler distance is geometrically impor-tant [9, 15, 16]. If a manifold of probability distributions were created so that prob-ability function pairs having an equivalent optimal classifier defined the manifold’sinvariance structure, no distance metric can exist for the manifold because dis-tance cannot be a symmetric quantity. The Kullback-Leibler distance takes on thatrole for this manifold. Delving further into this geometry yields a relationship be-tween the Kullback-Leibler and Chernoff distance measures [17]. On the manifold,the geodesic curve �� linking two given probability distributions �� and�� is given by

��

��

where �� is defined in equation (2). Define the halfway point on the manifold de-fined according to the Kullback-Leibler distance as the distribution equidistant fromthe endpoints: � �� . The location � of the halfway point turnsout to also be parameter value that maximizes �� and the halfway-distanceequals the Chernoff distance: �� [10, 17]. The Bhattacharyyadistance essentially chooses “halfway” to mean � �

� , which equals � only in

special cases.

These distance measures satisfy the following important properties.

(1) The Kullback-Leibler distance is not a symmetric quantity, but the Chernoffand Bhattacharyya distances are.

(2) These three distances have the additivity property: The distance between twojoint distributions of statistically independent, identically distributed randomvariables equals the sum of the marginal distances. Note that because of theoptimization step, the Chernoff distance is not additive when the random vari-ables are not identically distributed; the Kullback-Leibler and Bhattacharyyadistances are. When jointly defined random variables have a Markovian de-pendence structure, the Kullback-Leibler distance between their joint distri-butions equals the sum of distances between the conditional distributions.Taking the first-order Markovian case as an example, wherein ��

7

��

�� and �� has a similar structure,

��

�� (3)

where

��

��

��

��

�� (4)

The Bhattacharyya distance does not have this property.(3) Through Stein’s Lemma [18], the Kullback-Leibler and Chernoff distances

are the exponential rates of optimal classifier performance probabilities. If �is a random vector having � statistically independent, identically distributedcomponents with respect to the probabilistic models characterized by �� and��, the optimal (likelihood ratio) classifier results in error probabilities thatobey the asymptotics

��

��

� �� fixed ��

��

��

� � ��

��

��

� ��

Here, �� , �� , and � are the false-alarm, miss, and average-error proba-bilities, respectively. Loosely speaking, Stein’s Lemma suggests that theseerror probabilities decay exponentially in the amount of data available tothe likelihood-ratio classifier: for example, �� for aNeyman-Pearson classifier [19]. Because of the form of Stein’s Lemma, thesedistances are known as the exponential rates of their respective error proba-bilities: the relevant distance for each error probability determines the rate ofdecay. Whether all Ali-Silvey distances satisfy some variant of Stein’s Lemmais not known. The only distances known to be asymptotically related directlyto optimal classifier error probabilities are the Kullback-Leibler and Chernoffdistances.

(4) When the information vector has real, continuous-valued components and thederivatives exist, all Ali-Silvey distances have the properties that the first par-tial of �� with respect to each component of � is zero and that theHessian is proportional to the Fisher information matrix when evaluated at

8

� � ��.

��

��

��

� �

��

��

��

� � � ��

A Taylor series argument based on these results suggests that for perturba-tional changes in the parameter vector, �� Æ�, the distance betweenperturbed stochastic models is proportional to a quadratic form consisting ofthe perturbation and the Fisher information matrix evaluated at ��.

�� Æ�� Æ��Æ� (5)

The constant of proportionality equals � for the Kullback-Leibler and Bhat-tacharyya distances, and equals � � � �� for the Chernoff distance. Werefer to this result as the locally Gaussian property of distance measures: when� is Gaussian with only the mean� depending on the information parameter,all of the three distance measures of interest here (and many others) have theform

��

��

��

�

for all choices of the information parameter (whether perturbationally dif-ferent or not). Consequently, the distance between perturbationally differentprobability functions resembles the distance between Gaussian distributionsdiffering in the mean. The reason this property is important traces to theCramer-Rao bound, a fundamental bound on the mean-squared estimation er-ror [20]. The bound states that for scalar parameter changes, the mean-squarederror incurred in estimating a parameter when it equals �� can be no smallerthan the reciprocal Fisher information. Thus, the larger the distance for a per-turbational change in the parameter, the greater the Fisher information, andthe smaller the estimation error can be.

The last two properties directly relate our distances to the performances of optimalclassifiers and optimal parameter estimators. By analyzing distances, we at onceassess how well two informationally different situations can be distinguished andhow well the information parameter can be estimated through observation of thesignal � .

3.2 Quantifying system performance

To analyze how well systems process information, we let � denote the output of asystem that has � as its input as shown in Fig. 1. Because the input is stochastic

9

and encodes information, � is also a stochastic process that encodes information.Mathematically, the input-output relationship for a system is defined by the con-ditional probability function �� , which means that �� form a Markovchain� � � : the input signal is all that is needed to calculate the output. Note thatthis formulation means that the system cannot access the information parameter �other than through the encoded input� .

We require the distance measure to be what we call information theoretic: it mustobey the so-called Data Processing Theorem [17, 21], which loosely states thata system operating on a signal cannot produce an output containing more in-formation than that encoded in its input. The Data Processing Theorem is usu-ally stated in terms of mutual information. Here, we require any viable distancemeasure to have the property that for any information change, when � � � ,�� : output distance cannot exceed input distance. All dis-tances in the Ali-Silvey class [11] have this property by construction. Many others,such as the resistor average (to be described later) and Chernoff distances, do aswell.

To evaluate how well systems process information, we define the quantity �, theinformation transfer ratio, to be the ratio of the distance between two output dis-tributions that corresponds to two information states and the distance between thecorresponding input distributions.

��

��

This ratio is always less than or equal to one for information-theoretic distances. Avalue of one means that the information expressed by the input is perfectly repro-duced in the output; a value of zero means the output doesn’t represent the informa-tion change. Note that achieving a value of one does not require the output signal beof the same kind as the input. Kullback [22] showed that if the information trans-fer ratio equals one for all ��, ��, � is sufficient statistic for � . More generally,the information transfer ratio may equal one only for some choices of �� at eachreference value �� of the information parameter. � In such cases, the informationtransfer ratio quantifies those aspects of the information expressed perfectly in thesystem’s output and those that aren’t. A systematic study of a system’s informationprocessing demands that �� and �� vary systematically, with �� usually taken as areference point.

For the special case wherein the information parameter vector is perturbed (�� Æ�), we can explicitly write the information transfer ratio for all distance

� Consider a statistically independent sequence of Gaussian random variables wherein themean and variance constitute the components of the parameter vector. Let a system computethe sample average. Changes in the mean will yield an information transfer ratio of onewhile changes in the variance will not.

10

measures that satisfy the locally Gaussian property (5).

�� Æ�� Æ�

�� Æ�

Æ��Æ�

Note that the information transfer ratio for perturbational changes does not dependon the choice of distance measure. A subtlety emerges here in the multiparametercase: the value of � depends, in general, on the direction of the perturbation Æ�,which means that ��Æ�� Æ�� does not exist. This result is notsurprising since it tells us that the information transfer ratio differs in general foreach parameter component ��. For example, there is no reason to believe that a sys-tem would preserve the information about amplitude and phase to the same fidelity.Examples show that when �� and �� differ substantially, the information transferratio’s value does depend on distance measure choice. For example, symmetric andnon-symmetric measures must yield different ratios.

The information transfer ratio quantifies the information-processing performanceof systems and allows us to think of them as information filters. Let �� definean operating point and allow �� to vary about it. Some changes might yield adramatic reduction in � while other changes leave � near its maximal value ofone. In this sense, the system, by acting on its input signals, reduces the ability todiscern certain information changes relative to others. A plot of �� asa function of �� reveals the system’s information transfer ability. The “passband”occurs when the information transfer ratio is relatively large and the “stopband”when it is much smaller. Note that in general, this ability varies with operatingpoint defined by ��. The maximum value of the information transfer ratio definesthe system’s information gain and quantifies how well the system’s output can beused to extract information relative to the input.

To illustrate the information filter concept, we consider an array-processing exam-ple. Fig. 2 shows a five-sensor linear array that attempts to determine the waveformof a propagating sinusoidal signal observed in the presence of additive Gaussiannoise by using classic delay-and-sum beamforming [23]. With the information pa-rameters being the amplitude and propagation angle, the Kullback-Leibler distancebetween inputs corresponding to the reference and to other amplitudes and anglesis given by (see Appendix A for how the Kullback-Leibler distance between analogGaussian processes is calculated)

��

�

��

��

� ��

� ��

��

��

��

��

where � is the observation interval, �� is the spectral height of the white noise,and �� are the propagating sinusoid’s amplitude, frequency and angle, respec-tively. � ��

�� is the propagation delay between sensors spaced � apart with

11

0

20

400 5 10 15 20 25 30 35 40 45

0

100

200

300

400

500

600

Beamformer Input Distance

Angle (degrees)

Amplitude

KL

Dis

tanc

e (b

its)

–2τ∗ –τ∗ 0 +τ∗ +2τ∗

�

Y(t)

[X0(t) X1(t) X2(t) X3(t) X4(t)]X=

0

20

400 5 10 15 20 25 30 35 40 45

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Analog Beamformer

Angle (degrees)

Amplitude

γ X,Y

–2n∗

[X0(t) X1(t) X2(t) X3(t) X4(t)]X=

LPF LPF LPF LPF LPF

samp samp samp sampsamp

–n∗ 0 +n∗ +2n∗

�

Y(n)

0

20

400 5 10 15 20 25 30 35 40 45

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Digital Beamformer

Angle (degrees)

Amplitude

γ X,Y

Fig. 2. A sinusoidal signal propagates toward a five-sensor array. Each sensor measures thepropagating field at a particular spatial location in the presence of noise, which we assumeto be Gaussian and white both temporally and spatially. The left column shows analog anddigital beamformers, each of which delays each sensor’s output by an amount designedto cancel the propagation delay corresponding to a particular angle, here equal to ��. Thedigital system’s delays are quantized versions of these delays. The delayed signals are thenaveraged (the summers in the block diagrams represent averagers). For analyzing the in-formation processing capabilities of these beamformers, the input is the vector of sensoroutputs over a time interval lasting � �� s and the output is defined over a shorter timeinterval of � ��. The propagating signal is � �� , with � ��, and the noise ateach sensor had spectral height 500. In the digital case, each output is lowpass-filtered thensampled (bandwidth equal to �� Hz and sampling interval � � �� s). The informationvector contains the propagating signal’s amplitude and propagation angle:� � �� . Thetop plot shows the Kullback-Leibler distance between inputs measured in bits (choosingbase 2 logarithms in (1)), with the reference being�� Æ�. The right column showsthe information transfer ratios for the two beamformers.

12

being the propagation speed. � is the number of sensors in the linear array.How the distance varies with angle reflects the signal’s sinusoidal nature. The ana-log beamformer, having processing delay � � � � ��, yields an output distance�� of

� ��

��

��

��

��

��

� ��

��

��

��

��

Fig. 2 shows the resulting information transfer ratio. The design behind a beam-former is to produce an estimate of the waveform propagating from a specific di-rection ��. Signals propagating from other directions should be suppressed to somedegree. Because the beamformer is linear, it does not affect signal amplitude. Theinformation transfer ratio should reflect these signal processing capabilities, butin addition indicate how the processing affects the ability of optimal processorsto determine information represented by the propagating signal. Because we se-lected amplitude and propagation angle as the information parameters, we can ex-amine how the beamformer reduces the ability of optimal systems to extract theseparameters. First of all, the analog beamformer’s information transfer ratio maxi-mal value does not achieve the largest possible value of one; its maximal value is�� . This fundamental reduction occurs because each sensor producesa signal over the same time interval that is delayed by the beamforming algorithmrelative to the others before averaging. Since averaging only makes sense over atime interval when all delayed signals overlap, the signal-to-noise ratio at the out-put is smaller than that of the vector of sensor outputs. The maximal informationtransfer ratio is maintained for two conditions. The ridge extending along the angle�� for amplitudes larger than the reference indicates the intended spatial filteringproperties of the array. Signals propagating from other directions are suppressed.For smaller amplitudes than the reference, the information transfer ratio is essen-tially constant for all propagation angles. This property, which is certainly not partof the beamformer’s design, means that the beamformer affects little the ability todetect smaller amplitude signals.

The Kullback-Leibler distance at the output of the digital beamformer expressesthe loss due to sampling.

��

�

��

��

��

��

��

��

� ��

��

��

��

��

The sampling interval is �. The digital beamformer’s information transfer ratiois smaller because the sampling restricts the angles to those wherein the sampleddelay ��

��

�

is an integer. The sampling interval and reference propagation

direction were selected in this example so that the required delay would not be aninteger number of samples. The beamformer no longer provides maximal informa-tion at an angle corresponding to that of the reference signal, and the information

13

transfer ratio shown in Fig. 2 reflects this loss.

3.3 Choosing a distance measure

We focus here on the Kullback-Leibler distance and measures related to it. We usethe Kullback-Leibler distance because its computational and analytic properties aresuperior to those of the Chernoff distance. For example, estimating the Chernoffdistance from data would require solving an optimization problem and the fact thatthe Chernoff distance does not have the additivity property means it is less conve-nient to use analytically. We have a complete empirical theory [24] that frames howto estimate Kullback-Leibler distances from data, examples of which can be foundin [25, 26].

Despite the Kullback-Leibler distance’s computational and theoretical advantages,what becomes a nuisance in applications is its lack of symmetry. We have founda simple geometric relationship among the distances measures described here, andthis relationship leads to a symmetric distance measure related to the Kullback-Leibler distance. Although Jeffreys [27] did not develop it to symmetrize theKullback-Leibler distance, the so-called � -divergence equals the average of thetwo possible Kullback-Leibler distances between two probability distributions. �

Assuming the component Kullback-Leibler distances exist,

��

��

We now have a symmetric quantity that is easily calculated and the Ali-Silvey classcontains it (��

�� ). However, its relation to classifier performance is

more tenuous than the other distances [10, 14].

��

��

� ��

We have found this bound to be loose, with it not indicating well the exponentialrate of the average-error probability � (which is equal to the Chernoff distance).Symmetrizations beyond simple averaging are the geometric and harmonic means.The geometric mean ��

�� does not seem

have as interesting properties as the harmonic mean. We define a new symmetricdistance, what we call the resistor-average distance, via the harmonic mean.

��

�� (6)

� Many authors, including Jeffreys, define the �-divergence as the sum rather than theaverage. Using the average fits more neatly into the graphical relations developed subse-quently.

14

This quantity gets its name from the formula for the equivalent resistance of a setof parallel resistors: �equiv �

�� . It equals the harmonic sum (half the

harmonic mean) of the component Kullback-Leibler distances. It is not as arbitraryas the � -divergence or the geometric mean as it arises by considering the halfway-point on the geodesic using the opposite sense of equidistant used in formulatingthe Chernoff distance: rather than equating the distances from the endpoints to thehalfway point, equate distances from the halfway-point to the ends. Denoting thispoint along the geodesic by ��, we seek it by requiring� �� .The “ends-to-middle” notion of “halfway” results in the Chernoff distance; this“middle-to-ends” notion has a closed form solution.

��

� ��

� ��

The quantities �� and � are equal when the probability distributions are symmet-ric and differ only in mean. In these special cases, the Kullback-Leibler distanceis symmetric, making ��

�� . The resistor-average is not an

Ali-Silvey distance, but because of its direct relationship to the Kullback-Leiblerdistance, it is locally Gaussian (as is the � -divergence and the geometric mean).The resistor-average distance is not additive when the signal has statistically in-dependent components. However, it is directly computed from quantities that are(Kullback-Leibler distances), making it share the computational and interpretativeattributes that additivity offers, but producing a symmetric result. For these rea-sons, the resistor-average distance has been used in pattern and speech recognitionalgorithms [28, 29].

The relation between the various distance measures can be visualized graphically(Fig. 3). The quantity �� is concave, and its derivatives at � � and � are �� and �� , respectively. The quantity �� occurs at theintersection of these tangent lines, which means that the resistor-average distanceupper bounds the Chernoff distance: �� . Consequently,��

��

� ��. The various distance measures described herehave a specific ordering that applies not matter what the component probabilitydistributions may be.

��

� ��

� ��

In many realistic examples the Chernoff distance roughly equals half the resistor-average distance. � Consequently, we have an easily computed quantity that canapproximate the more difficult to compute Chernoff distance. The � -divergence

� Computer experiments show that this relationship, found in analytic examples, does notapply in general. In some examples, the curve � �� can lie close to the �-axis, leaving

15

t

C(p0,p1)

R(p0,p1)

tD(p0||p1)

(1-t)D(p1||p0)

0 1t*t**

D(p0||p1)

D(p1||p0)

0.5

B(p0,p1)

J(p0,p1)

–log μ(t)

G(p0,p1)

Fig. 3. This figure portrays relations among many of the most frequently used informa-tion-theoretic distances. The focus is the function � �� used to define the Cher-noff distance (2). The slopes of its tangent lines at � � � and � � � correspond to theKullback-Leibler distances. The tangent lines intersect at � � ��, the value of whichcorresponds to the resistor-average distance defined in (6). The Bhattacharyya distance�� equals � ��

��

�. The �-divergence �� equals the average of the two

Kullback-Leibler distances, with the geometric mean �� lying somewhere betweenthe �-divergence and the smaller of the Kullback-Leibler distances.

differs much more from the Chernoff distance while the more difficult-to-computeBhattacharyya distance can be quite close. This graphical depiction of these dis-tance measures suggests that as the two Kullback-Leibler distances differ more andmore, the greater the discrepancies between the Chernoff distance and both the� -divergence and the Bhattacharyya distance.

4 A system theory for information processing

Using the information transfer ratio based on the Kullback-Leibler distance, we canquantify how various system organizations affect the ability to perform informationprocessing. We emphasize that these results make few assumptions about the sys-tems they can be linear or nonlinear and we assume little about the signals theyprocess and produce.

Cascaded Systems. If two systems are in cascade (Fig. 4a), with� � � � � ,the overall information transfer ratio is the product of the component ratios.

�� (7)

Notice that this result holds for general distance measures (not just the Kullback-Leibler distance). Because information transfer ratios cannot exceed one, this result

the Chernoff and resistor-average distances far apart, while in others it can hug the tangentlines so that the Chernoff and resistor-average distances are numerically close.

16

X Y Z

a

X1

Y

XN

b

X1 Y1

X2 Y2

XN YN

d

Y1

Y2

YN

X

e

Y1

YN

X

c

Fig. 4. The various system structures for which we can calculate the information processingtransfer ratio are shown. a: The cascade structure. b: A system having several statisticallyindependent inputs. c: A system having several distinct outputs. d: Several systems actingseparately in parallel in their individual inputs. e: Several systems acting separately on thesame input.

means that once a system reduces � for some information change, that loss of in-formation representation capability cannot be recovered by subsequent processing.This result easily generalizes for any number of the systems in cascade. However,insertion of a pre-processing system can increase the information transfer ratio. Forexample, consider a memoryless system that center-clips: if � � ��, � � �; if � � ��, � � � . Let the parameter � correspond to the input signal’s amplitude.If �� , the output is zero and, for some range of amplitude changes, the outputremains zero, which yields �� ; no amount of post-processing willchange this. On the other hand, if we insert an amplifier of sufficient gain beforethe center-clipper, � will be non-zero, making the overall information transfer ra-tio larger than its original value, but smaller than that of the amplifier. Though (7)may bear resemblance to transfer function formulas in linear system theory whereinsystems don’t “load” each other and system order doesn’t matter, these propertiesdon’t apply to information processing systems in general. In particular, our the-ory suggests that pre-processing can improve information processing performance;post-processing cannot.

Multiple Inputs. When the input consists of several statistically independentcomponents (Fig. 4b), the overall information transfer ratio is related to individ-ual transfer ratios by an expression identical to the parallel resistor formula.

��

��

This result applies only to those distance measures having the additivity prop-erty. Note that each �� can exceed one because each input-output pair �� does not necessarily form a Markov chain. For example, if the input consistsof � independent and identically distributed Gaussian variables that encode in-formation in the mean and the system’s output equals the sum of the inputs,

17

every �� , while �� . � The parallel resistor formula implies�� . While this bound may not be tight (as we haveseen, the right side can actually be greater than one), it tells us that the system’soverall information transfer ratio is smaller than the individual ratios for each input.Processing one input badly can dramatically affect a multi-input system’s overallprocessing capability.

Multiple Outputs. Let a system have one input and� outputs� � �� as in Fig. 4c. When we use the Kullback-Leibler distance, the overall informationtransfer ratio is related to the component ratios as

��

�� (8)

The conditional information transfer ratio in the summation derives from theKullback-Leibler distance’s additivity property for the Markovian dependence case.Since the indexing of the outputs is arbitrary, result (8) applies to all permutationsof the outputs. Also note that this result applies regardless of the systems, the natureof the signals ��, and how each signal represents the information. Result (8) saysthat the total information transfer ratio equals the sum of one individual transfer ra-tio plus the sum of the incremental ratios that quantify how each additional outputincreases the ability to discriminate the two information states. Since � � , thesum of incremental information transfer ratios must attain some asymptotic value,which means that beyond some point, additional outputs may not significantly in-crease the information expressed by the aggregate output. The point of diminishingreturns is reached and producing more outputs does not significantly enhance theinformation processing capability.

One important special case has several systems in parallel, each of which processesthe same input. Mathematically, the � outputs are conditionally independent ofeach other given the input: ��

�� . We consider two cases.

The simplest has each system with its own input that is statistically independent ofthe other system’s input (Fig. 4d), as in the common model for distributed detec-tion [30]. Because of the additivity property, the input and output Kullback-Leiblerdistances equal the sum of the individual distances. Simple bounding argumentsshow that the overall information transfer ratio �� is bounded below by thesmallest information transfer ratio expressed by the individual systems and aboveby the largest [31]. Consequently, this structure shows no information gain as thenumber of systems grow (i.e., little variation with � ). A more interesting case oc-curs when the systems have a common input, wherein one signal forms the input toall� systems (Fig. 4e). When each system’s individual information transfer ratio is

� The information transfer ratio equaling one is not surprising because the sum of theobservations is the sufficient statistic for Gaussian distributions parameterized by the mean.

18

greater than zero, we show in Appendix B that as the number of systems increases,�� will always increase strictly monotonically and will have a limiting value ofone as long as the systems being added don’t have information transfer ratios thatdecrease too rapidly. This result applies widely: we require no assumption aboutthe nature of the signals involved or about the systems (other than � � for eachsystem). This result holds for any information processing system having the struc-ture shown in Fig. B.1 regardless of what the systems do, what signal serves as thecommon input, or what kinds of output signals are produced. For example, a suf-ficiently large population of neurons operating independently on their inputs willconvey every information-bearing aspect of their inputs regardless of the neuralcode employed [31] and that space-time coding systems, wherein the coding andthe channels are independent of each other, can transmit information asymptoticallyerror-free.

As a simple illustrative case, consider parallel systems that share a Gaussian input(� � � ��!��) with each system adding a Gaussian random variable (mean zeroand variance !��) statistically independent of the input and that added by the othersystems. Let the mean � encode the information. Each output �� has a Gaussiandistribution and individual information transfer ratio calculated with the Kullback-Leibler distance equals !��!

�� !��. The collective output � is a Gaussian

random vector having mean �� and covariance diag�!�� !�� !

��

�, where� � col�� and � denotes matrix transpose. The overall information transferratio calculated with the Kullback-Leibler distance equals

�� !��

��

� !��

��

�

If the sum�

��converges, the information transfer ratio does not achieve its

maximal value of one. Convergence only occurs when �� !�� ; in other

words, when the systems become increasingly noisy at a sufficiently rapid rate.When the sum of reciprocal variances diverges (when, for example, !�� is constant),the collective output can be used to extract the mean with the same fidelity as themean can be estimated from the input. When the noise variances are bounded, wehave

��

�

� � ��

��

�

� �

Somewhat surprisingly, the asymptotic rates at which the information transfer ratioincreases depend only on the nature of the signal� and not on the nature of � (seeAppendix B).

��

�� "� ��"�� is discrete-valued

"

�� is continuous-valued

(9)

To illustrate that the asymptotic behavior depends on the nature of the input distri-

19

bution, consider the case wherein the input is an exponentially distributed randomvariable and each system’s output is a Poisson random variable.

�� #� � #$��

�� %��$��

��

Here, %� is the gain between the input and the �th output. Thus, the input iscontinuous-valued. Despite the fact the output is discrete-valued, our theory pre-dicts that the asymptotics of the information transfer ratio will resemble that of theGaussian example examined above. The unconditional output probability functionfor each system is geometric:

�� #� �#

#�%�

�%�

#�%�

��

For the aggregate parallel system, the input and output Kullback-Leibler distancesare

�� #��#�� #�#�

�#� #�#�

�� #��#�� #�#�

�

� �

#�

��

%�

��

�#� �

��%�

#� ��

�%�

�

For the information transfer ratio to achieve a value of one, the sum of the gainsmust diverge. In this case, plotting the information transfer ratio when %� � %reveals a hyperbolic asymptotic behavior. �

Changing the input in this Poisson example to a Bernoulli random variable resultsin exponential asymptotics. Here, �� when � � � and equals � when� � . The input and output Kullback-Leibler distances are

��

� � ��

��

�� $

��

��

�� $��

��

�� $��

��

� � �� $��

��

��

Thus, if the sum of gains diverges, the information transfer ratio has an asymptoticvalue of one that it approaches exponentially. Consequently, a change in the natureof the input changes the asymptotics of the information transfer ratio.

� While the hyperbolic asymptotic formula in (9) is valid, we found that the formula

��

�

��more accurately characterizes how � varies with � in several ex-

amples. This formula is exact in the Gaussian example and closely approximates the infor-mation transfer ratio in this Poisson example.

20

From a general perspective, a parallel structure of information processing by noisysystems can overcome component system noisiness and achieve a nearly perfectlyeffective information representation (an information transfer ratio close to one) withrelatively few systems. The information transfer ratio for each system determineshow many systems are needed: the smaller �� is, proportionally more systemsare needed to compensate.

5 Summary and conclusions

Our theory of information processing rests on the approach of using informa-tion change to probe how well a signal any signal represents information andhow a system “filters” the information encoded in its input. In the theory’s cur-rent form, signals must be completely described by a probability function. Bothcontinuous-time and discrete-time signals are certainly in this class. Signals couldbe symbolic-valued, and multichannel signals having a mixed character are em-braced by our theory. Thus, we can within one framework, describe a wide vari-ety of stochastic signals. The main idea is to use information theoretic distances,especially the Kullback-Leibler distance, to quantify how the signal’s probabilityfunction changes as a consequence of an informational change. The choice of theKullback-Leibler distance follows from the desire for the distance to express in-formation processing performance. In the current theory, information processingmeans either classification (distinguishing between informational states) or estima-tion (estimating the information). If one informational state change yields a dis-tance of 2 bits and another a distance of 3 bits, the detector for the second changehas error probabilities roughly a factor of 2 smaller than the one for the first. Forperturbational changes of real-valued information parameters, the Kullback-Leiblerdistance is proportional to the Fisher informationmatrix, thus quantifying how wellthe information parameter can be estimated in the mean-square sense. We used theinformation transfer ratio to quantify how the information filtering properties ofanalog and discrete-time beamformers differ (Fig. 2) and, in another paper, how ef-fective a neural signal generator is [26]. Thus, our approach can cope with a varietyof signals and systems, and analyze them in unified way.

Conventional wisdom suggests that static measures those that do not require theinformation to change such as mutual information and entropy could be used toassess a signal’s information content. Unfortunately, entropy does not quantify theinformation contained in a signal for two reasons. First of all, it considers a signalas an indivisible quantity, not reflecting what is information bearing and what is not,and not reflecting what information expressed by the signal is relevant. Parallelingour approach, one could consider measuring entropy changes when informationchanges. However, entropy does not obey the Data Processing Theorem. For ex-ample, increasing (decreasing) the amplitude of a signal increases (decreases) itsentropy, which means the output entropy is greater (less) than the input entropy.

21

Secondly, entropy cannot be defined for all signals. The entropy of an analog sig-nal can either be considered to be infinite or quantified by differential entropy [2].Differential entropy is problematic because it is not scale-invariant. For example,the differential entropy of a Gaussian random variable is �

� �� $!�. Depending

on whether the standard deviation is expressed in volts or millivolts, the entropychanges! Differential entropy can be related to Fisher information (de Bruijn’s iden-tity; see [17, Theorem 16.6.2]), but this relationship is not particularly useful. TheKullback-Leibler distance remains constant under all transformations that result insufficient statistics, scale being only one such transformation.

An alternative choice for the quantity upon which to base a theory of informa-tion processing is mutual information. Mutual information is a statistical similar-ity measure, quantifying how closely the probability functions of a system’s inputand output agree. Mutual information plays a central role in rate-distortion theory,where it quantifies how well the output expresses the input. The information bot-tleneck method uses this property to assess how well signals encode informationwhen faced with an ultimate signal compression [5]. This approach not only re-quires specifying what is relevant information, but also the information must havea stochastic model. While in many problems information could be considered para-metric (this paper uses many such examples), requiring a stochastic model is overlyrestrictive. We attempted to develop a parallel theory that used mutual informationusing notions described by Popoli and Mendel [32]. Not only is the relationship be-tween mutual information and classification/estimation tenuous and indirect, but nosimple structural theory could be obtained. Furthermore, using mutual informationwould require all empirical studies to have access to each system’s input and outputbecause the joint probability function between input and output is required. Fromour experience in neuroscience, this requirement can rarely be met in practice. Usu-ally both can’t be measured simultaneously and what comprises input and outputis only broadly known. Furthermore, the well-known curse of dimensionality be-comes apparent. If&�� data values are required to estimate either of the marginalprobability functions to some accuracy, estimating the joint distribution requires&��. Estimating the Kullback-Leibler or resistor-average distances only require

estimates of a single signal’s probability function under two conditions, which is&��.

Ultimately, we believeWeaver was right [6]: information is about meaning and howprocessing information leads to effective action. On more technical grounds, usingrate-distortion theory or the information bottleneck requires specification of a dis-tortion function that measures the error between the encoded and decoded informa-tion. In empirical work, such as in neuroscience, wherein you gather data to study apart of an information processing system, usually the distortion function is both un-known and irrelevant. Unknown because intermediate signals are just that and onlyencode the information: when the decoded information is not evident, distortioncan’t be measured. Irrelevant because we don’t know how to quantify the distortionbetween intended and derived meaning. Information for us ultimately concerns ac-

22

tions derived from signal interpretation. If action is the measurable output, it cer-tainly lies in a different space than information or information-bearing signals. Inthis case, the distortion function is again unknown. Our use of the Kullback-Leiblerdistance encompasses both classification-related and mean-squared error distortionmeasures. In cases where information is parametric, the mean-squared error view-point is arbitrary and restrictive. However, when the “true” distortion measure isnot known, mean-squared error at least represents a well-understood choice.

Our analysis of various information processing system architectures indicates thatstructure how systems are interconnected affects information processing capa-bilities regardless of the kinds of signals and, in many cases, regardless of the spe-cific information-theoretic distance chosen. The simple cascade result shows thatthe ability to extract information can be easily degraded as a succession of systemsprocess it and the representation changes from one form to another. Once a lossoccurs, such as in our neural system, it can never be regained if only the outputis used in subsequent processing. This conclusion follows from the Data Process-ing Theorem and consequently is not that surprising. However, the other structureswe have been able to analyze do represent new insights. One important exampleis the parallel system result: when a sufficiently large number of systems operateseparately on a common input, their aggregated output can express whatever in-formation the input expresses with little loss. Inhomogeneity among the systemsand their outputs can occur without affecting this result. If instead these same sys-tems view statistically independent inputs that encode the same information, theinformation transfer ratio cannot exceed the largest information transfer ratio pro-vided by the component systems [31]. Consequently, this structure is ineffective inthat it provides no more information processing capability than its best componentsystem. In related work [33, 34], we studied alternative decision architectures forsensor network processing, some of them incorporating feedback. Instead of try-ing to solve the difficult problem of finding the optimal distributed quantizer anddirectly analyzing its performance, we used the Kullback-Leibler distance to un-derstand how decision performance would be affected by various design choices.These information processing results show that exchanging decision informationis just as effective as sharing raw data, meaning that communication costs can begreatly reduced without affecting performance.

From a broader perspective, our information processing theory goes beyond Shan-non’s classic results that apply to communicating information. The entropy limit onsource coding defines how to efficiently represent discrete-symbol streams. Chan-nel capacity defines how to reliably communicate signals over noisy channels, bethey discrete or continuous-valued. Rate-distortion theory assesses how compres-sion affects signal fidelity. As useful as these entities are, they do not help analyzeexisting systems to assess how well they work. More penetrating theories must beconcerned with what signals represent, how well they do so, and how classes of sys-tems more general than channels affect that representation. Our theory tries to ad-dress Weaver’s vision [6] of a broader theory that concerns information content. By

23

requiring the information to be changed, we effectively probe the signal’s semanticcontent. By using information-theoretic measures that reflect optimal-processingperformance, we can quantify the effectiveness of any signal’s information-bearingcapability.

A Kullback-Leibler distance between two Gaussian random processes

Consider two Gaussian random processes, �� and �� with two differentmean functions,�� and�� and the same covariance structure'� � (�. To cal-culate the Kullback-Leibler distance between �� and �� over some timeinterval, we used the Karhunen-Loeve expansion [7, �1.4] to represent processes interms of uncorrelated random variables, ��

� , ) � �� . Since these variables areGaussian, they are also statistically independent, which means that the Kullback-Leibler distance between two processes can be written as a sum of Kullback-Leiblerdistances between their representations

��

��

�and

��

��

�:

��

��

��

��

�

�

By using the result for the Kullback-Leibler distance between two Gaussian densi-ties with different means and the same covariance [19], the right-hand side equals

�

��

*�

��

��

�

��

Here, �*�� are eigenvalues of '�� and ��

� � are the coefficients of theseries expansion of the means using the Karhunen-Loeve basis. Defining +� � (�to be the so-called inverse kernel of the covariance function [35] that satisfies� � '� � (�+� � ,� � Æ�( ,�, it has the same eigenfunctions as '�� witheigenvalues equal to *�. The Kullback-Leibler distance between the processeshas the following expression.

��

��

��

��

�

�

�

��

�+� � (�

��(��(�

� (

When the process is white with spectral height �� , '� � (� � ��

�Æ� (� and

+� � (� � ��Æ� (�, in which case, the Kullback-Leibler distance between two

white-noise processes differing in their means is �� .

24

systemY1

Y2

YN

X system

system

��

��

� ��

Z

Fig. B.1. The initial portion leading to the output vector � shows what we call a non-cooperative structure: systems transform a common input � to produce outputs �� thatare conditionally independent and identically distributed. To find the asymptotic behaviorof the information transfer ratio, we append an optimal processing system. In the discretecase, the optimal processor is the likelihood ratio detector that indicates which value of �occurred. In the continuous case, the optimal processor is the maximum likelihood estima-tor of� .

B Multi-output and parallel systems

First of all, we prove that the information transfer ratio monotonically increases asthe number of systems increases regardless of the systems and the input encodingfor the separate system case shown in Fig. B.1: ��

�� . This result rests onthe log-sum inequality [17]:

��

��

��

��

��

��

with ��, �� being probability functions. The integral can be a sum in thecase of discrete probability functions. Equality occurs only when �� .To apply this result, note that the probability function for each system’s output isgiven by

��

Note that �� , which defines each system’s input-output relation, does notdepend on �. The Kullback-Leibler distance �� between the outputsresponding to the two information conditions �� and �� equals

� ��

��

��

��

�

By applying the log-sum inequality with respect to the input, we upper-bound this distance and demonstrate the Data Processing Theorem:�� . Applying the log-sum inequality to the inte-gral over �� , we find that�� , with strict inequalityarising because the individual system Kullback-Leibler distances are not zero.Thus, as the population size increases, the Kullback-Leibler distance strictlyincreases. In what follows, we find a lower bound on the rates of increase thatindicates that �� approaches one asymptotically under mild conditions.

25

To determine the rate at which the information transfer ratio approaches one, weconsider an optimal processing system that collects the outputs � to yield an es-timate � of the input � (Fig. B.1). We then calculate the asymptotic distributionof the estimate, derive the distance between the estimates, and find the informationtransfer ratio between the input and the estimate. Because of the cascade prop-erty (7), this ratio forms a lower bound on the information transfer ratio betweenthe input and the parallel system’s collective output.

Discrete-valued inputs

Let � be drawn from a set and have a discrete probability distribution. We areinterested in the asymptotic (in � , the number of parallel systems) behavior of theinformation transfer ratio ��

�� . Becausethe input distance does not vary with � , we need only consider the numeratordistance. Because the input is discrete, the “optimal processing system” of Fig. B.1is a likelihood ratio classifier that uses �� to determine whichvalue of � occurred. Let� � � and � be the output decision. The probabilisticrelation between the input set and the decision set can be expressed by an � -arycrossover diagram. Since we will consider asymptotics in � , here equal to thenumber of systems, we know that the error probabilities in this crossover diagramdo not depend on the a priori symbol probabilities �� so long as the symbolprobabilities are non-zero. Let -� � Pr�� .� � � �� denote the crossoverprobabilities. Then, the output symbol probabilities are

��.��

��

��

-�

��

��

�� -��

Note that -�� as � � �. This expression for ��.�� is written in termsof the crossover probabilities -� , / �� ), that all tend to � with increasing � . Thecrossover probabilities do depend on the distribution of the system output but don’tvary with the information parameter. We can collect these crossover probabilitiesinto the set � and explicitly indicate that the distance between classifier outputsdepends on � as �� . As long as the distance is differentiable with respectto the probabilities ��.��, the distance can be represented by a Taylor seriesapproximation about the origin.

�� &�-

��

The first term equals the distance �� since having zero crossover probabil-ities corresponds to a classifier that makes no errors. When the distance is infor-mation theoretic, the remaining terms must total a negative quantity because of theData Processing Theorem. The maximum of all the crossover probabilities, -��,

26

asymptotically equals [36]

-��

��

��

with �� a slowly varying function in the sense that �� ,and �� the Chernoff distance. Because -�� dominates the performance of theoptimal classifier as the number of systems increases, the distance �� decreases linearly as the crossover probabilities decrease, with the largest of thesedominating the decrease. Consequently,

�� '��

��

��

��

��

�

(B.1)Here, ' is the value of that component of the negative gradient of the distancecorresponding to -��. Whenever the Chernoff distance diverges as � � �, theinformation transfer ratio approaches one asymptotically. Typically, the Chernoffdistance for each system’s output �� is bounded from above and below by con-stants: �� . This special case means thatnew systems added in parallel do not have a systematically larger or smaller dis-crimination abilities than the others. In this case, the Chernoff distance term in theexponent of (B.1) increases linearly in � . We thus conclude that for the case ofdiscrete input distribution with finite support, the asymptotic increase in the infor-mation transfer ratio (as we increase number of parallel outputs) is exponential (orgreater) and that the information transfer ratio reaches as � � � so long as thesystems being added don’t have too rapid a systematic diminished ability to dis-criminate which input value occurred. If the regularity condition (differentiabilitywith respect to the probabilities of signal values) is satisfied, our result applies toany distance measure used to define the information transfer ratio.

Continuous-valued inputs

To determine the rate of increase of the information transfer ratio when � has aprobability function defined over an interval or over a Cartesian product of inter-vals, we use the approach shown in Fig. B.1 with � being the maximum likelihoodestimator (MLE) of � . Under certain regularity conditions [37] and because the��’s are conditionally independent, the MLE is asymptotically (in the number ofsystems � ) Gaussian with a mean equal to � and variance equal to the recipro-cal of the Fisher information 0�� . This result applies when the systems aren’thomogeneous so long as the third absolute moment of the score function of �� isbounded and the Fisher information 0�� diverges as � ��. Because the quan-tity � has conditionally independent components, 0��

�� 0�� . Thus, if

27

this sum diverges, we can obtain the probability density of � .

��.��

�� 0��

��

� ��

��. ��

�� 0��

�

� �

If the third derivative of the input probability density function, ��, is bounded,we can expand �� in a Taylor series around ., up to the third-order termand then perform term-by-term integration. This procedure amounts to the Laplaceapproximation for an integral. The probability density of � can be then expressedas

��.�� .�� 1�.��

��

�� 0��&

��

�� 0��

� � � (B.2)

where 1�.�� is the second derivative of �� evaluated at .: 1�.��

��

��. Result (B.2) says that the distance between MLEs corresponding to

the two information states equals the distance between the input probability den-sities perturbed in a particular way. If 0�� 0�� 0��, then the sum ofFisher information terms increases linearly in � . In this case, a Taylor-like seriescan be written for the distance between perturbed input probability functions usingthe Gateaux derivative, a generalized directional derivative.

�� '

��&

�

��

�

Here ' summarizes the negative first Gateaux derivative of the distance in theperturbational direction. Consequently, for all information-theoretic distances forwhich the first-order Gateaux derivatives exist (all Ali-Silvey distances are in thisclass), the information transfer ratio between the input and the output � is lower-bounded by the quantity above divided by �� .

�� '

�� &

�

��

��

This result states that the asymptotic information transfer ratio approaches one atleast as fast as a hyperbola. The Gaussian example described previously had thisasymptotic behavior, which means this hyperbolic behavior is tight.

28

References

[1] C. Shannon, A mathematical theory of communication, Bell System Tech. J. (1948)379–423, 623–656.

[2] T. Berger, Rate Distortion Theory, Prentice-Hall, Englewood Cliffs, NJ, 1971.

[3] A. Kolmogorov, On the Shannon theory of information transmission in the case ofcontinuous signals, IRE Trans. Info. Th. 3 (1956) 102–108.

[4] M. Gastpar, B. Rimoldi, M. Vetterli, To code, or not to code: Lossy source-channelcommunication revisited, IEEE Trans. Info. Th. 49 (2003) 1147–1158.

[5] N. Tishby, F. Pereira, W. Bialek, The information bottleneck method, in: Proc. ��th

Allerton Conference on Communication, Control and Computing, 1999, pp. 368–377.

[6] C. Shannon, W. Weaver, The Mathematical Theory of Communication, U. IllinoisPress, Urbana, IL, 1949.

[7] R. Ash, M. Gardner, Topics in Stochastic Processes, Academic Press, New York, NY,1975.

[8] D. Johnson, Toward a theory of signal processing, in: Info. Th. Worskhop onDetection, Estimation, Classification and Imaging, Santa Fe, NM, 1999.

[9] S. Amari, H. Nagaoka, Methods of Information Geometry, Vol. 191 of Translationsof Mathematical Monographs, American Mathematical Society, Providence, RhodeIsland, 2000.

[10] M. Basseville, Distance measures for signal processing and pattern recognition, SignalProcessing 18 (1989) 349–369.

[11] S. Ali, S. Silvey, A general class of coefficients of divergence of one distribution fromanother, J. Roy. Stat. Soc., Series B 28 (1966) 131–142.

[12] H. Chernoff, Measure of asymptotic efficiency for tests of a hypothesis based on thesum of observations, Ann. Math. Stat. 23 (1952) 493–507.

[13] A. Bhattacharyya, On a measure of divergence between two statistical populationsdefined by their probability distributions, Bull. Calcutta Math. Soc. 35 (1943) 99–109.

[14] T. Kailath, The divergence and Bhattacharyya distance measures in signal selection,IEEE Trans. on Comm. Tech. COM-15 (1) (1967) 52–60.

[15] N. Cencov, Statistical Decision Rules and Optimal Inference, Vol. 14 of Translationsin Mathematics, American Mathematical Society, Providence, RI, 1982.

[16] A. Dabak, A geometry for detection theory, Ph.D. thesis, Dept. Electrical & ComputerEngineering, Rice University, Houston, TX (1992).

[17] T. M. Cover, J. A. Thomas, Elements of Information Theory, John Wiley & Sons, Inc.,1991.

29

[18] H. Chernoff, Large-sample theory: Parametric case, Ann. Math. Stat. 27 (1956) 1–22.

[19] D. Johnson, G. Orsak, Relation of signal set choice to the performance of optimalnon-Gaussian detectors, IEEE Trans. Comm. 41 (1993) 1319–1328.

[20] L. Scharf, Statistical signal processing: Detection, estimation, and time series analysis,Addison-Wesley, Reading, MA, 1990.

[21] R. Blahut, Principles and Practice of Information Theory, Addison-Wesley, Reading,MA, 1987.

[22] S. Kullback, Information Theory and Statistics, Wiley, New York, 1959.

[23] D. Johnson, D. Dudgeon, Array Signal Processing: Concepts and Techniques,Prentice-Hall, 1993.

[24] D. Johnson, C. Gruner, K. Baggerly, C. Seshagiri, Information-theoretic analysis ofneural coding, J. Computational Neuroscience 10 (2001) 47–69.

[25] C. Miller, D. Johnson, J. Schroeter, L. Myint, R. Glantz, Visual responses ofcrayfish ocular motoneurons: An information theoretical analysis, J. ComputationalNeuroscience 15 (2003) 247–269.

[26] C. Rozell, D. Johnson, R. Glantz, Measuring information transfer in crayfishsustaining fiber spike generators: Methods and analysis, Biological Cybernetics 90(2004) 89–97.

[27] H. Jeffreys, An invariant form for the prior probability in estimation problems, Proc.Roy. Soc. A 186 (1946) 453–461.

[28] O. Arandjelovic, R. Cipolla, Face recognition from face motionmanifolds using robustkernel resistor-average distance, in: Conference on Computer Vision and PatternRecognition Workshops, 2004, p. 88.

[29] T. Spiess, B. Wrede, F. Kummert, G. A. Fink, Data-driven pronunciationmodeling forASR using acoustic subword units, in: Proc. European Conf. Speech Communicationand Technology, Geneva, 2003, pp. 2549–2552.

[30] P. Varshney, Distributed Detection and Data Fusion, Springer-Verlag, New York, NY,1997.

[31] D. Johnson, Neural population structures and consequences for neural coding, J.Computational Neuroscience 16 (2004) 69–80.

[32] R. Popoli, J. Mendel, Relative sufficiency, IEEE Trans. Automatic Control 38 (1993)826–828.

[33] M. Lexa, C. Rozell, S. Sinanovic, D. Johnson, To cooperate or not to cooperate:Detection strategies in sensor networks, in: ICASSP ’04, Montreal, Canada, 2004.

[34] M. Lexa, D. Johnson, Broadcast detection structures with applications to sensornetworks, IEEE Trans. Signal Processing. Submitted.

[35] H. van Trees, Detection, Estimation, and Modulation Theory, Part I, Wiley, New York,1968.

30

[36] C. Leang, D. Johnson, On the asymptotics of� -hypothesis Bayesian detection, IEEETrans. Info. Th. 43 (1997) 280–282.

[37] H. Cramer, Mathematical Methods of Statistics, Princeton University Press, Princeton,NJ, 1946.

31

Toward a Theory of Information Processing

Documents

Transcript of Toward a Theory of Information Processing