Information-Theoretic Analysis of Neural Codingdhj/jcns2001.pdf · Information-Theoretic Analysis...

23
Journal of Computational Neuroscience 10, 47–69, 2001 c 2001 Kluwer Academic Publishers. Manufactured in The Netherlands. Information-Theoretic Analysis of Neural Coding DON H. JOHNSON Department of Electrical and Computer Engineering and Department of Statistics, Computer and Information Technology Institute, Rice University, MS 366, Houston, TX 77251-1892 [email protected] CHARLOTTE M. GRUNER Department of Electrical and Computer Engineering, Computer and Information Technology Institute, Rice University, MS 366, Houston, TX 77251-1892 [email protected] KEITH BAGGERLY Department of Statistics, Computer and Information Technology Institute, Rice University, MS 366, Houston, TX 77251-1892 [email protected] CHANDRAN SESHAGIRI Department of Electrical and Computer Engineering, Computer and Information Technology Institute, Rice University, MS 366, Houston, TX 77251-1892 [email protected] Received August 6, 1999; Revised June 5, 2000; Accepted June 19, 2000 Action Editor: Barry Richmond Abstract. We describe an approach to analyzing single- and multiunit (ensemble) discharge patterns based on information-theoretic distance measures and on empirical theories derived from work in universal signal process- ing. In this approach, we quantify the difference between response patterns, whether time-varying or not, using information-theoretic distance measures. We apply these techniques to single- and multiple-unit processing of sound amplitude and sound location. These examples illustrate that neurons can simultaneously represent at least two kinds of information with different levels of fidelity. The fidelity can persist through a transient and a subsequent steady- state response, indicating that it is possible for an evolving neural code to represent information with constant fidelity. Keywords: neural coding, information theory, lateral superior olive 1. Introduction Neural coding has been classified into two broadly de- fined types: rate codes (the average rate of spike dis- charge) and timing codes (the timing pattern of dis- charges). Debates rage over the effectiveness of one code over another and over whether this categoriza- tion even applies. Complicating the debate are recent

Transcript of Information-Theoretic Analysis of Neural Codingdhj/jcns2001.pdf · Information-Theoretic Analysis...

Journal of Computational Neuroscience 10, 47–69, 2001c© 2001 Kluwer Academic Publishers. Manufactured in The Netherlands.

Information-Theoretic Analysis of Neural Coding

DON H. JOHNSONDepartment of Electrical and Computer Engineering and Department of Statistics, Computer and Information

Technology Institute, Rice University, MS 366, Houston, TX [email protected]

CHARLOTTE M. GRUNERDepartment of Electrical and Computer Engineering, Computer and Information Technology Institute,

Rice University, MS 366, Houston, TX [email protected]

KEITH BAGGERLYDepartment of Statistics, Computer and Information Technology Institute, Rice University, MS 366,

Houston, TX [email protected]

CHANDRAN SESHAGIRIDepartment of Electrical and Computer Engineering, Computer and Information Technology Institute,

Rice University, MS 366, Houston, TX [email protected]

Received August 6, 1999; Revised June 5, 2000; Accepted June 19, 2000

Action Editor: Barry Richmond

Abstract. We describe an approach to analyzing single- and multiunit (ensemble) discharge patterns based oninformation-theoretic distance measures and on empirical theories derived from work in universal signal process-ing. In this approach, we quantify the difference between response patterns, whether time-varying or not, usinginformation-theoretic distance measures. We apply these techniques to single- and multiple-unit processing of soundamplitude and sound location. These examples illustrate that neurons can simultaneously represent at least two kindsof information with different levels of fidelity. The fidelity can persist through a transient and a subsequent steady-state response, indicating that it is possible for an evolving neural code to represent information with constantfidelity.

Keywords: neural coding, information theory, lateral superior olive

1. Introduction

Neural coding has been classified into two broadly de-fined types: rate codes (the average rate of spike dis-

charge) and timing codes (the timing pattern of dis-charges). Debates rage over the effectiveness of onecode over another and over whether this categoriza-tion even applies. Complicating the debate are recent

48 Johnson et al.

results that suggest single-unit responses inadequatelyexplain information coding. For example, theoreticalconsiderations indicate that single-neuron dischargepatterns in the mammalian auditory pathway are toorandom to effectively represent sound (Johnson, 1996).Coordinated responses of neurons within sensory andmotor nuclei have been found, with discharge tim-ing relations among neural outputs having significance(Abeles, 1991; Alonso et al., 1996; De Charms andMerzenich, 1996; Riehle et al., 1997; Singer and Gray,1995; Vaadla et al., 1995; Wehr and Laurent, 1996).Consequently, much current work has focused on pop-ulation activity, using the fundamental assumption thatcoordinated sequences of action potential occurrencetimes produced by groups of neurons collectively rep-resent the stimulus-response relationship. Thus, todaythe “neural code” is taken to mean how groups of neu-rons, responding individually and collectively, repre-sent sensory information with their discharge patterns(Bialek et al., 1991). Knowing the code would unlockthe secrets of how neurons, working in concert, processand represent information.

In developing data-analysis techniques that seek toelucidate the neural code, some kind of averaging isrequired because of the stochastic nature of the neu-ral response. The simplest techniques requirestation-ary responses: The probability law governing the re-sponse must not change with time. Among these areseveral interspike interval statistics (Johnson, 1996),and auto- and cross-correlation techniques (Abeles andGoldstein, 1977). To quantify time-varying responses,responses are averaged over several stimulus presen-tations that are spaced sufficiently far apart in timeto prevent adaptation and sequential stimulus effects.Such responses are said to becyclostationary(Gardner,1994). The probability law varies with time but doesso periodically, the period equaling the interstimulusinterval. The well-known PST histogram measures,under a Poisson point process assumption, how thedischarge rate changes after each stimulus presentation(Johnson, 1996). All of these measures spring from apoint-process view of neural discharge patterns. Us-ing point-process-based measures and elaborations ofthem, we could in principle estimate the point-processmodel that accurately describes a neuron’s response toeach stimulus. In our experience, having such a point-process model does not reveal what stimulus featuresare being coded, when, and how effective the code is inrepresenting the stimulus. For example, we developeda point-process model for the tone-burst responses of

single units located in the lateral superior olive (LSO)(Zacksenhouse et al., 1992, 1993) and an underly-ing computational biophysical model (Zacksenhouseet al., 1998). While these models provide a notion ofthe response’s structure, they donothelp us determinethe typical LSO unit’s information processing role andwhat processing function the variety of LSO responsetypes may engender. What has been left out is quanti-fying both the response’s significance and its effective-ness in representing sensory information.

New techniques are emerging that take a broaderview. One developed by Victor and Purpura (1997)measures the distance between two responses by de-termining the number of steps it takes to systemat-ically transform one discharge pattern into another.Mutual information calculations (Gabbiani and Koch,1996; Rieke et al., 1995) measure how effectively aspike train encodes the stimulus under Poisson as-sumptions. Neural network models have been trainedon recorded neural responses to determine how wellthese responses can be used to determine stimulus pa-rameters (Middlebrooks et al., 1994). These methodsyield response metrics that are difficult to relate to howwell systems can extract information from single- ormultiple-neuron responses. When inputs to a neuralpopulation have been measured and characterized, howdo we judge how well the population extracts sensoryinformation? In fact, how would we know if a popula-tion did extract information or simply served as a relay,passing the information along (possibly in a differentneural code) to its projections? We must be able toquantify theneural code: What aspect of a neural en-semble’s collective output represents information, andwhat is the fidelity of this representation? This arti-cle describes a new information-theoretic technique formeasuring when and to what degree responses differ ina way that can be related to how optimal systems per-form in decoding neural responses.

Consider the simple system shown in Fig. 1. Con-ceptually, this system accepts inputsX that representa stimulus or a neural population conveying informa-tion (parameterized byα ) and produces outputsR thatcollectively code some or all of stimulus attributes.The boldfaced symbols represent vectors and are in-tended to convey the notion that our system—a neuralensemble—has multiple inputs and multiple outputs.Presumably, input stimulus features preserved in theoutput are those extracted by the system; those deem-phasized in the output are discarded by the system anddefine its feature extraction properties. To probe the

Information-Theoretic Analysis of Neural Coding 49

Figure 1. A neural system has as inputs the vector quantityX thatdepend on a collection of stimulus parameters denoted by the vectorα. The outputR thus also depends on the stimulus parameters. Bothinput and output implicitly depend on time. Note that the precise na-ture of the input is deliberately unclear. It can represent the stimulusitself or a population’s collective input.

system and its representation of sensory information,we experimentally measure the system’s output and itsinputs as we vary stimulus conditions.

Relating input and output for each condition amountsto measuring the intensity of an underlying point pro-cess model, which does not help quantify the effective-

Figure 2. The PST histograms shown in the upper two panels show the dramatically time-varying nature of a simulated auditory neuron’sresponse to brief tone pulses that had different amplitudes. Shown separately in the bottom panel is the distance between these cyclostationaryresponses. Distance between responses accumulates with poststimulus time, meaning that the distance at poststimulus timet is the distancebetween response measured up to timet . Detailed analysis results are shown in Fig. 5 for the same simulations.

ness of neural coding. Instead, what we look for is howthe inputs and the outputschangeas the stimulus under-goes a controlled change. No change means no codingof the perturbed aspect of the stimulus; the bigger thechange, the more the system accentuates that sensoryaspect. To quantify change, we need a measure thatquantifies its degree. In short, what we seek is adis-tance measure. Given two sets of stimulus conditionsθ1,θ2, we need to measure how different the corre-sponding responsesR(θ1), R(θ2) are—how far apartthey are—with some distance metricd(R(θ1),R(θ2)).Assuming that population codes are subtle, this metricneeds to apply to ensemble responses, to dynamic aswell as steady-state responses, to changes in transneu-ral correlations, and to changes in temporal correlationstructure.

Anticipating our results, Fig. 2 demonstratesquantifying how two responses differ. The distance

50 Johnson et al.

computation doesnot have anya priori notion of theneural code and is based on the data,not the PST his-togram. What we want the distance measure to uncoveris (1) what portion of the response most representsthe stimulus change, (2) whether the later constant-rate response reveal more or less than the early time-varying response, and (3) how well the stimulus changeis represented by the response change. By plotting howdistance accumulates with time since stimulus pre-sentation, we can answer these questions. If the dis-tance measure plateaus, the responses don’t differ overthat time interval; if it increases, the responses differin their statistical structure, and the sharper the in-crease, the more the responses differ during that timesegment than in other segments. First of all, we seethat the response begins at about a 3 mslatency, butthe distance measure suggests that they don’t differsignificantly until after 5 ms. Between 5 and 10 ms,the distance accumulates about 2 bits, and from 10 to25 ms, the distance increases by another 2 bits.1 Thus,the response during the 5 to 10 ms intervals revealsmuch about the amplitude change, while the constant-rate portion reveals just as much but over a time intervalthree times longer. From this example, we see that thedistance calculation applies to both time-varying andconstant-rate responses. The distance measurementapplies to single and multineuron responses equallyas well. What the distance measure does not reveal isthe precise nature of amplitude coding by this neuron;it assesses only the code’s quality and when the codingoccurs. However, just because some component of theneural response more accurately represents some stim-ulus component than others does not mean that des-tination populations make more effective use of thatresponse component than others. As we shall see, re-sults from information theory can be used to determinethe limits to which systems can extract information nomatter what response component(s) they may use.

2. Information-Theoretic Distance Measures

While the merits of one distance measure versus an-other can be debated (Basseville, 1989), we describehere a collection of information-theoretic distances thathave a clear, intuitive mathematical foundation. Theunderlying theory is not rooted in the classic results ofShannon but in modern classification theory, the keyresults from which are detailed in the appendix. Inthis theory, we try to assign a response to one of aset of preassigned response categories. For example,

discerning whether the stimulus is on or off is a two-category classification problem. The ease of classi-fication depends on how different the categories are;it is through this aspect of the classification problemthat distance measures arise. We use this classificationtheoretic approach because recent results from univer-sal signal processing2 provide distance measures andclassification techniques that assume little about thedata yet yield (in a certain sense) optimal classificationresults. In addition to information-theoretic distancemeasures having a strong mathematical foundation, di-rect empirical results have been derived. For example,we can determine how complex a data analysis we canperform given a certain amount of data.

Error probabilities in optimal classifiers decrease ex-ponentially with the distance between the categories.Using Pr[error] to denote a generic classification errorprobability, Pr[error]∼ 2−d(C1,C2), whered(C1,C2)

denotes an information- theoretic distance between twocategories. The distance measure of particular interesthere is theKullback-Leiblerdistance defined to be

D(p1 ‖ p2) =∫

p1(R) logp1(R)p2(R)

dR, (1)

wherep1, p2 are probability distributions that charac-terize the two categories andR symbolically representsa neural response, whether from one or several neu-rons. Following the convention of information theory,we use the base-two logarithm, which means that dis-tance has units of bits. In the Gaussian case (categoriesdefined to have different means but the same variance),the Kullback-Leibler distance equalsd′2/(2 ln 2) bits,with d′ = |m1 −m2|/σ . The quantityd′ is frequentlyused in psychophysics to assess how easily stimuli canbe distinguished. When applied to non-Gaussian prob-lems, the Kullback-Leibler distance represents a gener-alization ofd′ to all binary classification problems: thelarger this distance, the easier the classification prob-lem. It measures how different two probability distri-butions are, and it has several important properties:

1. D(p1 ‖ p2) ≥ 0 andD(p‖ p) = 0. The Kullback-Leibler distance is always nonnegative, with zerodistance occurring only when the probability distri-butions are the same.

2. D(p1 ‖ p2) = ∞ whenever, for someR domain,p2(R) = 0 andp1(R) 6= 0. If p1(R) = 0, the valueof p1(R) log p1(R)

p2(R)is defined to be zero.

3. When the underlying stochastic quantities arerandom vectors having statistically independent

Information-Theoretic Analysis of Neural Coding 51

components with respect to bothp1 and p2, theKullback-Leibler distance equals the sum of thecomponent distances. Stated mathematically, ifp1(R) =

∏i p1(Ri ) and p2(R) =

∏i p2(Ri ),

D(p1(R) ‖ p2(R))=∑

i

D(p1(Ri ) ‖ p2(Ri )). (2)

Furthermore, ifp1, p2 describe Markovian data,3

the Kullback-Leibler distance has a similarsummation property. Taking the first-orderMarkovian case as an example, whereinp1(R) =p1(R1)

∏i p1(Ri+1 | Ri ) and p2(·) has a similar

structure,

D(p1(R) ‖ p2(R))

= D(p1(R1) ‖ p2(R1))

+∑

i

D(p1(Ri+1 | Ri ) ‖ p2(Ri+1 | Ri )), (3)

where

D(p1(Ri+1 | Ri ) ‖ p2(Ri+1 | Ri ))

=∫

p1(Ri , Ri+1) logp1(Ri+1 | Ri )

p2(Ri+1 | Ri )d Ri d Ri+1.

(4)

4. D(p1 ‖ p2) 6= D(p2 ‖ p1). The Kullback-Leiblerdistance is usually not a symmetric quantity. Insome special cases, it can be symmetric (like the justdescribed Gaussian example), but symmetry cannot,and should not, be expected.

5. D(p(x1, x2) ‖ p(x1)p(x2)) = I (x1; x2). TheKullback-Leibler distance between a joint proba-bility density and the product of the marginal distri-butions equals what is known in information theoryas themutual informationbetween the random vari-ablesx1, x2. From the properties of the Kullback-Leibler distance, we see that the mutual informationequals zero only when the random variables are sta-tistically independent.

The word “distance” should appear in quotes be-causeD(· ‖ ·) violates some of the fundamental prop-erties a distance metric must have: A distancemustbe symmetric in its arguments. As explained in the ap-pendix, classification error probabilities need not havethe same exponential decay rate, and this results inthe Kullback-Leibler distance’s asymmetry. This asym-metry property does not hinder theoretical develop-ments but does affect measuring the distance between

recorded neural responses. Consequently, we later pro-pose a symmetric distance measure directly related tothe Kullback-Leibler distance. The Gaussian examplealso indicates that the Kullback-Leibler distance hasthe form of asquared-distance: these distances areproportional to

∑i (m

(i )1 − m(i )

2 )2, which corresponds

to thesquareof the Euclidean distance. Thus, we havea second reason to put “distance” in quotes.

In addition to quantifying the exponential decayrate of the error probabilities in optimal classifiers,information-theoretic distances determine the ease ofestimating parameters represented by the data. Con-sider the situation where two categories differ slightlyaccording to the values of a scalar parameterθ : sym-bolically, p1(R) = pθ (R) andp2(R) = pθ+δθ (R). In-tuitively, if we can easily distinguish between two suchcategories (small error probabilities), we should alsobe able to estimate the parameter accurately (smallerestimation error). For sufficiently small values of thedifferenceδθ , the Kullback-Leibler distance is propor-tional to the reciprocal of the smallest mean-squaredestimation error that can be achieved. The mathemati-cal results are4

D(pθ+δθ ‖ pθ ) ≈ 1

2 ln 2F(θ)(δθ)2. (5)

Here,F(θ) denotes theFisher information:

F(θ) = E[(∂ ln pθ (R)

∂θ

)2]

=∫ (

∂ ln pθ (R)

∂θ

)2

pθ (R) dR,

with E [·] denoting expected value. The significance ofthese formulas rests in theCramer-Rao bound, whichstates that the mean-squared error foranyunbiased es-timatorθ of θ cannot be smaller than 1/F(θ) (Johnsonand Dudgeon, 1993, sec. 6.2.4):

E [(θ − θ)2] ≥ 1

F(θ). (6)

When two or more parameters change, Fisher infor-mation becomes a matrix, and the distance formulas-become what are known as quadratic forms:

D(pθ+δθ ‖ pθ ) ≈ 1

2 ln 2δθ′ F (θ)δθ, (7)

with F(θ)= E [(∇θ ln pθ (R))(∇θ ln pθ (R))′]. Here,(·)′ means transpose and∇θ ln pθ (R) means the

52 Johnson et al.

gradient of the log probability density function:∇θ ln pθ (R) = col[ ∂

∂θ1ln pθ (R), . . . ,

∂∂θN

ln pθ (R)].The Cramer-Rao bound still holds, but in a more com-plicated form:

E [(θ − θ)(θ − θ)′] ≥ F−1(θ).

This result means that the mean-squared estimationerror for any one parameter must be greater than thecorresponding diagonal entry in theinverse of theFisher information matrix:E [(θi − θi )

2] ≥ (F−1)ii (θ).Thus for any given stimulus parameter perturbationδθ, the larger the Kullback-Leibler distance becomes(the further apart the distributions become), the largerthe Fisher information (Eq. (5)), and the least possi-ble mean-squared error in estimating the parameterbecomes proportionally smaller. In short,larger dis-tances mean smaller estimation errors. This relation-ship not only reinforces the notion that our distancemeasures do indeed measure how distinct two classi-fication categories are but also allows us to determinehow well parametric information can be gleaned fromdata. Information-theoretic distance measures assessthe limits of information processing.

3. Digital Representation of Neural Responses

To develop a measure of the population’s response, wefirst convert the population’s discharge pattern into aconvenient representation for computational analysis(Fig. 3B). Here, a neural population’s response dur-ing the bth bin is summarized by a single numberRb that equals a binary coding for which neurons, ifany, discharged during the bin. This procedure gener-alizes the approach taken for the single-neuron case,wherein the occurrence of a spike in a bin was rep-resented by a zero or a one. Note that this digi-tization process for a neural ensemble is reversible(up to the temporal precision of the binwidth). ThesequenceR = {R1, . . . , Rb, . . . , RB} completely char-acterizes the population response, and the entire dis-charge pattern can be recreated from it.In develop-ing techniques to analyze neural coding, we need con-sider only the statistical structure of this sequence.When we present a stimulus periodicallyM times,we form the dataset{R1, . . . ,RM} from the compo-nent responses. Here, the response to themth stim-ulus isRm = {Rm,1, . . . , Rm,B}. Because we repeatthe same stimulus and take the usual precautions tomitigate adaptation, the resulting response is cyclo-

stationary (Gardner, 1994) (eachRm obeys the sameprobability law), and, in addition, they are statisticallyindependent of each other.

In information theory terminology, a discrete-valuedrandom variable, such as the response in a bin, takeson values that areletters r drawn from analphabetA. When we have aN-neuron population and a suf-ficiently small binwidth,Rb takes on values from thealphabet{r0, . . . , r K−1} = {0, . . . ,2N − 1}. For thethree-neuron population exemplified in Fig. 3, the col-lection {0, 1, 2, 3, 4, 5, 6, 7} forms the alphabet. Theletterr = 3= 0112 means that a discharge occurred inboth neurons 2 and 3 and not in neuron 1 during a par-ticular bin. We could form apopulation PST histogramof the population’s response at a particular bin to es-timate the probability that Pr[Rb = rk]. Informationtheorists term such histogram estimates of probabilitiestypes(Cover and Thomas, 1991, chap. 12):

P[Rb= rk]= (# timesrk occurs in{R1,b, . . . , RM,b})M

.

By accumulating this multineuron PST histogram inthis way, we obtain the distribution of neural dischargeoccurrence across the entire population within each bin.This histogram generalizes the PST histogram used toanalyze single-neuron responses (Johnson, 1996). Inthe single-neuron case, the alphabet consists of{0, 1},and only the probability of one letter need be calcu-lated. The PST histogram consists of a type at eachbin that estimates the probability of a discharge (theletter 1) occurring. The only other remaining valueof the type—PRb(Rb = 0)—is found by subtractingPRb(Rb = 1) from one.

Just as in the usual PST histogram, this multiunitPST histogram does not faithfully represent temporaldependence that may be present in the ensembleresponse (Johnson and Swami, 1983). The multi-unit PST histogram essentially assumes responses oc-cur independently from bin to bin—what amounts toa Poisson assumption—because no record is kept ofwhat preceded a particular population discharge pat-tern in each bin when the type is calculated. This as-sumption is more serious here than in the single-unitcase: while departures from Poisson behavior may notbe significant in the single-unit case, a discharge inone neuron may well affect another’s discharge occur-ring several bins later. We want our analysis tech-niques to be sensitive to this possibility and go beyondthe PST histogram in providing insight into the neuralcode. In the most general case, we should estimate

Information-Theoretic Analysis of Neural Coding 53

Figure 3. A: Portrays how we estimate the Kullback-Leibler distance between single neuron responses to different stimuli. The response toeach stimulus repetition is time aligned as in PST histogram computation, and a table is formed from the spike occurrence times (denoted byan×) quantized to the binwidth1. For each bin, a 1 indicates that a spike occurred in a bin and a 0 indicates that a spike did not occur. Weaccumulate the type for each bin, forming a histogram of spike occurrences and nonoccurrences separately for each bin from theM stimuluspresentations (four are shown in the figure). A similar set of types is computed from the responses to a second stimulus. When we assume theMarkov orderD to be zero, we compute the Kullback-Leibler distance between corresponding bins and sum the results.B: Generalizes thecomputations of panel A to the multineuron case. In the depicted ensemble of three neurons, the spike pattern at any bin could be one of eight(23 = 8) possible patterns. Each possible pattern is represented by an integer between 0 and 7 in the table to the lower right. Types are formedfrom these quantities, and distances are again computed separately between corresponding bins and summed whenD = 0. C: Illustrates howfirst-order distance analysis is computed. For each neuron’s (or ensemble’s) responses, the response pattern for two bins at a time is representedby an integer between 0 and 3 in the bottom table. Note that the first bin is special as no bin precedes it. This edge effect corresponds to the firstterm in Eq. (3). We compute the zero-order distance for it and the first-order distances for the others and then sum the result to form the totaldistance.

the joint probability of population response across allbins: Pr[R1 = rk1, R2 = rk2, . . . , RB = rkB ]. Becausewe have 2N possible letters in each bin andB bins, weneed to estimate 2NB probabilities. Most of these willbe zero—certain discharge patterns will not occur—butknowing this does not alleviate the overwhelming de-mand achieving an accurate probability estimate placeson data collection.

To approximatethe temporal dependence structureof the population response, we assume that it has a

Markovian structure: the probability of a particularpopulation response in a bin depends only on what re-sponses occurred in the previousD bins. This assump-tion means that we approximate the joint probability ofthe neural response by

P(R) = P(R1, . . . , RD) ×∏b= D+1

P(Rb | Rb−1, . . . , Rb− D).

54 Johnson et al.

The demands placed on data collection are greatly re-duced when we employ this approximation. Recent re-sults in information theory (Wienberger et al., 1995)prescribe how much data are needed byany bin-based technique to analyze data to a given degree ofdependence:

D ≤ log(L + 1)

log(2N + 1), (8)

where L is the amount of averaging used in analyz-ing the data andN the number of neurons. For non-stationary responsesL equals the number of stimulusrepetitionsM while for stationary (constant rate) re-sponses it equals the number of bins in the measuredresponse. This result makes the point that the amountof data needed grows exponentially in the dependenceorder and in the number of neurons in the ensem-ble: L > 2D·N . Computational experiments indicatethat this bound is not particularly tight, and we do notanalyze data to as high an order as the bound permits.

D(Pθ1(R)

∥∥Pθ2(R))

∼= D(Pθ1(R1, . . . , RD)∥∥Pθ2(R1, . . . , RD)

)+

B∑b=D+1

D(Pθ1(Rb | Rb−1, . . . , Rb−D)

∥∥Pθ2(Rb | Rb−1, . . . , Rb−D)). (9)

As shown in Fig. 3C, temporal dependence is eas-ily incorporated into type-based distance calculations.For each bin, a joint type estimates the joint proba-bility that a given ensemble response pattern occursin it and the precedingD bins. Extending our exam-ple, rather than just counting the number of timesRb

assumes various values, we need to know the joint dis-tribution of (Rb−1, Rb) to assess first-order Markoviandependence. Because the PST histogram is equiva-lent to zero-order analysis, employing joint types inmeasuring response differences can reveal responsechangesnot revealed by the PST histogram, whethera single- or multiunit histogram. Temporal depen-dence in discharge probabilities can arise in a varietyof ways: among them are dependence on dischargehistory (Johnson et al., 1986; Zacksenhouse et al.,1992, 1993), nonexponential interval distributions,5

and syn-fire response patterns in ensembles (Abeles,1991; Riehle et al., 1997). The Markov dependenceD corresponds to a temporal analysis windowD1 slong. The bound onD determines the longest intervalover which interneuron and intraneuron response de-pendencies can be included in the analysis. For a given

amount of data, the only way to extend the analysisinterval is to use a larger binwidth, which means thattemporal precision is reduced.

4. Calculating Distance Between Responses

Let R(θ1) and R(θ2) represent the responses of aneural population to two stimulus conditions param-eterized byθ1,θ2. What we want to measure is thedistance between thejoint probability distributions cor-responding to these responses. Using the Kullback-Leibler distance as an example, we would want to findD(Pθ1(R) ‖ Pθ2(R)). To manage this statistical com-plexity, we must assume that the response in a givenbin depends (in the statistical and practical sense)onlyon the responses that occur in the immediately pre-cedingD bins. Once thisanalysis dependence orderischosen, the distance calculation generalizes Eq. (3):

The∼= relation means that this relation is true onlyaccording to assumption, and the data’s actual depen-dence structure may differ. IfD equals or exceeds thememory present in the responses, this equation holds:picking D too large does not hurt. The problem ariseswhen D is chosen too small; in this case the twosides of Eq. (9) are not equal. Mathematical analysissuggests that Kullback-Leibler distances calculated us-ing a smaller-than-required dependence order could besmaller or larger than the actual value. Thus, to mea-sure accurately the distance between two responses,distances must be computed using increasingly largervalues ofD until the calculated values stabilize (don’tchange with increasingD) or the upper limit of (8) isreached. We decide when the distance reaches a con-stant value by employing statistical tests; in the fol-lowing sections we describe the statistical propertiesof distances and how to estimate their confidence in-tervals. On the other hand, if we have insufficient datato reach a stabilizing value ofD, we cannot say whetherthe computed value is a lower or an upper bound. Be-cause temporal dependence in discharge patterns usu-ally spans some time interval, we need to choose a

Information-Theoretic Analysis of Neural Coding 55

larger binwidth (at the sacrifice of temporal resolution)to span this interval with the same number of bins toobtain accurate distance measurements. We addressthe binwidth selection problem in Section 5.3.

As succeeding examples will show, the Kullback-Leibler distance’s asymmetry property—D(p‖q) 6=D(q ‖ p)—is indeed real, with the distance betweentwo responses depending on the order the responses ap-pear in the formula. In some applications, a referencestimulus condition occurs naturally, and we want tomeasure how responses differ from a reference; theKullback-Leibler distance can be used in such situa-tions (the second argument denotes the reference dis-tribution). Otherwise, the asymmetry becomes a nui-sance, and we need a symmetric distance measure, likethe Chernoff distance defined in the appendix. Calcu-lating the Chernoff distance can be so daunting that weneed to consider alternative symmetric distances. Onepossibility is known as theJ-divergence, which equalsthe average of the two Kullback-Leibler distances thatcan be defined for two probability distributions:

J (p1, p2) = D(p1 ‖ p2)+D(p2 ‖ p1)

2. (10)

This distance is not as powerful as the Kullback-Leibleror the Chernoff distances in that it onlyboundstheaverage error probability of an optimal classifier:

limM→∞

log Pe

M≥ −J (p1, p2).

Calculations show that this bound can be quite generous(not very tight). Though it may be easy to find (it’s thesum of easily calculated Kullback-Leibler distances),but we can only approximately relate it to classificationerror rates: theJ-divergence is overly optimistic.

A more accurate approximation is the so-calledresistor averageof the two Kullback-Leibler distances:

R(p1, p2) = D(p1 ‖ p2)D(p2 ‖ p1)

D(p1 ‖ p2)+D(p2 ‖ p1). (11)

The origin of the name “resistor average” arises be-cause a simple rewriting of this definition yields a for-mula that resembles the parallel resistor formula:

1

R(p1, p2)= 1

D(p1 ‖ p2)+ 1

D(p2 ‖ p1).

This quantity is not arbitrary. Rather it is derivedin a way analogous to the Chernoff distance, and

half of it approximates the Chernoff distance well:C(p1, p2)≈R(p1, p2)/2. In our Gaussian and Pois-son examples of Fig. 9, the Chernoff distances are0.03125 and 0.01796, respectively. The correspondingJ-divergences are 0.125 and 0.07192, and half the re-sistor average values are 0.03125 (exact equality) and0.01794. Thus, when we want to contrast two responsepatterns, rather than computing the correct distancemeasure, the Chernoff distance, we compute the muchsimpler quantity, the resistor average, instead.

One note on these “distances.” None of the Kullback-Leibler, Chernoff, and resistor-average “distances”can be distances in the mathematical sense. To qual-ify, a proposed distanced(a, b) must be symmetric(d(a, b) = d(b,a)), be stricly positive unless thetwo arguments to the distance are equal (d(a, b)>0for a 6= b, d(a,a) = 0), and obey the triangle in-equality (d(a, b)≤ d(a, c) + d(c, b)). The Kullback-Leibler distance is not symmetric, and the Chernoffand resistor-average distances don’t satisfy the trian-gle inequality. This mathematical issue does not affecttheir utility in judging how different two responses arein a meaningful way. All of these measures can be re-lated to the performance of optimal signal processingsystems (see the discussion following Eq. (5) and theappendix).

5. Statistical Properties

5.1. Estimation of Distance Measures

The most direct approach to estimating distance mea-sures is to use types in their definitions:

D(Pθ1(R)

∥∥Pθ2(R)) = D(Pθ1(R)

∥∥Pθ2(R))

R(Pθ1(R), Pθ2(R)

) = R(Pθ1(R), Pθ2(R)).

Here, the Kullback-Leibler distances are computedassuming some Markov order as described by equa-tion (Efron and Tibshirani, 1993).

However, this direct approach does have problems.When the type for the reference distribution has a zero-valued probability estimate for some letter at whichthe other type is nonzero, we obtain an infinite an-swer, which may not be accurate (the true referencedistribution has a nonzero probability for the offendingletter). To alleviate this problem, the so-calledK-Testimate (Krichevsky and Trofimov, 1981) is em-ployed. Each type is modified by adding one-half tothe histogram estimatebeforeit is normalized to yield

56 Johnson et al.

a type. Thus, for thekth letter, the K-T estimate at binb is

PKTRb(rk)=

(# timesrk occurs in{R1,b, . . ., RM,b})+ 12

M + K2

.

Now, no letter will be assigned a zero estimate of itsprobability of occurrence,and the estimate remainsasymptotically unbiased with increasing number of ob-servations. When applied to joint types, we add 1/2 toeach bin and normalize according to the total number ofletters in the joint type. For first-order Markovian de-pendence analysis, we need the joint type defined overtwo successive bins, and we apply the K-T procedureaccording to

PKTRb−1,Rb

(rk1, rk2

) = (# times(rk1, rk2

)occurs in{(R1,b−1, R1,b), . . . , (RM,b−1, RM,b)})+ 1

2

M + K 2

2

.

This estimation procedure is not arbitrary: it is basedon theoretical considerations of whata priori distribu-tion for the probabilities estimated by a type sways theestimate the least.

5.2. Bootstrap Removal of Bias

All distance measures presented here have the propertythat they can only attain nonnegative values. Any quan-tity having this property cannot be estimated withoutbias. For example, if the true distributions are iden-tical, distance measures are zero, but types calculatedfrom two datasets drawn from the same distributionare unlikely to themselves be equal, and the result-ing distance estimate will be positively biased. Whilethe estimates are asymptotically unbiased, experienceshows that the bias is significant even for large datasetsand can lead to analysis difficulties. Analytic expres-sions for the bias of a related quantity—entropy—areknown (Carlton, 1969), and they indicate that bias ex-pressions will depend on the underlying distribution incomplicated ways.

Fortunately, recent work in statistics provides a wayof estimating the bias and removing it fromany esti-mator without requiring additional data. The essenceof this procedure, known as thebootstrap, is to employcomputation as a substitute for a larger dataset. Thebootstrap procedure is one of severalresamplingtech-

niques that attempt to provide auxiliary information—variance, bias, and confidence intervals—about a sta-tistical estimate. Another method in this family is theso-called jackknife method, and it has been used forremoval of bias in entropy calculations (Fagan, 1978).The book by Efron and Tibshirani (1993) provides ex-cellent descriptions of the bootstrap procedure and itstheoretical properties.

In a general setting, letR = {R1, . . . , RM} denote adataset from which we estimate the quantityθ(R). Ourquantities of interest here are the Kullback-Leibler andresistor-average distance measures. We create a se-quence of bootstrap datasetsR∗j = {R∗1, j , . . . , R∗M, j },j = 1, . . . ,MB. Each bootstrap dataset has the samenumber of elements as the original and is created by

selecting elements from the original randomly and withreplacement. Thus, elements in the original datasetmay or may not appear in a given bootstrap dataset,and each can appear more than once. For example,suppose we had a dataset having four data elements{R1, R2, R3, R4}; a possible bootstrap dataset mightbe R∗ = {R2, R3, R1, R1}. The parameter estimatedfrom the mth bootstrap dataset is denoted byθ∗m =θ(R∗m). From theMB bootstrap datasets, we estimatethe quantity of interest, obtaining the collection of es-timates{θ∗1 , . . . , θ∗MB

}. The suggested number of boot-strap datasets and estimates is a few hundred (Efronand Tibshirani, 1993).

The bootstrap estimates cannot be used improve theprecision of the original estimate, but they can provideestimates ofθ(R)’s auxiliary statistics, such as vari-ance, bias, and confidence intervals. Thebootstrapestimate of biasis found by averaging the bootstrap es-timates and subtracting from this average the originalestimate: bias= 1

MB

∑m θ∗m − θ(R). The bootstrap-

debiased estimate is, therefore, 2θ(R) − 1MB

∑m θ∗m.

Calculation of bootstrap-debiased distances can re-sult in negative distances when the actual distance issmall.

Confidence intervals of levelβ can be estimated fromthe bootstrap estimates by sorting them and determin-ing which values correspond to theβ/2 and 1− β/2quantiles. Let{θ∗(1), . . . , θ∗(MB)

} denote the sorted (from

Information-Theoretic Analysis of Neural Coding 57

smallest to largest) estimates. A raw confidence inter-val estimate corresponds to [θ∗(bMB−βMB/2c), θ

∗(dβMB/2e)].

Thus, for the 90% confidence interval,β = 0.9, and theraw confidence interval corresponds to the 5th and 95thpercentiles. Because we want confidence intervals onthe bootstrapdebiased estimate rather than the original,we reverse the interval and center it around the debiasedestimate: [2θ(R)−θ∗(dβMB/2e), 2θ(R)−θ∗(bMB−βMB/2c)].

5.3. Dependence on Binwidth

Ideally, the calculation of distance measures betweentwo responses would not depend on the binwidth1used in the digitization process. However, dischargeprobability at any specific time varies as binwidthvaries. Since distances measure how different two prob-ability distributions are, we expect that distance calcu-lations could depend on binwidth. To analyze this sit-uation, let’s assume a single neuron population, withthe probability of an event equaling some rate timesthe binwidth: Pr[Rb = 1] = λ1 and Pr[Rb = 0] =1− λ1. The Kullback-Leibler distance between twosuch random variables (having ratesλ1 andλ2) is givenby

D(λ2 ‖ λ1) = λ21 logλ21

λ11+ (1−λ21) log

1− λ21

1− λ11.

The first term is clearly proportional to binwidth;if we assume that the discharge probability is small(λ1 ¿ 1), then the total expression is proportional tothe binwidth:

D(λ2 ‖ λ1) ≈(λ2 log

λ2

λ1+ λ1− λ2

)1.

All the other distances are also proportional to bin-width whenλ1 ¿ 1. Once the binwidth is chosensmall enough, we have found the temporal resolutionnecessary to maximally distinguish the two responses.Distance calculations can also be deliberately madewith larger binwidths to assess the role temporal reso-lution has on distinguishing two responses.

When we accumulate the distance across bins thatspan a given time interval having durationT , as sug-gested in property 3 and Eq. (9), the number of binsequalsT/1. If the discharge rates are such that dis-charge probabilities are small, the accumulationover agiven time intervalcancels the binwidth dependence,which leaves the accumulated distance independent ofthe binwidth. Let’s be more concrete about this point.

Assuming for the moment that the data are statisticallyindependent from bin to bin (Markov orderD = 0),the computation of the Kullback-Leibler distance be-tween two responses equals the sum of the distancesbetween the responses occurring within a bin. Thisdistance will be proportional to binwidth if the binsare small enough. However, when we add them up toform the total interresponse distance, the value we getwill not depend on the binwidth. For this reason, weprefer plotting accumulated distance (as expressed inEqs. (2) and (9) in the independent and Markov cases,respectively) across the response.

5.4. Example

Figure 4 illustrates a simple application of this proce-dure for simulated (Poisson) single-neuron discharges.The two ways of computing the Kullback-Leibler dis-tance from the simulated responses differ substan-tially. We find that this difference is statistically sig-nificant and occurs frequently in simulations and inactual recordings. Because we have no clear refer-ence stimulus in this example, we use the Chernoffdistance or its resistor-average approximation to com-pare two responses. The resistor average depictedin the top right panel consists of a series of straightlines, which correspond to time segments of constantrate differences between the responses. The greaterslopes correspond to greater rate differences. Note thatwhen the rates are equal, the distances do not change,indicating no response difference. The middle plotshows that the bias in the initial estimate of the resis-tor average is quite large. We have found the bootstrapbias compensation procedure described in Section 5.2to be necessary for obtaining accurate distance esti-mates. To employ bootstrap in the cyclostationary case,we consider our dataset to consist of the responsesto individual stimulus presentations for a given pa-rameter setting:{R1(θ), . . . ,RM(θ)}, and our boot-strap datasets containM responses selected randomlyfrom this original. We independently perform the boot-strap on each response resulting from each stimuluscondition, compute types from each bootstrap sam-ple, and calculate the distance between these sam-ples. As illustrated in Fig. 4, the bootstrap substan-tially removes the inherent positive bias. We also seethat half the resistor-average distance quite closely ap-proximates the actual Chernoff distance between theresponses. Examining the bottom right panel of Fig. 4shows that the actual Chernoff distance would lie well

58 Johnson et al.

Figure 4. Single-neuron responses were simulated based on a Poisson discharge model. The first response had a constant rate, and the secondresponse was a staircase; these constitute an example chosen to illustrate type-based analysis. These two responses equaled each other during theinitial and final 10 bins. The discharge probabilities controlled the occurrence ofM = 200 simulated responses. The resulting PST histogramsare shown, with the actual discharge probability shown by dashed lines and the dotted vertical lines indicating when rate changes occurred.The right column displays the various information-theoretic distance measures calculated from these responses. The top panel shows theaccumulated Kullback-Leibler distances estimated with the K-T modification using each response as the reference (dashed lines), along withthe resistor-average of these two shown (solid line). All of these were debiased using the bootstrap. In the middle panel, the resistor-average(scaled by two) before (dot-dashed) and after (solid) applying the bootstrap is compared with the theoretical Chernoff distance (dashed). Thebottom panel again shows the debiased resistor-average (again scaled by two) along with the 90% confidence limits (dotted) estimated via thebootstrap. In all cases, 200 bootstrap samples were used.

within the 90% confidence interval. Hence, we usethe computationally simpler resistor-average distancemeasure. Note how the confidence interval widens aswe progress across the response. This effect occursnaturally because we are adding more and more sta-tistical quantities as we accumulate the total distance.These intervals would be substantially smaller if weconsidered accumulated distances over portions of theresponse.

To interpret this distance calculation, we refer tomodern classification theory reviewed in the appendix.Because Chernoff distance is related through Eq. (A3)to the classification error rate, it reveals how easilythe two responses can be distinguished: the bigger thedistance, the smaller the probability of an error in dis-tinguishing the two. Note that this error probability isknown only up to a constant: we cannot compute it

precisely. Asymptotic error probability changes withtime roughly according to 2−d(t), whered(t) is the ac-cumulated distance, whether the Kullback-Leibler orChernoff distance, andt is poststimulus time. Thus,each unit (one bit) increase in distance corresponds toa factor of two smaller error probability. The accumu-lation of distance with time is not an arbitrary choice.This procedure corresponds to the Kullback-Leiblerdistance’s property 3, which states that the distancebetween the joint probability distributions characteriz-ing a response over a given number of bins equals thesum of the component distances.

As the two responses are identical over the first10 bins, no distance is accumulated. As the rates differmore in each 20-bin section, we see that the distanceaccumulated in each section increases. In this exam-ple, the accumulated (approximate) Chernoff distance

Information-Theoretic Analysis of Neural Coding 59

increases from the beginning to the end of each sectionare 0.1, 0.3, 0.55, and 0.95 bits. These quantities werecalculated by subtracting the accumulated distance atsection beginning from its value at the end. Finally, theresponses have identical rates during the last 10 bins,and we see distance does not increase further. Whenwe analyze responses, we concentrate on those portionsof the response that contribute most to accumulateddistance since they provide the most effective coding(in terms of classification errors). In our simple ex-ample, the response during bins 70 to 90 contributesmost because the rate difference is greater there. Aswe consider more complicated examples of coding, itbecomes increasingly important that we can use type-based analysis to determine important sections of theresponsewithoutassuming the nature of the code.

To relate these distance calculations to estimationerror, let’s assume that the staircase response corre-

Figure 5. The upper panels show the PST histograms of a simulated lateral superior olive neuron’s response to two choices of stimuluslevel (binwidth equals 0.5 ms). The simulations modeled the neuron’s biophysics (Zacksenhouse et al., 1998). The bottom panels show theresistor-average distance between these two responses; the computations were performed under several conditions. The first of these shows theresistor-average distance (divided by two) between these responses computed forD = 0, 2, 4 bins (corresponding to 0, 1, and 2 ms of temporaldependence, respectively). The dotted lines straddling theD = 4 curve portray the 90% confidence interval. The curve superimposed on thePST histograms is theD = 4 curve. Finally, the bottom plot displays the resistor-average distance (divided by two) between the responses fortwo choices of binwidth, but with the dependence parameterD chosen so that the assumed temporal dependence for each spans the same timeinterval. The 90% confidence interval for the1 = 0.5 ms is displayed with dotted lines.

sponds to increasing some stimulus parameter by equalamounts. The smallest increment yielded a differenceof 0.1 bits over a 20-bin interval. Using the perturba-tional results of Eq. (5), we find that the Fisher infor-mation equals 0.1× 8 ln 2/(δθ)2 = 0.555/(δθ)2. Thiscalculation means that this parameter is encoded by arate code in such a way that the mean-squared estima-tion error incurred in determining the parameter fromthis response must be at least(δθ)2/0.555= 1.8(δθ)2,where we need to know how the amount of perturba-tion to produce a numeric value. IfN statistically inde-pendent neurons represented the stimulus the sameway, the mean-squared error would decrease inverselywith N.

Figure 5 portrays how choice of analysis ordercan affect distance calculations. Recall that explor-ing nonzero analysis orders amounts to seeking re-sponse differencesnotconveyed by the PST histogram.

60 Johnson et al.

During the first few milliseconds, no significant re-sponse differences are evident. After about 5 ms, sig-nificant differences occur, with the various choices ofanalysis orders yielding about the same result. Thesedistances then depart at about 7 ms, with theD= 4curve being significantly larger. This result indicatessignificant temporal dependence in the responses asit differs greatly from theD= 0 curve, which al-ways corresponds to assuming the data are statisticallyindependent from bin to bin. The value of depen-dence parameterD is one of the few assumptions ourinformation-theoretic approach must make. Ideally, allvalues that can be computed based on the amount ofavailable data (Eq. (8)) should be explored. AsD in-creases, the distance calculations will eventually notchange, and the best value for the dependence param-eter is the smallest of these. In the example portrayedin Fig. 5, the resistor-average distance kept increas-ing, leaving us no choice but to use the largest possiblevalue. What the actual distance might be, even whetherit is larger or smaller than theD = 4 result, cannot bedetermined without more data. Using theD = 4 re-sult, the distance between the responses increases mostsharply during the second portion of the transient re-sponse.

If one sacrifices temporal resolution by using largerbins, the distance computation can span longer time in-tervals. The bottom panel of Fig. 5 shows two distancecalculations that span the same amount of temporal de-pendence, one using twice the binwidth of the other andhalf the dependence order. The restrictions placed onthe dependence parameter by (8) apply to the Markovorder, not to the amount of time spanned by a dis-charge pattern’s dependence structure. Thus, we couldanalyze the data with the same maximal dependence or-der allowed by the bound, but over longer dependencetime intervals by manipulating the binwidth. It shouldbe reemphasized that the restrictions of (8) apply toany data analysis technique and the idea of choosingdifferent binwidths applies to other methods as well.

This way of displaying distance—accumulated aspoststimulus time increases—also illustrates our gen-eral finding that the distance measures smooth the re-sponse variations found in PST histograms. Althoughthe displayed responses came from simulations, ac-tual recordings also demonstrate rapid rate oscillationsfound during the first 10 ms. One of our analysis tech-nique’s most powerful features is that it can assess re-sponse differences without regard to whether responserates and/or interspike dependencies are time varying

or not. Note that during the latter portion of the re-sponse the distance measures increase roughly linearly.This effect usually indicates a difference in sustainedrates, which can be discerned from the PST histograms.Furthermore, about half the total distance accumulatedover 25 ms (4.65 bits) is garnered in the first 10 ms. Weconclude that the initial transient of the response allowsequal discriminability in the first 10 ms (actually 7 msas there is about a 3 mslatency) as does the responseobtained during the last 13 ms. Thus, the initial portionof the response conveys as much as the stimulus doesduring the latter portion in less time.

Binwidth effects are also demonstrated in Fig. 5.From the example shown there, we conclude that thelarger binwidth of 0.5 ms would suffice as joint typescomputed over the same time span but with differ-ent binwidths yield nearly the same results. The timeepochs over which the distance calculations disagreemost occurs during the high-probability-of-dischargesegments of both responses, a result consistent withthe analysis of Section 5.3.

6. Applications

We have shown how information-theoretic distancescan assess how two responses differ in a meaning-ful way: using them, we can infer the performancelimits of information processing systems. We can alsoprobe interdependencies in population responses. Wedescribe this and other application of our approach forunderstanding the neural code.

6.1. Assessing Neural Codes

The simplest application of distance analysis is assess-ing which part of the response changes significantlyas with stimulus changes. Perhaps the most powerfulaspect of type-based analysis is that it makes noa pri-ori assumption about the nature of neural encoding. Itand other techniques that make noa priori assump-tions about the neural code are limited to Markov de-pendence orders that (8) allows. Calculating responsedistance quantifies how well the code expresses stim-ulus changes regardless of its form, whether a timingcode, a rate code, or some combination of these. “Sig-nificant change” has two meanings here. The first iswhether the distance measure is significantly differ-ent from zero during some portion of the response.Inferring this statistical significance is the role for

Information-Theoretic Analysis of Neural Coding 61

confidence intervals, which we compute using the boot-strap. The second type of significance is which portionof the response contributes most to accumulated dis-tance. We judge this by computing how much distancechanges over a given time interval. One consequenceof making this kind of calculation is that we candi-rectly evaluate one response component’s importancerelative to another’s. For well-defined portions of theresponse, like the initial transient and later sustained re-sponse that typifies auditory neuron responses to tonebursts, we can directly compare how different portionsare. Furthermore, the cumulative distance reveals howlong it takes to yield a certain level of discrimination.We can then begin to answer questions such as how longit takes to determine from the population’s response ajust noticeable stimulus change.

An example of this analysis for the single neuroncase is displayed in Fig. 5. Figure 6 illustrates applyingthis approach to a simple population of three neurons.

Figure 6. We simulated a three-neuron ensemble responding to two stimulus conditions. The left portion of the display shows PST histogramsof each neuron. As far as can be discerned from these histograms, the first stimulus yielded a constant-rate response in each simulated neuron.The second stimulus produced different responses in each neuron. The first had an oscillatory response lasting 20 bins, the second a transientrate increase for 10 bins, and the third a rate change. The dashed vertical lines in the PST histograms indicate the boundaries of these variousresponse portions. During the first stimulus, and until the last 10 bins of the second stimulus, the neurons produced discharges statisticallyindependent of the others. In the last 10 bins, the first and third neurons’ discharges became correlated (coefficient= 0.6). Throughout allresponses, the responses were produced by a first-order Markov model having a correlation coefficient of−0.1. The right panel shows the resultof computing the resistor-average distance between the two responses. The solid line shows half the resistor-average distance, with its 90%confidence interval shown with a dotted line. Dashed vertical lines correspond to stimulus 2 response components.

Both a stimulus-induced rate response and a transneu-ral correlation can be detected, and the relative con-tribution of each response component to sensory dis-crimination quantified. Clearly, the initial portion ofthe response produced the greatest distance change.During the next 10 bins, when the latter portion of neu-ron #1’s oscillatory response and the rate responsesof the other two are present, about 0.5 bits of dis-tance were gained. This increase means that the prob-ability of not being able to discriminate between thetwo stimulus conditions decreased by a factor of about20.5 = 1.4. A much larger change (1.4 bits) occurredduring the first 10 bins. Consequently, the first por-tion of the response contributes much more to stimulusdiscrimination than the second. The third portion ofthe response contains only constant discharge rates.The distance accumulated during this time (bins 20to 29) roughly equals the distance accumulated dur-ing the previous 10 bins, when neuron #1’s response

62 Johnson et al.

contained an oscillatory component. This equality ofaccumulated distance means that the oscillatory re-sponse and the constant-rate response are equally ef-fective in representing the stimulus difference. Inter-estingly, the introduction of spatial correlation (foundin the last 10 bins) increased only slightly the accumu-lated distance beyond what the rate response by itselfwould have.

6.2. Uncovering Neural Codes

The calculation of distances between responses quan-tifies neural coding without revealing what the codeis. Distance calculations can offer some insights aswell into what aspects of the response contribute tothe code. For example, we can determine the pres-ence of correlation in an ensemble’s response, be itstimulus- or connectivity-induced. In the former case,spike trains can be correlated merely because neu-rons are responding to the same stimulus. In the lat-ter, the neurons receive common inputs or are in-terconnected. We compute the type of the measuredensemble response and derive from it the type that

Figure 7. The resistor average (divided by two) between the type computed from response and the type computed derived from it that forces aspatially independent ensemble response structure is shown for the two stimulus conditions used in Fig. 6. The dashed line shows the result forthe first response (histograms shown in the left column of Fig. 6), the solid line for the second (center column). As was simulated, the responsesto stimulus 1 demonstrated no transneural correlation. The second stimulus did induce a correlation in the latter portion of the response, andthe distance clearly indicates the presence of such correlation. The 90% confidence interval for the second response is indicated by the dottedlines. Note that the confidence interval’s lower edge was less than zero for the first 30 bins.

would have been produced by the ensemble if ithad statistically independent members (spatial depen-dence) and/or had no temporal dependence. Referringto Fig. 3 for an example, the probability of each neu-ron discharging in each bin can be calculated from thejoint probability of various response patterns occurringin a bin. For example, Pr[discharge in neuron #1]=Pr[Rn = 4]+Pr[Rn = 5]+Pr[Rn = 6]+Pr[Rn = 7]because the leading bit of the binary representa-tions of these symbols, which corresponds to neu-ron #1, equals 1: 4= 1002, 5= 1012, etc. Fromthese component probabilities, we estimate theprobability of all possible ensemble response pat-terns by multiplying according to the ensemble re-sponse the probabilities of each neuron dischargingor not (Pr[Rn = 3]= Pr[no discharge in neuron #1]·Pr[discharge in neuron #2]·Pr[discharge in neuron #3]because 3= 0112). By calculating the distance be-tween these two types, we can infer when correlatedresponses are present; Fig. 7 illustrates an example.

The presence of interneuron correlation in the fourthresponse segment shown in Fig. 6 is not discernablewhen compared to the distance accumulated in the thirdsegment, when only rate differences are present. One

Information-Theoretic Analysis of Neural Coding 63

might infer from the analysis shown in Fig. 7 that theamount accumulated in the fourth segment should ex-ceed that of the third by about 0.5 bit. The fact thatthis difference does not occur when analyzing the datademonstrates a subtlety in using distance measures.The distance between responses does represent howeasily an optimal classifier can distinguish them, butthe various factors that contribute to this distance arenotnecessarily additive. Just because correlation anal-ysis reveals 0.5 bit of difference does not mean that in-terneuron correlation increases the distance contributedby average-rate differences by the same amount.

In this analysis, we can use the Kullback-Leiblerdistance directly. It equals the mutual information be-tween the component discharge patterns of the popula-tion (property 5). Zero mutual information correspondsto statistically independent responses and it increasesas the discharge patterns become more interdependent.When applying this analysis to populations of three ormore neurons, we are extending the definition of mu-tual information (property 5) toN random variables ina novel way:

I (x1; . . . ; xN) = D(p(x) ‖ p(x1)p(x2) · · · p(xN)).

Here,p(x) denotes the joint distribution of theN ran-dom variables. We note that from an information trans-fer viewpoint, statistically independent responses donot always correspond to the best situation (Gruner andJohnson, 1998).

6.3. Uncovering Feature Extraction

As part of developing these new techniques, we re-examined how the signal processing function of anysystem should be assessed. Consider a nonlinear,adaptive system—a neural ensemble—that accepts in-puts and produces outputs (as shown in Fig. 1), aboutwhich we have only general insight into the system’sfunction (for example, it processes acoustic informa-tion). Assume that the inputs depend on a collection ofstimulus parameters represented by the vectorθ. Curi-ously, knowing the system’s input-output relation maynot be helpful in understanding its signal processingfunction: nonlinear systems are just too complicated.Our approach examines response sensitivity to stimu-lus changes and derives from it the ability of an opti-mal signal processing system to estimate the stimulusparameters. The key idea underlying this approachis the perturbational result of Eq. (5), which relates

distance measure changes to the Fisher informationmatrix.

In our approach, we measure responses recordedin response to a reference stimulus parameterizedby θ0 and a family of responses parameterized byθ0 + δθ, with δθ a perturbation. We compute typesfrom ensemble responses to both stimuli and quantifythe “distance” between them. We use the Kullback-Leibler distance in this application since we have anatural choice for a reference response. Figure 8shows the surfaces generated by perturbing two stim-ulus parameters—sound amplitude and azimuthal lo-cation of the sound—about a reference stimulus. Ourresponses, obtained from accurate biophysical simula-tions of binaurally sensitive lateral superior olive (LSO)neurons (Zacksenhouse et al., 1998), indicate that dur-ing different portions of the response, the two stimulusfeatures are coded with differing fidelity. We measurefidelity as the ability (standard deviation of error) ofan optimal system to estimate the stimulus parametersfrom the response. Early on, the transient response en-codes both stimulus features well. Twenty millisecondslater, the fidelity of angle encoding remains about thesame, although the form of the response has changedfrom a transient to a gradual rate change. During thisperiod, the amplitude encoding has greatly worsened,with the standard deviation increasing by over a factorof five. During the constant-rate portion of the responsestarting 20 ms later, the amplitude estimate has wors-ened more with the angle estimate’s quality remainingabout the same.

What these results indicate is that this LSO neu-ron is processing its inputs (which greatly resemblethe primary neural outputs of the two ears) in such away that stimulus amplitude and angle are encoded inits response. In short, the neuron’s discharge patternmultiplexes stimulus information. The fidelity of thisrepresentation changes rapidly with time after stimulusonset, with the azimuth being the primary stimulus at-tribute encoded in the response. Thus, the informationcoding provided by this neuron’s discharges is multi-dimensional and time varying. The initial portion ofthe response could be used along with other neural re-sponses in the auditory pathway to estimate stimulusamplitude, but later portions are less useful. Becauseazimuth is consistently represented by these neural dis-charges, we conclude the primary, but not only, role forthe lateral superior olive, is sound localization. How-ever, downstream neural structures could use the ampli-tude information conveyed by these responses. Parallel

64 Johnson et al.

Figure 8. We simulated (Zacksenhouse et al., 1998) the response of a single lateral superior olive neuron (Tsuchitani, 1988a; 1988b) tohigh-frequency tone bursts presented at various amplitudes and azimuthal locations. The top row (panels A–C) shows the PST histogram of thesimulated response at the reference condition (20 dB, 30◦). The light areas in each indicate the 40 ms portion of the response subjected to type-based analysis. The next row (panels D–F) shows three-dimensional surfaces of the corresponding values of Kullback-Leibler distance betweenthe reference response and the responses resulting from varying stimulus amplitude and angle. From these surfaces, we fit a two-dimensionalthird-order polynomial and used its parabolic terms to estimate the elements of the Fisher information matrix according to Eq. (5). The inverseof this matrix provides lower bounds on estimates of angle and amplitude derived from the analyzed portion of the response. Panels G–I showsensitivity ellipses that trace one standard deviation that an optimal system would yield if it estimated amplitude and angle simultaneously. Thehorizontal and vertical extents of these ellipses correspond to the standard deviations of angle and amplitude estimates, respectively, and thesedetermined the rectangles shown in each panel. Panel G’s rectangle is repeated in the other panels for comparison. The bottom panel shows howthese standard deviations changed during the response. The circles indicate the standard deviation of the amplitude estimate (left vertical scale)and the asterisks the standard deviation of the angle estimate (right scale).

neural systems present in the auditory pathway clearlyrepresent amplitude more effectively; presumably theyhave greater impact on amplitude processing.

7. Conclusions and Discussion

The goal of information-theoretic distance analy-sis is to compute the distance between responses.We explored several ideas on how these distance

calculations can be used to measure and assess theneural code. In all of these, the basic procedure is asfollows:

1. Given sets of individual or simultaneous recordings,the analysis of the population’s response begins withthe digitization process described in Section 3. Theimportant consequence of this procedure is that sin-gle and multiunit recordings have a common datarepresentation.

Information-Theoretic Analysis of Neural Coding 65

2. Compute the joint type of user-specified orderD,employing the K-T modification if the Kullback-Leibler distance is needed:

PRb,...,Rb−D (r0, . . . , r D)

= (# timesRb = r0, . . . , Rb−D = r D)+ 12

M + 2D·N−1.

3. Compute the Kullback-Leibler distance or resistor-average approximation to the Chernoff distanceusing the Markov decomposition expressed inEq. (9). The conditional distribution needed in thiscomputation is found from the joint type by a for-mula that mimics the definition of conditional prob-abilities:

PRb|Rb−1,...,b−D(r0 | r1, . . . r D)

= PRb,...,Rb−D (r0, r1, . . . , r D)∑r PRb,Rb−1,...,Rb−D (r, r1 . . . , r D)

.

4. Use the bootstrap debiasing and confidence intervalprocedure on the distance thus calculated. When an-alyzing cyclostationary responses, we consider theresponses to individual stimulus presentations as thefundamental “data quantum.” Our bootstrap sam-ples are drawn from this collection ofM datasets.

5. Our examples plot the cumulative debiased distanceas each term is accumulated (using an expressionsimilar to that of Eq. (9)).

We have presented information-theoretic distancemeasures that can be used to quantify neural codingand described techniques that exploit them. These dis-tances depend on the probabilistic descriptions of theneural discharges, about which we want to assume aslittle as possible. The theory of types suggests that em-pirical estimates of these distributions can be used toaccurately compute these distance measures, with thesole modeling assumption being the Markov depen-dence parameter. Given sufficient data, this parametercan also be determined solely by the data’s statisticalstructure. If this statistical structure spans a long timeinterval, our technique and any other will not fully re-veal the neural code unless temporal resolution is com-promised or more data are acquired. The examples wehave presented here, particularly the feature extractionone, demonstrate that neural information coding can bequite complex, being both time varying and expressedby both discharge timing and rate. Thus, any techniquefor assessing information-coding fidelity must make as

few assumptions as possible; the type-based analysisdescribed here fulfills that criterion.

A second powerful aspect of our approach is itsability to cope with ensemble responses. As shown inFig. 3, the analysis technique can conceptually be ap-plied to any sized population. The information bound(Eq. (8)) suggests that the amount of data required for agiven level of analysis growsexponentiallyin the num-ber of neurons and in the Markov parameterD. In prac-tical terms, our technique can be used only for smallpopulations. However, since the bound applies to anybin-based technique, without additional assumptionsabout the neural code, all such techniques are simi-larly data-limited. Whether a similar limitation appliesto other techniques, such as those based on interspikeintervals, is not known.

For judging coding quality, we prefer the Chernoffdistance. Because of its computational complexity, weuse the resistor-average distance to approximate it. TheKullback-Leibler distance, despite its theoretical im-portance, is difficult to use empirically because it isasymmetric with respect to the two component re-sponses. We used it in the stimulus perturbation analy-sis because a natural reference response emerges, andit is the simplest computationally to estimate.

The procedures we have described here can assessneural coding, but they do not directly reveal what thecode is. We showed one approach to assessing transneu-ral correlation in Fig. 7. In general, coding mechanismscan be inferred from the component types; preciselyhow we have not yet determined. Be that as it may,the information-theoretic procedures developed hereoffer flexible but computationally intense analysis tech-niques that can meaningfully quantify the nature ofneural coding within populations.

Appendix: Classification Theory

Classification theory concerns how observations can beoptimally classified into predefined categories. Stat-ing the problem formally in the paper’s context ofneural signal analysis, a set of measured responses{R1, . . . ,RM} is to be classified as belonging to oneof J categories. Here, the parameterM represents thenumber of stimulus presentations, and eachRm rep-resents the population response to each presentation.Each category is defined according to a known proba-bilistic description (which may have unknown param-eters or other controlled uncertainties). The most fre-quently studied variant of the classification problem is

66 Johnson et al.

the binary classification problem: which of two cat-egoriesC1, C2 best match the observations. Giventhe probabilistic descriptions of the categories, the op-timal rule for classifying observations, the likelihoodratio test, is well known (Hogg and Craig, 1995). Whatis very difficult to calculate is how well this optimalrule works. Developing approximations for calculat-ing performance leads to important notions that directlyapply to the neural response analysis problem. Classi-fier performance is usually expressed in terms of errorprobabilities. Terminology for the error probabilitiesoriginated in radar engineering, whereC1 meant thatno target was present andC2 that one was. Thefalse-alarm probability Pfa = Pr[sayC2 |C1] is the proba-bility that the classification rule announcesC2 when thedata actually were produced according toC1 (the clas-sifier incorrectly declared a target was present), and themiss probabilityis Pmiss= Pr[sayC1 | C2] (the classi-fier incorrectly announced that no target was presentwhen one was). Theaverage error probability Peis the average of these individual error probabilities:Pe = π1Pfa+ π2Pmiss, whereπ1, π2 are thea prioriprobabilities that data conform to the categories. Notethat in the two-category problemπ1 = 1− π2.

The likelihood ratio test results when we seek theclassifier that minimizes either the average probabil-ity of error or the false-alarm probability (Johnson andDudgeon, 1993).6 Picking which error probability tominimize would seem not to matter (the same classi-fier results) if it were not for the fact that the optimizederror probabilities usually differ, easily by orders ofmagnitude in many cases. Furthermore, in many situa-tions the optimal classifier’s false-alarm probability candiffer significantly from its miss probability. Becauseerror probabilities provide the portal through which wedevelop information-theoretic analysis techniques, ap-preciating which error probability best suits a givenexperimental paradigm leads to better interpretation ofmeasured distances. In neuroscience applications, wewant to present two different stimulus conditions and toquantify how easily these stimuli can be distinguishedbased on the responses of some neural population.Average error probability summarizes how well an op-timal classifier can distinguish data that could arisefrom either of two stimulus conditions. False-alarm(or miss) probability better summarizes performancewhen one of our stimuli can be considered a refer-ence, and we want to know how well an optimal clas-sifier can distinguish some neural response from anominal.

No general formulae for any error probabilities areknown for any optimal classifiers except in specialcases, such as the classic Gaussian problem. We cananswer the question “How does performance change asthe amount of data becomes large?” When the observa-tions are statistically independent and identically dis-tributed under both categories,pCj ({R1, . . . ,RM}) =∏

m pCj (Rm), results generically known as Stein’sLemma (Cover and Thomas, sec. 12.8, 12.9) statethat error probabilities decay exponentially in theamount of data available, with the so-calledexpo-nential rate being an information-theoretic distancemeasure:

limM→∞

log Pfa

M=−D (pC2(R) ‖ pC1(R)

)for fixed Pmiss (12a)

limM→∞

log Pmiss

M=−D (pC1(R) ‖ pC2(R)

)for fixed Pfa (12b)

limM→∞

log Pe

M=−C(pC1(R), pC2(R)

). (12c)

D(· ‖ ·) is known as theKullback-Leibler distancebe-tween two probability distributions. It applies to bothprobability densitiesp, q or probability mass functionsP, Q:

D(p‖q) =∫

p(R) logp(R)

q(R)dR. (13)

C(·, ·) is theChernoff distance, defined as

C(p,q)= − min0≤u≤1

log∫

[ p(R)]1−u[q(R)]u dR. (14)

Note these definitions apply to both univariate and mul-tivariate distributions. When the observations are notstatistically independent, all these results apply to themultivariate distribution of the observations (Johnsonand Orsak, 1993). For example,

limM→∞

log Pfa

M

= − limM→∞

D (pC2({R1, . . . ,RM }) ‖ pC1({R1, . . . ,RM }))

Mfor fixed Pmiss.

Stein’s Lemma (A1) is not stated directly in termof error probabilities because of subtle, but important,

Information-Theoretic Analysis of Neural Coding 67

technical details. Focusing on the false-alarm proba-bility, Stein’s Lemma for the case of independent ob-servations can be stated more directly as

Pfa→ f (M)2−MD(pc1(R)‖pc2(R)) for fixed Pmiss,

with limM→∞[log f (M)]/M = 0. The term f (·)changes more slowly in comparison to the exponen-tial, and it depends on the problem at hand. What thisformula means is that if we plot any of the error prob-abilities logarithmically againstM linearly, we willalways obtain a straight line for large values ofM(see Fig. 9). Stein’s Lemma says that the false-alarmprobability’s slope on semi-logarithmic axes—its ex-ponential rate—equals the negative of the Kullback-Leibler distance between the probability distributionsdefining our classification problem. Because of thepresence of the problem-dependent quantityf (·), wecannot determine in general the vertical origin for theerror probability or how largeM must be for straight-line behavior to take over. Consequently, we cannot

Figure 9. Using the Gaussian and Poisson classification problems as examples, we plot the false-alarm probability (left panels) and averageprobability of error (right panels) for each as a function of the amount of statistically independent data used by the optimal classifier. Themiss-probability criterion was that it be less than or equal to 0.1. Thea priori probabilities are 1/2 in the right-column examples. As shownhere, the average error probability produced by the minimumPe classifier typically decays more slowly than the false-alarm probability for theclassifier that fixes the miss probability. The dashed lines depict the behavior of the error probabilities as predicted by asymptotic theory (A1).In each case, these theoretical lines have been shifted vertically for ease of comparison.

use asymptotic formulas to compute error probabilities,but we do know that error probabilities ultimately de-cay exponentially foranyclassification problem solvedwith the optimal classifier, and we know the rate ofthis decay. We can also say that if further observa-tions increase any of these distances by one unit (abit), the corresponding error probability decreases bya factor of two. Furthermore, having exponentially de-creasing error probabilities defines a set of “good” clas-sifiers. Optimal classifiers produce error probabilitiesthat decay exponentially with the quantity multiply-ing M equal to the Kullback-Leibler distance or theChernoff distance. Suboptimal but “good” ones willhave a smaller slope, with poor ones not yielding ex-ponentially decaying error probabilities. The exponen-tial rate cannot be steeper than the Chernoff distancefor the classifier that optimizes average error proba-bility and Kullback-Leibler distance for the Neyman-Pearson classifier. Thus, these distances defineanyclassification problem’s difficulty. The greater the dis-tance, the more quickly error probabilities decrease (the

68 Johnson et al.

exponential rate is larger) and the “easier” the classifi-cation problem.Whether we use an optimal classifieror not, the Chernoff and Kullback-Leibler distancesquantify the ultimate performance any classifier canachieve, and therefore measure intrinsic classificationproblem difficulty. We therefore want to estimate thesedistances from measured responses to quantify whatresponse differences make distinguishing them easier.

Note that the Kullback-Leibler distance (Eq. (13))is asymmetric with respect to the two distributionsdefining the classification problem. The false-alarmprobability achieved under a fixed miss-probabilityconstraint and the miss probability achieved under afixed false-alarm-probability constraint not only havedifferent values, they may have different exponentialrates. The Chernoff distance (Eq. (14)) is symmetricwith respect to the two probability distributions andshould be used to assess the classification problem.What prevents using it in applications is the requiredminimization process. Technically, this minimizationis simple: the function to be minimized is bowl-shaped,leaving it with a unique minimum. Computationally,finding the quantity to be minimized can be quite com-plicated. In typical problems, many more calculationsare needed to find the Chernoff distance than requiredto compute the Kullback-Leibler distance.

Notes

1. Our distance measure has units of bits only because we use base-2logarithms in its computation and does not imply an informationrate. We describe in succeeding sections how to interpret distancevalues.

2. The theory surrounding how to process information universallywithout much regard to the underlying distribution of the data.

3. A sequence of random variables isDth-order Markovif the con-ditional probability of any element of the sequence given val-ues for the previous ones dependsonly on theD most previous:p(Ri | Ri−1, Ri−2, . . .) = p(Ri | Ri−1, Ri−2, . . . , Ri−D).

4. The term ln 2 arises because the definition of Kullback-Leiblerdistance (1) uses log2 and the definition of Fisher informationuses natural logarithms.

5. This situation is particularly subtle. Even when the response canbe well modeled as a renewal process (interspike intervals arestatistically independent from each other), the probability of adischarge in a bin depends on how long ago the previous dischargeoccurred.

6. Note that in optimizing false-alarm probability (making it as smallas possible), we must constrain the miss-probability to not exceeda prespecified value. If not, we can make the false-alarm prob-ability zero by having our classifier announce “classC2 modelsthe data.” That way we are never wrong when the categoryC2 istrue, but we’ll always be wrong if categoryC1 is true (the missprobability will be one). The likelihood ratio test emerges whenwe minimizePfa subject toPmiss≤ α.

References

Abeles M (1991)Corticonics: Neural Circuits of the Cerebral Cor-tex. Cambridge University Press, New York.

Abeles M, Goldstein Jr. MH (1977) Multispike train analysis.Proc.IEEE65:762–773.

Alonso JM, Usrey WM, Reid RC (1996) Precisely correlated firingin cells of the lateral geniculate nucleus.Nature383:815–818.

Basseville M (1989) Distance measures for signal processing andpattern recognition.Signal Processing18:349–369.

Bialek W, Rieke F, de Ruyter van Steveninck RR, Warland D (1991)Reading a neural code.Science252:1852–1856.

Carlton AG (1969) On the bias of information estimates.Psycholog-ical Bulletin71:108–109.

Cover TM, Thomas JA (1991)Elements of Information Theory.Wiley, New York.

deCharms RC, Merzenich MM (1996) Primary cortical represen-tation of sounds by the coordination of action-potential timing.Nature381:610–613.

Efron B, Tibshirani RJ (1993)An Introduction to the Bootstrap.Chapman & Hall, New York.

Fagan RM (1978) Information measures: Statistical confidence lim-its and inference.J. Theor. Biol.73:61–79.

Gabbiani F, Koch C (1996) Coding of time-varying signals in spiketrains of integrate-and-fire neurons with random threshold.NeuralComput.8:44–66.

Gardner WA (1994) An introduction to cyclostationary signals.In: Cyclostationarity in Communications and Signal Processing.IEEE Press, New York. chap. 1.

Gruner CM, Johnson DH (1999) Correlation and neural informa-tion coding efficiency. In:Computational Neuroscience. Elsevier,Santa Barbara, CA.

Hogg RV, Craig AT (1995)Introduction to Mathematical Statistics(5th ed.). Prentice-Hall, Englewood Cliffs, NJ.

Johnson DH (1996) Point process models of single-neuron dis-charges.J. Comput. Neurosci.3:275–299.

Johnson DH, Dudgeon DE (1993)Array Signal Processing: Con-cepts and Techniques. Prentice-Hall, Englewood Cliffs, NJ.

Johnson DH, Orsak GC (1993) Performance of optimal non-Gaussian detectors.IEEE Trans. Comm.41:1319–1328.

Johnson DH, Swami A (1983) The transmission of signals byauditory-nerve fiber discharge patterns.J. Acoust. Soc. Am.74:493–501.

Johnson DH, Tsuchitani C, Linebarger DA, Johnson M (1986) Theapplication of a point process model to the single unit responsesof the cat lateral superior olive to ipsilaterally presented tones.Hearing Res.21:135–159.

Krichevsky RE, Trofimov VK (1981) The performance of universalencoding.IEEE Trans. Info. TheoryIT-27:199–207.

Middlebrooks JC, Clock AE, Xu L, Green DM (1994) A panoramiccode for sound location by cortical neurons.Science264:842–844.

Riehle A, Grun S, Diesmann M, Aertsen A (1997) Spike synchroniza-tion and rate modulation differentially involved in motor corticalfunction.Science278:1950–1953.

Rieke F, Bodnar DA, Bialek W (1995) Naturalistic stimuli increasethe rate and efficiency of information transmission by primaryauditory efferents.Proc. R. Soc. Lond. B262:259–265.

Singer W, Gray CM (1995) Visual feature integration and the tem-poral correlation hypothesis.Ann. Rev. Neurosci.18:555–586.

Information-Theoretic Analysis of Neural Coding 69

Tsuchitani C (1988a) The inhibition of cat lateral superior olivaryunit excitatory responses to binaural tone bursts: I. The transientchopper discharges.J. Neurophysiol.59:164–183.

Tsuchitani C (1988b) The inhibition of cat lateral superior olivaryunit excitatory responses to binaural tone bursts: II. The sustaineddischarges.J. Neurophysiol.59:184–211.

Vaadla E, Haalman L, Abeles M, Bergman H, Prut Y, Solvin H,Aertsen A (1995) Dynamics of neuronal interactions in mon-key cortex in relation to behavioural events.Nature 373:515–518.

Victor JD, Purpura KP (1997) Metric-space analysis of spike trains:Theory, algorithms and applications.Network: Comput. NeuralSys.8:127–164.

Wehr M, Laurent, G (1996) Odour encoding by temporal sequencesof firing in oscillation neural assemblies.Nature384:162–166.

Wienberger MJ, Rissanen JJ, Feder M (1995) A universal finite mem-ory source.IEEE Trans. Info. Theory41:643–652.

Zacksenhouse M, Johnson DH, Tsuchitani C (1992) Excita-tory/inhibitory interaction in the LSO revealed by point processmodeling.Hearing Res.62:105–123.

Zacksenhouse M, Johnson DH, Tsuchitani C (1993) Excitation ef-fects on LSO unit sustained responses: Point process characteri-zation.Hearing Res.68:202–216.

Zacksenhouse M, Johnson DH, Williams J, Tsuchitani C (1998)Single-neuron modeling of LSO unit responses.J. Neurophysiol.79:3098–3110.