Post on 06-Jan-2016
description
Introduction: statistical and machine learning based
approaches to neurobiology
Shin Ishii
Nara Institute of Science and Technology
1. Mathematical Fundamentals:Maximum Likelihood and Bayesian Inferences
Coin tossing• Tossing a skewed coin
• How often does the head appear for this coin?– Probability of coming up one head in five tossing:
Head Tail Tail Tail Tail
One head comes up in five tossing
(Note: each trial is independent)
Parameter: rate of head appearance in an individual trial
Likelihood
Likelihood function
• Likelihood: evaluation of the observed data– viewed as a function of the parameter
What is the most likely parameter ?How to determine it?
It seems natural to set according to the frequency of coming up the head.
Really?
Which parameter is better for explaining the observed data?
Likelihood of parameter :
Likelihood of parameter :
Kullback-Leibler (KL) divergence
• A measure of the difference between two probability distributions: and
difference
We can measure the difference according to an objective and
numerical value.
Note: KL divergence is not a metric.
Minimize KL divergence
• Random events are drawn from the real distribution
true distribution
data set
Using the observed data, we want to estimate the true
distribution using a trial distribution.
trial distributionminimize
divergence
The smaller the KL divergence , the better an estimate.
Minimize KL divergence
• KL divergence between the two distributions
Constant: independentof parameter
To minimize KL divergence, we have onlyto maximize the second term with respect to the parameter .
Likelihood and KL divergence
• The second term is approximated by the sample mean:
data setLog likelihood
They are the same:• Minimizing the KL divergence• Maximizing the likelihood
Maximum Likelihood (ML) estimation
• Maximum likelihood (ML) estimate:
• What is the most likely parameter in the coin tossing?
Head Tail Tail Tail Tail
Maximization condition
Same as intuition
Property of ML estimate
• As the number of observations increases, the squared error of an estimate decreases in order. • ML estimate is asymptotically unbiased. R. Fisher
(1890-1962)
If the infinite number of observations could be obtained, an ML estimate becomes the real parameter.
InfeasibleWhat happens when only a limited number of observations have been obtained from the real environment?
Problem with ML estimation
• Is it a really skewed coin?
Head Tail Tail Tail Tail
It may just happen to come up four consecutive tails. It may be detrimental to assume the parameter as
Head TailHead HeadHead
Five more tossing...
See... Four consecutive tails occurred by chance.The ML estimate overfits to the first observations.How to avoid this overfitting?
If the data consists of one head in a single tossing, an ML estimate gives 100%. Not reasonable
Consider extreme case:
Bayesian approach
• Bayes theoremLikelihood Prior
Posterior
a posterioriinformation
information obtainedfrom data
a prioriinformation= +
We have no information about the probably skewed coin. Then, we now assume that the parameter is distributed around .
Prior distribution
Bayesian approach
• Bayes theoremLikelihood Prior
Posterior
a posterioriinformation
information obtainedfrom data
a prioriinformation= +
observed dataOne head and four tails.Hmm... It may be a skewed coin, but better consider other possibilities. Likelihood
function
Bayesian approach
• Bayes theoremLikelihood Prior
Posterior
a posterioriinformation
information obtainedfrom data
a prioriinformation= +
Posterior distribution
The parameter is distributed mainly between . Variance (uncertainty) exists.
Property of Bayesian inference
• Bayesian view: probability represents uncertainty of random events (subjective value).
R. Fisher(1890-1962)
That can’t be! The prior distribution leads to a subjective distortion against estimation. The estimation process must be objective to obtained data.
Frequentist
T. Bayes (1702-1761)
No problem. Uncertainty of random events (subjective probability) depends on the amount of information obtained from data and prior knowledge of the events.
Bayesian
Application of Bayesian approaches
• Data obtained from the real world:– sparse– high dimension– unobservable variables
Bayesian methods are available
User support system(Bayesian Network)
Bioinfomatics
2. Bayesian Approaches to Reconstruction of Neural Codes
A neural decoding problem
How the brain works?How the brain works?
Sensory information is represented in sequences of spikes.
When the same stimulus is repeatedly input, spike occurrence varies between trials.
An indirect approach is to reconstruct the stimuliAn indirect approach is to reconstruct the stimulifrom the observed spike trains.from the observed spike trains.
Bayesian application to a neural codeSpike train(observation)
Time
Stimulus(prior knowledge) ?
Possible algorithm of stimulus reconstruction (estimation) from
spike train only (Maximum likelihood estimation)
spike train & stimulus (Bayes estimation)
Note: we focus on whether spike trains include stimulus information, NOT BUT whether the algorithm is true in the brain.
‘Observation’ depends on ‘Prior’
Stimulus
Time
Neural system
Spike trainEstimatedstimulus
Time
Estimationalgorithm
Black box
Time
(Bialek et al., Science, 1991)
‘Observation’ depends on ‘Prior’
s
Neural system
Estimatedstimulusdistribution
Estimationalgorithm
( )P s
( | )P x s
( | )P s x
Black box
s x
Stimulusdistribution
Simple example of signal estimation
x s Observation = Signal + Noise :
Particular value of observation: xIncoming signal: sNoise ( )0
( )P s ( | )P s x( | )P x s( )P
=+
( )ests f xEstimation stimulus
Simple example of signal estimation
2
22
1exp
2( )
2 ss
P ss
If the signal s is supposed to be chosen from a Gaussian,
2
22
1 ( )( ) e( |
22) xpP x s
s xP x s
chosen from a Gaussian,
If the probability that one observes a particular x with signal s just depends on the noise , and the noise is supposed to be
2 2
2 2
1 ( )( | ) ( ) exp exp
2 2( | )
2s s
s s xP x s P sP s x
So the posterior is,
Simple example of signal estimation
(( )| () | )P xP ss Px sPrior knowledgeLikelihood
(Observation)PosteriorBayes theorem:
Maximum likelihood estimation: maximize
Bayes estimation: maximize
( | ) 0 estsP s xxs
( | ) 0 estP s x s Kxs
2 2( / )
1 sSNR
K SNRSNR
SNR
1
K
( | )P x s
( | )P s x
Signal estimation of a fly(Bialek et al., Science, 1991)
Calliphora erythrocephalaMovement sensitive neuron (H1)
Gaussian visual stimulus
Time
Visually guided flightTime scale 30 ms of behaviorH1 firing rate : 100-200 spikes/s Behavioral decisions are based on a few spikes.
Signal estimation of a fly(Bialek et al., Science, 1991)
StimulusObservation
X
Encoder{ }it ( )s t
Time
(( )| () | )P xP ss Px sPrior knowledgeLikelihood
(Observation)PosteriorBayes theorem:
[{ } | ]iP t s [ ]P s
Estimated stimulus maximizes [ |{ }]iP s tHowever, [ |{ }]iP s t can not be measured directly.
Kernel reconstruction and least square
[ |{ }]iP s tcan not still calculated, because can not be defined.
Estimated stimulation:
Next step alternative calculation:
[ |{ }]est is ds P s t s
2 2| ( ) ( ) |ests t s t dt ( )F ( )est i
i
s F t t Choosing the kernel
which minimizes the square error
in the function
[ |{ }]est is ds P s t s
Signal estimation of a fly(Bialek et al., Science, 1991)
( )F
StimulusEstimated stimulus
Kernel
Case of mammalian
Rat hippocampal CA1 cells
O’Keefe’s place cell
(Lever et al., nature, 2002)
Each place cell shows high activity when the rat is located at a specific position
It is known that hippocampal CA1 cells wouldrepresent the position is a familiar field.
Case of mammalian
(Lever et al., nature, 2002)
Each place cell shows high activity when the rat is located at a specific position
Question:
Can one estimate rat’s position in the field from firing patterns of rat hippocampal place cells?
Incremental Bayes estimation(Brown et al., J. Neurosci., 1998)
Sequential Bayes estimation from spike train
time1kt kt
Spike train of a place cell
1spikes in( ( ) | [ , ])k kP s t t t
11 1spikspikes at es in( | ( ), ( ( ) |) [ , ])k k kk kPP t s t t s t t t
(( )| () | )P xP ss Px sPrior knowledgeLikelihood
(Observation)PosteriorBayes theorem:
... ... Rat’s position at kt
( )ks tObservation:
Prior stimulation:
(Brown et al., J. Neurosci., 1998) Rat position can be estimated by integratingthe recent place cell activities and the positionestimate from the history of activities.
Spike train of place cells
... ...
1( )ks t
Observation:
(Brown et al., J. Neurosci., 1998)
Time1kt
( )ks t
Rat’s positionPrior:
1t
p
p
p
p
1 1spikes at | ( ),k k kP t s t t
1 1spikes in( ) | [ , ]k kP s t t t
Incremental Bayes estimation from spike train
1kt kt
1 1spikes in( ) | ( ) ( ) | [ , ] ( )k k k k kP s t s t P s t t t ds t
Incremental Bayes estimation from spike train(Brown et al., J. Neurosci., 1998)
1kt kt... ... Time1kt 1t
1spikes at | ( ),k k kP t s t t
1( )ks t 1 1spikes in( ) | [ , ]k kP s t t t
Observation probability is the function
of the firing rate of cells, which depends on the position & theta rhythm.
1spikes at | ( ),k k kP t s t t
Firing rate of a place cell depends on
1. position component (receptive field)2. theta phase component
Inhomogeneous Poisson process for spike train
Position component (asymmetric Gaussian):
(Brown et al., J. Neurosci., 1998)
Theta phase component (cosine):
0( | ( ), ) exp cos( ( ) )t t t
11( | ( ), ) exp ( ( ) ) ( ( ) )
2x xt x t x t W x t
Instantaneous firing rate:
( | ( ), ( ), ) ( ( ( ), ) ( | ( ), )x x xt x t t t x t t t
The parameters were determined by maximum likelihood.
Firing rate of a place cell depends on
1. preferred position (receptive field)2. preferred phase in the theta rhythm
Position estimation from spike train
Assumption:
The path of the rat may be approximated as a zero mean two-dimensional Gaussian random walk.Parameters, and were estimated by ML.
(Brown et al., J. Neurosci., 1998)
2x
1x 2x
1spikes in( ) | [ , ]k kP s t t t
1 11 spispikes a kes int | ( ) ( ) |, [ , ]k kk kk PP t s t t s t t t
Finally, estimation procedure is as follows:
1. Encoding stage: estimate parameters, , , , and .2. Decoding stage: estimate rat’s position by incremental Bayes method at each spike event with the assumption of Gaussian random walk.
x 1x 2x
Bayes estimation from spike train
spike!
spike! spike!
spike!
spike!
(Brown et al., J. Neurosci., 1998)
Real rat
EKF estimation with variance
spike!
Calculation of posterior distribution is done in discontinuous time steps; when a spike occurs as a new observation.
Position estimation from spike train (1)(Brown et al., J. Neurosci., 1998)
Mouse position Estimation
=
Prior Likelihood
X
Posterior
LikelihoodObserved
firing pattern
Model activity
Correlation
Bayes estimation Maximum likelihood Maximum correlation
Position estimation from spike train (2)(Brown et al., J. Neurosci., 1998)
Mouse position Estimation
=
Prior Likelihood
X
Posterior
LikelihoodObserved
firing pattern
Model activity
Correlation
Maximum likelihood Maximum correlationBayes estimation
The ML and maximum correlation methods ignorethe history of neural activities, but the incremental
Bayes incorporates it as a prior.
3. Information Theoretic Analysisof Spike Trains
Information transmissionin neural systems
Environmental stimulusSpike train
(neural response)Encoding
Decoding
•How does a spike train code information about the corresponding stimuli?•How efficient is the information transmission?•Which kind of coding is optimal?
X Y
t
Encoder Decoder
Information transmission:Generalized view
Shannon’s communication system (Shannon, 1948)
Informationsource
Message
Noise source
Z
Destination
Z~
Signal
X
TransmitterObservable
Y
Receiver
MessageReceivedsignal
Observable
Channel
stimuli)( symbol Encoded:
Symbol:
X
Z
symbol Decoded:~
response)( symbol dTransmitte:
Z
Y
Neural coding is stochastic process
Stimulus Observed Spike trains
t
Neuronal responses against a given stimulus are not deterministic but stochastic, and the stimulus against each response is also probabilistic.
X XY |1
XY |2
XY |3
Shannon’s Information
• Smallest unit of information is “bit”– 1 bit = the amount of information needed to ch
oose between two equally-likely outcomes (eg: tossing a coin)
• Properties:1. Information for independent events are additi
ve over constituent events
2. If we already know the outcome, there is no information
Shannon’s Information
Independent events:
Implies:
)()()(),,,( 221121 NNN XPXPXPXXXP
))(())(())((
)),,,((
2211
21
NN
N
XPIXPIXPI
XXXPI
Certain events:
Implies:
0)(or 1)( 11 XPXP
0))(( XPI
Property 1
Property 2
)(log))(( 2 XPXPI
Eg. Tossing a coin• Tossing an even coin
Head Tail
0.5Head)( XP 0.5Tail)( XP
bit 1
0.5logHead)( 2
XI bit 1Tail)( XI
Observed HeadX TailX
0.5Head)( XP 0.5Tail)( XP
Eg. Tossing a coin• Tossing a horribly skewed coin…
Head Tail
0.99Head)( XP 0.01Tail)( XP
bits 0145.0Head)( XI bits 6.64Tail)( XI
Observed HeadX TailX
Observing ordinary event has low information,but observing rare event is highly informative.
Eg. Tossing 5 coins
• Case 1: even 5 coins
• Case 2: skewed 5 coins
Head Tail Tail Tail Tail
0.5Head)(Given XP
( H,T,T,T,T) 1*5 5 bitsI X
2 2( H,T,T,T,T) log 0.2 4*log 0.8
3.6 bits
I X
Given ( Head) 0.2P X
Entropy
• Entropy is the expectation of the information over all possible observations
On average, how much information do we get from an observation drawn from the distribution?
)](log[E 2 XPH Xp
discrete
continuous
X
p XPXPH )(log)( 2
dXXPXPH p )(log)( 2
Entropy can be defined…
Some properties of entropy
• Scalar property of a probability distribution• Entropy is maximum if P(X) is constant
– Least certainty of the event• Entropy is minimum if P(X) is a delta function• Entropy is always positive• The higher the entropy is, the more you learn
(on average) by observing values of the random variable
• The higher the entropy is, the less you can predict the values of the random variable
Eg. Tossing a coin
Head Tail
pXP Head)( pXP 1Tail)(
)1(log)1(log 22 ppppH P
Entropy
Entropy reaches the maximum when each event occurs with equal probability, i.e., occurs most randomly.
Eg. Entropy of Gaussian distributions
Entropy only depends the standard deviation,i.e., the entropy reflects variability of information source.
bits )2(log2
1
2
1log
2
1 22
2/2/ 2222
edxeeH xxx
),(~ 2Nx
What distribution maximizes the entropy of random variables?
discreteX
p XPXPH )(log)( 2
continuousdXXPXPH p )(log)( 2
MXP
1)(
MX ,,1
2
2
2
)(exp
2
1)(
X
XP
uniform distribution
Gaussian
2][,][ XVXE
Entropy of Spike Trains
• A spike train can be transformed into a binary vector by discretizing the time into small bins.
• Computing the entropy of possible spike trains How informative are such spike trains?
1 1 1 1 1 1 10 0 0 0 0 0 0 0binary word
spike train
T
t
MacKay and McCulloch (1952)resolution time:
window timeofduration :
t
T
How many different binary words may occur over the whole bins
Entropy of spike trains Brillouin (1962)
tTN /
!!
!
01total NN
NN All possible words
rate spikemean :
yprobabilit firing:
s0' ofnumber :)1(
s1' ofnumber :
0
1
r
trp
NpN
pNN
Stirling approximation
.)1ln()1()ln()(2ln/)/(
/ln//ln/2ln/
))1ln()1ln()1(ln(2ln/1
!ln!ln!ln2ln/1
log
0011
0011
01
total2total
trtrtrtrtT
NNNNNNNNN
NNNNNN
NNN
NH
Entropy
)1(ln!ln xxx
The entropy is linear to length of time window, T
tT n larger thamuch is Assume
.)1ln()1()ln()(2ln
1total trtrtrtrtT
H
(ms)
spikes/s 50~ rEg.
Entropy rate: the unit of bits of information per second
• If the chance of a spike in a bin is small (low rate, or high sampling rate) then we can approximate the entropy rate as:
tr
er
T
H
2
total log
Entropy rate of temporal (timing) code
bit/s) (288
spikeper bits 76.5
1 trp
trtr )1ln(
Entropy of spike count distribution
n
npnpH )(log)()count spike( 2
Tn windowin time spikes observing ofy probabilit
Entropy of spike count distribution (rate code)
What should we choose for p(n)?We know only know two constraints about p(n):
1. probability distributions must be normalized .
2. average spike count in T should be 1)( n
np
rTn
We cannot determine p(n) uniquely,but can obtain p(n) which maximizes the entropy.
Entropy of spike count distribution
• Entropy for spike counts is maximized by an exponential distribution:
Entropy is then:
)exp()( nnp )1
1log(tr
trtrtrH
11log)1(log 22
Conditional entropy andmutual information
I(s,r)
• The entropy represents the uncertainty about the response in the absence of any other information.
• The conditional entropy represents the remaining uncertainty about the response for a fixed stimulus s.
• I(r,s) is the mutual information between s and r; representing the reduction of uncertainty in r achieved by measuring s.
• If r and s are statistically independent, then I(r,s)=0.
0),(),( | rsIHHsrI srrrH sH
srH | rsH |
rH
srH |
Mutual information
)|(logE 2| srpH sr
)|(log)|( 2| srpsrpH sr
drsrpsrpH sr )|(log)|( 2|
discrete
continuous
Conditional entropy
Reproducibility and variabilityin neural spike trains
Calliphora erythrocephalaMovement sensitive neuron (H1)
Dynamic stimuli(natural condition)
Static stimuli(artificial condition)
Ordered firing patterns(high reproducibility)
Irregular firing patterns(low reproducibility, Poisson-like patterns)
van Steveninck et al., Science, 1997
random walkwith diffusionconstant
s/degrees 14 2
Low spike count variance
Reproducibility and variabilityin neural spike trains
Calliphora erythrocephalaMovement sensitive neuron (H1)
High spike count variance(Poisson-like mean-variance relationship)
van Steveninck et al., Science, 1997
Dynamic stimuli(natural condition)
Static stimuli(artificial condition)
random walkwith diffusionconstant
s/degrees 14 2
Does more precise spike timing convey more information about input stimuli?
mean count
Quantifying information transfer
van Steveninck et al., Science, 1997
100
resp
onse
s
1. At each t, divide spike trains in 10 contiguous 3-ms bins, and construct local word frequency
2. Stepping in 3-ms bins, words are sampled (900 trials, 10 s)
30-ms window
)|( tWP
word local wordfrequency
6103~
Distribution of 1500 words
ttWPWP )|()(
Quantifying information transfer
van Steveninck et al., Science, 1997
W
WPWPH bits )(log)( 2total
bits )|(log)|( 2noise
tW
tWPtWPH
bits noisetotal HHI
Transmitted information(mutual information between W and t)
Entropy of spike trains
Conditional entropy of neuronal response given stimulus
H1’s responses to dynamic stimuli
bits 43.2
bits 62.2
bits 05.5
noise
total
I
H
H
Comparison with simulated spike trains
bits 43.2
bits 62.2
bits 05.5
noise
total
I
H
H
H1’s responses to dynamic stimuli
bits 95.0
bits 22.4
bits 17.5
noise
total
I
H
H
Simulated responses
Simulate spike trains by a modulated Poisson process
1. has the correct dynamics of the firing rate of the responses to dynamic stimuli
2. but follows the mean-variance relation of static stimuli (mean=variance).
Models that may accurately accounts for the
H1 neural response to static stimuli can
significant underestimate the signal
transfer under more natural conditions.
More than twice as much information
Summary
• Statistical inference– Maximum likelihood inference– Bayesian inference– Bayesian approach to neural decoding
problems
• Information theory– Information amount and entropy– Information theoretic approach to a neural
encoding system