Introduction to Bayesian statistics Yves Moreau. Overview The Cox-Jaynes axioms Bayes’ rule...
-
Upload
carol-gallagher -
Category
Documents
-
view
218 -
download
2
Transcript of Introduction to Bayesian statistics Yves Moreau. Overview The Cox-Jaynes axioms Bayes’ rule...
Introduction to Bayesian statistics
Yves Moreau
Overview
The Cox-Jaynes axioms Bayes’ rule Probabilistic models
Maximum likelihood Maximum a posteriori
Bayesian inference Multinomial en Dirichlet distributions Estimation of frequency matrices
Pseudocounts Dirichlet mixture
The Cox-Jaynes axioms and Bayes’ rule
Probability vs. belief
What is a probability? Frequentist point of view
Probabilities are what frequency counts (coin, die) and histograms (height of people)
Such definitions are somewhat circular because of the dependency on the Central Limit Theorem
Measure theory point of view Probabilities satisfy Kolmogorov’s -algebra axioms Rigorous definition fits well within measure and integration theory But definition is ad hoc to fit within this framework
Bayesian point of view Probabilities are models of the uncertainty regarding propositions
within a given domain Induction vs. deduction
Deduction IF ( A B AND A = TRUE )
THEN B = TRUE Induction
IF ( A B AND B = TRUE )THAN A becomes more plausible
Probabilities satisfy Bayes’ rule
The Cox-Jaynes axioms
The Cox-Jaynes axioms allow the buildup of a large probabilistic framework with minimal assumptions
Firstly, some concepts A is a proposition
A TRUE or FALSE
D is a domain Information available about the current situation
BELIEF: (A=TRUE |D) Belief that we have regarding the proposition given the domain
knowledge
Secondly, some assumptions
1. Suppose we can compare beliefs (A|D) > (B|D) A is more plausible than B given Dand suppose the comparison is transitive
We have an ordering relation, so is a number
IF ( ( | ) ( | )) AND ( ( | ) ( | ))
THEN ( ( | ) ( | ))
A B B C
A C
D D D DD D
2. Suppose there exists a fixed relation between the belief in a proposition and the belief in the negation of this proposition
3. Suppose there exists a fixed relation between on the one hand the belief in the union of two propositions and on the other hand the belief in the first proposition and the belief in the second proposition given the first one
)|()|()|()|( thus
))|(()|(
DDDDDD
BABA
AfA
)),|(),|(()|,( DDD ABAgBA
Bayes’ rule
THEN it can be shown (after rescaling of the beliefs) that
Bayes’ rule
If we accept the Cox-Jaynes axions, we can always apply Bayes’ rule, independently of the specific definition of the probabilities
)|()|().,|(
),|(D
DDDAP
BPBAPABP
)|().,|()|,(
1)|()|(
DDDDD
APABPBAP
APAP
Bayes’ rule
Bayes’ rule will be our main tool for building probabilistic models and to estimate them
Bayes’ rule holds not only for statements (TRUE/FALSE) but for any random variables (discrete or continuous)
Bayes’ rule holds for specific realizations of the random variables as well as for the whole distribution
( | , ). ( | )( | , )
( | )
p X x Y y p Y yp Y y X x
p X x
D DD D
( | , ). ( | )( | , )
( | )
p X Y p Yp Y X
p X
D DD D
Importance of the domain D
The domain D is a flexible concept that encapsulates the background information that is relevant for the problem
It is important to set up the problem within the right domain Example
Diagnosis of Tay-Sachs’ disease Rare disease that appears more frequently for Ashkenazi Jews With the same symptoms, the probability of the disease will be smaller if we are
in a hospital in Brussels that if we are in Mount Sinai Hospital in New York
If we try to build a model with all the patients in the world, this model will not be more efficient
),|(),|( NYBE DD SDPSDP
)World)NY
WorldAsjk,|(,|(
),Asjk,|(),Asjk,|(
DD
DD
SDPSDP
PSDPPSDP
Probabilistic models and inference
Probabilistic models
We have a domain D
We have observations D We have a model M with parameters
Example 1 Domain D: the genome of a given organism Data D: a DNA sequence S = ’ACCTGATCACCCT’ Model M: the sequences are generated by a discrete distribution
over the alphabet {A,C,G,T} Parameters : 1 with ),,,( TGCATGCA
Example 2 Domain D: all European people Data D: the length of people from a given group Model M: the length is normally distributed N(m,) Parameters : the mean m and the standard deviation
Generative models
It is often possible to set up a model of the likelihood of the data For example, for the DNA sequence
More sophisticated models are possible HMMs Gibbs sampling for motif finding Bayesian networks
We want to find the model that describes our observations
1
( | , )i
L
Si
P S M
Maximum likelihood
Maximum likelihood (ML)
Consistent: if the observation were generated by the model M with parameters *, then ML will converge to * when the number of observations goes to infinity
Note that the data might not be generated by any instance of the model
If the data set is small, there might be a large difference between ML en *
),|(argmax MDPML
Maximum a posteriori probability
Maximum a posteriori probability (MAP)
Bayes’ rule
Thus
),|(argmax MDPMAP
)|(/)|(),|(),|( MDPMPMDPMDP
posteriorlikelihood of the data
prior
)|()|(),|(
argmaxMDP
MPMDPMAP
a priori knowledge
plays no role inoptimization over
Posterior mean estimate
Posterior mean estimate
dMDPPME ),|(.
Distributions over parameters
Let us look carefully to P(|M) (or to P(|D,M)) P(|M) is a probability distribution over the PARAMETERS We have to handle both distributions over observations and over
parameters at the same time Example
Distribution of the length of people P(D|,M)
Prior P(|M)
)(Lp
Length
150
175
200
),( mN
)(mp
Meanlength
150
175
200
)(p
Standard deviationlength
5 10 15
Bayesian inference
If we want to update the probability of the parameters with new observations D
1. Choose a reasonable prior
2. Add the information from the data
3. Get the updated distributions of the parameters
(We often work with logarithms)
( | , ) ( | )( | , )
( | )
( | , ) ( | )
( | , ) ( | )
P D M P MP D M
P D M
P D M P M
P D M P M d
13
2
Bayesian inference
Example
)|( Mmp
Meanlength
150
175
200
),|( MBmp
Meanlength
150
175
200
),|( MHmp
Meanlength
150
175
200
100 Belgianmen
100 Dutchmen
Marginalization
A major technique for working with probabilistic models is to introduce or remove a variable through marginalization wherever appropriate
If a variable Y can take only k mutually exclusive outcomes, we have
If the variables are continuous1
( ) 1K
k
P Y k
1
( , ) ( )K
k
P X Y k P X
( , ) ( )y
P X Y y dy P X
Y
Multinomial and Dirichlet distributions
Multinomial distribution
Discrete distribution K independent outcomes with probabilities i
Example Die K=6 DNA sequence K=4 Amino acid sequence K=20
For K=2 we have a Bernoulli variable (giving rise to a binomial distribution)
i
( ) , 1,...,
with = |0 1 and 1
i
Ki i
P X i i K
The multinomial distribution gives the number of times that the different outcomes were observed
The multinomial distribution is the natural distribution for the modeling of biological sequences
1 1 2 211
11
1
1( , ,..., ) ( ; )
(( ,..., ))
!with normalization factor (( ,..., ))
!
i
Kn
k k iik
K
ii
k K
kk
P N n N n N n nM n n
nM n n
n
M
Dirichlet distribution
Distribution over the region of the parameter space where
The distribution has parameters The Dirichlet distribution gives the probability of
The distribution is like a ‘dice factory’
Kiii ,...,1 , 10 and 1 i
Ki ,...,1 0, i
( 1)
1
( 1) 1
1
1
1( ; )
( )
( )( )
i
i
K
ii
K
iKi
i Ki
ik
Z
Z d
D
Dirichlet distribution
Z() is a normalization factor such that is de gamma function
Generalization of the factorial function to real numbers
The Dirichlet distribution is the natural prior for sequence analysis because this distribution is conjugate to the multinomial distribution, which means that if we have a Dirichlet prior and we update this prior with multinomial observations, the posterior will also have the form of a Dirichlet distribution Computationally very attractive
1)|( dP
)()1()!1()( xxxnn
Estimation of frequency matrices
Estimation on the basis of counts e.g., Position-Specific Scoring Matrix in PSI-BLAST Example: matrix model of a local motif
GACGTGCTCGAG
CGCGTGAACGTG
CACGTG
......
......
......
......
T
G
C
A
Count the number of instances in each column
If there are many aligned sites (N>>), we can estimate the frequencies as
This is the maximum likelihood estimate for
NnNnNnNn TTGGCCAA /,/,/,/
Nn
nP
nPnNnNnNnNP
ML
TGCATTGGCCAA
)|(maxarg
)|(),,,|,,,(
Proof
We want to show that
This is equivalent to
Further
MLML nPnP ),|()|(
0))|(/)|(log( nPnP ML
entropy) of(property 0log
)/(log
l)multinomia ofn (definitiolog)|()|(
log
i i
MLiML
i
iML
ii i
MLi
i
i
ni
i
nMLiML
N
Nnn
nPnP
i
i
Pseudocounts
If we have a limited number of counts, the maximum likelihood estimate will not be reliable (e.g., for symbols not observed in the data)
In such a situation, we can combine the observations with prior knowledge
Suppose we use a Dirichlet prior : Let us compute the Bayesian update
( | ) ( ; )( | )
( )
P nP n
P n
D
( ; ) D
( 1)
1
1 ( )( | ) ( ; )
( ) ( ) ( ) ( ) ( ) ( )i i
Kn
ii
Z nP n n
P n Z M n P n Z M n
D
( 1)
1
1( ; )
( )i
K
iiZ
D
K
i
ni
i
nMnP
1)(1
)( ( | ) ( ; )
( | )( )
P nP n
P n
D
( | ) ( ; )P n n D
11( ; )
( )k knPME
i i i kk
n d dZ n
D
Bayesian update
=1 because both distributionsare normalized
Computation of the posteriormean estimate
AN
n
nZ
nZ iiiPMEi
)(
)(Normalization integral Z(.)
)0,...,0,1,0,...,0(i
Pseudocounts
Pseudocounts
The prior contributes to the estimation through pseudo-observations
If few observations are available, then the prior plays an important role
If many observations are available, then the pseudocounts play a negligible role
i iiiPME
i AAN
n with
Dirichlet mixture
Sometimes the observations are generated by a heterogeneous process (e.g., hydrophobic vs. hydrophilic domains in proteins) In such situations, we should use different priors in function of
the context
But we do not necessarily know the context beforehand
A possibility is the use of a Dirichlet mixture The frequency parameter can be generated from m different
sources S with different Dirichlet parameters k
( ) ( ) ( ; ) ( ; )k kk
k k
P P S k q D D
Dirichlet mixture
Posterior
Via Bayes’ rule
( | ) ( | ) ( | , ) (disjunction)
( | ) ( ; ) (pseudocount)k
k
k
P n P S k n P S k n
P S k n n
D
( | )( | ) ( )( | )
( | ) ( ) ( | )k
ll l
q P n S kP n S k P S kP S k n
P n S l P S l q P n S l
( ) / ( )
( | )( ) / ( )
k kk
l lll
q Z n ZP S k n
q Z n Z
Dirichlet mixture
Posterior mean estimate
The different components of the Dirichlet mixture are first considered as separate pseudocounts
These components are then combined with a weight depending on the likelihood of the Dirichlet component
( | )k
PME i ii k
nP S k n
N A
( ) / ( )
( | )( ) / ( )
k kk
l kll
q Z n ZP S k n
q Z n Z
)/()( ANn kii
( | )P S k n
Summary
The Cox-Jaynes axioms Bayes’ rule Probabilistic models
Maximum likelihood Maximum a posteriori
Bayesian inference Multinomial and Dirichlet distributions Estimation of frequency matrices
Pseudocounts Dirichlet mixture