Hidden Markov Model and Conditional Random...
Transcript of Hidden Markov Model and Conditional Random...
1
1
School of Computer Science
Hidden Markov Model and
Conditional Random Fields
Probabilistic Graphical Models (10Probabilistic Graphical Models (10--708)708)
Lecture 12, Oct 29, 2007
Eric XingEric XingReceptor A
Kinase C
TF F
Gene G Gene H
Kinase EKinase D
Receptor BX1 X2
X3 X4 X5
X6
X7 X8
Receptor A
Kinase C
TF F
Gene G Gene H
Kinase EKinase D
Receptor BX1 X2
X3 X4 X5
X6
X7 X8
X1 X2
X3 X4 X5
X6
X7 X8
Reading: J-Chap. 12, and addition papers
Eric Xing 2
FeedbacksTodayOffice hour
Recitation
Project milestoneThis Wednesday
2
Eric Xing 3
Hidden Markov Model revisitTransition probabilities between any two states
or
Start probabilities
Emission probabilities associated with each state
or in general:
A AA Ax2 x3x1 xT
y2 y3y1 yT...
... ,)|( , jiit
jt ayyp === − 11 1
( ) .,,,,lMultinomia~)|( ,,, I∈∀=− iaaayyp Miiiitt K211 1
( ).,,,lMultinomia~)( Myp πππ K211
( ) .,,,,lMultinomia~)|( ,,, I∈∀= ibbbyxp Kiiiitt K211
( ) .,|f~)|( I∈∀⋅= iyxp iitt θ1
Eric Xing 4
Applications of HMMsSome early applications of HMMs
finance, but we never saw them speech recognition modelling ion channels
In the mid-late 1980s HMMs entered genetics and molecular biology, and they are now firmly entrenched.
Some current applications of HMMs to biologymapping chromosomesaligning biological sequencespredicting sequence structureinferring evolutionary relationshipsfinding genes in DNA sequence
3
Eric Xing 5
Typical structure of a gene
Eric Xing 6
E0 E1 E2
E
poly-A
3'UTR5'UTR
tEi
Es
I0 I1 I2
intergenicregion
Forward (+) strandReverse (-) strand
Forward (+) strandReverse (-) strand
promoter
GAGAACGTGTGAGAGAGAGGCAAGCCGAAAAATCAGCCGCCGAAGGATACACTATCGTCGTCCTTGTCCGACGAACCGGTGGTCATGCAAACAACGCACAGAACAAATTAATTTTCAAATTGTTCAATAAATGTCCCACTTGCTTCTGTTGTTCCCCCCTTTCCGCTAGCGGAATTTTTTATATTCTTTTGGGGGCGCTCTTTCGTTGACTTTTCGAGCACTTTTTCGATTTTCGCGCGCTGTCGAACGGCAGCGTATTTATTTACAATTTTTTTTGTTAGCGGCCGCCGTTGTTTGTTGCAGATACACAGCGCACACATATAAGCTTGCACACTGATGCACACACACCGACACGTTGTCACCGAAATGAACGGGACGGCCATATGACTGGCTGGCGCTCGGTATGTGGGTGCAAGCGAGATACCGCGATCAAGACTCGAACGAGACGGGTCAGCGAGTGATACCGATTCTCTCTCTTTTGCGATTGGGAATAATGCCCGACTTTTTACACTACATGCGTTGGATCTGGTTATTTAATTATGCCATTTTTCTCAGTATATCGGCAATTGGTTGCATTAATTTTGCCGCAAAGTAAGGAACACAAACCGATAGTTAAGATCCAACGTCCCTGCTGCGCCTCGCGTGCACAATTTGCGCCAATTTCCCCCCTTTTCCAGTTTTTTTCAACCCAGCACCGCTCGTCTCTTCCTCTTCTTAACGTTAGCATTCGTACGAGGAACAGTGCTGTCATTGTGGCCGCTGTGTAGCTAAAAAGCGTAATTATTCATTATCTAGCTATCTTTTCGGATATTATTGTCATTTGCCTTTAATCTTGTGTATTTATATGGATGAAACGTGCTATAATAACAATGCAGAATGAAGAACTGAAGAGTTTCAAAACCTAAAAATAATTGGAATATAAAGTTTGGTTTTACAATTTGATAAAACTCTATTGTAAGTGGAGCGTAACATAGGGTAGAAAACAGTGCAAATCAAAGTACCTAAATGGAATACAAATTTTAGTTGTACAATTGAGTAAAATGAGCAAAGCGCCTATTTTGGATAATATTTGCTGTTTACAAGGGGAACATATTCATAATTTTCAGGTTTAGGTTACGCATATGTAGGCGTAAAGAAATAGCTATATTTGTAGAAGTGCATATGCACTTTATAAAAAATTATCCTACATTAACGTATTTTATTTGCTTTAAAACCTATCTGAGATATTCCAATAAGGTAAGTGCAGTAATACAATGTAAATAATTGCAAATAATGTTGTAACTAAATACGTAAACAATAATGTAGAAGTCCGGCTGAAAGCCCCAGCAGCTATAGCCGATATCTATATGATTTAAACTCTTGTCTGCAACGTTCTAATAAATAAATAAAATGCAAAATATAACCTATTGAGACAATACATTTATTTTATTTTTTTATATCATCAATCATCTACTGATTTCTTTCGGTGTATCGCCTAATCCATCTGTGAAATAGAAATGGCGCCACCTAGGTTAAGAAAAGATAAACAGTTGCCTTTAGTTGCATGACTTCCCGCTGGAT
4
3
2
1
=)|•(
θθθθ
yp
GENSCAN (Burge & Karlin),)|( , ji
it
jt ayyp === − 11 1Transition probabilities:
4
Eric Xing 7
The Forward AlgorithmWe want to calculate P(x), the likelihood of x, given the HMM
Sum over all possible ways of generating x:
To avoid summing over an exponential number of paths y, define
(the forward probability)
The recursion:
),,...,()(def
11 1 ==== ktt
kt
kt yxxPy αα
∑ −==i
kiit
ktt
kt ayxp ,)|( 11 αα
∑=k
kTP α)(x
∑ ∑ ∑ ∑ ∏ ∏= =
−==
yyxx
1 2 112 1
y y y
T
t
T
tttyyy
N ttyxpapp )|(),()( ,πL
Eric Xing 8
The Backward AlgorithmWe want to compute ,
the posterior probability distribution on the t th position, given x
We start by computing
The recursion:
)|( x1=ktyP
Forward, αtk Backward,
),...,,,,...,(),( Ttk
ttk
t xxyxxPyP 11 11 +=== x
)|...()...(
),,...,|,...,(),,...,(
, 1111
11
111
===
===
+
+
ktTt
ktt
kttTt
ktt
yxxPyxxP
yxxxxPyxxP
)|,...,( 11 == +ktTt
kt yxxPβ
it
itt
iik
kt yxpa 111, )1|( +++ == ∑ ββ
A Axt+1 xT
yt+1 yT...
Axt
yt
...
...
...
5
Eric Xing 9
The Junction tree algorithmA junction tree for the HMM
Rightward pass
This is exactly the forward algorithm!
Leftward pass …
This is exactly the backward algorithm!
A AA Ax2 x3x1 xT
y2 y3y1 yT...
...
),( 11 xyψ ),( 21 yyψ ),( 32 yyψ ),( TT yy 1−ψ
),( 22 xyψ ),( 33 xyψ ),( TT xyψ
)( 2yζ )( 3yζ )( Tyζ)( 1yφ )( 2yφ⇒⇒
),( 1+tt yyψ)( ttt y→−1µ
),( 11 ++ tt xyψ
)( 11 ++→ ttt yµ
)( 1+↑ tt yµ
∑+
+↑++←+←− =1
11111ty
tttttttttt yyyyy )()(),()( µµψµ
∑
∑
→−++
++→−+
+=
=
t
tt
t
ytttyytt
yttttttt
yayxp
yxpyyyp
)()|(
)|()()|(
, 111
1111
1µ
µ
∑ +↑→−+++→ =ty
tttttttttt yyyyy )()(),()( 11111 µµψµ
∑+
++++←+=1
11111ty
ttttttt yxpyyyp )|()()|( µ
),( 1+tt yyψ)( ttt y←−1µ )( 11 ++← ttt yµ
),( 11 ++ tt xyψ
)( 1+↑ tt yµ
Eric Xing 10
Summary of the F-B algorithm
),,,...,()(def
1111 === −→−k
tttttkt yxxxPkµα
)|,...,()(def
111 === +←−k
tTtttk
t yxxPkµβ
)|()|()1()1(
),1,1(
111111
:11
def,
ttttj
tttittt
Tj
tit
jit
yypyxpyy
xyyp
+++++←→−
+
==∝
===
µµ
ξ
∑ −==i
kiit
ktt
kt ayxp ,)|( 11 αα
it
itt
iik
kt yxpa 111, )1|( +++ == ∑ ββ
)|(,, 1111 == +++
ittji
jt
it
jit yxpaβαξ
∑=∝==j
jit
it
itT
it
it xyp ,
:1
def)|1( ξβαγ
)|(),(
)|()(def
def
11
1
1 ===
==
+i
tj
t
ittt
yypji
yxpi
A
B
( )( )
( )( )ttt
tttt
ttt
ttt
βαγβαξ
ββαα
∗=∗∗=
∗=∗=
++
++
−
. . .
. .
AB
BABA
T
T
11
11
1
The matrix-vector form:
6
Eric Xing 11
Posterior decodingWe can now calculate
Then, we can askWhat is the most likely state at position t of sequence x:
Note that this is an MPA of a single hidden state, what if we want to a MPA of a whole hidden state sequence?
Posterior Decoding:
This is different from MPA of a whole sequenceof hidden states
This can be understood as bit error ratevs. word error rate
)()(),(
)|(xx
xx
PPyPyP
kt
kt
ktk
tβα
==
==11
)|(maxarg* x1== ktkt yPk
{ } : * Tty tk
t L11 ==
Example:MPA of X ?MPA of (X, Y) ?
x y P(x,y)0 0 0.350 1 0.051 0 0.31 1 0.3
Eric Xing 12
Viterbi decodingGIVEN x = x1, …, xT, we want to find y = y1, …, yT, such that P(y|x) is maximized:
y* = argmaxy P(y|x) = argmaxπ P(y,x) Let
= Probability of most likely sequence of states ending at state yt = k
The recursion:
Underflows are a significant problem
These numbers become extremely small – underflowSolution: Take the logs of all values:
),,...,,,...,(max ,--},...{ -1111111
== kttttyy
kt yxyyxxPV
t
itkii
ktt
kt VayxpV 11 −== ,max)|(
x1 x2 x3 ……………………...……..xN
State 12
K
x1 x2 x3 ……………………...……..xN
State 12
K
x1 x2 x3 ……………………...……..xN
State 12
K
x1 x2 x3 ……………………...……..xN
State 12
K
Vi(t)k
tV
tttt xyxyyyyyytt bbaayyxxp ,,,,),,,,,( LLKK11121111 −
= π
( )( )itkii
ktt
kt VayxpV 11 −++== ,logmax)|(log
7
Eric Xing 13
The Viterbi Algorithm – derivationDefine the viterbi probability:
),,...,,,...,(max ,},...{ 111111 1== +++
kttttyy
kt yxyyxxPV
t
),...,,,...,(),...,,,...,|(max ,},...{ ttttk
ttyy yyxxPyyxxyxPt 111111 1
1== ++
),,,...,,,...,()|(max ,},...{ tttttk
ttyy yxyyxxPyyxPt 111111 1
1 −−++ ==
itki
ktti VayxP ,, )|(max 111 == ++
),,,...,,,...,(max)|(max },...{, 111 111111 11==== −−++ −
ittttyy
it
ktti yxyyxxPyyxP
t
itkii
ktt VayxP ,, max)|( 111 == ++
Eric Xing 14
Computational Complexity and implementation details
What is the running time, and space required, for Forward, and Backward?
Time: O(K2N); Space: O(KN).
Useful implementation technique to avoid underflowsViterbi: sum of logsForward/Backward: rescaling at each position by multiplying by a constant
∑ −==i
kiit
ktt
kt ayxp ,)|( 11 αα
it
itt
iik
kt yxpa 111, )1|( +++ == ∑ ββ
itkii
ktt
kt VayxpV 11 −== ,max)|(
8
Eric Xing 15
Learning HMMSupervised learning: estimation when the “right answer” is known
Examples: GIVEN: a genomic region x = x1…x1,000,000 where we have good
(experimental) annotations of the CpG islandsGIVEN: the casino player allows us to observe him one evening,
as he changes dice and produces 10,000 rolls
Unsupervised learning: estimation when the “right answer” is unknown
Examples:GIVEN: the porcupine genome; we don’t know how frequent are the
CpG islands there, neither do we know their compositionGIVEN: 10,000 rolls of the casino player, but we don’t see when he
changes dice
QUESTION: Update the parameters θ of the model to maximize P(x|θ) --- Maximal likelihood (ML) estimation
Eric Xing 16
Learning HMM: two scenariosSupervised learning: if only we knew the true state path then ML parameter estimation would be trivial
E.g., recall that for complete observed tabular BN:
What if y is continuous? We can treat as N×Tobservations of, e.g., a GLIM, and apply learning rules for GLIM …
Unsupervised learning: when the true state path is unknown, we can fill in the missing values using inference recursions.
The Baum Welch algorithm (i.e., EM)Guaranteed to increase the log likelihood of the model after each iterationConverges to local optimum, depending on initial conditions
∑=
kjikij
ijkMLijk n
n
,','
θ ∑ ∑∑ ∑
= −
= −=•→
→=
nTt
itn
jtnn
Tt
itnML
ij yyy
ijia
2 1
2 1
,
,,
)(#)(#
∑ ∑∑ ∑
=
==•→
→=
nTt
itn
ktnn
Tt
itnML
ik yxy
ikib
1
1
,
,,
)(#)(#
( ){ }NnTtyx tntn :,::, ,, 11 ==
9
Eric Xing 17
The Baum Welch algorithmThe complete log likelihood
The expected complete log likelihood
EMThe E step
The M step ("symbolically" identical to MLE)
∏ ∏∏ ⎟⎟⎠
⎞⎜⎜⎝
⎛==
==−
n
T
ttntn
T
ttntnnc xxpyypypp
1211 )|()|()(log),(log),;( ,,,,,yxyxθl
∑∑∑∑∑==
− ⎟⎠⎞⎜
⎝⎛+⎟
⎠⎞⎜
⎝⎛+⎟
⎠⎞⎜
⎝⎛=
− n
T
tkiyp
itn
ktn
n
T
tjiyyp
jtn
itn
niyp
inc byxayyy
ntnntntnnn 1211
11,)|(,,,)|,(,,)|(, logloglog),;(
,,,, xxxyxθ πl
)|( ,,, ni
tni
tni
tn ypy x1===γ
)|,( ,,,,,, n
jtn
itn
jtn
itn
jitn yypyy x1111 ==== −−ξ
∑ ∑∑ ∑
−
=
==n
T
ti
tn
n
T
tjitnML
ija 1
1
2
,
,,
γ
ξ
∑ ∑∑ ∑
−
=
==n
T
ti
tn
ktnn
T
ti
tnMLik
xb 1
1
1
,
,,
γ
γ
Nn
inML
i∑= 1,γ
π
Eric Xing 18
Shortcomings of Hidden Markov Model
HMM models capture dependences between each state and only its corresponding observation
NLP example: In a sentence segmentation task, each segmental state may depend not just on a single word (and the adjacent segmental stages), but also on the (non-local) features of the whole line such as line length, indentation, amount of white space, etc.
Mismatch between learning objective function and prediction objective function
HMM learns a joint distribution of states and observations P(Y, X), but in a prediction task, we need the conditional probability P(Y|X)
Y1 Y2 … … … Yn
X1 X2 … … … Xn
10
Eric Xing 19
Solution:Maximum Entropy Markov Model (MEMM)
Models dependence between each state and the full observation sequence explicitly
More expressive than HMMs
Discriminative modelCompletely ignores modeling P(X): saves modeling effortLearning objective function consistent with predictive function: P(Y|X)
Y1 Y2 … … … Yn
X1:n
Eric Xing 20
MEMM: the Label bias problem
State 1
State 2
State 3
State 4
State 5
Observation 1 Observation 2 Observation 3 Observation 40.4
0.60.2
0.2
0.2
0.2
0.2
0.45
0.550.2
0.3
0.1
0.1
0.3
0.5
0.50.1
0.3
0.2
0.2
0.2
What the local transition probabilities say:
• State 1 almost always prefers to go to state 2
• State 2 almost always prefer to stay in state 2
11
Eric Xing 21
MEMM: the Label bias problem
State 1
State 2
State 3
State 4
State 5
Observation 1 Observation 2 Observation 3 Observation 40.4
0.60.2
0.2
0.2
0.2
0.2
0.45
0.550.2
0.3
0.1
0.1
0.3
0.5
0.50.1
0.3
0.2
0.2
0.2
Probability of path 1-> 1-> 1-> 1:
• 0.4 x 0.45 x 0.5 = 0.09
Eric Xing 22
MEMM: the Label bias problem
State 1
State 2
State 3
State 4
State 5
Observation 1 Observation 2 Observation 3 Observation 40.4
0.60.2
0.2
0.2
0.2
0.2
0.45
0.550.2
0.3
0.1
0.1
0.3
0.5
0.50.1
0.3
0.2
0.2
0.2
Probability of path 2->2->2->2 :
• 0.2 X 0.3 X 0.3 = 0.018 Other paths:1-> 1-> 1-> 1: 0.09
12
Eric Xing 23
MEMM: the Label bias problem
State 1
State 2
State 3
State 4
State 5
Observation 1 Observation 2 Observation 3 Observation 40.4
0.60.2
0.2
0.2
0.2
0.2
0.45
0.550.2
0.3
0.1
0.1
0.3
0.5
0.50.1
0.3
0.2
0.2
0.2
Probability of path 1->2->1->2:
• 0.6 X 0.2 X 0.5 = 0.06Other paths:1->1->1->1: 0.09 2->2->2->2: 0.018
Eric Xing 24
MEMM: the Label bias problem
State 1
State 2
State 3
State 4
State 5
Observation 1 Observation 2 Observation 3 Observation 40.4
0.60.2
0.2
0.2
0.2
0.2
0.45
0.550.2
0.3
0.1
0.1
0.3
0.5
0.50.1
0.3
0.2
0.2
0.2
Probability of path 1->1->2->2:
• 0.4 X 0.55 X 0.3 = 0.066Other paths:1->1->1->1: 0.09 2->2->2->2: 0.0181->2->1->2: 0.06
13
Eric Xing 25
MEMM: the Label bias problem
State 1
State 2
State 3
State 4
State 5
Observation 1 Observation 2 Observation 3 Observation 40.4
0.60.2
0.2
0.2
0.2
0.2
0.45
0.550.2
0.3
0.1
0.1
0.3
0.5
0.50.1
0.3
0.2
0.2
0.2
Most Likely Path: 1-> 1-> 1-> 1
• Although locally it seems state 1 wants to go to state 2 and state 2 wants to remain in state 2.
• why?
Eric Xing 26
MEMM: the Label bias problem
State 1
State 2
State 3
State 4
State 5
Observation 1 Observation 2 Observation 3 Observation 40.4
0.60.2
0.2
0.2
0.2
0.2
0.45
0.550.2
0.3
0.1
0.1
0.3
0.5
0.50.1
0.3
0.2
0.2
0.2
Most Likely Path: 1-> 1-> 1-> 1
• State 1 has only two transitions but state 2 has 5:
• Average transition probability from state 2 is lower
14
Eric Xing 27
MEMM: the Label bias problem
State 1
State 2
State 3
State 4
State 5
Observation 1 Observation 2 Observation 3 Observation 40.4
0.60.2
0.2
0.2
0.2
0.2
0.45
0.550.2
0.3
0.1
0.1
0.3
0.5
0.50.1
0.3
0.2
0.2
0.2
Label bias problem in MEMM:
• Preference of states with lower number of transitions over others
Eric Xing 28
Solution: Do not normalize probabilities locally
State 1
State 2
State 3
State 4
State 5
Observation 1 Observation 2 Observation 3 Observation 40.4
0.60.2
0.2
0.2
0.2
0.2
0.45
0.550.2
0.3
0.1
0.1
0.3
0.5
0.50.1
0.3
0.2
0.2
0.2
From local probabilities ….
15
Eric Xing 29
Solution: Do not normalize probabilities locally
State 1
State 2
State 3
State 4
State 5
Observation 1 Observation 2 Observation 3 Observation 420
3010
20
10
20
20
30
2020
30
10
10
30
5
510
30
20
20
20
From local probabilities to local potentials
• States with lower transitions do not have an unfair advantage!
Eric Xing 30
Office hour
Mid-term project milestone due today
Next week
16
Eric Xing 31
From MEMM ….
Y1 Y2 … … … Yn
X1:n
Eric Xing 32
CRF is a partially directed modelDiscriminative model like MEMMUsage of global normalizer Z(x) overcomes the label bias problem of MEMMModels the dependence between each state and the entire observation sequence (like MEMM)
From MEMM to CRF
Y1 Y2 … … … Yn
x1:n
17
Eric Xing 33
Conditional Random FieldsGeneral parametric form:
Y1 Y2 … … … Yn
x1:n
Eric Xing 34
CRFs: InferenceGiven CRF parameters λ and µ, find the y* that maximizes P(y|x)
Can ignore Z(x) because it is not a function of y
Run the max-product algorithm on the junction-tree of CRF:
Y1 Y2 … … … Yn
x1:n
Y1,Y2 Y2,Y3 ……. Yn-2,Yn-1Yn-1,Yn
Y2 Y3Yn-2 Yn-1
Same as Viterbi decoding used in HMMs!
18
Eric Xing 35
CRF learningGiven {(xd, yd)}d=1
N, find λ*, µ* such that
Computing the gradient w.r.t λ:
Gradient of the log-partition function in an exponential family is the expectation of the
sufficient statistics.
Eric Xing 36
CRF learning
Computing the model expectations:
Requires exponentially large number of summations: Is it intractable?
Tractable!Can compute marginals using the sum-product algorithm on the chain
Expectation of f over the corresponding marginal probability of neighboring nodes!!
19
Eric Xing 37
CRF learningComputing marginals using junction-tree calibration:
Junction Tree Initialization:
After calibration:
Y1 Y2 … … … Yn
x1:n
Y1,Y2 Y2,Y3 ……. Yn-2,Yn-1Yn-1,Yn
Y2 Y3Yn-2 Yn-1
Also called forward-backward algorithm
Eric Xing 38
CRF learningComputing feature expectations using calibrated potentials:
Now we know how to compute rλL(λ,µ):
Learning can now be done using gradient ascent:
20
Eric Xing 39
CRF learningIn practice, we use a Gaussian Regularizer for the parameter vector to improve generalizability
In practice, gradient ascent has very slow convergenceAlternatives:
Conjugate Gradient methodLimited Memory Quasi-Newton Methods
Eric Xing 40
CRFs: some empirical resultsComparison of error rates on synthetic data
CRF error HMM error
HMM error
MEM
M e
rror
MEM
M e
rror
CR
F er
ror
Data is increasingly higher order in the direction of arrow
CRFs achieve the lowest error rate for higher order data
21
Eric Xing 41
CRFs: some empirical resultsParts of Speech tagging
Using same set of features: HMM >=< CRF > MEMMUsing additional overlapping features: CRF+ > MEMM+ >> HMM
Eric Xing 42
Other CRFsSo far we have discussed only 1-dimensional chain CRFs
Inference and learning: exact
We could also have CRFs for arbitrary graph structure
E.g: Grid CRFsInference and learning no longer tractableApproximate techniques used
MCMC SamplingVariational InferenceLoopy Belief Propagation
We will discuss these techniques in the future
22
Eric Xing 43
SummaryConditional Random Fields are partially directed discriminative modelsThey overcome the label bias problem of MEMMs by using a global normalizerInference for 1-D chain CRFs is exact
Same as Max-product or Viterbi decodingLearning also is exact
globally optimum parameters can be learnedRequires using sum-product or forward-backward algorithm
CRFs involving arbitrary graph structure are intractable in generalE.g.: Grid CRFsInference and learning require approximation techniques
MCMC samplingVariational methodsLoopy BP