Hidden Markov Models -...

Hidden Markov ModelsAdvances and applications

Diego Milone

d.milone~ieee.org

Tópicos Selectos en Aprendizaje Maquinal

Doctorado en Ingeniería, FICH-UNL

November 14, 2016

Advances Applications

Hidden Markov Trees as Observation Densities(HMM-HMT)

Diego Milone (Curso TSAM) Hidden Markov Models November 14, 2016 2 / 52

Motivation: Sequences of Trees

The main motivation for this section is:

to learn variable-length sequences

of tree-structured data

First: A Model for Just one Tree

Let be w = [w1, w2, . . . , wN ] a tree structured data with J levels(N = 2J − 1 for binary trees). The Hidden Markov Tree can be dened as

θ = 〈U ,R,κ, ε,F〉where

i) U = u ∈ [1 . . . N ] is the set of nodes in the tree,

ii) R = R ∈ [1 . . . NM ] is the set of states in all the nodes andRu = Ru ∈ [1 . . .M ] the set of states in the node u.

iii) ε =[εu,mn = Pr(Ru = m|Rρ(u) = n)

], is the conditional probability

of node u being in state m given that the state in its parent nodeρ(u) is n,

iv) κ = [κp = Pr(R1 = p)], ∀p ∈ R1 are the probabilities for the rootnode being on state p.

v) F = fu,m(wu) = Pr (Wu = wu|Ru = m) are the observationprobability distributions.

HMT: Preliminars

Additional notation:

C(u) = c1(u), . . . , cNu(u) is the set of children of the node u.

Tu is the subtree observed from the node u (including all itsdescendants).

Turv is the subtree from node u but excluding node v and all itsdescendants.

Likelihood:

Lθ(w) , Pr(w|θ) =∑∀r

Pr(w, r|θ)

=∑∀r

εu,rurρ(u)fu,ru(wu) =∑∀rLθ(w, r)

HMT: Preliminars

Additional notation:

C(u) = c1(u), . . . , cNu(u) is the set of children of the node u.

Tu is the subtree observed from the node u (including all itsdescendants).

Turv is the subtree from node u but excluding node v and all itsdescendants.

Likelihood:

Lθ(w) , Pr(w|θ) =∑∀r

Pr(w, r|θ)

=∑∀r

εu,rurρ(u)fu,ru(wu) =∑∀rLθ(w, r)

Expectation Maximization for HMT

Expected values and counters (upward-downward algorithm):

βu(n) , Pr (Tu |ru = n, θ ) = fu,n(wu)∏

v∈C(u)

βu,v(n),

βρ(u),u(n) , Pr(Tu∣∣rρ(u) = n, θ

)=∑m

βu(m)εu,mn,

αu(n) , Pr (T1ru, ru = n |θ ) =∑m

εu,nmβρ(u)(m)αρ(u)(m)

βρ(u),u(m),

γu(m) , Pr(ru = m|w, θ) =αu(m)βu(m)∑n

αu(n)βu(n),

ξu(m,n) , Pr(ru = m, rρ(u) = n|w, θ)

=βu(m)εu,mnαρ(u)(n)βρ(u)(n)/βρ(u),u(n)∑

αu(n)βu(n).

Expectation Maximization for HMT

Expected values and counters (upward-downward algorithm):

βu(n) , Pr (Tu |ru = n, θ ) = fu,n(wu)∏

v∈C(u)

βu,v(n),

βρ(u),u(n) , Pr(Tu∣∣rρ(u) = n, θ

)=∑m

βu(m)εu,mn,

αu(n) , Pr (T1ru, ru = n |θ ) =∑m

εu,nmβρ(u)(m)αρ(u)(m)

βρ(u),u(m),

γu(m) , Pr(ru = m|w, θ) =αu(m)βu(m)∑n

αu(n)βu(n),

ξu(m,n) , Pr(ru = m, rρ(u) = n|w, θ)

=βu(m)εu,mnαρ(u)(n)βρ(u)(n)/βρ(u),u(n)∑

αu(n)βu(n).

EM for HMT: Reestimation Formulas

For multiple observations W =w1,w2, . . . ,wL

, and using

fu,ru(w`u) = N(w`u, µu,ru , σu,ru

), we have

εu,mn =

ξ`u(m,n)∑`

γ`ρ(u)(n), µu,m =

w`uγ`u(m)∑

γ`u(m),

σ2u,m =

(w`u − µu,m)2γ`u(m)∑`

γ`u(m).

, and using

), we have

εu,mn =

ξ`u(m,n)∑`

w`uγ`u(m)∑

γ`u(m),

σ2u,m =

(w`u − µu,m)2γ`u(m)∑`

γ`u(m).

, and using

), we have

εu,mn =

ξ`u(m,n)∑`

w`uγ`u(m)∑

γ`u(m),

σ2u,m =

(w`u − µu,m)2γ`u(m)∑`

γ`u(m).

Hidden Markov Models and Hidden Markov Trees

To model a sequence W = w1,w2, . . . ,wT , where wt = [wt1, wt2, . . . , w

is a tree structured data with J levels, the HMM-HMT model is dened asthe composite Θ , ϑ[θ].

That is, each state k ∈ Q in the external HMM has an HMTθk = 〈Uk,Rk,κk, εk,Fk〉.

Then, in Θ we have

bk(wt) , Pr(wt|θ) =

∑∀r

Pr(wt, r|θ) =∑∀r

∏∀uεku,rurρ(u)f

ku,ru(wtu).

To model a sequence W = w1,w2, . . . ,wT , where wt = [wt1, wt2, . . . , w

is a tree structured data with J levels, the HMM-HMT model is dened asthe composite Θ , ϑ[θ].

That is, each state k ∈ Q in the external HMM has an HMTθk = 〈Uk,Rk,κk, εk,Fk〉.

Then, in Θ we have

bk(wt) , Pr(wt|θ) =

∑∀r

Pr(wt, r|θ) =∑∀r

∏∀uεku,rurρ(u)f

ku,ru(wtu).

HMM-HMT: Likelihood

The full-model likelihood for the composite model is

LΘ(W) =∑∀q

T∏t=1

(aqt−1qt

∑∀r

∏∀uεqt

u,rurρ(u)f q

u,ru(wtu)

∑∀q

∑∀R

aqt−1qt

∏∀uεqt

u,rturtρ(u)

u,rtu(wtu)

=∑∀q

∑∀RLΘ(W,q,R)

EM for HMM-HMT: Maximization

In this case, the auxiliary function is dened as

D(Θ, Θ) ,∑∀q

∑∀RLΘ(W,q,R) log (LΘ(W,q,R))

D(Θ, Θ) =∑∀q

∑∀RLΘ(W,q,R) ·

log(aqt−1qt) +

∑∀u

u,rturtρ(u)

)+ log

u,rtu(wtu)

In this case, the auxiliary function is dened as

D(Θ, Θ) ,∑∀q

∑∀RLΘ(W,q,R) log (LΘ(W,q,R))

D(Θ, Θ) =∑∀q

log(aqt−1qt) +

∑∀u

u,rturtρ(u)

)+ log

u,rtu(wtu)

Example for single Gaussian(HMM(states(HMT(nodes(states(Gaussian))))))

D(Θ, Θ) =∑∀q

log(aqt−1qt) +

∑∀u

u,rturtρ(u)

∑∀u

− log(2π)

2− log

(wtu − µ

Example for single Gaussian(HMM(states(HMT(nodes(states(Gaussian))))))

D(Θ, Θ) =∑∀q

log(aqt−1qt) +

∑∀u

u,rturtρ(u)

∑∀u

− log(2π)

2− log

(wtu − µ

For εku,mn the restriction∑

m εku,mn $ 1 should be satised. If we use

D(Θ, Θ) , D(Θ, Θ) +∑n

εku,mn − 1

with ∇εku,mnD(Θ, Θ) = 0 the learning rule results

εku,mn =

γt(k)ξtku (m,n)∑t

γt(k)γtkρ(u)(n).

D(Θ, Θ) , D(Θ, Θ) +∑n

εku,mn − 1

εku,mn =

γt(k)γtkρ(u)(n).

D(Θ, Θ) , D(Θ, Θ) +∑n

εku,mn − 1

εku,mn =

γt(k)γtkρ(u)(n).

From ∇µku,mD(Θ, Θ) = 0 we obtain

µku,m =

γt(k)γtku (m)wtu∑t

γt(k)γtku (m)

and, in a similar way,

(σku,m)2 =

γt(k)γtku (m)(wtu − µku,m

γt(k)γtku (m).

µku,m =

γt(k)γtku (m)

(σku,m)2 =

γt(k)γtku (m).

µku,m =

γt(k)γtku (m)

(σku,m)2 =

γt(k)γtku (m).

Minimum Classication Error Approach for HMM-HMT

Some hidden requirements for ML training

Using ML for classication purposes:

Isolated training of models

Enough expression power in the model

Enough representative training data

Enough (innite?) reestimation steps

The Minimum Classication Error (MCE) approach:

1 Simulation of the classier decision

2 Soft approximation of the 0-1 loss

3 Minimization of the empirical classication risk

Some hidden requirements for ML training

Using ML for classication purposes:

Isolated training of models

Enough expression power in the model

Enough representative training data

Enough (innite?) reestimation steps

The Minimum Classication Error (MCE) approach:

1 Simulation of the classier decision

2 Soft approximation of the 0-1 loss

3 Minimization of the empirical classication risk

Minimum Classication Error Training

Let gj(W; Λ) be a set of discriminant functions, with Λ the wholeparameter set. For the MCE approach we can choose:

1 Simulation of the classier decision:

di(W; Λ) = −gi(W; Λ) + maxj 6=i

gj(W; Λ)

2 Soft approximation of the 0-1 loss:

`(di(W; Λ)) = `i(W; Λ) = (1 + exp (−γdi(W; Λ) + β))−1

3 Minimization of the empirical classication risk: The classication riskconditioned on W can be written as

`(W; Λ) =

M∑i=1

`i(W; Λ) I(W ∈ Ωi),

where I(·) is the indicator function and Ωi stands for the set ofpatterns which belong to class i.

gj(W; Λ)

`(W; Λ) =

M∑i=1

`i(W; Λ) I(W ∈ Ωi),

gj(W; Λ)

`(W; Λ) =

M∑i=1

`i(W; Λ) I(W ∈ Ωi),

gj(W; Λ)

`(W; Λ) =

M∑i=1

`i(W; Λ) I(W ∈ Ωi),

MCE training for HMM-HMT

For the HMM-HMT we propose a Viterbi-based discrimination function

−g(W|Λ) = log

(maxq,RLΘ(W,q,R)

log aqt−1qt +∑t

∑∀u

log εqt

u,rturtρ(u)

∑∀u

log f qt

u,rtu(wtu)

and the missclassication function dened as the inverse relation

di(W; Λ) = 1−

[1M−1

∑j 6=i gj(W; Λ)−η

]−1/η

gi(W; Λ)

For the HMM-HMT we propose a Viterbi-based discrimination function

−g(W|Λ) = log

(maxq,RLΘ(W,q,R)

log aqt−1qt +∑t

∑∀u

log εqt

u,rturtρ(u)

∑∀u

log f qt

u,rtu(wtu)

and the missclassication function dened as the inverse relation

di(W; Λ) = 1−

[1M−1

∑j 6=i gj(W; Λ)−η

]−1/η

gi(W; Λ)

Based in the Generalized Probabilistic Descent (GPD) method, thealgorithm can be summarized as:

Λ←− Λ− ατ∂`(Wτ ; Λ)

∣∣∣∣Λ=Λτ

Thus, the updating process that works upon the transformed means µ(j)ku,m

is given by

µ(j)ku,m ←− µ(j)k

u,m − ατ∂`i(Wτ ; Λ)

∂µ(j)ku,m

∣∣∣∣∣Λ=Λτ

and applying the chain rule of dierentiation we get...

Λ←− Λ− ατ∂`(Wτ ; Λ)

∣∣∣∣Λ=Λτ

is given by

∂µ(j)ku,m

∣∣∣∣∣Λ=Λτ

Λ←− Λ− ατ∂`(Wτ ; Λ)

∣∣∣∣Λ=Λτ

is given by

∂µ(j)ku,m

∣∣∣∣∣Λ=Λτ

MCE training for the Gaussian means

Ii i iiiiiii iii:...for j = i

µ(i)ku,m ←−µ(i)k

u,m − ατγ`i(1− `i)di − 1

×∑t

δ(qt − k, rtu −m)

[wtu − µ

(i)ku,m

σ(i)ku,m

...for j 6= i

µ(j)ku,m ←−µ(j)k

u,m − ατγ`i(1− `i)(1− di)g−η−1j∑k 6=i g

−ηk

×∑t

[wtu − µ

(j)ku,m

σ(j)ku,m

MCE training for the Gaussian variances

In a similar way:...for j = i

σ(i)ku,m ←−σ(i)k

u,m − ατγ`i(1− `i)di − 1

×∑t

(wtu − µ(i)ku,m

σ(i)ku,m

...for j 6= i

σ(j)ku,m ←−σ(j)k

u,m − ατγ`i(1− `i)(1− di)g−η−1j∑k 6=i g

−ηk

×∑t

(wtu − µ(j)ku,m

σ(j)ku,m

MCE training for the transition probabilities in the HMT

...for j = i

ε(i)ku,mn ←−ε

(i)ku,mn − ατγ`i(1− `i)

di − 1

δ(qt − k, rtu −m, rtρ(u) − n)−

−∑t

δ(qt − k, rtu − p, rtρ(u) − n)ε(i)ku,mn

...for j 6= i

ε(j)ku,mn ←−ε

(j)ku,mn − ατγ`i(1− `i)(1− di)

g−η−1j∑k 6=i g

−ηk

δ(qt − k, rtu −m, rtρ(u) − n)−

−∑t

δ(qt − k, rtu − p, rtρ(u) − n)ε(j)ku,mn

MCE training for the transition probabilities in the HMM

...for j = i

a(i)sj ←−a

(i)sj − ατγ`i(1− `i)

di − 1

T∑t=1

δ(qt−1 − s, qt − j)−T∑t=1

δ(qt−1 − s)a(i)sj

...for j 6= i

a(j)sj ←−a

(j)sj − ατγ`i(1− `i)(1− di)

g−η−1j∑k 6=i g

−ηk

T∑t=1

δ(qt−1 − s, qt − j)−T∑t=1

δ(qt−1 − s)a(j)sj

Applications

The rst application: speech recognition

t “ en que ”

Speech signal Signal processing and analysis Pattern recognition

ASR System

The problem complexities:

Speech continuity, context dependenceInter/intra speaker variabilityThe structures of speechNoise and other distortions

Types of speech recognizers:

Vocabulary size: small, medium, large, very large, unrestrictedSpeakers: mono- multi- independentSpeaking styles: isolated connected read uent spontaneousDistortions: robust/non-robust speech recognition

t “ en que ”

ASR System

t “ en que ”

ASR System

The speech recognition model

ASR System

The speech recognition model

s ( t ) s [ m , k ]

silencio el sótano silencio B E

e l s o t a n o

está en del comedor

Language model for speech recognition

caudal

veinte

Language model and prosody

Layer N Layer 2

caudal

veinte

caudal

veinte

caudal

veinte

sil sil

Layer 1

Speech recognition: conguration for similar tasks

Speaker recognition/identication

Word-spotting

Emotion recognition

Language recognition

Pronunciation quality evaluation (for teaching)

Automatic translation

Recognition of pathologies in voices

Multimodal speech recognition

Recognition of masticatory events in cows and sheep

Masticatory events in sheep

Model for classication of ruminant ingestive events

s1 s2 s3 s1 s2 s3s1 s2 s3

c1 c2 c3

Eventlevel

Sub-eventlevel

Acousticlevel

Languagemodel

Compoundlevel

chew chew-bite

bite sil

Language model for masticatory events in sheep/cows

S Esil sil

chew-bite

HMM in Computer Vision

HMM in Computer Vision: OCR

HMM in Computer Vision: OCR by frames

HMM in Computer Vision: OCR by grammars

HMM in Computer Vision: Cancer

HMM in Computer Vision: Chromosomes

HMM for Denoising

Denoising with HMM-HMT in the wavelet domain and re-synthesis

Nx: signal lengthNw : window widthNs: step of the analysisDWT: discrete wavelet transform (Daubechies-8)

Denoising: Doppler

0 200 400 600 800 1000 b)-15

0 200 400 600 800 1000

0 200 400 600 800 1000 d)-15

0 200 400 600 800 1000

Denoising: Heavisine

0 200 400 600 800 1000 b)-15

0 200 400 600 800 1000

0 200 400 600 800 1000 d)-15

0 200 400 600 800 1000

Denoising: warping Doppler

Feature Extraction: DWT post-processing

/eh/ Standard DWT /eh/

frame by frame

Feature Extraction: DWT post-processing

/eh/ Standard DWT /eh/

frame by frameSpectrum Modulus by Scale (SMS) DWT

HMM in Bioinformatics

Basic base pairs modelling of DNA: classication, prediction, search, ...

Classication of coding/non-coding regions of nucleotide sequences.

Alignment and prediction of exons using genomic DNA from two dierentorganisms

Signal peptide discrimination

Hidden Markov Models -...

Documents

Transcript of Hidden Markov Models -...