Introduction to Machine Learning - CmpE WEB

ETHEM ALPAYDIN© The MIT Press, 2010

[email protected]://www.cmpe.boun.edu.tr/~ethem/i2ml2e

Lecture Slides for

Introduction Modeling dependencies in input; no longer iid

Sequences:

Temporal: In speech; phonemes in a word (dictionary), words in a sentence (syntax, semantics of the language).

In handwriting, pen movements

Spatial: In a DNA sequence; base pairs

3Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Discrete Markov Process N states: S1, S2, ..., SN State at “time” t, qt = Si

First-order Markov

P(qt+1=Sj | qt=Si, qt-1=Sk ,...) = P(qt+1=Sj | qt=Si)

Transition probabilities

aij ≡ P(qt+1=Sj | qt=Si) aij ≥ 0 and Σj=1N aij=1

Initial probabilities

πi ≡ P(q1=Si) Σj=1N πi=1


Stochastic Automaton

TT qqqqq

T

t

tt aaqqPqP,QOP1211

2

11 ||

A


Three urns each full of balls of one color

S1: red, S2: blue, S3: green

Example: Balls and Urns

048080304050

801010

206020

303040

302050

3313111

3313111

3311

.....

|||,|

,,,

...

...

...

.,.,.

aaa

SSPSSPSSPSPOP

SSSSO

T

A

A


Given K example sequences of length T

Balls and Urns: Learning

k

T-

t ikt

k

T-

t jkti

kt

i

ji

ij

k ik

ii

Sq

SqSq

S

SSa

K

SqS

1

1

1

1 1

1

1

1

1

and

from stransition

to from stransition

sequences

withstarting sequences

#

#ˆ

#

#


Hidden Markov Models States are not observable

Discrete observations {v1,v2,...,vM} are recorded; a probabilistic function of the state

Emission probabilities

bj(m) ≡ P(Ot=vm | qt=Sj)

Example: In each urn, there are balls of different colors, but with different probabilities.

For each observation sequence, there are multiple state sequences


HMM Unfolded in Time


Elements of an HMM N: Number of states

M: Number of observation symbols

A = [aij]: N by N state transition probability matrix

B = bj(m): N by M observation probability matrix

Π = [πi]: N by 1 initial state probability vector

λ = (A, B, Π), parameter set of HMM


Three Basic Problems of HMMs1. Evaluation: Given λ, and O, calculate P (O | λ)

2. State sequence: Given λ, and O, find Q* such that

P (Q* | O, λ ) = maxQ P (Q | O , λ )

3. Learning: Given X={Ok}k, find λ* such that

P ( X | λ* )=maxλ P ( X | λ )

11

(Rabiner, 1989)

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Forward variable:

Evaluation

N

iT

tj

N

iijtt

ii

ittt

iOP

Obaij

Obi

SqOOPi

1

1

1

1

11

1

|

:Recursion

:tionInitializa

|,


Backward variable:

N

jttjijt

T

itTtt

jObai

i

SqOOPi

1

11

1

1

:Recursion

:tionInitializa

| ,


Finding the State Sequence

14

N

j tt

tt

itt

jj

ii

OSqPi

1

,

No!

Choose the state that has the highest probability, for each time step:

qt*= arg maxi γt(i)


Viterbi’s Algorithm

δt(i) ≡ maxq1q2∙∙∙ qt-1

p(q1q2∙∙∙qt-1,qt =Si,O1∙∙∙Ot | λ)

Initialization: δ1(i) = πibi(O1), ψ1(i) = 0

Recursion:δt(j) = maxi δt-1(i)aijbj(Ot), ψt(j) = argmaxi δt-1(i)aij

Termination:p* = maxi δT(i), qT

*= argmaxi δT (i) Path backtracking:

qt* = ψt+1(qt+1

* ), t=T-1, T-2, ..., 1


Learning

16

otherwise

and if

otherwise

if

(EM) algorithm Welch-Baum

|

0

1

0

1 1

11

11

1

jtittij

itti

k l ttlklt

ttjijt

t

jtitt

SqSqz

Sqz

lObak

jObaiji

OSqSqPji

:

,

,,,


Baum-Welch (EM)

ˆ

,ˆ ˆ

:s tepM

, :s tepE

K

k

T

t

kt

K

k

T

t mkt

kt

j

K

k

T

t

kt

K

k

T

t

kt

ij

K

k

k

i

ttijt

ti

k

k

k

k

i

vOjmb

i

jia

K

i

jizEizE

1

1

1

1

1

1

1

1

1

1

1

11

1

1


Continuous Observations

2

jjjtt SqOP ,~,| N

18

Discrete:

Gaussian mixture (Discretize using k-means):

Continuous:

otherwise

if |

0

1

1

mttm

rM

mjjtt

vOrmbSqOP

tm

,

ll

ljtt

L

ljljtt SqOpPSqOP

,~

,,| ,|

N

GG1

Use EM to learn parameters, e.g.,

t t

t tt

jj

Oj


Input-dependent observations:

Input-dependent transitions (Meila and Jordan, 1996; Bengio and Frasconi, 1996):

Time-delay input:

HMM with Input

titjt xSqSqP ,| 1

19

2

jjt

jt

jtt xgxSqOP ,,, |~| N

1 ttt OO ,...,fx


Left-to-right HMMs:

In classification, for each Ci, estimate P (O | λi) by a separate HMM and use Bayes’ rule

Model Selection in HMM

44

3433

242322

131211

000

00

0

0

a

aa

aaa

aaa

A

20

j jj

iii

POP

POPOP

|

||


Introduction to Machine Learning - CmpE WEB

Documents

Transcript of Introduction to Machine Learning - CmpE WEB