Slide3 HMM

download Slide3 HMM

of 51

Transcript of Slide3 HMM

  • 8/8/2019 Slide3 HMM

    1/51

    Hidden Markov Models

    Ts. Nguyn Vn Vinh

    B mn KHMT, Trng HCN, H QG H ni

  • 8/8/2019 Slide3 HMM

    2/51

    2

    Why Learn ? Machine learning is programming computers to optimize a performance

    criterion using example data or past experience.

    There is no need to learn to calculate payroll

    Learning is used when: Human expertise does not exist (navigating on Mars),

    Humans are unable to explain their expertise (speech recognition)

    Solution changes in time (routing on a computer network)

    Solution needs to be adapted to particular cases (user biometrics)

  • 8/8/2019 Slide3 HMM

    3/51

    3

    WhatWe Talk AboutWhen We Talk

    AboutLearning Learning general models from a data of particular examples

    Data is cheap and abundant (data warehouses, data marts); knowledge is

    expensive and scarce.

    Example in retail: Customer transactions to consumer behavior:

    People who bought Da Vinci Code also bought The Five People You Meet

    in Heaven (www.amazon.com)

    Build a model that is a good and useful approximation to the data.

  • 8/8/2019 Slide3 HMM

    4/51

    4

    What is Machine Learning? Optimize a performance criterion using example data or past

    experience.

    Role of Statistics: Inference from a sample Role of Computer science: Efficient algorithms to

    Solve the optimization problem

    Representing and evaluating the model for inference

  • 8/8/2019 Slide3 HMM

    5/51

  • 8/8/2019 Slide3 HMM

    6/51

    Part-of-Speech Tagging Input:

    Profits soared at Boeing Co., easily topping forecasts on Wall Street, as their CEO

    Alan Mulally announced first quarter results.

    Output:

    Profits/Nsoared/Vat/PBoeing/NCo./N,/, easily/ADVtopping/Vforecasts/Non/P

    Wall/NStreet/N,/, as/Ptheir/POSSCEO/NAlan/NMulally/Nannounced/V

    first/ADJquarter/Nresults/N./.

    N= Noun

    V= Verb

    P= Preposition

    Adv=Adverb

    Adj=Adjective

    .

  • 8/8/2019 Slide3 HMM

    7/51

    7

    Face RecognitionTraining examples of a person

    Test images

    AT&T Laboratories, Cambridge UKhttp://www.uk.research.att.com/facedatabase.html

  • 8/8/2019 Slide3 HMM

    8/51

    8

    Introduction Modeling dependencies in input

    Sequences: Temporal: In speech; phonemes in a word (dictionary), words in

    a sentence (syntax, semantics of the language).

    In handwriting, pen movements

    Spatial: In a DNA sequence; base pairs

  • 8/8/2019 Slide3 HMM

    9/51

    9

    Discrete Markov Process N states: S1, S2, ..., SN State at time t, qt= Si First-order Markov

    P(qt+1

    =Sj

    | qt=S

    i, q

    t-1=S

    k,...) = P(q

    t+1=S

    j| q

    t=S

    i)

    Transition probabilities

    aij P(qt+1=Sj | qt=Si) aij 0 and j=1Naij=1

    Initial probabilities

    i P(q1=Si) j=1Ni=1

  • 8/8/2019 Slide3 HMM

    10/51

    10

    Stochastic Automaton

    TT qqqqq

    T

    t

    tt aaqqPqP,QOP 12112

    11 || T!!! !

    .4A

  • 8/8/2019 Slide3 HMM

    11/51

    11

    Example: Balls and Urns Three urns each full of balls of one color

    S1: red, S2: blue, S3: green

    ? A

    _ a

    048080304050

    ||||

    801010

    206020

    303040

    302050

    3313111

    3313111

    3311

    .....

    aaa

    SSSSSSS,

    S,S,S,S

    ...

    ...

    ...

    .,.,.T

    !!

    T!

    !

    !

    -

    !!

    4

    4

    A

    A

  • 8/8/2019 Slide3 HMM

    12/51

    12

    Balls and Urns: Learning Given K example sequences of length T

    _ a_ a

    _ a_ a

    !

    !

    !

    !!!

    !

    !!!T

    k

    T-

    t ik

    t

    k

    T-

    t jk

    tik

    t

    i

    ji

    ij

    k ik

    ii

    S

    SS

    S#

    SS#a

    KS

    #S#

    1

    1

    1

    1 1

    1

    1

    nd1

    fromstr nsition

    tofromstr nsition

    1

    sequen esithst rtinsequen es

  • 8/8/2019 Slide3 HMM

    13/51

    13

    Hidden Markov Models States are not observable

    Discrete observations {v1,v2,...,vM} are recorded; a

    probabilistic function of the state Emission probabilities

    bj(m) P(Ot=vm | qt=Sj)

    Example: In each urn, there are balls of different colors,but with different probabilities.

    For each observation sequence, there are multiple state

    sequences

  • 8/8/2019 Slide3 HMM

    14/51

    Hidden Markov Model (HMM) HMMs allow you to estimate probabilities of

    unobserved events

    Given plain text, which underlying parameters

    generated the surface

    E.g., in speech recognition, the observed data is

    the acoustic signal and the words are the hiddenparameters

  • 8/8/2019 Slide3 HMM

    15/51

    HMMs and their Usage HMMs are very common in Computational Linguistics:

    Speech recognition (observed: acoustic signal, hidden: words)

    Handwriting recognition (observed: image, hidden: words) Part-of-speech tagging (observed: words, hidden:part-of-speech

    tags)

    Machine translation (observed: foreign words, hidden: words in

    target language)

  • 8/8/2019 Slide3 HMM

    16/51

    Noisy Channel Model In speech recognition you observe an acoustic

    signal (A=a1,,an) and you want to determine the

    most likely sequence of words (W=w1,,wn): P(W |A)

    Problem: A andWare too specific for reliable

    counts on observed data, and are very unlikely tooccur in unseen data

  • 8/8/2019 Slide3 HMM

    17/51

    Noisy Channel Model Assume that the acoustic signal (A) is already segmented

    wrt word boundaries

    P(W

    | A) could be computed as

    Problem: Finding the most likely word corresponding to a

    acoustic representation depends on the context E.g., /'pre-z&ns / could mean presents or presence

    depending on the context

    P(W | A) ! maxw ia i

    P(wi | ai )

  • 8/8/2019 Slide3 HMM

    18/51

    Noisy Channel Model Given a candidate sequence Wwe need to

    compute P(W) and combine it with P(W | A)

    Applying Bayes rule:

    The denominator P(A) can be dropped, because itis constant for allW

    argmaxW

    (W ) ! argmaxW

    ( W) (W)

    ( )

  • 8/8/2019 Slide3 HMM

    19/51

    19

    Noisy Channel in a Picture

  • 8/8/2019 Slide3 HMM

    20/51

    DecodingThe decoder combines evidence from

    The likelihood: P(A | W)

    This can be approximated as:

    The prior: P(W)

    This can be approximated as:

    P(W) } P(w1 ) P(w ii! 2

    n

    | wi1)

    P(A |W) } P(aii!1

    n

    | wi )

  • 8/8/2019 Slide3 HMM

    21/51

    Search Space Given a word-segmented acoustic sequence list all

    candidates

    Compute the most likely path

    'bot ik-'spen-siv 'pre-z &ns

    boat excessive presidents

    bald expensive presence

    bold expressive presents

    bought inactive press

    P(inactive | bald)

    P('bot bald)

  • 8/8/2019 Slide3 HMM

    22/51

    Markov Assumption The Markov assumption states that probability of

    the occurrence of word wiat time t depends only

    on occurrence of word wi-1 at time t-1 Chain rule:

    Markov assumption:

    P(w1,...,wn ) } P(wi | w i1)i! 2

    n

    P(w1,...,wn ) ! P(wi w1,...,wi1)i! 2

    n

  • 8/8/2019 Slide3 HMM

    23/51

    The Trellis

  • 8/8/2019 Slide3 HMM

    24/51

    Parameters of an HMM States:A set of states S=s1,,sn Transition probabilities: A= a1,1,a1,2,,an,n Each ai,j

    represents the probability of transitioning from state si to sj. Emission probabilities: A set B of functions of the form

    bi(ot) which is the probability of observation ot beingemitted by si

    Initial state distribution: is the probability that si is a startstate

    T i

  • 8/8/2019 Slide3 HMM

    25/51

    The Three Basic HMM Problems Problem 1 (Evaluation): Given the observation sequence

    O=o1,,oTand an HMM model

    , how do we compute the probability of Ogiven the model?

    Problem 2(Decoding): Given the observation sequence

    O=o1,,oTand an HMM model

    , how do we find the state sequence that best

    explains the observations?

    P ! (A,B,T )

    P ! (A,B,T )

  • 8/8/2019 Slide3 HMM

    26/51

    Problem 3 (Learning): How do we adjust the model

    parameters , to maximize ?

    The Three Basic HMM Problems

    P ! (A,B,T )

    P(O P)

  • 8/8/2019 Slide3 HMM

    27/51

    Problem 1: Probability of an Observation

    Sequence What is ?

    The probability of a observation sequence is the sum of

    the probabilities of all possible state sequences in theHMM.

    Nave computation is very expensive. Given Tobservations and N states, there are NTpossible statesequences.

    Even small HMMs, e.g. T=10 and N=10, contain 10 billiondifferent paths

    Solution to this and problem 2 is to use dynamic

    programming

    P(O | P)

  • 8/8/2019 Slide3 HMM

    28/51

    Forward Probabilities What is the probability that, given an HMM , at

    time t the state is i and the partial observation o1

    othas been generated?

    E t(i) ! P(o1... ot, qt ! si | P)

    P

  • 8/8/2019 Slide3 HMM

    29/51

    Forward Probabilities

    t(j) ! E t1(i) aiji!1

    N

    -

    bj (ot)

    Et(i) ! P(o1...ot, qt ! si | P)

  • 8/8/2019 Slide3 HMM

    30/51

    Forward Algorithm Initialization:

    Induction:

    Termination:

    E t(j) ! E t1(i) aiji!1

    N

    -

    bj (ot) 2 e te T, 1e j e N

    E1(i) ! T ibi(o1) 1e i e N

    P(O | P) ! ET(i)i!1

    N

  • 8/8/2019 Slide3 HMM

    31/51

    Forward Algorithm Complexity In the nave approach to solving problem 1 it takes

    on the order of2T*NTcomputations

    The forward algorithm takes on the order ofN2Tcomputations

  • 8/8/2019 Slide3 HMM

    32/51

    Backward ProbabilitiesAnalogous to the forward probability, just in the

    other direction

    What is the probability that given an HMM andgiven the state at time t is i, the partial observation

    ot+1 oT is generated?

    F t(i) ! P(ot1...o |qt ! si,P)

    P

  • 8/8/2019 Slide3 HMM

    33/51

    Backward Probabilities

    Ft(i) ! aijbj (ot1)Ft1(j)j!1

    N

    -

    Ft(i) ! P(ot1...o |qt ! si,P)

  • 8/8/2019 Slide3 HMM

    34/51

    Backward Algorithm Initialization:

    Induction:

    Termination:

    FT(i) !1, 1e i e N

    F t(i) ! aijbj (ot1)F t1(j)j!1

    N

    -

    t! T1...1,1 e i e N

    P(O | P) ! T iF1(i)i!1

    N

  • 8/8/2019 Slide3 HMM

    35/51

    Problem2: Decoding

    The solution to Problem 1 (Evaluation) gives us the sum of

    all paths through an HMM efficiently.

    For Problem 2, we wan to find the path with the highestprobability.

    We want to find the state sequence Q=q1qT, such that

    Q ! argmaxQ'

    P(Q' |O,P)

  • 8/8/2019 Slide3 HMM

    36/51

    Viterbi Algorithm Similar to computing the forward probabilities, but

    instead of summing over transitions from incoming

    states, compute the maximum Forward:

    Viterbi Recursion:

    E t(j) ! E t1(i) aiji!1

    N

    -

    bj (ot)

    Ht(j) ! max1e ieN

    Ht1(i)aij? Abj (ot)

  • 8/8/2019 Slide3 HMM

    37/51

    Viterbi Algorithm Initialization:

    Induction:

    Termination:

    Read out path:

    H1(i) ! T ibj (o1) 1e i e N

    Ht(j) ! max1e ieN

    Ht1(i)aij? Abj (ot)

    ]t(j) ! argmax1e ieN

    Ht1(i)aij

    -

    2 e te T, 1 e j e N

    p* ! max

    1e ieNHT(i) qT

    *! argmax

    1e ieN

    HT(i)

    qt

    * !]t1

    (qt1

    *) t! T1,...,1

  • 8/8/2019 Slide3 HMM

    38/51

    Problem3: Learning

    Up to now weve assumed that we know the underlyingmodel

    Often these parameters are estimated on annotatedtraining data, which has two drawbacks: Annotation is difficult and/or expensive

    Training data is different from the current data

    We want to maximize the parameters with respect to thecurrent data, i.e., were looking for a model , such that

    ! (A,B, )

    P' P'! argmaxP

    P(O | P)

  • 8/8/2019 Slide3 HMM

    39/51

    Problem3: Learning

    Unfortunately, there is no known way to analytically find aglobal maximum, i.e., a model , such that

    But it is possible to find a local maximum Given an initial model , we can always find a model ,

    such that

    P'

    P'! argmaxP

    P(O | P)

    P

    P'

    P(O | P') u P(O | P)

  • 8/8/2019 Slide3 HMM

    40/51

    Parameter Re-estimation Use the forward-backward (or Baum-Welch)

    algorithm, which is a hill-climbing algorithm

    Using an initial parameter instantiation, theforward-backward algorithm iteratively re-estimatesthe parameters and improves the probability thatgiven observation are generated by the new

    parameters

  • 8/8/2019 Slide3 HMM

    41/51

    Parameter Re-estimation Three parameters need to be re-estimated:

    Initial state distribution:

    Transition probabilities: ai,j

    Emission probabilities: bi(ot)

    T i

  • 8/8/2019 Slide3 HMM

    42/51

    Re-estimating Transition Probabilities Whats the probability of being in state siat time t

    and going to state sj, given the current model and

    parameters?

    \t(i, j) ! P(qt ! si, qt1 ! sj | O,P)

  • 8/8/2019 Slide3 HMM

    43/51

    Re-estimating Transition Probabilities

    t(i, j) !E t(i) ai, j bj (ot1)Ft1(j)

    E t(i) ai, j bj (ot1)F t1(j)j!1

    N

    i!1

    N

    t(i, j) ! P(qt ! si, qt1 ! sj |O,P)

  • 8/8/2019 Slide3 HMM

    44/51

    Re-estimating Transition Probabilities The intuition behind the re-estimation equation for

    transition probabilities is

    Formally:

    ai, j !expected numberoftransitions from state s ito state sj

    expected numberoftransitions from state s i

    ai, j !\t(i, j)

    t!1

    T1

    \t(i, j ')j '!1

    N

    t!1

    T1

  • 8/8/2019 Slide3 HMM

    45/51

    Re-estimating Transition Probabilities Defining

    As the probability of being in state si, given the

    complete observation O

    We can say: ai, j !

    \t(i, j)t!1

    T1

    Kt(i)t!1

    T1

    Kt(i) ! \t(i, j)j!1

    N

  • 8/8/2019 Slide3 HMM

    46/51

    Review of Probabilities Forward probability:

    The probability of being in state si, given the partial observationo1,,ot

    Backward probability:

    The probability of being in state si, given the partial observationot+1,,oT

    Transition probability:

    The probability of going from state si, to state sj, given thecomplete observation o1,,oT

    State probability:

    The probability of being in state si, given the complete

    observation o1,,oT

    E t(i)

    F t(i)

    \t(i, j)

    Kt(i)

  • 8/8/2019 Slide3 HMM

    47/51

    Re-estimating Initial State Probabilities Initial state distribution: is the probability that si is

    a start state

    Re-estimation is easy:

    Formally:

    T i

    Ti ! expected numbero times in state s iattime1

    Ti ! K1(i)

  • 8/8/2019 Slide3 HMM

    48/51

    Re-estimation of Emission Probabilities Emission probabilities are re-estimated as

    Formally:

    Where

    Note that here is the Kronecker delta function and is not related to thein the discussion of the Viterbi algorithm!!

    bi(k) !

    expected numberoftimes in state siand observe symbol vk

    exp

    ectednumb

    erof

    tim

    es in s

    tate

    s i

    bi (k) !

    H(ot,vk)Kt(i)t!1

    T

    Kt(i)t!1

    T

    H(ot,vk) ! 1, ifot ! vk, and 0 otherwise

    H

    H

  • 8/8/2019 Slide3 HMM

    49/51

    The Updated Model Coming from we get to

    by the following update rules:

    P ! (A,B,T )

    P'! ( A, B, T)

    bi (k) !

    H(ot,vk)Kt(i)t!1

    T

    Kt(i)t!1

    T

    ai, j !

    \t(i, j)t!1

    T1

    Kt(i)t!1

    T1

    Ti ! K1(i)

  • 8/8/2019 Slide3 HMM

    50/51

    Expectation Maximization The forward-backward algorithm is an instance of

    the more general EM algorithm

    The E Step: Compute the forward and backwardprobabilities for a give model

    The M Step: Re-estimate the model parameters

  • 8/8/2019 Slide3 HMM

    51/51

    Exercise Programming with Viterbi Algorithm

    Apply HMM forPart-of-Speech Tagging