Lecture13 xing fei-fei

download Lecture13 xing fei-fei

of 65

  • date post

    12-Jan-2017
  • Category

    Technology

  • view

    318
  • download

    0

Embed Size (px)

Transcript of Lecture13 xing fei-fei

  • Machine LearningMachine Learninggg

    Structured Models: Structured Models: Hidden Markov Models versusHidden Markov Models versusHidden Markov Models versus Hidden Markov Models versus

    Conditional Random FieldsConditional Random Fields

    Eric XingEric XingLecture 13 August 15 2010

    Eric Xing Eric Xing @ CMU, 2006-2010 1

    Lecture 13, August 15, 2010

    Reading:

  • From static to dynamic mixture modelsmodels

    Dynamic mixtureDynamic mixtureStatic mixtureStatic mixture

    Y2 Y3Y1 YT... Y1The underlying The underlying source:source:Speech signal, Speech signal, dice,dice,

    Eric Xing Eric Xing @ CMU, 2006-2010 2

    A AA AX2 X3X1 XT... AX1N

    The sequence:The sequence:Phonemes,Phonemes,sequence of rolls, sequence of rolls,

  • Hidden Markov ModelHidden Markov Model Observation spaceObservation space

    Alphabetic set:Euclidean space:

    Index set of hidden statesIndex set of hidden statesA AA Ax2 x3x1 xT

    y2 y3y1 yT...

    ...

    Kccc ,,, 21CdR

    Transition probabilitiesTransition probabilities between any two statesbetween any two states

    A AA Ax2 x3x1 xT M,,, 21I

    ,)|( ,jiit

    jt ayyp 11 1

    Graphical model

    or

    Start probabilitiesStart probabilities

    ,jp .,,,,lMultinomia~)|( ,,, I iaaayyp Miiiitt 211 1

    lMultinomia~)( Myp 211

    1 2

    Emission probabilitiesEmission probabilities associated with each stateassociated with each state

    i l

    .,,,lMultinomia)( Myp 211

    .,,,,lMultinomia~)|( ,,, I ibbbyxp Kiiiitt 211 K

    Eric Xing Eric Xing @ CMU, 2006-2010 3

    or in general: .,|f~)|( I iyxp iitt 1 State automata

  • Probability of a ParseProbability of a Parse Given a sequence x = x1xT1 T

    and a parse y = y1, , yT, To find how likely is the parse:

    (given our HMM and the sequence) A AA Ax2 x3x1 xT

    y2 y3y1 yT...

    ... (given our HMM and the sequence)

    p(x, y) = p(x1xT, y1, , yT) (Joint probability)

    A AA Ax2 x3x1 xT

    p p 1 T y1 yT= p(y1) p(x1 | y1) p(y2 | y1) p(x2 | y2) p(yT | yT-1) p(xT | yT)= p(y1) P(y2 | y1) p(yT | yT-1) p(x1 | y1) p(x2 | y2) p(xT | yT)

    Marginal probability:

    yyxx

    1 2 11 2 1y y y

    T

    t

    T

    tttyyy

    N ttyxpapp )|(),()( ,

    Eric Xing Eric Xing @ CMU, 2006-2010 4

    Posterior probability: )(/),()|( xyxxy ppp

  • Shortcomings of Hidden Markov ModelModel

    Y1 Y2 Yn

    X1 X2 Xn

    HMM models capture dependences between each state and only its corresponding observation

    1 2 n

    corresponding observation NLP example: In a sentence segmentation task, each segmental state may depend not just

    on a single word (and the adjacent segmental stages), but also on the (non-local) features of the whole line such as line length, indentation, amount of white space, etc.

    Mi t h b t l i bj ti f ti d di ti Mismatch between learning objective function and prediction objective function HMM learns a joint distribution of states and observations P(Y, X), but in a prediction task, we

    need the conditional probability P(Y|X)

    Eric Xing Eric Xing @ CMU, 2006-2010 5

    p y ( | )

  • Recall Generative vs. Discriminative ClassifiersDiscriminative Classifiers Goal: Wish to learn f: X Y, e.g., P(Y|X)g ( | )

    Generative classifiers (e.g., Nave Bayes): Yn Assume some functional form for P(X|Y), P(Y)

    This is a generative model of the data! Estimate parameters of P(X|Y), P(Y) directly from training data

    Xn

    Use Bayes rule to calculate P(Y|X= x)

    Discriminative classifiers (e g logistic regression) Discriminative classifiers (e.g., logistic regression) Directly assume some functional form for P(Y|X)

    This is a discriminative model of the data! Estimate parameters of P(Y|X) directly from training data

    Yn

    Xn

    Eric Xing Eric Xing @ CMU, 2006-2010 6

    Estimate parameters of P(Y|X) directly from training data

  • Structured Conditional ModelsStructured Conditional Models

    Y1 Y2 YnY1 Y2 Yn

    x1:n

    Conditional probability P(label sequence y | observation sequence x)rather than joint probability P(y, x) Specify the probability of possible label sequences given an observation sequence

    Allow arbitrary, non-independent features on the observation sequence Xsequence X

    The probability of a transition between labels may depend on past and future observations

    Eric Xing Eric Xing @ CMU, 2006-2010 7

    Relax strong independence assumptions in generative models

  • Conditional DistributionConditional Distribution If the graph G = (V, E) of Y is a tree, the conditional distribution over

    the label sequence Y = y, given X = x, by the Hammersley Clifford theorem of random fields is:

    (y | x) exp ( y | x) ( y | x)

    k k k kp f e g v

    x is a data sequencey is a label sequence

    (y | x) exp ( , y | , x) ( , y | , x)

    k k e k k v

    e E,k v V ,k

    p f e g v

    Y1 Y2 Y5

    y is a label sequence v is a vertex from vertex set V = set of label random variables e is an edge from edge set E over V fk and gk are given and fixed gk is a Boolean vertex feature; fk is a Boolean edge

    X1 Xn

    fk and gk are given and fixed. gk is a Boolean vertex feature; fk is a Boolean edge feature

    k is the number of features are parameters to be estimated1 2 1 2( , , , ; , , , ); andn n k k

    Eric Xing Eric Xing @ CMU, 2006-2010 8

    y|e is the set of components of y defined by edge e y|v is the set of components of y defined by vertex v

    1 2 1 2( )n n k k

  • Conditional Random FieldsConditional Random Fields

    Y1 Y2 Yn

    x1:n

    CRF is a partially directed model CRF is a partially directed model Discriminative model Usage of global normalizer Z(x) Models the dependence between each state and the entire observation sequence

    Eric Xing Eric Xing @ CMU, 2006-2010 9

    Models the dependence between each state and the entire observation sequence

  • Conditional Random Fields A AA Ay2 y3y1 yT...

    A AA A

    y2 y3y1 yT...

    Conditional Random Fields General parametric form:

    A AA Ax2 x3x1 xT... A AA Ax2 x3x1 xT...

    p

    Y1 Y2 Yn

    x1:n

    Eric Xing Eric Xing @ CMU, 2006-2010 10

  • Conditional Random FieldsConditional Random Fields

    c

    ccc yxfxZxyp ),(exp

    ),()|(

    1

    Allow arbitrary dependencies y pon input

    Clique dependencies on labels Clique dependencies on labels

    Use approximate inference for

    Eric Xing Eric Xing @ CMU, 2006-2010 11

    general graphs

  • CRFs: InferenceCRFs: Inference Given CRF parameters and , find the y* that maximizes P(y|x)

    Can ignore Z(x) because it is not a function of y Can ignore Z(x) because it is not a function of y

    Run the max-product algorithm on the junction-tree of CRF:

    Y1 Y2 YnY1 Y2 Yn

    x1:n Same as Viterbi decoding x1:n

    Y YY YYn 2 Yn 1

    used in HMMs!

    Eric Xing Eric Xing @ CMU, 2006-2010 12

    Y1,Y2 Y2,Y3 . Yn-2,Yn-1 Yn-1,YnY2 Y3 Yn-2 Yn-1

  • CRF learningCRF learning Given {(xd, yd)}d=1N, find *, * such that{( d yd)}d 1

    Gradient of the log-partition function in an

    Computing the gradient w.r.t

    Gradient of the log partition function in an exponential family is the expectation of the

    sufficient statistics.

    Eric Xing Eric Xing @ CMU, 2006-2010 13

  • CRFs: some empirical resultsCRFs: some empirical results Comparison of error rates on synthetic datap y

    M e

    rror

    erro

    r

    ME

    MM

    ME

    MM

    e

    CRF error HMM error

    Data is increasingly higher

    RF

    erro

    r

    Data is increasingly higher order in the direction of arrow

    Eric Xing Eric Xing @ CMU, 2006-2010 14HMM error

    CR

    CRFs achieve the lowest error rate for higher order data

  • CRFs: some empirical resultsCRFs: some empirical results Parts of Speech tagging

    Using same set of features: HMM >=< CRF > MEMM Using additional overlapping features: CRF+ > MEMM+ >> HMM

    Eric Xing Eric Xing @ CMU, 2006-2010 15

    Using additional overlapping features: CRF MEMM HMM

  • SummarySummary Conditional Random Fields is a discriminative

    Structured Input Output model!

    HMM is a generative structured I/O model

    Yn

    X HMM is a generative structured I/O model

    Complementary strength and weakness:

    Xn

    Complementary strength and weakness:

    1.

    2Yn

    2.

    3.

    Xn

    Eric Xing Eric Xing @ CMU, 2006-2010 16

  • Condition Random Fields:ConditionRandomFields:applicationsinvision

    L.FeiFei

    ComputerScienceDept.

    f dStanfordUniversity

  • Machinelearningincomputervision

    A 15 L t 13 C diti l R d Fi ld Aug15,Lecture13:ConditionalRandomField Applicationsincomputervision Imagelabeling,segmentation,objectrecognition&imageannotationg g g j g g

    8 August 2010 18L. Fei-Fei, Dragon Star 2010, Stanford

  • Image labeling(slides courtesy to Sanjiv Kumar (Google))

    Reference:Sanjiv Kumar & Martial Hebert, Discriminative random fields: a discriminativeframework for contextual interaction in classification. ICCV, 2003,

    8 August 2010 19

  • Context helps visual recognitiong

    Context from Larger Neighborhoods !

    208 August 2010

  • Context helps visual recognitiong

    Context from Whole Image !

    218 August 2010

  • Contextual Interactions

    t

    water water

    water

    buildingvegetation

    sky

    water

    g

    water

    22Pixel-Pixe