Cyclone: Safe Programming at the C Level of...

41
Machine Learning CSE 6363 (Fall 2016) Lecture 20 Conditional Random Field Heng Huang, Ph.D. Department of Computer Science and Engineering

Transcript of Cyclone: Safe Programming at the C Level of...

Page 1: Cyclone: Safe Programming at the C Level of Abstractionranger.uta.edu/~heng/CSE6363_slides/Lecture20_2016.pdf · MEMM-Linear Regression . ... MEMM-Logistic Regression/Learning . Fall

Machine Learning CSE 6363 (Fall 2016)

Lecture 20 Conditional Random Field

Heng Huang, Ph.D.

Department of Computer Science and Engineering

Page 2: Cyclone: Safe Programming at the C Level of Abstractionranger.uta.edu/~heng/CSE6363_slides/Lecture20_2016.pdf · MEMM-Linear Regression . ... MEMM-Logistic Regression/Learning . Fall

Fall 2016 Heng Huang Machine Learning 2

Generative Models

• Hidden Markov models (HMMs) and stochastic grammars

– Assign a joint probability to paired observation and label sequences

– The parameters typically trained to maximize the joint likelihood of train

examples

Page 3: Cyclone: Safe Programming at the C Level of Abstractionranger.uta.edu/~heng/CSE6363_slides/Lecture20_2016.pdf · MEMM-Linear Regression . ... MEMM-Logistic Regression/Learning . Fall

Fall 2016 Heng Huang Machine Learning 3

Conditional Models

• Conditional probability P(label sequence y | observation

sequence x) rather than joint probability P(y, x)

– Specify the probability of possible label sequences given an

observation sequence

• Allow arbitrary, non-independent features on the observation

sequence X

• The probability of a transition between labels may depend on past

and future observations

– Relax strong independence assumptions in generative models

Page 4: Cyclone: Safe Programming at the C Level of Abstractionranger.uta.edu/~heng/CSE6363_slides/Lecture20_2016.pdf · MEMM-Linear Regression . ... MEMM-Logistic Regression/Learning . Fall

Fall 2016 Heng Huang Machine Learning 4

Implications of the model

• Does this do what we want?

• Q: does Y[i-1] depend on X[i+1] ?

– “a nodes is conditionally independent of its non-

descendents given its parents”

Page 5: Cyclone: Safe Programming at the C Level of Abstractionranger.uta.edu/~heng/CSE6363_slides/Lecture20_2016.pdf · MEMM-Linear Regression . ... MEMM-Logistic Regression/Learning . Fall

Fall 2016 Heng Huang Machine Learning 5

Discriminative Models

Maximum Entropy Markov Models (MEMMs)

• Exponential model

• Given training set X with label sequence Y:

– Train a model θ that maximizes P(Y|X, θ)

– For a new data sequence x, the predicted label y maximizes P(y|x, θ)

– Notice the per-state normalization

L(Θy)= Σ log pΘy(yi|xi)

Page 6: Cyclone: Safe Programming at the C Level of Abstractionranger.uta.edu/~heng/CSE6363_slides/Lecture20_2016.pdf · MEMM-Linear Regression . ... MEMM-Logistic Regression/Learning . Fall

Fall 2016 Heng Huang Machine Learning 6

MEMM-Supervise Learning

Page 7: Cyclone: Safe Programming at the C Level of Abstractionranger.uta.edu/~heng/CSE6363_slides/Lecture20_2016.pdf · MEMM-Linear Regression . ... MEMM-Logistic Regression/Learning . Fall

Fall 2016 Heng Huang Machine Learning 7

MEMM-Linear Regression

Page 8: Cyclone: Safe Programming at the C Level of Abstractionranger.uta.edu/~heng/CSE6363_slides/Lecture20_2016.pdf · MEMM-Linear Regression . ... MEMM-Logistic Regression/Learning . Fall

Fall 2016 Heng Huang Machine Learning 8

MEMM-Linear Regression

Page 9: Cyclone: Safe Programming at the C Level of Abstractionranger.uta.edu/~heng/CSE6363_slides/Lecture20_2016.pdf · MEMM-Linear Regression . ... MEMM-Logistic Regression/Learning . Fall

Fall 2016 Heng Huang Machine Learning 9

MEMM-Logistic Regression

Page 10: Cyclone: Safe Programming at the C Level of Abstractionranger.uta.edu/~heng/CSE6363_slides/Lecture20_2016.pdf · MEMM-Linear Regression . ... MEMM-Logistic Regression/Learning . Fall

Fall 2016 Heng Huang Machine Learning 10

MEMM-Logistic Regression/Classification

Page 11: Cyclone: Safe Programming at the C Level of Abstractionranger.uta.edu/~heng/CSE6363_slides/Lecture20_2016.pdf · MEMM-Linear Regression . ... MEMM-Logistic Regression/Learning . Fall

Fall 2016 Heng Huang Machine Learning 11

MEMM-Logistic Regression/Learning

Page 12: Cyclone: Safe Programming at the C Level of Abstractionranger.uta.edu/~heng/CSE6363_slides/Lecture20_2016.pdf · MEMM-Linear Regression . ... MEMM-Logistic Regression/Learning . Fall

Fall 2016 Heng Huang Machine Learning 12

MEMM-Maximum Entropy Model

Page 13: Cyclone: Safe Programming at the C Level of Abstractionranger.uta.edu/~heng/CSE6363_slides/Lecture20_2016.pdf · MEMM-Linear Regression . ... MEMM-Logistic Regression/Learning . Fall

Fall 2016 Heng Huang Machine Learning 13

HMMs and MEMMs

Page 14: Cyclone: Safe Programming at the C Level of Abstractionranger.uta.edu/~heng/CSE6363_slides/Lecture20_2016.pdf · MEMM-Linear Regression . ... MEMM-Logistic Regression/Learning . Fall

Maximum Entropy Markov Models (MEMMs)

Fall 2016 Heng Huang Machine Learning 14

Page 15: Cyclone: Safe Programming at the C Level of Abstractionranger.uta.edu/~heng/CSE6363_slides/Lecture20_2016.pdf · MEMM-Linear Regression . ... MEMM-Logistic Regression/Learning . Fall

HMM vs MEMM

Fall 2016 Heng Huang Machine Learning 15

Page 16: Cyclone: Safe Programming at the C Level of Abstractionranger.uta.edu/~heng/CSE6363_slides/Lecture20_2016.pdf · MEMM-Linear Regression . ... MEMM-Logistic Regression/Learning . Fall

Fall 2016 Heng Huang Machine Learning 16

Maximum Entropy Markov Model (MEMM)

• Models dependence between each state and the full observation sequence

explicitly

– More expressive than HMMs

• Discriminative model

– Completely ignores modeling P(X): saves modeling effort

– Learning objective function consistent with predictive function: P(Y|X)

Y1 Y2 … … … Yn

X1:n

Page 17: Cyclone: Safe Programming at the C Level of Abstractionranger.uta.edu/~heng/CSE6363_slides/Lecture20_2016.pdf · MEMM-Linear Regression . ... MEMM-Logistic Regression/Learning . Fall

Viterbi in MEMMs

Fall 2016 Heng Huang Machine Learning 17

Page 18: Cyclone: Safe Programming at the C Level of Abstractionranger.uta.edu/~heng/CSE6363_slides/Lecture20_2016.pdf · MEMM-Linear Regression . ... MEMM-Logistic Regression/Learning . Fall

Fall 2016 Heng Huang Machine Learning 18

MEMM: Label Bias Problem

State 1

State 2

State 3

State 4

State 5

Observation 1 Observation 2 Observation 3 Observation 4 0.4

0.6 0.2

0.2

0.2

0.2

0.2

0.45

0.55 0.2

0.3

0.1

0.1

0.3

0.5

0.5 0.1

0.3

0.2

0.2

0.2

What the local transition probabilities say:

• State 1 almost always prefers to go to state 2

• State 2 almost always prefer to stay in state 2

Page 19: Cyclone: Safe Programming at the C Level of Abstractionranger.uta.edu/~heng/CSE6363_slides/Lecture20_2016.pdf · MEMM-Linear Regression . ... MEMM-Logistic Regression/Learning . Fall

Fall 2016 Heng Huang Machine Learning 19

State 1

State 2

State 3

State 4

State 5

Observation 1 Observation 2 Observation 3 Observation 4

0.4

0.6 0.2

0.2

0.2

0.2

0.2

0.45

0.55 0.2

0.3

0.1

0.1

0.3

0.5

0.5 0.1

0.3

0.2

0.2

0.2

Probability of path 1-> 1-> 1-> 1:

• 0.4 x 0.45 x 0.5 = 0.09

MEMM: Label Bias Problem

Page 20: Cyclone: Safe Programming at the C Level of Abstractionranger.uta.edu/~heng/CSE6363_slides/Lecture20_2016.pdf · MEMM-Linear Regression . ... MEMM-Logistic Regression/Learning . Fall

Fall 2016 Heng Huang Machine Learning 20

State 1

State 2

State 3

State 4

State 5

Observation 1 Observation 2 Observation 3 Observation 4

0.4

0.6 0.2

0.2

0.2

0.2

0.2

0.45

0.55 0.2

0.3

0.1

0.1

0.3

0.5

0.5 0.1

0.3

0.2

0.2

0.2

Probability of path 2->2->2->2 :

• 0.2 X 0.3 X 0.3 = 0.018 Other paths:

1-> 1-> 1-> 1: 0.09

MEMM: Label Bias Problem

Page 21: Cyclone: Safe Programming at the C Level of Abstractionranger.uta.edu/~heng/CSE6363_slides/Lecture20_2016.pdf · MEMM-Linear Regression . ... MEMM-Logistic Regression/Learning . Fall

Fall 2016 Heng Huang Machine Learning 21

State 1

State 2

State 3

State 4

State 5

Observation 1 Observation 2 Observation 3 Observation 4

0.4

0.6 0.2

0.2

0.2

0.2

0.2

0.45

0.55 0.2

0.3

0.1

0.1

0.3

0.5

0.5 0.1

0.3

0.2

0.2

0.2

Probability of path 1->2->1->2:

• 0.6 X 0.2 X 0.5 = 0.06 Other paths:

1->1->1->1: 0.09

2->2->2->2: 0.018

MEMM: Label Bias Problem

Page 22: Cyclone: Safe Programming at the C Level of Abstractionranger.uta.edu/~heng/CSE6363_slides/Lecture20_2016.pdf · MEMM-Linear Regression . ... MEMM-Logistic Regression/Learning . Fall

Fall 2016 Heng Huang Machine Learning 22

State 1

State 2

State 3

State 4

State 5

Observation 1 Observation 2 Observation 3 Observation 4

0.4

0.6 0.2

0.2

0.2

0.2

0.2

0.45

0.55 0.2

0.3

0.1

0.1

0.3

0.5

0.5 0.1

0.3

0.2

0.2

0.2

Probability of path 1->1->2->2:

• 0.4 X 0.55 X 0.3 = 0.066 Other paths:

1->1->1->1: 0.09

2->2->2->2: 0.018

1->2->1->2: 0.06

MEMM: Label Bias Problem

Page 23: Cyclone: Safe Programming at the C Level of Abstractionranger.uta.edu/~heng/CSE6363_slides/Lecture20_2016.pdf · MEMM-Linear Regression . ... MEMM-Logistic Regression/Learning . Fall

Fall 2016 Heng Huang Machine Learning 23

State 1

State 2

State 3

State 4

State 5

Observation 1 Observation 2 Observation 3 Observation 4

0.4

0.6 0.2

0.2

0.2

0.2

0.2

0.45

0.55 0.2

0.3

0.1

0.1

0.3

0.5

0.5 0.1

0.3

0.2

0.2

0.2

Most Likely Path: 1-> 1-> 1-> 1

• Although locally it seems state 1 wants to go to state 2 and state 2 wants to remain in state 2.

• why?

MEMM: Label Bias Problem

Page 24: Cyclone: Safe Programming at the C Level of Abstractionranger.uta.edu/~heng/CSE6363_slides/Lecture20_2016.pdf · MEMM-Linear Regression . ... MEMM-Logistic Regression/Learning . Fall

Fall 2016 Heng Huang Machine Learning 24

State 1

State 2

State 3

State 4

State 5

Observation 1 Observation 2 Observation 3 Observation 4

0.4

0.6 0.2

0.2

0.2

0.2

0.2

0.45

0.55 0.2

0.3

0.1

0.1

0.3

0.5

0.5 0.1

0.3

0.2

0.2

0.2

Most Likely Path: 1-> 1-> 1-> 1

• State 1 has only two transitions but state 2 has 5:

• Average transition probability from state 2 is lower

MEMM: Label Bias Problem

Page 25: Cyclone: Safe Programming at the C Level of Abstractionranger.uta.edu/~heng/CSE6363_slides/Lecture20_2016.pdf · MEMM-Linear Regression . ... MEMM-Logistic Regression/Learning . Fall

Fall 2016 Heng Huang Machine Learning 25

State 1

State 2

State 3

State 4

State 5

Observation 1 Observation 2 Observation 3 Observation 4

0.4

0.6 0.2

0.2

0.2

0.2

0.2

0.45

0.55 0.2

0.3

0.1

0.1

0.3

0.5

0.5 0.1

0.3

0.2

0.2

0.2

Label bias problem in MEMM:

• Preference of states with lower number of transitions over others

MEMM: Label Bias Problem

Page 26: Cyclone: Safe Programming at the C Level of Abstractionranger.uta.edu/~heng/CSE6363_slides/Lecture20_2016.pdf · MEMM-Linear Regression . ... MEMM-Logistic Regression/Learning . Fall

Fall 2016 Heng Huang Machine Learning 26

Solution: Do not normalize probabilities locally

State 1

State 2

State 3

State 4

State 5

Observation 1 Observation 2 Observation 3 Observation 4 0.4

0.6 0.2

0.2

0.2

0.2

0.2

0.45

0.55 0.2

0.3

0.1

0.1

0.3

0.5

0.5 0.1

0.3

0.2

0.2

0.2

From local probabilities ….

Page 27: Cyclone: Safe Programming at the C Level of Abstractionranger.uta.edu/~heng/CSE6363_slides/Lecture20_2016.pdf · MEMM-Linear Regression . ... MEMM-Logistic Regression/Learning . Fall

Fall 2016 Heng Huang Machine Learning 27

Solution: Do not normalize probabilities locally

State 1

State 2

State 3

State 4

State 5

Observation 1 Observation 2 Observation 3 Observation 4 20

30 10

20

10

20

20

30

20 20

30

10

10

30

5

5 10

30

20

20

20

From local probabilities to local potentials

• States with lower transitions do not have an unfair advantage!

Page 28: Cyclone: Safe Programming at the C Level of Abstractionranger.uta.edu/~heng/CSE6363_slides/Lecture20_2016.pdf · MEMM-Linear Regression . ... MEMM-Logistic Regression/Learning . Fall

Fall 2016 Heng Huang Machine Learning 28

From MEMM ….

Y1 Y2 … … … Yn

X1:n

Page 29: Cyclone: Safe Programming at the C Level of Abstractionranger.uta.edu/~heng/CSE6363_slides/Lecture20_2016.pdf · MEMM-Linear Regression . ... MEMM-Logistic Regression/Learning . Fall

Fall 2016 Heng Huang Machine Learning 29

• CRF is a partially directed model

– Discriminative model like MEMM

– Usage of global normalizer Z(x) overcomes the label bias problem of

MEMM

– Models the dependence between each state and the entire observation

sequence (like MEMM)

From MEMM to CRF

Y1 Y2 … … … Yn

x1:n

Page 30: Cyclone: Safe Programming at the C Level of Abstractionranger.uta.edu/~heng/CSE6363_slides/Lecture20_2016.pdf · MEMM-Linear Regression . ... MEMM-Logistic Regression/Learning . Fall

Fall 2016 Heng Huang Machine Learning 30

Conditional Random Fields

• General parametric form:

Y1 Y2 … … … Yn

x1:n

Page 31: Cyclone: Safe Programming at the C Level of Abstractionranger.uta.edu/~heng/CSE6363_slides/Lecture20_2016.pdf · MEMM-Linear Regression . ... MEMM-Logistic Regression/Learning . Fall

Ramesh Nallapati

31

CRFs: Inference

• Given CRF parameters and , find the y* that maximizes P(y|x)

– Can ignore Z(x) because it is not a function of y

• Run the max-product algorithm on the junction-tree of CRF:

Y1 Y2 … … … Yn

x1:n

Y1,Y2 Y2,Y3 ……. Yn-2,Yn-1 Yn-1,Yn

Y2 Y3 Yn-2 Yn-1

Same as Viterbi decoding used

in HMMs!

Page 32: Cyclone: Safe Programming at the C Level of Abstractionranger.uta.edu/~heng/CSE6363_slides/Lecture20_2016.pdf · MEMM-Linear Regression . ... MEMM-Logistic Regression/Learning . Fall

Fall 2016 Heng Huang Machine Learning 32

Recall: Random Field

Page 33: Cyclone: Safe Programming at the C Level of Abstractionranger.uta.edu/~heng/CSE6363_slides/Lecture20_2016.pdf · MEMM-Linear Regression . ... MEMM-Logistic Regression/Learning . Fall

Fall 2016 Heng Huang Machine Learning 33

Conditional Random Fields (CRFs)

• CRFs have all the advantages of MEMMs without

label bias problem – MEMM uses per-state exponential model for the conditional

probabilities of next states given the current state

– CRF has a single exponential model for the joint probability of the

entire sequence of labels given the observation sequence

• Undirected acyclic graph

• Allow some transitions “vote” more strongly than others

depending on the corresponding observations

Page 34: Cyclone: Safe Programming at the C Level of Abstractionranger.uta.edu/~heng/CSE6363_slides/Lecture20_2016.pdf · MEMM-Linear Regression . ... MEMM-Logistic Regression/Learning . Fall

Fall 2016 Heng Huang Machine Learning 34

Definition of CRFs

X is a random variable over data sequences to be labeled

Y is a random variable over corresponding label sequences

Page 35: Cyclone: Safe Programming at the C Level of Abstractionranger.uta.edu/~heng/CSE6363_slides/Lecture20_2016.pdf · MEMM-Linear Regression . ... MEMM-Logistic Regression/Learning . Fall

Fall 2016 Heng Huang Machine Learning 35

Example of CRFs

Page 36: Cyclone: Safe Programming at the C Level of Abstractionranger.uta.edu/~heng/CSE6363_slides/Lecture20_2016.pdf · MEMM-Linear Regression . ... MEMM-Logistic Regression/Learning . Fall

Fall 2016 Heng Huang Machine Learning 36

Graphical comparison among

HMMs, MEMMs and CRFs

HMM MEMM CRF

Page 37: Cyclone: Safe Programming at the C Level of Abstractionranger.uta.edu/~heng/CSE6363_slides/Lecture20_2016.pdf · MEMM-Linear Regression . ... MEMM-Logistic Regression/Learning . Fall

Functional Models

Fall 2016 Heng Huang Machine Learning 37

Page 38: Cyclone: Safe Programming at the C Level of Abstractionranger.uta.edu/~heng/CSE6363_slides/Lecture20_2016.pdf · MEMM-Linear Regression . ... MEMM-Logistic Regression/Learning . Fall

Fall 2016 Heng Huang Machine Learning 38

x is a data sequence

y is a label sequence

v is a vertex from vertex set V = set of label random variables

e is an edge from edge set E over V

fk and gk are given and fixed. gk is a Boolean vertex feature; fk is a

Boolean edge feature

k is the number of features

are parameters to be estimated

y|e is the set of components of y defined by edge e

y|v is the set of components of y defined by vertex v

1 2 1 2( , , , ; , , , ); andn n k k

Conditional Distribution

If the graph G = (V, E) of Y is a tree, the conditional distribution over the

label sequence Y = y, given X = x, by fundamental theorem of random

fields is:

(y | x) exp ( , y | , x) ( , y | , x)

k k e k k v

e E,k v V ,k

p f e g v

Page 39: Cyclone: Safe Programming at the C Level of Abstractionranger.uta.edu/~heng/CSE6363_slides/Lecture20_2016.pdf · MEMM-Linear Regression . ... MEMM-Logistic Regression/Learning . Fall

Training of CRFs (From Prof. Dietterich)

log ( | )( , y | , x) ( , y | , x) log (x)

k k e k k v

e E,k v V ,k

p y xf e g v Z

log ( | ) ( , y | , x) ( , y | , x) log (x)k k e k k v

e E,k v V ,k

p y x f e g v Z

• First, we take the log of the equation

• Then, take the derivative of the above equation

• For training, the first 2 items are easy to get.

• For example, for each k, fk is a sequence of Boolean numbers, such

as 00101110100111.

is just the total number of 1’s in the sequence. ( , y | , x)k k ef e

• The hardest thing is how to calculate Z(x)

Fall 2016 Heng Huang Machine Learning 39

Page 40: Cyclone: Safe Programming at the C Level of Abstractionranger.uta.edu/~heng/CSE6363_slides/Lecture20_2016.pdf · MEMM-Linear Regression . ... MEMM-Logistic Regression/Learning . Fall

Fall 2016 Heng Huang Machine Learning 40

Another Definition

CRF is a Markov Random Fields.

By the Hammersley-Clifford theorem, the probability of a label can be

expressed as a Gibbs distribution, so that

What is clique?

|

1

1( | , , ) exp( ( , ))

( , ) ( , , )

j j

j

n

j j c

i

p y x F y xZ

F y x f y x i

| |

1( | , , ) exp( ( , , ) ( , , ))j j e k k s

j k

p y x t y x i s y x iZ

By only taking consideration of the one node and two nodes cliques, we

have

clique

Page 41: Cyclone: Safe Programming at the C Level of Abstractionranger.uta.edu/~heng/CSE6363_slides/Lecture20_2016.pdf · MEMM-Linear Regression . ... MEMM-Logistic Regression/Learning . Fall

Fall 2016 Heng Huang Machine Learning 41

Definition (cont.)

Moreover, let us consider the problem in a first-order chain model, we

have

For simplifying description, let fj(y, x) denote tj(yi-1, yi, x, i) and sk(yi, x, i)

|

1

1( | , , ) exp( ( , ))

( , ) ( , , )

j j

j

n

j j c

i

p y x F y xZ

F y x f y x i

1

1( | , , ) exp( ( , , , ) ( , , ))j j i i k k i

j k

p y x t y y x i s y x iZ