Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling...
-
date post
19-Dec-2015 -
Category
Documents
-
view
214 -
download
0
Transcript of Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling...
![Page 1: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d3e5503460f94a1735b/html5/thumbnails/1.jpg)
Learning Seminar, 2004
Conditional Random Fields: Probabilistic Models
for Segmenting and Labeling Sequence Data
J. Lafferty, A. McCallum, F. Pereira
Presentation: Inna Weiner
![Page 2: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d3e5503460f94a1735b/html5/thumbnails/2.jpg)
Learning Seminar, 2004
Outline
• Labeling sequence data problem
• Classification with probabilistic models: Generative and Discriminative– Why HMMs and MEMMs are not good
enough
• Conditional Field Model
• Experimental Results
![Page 3: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d3e5503460f94a1735b/html5/thumbnails/3.jpg)
Learning Seminar, 2004
Labeling Sequence Data Problem
• X is a random variable over data sequences• Y is a random variable over label sequences
• Yi is assumed to range over a finite label alphabet A
• The problem:– Learn how to give labels from a closed set Y to a data sequence
X
Thinking is beingX:
x1 x2 x3
noun verb noun
y1 y2 y3
Y:
![Page 4: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d3e5503460f94a1735b/html5/thumbnails/4.jpg)
Learning Seminar, 2004
Labeling Sequence Data Problem
• The lab setup: Let a monkey do some behavioral task while recording movement and neural activity
• Motor task: Reach to target• Goal:
Map neural activityto behavior
• In our notation:– X: Neural Data– Y: Hand movements
![Page 5: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d3e5503460f94a1735b/html5/thumbnails/5.jpg)
Learning Seminar, 2004
Generative Probabilistic Models
• Learning problem:
Choose Θ to maximize joint likelihood:
L(Θ)= Σ log pΘ (yi,xi)
• The goal: maximization of the joint likelihood of training examples
y = argmax p*(y|x) = argmax p*(y,x)/p(x)
• Needs to enumerate all possible observation sequences
![Page 6: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d3e5503460f94a1735b/html5/thumbnails/6.jpg)
Learning Seminar, 2004
Markov Model
• A Markov process or model assumes that we can predict the future based just on the present (or on a limited horizon into the past):
• Let {X1,…,XT} be a sequence of random variables taking values {1,…,N} then the Markov properties are:
• Limited Horizon: P(Xt+1|X1,…,Xt) = P(Xt+1|Xt) =
• Time invariant (stationary):= P(X2|X1)
![Page 7: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d3e5503460f94a1735b/html5/thumbnails/7.jpg)
Learning Seminar, 2004
Describing a Markov Chain
• A Markov Chain can be described by the transition matrix A and the initial probabilities Q:
Aij = P(Xt+1=j|Xt=i)
qi = P(X1=i)
![Page 8: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d3e5503460f94a1735b/html5/thumbnails/8.jpg)
Learning Seminar, 2004
Hidden Markov Model
• In a Hidden Markov Model (HMM) we do not observe the sequence that the model passed through (X) but only some probabilistic function of it (Y). Thus, it is a Markov model with the addition of emission probabilities:
Bik = P(Yt = k|Xt = i)
![Page 9: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d3e5503460f94a1735b/html5/thumbnails/9.jpg)
Learning Seminar, 2004
The Three Problems of HMM
• Likelihood: Given a series of observations y and a model λ = {A,B,q}, compute the likelihood p(y| λ)
• Inference: Given a series of observations y and a model lambda compute the most likely series of hidden states x.
• Learning: Given a series of observations, learn the best model λ
![Page 10: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d3e5503460f94a1735b/html5/thumbnails/10.jpg)
Learning Seminar, 2004
Likelihood in HMMs
• Given a model λ = {A,B,q}, we can compute the likelihood by
P(y) = p(y| λ) = Σp(x)p(y|x) =
= q(x1)ΠA(xt+1|xt) ΠB(yt|xt)
• But … this computation complexity is O(NT), when |xi| = N impossible in practice
![Page 11: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d3e5503460f94a1735b/html5/thumbnails/11.jpg)
Learning Seminar, 2004
Forward-Backward algorithm
• To compute likelihood: – Need to enumerate over all paths in the lattice (all
possible instantiations of X1…XT). But … some starting subpath(blue) is common to many continuing paths (blue+red)
The idea: using dynamic programming, calculate a path in terms of shorter sub-paths
![Page 12: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d3e5503460f94a1735b/html5/thumbnails/12.jpg)
Learning Seminar, 2004
Forward-Backward algorithm (cont’d)
• We build a matrix of the probability of being at time t at state i: αt(i)=P(xt=i,y1y2…yt). This is a function of the previous column (forward procedure):
![Page 13: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d3e5503460f94a1735b/html5/thumbnails/13.jpg)
Learning Seminar, 2004
Forward-Backward algorithm (cont’d)
We can similarly define a backwards procedure for filling the matrix βt(i) = P(yt+1…yT|xt=i)
![Page 14: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d3e5503460f94a1735b/html5/thumbnails/14.jpg)
Learning Seminar, 2004
Forward-Backward algorithm (cont’d)
• And we can easily combine:
P(y,xt=i) = P(xt=i,y1y2…yt)* P(yt+1…yT|xt=i)= =αt(i)βt(i)
• And then we get:
P(y) = Σ P(y,xt=i) = Σ αt(i)βt(i)
• Summary: we presented a polynomial algorithm for computing likelihood in HMMs.
![Page 15: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d3e5503460f94a1735b/html5/thumbnails/15.jpg)
Learning Seminar, 2004
HMM – why not?
• Advantages:– Estimation very easy.– Closed form solution– The parameters can be estimated with
relatively high confidence from small samples
• But:– The model represents all possible (x,y)
sequences and defines joint probability over all possible observation and label sequences needless effort
![Page 16: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d3e5503460f94a1735b/html5/thumbnails/16.jpg)
Learning Seminar, 2004
Discriminative Probabilistic Models
“Solve the problem you need to solve”: The traditional approach inappropriately uses a generative joint model in order to solve a conditional problem in which the observations are given. To classify we need p(y|x) – there’s no need to implicitly approximate p(x).
Generative Discriminative
![Page 17: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d3e5503460f94a1735b/html5/thumbnails/17.jpg)
Learning Seminar, 2004
Discriminative Models - Estimation
• Choose Θy to maximize conditional likelihood:
L(Θy)= Σ log pΘy(yi|xi)
• Estimation usually doesn’t have closed form
• Example – MinMI discriminative approach (2nd week lecture)
![Page 18: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d3e5503460f94a1735b/html5/thumbnails/18.jpg)
Learning Seminar, 2004
Maximum Entropy Markov Model
• MEMM: – a conditional model that represents the
probability of reaching a state given an observation and the previous state
– These conditional probabilities are specified by exponential models based on arbitrary observation features
![Page 19: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d3e5503460f94a1735b/html5/thumbnails/19.jpg)
Learning Seminar, 2004
The Label Bias Problem
• The mass that arrives at the state must be distributed among the possible successor states
• Potential victims: Discriminative Models
![Page 20: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d3e5503460f94a1735b/html5/thumbnails/20.jpg)
Learning Seminar, 2004
The Label Bias Problem: Solutions
• Determinization of the Finite State MachineNot always possibleMay lead to combinatorial explosion
• Start with a fully connected model and let the training procedure to find a good structurePrior structural knowledge has proven to be
valuable in information extraction tasks
![Page 21: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d3e5503460f94a1735b/html5/thumbnails/21.jpg)
Learning Seminar, 2004
Random Field Model: Definition
• Let G = (V, E) be a finite graph, and let A be a finite alphabet.
• The configuration space Ω is the set of all labelings of the vertices in V by letters in A. If C is a part of V and ω is an element of Ω is a configuration, the ωc denotes the configuration restricted to C.
• A random field on G is a probability distribution on Ω.
![Page 22: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d3e5503460f94a1735b/html5/thumbnails/22.jpg)
Learning Seminar, 2004
Random Field Model: The Problem
• Assume that a finite number of features can define a class
• The features fi(w) are given and fixed.
• The goal: estimating λ to maximize likelihood for training examples
![Page 23: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d3e5503460f94a1735b/html5/thumbnails/23.jpg)
Learning Seminar, 2004
Conditional Random Field: Definition
• X – random variable over data sequences
• Y - random variable over label sequences
• Yi is assumed to range over a finite label alphabet A
• Discriminative approach: we construct a conditional model p(y|x) and do not explicitly model marginal p(x)
![Page 24: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d3e5503460f94a1735b/html5/thumbnails/24.jpg)
Learning Seminar, 2004
CRF - Definition
• Let G = (V, E) be a finite graph, and let A be a finite alphabet
• Y is indexed by the vertices of G • Then (X,Y) is a conditional random field if the
random variables Yv, conditioned on X, obey the Markov property with respect to the graph:
p(Y|X,Yw,w≠v) = p(Yv|X,Yw,w~v),
where w~v means that w and v are neighbors in G
![Page 25: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d3e5503460f94a1735b/html5/thumbnails/25.jpg)
Learning Seminar, 2004
CRF on Simple Chain Graph
• We will handle the case when G is a simple chain: G = (V = {1,…,m}, E={ (I,i+1) })
HMM (Generative) MEMM (Discriminative) CRF
![Page 26: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d3e5503460f94a1735b/html5/thumbnails/26.jpg)
Learning Seminar, 2004
Fundamental Theorem of Random Fields (Hammersley & Clifford)
• Assumption:– G structure is a tree, of which simple chain is
a private case
![Page 27: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d3e5503460f94a1735b/html5/thumbnails/27.jpg)
Learning Seminar, 2004
CRF – the Learning Problem
• Assumption: the features fk and gk are given and fixed.– For example, a boolean feature gk is TRUE if
the word Xi is upper case and the label Yi is a “noun”.
• The learning problem– We need to determine the parameters
Θ = (λ1, λ2, . . . ; µ1, µ2, . . .) from training data D = {(x(i), y(i))} with empirical distribution p~(x, y).
![Page 28: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d3e5503460f94a1735b/html5/thumbnails/28.jpg)
Learning Seminar, 2004
CRF – Estimation
• And we return to the log-likelihood maximization problem, this time – we need to find Θ that maximizes the conditional log-likelihood:
![Page 29: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d3e5503460f94a1735b/html5/thumbnails/29.jpg)
Learning Seminar, 2004
CRF – Estimation
• From now on we assume that the dependencies of Y, conditioned on X, form a chain.
• To simplify some
expressions, we add
special start and stop
states Y0 = start and
Yn+1 = stop.
![Page 30: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d3e5503460f94a1735b/html5/thumbnails/30.jpg)
Learning Seminar, 2004
CRF – Estimation
• Suppose that p(Y|X) is a CRF. For each position i in the observation sequence X, we define the |Y|*|Y| matrix random variable Mi(x) = [Mi(y', y|x)] by:
ei is the edge with labels (Yi-1,Yi) and vi is the vertex with label Yi
![Page 31: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d3e5503460f94a1735b/html5/thumbnails/31.jpg)
Learning Seminar, 2004
CRF – Estimation
• The normalization function Z(x) is
• The conditional probability of a label sequence y is written as
![Page 32: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d3e5503460f94a1735b/html5/thumbnails/32.jpg)
Learning Seminar, 2004
Parameter Estimation for CRFs
• The parameter vector Θ that maximizes the log-likelihood is found using a iterative scaling algorithm.
• We define standard
HMM-like forward and
backward vectors α and β,
which allow polynomial
calculations.• For example:
![Page 33: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d3e5503460f94a1735b/html5/thumbnails/33.jpg)
Learning Seminar, 2004
Experimental Results – Set 1• Set 1: modeling label bias• Data was generated from a simple HMM which encodes a noisy
version of the finite-state network (“rib/ rob”)• Each state emits its designated symbol with probability 29/32 and
any of the other symbols with probability 1/32 • We train both an MEMM and a CRF• The observation features are simply the identity of the observation
symbols.• 2, 000 training and 500 test samples were used• Results:
– CRF error: 4.6%– MEMM error: 42%
• Conclusion:– MEMM fails to discriminate between the two branches and we get the
label bias problem
![Page 34: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d3e5503460f94a1735b/html5/thumbnails/34.jpg)
Learning Seminar, 2004
Experimental Results – Set 2
• Set 2: modeling mixed order sources• Data was generated from a mixed-order HMM
with state transition probabilities given by p(yi|yi-1, yi-2) = α p2(yi|yi-1, yi-2) + (1 - α) p1(yi|yi-1)
• Similarly, emission probabilities given by p(xi|yi, xi-1) = α p2(xi|yi, xi-1)+(1- α) p1(xi|yi)
• Thus, for α = 0 we have a standard first-order HMM.
• For each randomly generated model, a sample of 1,000 sequences of length 25 is generated for training and testing.
![Page 35: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d3e5503460f94a1735b/html5/thumbnails/35.jpg)
Learning Seminar, 2004
Experimental Results – Set 2
![Page 36: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d3e5503460f94a1735b/html5/thumbnails/36.jpg)
Learning Seminar, 2004
Experimental Results – Set 3
• Set 3: Part-Of-Speech tagging experiments
![Page 37: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d3e5503460f94a1735b/html5/thumbnails/37.jpg)
Learning Seminar, 2004
Conclusions
• Conditional random fields offer a unique combination of properties:– discriminatively trained models for sequence segmentation and
labeling– combination of arbitrary and overlapping observation features
from both the past and future– efficient training and decoding based on dynamic programming
for a simple chain graph– parameter estimation guaranteed to find the global optimum
• CRFs main current limitation is the slow convergence of the training algorithm relative to MEMMs, let alone to HMMs, for which training on fully observed data is very efficient.
![Page 38: Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:](https://reader036.fdocuments.net/reader036/viewer/2022062714/56649d3e5503460f94a1735b/html5/thumbnails/38.jpg)
Learning Seminar, 2004
Thank you