2. Probabilistic Document Clustering and Topic Models A popular
method for probabilistic document clustering is that of topic
modeling. The idea of topic modeling is to create a probabilistic
generative model for the text documents in the corpus. The main
approach is to represent a corpus as a function of hidden random
variables, the parameters of which are estimated using a particular
document collection. The primary assumptions in any topic modeling
approach (together with the corresponding random variables) are as
follows: The n documents in the corpus are assumed to have a
probability of belonging to one of k topics. Thus, a given document
may have a probability of belonging to multiple topics, and this
reflects the fact that the same document may contain a multitude of
subjects.
3. For a given document Di, and a set of topics T1 . . . Tk,
the probability that the document Di belongs to the topic Tj is
given by P(Tj |Di). The topics are essentially analogous to
clusters, and the value of P(Tj |Di) provides a probability of
cluster membership of the ith document to the jth cluster. In
non-probabilistic clustering methods, the membership of documents
to clusters is deterministic in nature, and therefore the
clustering is typically a clean partitioning of the document
collection. When there are overlaps in document subject matter
across multiple clusters. The use of a soft cluster membership in
terms of probabilities is an elegant solution to this dilemma.
4. In this scenario, the determination of the membership of the
documents to clusters is a secondary goal to that of finding the
latent topical clusters in the underlying text collection. Topic
modeling is related to the clustering problem, it is often studied
as a distinct area of research from clustering. The value of P(Tj
|Di) is estimated using the topic modeling approach, and is one of
the primary outputs of the algorithm. The value of k is one of the
inputs to the algorithm and is analogous to the number of clusters.
Each topic is associated with a probability vector, which
quantifies the probability of the different terms in the lexicon
for that topic.
5. Let t1 . . . td be the d terms in the lexicon. Then, for a
document that belongs completely to topic Tj , the probability that
the term tl occurs in it is given by P(tl|Tj ). The value of
P(tl|Tj) is another important parameter which needs to be estimated
by the topic modeling approach. The number of documents is denoted
by n, topics by k and lexicon size (terms) by d. Most topic
modeling methods attempt to learn the above parameters using
maximum likelihood methods, so that the probabilistic fit to the
given corpus of documents is as large as possible. There are two
basic methods which are used for topic modeling, Probabilistic
Latent Semantic Indexing (PLSI) Latent Dirichlet Allocation
(LDA)
6. Probabilistic Latent Semantic Indexing Method Set of random
variables P(Tj |Di) and P(tl|Tj) model the probability of a term tl
occurring in any document Di. The probability P(tl|Di) of the term
tl occurring document Di can be expressed in terms: For each term
tl and document Di, generate a n d matrix of probabilities in terms
of these parameters, where, n - number of documents and d - number
of terms. For a given corpus, the n d term-document occurrence
matrix X, tells us which term actually occurs in each document, and
how many times the term occurs in the document. In other words,
X(i, l) is the number of times that term tl occurs in document Di.
Therefore, we can use a maximum likelihood estimation algorithm
which maximizes the product of the probabilities of terms that are
observed in each document in the entire collection.
7. Log likelihood probability i,l X(i, l) log(P(tl|Di)) subject
to the constraints that the probability values over each of the
topic-document and term-topic spaces must sum to 1: The Lagrangian
solution essentially leads to a set of iterative update equations
for the corresponding parameters need to be estimated. These
parameters can be estimated with the iterative update of two
matrices [P1]kn and [P2]dk containing the topic-document
probabilities and term-topic probabilities respectively.
8. Initializing the matrices randomly, and normalize each of
them so that the probability values in their columns sum to one.
Then, iteratively perform the following steps on each of P1 and P2
respectively: The process is iterated to convergence. The output of
this approach are the two matrices P1 and P2, the entries of which
provide the topic document and term-topic probabilities
respectively.
9. Latent Dirichlet Allocation The term-topic probabilities and
topic-document probabilities are modeled with a Dirichlet
distribution as a prior. LDA method is the Bayesian version of the
PLSI technique. PLSI method is equivalent to the LDA technique,
when applied with a uniform Dirichlet prior. The LDA method can be
used to model the topic distribution of a new document more
robustly, even if it is not present in the original data set.
EM-concepts used for topic modeling are quite general, and can be
used for different variations on the text clustering tasks, such as
text classification or incorporating user feedback into
clustering.
10. LDAs main advantage over the PLSI method is that it is not
quite as susceptible to overfitting. This is generally true of
Bayesian methods which reduce the number of model parameters to be
estimated, and therefore work much better for smaller data sets.
Even for larger data sets, PLSI has the disadvantage that the
number of model parameters grows linearly with the size of the
collection. The PLSI model is not a fully generative model, because
there is no accurate way to model the topical distribution of a
document which is not included in the current data set.
11. Probabilistic Models for Information Extraction
Probabilistic models show better accuracy and robustness against
the noise than categorical models. Useful for the different tasks
in extracting meaning from natural language texts. Most prominent
among these probabilistic approaches are Hidden Markov Models
(HMMs), Stochastic Context-free Grammars (SCFG), and Maximal
Entropy (ME).
12. Probabilistic Models for Information Extraction Hidden
Markov Models The Three Classic Problems Related to HMMs The
ForwardBackward Procedure The Viterbi Algorithm The Training of the
HMM Dealing with Training Data Sparseness Stochastic Context-Free
Grammars Using SCFGs Maximal Entropy Modeling Computing the
Parameters of the Model Maximal Entropy Markov Models Training the
MEMM Conditional Random Fields The Three Classic Problems Relating
to CRF Computing the Conditional Probability Finding the Most
Probable Label Sequence Training the CRF
13. Hidden Markov Models An HMM is a finite-state automaton
with stochastic state transitions and symbol emissions. The
automaton models a probabilistic generative process. Process, a
sequence of symbols is produced by Starting in an initial state,
Emitting a symbol selected by the state, Making a transition to a
new state, Emitting a symbol selected by the state, and Repeating
this transitionemission cycle until a designated final state is
reached.
14. HMM Assumptions Markov assumption: the state transition
depends only on the origin and destination Output-independent
assumption: all observation frames are dependent on the state that
generated them, not on neighbouring observation frames
15. Formally, Let O = {o1, . . . oM} - finite set of
observation symbols and Q ={q1, . . . qN} - finite set of states. A
first-order Markov model is a triple (, A, B), where : Q [0, 1]
defines the starting probabilities, A : Q Q [0, 1] defines the
transition probabilities, and B : Q O [0, 1] denotes the emission
probabilities. The functions , A, and B define true probabilities,
they must satisfy A model together with the random process
described above induces a probability distribution over the set O*
of all possible observation sequences. 1)( qQq Oo oqB 1),( 1)',('
Qq qqA for all states q
16. The Three Classic Problems Related to HMMs Most
applications of hidden Markov models can be reduced to three basic
problems: 1. Find P(T | ) [Evaluation] the probability of a given
observation sequence T in a given model . (compute the probability
distribution induced by the model) 2. Find argmaxSQ |T| P(T, S | )
[Decoding] the most likely state trajectory given and T. (finds the
most probable states sequence for a given observation sequence) 3.
Find argmax P(T, | ) [Learning] the model that best accounts for a
given sequence. (adjusts the model itself to maximize the
likelihood of the given observation)
17. Description of how these three problems can be solved:
Calculate P(T | ), where , T is a sequence of observation symbols T
= t1t2 . . . tk O. Enumerate every possible state sequence of
length |T|. Let S = s1,s2 . . . s|T| Q|T| be one such sequence.
Calculate the probability P(T | S, ) of generating T knowing that
the process went through the states sequence S. By Markovian
assumption, the emission probabilities are all independent of each
other. Therefore, ),(),|( ||...1 iiTi tsBSTP
18. Similarly, the transition probabilities are independent.
Thus the probability P(S|) for the process to go through the state
sequence S is Using the above probabilities, we find that the
probability P(T|) of generating the sequence can be calculated as
This solution is of course infeasible in practice because of the
exponential number of possible state sequences. To solve the
problem efficiently, we use a dynamical programming technique. The
resulting algorithm is called the forwardbackward procedure.
),()()|( 11||...11 iiTi ssAsSP || )|(),|()|( T QS SPSTPTP
19. The ForwardBackward Procedure Let m(q), the forward
variable, denote the probability of generating the initial segment
t1, t2 . . . tm of the sequence T and finishing at the state q at
time m. This forward variable can be computed recursively as
follows: Then, the probability of the whole sequence T can be
calculated as ),(),'()'()(.2 ),()()(.1 1'1 11 nQq nn tqBqqAqq tqBqq
|| )()|( TQq qTP
20. In a similar manner, one can define m (q), the backward
variable, which denotes the probability of starting at the state q
and generates the final segment tm+1 . . . t|T| of the sequence T.
The backward variable can be calculated starting from the end and
going backward to the beginning of the sequence: The probability of
the whole sequence is then Qq nnn T qtqBqqAq q '1 ||
)'(),'()',()(.2 ,1)(.1 )(),()()|( 11 qtqBqTP Qq
21. The Viterbi Algorithm Solution of the second problem
finding the most likely state sequence for a given sequence T. As
with the previous problem, enumerating all possible state sequences
S and choosing the one maximizing P(T, S | ) is infeasible.
Dynamical programming, utilizing the following property of the
optimal states sequence: if is some initial segment of the sequence
T = t1 t2 . . . t|T| and S = s1 s2 . . . s|T| is a state sequence
maximizing P(T, S| ), then maximizes among all state sequences of
length ending with s|T|. The resulting algorithm is called the
Viterbi algorithm. 'T '||21 ...' TsssS )|','( STP |'|T
22. Let n(q) denote the state sequence ending with the state q,
which is optimal for the initial segment Tn = t1t2 . . . tn among
all sequences ending with q, and let n(q) denote the probability
P(Tn, n(q) | ) of generating this initial segment following those
optimal states. Delta and gamma can be recursively calculated as
follows: Where, Then, the best states sequence among {|T|(q) : q Q}
is the optimal one: ,)'()(),,(),'()'(max)(.2 ,)(),,().()(1.1 111'1
111 qqqtqBqqAqq qqtqBqq nnnQqn ),(),'()'(maxarg' 1' nnQq tqBqqAqq
))(max(arg)|,(maxarg |||||| qSTP TQqTQS T
23. Example of the Viterbi Computation Using the HMM described
in Figure with the sequence (a, b, a), the following steps are
there in using the Viterbi algorithm: A sample HMM
24. Computation of the optimal path using the Viterbi algorithm
Two optimal paths: {S1, S3, S1} and {S3, S2, S3}.
25. The Training of the HMM BaumWelsh re-estimation formulas
Let n(q) be the probability P(sn = q | T, ) of being in the state q
at time n while generating the observation sequence T. Then n(q)
P(T | ) is the probability of generating T passing through the
state q at time n. By definition of the forward and backward
variables, this probability is equal to n(q) n(q). Thus, Also let
n(q, q' ) be the probability P(sn = q, sn+1 = q' | T, ) of passing
from state q to state q at time n while generating the observation
sequence T. As in the preceding equation, The sum of n(q) over all
n = 1 . . . | T | can be seen as the expected number of times the
state q was visited while generating the sequence T. Or, if one
sums over n = 1 . . . | T |1, the expected number of transitions
out of the state q results because there is no transition at time
|T|. )|(/)()()( TPqqq nnn )|(/)(),'()',()()',( 1 TPqoqBqqAqqq
nnnn
26. Similarly, the sum of n(q, q') over all n = 1 . . . | T | 1
can be interpreted as the expected number of transitions from the
state q to q' The BaumWelsh formulas re-estimate the parameters of
the model according to the expectations It can be shown that the
model '= (', A', B') is equal either to , in which case the is the
critical point of the likelihood function P(T | ), or ', which
better accounts for the training sequence T than the original model
in the sense that P(T | ') >P(T | ). Therefore, the training
problem can be solved by iteratively applying the re- estimation
formulas until convergence. )(/)(:),(' ),(/)',(:)',(' ),(:)('
||..1: 1||..11||..1 1 qqoqB qqqqqA qq Tn noTnn n Tn nTn n
27. Dealing with Training Data Sparseness Techniques for data
sparseness problems in probabilistic modeling Smoothing shrinkage
Smoothing Process of flattening a probability distribution implied
by a model so that all reasonable sequences can occur with some
probability. Broadening the distribution by redistributing weight
from high-probability regions to zero-probability regions. Example
Laplace smoothing o Every possible training event occurs one time
more than it actually does. Any constant can be used instead of
one. Other possible methods may include back-off smoothing, deleted
interpolation, and others.
28. Shrinkage Defined in terms of some hierarchy representing
the expected similarity between parameter estimates. With respect
to HMMs, the hierarchy can be defined as a tree with the HMM states
for the leaves all at the same depth. Hierarchy is created as
follows: First, the most complex HMM is built and its states are
used for the leaves of the tree. Then the states are separated into
disjoint classes within which the states are expected to have
similar probability distributions. The classes become the parents
of their constituent states in the hierarchy (HMM structure at the
leaves induces a simpler HMM structure at the level of the
classes). It is generated by summing the probabilities of emissions
and transitions of all states in a class. This process may be
repeated until only a single-state HMM remains at the root of the
hierarchy
29. Successful Application Areas of HMM Online handwriting
recognition Speech recognition Gesture recognition Language
recognition Motion Video analysis and tracking Protein sequence /
gene sequence alignment Stock price prediction
30. Stochastic context-free grammars An SCFG is a quintuple G =
(T, N, S, R, P), where, T is the alphabet of terminal symbols
(tokens), N is the set of nonterminals, S is the starting
nonterminal, R is the set of rules, and P : R[0.1] defines their
probabilities. The rules have the form: n s1s2 . . . sk, where, n
is a nonterminal and si is either a token or another nonterminal.
SCFG generate (or accept) a given string (sequence of tokens) if
the string can be produced starting from a sequence containing just
the starting symbol S and expanding nonterminals one by one in the
sequence using the rules from the grammar. The string generated can
be naturally represented by a parse tree, Starting symbol as a
root, Nonterminals as internal nodes, and Tokens as leaves.
31. SCFG is a usual context-free grammar with the addition of
the P function. The semantics of the probability function P are
straightforward. If r is the rule n s1s2 . . . sk, then P(r) is the
frequency of expanding n using this rule. In Bayesian terms, if it
is known that a given sequence of tokens was generated by expanding
n, then P(r) is the a priori likelihood that n was expanded using
the rule r. For every nonterminal n the sum P(r ) of probabilities
of all rules r headed by n must be equal to one.
32. Using SCFGs Classical definition of SCFG: It is assumed
that the rules are all independent. Find the (unconditional)
probability of a given parse tree by simply multiplying the
probabilities of all rules participating in it. Parsing problem is
formulated as follows: Given a sequence of tokens (a string), find
the most probable parse tree that could generate the string. A
simple generalization of the Viterbi algorithm is able to solve
this problem efficiently. Practical applications of SCFGs: Rare the
case that the rules are truly independent. Let the probabilities
P(r) be conditioned on the context where the rule is applied. If
the conditioning context is chosen reasonably, the Viterbi
algorithm still works correctly even for this more general
problem.
33. Maximal Entropy Modeling Consider a random process of an
unknown nature that produces a single output value y, a member of a
finite set Y of possible output values. The process of generating y
may be influenced by some contextual information x a member of the
set X of possible contexts. The task is to construct a statistical
model that accurately represents the behavior of the random
process. Such a model is a method of estimating the conditional
probability of generating y given the context x. Let P(x, y) be
denoted as the unknown true joint probability distribution of the
random process, and let p(y | x) be the model we are trying to
build taken from the class of all possible models. To build the
model we are given a set of training samples generated by observing
the random process for some time. The training data consist of a
sequence of pairs (xi, yi) of different outputs produced in
different contexts.
34. In many cases the set X is too large and underspecified to
be used directly. For instance, X may be the set of all dots . in
all possible English texts. For contrast, the Y may be extremely
simple while remaining interesting. In the preceding case, the Y
may contain just two outcomes: SentenceEnd and NotSentenceEnd. The
target model p(y | x) would in this case solve the problem of
finding sentence boundaries. In such cases it is impossible to use
the context x directly to generate the output y. There are usually
many regularities and correlations, however, that can be exploited.
Different contexts are usually similar to each other in all manner
of ways, and similar contexts tend to produce similar output
distributions.
35. To express such regularities and their statistics, can use
constraint functions and their expected values. A constraint
function f : X Y R can be any real-valued function. Binary-valued
trigger functions: Such a trigger function returns one for pair (x,
y) if the context x satisfies the condition predicate C and the
output value y is yi. A common short notation for such a trigger
function is Cyi. For the example above, useful triggers are
previous token is MrNotSentenceEnd, next token is
capitalizedSentenceEnd. Given a constraint function f, its
importance by requiring our target model to reproduce f s expected
value faithfully in the true distribution:
36. In practice we cannot calculate the true expectation and
must use an empirical expected value calculated by summing over the
training samples: The choice of feature functions is domain
dependent. Let us assume the complete set of features F={ fk} is
given. The completeness of the set of features by requiring that
the model agree with all the expected value constraints while
otherwise being as uniform as possible. The uniformity requirement
defines the target model uniquely. The degree of uniformity of a
model is expressed by its conditional entropy Or, empirically,
37. The constrained optimization problem of finding the
maximal-entropy target model is solved by application of Lagrange
multipliers and the KuhnTucker theorem. Let us introduce a
parameter k (the Lagrange multiplier) for every feature. Define the
Lagrangian (p, ) by Holding fixed, we compute the unconstrained
maximum of the Lagrangian over all p . Denote by p the p where (p,
) achieves its maximum and by () the value of at this point. The
functions p and () can be calculated using simple calculus: Where,
Z(x) is a normalizing constant determined by the requirement that
yY p(y | x) = 1. The dual optimization problem
38. The KuhnTucker theorem asserts that, under certain
conditions, the solutions of the primal and dual optimization
problems coincide. The model p, which maximizes HE(p) while
satisfying the constraints, has the parametric form p*. The
function () is simply the log-likelihood of the training sample as
predicted by the model p. Thus, the model p* maximizes the
likelihood of the training sample among all models of the
parametric form p.
39. Computing the Parameters of the Model The function() is
well behaved from the perspective of numerical optimization, for it
is smooth and concave. Consequently, various methods can be used
for calculating *. Generalized iterative scaling is the algorithm
specifically tailored for the problem. This algorithm is applicable
whenever all constraint functions are non- negative: fk(x, y) 0.
The algorithm starts with an arbitrary choice of s for instance k=
0 for all k. At each iteration the s are adjusted as follows: In
the simplest case, when f # is constant, k is simply (1/f #) log
PE( fk)/pE( fk). Any numerical algorithm for solving the equation
can be used such as Newtons method.
40. Maximal Entropy Markov Models A MEMM is a probabilistic
finite-state acceptor. Unlike HMM, which has separate transition
and emission probabilities, MEMM has only transition probabilities,
depend on the observations. A slightly modified version of the
Viterbi algorithm solves the problem of finding the most likely
state sequence for a given observation sequence. A MEMM consists of
a set Q = {q1, . . . , qN} of states, and a set of transition
probabilities functions Aq : X Q [0, 1], where X denotes the set of
all possible observations. Aq(x, q) gives the probability P(q | q,
x) of transition from q to q, given the observation x. The model
does not generate x but only conditions on it. The set X need not
be small and need not even be fully defined. The transition
probabilities Aq are separate exponential models trained using
maximal entropy.
41. The task of a trained MEMM is to produce the most probable
sequence of states given the observation, solved by a simple
modification of the Viterbi algorithm. The forwardbackward
algorithm, loses its meaning because here it computes the
probability of the observation being generated by any state
sequence, which is always one. The forward and backward variables
are still useful for the MEMM training. The forward variable [Ref
>HMM] m(q) denotes the probability of being in state q at time m
given the observation. It is computed recursively as The backward
variable denotes the probability of starting from state q at time m
given the observation. It is computed similarly as The model Aq for
transition probabilities from a state is defined parametrically
using constraint functions. If fk : X Q R is the set of such
functions for a given state q, then the model Aq can be represented
in the form where k are the parameters to be trained and Z(x, q) is
the normalizing factor making probabilities of all transitions from
a state sum to one.
42. Training the MEMM If the true states sequence for the
training data is known, the parameters of the models can be
straightforwardly estimated using the GIS algorithm for training ME
models. If the sequence is not known-for instance, if there are
several states with the same label in a fully connected MEMM-the
parameters must be estimated using a combination of the Baum-Welsh
procedure and iterative scaling. Every iteration consists of two
steps: 1. Using the forwardbackward algorithm and the current
transition functions to compute the state occupancies for all
training sequences. 2. Computing the new transition functions using
GIS with the feature frequencies based on the state occupancies
computed in step 1. It is unnecessary to run GIS to convergence in
step 2; a single GIS iteration is sufficient.
43. Conditional Random Fields(CRF) Problem description Why
conditional random fields(CRF) Introduction to CRF CRF model
Inference of CRF Learning of CRF
44. Problem Description Given observed data X, we wish to
predict Y (labels) Example: X = {Temperature, Humidity, ...} Xn =
observation on day n Y = {Sunny, Rainy, Cloudy} Yn = weather on day
n 30C 20% Sunny? Rainy? Cloudy? Light breeze May depend on one
another May depend on the weather of yesterday
45. Generative Model vs. Discriminative Model Generative model
A model that generate observed data randomly Model the joint
probability p(x,y) Discriminative model Directly estimate the
posterior probability p(y|x) Aim at modeling the discrimination
between different outputs Nave Bayes, HMM, Bayesian network, MRF,
Single variable Sequence General Logistic regression, Linear-chain
CRF MEMM, General CRF, Conditional
46. Why Conditional Random Fields Generative model Generative
model targets to find the joint probability p(x,y) and make the
prediction based on Bayes rule to calculate p(y|x) Ex: Naive Bayes
(single output) and HMM (Hidden Markov Model) (sequence output) K k
k yxpypyxp 1 )|()(),( a vector of features Assume that given y,
features are independent T t tttt yxpyypyxp 1 1 )|()|(),(
Assumption: 1. each state t only depends on its immediate
predecessor 2. Conditional independence of observed given Sequence
output
47. Why Conditional Random Fields 30C 20% Humidity, temperature
and the wind scale are independent Mon. {30C, 20%, light breeze}
Light breeze Tue. {28C, 30%, light breeze} Wed. {25C, 40%, moderate
breeze} Thu. {22C, 60%, moderate breeze} A B: A causes B
48. Why Conditional Random Fields Difficulties for generative
models Not practical to represent multiple interacting features
(hard to model p(x)) or long-range dependencies of the observations
Very strict independence assumptions on the observations Mon. {30C,
20%, light breeze} Tue. {28C, 30%, light breeze} Wed. {25C, 40%,
moderate breeze} Thu. {22C, 60%, moderate breeze}
49. Why Conditional Random Fields Discriminative models
Directly model the posterior p(y|x) Aim at modeling the
discrimination between different outputs Ex: logistic regression
(maximum entropy) and CRF Advantages of discriminative models
Training process aim at finding optimal coefficients for features
no matter the features are correlated Not sensitive to unbalanced
training data Especially for the classification problem, we dont
have to care about p(x)
50. Why Conditional Random Fields Logistic regression (maximum
entropy) Suppose we have a bin of candies, each with an associated
label (A,B,C, or D) Each candy has multiple colors in its wrapper
Each candy is assigned a label randomly based on some distribution
over wrapper colors Observation: the color of the wrapper Label: 4
kinds of flavors A: chocolate B: strawberry C: lemon D: milk
51. Why Conditional Random Field For any candy with a red label
pulled from the bin: P(A|red)+P(B|red)+P(C|red)+P(D|red) = 1
Infinite number of distributions exist that fit this constraint The
distribution that fits with the idea of maximum entropy is: (the
most uniform) o P(A|red)=0.25 o P(B|red)=0.25 o P(C|red)=0.25 o
P(D|red)=0.25
52. Why Conditional Random Field Now suppose we add some
evidence to our model We note that 80% of all candies with red
labels are either labeled A or B o P(A|red) + P(B|red) = 0.8 The
updated model that reflects this would be: o P(A|red) = 0.4 o
P(B|red) = 0.4 o P(C|red) = 0.1 o P(D|red) = 0.1 As we make more
observations and find more constraints, the model gets more
complex
53. Why Conditional Random Field Given a collection of facts,
choose a model which is consistent with all the facts, but
otherwise as uniform as possible , ),( );|( ),( wxZ e wxyp j jj
yxFw ion termnormalizatais),( ),( y yxFw j jj ewxZ By learning
Defined feature functions evidence x1 x2 xd y
B)f(A,nodesBandAbetweennodesfactor Factor Graph:
54. Linear-Chain CRF If we extend the logistic regression to a
sequence problem ( ): , ),( );|( ),( wxZ e wxyp j jj yxFw ion
termnormalizatais),( ),( y yxFw j jj ewxZ y x1 x2 xd yt-1 x1 x2 xd
yt x1 x2 xd yt+1 Entire x sentencethealongsuma,),,(),(where 1 t
ttjj xyyfyxF
55. Linear-Chain CRF y1 y2 y3 x1 x2 x3 y1 y2 y3 x
56. General CRF Divide Graph G into many templates A. The
parameters inside each template are tied K(A) is the number of
feature functions for the template )( )|( )( 1 ),( xZ e xyp G xyf A
AK k aaakak
57. Inference of CRF Problem description: Given the
observations({xi}) and the probability model(parameters such as i
mentioned above), we target to find the best state sequence For
general graphs, the problem of exact inference in CRFs is
intractable Chain or tree like CRF can yield exact inference
Approximation solutions
58. Inference of Linear-Chain CRF The inference of linear-chain
CRF is very similar to that of HMM Example: POS(part of speech)
tagging The identification of words as nouns, verbs, adjectives,
adverbs, etc. Students need another break noun verb article
noun
60. Inference of Linear-Chain CRF Then back to CRF i iiiy i j
iiijy j i iiijy j jjy yxF y y yyg xyyf xyyf yxF xZ e xypy j jj
),(maxarg ),,(maxarg ),,(maxarg ),(maxarg )( maxarg );|(maxarg 1 1
1 ),( *
61. Inference of Linear-Chain CRF gi can be represented as a
MxM matrix where m is the cardinality of the set of the tags j
iijjiii xyyfyyg ),,(),( 11 V ART N N V ART yi-1 yi V ART N V ART
N
62. Inference of Linear-Chain CRF The inference of linear-chain
CRF is similar to that of HMM, which uses Viterbi algorithm. v:
range over the tags U(k,v) to be the score of the best sequence of
tags from 1 to k, where tag k is required to be v )],(),1([max
)],(),([max),( 11 1 1 1 1 },...,{ 1 11 vygykU vygyygvkU kkk y k k i
kiii yy k k
63. Learning of CRF Problem description Given training pairs
({xi,yi}), we wish to estimate the parameters of the model ({i})
Method For chain or tree structured CRFs, they can be trained by
maximum likelihood we will focus on the learning of linear chain
CRF General CRFs are intractable hence approximation solutions are
necessary
64. Learning of Linear-chain CRF Conditional maximum likelihood
(CML) x: observations; y: labels Apply CML to the learning of CRF
It can be shown that the conditional log-likelihood of the
linear-chain CRF is a convex function we can apply gradient ascent
to the CML problem );|(max)|;(max xypxyL );|(max xyp );|(log xyp
0),(log),();|(log xZyxFxyp j j j
65. Learning of Linear-chain CRF For the entire training set T
)]',([),( );|'()',(),( 0),(log),();|(log );'|(~' ' yxFEyxF
xypyxFyxF xZyxFxyp jxypyj y jj j j j Ep[] denotes expectation with
respect to distribution p. Tx jxypy Tyx j yxFEyxF , );|(~ ,
)],([),( The expectation of the feature fx with respect to the
model distribution The expectation of the feature fx with respect
to the empirical distribution
66. Learning of Linear-chain CRF To yield the best model: The
expectation of each feature with respect to the model distribution
is equal to the expected value under the empirical distribution of
the training data The same as the maximum entropy model Logistic
regression (maximum entropy) Extend to sequence Linear-Chain
CRF