Download - Tdm probabilistic models (part 2)

1. Text Data Mining (part-2) PROBABILISTIC MODELS

2. Probabilistic Document Clustering and Topic Models A popular method for probabilistic document clustering is that of topic modeling. The idea of topic modeling is to create a probabilistic generative model for the text documents in the corpus. The main approach is to represent a corpus as a function of hidden random variables, the parameters of which are estimated using a particular document collection. The primary assumptions in any topic modeling approach (together with the corresponding random variables) are as follows: The n documents in the corpus are assumed to have a probability of belonging to one of k topics. Thus, a given document may have a probability of belonging to multiple topics, and this reflects the fact that the same document may contain a multitude of subjects.

3. For a given document Di, and a set of topics T1 . . . Tk, the probability that the document Di belongs to the topic Tj is given by P(Tj |Di). The topics are essentially analogous to clusters, and the value of P(Tj |Di) provides a probability of cluster membership of the ith document to the jth cluster. In non-probabilistic clustering methods, the membership of documents to clusters is deterministic in nature, and therefore the clustering is typically a clean partitioning of the document collection. When there are overlaps in document subject matter across multiple clusters. The use of a soft cluster membership in terms of probabilities is an elegant solution to this dilemma.

4. In this scenario, the determination of the membership of the documents to clusters is a secondary goal to that of finding the latent topical clusters in the underlying text collection. Topic modeling is related to the clustering problem, it is often studied as a distinct area of research from clustering. The value of P(Tj |Di) is estimated using the topic modeling approach, and is one of the primary outputs of the algorithm. The value of k is one of the inputs to the algorithm and is analogous to the number of clusters. Each topic is associated with a probability vector, which quantifies the probability of the different terms in the lexicon for that topic.

5. Let t1 . . . td be the d terms in the lexicon. Then, for a document that belongs completely to topic Tj , the probability that the term tl occurs in it is given by P(tl|Tj ). The value of P(tl|Tj) is another important parameter which needs to be estimated by the topic modeling approach. The number of documents is denoted by n, topics by k and lexicon size (terms) by d. Most topic modeling methods attempt to learn the above parameters using maximum likelihood methods, so that the probabilistic fit to the given corpus of documents is as large as possible. There are two basic methods which are used for topic modeling, Probabilistic Latent Semantic Indexing (PLSI) Latent Dirichlet Allocation (LDA)

6. Probabilistic Latent Semantic Indexing Method Set of random variables P(Tj |Di) and P(tl|Tj) model the probability of a term tl occurring in any document Di. The probability P(tl|Di) of the term tl occurring document Di can be expressed in terms: For each term tl and document Di, generate a n d matrix of probabilities in terms of these parameters, where, n - number of documents and d - number of terms. For a given corpus, the n d term-document occurrence matrix X, tells us which term actually occurs in each document, and how many times the term occurs in the document. In other words, X(i, l) is the number of times that term tl occurs in document Di. Therefore, we can use a maximum likelihood estimation algorithm which maximizes the product of the probabilities of terms that are observed in each document in the entire collection.

7. Log likelihood probability i,l X(i, l) log(P(tl|Di)) subject to the constraints that the probability values over each of the topic-document and term-topic spaces must sum to 1: The Lagrangian solution essentially leads to a set of iterative update equations for the corresponding parameters need to be estimated. These parameters can be estimated with the iterative update of two matrices [P1]kn and [P2]dk containing the topic-document probabilities and term-topic probabilities respectively.

8. Initializing the matrices randomly, and normalize each of them so that the probability values in their columns sum to one. Then, iteratively perform the following steps on each of P1 and P2 respectively: The process is iterated to convergence. The output of this approach are the two matrices P1 and P2, the entries of which provide the topic document and term-topic probabilities respectively.

9. Latent Dirichlet Allocation The term-topic probabilities and topic-document probabilities are modeled with a Dirichlet distribution as a prior. LDA method is the Bayesian version of the PLSI technique. PLSI method is equivalent to the LDA technique, when applied with a uniform Dirichlet prior. The LDA method can be used to model the topic distribution of a new document more robustly, even if it is not present in the original data set. EM-concepts used for topic modeling are quite general, and can be used for different variations on the text clustering tasks, such as text classification or incorporating user feedback into clustering.

10. LDAs main advantage over the PLSI method is that it is not quite as susceptible to overfitting. This is generally true of Bayesian methods which reduce the number of model parameters to be estimated, and therefore work much better for smaller data sets. Even for larger data sets, PLSI has the disadvantage that the number of model parameters grows linearly with the size of the collection. The PLSI model is not a fully generative model, because there is no accurate way to model the topical distribution of a document which is not included in the current data set.

11. Probabilistic Models for Information Extraction Probabilistic models show better accuracy and robustness against the noise than categorical models. Useful for the different tasks in extracting meaning from natural language texts. Most prominent among these probabilistic approaches are Hidden Markov Models (HMMs), Stochastic Context-free Grammars (SCFG), and Maximal Entropy (ME).

12. Probabilistic Models for Information Extraction Hidden Markov Models The Three Classic Problems Related to HMMs The ForwardBackward Procedure The Viterbi Algorithm The Training of the HMM Dealing with Training Data Sparseness Stochastic Context-Free Grammars Using SCFGs Maximal Entropy Modeling Computing the Parameters of the Model Maximal Entropy Markov Models Training the MEMM Conditional Random Fields The Three Classic Problems Relating to CRF Computing the Conditional Probability Finding the Most Probable Label Sequence Training the CRF

13. Hidden Markov Models An HMM is a finite-state automaton with stochastic state transitions and symbol emissions. The automaton models a probabilistic generative process. Process, a sequence of symbols is produced by Starting in an initial state, Emitting a symbol selected by the state, Making a transition to a new state, Emitting a symbol selected by the state, and Repeating this transitionemission cycle until a designated final state is reached.

14. HMM Assumptions Markov assumption: the state transition depends only on the origin and destination Output-independent assumption: all observation frames are dependent on the state that generated them, not on neighbouring observation frames

15. Formally, Let O = {o1, . . . oM} - finite set of observation symbols and Q ={q1, . . . qN} - finite set of states. A first-order Markov model is a triple (, A, B), where : Q [0, 1] defines the starting probabilities, A : Q Q [0, 1] defines the transition probabilities, and B : Q O [0, 1] denotes the emission probabilities. The functions , A, and B define true probabilities, they must satisfy A model together with the random process described above induces a probability distribution over the set O* of all possible observation sequences. 1)( qQq Oo oqB 1),( 1)',(' Qq qqA for all states q

16. The Three Classic Problems Related to HMMs Most applications of hidden Markov models can be reduced to three basic problems: 1. Find P(T | ) [Evaluation] the probability of a given observation sequence T in a given model . (compute the probability distribution induced by the model) 2. Find argmaxSQ |T| P(T, S | ) [Decoding] the most likely state trajectory given and T. (finds the most probable states sequence for a given observation sequence) 3. Find argmax P(T, | ) [Learning] the model that best accounts for a given sequence. (adjusts the model itself to maximize the likelihood of the given observation)

17. Description of how these three problems can be solved: Calculate P(T | ), where , T is a sequence of observation symbols T = t1t2 . . . tk O. Enumerate every possible state sequence of length |T|. Let S = s1,s2 . . . s|T| Q|T| be one such sequence. Calculate the probability P(T | S, ) of generating T knowing that the process went through the states sequence S. By Markovian assumption, the emission probabilities are all independent of each other. Therefore, ),(),|( ||...1 iiTi tsBSTP

18. Similarly, the transition probabilities are independent. Thus the probability P(S|) for the process to go through the state sequence S is Using the above probabilities, we find that the probability P(T|) of generating the sequence can be calculated as This solution is of course infeasible in practice because of the exponential number of possible state sequences. To solve the problem efficiently, we use a dynamical programming technique. The resulting algorithm is called the forwardbackward procedure. ),()()|( 11||...11 iiTi ssAsSP || )|(),|()|( T QS SPSTPTP

19. The ForwardBackward Procedure Let m(q), the forward variable, denote the probability of generating the initial segment t1, t2 . . . tm of the sequence T and finishing at the state q at time m. This forward variable can be computed recursively as follows: Then, the probability of the whole sequence T can be calculated as ),(),'()'()(.2 ),()()(.1 1'1 11 nQq nn tqBqqAqq tqBqq || )()|( TQq qTP

20. In a similar manner, one can define m (q), the backward variable, which denotes the probability of starting at the state q and generates the final segment tm+1 . . . t|T| of the sequence T. The backward variable can be calculated starting from the end and going backward to the beginning of the sequence: The probability of the whole sequence is then Qq nnn T qtqBqqAq q '1 || )'(),'()',()(.2 ,1)(.1 )(),()()|( 11 qtqBqTP Qq

21. The Viterbi Algorithm Solution of the second problem finding the most likely state sequence for a given sequence T. As with the previous problem, enumerating all possible state sequences S and choosing the one maximizing P(T, S | ) is infeasible. Dynamical programming, utilizing the following property of the optimal states sequence: if is some initial segment of the sequence T = t1 t2 . . . t|T| and S = s1 s2 . . . s|T| is a state sequence maximizing P(T, S| ), then maximizes among all state sequences of length ending with s|T|. The resulting algorithm is called the Viterbi algorithm. 'T '||21 ...' TsssS )|','( STP |'|T

22. Let n(q) denote the state sequence ending with the state q, which is optimal for the initial segment Tn = t1t2 . . . tn among all sequences ending with q, and let n(q) denote the probability P(Tn, n(q) | ) of generating this initial segment following those optimal states. Delta and gamma can be recursively calculated as follows: Where, Then, the best states sequence among {|T|(q) : q Q} is the optimal one: ,)'()(),,(),'()'(max)(.2 ,)(),,().()(1.1 111'1 111 qqqtqBqqAqq qqtqBqq nnnQqn ),(),'()'(maxarg' 1' nnQq tqBqqAqq ))(max(arg)|,(maxarg |||||| qSTP TQqTQS T

23. Example of the Viterbi Computation Using the HMM described in Figure with the sequence (a, b, a), the following steps are there in using the Viterbi algorithm: A sample HMM

24. Computation of the optimal path using the Viterbi algorithm Two optimal paths: {S1, S3, S1} and {S3, S2, S3}.

25. The Training of the HMM BaumWelsh re-estimation formulas Let n(q) be the probability P(sn = q | T, ) of being in the state q at time n while generating the observation sequence T. Then n(q) P(T | ) is the probability of generating T passing through the state q at time n. By definition of the forward and backward variables, this probability is equal to n(q) n(q). Thus, Also let n(q, q' ) be the probability P(sn = q, sn+1 = q' | T, ) of passing from state q to state q at time n while generating the observation sequence T. As in the preceding equation, The sum of n(q) over all n = 1 . . . | T | can be seen as the expected number of times the state q was visited while generating the sequence T. Or, if one sums over n = 1 . . . | T |1, the expected number of transitions out of the state q results because there is no transition at time |T|. )|(/)()()( TPqqq nnn )|(/)(),'()',()()',( 1 TPqoqBqqAqqq nnnn

26. Similarly, the sum of n(q, q') over all n = 1 . . . | T | 1 can be interpreted as the expected number of transitions from the state q to q' The BaumWelsh formulas re-estimate the parameters of the model according to the expectations It can be shown that the model '= (', A', B') is equal either to , in which case the is the critical point of the likelihood function P(T | ), or ', which better accounts for the training sequence T than the original model in the sense that P(T | ') >P(T | ). Therefore, the training problem can be solved by iteratively applying the re- estimation formulas until convergence. )(/)(:),(' ),(/)',(:)',(' ),(:)(' ||..1: 1||..11||..1 1 qqoqB qqqqqA qq Tn noTnn n Tn nTn n

27. Dealing with Training Data Sparseness Techniques for data sparseness problems in probabilistic modeling Smoothing shrinkage Smoothing Process of flattening a probability distribution implied by a model so that all reasonable sequences can occur with some probability. Broadening the distribution by redistributing weight from high-probability regions to zero-probability regions. Example Laplace smoothing o Every possible training event occurs one time more than it actually does. Any constant can be used instead of one. Other possible methods may include back-off smoothing, deleted interpolation, and others.

28. Shrinkage Defined in terms of some hierarchy representing the expected similarity between parameter estimates. With respect to HMMs, the hierarchy can be defined as a tree with the HMM states for the leaves all at the same depth. Hierarchy is created as follows: First, the most complex HMM is built and its states are used for the leaves of the tree. Then the states are separated into disjoint classes within which the states are expected to have similar probability distributions. The classes become the parents of their constituent states in the hierarchy (HMM structure at the leaves induces a simpler HMM structure at the level of the classes). It is generated by summing the probabilities of emissions and transitions of all states in a class. This process may be repeated until only a single-state HMM remains at the root of the hierarchy

29. Successful Application Areas of HMM Online handwriting recognition Speech recognition Gesture recognition Language recognition Motion Video analysis and tracking Protein sequence / gene sequence alignment Stock price prediction

30. Stochastic context-free grammars An SCFG is a quintuple G = (T, N, S, R, P), where, T is the alphabet of terminal symbols (tokens), N is the set of nonterminals, S is the starting nonterminal, R is the set of rules, and P : R[0.1] defines their probabilities. The rules have the form: n s1s2 . . . sk, where, n is a nonterminal and si is either a token or another nonterminal. SCFG generate (or accept) a given string (sequence of tokens) if the string can be produced starting from a sequence containing just the starting symbol S and expanding nonterminals one by one in the sequence using the rules from the grammar. The string generated can be naturally represented by a parse tree, Starting symbol as a root, Nonterminals as internal nodes, and Tokens as leaves.

31. SCFG is a usual context-free grammar with the addition of the P function. The semantics of the probability function P are straightforward. If r is the rule n s1s2 . . . sk, then P(r) is the frequency of expanding n using this rule. In Bayesian terms, if it is known that a given sequence of tokens was generated by expanding n, then P(r) is the a priori likelihood that n was expanded using the rule r. For every nonterminal n the sum P(r ) of probabilities of all rules r headed by n must be equal to one.

32. Using SCFGs Classical definition of SCFG: It is assumed that the rules are all independent. Find the (unconditional) probability of a given parse tree by simply multiplying the probabilities of all rules participating in it. Parsing problem is formulated as follows: Given a sequence of tokens (a string), find the most probable parse tree that could generate the string. A simple generalization of the Viterbi algorithm is able to solve this problem efficiently. Practical applications of SCFGs: Rare the case that the rules are truly independent. Let the probabilities P(r) be conditioned on the context where the rule is applied. If the conditioning context is chosen reasonably, the Viterbi algorithm still works correctly even for this more general problem.

33. Maximal Entropy Modeling Consider a random process of an unknown nature that produces a single output value y, a member of a finite set Y of possible output values. The process of generating y may be influenced by some contextual information x a member of the set X of possible contexts. The task is to construct a statistical model that accurately represents the behavior of the random process. Such a model is a method of estimating the conditional probability of generating y given the context x. Let P(x, y) be denoted as the unknown true joint probability distribution of the random process, and let p(y | x) be the model we are trying to build taken from the class of all possible models. To build the model we are given a set of training samples generated by observing the random process for some time. The training data consist of a sequence of pairs (xi, yi) of different outputs produced in different contexts.

34. In many cases the set X is too large and underspecified to be used directly. For instance, X may be the set of all dots . in all possible English texts. For contrast, the Y may be extremely simple while remaining interesting. In the preceding case, the Y may contain just two outcomes: SentenceEnd and NotSentenceEnd. The target model p(y | x) would in this case solve the problem of finding sentence boundaries. In such cases it is impossible to use the context x directly to generate the output y. There are usually many regularities and correlations, however, that can be exploited. Different contexts are usually similar to each other in all manner of ways, and similar contexts tend to produce similar output distributions.

35. To express such regularities and their statistics, can use constraint functions and their expected values. A constraint function f : X Y R can be any real-valued function. Binary-valued trigger functions: Such a trigger function returns one for pair (x, y) if the context x satisfies the condition predicate C and the output value y is yi. A common short notation for such a trigger function is Cyi. For the example above, useful triggers are previous token is MrNotSentenceEnd, next token is capitalizedSentenceEnd. Given a constraint function f, its importance by requiring our target model to reproduce f s expected value faithfully in the true distribution:

36. In practice we cannot calculate the true expectation and must use an empirical expected value calculated by summing over the training samples: The choice of feature functions is domain dependent. Let us assume the complete set of features F={ fk} is given. The completeness of the set of features by requiring that the model agree with all the expected value constraints while otherwise being as uniform as possible. The uniformity requirement defines the target model uniquely. The degree of uniformity of a model is expressed by its conditional entropy Or, empirically,

37. The constrained optimization problem of finding the maximal-entropy target model is solved by application of Lagrange multipliers and the KuhnTucker theorem. Let us introduce a parameter k (the Lagrange multiplier) for every feature. Define the Lagrangian (p, ) by Holding fixed, we compute the unconstrained maximum of the Lagrangian over all p . Denote by p the p where (p, ) achieves its maximum and by () the value of at this point. The functions p and () can be calculated using simple calculus: Where, Z(x) is a normalizing constant determined by the requirement that yY p(y | x) = 1. The dual optimization problem

38. The KuhnTucker theorem asserts that, under certain conditions, the solutions of the primal and dual optimization problems coincide. The model p, which maximizes HE(p) while satisfying the constraints, has the parametric form p*. The function () is simply the log-likelihood of the training sample as predicted by the model p. Thus, the model p* maximizes the likelihood of the training sample among all models of the parametric form p.

39. Computing the Parameters of the Model The function() is well behaved from the perspective of numerical optimization, for it is smooth and concave. Consequently, various methods can be used for calculating *. Generalized iterative scaling is the algorithm specifically tailored for the problem. This algorithm is applicable whenever all constraint functions are non- negative: fk(x, y) 0. The algorithm starts with an arbitrary choice of s for instance k= 0 for all k. At each iteration the s are adjusted as follows: In the simplest case, when f # is constant, k is simply (1/f #) log PE( fk)/pE( fk). Any numerical algorithm for solving the equation can be used such as Newtons method.

40. Maximal Entropy Markov Models A MEMM is a probabilistic finite-state acceptor. Unlike HMM, which has separate transition and emission probabilities, MEMM has only transition probabilities, depend on the observations. A slightly modified version of the Viterbi algorithm solves the problem of finding the most likely state sequence for a given observation sequence. A MEMM consists of a set Q = {q1, . . . , qN} of states, and a set of transition probabilities functions Aq : X Q [0, 1], where X denotes the set of all possible observations. Aq(x, q) gives the probability P(q | q, x) of transition from q to q, given the observation x. The model does not generate x but only conditions on it. The set X need not be small and need not even be fully defined. The transition probabilities Aq are separate exponential models trained using maximal entropy.

41. The task of a trained MEMM is to produce the most probable sequence of states given the observation, solved by a simple modification of the Viterbi algorithm. The forwardbackward algorithm, loses its meaning because here it computes the probability of the observation being generated by any state sequence, which is always one. The forward and backward variables are still useful for the MEMM training. The forward variable [Ref >HMM] m(q) denotes the probability of being in state q at time m given the observation. It is computed recursively as The backward variable denotes the probability of starting from state q at time m given the observation. It is computed similarly as The model Aq for transition probabilities from a state is defined parametrically using constraint functions. If fk : X Q R is the set of such functions for a given state q, then the model Aq can be represented in the form where k are the parameters to be trained and Z(x, q) is the normalizing factor making probabilities of all transitions from a state sum to one.

42. Training the MEMM If the true states sequence for the training data is known, the parameters of the models can be straightforwardly estimated using the GIS algorithm for training ME models. If the sequence is not known-for instance, if there are several states with the same label in a fully connected MEMM-the parameters must be estimated using a combination of the Baum-Welsh procedure and iterative scaling. Every iteration consists of two steps: 1. Using the forwardbackward algorithm and the current transition functions to compute the state occupancies for all training sequences. 2. Computing the new transition functions using GIS with the feature frequencies based on the state occupancies computed in step 1. It is unnecessary to run GIS to convergence in step 2; a single GIS iteration is sufficient.

43. Conditional Random Fields(CRF) Problem description Why conditional random fields(CRF) Introduction to CRF CRF model Inference of CRF Learning of CRF

44. Problem Description Given observed data X, we wish to predict Y (labels) Example: X = {Temperature, Humidity, ...} Xn = observation on day n Y = {Sunny, Rainy, Cloudy} Yn = weather on day n 30C 20% Sunny? Rainy? Cloudy? Light breeze May depend on one another May depend on the weather of yesterday

45. Generative Model vs. Discriminative Model Generative model A model that generate observed data randomly Model the joint probability p(x,y) Discriminative model Directly estimate the posterior probability p(y|x) Aim at modeling the discrimination between different outputs Nave Bayes, HMM, Bayesian network, MRF, Single variable Sequence General Logistic regression, Linear-chain CRF MEMM, General CRF, Conditional

46. Why Conditional Random Fields Generative model Generative model targets to find the joint probability p(x,y) and make the prediction based on Bayes rule to calculate p(y|x) Ex: Naive Bayes (single output) and HMM (Hidden Markov Model) (sequence output) K k k yxpypyxp 1 )|()(),( a vector of features Assume that given y, features are independent T t tttt yxpyypyxp 1 1 )|()|(),( Assumption: 1. each state t only depends on its immediate predecessor 2. Conditional independence of observed given Sequence output

47. Why Conditional Random Fields 30C 20% Humidity, temperature and the wind scale are independent Mon. {30C, 20%, light breeze} Light breeze Tue. {28C, 30%, light breeze} Wed. {25C, 40%, moderate breeze} Thu. {22C, 60%, moderate breeze} A B: A causes B

48. Why Conditional Random Fields Difficulties for generative models Not practical to represent multiple interacting features (hard to model p(x)) or long-range dependencies of the observations Very strict independence assumptions on the observations Mon. {30C, 20%, light breeze} Tue. {28C, 30%, light breeze} Wed. {25C, 40%, moderate breeze} Thu. {22C, 60%, moderate breeze}

49. Why Conditional Random Fields Discriminative models Directly model the posterior p(y|x) Aim at modeling the discrimination between different outputs Ex: logistic regression (maximum entropy) and CRF Advantages of discriminative models Training process aim at finding optimal coefficients for features no matter the features are correlated Not sensitive to unbalanced training data Especially for the classification problem, we dont have to care about p(x)

50. Why Conditional Random Fields Logistic regression (maximum entropy) Suppose we have a bin of candies, each with an associated label (A,B,C, or D) Each candy has multiple colors in its wrapper Each candy is assigned a label randomly based on some distribution over wrapper colors Observation: the color of the wrapper Label: 4 kinds of flavors A: chocolate B: strawberry C: lemon D: milk

52. Why Conditional Random Field Now suppose we add some evidence to our model We note that 80% of all candies with red labels are either labeled A or B o P(A|red) + P(B|red) = 0.8 The updated model that reflects this would be: o P(A|red) = 0.4 o P(B|red) = 0.4 o P(C|red) = 0.1 o P(D|red) = 0.1 As we make more observations and find more constraints, the model gets more complex

53. Why Conditional Random Field Given a collection of facts, choose a model which is consistent with all the facts, but otherwise as uniform as possible , ),( );|( ),( wxZ e wxyp j jj yxFw ion termnormalizatais),( ),( y yxFw j jj ewxZ By learning Defined feature functions evidence x1 x2 xd y B)f(A,nodesBandAbetweennodesfactor Factor Graph:

54. Linear-Chain CRF If we extend the logistic regression to a sequence problem ( ): , ),( );|( ),( wxZ e wxyp j jj yxFw ion termnormalizatais),( ),( y yxFw j jj ewxZ y x1 x2 xd yt-1 x1 x2 xd yt x1 x2 xd yt+1 Entire x sentencethealongsuma,),,(),(where 1 t ttjj xyyfyxF

55. Linear-Chain CRF y1 y2 y3 x1 x2 x3 y1 y2 y3 x

56. General CRF Divide Graph G into many templates A. The parameters inside each template are tied K(A) is the number of feature functions for the template )( )|( )( 1 ),( xZ e xyp G xyf A AK k aaakak

57. Inference of CRF Problem description: Given the observations({xi}) and the probability model(parameters such as i mentioned above), we target to find the best state sequence For general graphs, the problem of exact inference in CRFs is intractable Chain or tree like CRF can yield exact inference Approximation solutions

58. Inference of Linear-Chain CRF The inference of linear-chain CRF is very similar to that of HMM Example: POS(part of speech) tagging The identification of words as nouns, verbs, adjectives, adverbs, etc. Students need another break noun verb article noun

59. Inference of Linear-Chain CRF We firstly illustrate the inference of HMM students/V students/N students/P students/ART need/V need/N need/P need/ART o/s another/V another/N another/P another/ART break/V break/N break/P break/ART 7.6x10-6 0.00725 0 0 0.00031 1.3x10-5 0.0002 0 0 1.2x10-7 0 7.2x10-5 2.6x10-9 4.3x10-6 0 0

60. Inference of Linear-Chain CRF Then back to CRF i iiiy i j iiijy j i iiijy j jjy yxF y y yyg xyyf xyyf yxF xZ e xypy j jj ),(maxarg ),,(maxarg ),,(maxarg ),(maxarg )( maxarg );|(maxarg 1 1 1 ),( *

61. Inference of Linear-Chain CRF gi can be represented as a MxM matrix where m is the cardinality of the set of the tags j iijjiii xyyfyyg ),,(),( 11 V ART N N V ART yi-1 yi V ART N V ART N

62. Inference of Linear-Chain CRF The inference of linear-chain CRF is similar to that of HMM, which uses Viterbi algorithm. v: range over the tags U(k,v) to be the score of the best sequence of tags from 1 to k, where tag k is required to be v )],(),1([max )],(),([max),( 11 1 1 1 1 },...,{ 1 11 vygykU vygyygvkU kkk y k k i kiii yy k k

63. Learning of CRF Problem description Given training pairs ({xi,yi}), we wish to estimate the parameters of the model ({i}) Method For chain or tree structured CRFs, they can be trained by maximum likelihood we will focus on the learning of linear chain CRF General CRFs are intractable hence approximation solutions are necessary

64. Learning of Linear-chain CRF Conditional maximum likelihood (CML) x: observations; y: labels Apply CML to the learning of CRF It can be shown that the conditional log-likelihood of the linear-chain CRF is a convex function we can apply gradient ascent to the CML problem );|(max)|;(max xypxyL );|(max xyp );|(log xyp 0),(log),();|(log xZyxFxyp j j j

65. Learning of Linear-chain CRF For the entire training set T )]',([),( );|'()',(),( 0),(log),();|(log );'|(~' ' yxFEyxF xypyxFyxF xZyxFxyp jxypyj y jj j j j Ep[] denotes expectation with respect to distribution p. Tx jxypy Tyx j yxFEyxF , );|(~ , )],([),( The expectation of the feature fx with respect to the model distribution The expectation of the feature fx with respect to the empirical distribution

66. Learning of Linear-chain CRF To yield the best model: The expectation of each feature with respect to the model distribution is equal to the expected value under the empirical distribution of the training data The same as the maximum entropy model Logistic regression (maximum entropy) Extend to sequence Linear-Chain CRF