Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall... ·...
Transcript of Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall... ·...
![Page 1: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall... · linear-chain conditional random fields (linear-chain CRFs) (Lafferty et al., 2001) = 1 exp](https://reader034.fdocuments.net/reader034/viewer/2022051809/6012c956e9a0422b6c27e0c4/html5/thumbnails/1.jpg)
Sequence Modelling with Features:
Linear-Chain Conditional Random
Fields
COMP-550
Oct 3, 2017
![Page 2: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall... · linear-chain conditional random fields (linear-chain CRFs) (Lafferty et al., 2001) = 1 exp](https://reader034.fdocuments.net/reader034/viewer/2022051809/6012c956e9a0422b6c27e0c4/html5/thumbnails/2.jpg)
OutlineHidden Markov models: shortcomings
Generative vs. discriminative models
Linear-chain CRFs
• Inference and learning algorithms with linear-chain CRFs
2
![Page 3: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall... · linear-chain conditional random fields (linear-chain CRFs) (Lafferty et al., 2001) = 1 exp](https://reader034.fdocuments.net/reader034/viewer/2022051809/6012c956e9a0422b6c27e0c4/html5/thumbnails/3.jpg)
Hidden Markov ModelGraph specifies how join probability decomposes
𝑃(𝑶,𝑸) = 𝑃 𝑄1
𝑡=1
𝑇−1
𝑃(𝑄𝑡+1|𝑄𝑡)
𝑡=1
𝑇
𝑃(𝑂𝑡|𝑄𝑡)
3
𝑄1
𝑂1
𝑄2
𝑂2
𝑄3
𝑂3
𝑄4
𝑂4
𝑄5
𝑂5
Initial state probability
State transition probabilities
Emission probabilities
![Page 4: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall... · linear-chain conditional random fields (linear-chain CRFs) (Lafferty et al., 2001) = 1 exp](https://reader034.fdocuments.net/reader034/viewer/2022051809/6012c956e9a0422b6c27e0c4/html5/thumbnails/4.jpg)
Shortcomings of Standard HMMsHow do we add more features to HMMs?
Might be useful for POS tagging:
• Word position within sentence (1st, 2nd, last…)
• Capitalization
• Word prefix and suffixes (-tion, -ed, -ly, -er, re-, de-)
• Features that depend on more than the current word or the previous words.
4
![Page 5: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall... · linear-chain conditional random fields (linear-chain CRFs) (Lafferty et al., 2001) = 1 exp](https://reader034.fdocuments.net/reader034/viewer/2022051809/6012c956e9a0422b6c27e0c4/html5/thumbnails/5.jpg)
Possible to Do with HMMsAdd more emissions at each timestep
Clunky
Is there a better way to do this?
5
𝑄1
𝑂12𝑂11 𝑂13 𝑂14
Word identity Capitalization Prefix feature Word position
![Page 6: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall... · linear-chain conditional random fields (linear-chain CRFs) (Lafferty et al., 2001) = 1 exp](https://reader034.fdocuments.net/reader034/viewer/2022051809/6012c956e9a0422b6c27e0c4/html5/thumbnails/6.jpg)
Discriminative ModelsHMMs are generative models
• Build a model of the joint probability distribution 𝑃(𝑶,𝑸),
• Let’s rename the variables
• Generative models specify 𝑃(𝑋, 𝑌; 𝜃gen)
If we are only interested in POS tagging, we can instead train a discriminative model
• Model specifies 𝑃(𝑌|𝑋; 𝜃disc)
• Now a task-specific model for sequence labelling; cannot use it for generating new samples of word and POS sequences
6
![Page 7: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall... · linear-chain conditional random fields (linear-chain CRFs) (Lafferty et al., 2001) = 1 exp](https://reader034.fdocuments.net/reader034/viewer/2022051809/6012c956e9a0422b6c27e0c4/html5/thumbnails/7.jpg)
Generative or Discriminative?Naive Bayes
𝑃 𝑦 𝑥 = 𝑃 𝑦 𝑖 𝑃 𝑥𝑖 𝑦 / 𝑃( 𝑥)
Logistic regression
𝑃(𝑦| 𝑥) =1
𝑍𝑒𝑎1𝑥1 + 𝑎2𝑥2 +… + 𝑎𝑛𝑥𝑛 + 𝑏
7
![Page 8: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall... · linear-chain conditional random fields (linear-chain CRFs) (Lafferty et al., 2001) = 1 exp](https://reader034.fdocuments.net/reader034/viewer/2022051809/6012c956e9a0422b6c27e0c4/html5/thumbnails/8.jpg)
Discriminative Sequence ModelThe parallel to an HMM in the discriminative case: linear-chain conditional random fields (linear-chain CRFs) (Lafferty et al., 2001)
𝑃 𝑌 𝑋 =1
𝑍 𝑋exp
𝑡
𝑘
𝜃𝑘𝑓𝑘(𝑦𝑡 , 𝑦𝑡−1, 𝑥𝑡)
Z(X) is a normalization constant:
𝑍 𝑋 =
𝒚
exp
𝑡
𝑘
𝜃𝑘𝑓𝑘(𝑦𝑡, 𝑦𝑡−1, 𝑥𝑡)
8
sum over all features
sum over all time-steps
sum over all possible sequences of hidden states
![Page 9: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall... · linear-chain conditional random fields (linear-chain CRFs) (Lafferty et al., 2001) = 1 exp](https://reader034.fdocuments.net/reader034/viewer/2022051809/6012c956e9a0422b6c27e0c4/html5/thumbnails/9.jpg)
IntuitionStandard HMM: product of probabilities; these probabilities are defined over the identity of the states and words
• Transition from state DT to NN: 𝑃(𝑦𝑡+1 = 𝑁𝑁|𝑦𝑡 = 𝐷𝑇)
• Emit word the from state DT: 𝑃(𝑥𝑡 = 𝑡ℎ𝑒|𝑦𝑡 = 𝐷𝑇)
Linear-chain CRF: replace the products by numbers that are NOT probabilities, but linear combinations of weights and feature values.
9
![Page 10: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall... · linear-chain conditional random fields (linear-chain CRFs) (Lafferty et al., 2001) = 1 exp](https://reader034.fdocuments.net/reader034/viewer/2022051809/6012c956e9a0422b6c27e0c4/html5/thumbnails/10.jpg)
Features in CRFsStandard HMM probabilities as CRF features:
• Transition from state DT to state NN𝑓𝐷𝑇→𝑁𝑁(𝑦𝑡 , 𝑦𝑡−1, 𝑥𝑡) = 𝟏(𝑦𝑡−1 = 𝐷𝑇) 𝟏(𝑦𝑡 = 𝑁𝑁)
• Emit the from state DT𝑓𝐷𝑇→𝑡ℎ𝑒(𝑦𝑡 , 𝑦𝑡−1, 𝑥𝑡) = 𝟏(𝑦𝑡 = 𝐷𝑇) 𝟏(𝑥𝑡 = 𝑡ℎ𝑒)
• Initial state is DT𝑓𝐷𝑇 𝑦1, 𝑥1 = 𝟏 𝑦1 = 𝐷𝑇
Indicator function:
Let 𝟏 𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛 = 1 if 𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛 is true0 otherwise
10
![Page 11: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall... · linear-chain conditional random fields (linear-chain CRFs) (Lafferty et al., 2001) = 1 exp](https://reader034.fdocuments.net/reader034/viewer/2022051809/6012c956e9a0422b6c27e0c4/html5/thumbnails/11.jpg)
Features in CRFsAdditional features that may be useful
• Word is capitalized𝑓𝑐𝑎𝑝(𝑦𝑡 , 𝑦𝑡−1, 𝑥𝑡) = 𝟏(𝑦𝑡 = ? )𝟏(𝑥𝑡 is capitalized)
• Word ends in –ed𝑓−𝑒𝑑(𝑦𝑡 , 𝑦𝑡−1, 𝑥𝑡) = 𝟏(𝑦𝑡 = ? )𝟏(𝑥𝑡 ends with 𝑒𝑑)
• Exercise: propose more features
11
![Page 12: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall... · linear-chain conditional random fields (linear-chain CRFs) (Lafferty et al., 2001) = 1 exp](https://reader034.fdocuments.net/reader034/viewer/2022051809/6012c956e9a0422b6c27e0c4/html5/thumbnails/12.jpg)
Inference with LC-CRFsDynamic programming still works – modify the forward and the Viterbi algorithms to work with the weight-feature products.
12
HMM LC-CRF
Forward algorithm 𝑃 𝑋 𝜃 𝑍(𝑋)
Viterbi algorithm argmax𝑌
𝑃(𝑋, 𝑌|𝜃) argmax𝑌
𝑃(𝑌|𝑋, 𝜃)
![Page 13: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall... · linear-chain conditional random fields (linear-chain CRFs) (Lafferty et al., 2001) = 1 exp](https://reader034.fdocuments.net/reader034/viewer/2022051809/6012c956e9a0422b6c27e0c4/html5/thumbnails/13.jpg)
Forward Algorithm for HMMsCreate trellis 𝛼𝑖(𝑡) for 𝑖 = 1…𝑁, 𝑡 = 1…𝑇
𝛼𝑗 1 = 𝜋𝑗𝑏𝑗(𝑂1) for j = 1 … N
for t = 2 … T:
for j = 1 … N:
𝛼𝑗 𝑡 =
𝑖=1
𝑁
𝛼𝑖 𝑡 − 1 𝑎𝑖𝑗𝑏𝑗(𝑂𝑡)
𝑃 𝑶 𝜃 =
𝑗=1
𝑁
𝛼𝑗(𝑇)
Runtime: O(𝑁2𝑇)
13
![Page 14: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall... · linear-chain conditional random fields (linear-chain CRFs) (Lafferty et al., 2001) = 1 exp](https://reader034.fdocuments.net/reader034/viewer/2022051809/6012c956e9a0422b6c27e0c4/html5/thumbnails/14.jpg)
Forward Algorithm for LC-CRFsCreate trellis 𝛼𝑖(𝑡) for 𝑖 = 1…𝑁, 𝑡 = 1…𝑇
𝛼𝑗 1 = exp 𝑘 𝜃𝑘𝑖𝑛𝑖𝑡𝑓𝑘
𝑖𝑛𝑖𝑡(𝑦1 = 𝑗, 𝑥1) for j = 1 … N
for t = 2 … T:
for j = 1 … N:
𝛼𝑗 𝑡 =
𝑖=1
𝑁
𝛼𝑖 𝑡 − 1 exp
𝑘
𝜃𝑘𝑓𝑘(𝑦𝑡 = 𝑗, 𝑦𝑡−1, 𝑥𝑡)
𝑍(𝑋) =
𝑗=1
𝑁
𝛼𝑗(𝑇)
Runtime: O(𝑁2𝑇)
Having 𝑍(𝑋) allows us to compute 𝑃(𝑌|𝑋)
14
Transition and emission probabilities replacedby exponent of weighted sums of features.
![Page 15: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall... · linear-chain conditional random fields (linear-chain CRFs) (Lafferty et al., 2001) = 1 exp](https://reader034.fdocuments.net/reader034/viewer/2022051809/6012c956e9a0422b6c27e0c4/html5/thumbnails/15.jpg)
Viterbi Algorithm for HMMsCreate trellis 𝛿𝑖(𝑡) for 𝑖 = 1…𝑁, 𝑡 = 1…𝑇
𝛿𝑗 1 = 𝜋𝑗𝑏𝑗(𝑂1) for j = 1 … N
for t = 2 … T:
for j = 1 … N:𝛿𝑗 𝑡 = max
𝑖𝛿𝑖 𝑡 − 1 𝑎𝑖𝑗𝑏𝑗(𝑂𝑡)
Take max𝑖
𝛿𝑖 𝑇
Runtime: O(𝑁2𝑇)
15
![Page 16: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall... · linear-chain conditional random fields (linear-chain CRFs) (Lafferty et al., 2001) = 1 exp](https://reader034.fdocuments.net/reader034/viewer/2022051809/6012c956e9a0422b6c27e0c4/html5/thumbnails/16.jpg)
Viterbi Algorithm for LC-CRFsCreate trellis 𝛿𝑖(𝑡) for 𝑖 = 1…𝑁, 𝑡 = 1…𝑇
𝛿𝑗 1 = exp 𝑘 𝜃𝑘𝑖𝑛𝑖𝑡𝑓𝑘
𝑖𝑛𝑖𝑡(𝑦1 = 𝑗, 𝑥1)for j = 1 … N
for t = 2 … T:
for j = 1 … N:
𝛿𝑗 𝑡 = max𝑖
𝛿𝑖 𝑡 − 1 exp
𝑘
𝜃𝑘𝑓𝑘(𝑦𝑡 = 𝑗, 𝑦𝑡−1, 𝑥𝑡)
Take max𝑖
𝛿𝑖 𝑇
Runtime: O(𝑁2𝑇)
Remember that we need backpointers.
16
![Page 17: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall... · linear-chain conditional random fields (linear-chain CRFs) (Lafferty et al., 2001) = 1 exp](https://reader034.fdocuments.net/reader034/viewer/2022051809/6012c956e9a0422b6c27e0c4/html5/thumbnails/17.jpg)
Training LC-CRFsUnlike for HMMs, no analytic MLE solution
Use iterative method to improve data likelihood
Gradient descent
A version of Newton’s method to find where the gradient is 0
17
![Page 18: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall... · linear-chain conditional random fields (linear-chain CRFs) (Lafferty et al., 2001) = 1 exp](https://reader034.fdocuments.net/reader034/viewer/2022051809/6012c956e9a0422b6c27e0c4/html5/thumbnails/18.jpg)
ConvexityFortunately, 𝑙 𝜃 is a concave function (equivalently, its negation is a convex function). That means that we will find the global maximum of 𝑙 𝜃 with gradient ascent (equivalently, the global minimum of −𝑙 𝜃 with gradient descent).
18
![Page 19: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall... · linear-chain conditional random fields (linear-chain CRFs) (Lafferty et al., 2001) = 1 exp](https://reader034.fdocuments.net/reader034/viewer/2022051809/6012c956e9a0422b6c27e0c4/html5/thumbnails/19.jpg)
Gradient AscentWalk in the direction of the gradient to maximize 𝑙(𝜃)
• a.k.a., gradient descent on a loss function
𝜃new = 𝜃old + 𝛾𝛻𝑙(𝜃)
𝛾is a learning rate that specifies how large a step to take.
There are more sophisticated ways to do this update:
• Conjugate gradient
• L-BFGS (approximates using second derivative)
19
![Page 20: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall... · linear-chain conditional random fields (linear-chain CRFs) (Lafferty et al., 2001) = 1 exp](https://reader034.fdocuments.net/reader034/viewer/2022051809/6012c956e9a0422b6c27e0c4/html5/thumbnails/20.jpg)
Gradient Descent SummaryDescent vs ascent
Convention: think about the problem as a minimization problem
Minimize the negative log likelihood
• 𝜃 ← 𝜃 − 𝛾(−𝛻𝑙 𝜃 )
Initialize 𝜃 = 𝜃1, 𝜃2, … , 𝜃𝑘 randomly
Do for a while:
Compute 𝛻𝑙(𝜃), which will require dynamic programming (i.e., forward algorithm)
𝜃 ← 𝜃 + 𝛾𝛻𝑙(𝜃)
20
![Page 21: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall... · linear-chain conditional random fields (linear-chain CRFs) (Lafferty et al., 2001) = 1 exp](https://reader034.fdocuments.net/reader034/viewer/2022051809/6012c956e9a0422b6c27e0c4/html5/thumbnails/21.jpg)
Gradient of Log-LikelihoodFind the gradient of the log likelihood of the training corpus:
𝑙 𝜃 = log
𝑖
𝑃(𝑌 𝑖 |𝑋 𝑖 )
21
![Page 22: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall... · linear-chain conditional random fields (linear-chain CRFs) (Lafferty et al., 2001) = 1 exp](https://reader034.fdocuments.net/reader034/viewer/2022051809/6012c956e9a0422b6c27e0c4/html5/thumbnails/22.jpg)
Interpretation of GradientOverall gradient is the difference between:
𝑖
𝑡
𝑓𝑘(𝑦𝑡𝑖, 𝑦𝑡−1
𝑖, 𝑥𝑡
𝑖)
the empirical distribution of feature 𝑓𝑘 in the training corpus
and:
𝑖
𝑡
𝑦,𝑦′
𝑓𝑘 𝑦, 𝑦′, 𝑥𝑡𝑖
𝑃(𝑦, 𝑦′|𝑋 𝑖 )
the expected distribution of 𝑓𝑘 as predicted by the current model
22
![Page 23: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall... · linear-chain conditional random fields (linear-chain CRFs) (Lafferty et al., 2001) = 1 exp](https://reader034.fdocuments.net/reader034/viewer/2022051809/6012c956e9a0422b6c27e0c4/html5/thumbnails/23.jpg)
Interpretation of GradientWhen the corpus likelihood is maximized, the gradient is zero, so the difference is zero.
Intuitively, this means that finding parameter estimate by gradient descent is equivalent to telling our model to predict the features in such a way that they are found in the same distribution as in the gold standard.
23
![Page 24: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall... · linear-chain conditional random fields (linear-chain CRFs) (Lafferty et al., 2001) = 1 exp](https://reader034.fdocuments.net/reader034/viewer/2022051809/6012c956e9a0422b6c27e0c4/html5/thumbnails/24.jpg)
RegularizationTo avoid overfitting, we can encourage the weights to be close to zero.
Add term to corpus log likelihood:
𝑙∗ 𝜃 = log
𝑖
𝑃(𝑌 𝑖 |𝑋 𝑖 ) −
𝑘
𝜃𝑘2
2𝜎2
𝜎 controls the amount of regularization
Results in extra term in gradient:
−𝜃𝑘𝜎2
24
![Page 25: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall... · linear-chain conditional random fields (linear-chain CRFs) (Lafferty et al., 2001) = 1 exp](https://reader034.fdocuments.net/reader034/viewer/2022051809/6012c956e9a0422b6c27e0c4/html5/thumbnails/25.jpg)
Stochastic Gradient Descent (SGD)In the standard version of the algorithm, the gradient is computed over the entire training corpus.
• Weight update only once per iteration through training corpus.
Alternative: calculate gradient over a small mini-batch of the training corpus and update weights
SGD is when mini-batch size is one.
• Many weight updates per iteration through training corpus
• Usually results in much faster convergence to final solution, without loss in performance
25
![Page 26: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall... · linear-chain conditional random fields (linear-chain CRFs) (Lafferty et al., 2001) = 1 exp](https://reader034.fdocuments.net/reader034/viewer/2022051809/6012c956e9a0422b6c27e0c4/html5/thumbnails/26.jpg)
Stochastic Gradient DescentGoal: Minimize −𝑙∗(𝜃)
Initialize 𝜃 = 𝜃1, 𝜃2, … , 𝜃𝑘 randomly
Do for a while:
Randomize order of samples in training corpus
For each mini-batch (of size one) in the training corpus:
Compute 𝛻𝑙∗(𝜃) over this mini-batch𝜃 ← 𝜃 + 𝛾𝛻𝑙∗(𝜃)
26