Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2....
Transcript of Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2....
![Page 1: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/1.jpg)
Structured PerceptronYe Qiu, Xinghui Lu, Yue Lu,
Ruofei Shen
1
![Page 2: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/2.jpg)
Outline1. Brief review of perceptron 2. Structured Perceptron3. Discriminative Training Methods for Hidden Markov Models: Theory and
Experiments with Perceptron Algorithms4. Distributed Training Strategies for the Structured Perceptron
2
![Page 3: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/3.jpg)
Brief review of perceptron
3
![Page 4: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/4.jpg)
4
![Page 5: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/5.jpg)
Error-Driven Updating: The perceptron algorithmThe perceptron is a classic learning algorithm for the neural model of learning. It is one of the most fundamental algorithm. It has two main characteristics:
It is online. Instead of considering the entire data set at the same time, it only ever looks at one example. It processes that example and then goes on to the next one.
It is error driven. If there is no error, it doesn’t bother updating its parameter.
5
![Page 6: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/6.jpg)
Structured Perceptron
6
![Page 7: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/7.jpg)
Prediction ProblemsInput x Predict y Prediction Type
A book reviewOh, man I love this book!This book is so boring...
Is it positive?YesNo
Binary prediction(2 choices)
A tweetOn the way to the park!正在去公园
Its languageEnglishChinese
Multi-class prediction(several choices)
A sentenceI read a book.
Its syntactic tags Structured prediction(millions of choices)
7
![Page 8: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/8.jpg)
Applications of Structured Perceptron1. POS Tagging with HMMs
- Collins “Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms” ACL02
2. Parsing- Huang+ “Forest Reranking: Discriminative Parsing with Non-Local Features” ACL08
3. Machine Translation- Liang+ “An End-to-End Discriminative Approach to Machine Translation” ACL06
4. Vocabulary speech recognition- Roark+ “Discriminative Language Modeling with Conditional Random Fields and the Perceptron
Algorithm” ACL04
8
![Page 9: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/9.jpg)
Structured perceptron algorithm
9
![Page 10: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/10.jpg)
Feature representationWhat is feature representation?
A feature vector is just a vector that contains information describing an object's important characteristics.
Some example of feature representation in different areas:
In character recognition, features may include histograms counting the number of black pixels along horizontal and vertical directions, number of internal holes, stroke detection and many others.
In spam detection algorithms, features may include the presence or absence of certain email headers, the email structure, the language, the frequency of specific terms, the grammatical correctness of the text.
10
![Page 11: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/11.jpg)
Feature representationIn POS Tagging problem, general ways to create a feature:
Are capital letters nouns?
Are words that end with -ed verbs?
Can easily try many different ideas!
11
![Page 12: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/12.jpg)
Some notationsA sequence of words, short-hand for
A sequence of tags, short-hand for
A history, the context in which a tagging decision is made.
Usually, a 4-tuple
A feature mapping function. Size is
Maps a history-tag pair to a d-dimensional feature vector.
Each component could be an arbitrary function of h and t. 12
![Page 13: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/13.jpg)
Some notationsA feature-vector representation
For example, one such feature might be
13
![Page 14: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/14.jpg)
Some notations
A function from fd pairs to d-dimensional feature vectors.
Refer as a “global” representation, in contrast to as a “local” representation.
14
![Page 15: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/15.jpg)
An example of feature vector - POS Tagging
Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun, Support Vector Machine Learning for Interdependent and Structured Output Spaces, ICML 2004
15
x: The dog ate my homework.
y : <S> D N V D N <E>
![Page 16: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/16.jpg)
An example of feature vector - Parsing
Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun, Support Vector Machine Learning for Interdependent and Structured Output Spaces, ICML 2004
16
![Page 17: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/17.jpg)
Some notations
A function which enumerates a set of candidates for an input x. Given a set of possible tags T , we define
A d-dimensional parameter vector
17
![Page 18: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/18.jpg)
Recap: Structured perceptron algorithm
18
![Page 19: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/19.jpg)
Recap: Decoding - Viterbi Algorithm in HMM
What is decoding problem?
Given a sequence of symbols (your observations) and a model, what is the most likely sequence of states that produced the sequence.
19
![Page 20: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/20.jpg)
An example of Viterbi AlgorithmFirst, calculate transition from <S> and emission of the first word for every POS.
20
![Page 21: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/21.jpg)
An example of Viterbi AlgorithmFor middle words, calculate the maximum score for all possible previous POS tags.
21
max
max
![Page 22: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/22.jpg)
Viterbi algorithm with featuresSame as probabilities, use feature weights
22
![Page 23: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/23.jpg)
Viterbi algorithm with featuresCan add additional features.
23
![Page 24: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/24.jpg)
Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms
Michael Collins
24
![Page 25: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/25.jpg)
Outline1. A motivating example2. Algorithm for tagging3. General Algorithm
a. Proof for Convergences
4. Experiment result
25
![Page 26: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/26.jpg)
Part of Speech (POS) Tagging1. Maximum-entropy (ME) models
a. Advantages: flexible in the features that can be incorporated in the model.
b. Disadvantages: inference problem for such models is intractable
2. Conditional Random Fields (CRFs)a. Better than ME
3. Structured Perceptrona. 11.9% relative reduction in error for POS tagging compared with ME
26
![Page 27: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/27.jpg)
A Motivating Example ● Trigram HMM taggers (Second order HMMs)● Parameters αx,y,z - trigram <x, y, z>, αt,w - tag/word pair <t, w>● Common approach(maximum-likelihood):
The score for a tag sequence t[1:n] with a word sequence w[1:n]
Using Viterbi algorithm to find the highest scoring.
27
![Page 28: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/28.jpg)
A Motivating Example ● An alternative to maximum-likelihood parameter estimates
○ Choose a T defining the number of iterations over the training set.
○ Initially set all parameters αx,y,z and αt,w to be zero.○ In each iteration, use Viterbi to find best tagged sequence for w[1:n] to get z[1:n]○ Update parameters:
■ <x, y, z> c1 times in t[1:n], c2 times in z[1:n]
■ αx,y,z = αx,y,z + c1 - c2
■ <t, w> c1 times in t[1:n], c2 times in z[1:n]
■ αt,w = αt,w + c1 - c2
28
![Page 29: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/29.jpg)
A Motivating Example ● An example
○ Training data : ■ the/D man/N saw/V the/D dog/N
○ The highest scoring tag sequence z[1:n]:■ the/D man/N saw/N the/D dog/N
○ Update parameters
■ αD,N,V, αN,V,D, αV,D,N, αV,saw + 1
■ αD,N,N, αN,N,D, αN,D,N, αN,saw - 1■ If tag sequence is correct -- no change.
29
![Page 30: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/30.jpg)
Generalize the algorithm to tagged sequences● Recap: feature vector representations
○ maps a history-tag pairs to a d-dimensional feature vector
Each components φ is an arbitrary function of (h, t), eg: h: <t-1, t-2, w[1:n], i>
30
![Page 31: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/31.jpg)
General Representation● Maximum-Entropy Taggers New estimate method
○ Conditional probability: The “score” of a tagged sequence
Log probability:
31
![Page 32: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/32.jpg)
The training algorithm for tagging● Inputs: A training set of tagged sentences● Initialization: Set parameter vector● For t = 1 … T, i = 1 ... n:
○ ○ If :
■ ■
● Output: Parameter vector● Refinement: “averaged parameters” method
32
![Page 33: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/33.jpg)
General Algorithm
33
![Page 34: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/34.jpg)
Assumption for Algorithm for taggers
34
![Page 35: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/35.jpg)
Convergence for separable data
Independent of the number of candidates for each example,Depending only on the separation of the training data
35
![Page 36: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/36.jpg)
Proof of Theorem 1 :the weights before k’th mistake is made.
(induction)
36
![Page 37: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/37.jpg)
Convergence for inseparable data
37
![Page 38: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/38.jpg)
Proof of Theorem 2
( ) >=
Using Theorem 1:
38
![Page 39: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/39.jpg)
Generalization to New Test Examples
39
![Page 40: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/40.jpg)
Experiment Results
cc = feature count cut-offs set
40
![Page 41: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/41.jpg)
Conclusion● Described new structured perceptron algorithm for tagging.● The generic algorithm and the theorems describing its convergence could
apply to several other models in the NLP.
41
![Page 42: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/42.jpg)
Distributed Training Strategies for the Structured Perceptron
Ryan McDonald Keith Hall Gideon Mann
42
![Page 43: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/43.jpg)
Outline1. Introduction of distributed training strategies for the structured
perceptron2. Background (Related Work and Structured perceptron recap)3. Iterative Parameter mixing distributed structured perceptron4. Experiments and results
a. Serial (All Data)b. Serial (Sub Sampling)c. Parallel (Parameter Mix)d. Parallel (Iterative Parameter Mix)
43
![Page 44: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/44.jpg)
Limitation of Structured Perceptron● Increasing size of available training sets
● Training complexity is proportional to inference, which is frequently non- linear in sequence length, even with strong structural independence
44
![Page 45: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/45.jpg)
Distributed Training Strategies● Iterative parameter mixing
○ have similar convergence properties to the standard perceptron algorithm○ find a separating hyperplane if the training set is separable○ reduce training times significantly○ produce models with comparable (or superior) accuracies to those trained serially on all
the data
45
![Page 46: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/46.jpg)
Related Work● Sub-gradient distributed training: asynchronous optimization via gradient
descent○ Require shared memory○ Less suitable to the more common cluster computing environment
● Other structured prediction classifiers: like CRF○ Conditional random fields (CRFs)○ Distributed through the gradient○ Require computation of a partition, expensive and sometimes intractable
46
![Page 47: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/47.jpg)
Structured Perceptron Recap
47
![Page 48: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/48.jpg)
Supporting Theorem 1 Recap
Theorem 1 implies that if T is separable then● The perceptron will converge in a finite amount of time● It will produce a w that separates T.
48
![Page 49: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/49.jpg)
Distributed Training Strategies for the Perceptron● Parameter Mixing
● Iterative Parameter Mixing
49
![Page 50: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/50.jpg)
Parameter Mixing● Algorithm
50
![Page 51: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/51.jpg)
Parameter Mixing● Parameter Mixing Map-Reduce framework
○ Map stage - Trains the individual models in parallel
○ Reduce stage - Mixes their parameters
51
![Page 52: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/52.jpg)
Parameter Mixing● Limitations
○ their analysis requires a stability bound on the parameters of a regularized maximum entropy model, which is not known to hold for the perceptron.
52
![Page 53: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/53.jpg)
Supporting Theorem 2
53
![Page 54: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/54.jpg)
Theorem 2 ProofConsider a binary classification setting where Υ = {0, 1} and Τ has 4 instance.
Τ1 = {(x1,1, y1,1),(x1,2, y1,2)} Τ2 = {(x2,1, y2,1),(x2,2, y2,2)}y1,1 = y1,2 = 0 y2,1 = y2,2 = 1
f(x1,1, 0) = [1 1 0 0 0 0] f(x1,1, 1) = [0 0 0 1 1 0]f(x1,2, 0) = [0 0 1 0 0 0] f(x1,2, 1) = [0 0 0 0 0 1]f(x2,1, 0) = [0 1 1 0 0 0] f(x2,1, 1) = [0 0 0 0 1 1]f(x2,2, 0) = [1 0 0 0 0 0] f(x2,2, 1) = [0 0 0 1 0 0]
54
![Page 55: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/55.jpg)
Theorem 2 Proof cont.Assuming label is tie-breaking, parameter mixing might returns:w1=[1 1 0 -1 -1 0] w2=[0 1 1 0 -1 -1]
If both μ1, μ2 are non-zero: all examples will be classified as 0
If μ1=1 and μ2=0: (x2,2, y2,2) will be incorrectly classified as 0
If μ1=0 and μ2=1: (x1,2, y1,2) will be incorrectly classified as 0
There is a separating vector w = [-1 2 -1 1 -2 1] !!!55
![Page 56: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/56.jpg)
Iterative Parameter Mixing
56
![Page 57: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/57.jpg)
Iterative Parameter Mixing● Iterative Parameter Mixing Map-reduce
○ map computes the parameters for each shard for one epoch and○ reduce mixes and re-sends them
● The disadvantage of iterative parameter mixing, relative to simple parameter mixing, is that the amount of information sent across the network will increase.
57
![Page 58: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/58.jpg)
Supporting Theorem 3
58
![Page 59: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/59.jpg)
Theorem 3 Proofw(i,n) the weight vector for the ith shard after the nth epoch of the main loop
w([i,n]-k) the weight vector that existed on shard i in the nth epoch k errors before w(i,n)
w(avg,n) the mixed vector from the weight vectors returned after the nth epoch
59
![Page 60: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/60.jpg)
Theorem 3 Proof cont.
60
![Page 61: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/61.jpg)
Theorem 3 Proof cont.
61
![Page 62: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/62.jpg)
Theorem 3 Proof cont.Base Case Inductive Hypotheses
62
![Page 63: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/63.jpg)
Theorem 3 Proof cont.
63
![Page 64: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/64.jpg)
Theorem 3 Proof cont.
64
![Page 65: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/65.jpg)
AnalysisIf we set each μi,n to be the uniform mixture, μi, n= 1/S, theorem 3 guarantees convergence to a separating hyperplane.
If then the previous weight vector already separated the data.
Otherwise is bounded and cannot increase indefinitely.
65
![Page 66: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/66.jpg)
AnalysisNon-distributed:
Distributed: Setting μi,n= ki, n/kn where kn = ∑iki,n
66
![Page 67: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/67.jpg)
Experiments● Serial (All Data)
○ Trained serially on all the available data
● Serial (Sub Sampling)○ Shard the data, select randomly and train serially
● Parallel (Parameter Mix)○ Parameter Mixing distributed strategy
● Parallel (Iterative Parameter Mix)○ Iterative Parameter Mixing distributed strategy
67
![Page 68: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/68.jpg)
Experiment 1
68
![Page 69: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/69.jpg)
Experiment 2
69
![Page 70: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/70.jpg)
Convergence Properties
70
![Page 71: Structured Perceptron - GitHub Pages · Ruofei Shen 1. Outline 1. Brief review of perceptron 2. Structured Perceptron 3. Discriminative Training Methods for Hidden Markov Models:](https://reader034.fdocuments.net/reader034/viewer/2022052013/6029c1d18b7d26457a6a0add/html5/thumbnails/71.jpg)
Conclusions● The analysis shows that an iterative parameter mixing strategy is both
guaranteed to separate the data (if possible) and significantly reduces the time required to train high accuracy classifiers.
● There is a trade-off between increasing training times through distributed computation and slower convergence relative to the number of shards.
71