Machine Learning for NLP - cl.lingfil.uu.se › ~nivre › master › ml-perc.pdf · I \Maximum...

25
Machine Learning for NLP Perceptron Learning Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(25)

Transcript of Machine Learning for NLP - cl.lingfil.uu.se › ~nivre › master › ml-perc.pdf · I \Maximum...

Page 1: Machine Learning for NLP - cl.lingfil.uu.se › ~nivre › master › ml-perc.pdf · I \Maximum Entropy"&\NLP"2660 hits, 141 before 2000 I \SVM"&\NLP"2210 hits, 16 before 2000 I \Perceptron"&\NLP",

Machine Learning for NLP

Perceptron Learning

Joakim Nivre

Uppsala UniversityDepartment of Linguistics and Philology

Slides adapted from Ryan McDonald, Google Research

Machine Learning for NLP 1(25)

Page 2: Machine Learning for NLP - cl.lingfil.uu.se › ~nivre › master › ml-perc.pdf · I \Maximum Entropy"&\NLP"2660 hits, 141 before 2000 I \SVM"&\NLP"2210 hits, 16 before 2000 I \Perceptron"&\NLP",

Introduction

Linear Classifiers

I Classifiers covered so far:I Decision treesI Nearest neighbor

I Next three lectures: linear classifiersI Statistics from Google Scholar (October 2009):

I “Maximum Entropy” & “NLP” 2660 hits, 141 before 2000I “SVM” & “NLP” 2210 hits, 16 before 2000I “Perceptron” & “NLP”, 947 hits, 118 before 2000

I All are linear classifiers and basic tools in NLP

I They are also the bridge to deep learning techniques

Machine Learning for NLP 2(25)

Page 3: Machine Learning for NLP - cl.lingfil.uu.se › ~nivre › master › ml-perc.pdf · I \Maximum Entropy"&\NLP"2660 hits, 141 before 2000 I \SVM"&\NLP"2210 hits, 16 before 2000 I \Perceptron"&\NLP",

Introduction

Outline

I Today:I Preliminaries: input/output, features, etc.I PerceptronI Assignment 2

I Next time:I Large-margin classifiers (SVMs, MIRA)I Logistic regression (Maximum Entropy)

I Final session:I Naive Bayes classifiersI Generative and discriminative models

Machine Learning for NLP 3(25)

Page 4: Machine Learning for NLP - cl.lingfil.uu.se › ~nivre › master › ml-perc.pdf · I \Maximum Entropy"&\NLP"2660 hits, 141 before 2000 I \SVM"&\NLP"2210 hits, 16 before 2000 I \Perceptron"&\NLP",

Preliminaries

Inputs and Outputs

I Input: x ∈ XI e.g., document or sentence with some words x = w1 . . .wn, or

a series of previous actions

I Output: y ∈ YI e.g., parse tree, document class, part-of-speech tags,

word-sense

I Input/output pair: (x,y) ∈ X × YI e.g., a document x and its label yI Sometimes x is explicit in y, e.g., a parse tree y will contain

the sentence x

Machine Learning for NLP 4(25)

Page 5: Machine Learning for NLP - cl.lingfil.uu.se › ~nivre › master › ml-perc.pdf · I \Maximum Entropy"&\NLP"2660 hits, 141 before 2000 I \SVM"&\NLP"2210 hits, 16 before 2000 I \Perceptron"&\NLP",

Preliminaries

Feature Representations

I We assume a mapping from input-output pairs (x,y) to ahigh dimensional feature vector

I f(x,y) : X × Y → Rm

I For some cases, i.e., binary classification Y = {−1,+1}, wecan map only from the input to the feature space

I f(x) : X → Rm

I However, most problems in NLP require more than twoclasses, so we focus on the multi-class case

I For any vector v ∈ Rm, let vj be the j th value

Machine Learning for NLP 5(25)

Page 6: Machine Learning for NLP - cl.lingfil.uu.se › ~nivre › master › ml-perc.pdf · I \Maximum Entropy"&\NLP"2660 hits, 141 before 2000 I \SVM"&\NLP"2210 hits, 16 before 2000 I \Perceptron"&\NLP",

Preliminaries

Features and Classes

I All features must be numericalI Numerical features are represented directly as fi (x,y) ∈ RI Binary (boolean) features are represented as fi (x,y) ∈ {0, 1}

I Multinomial (categorical) features must be binarizedI Instead of: fi (x,y) ∈ {v0, . . . , vp}I We have: fi+0(x,y) ∈ {0, 1}, . . . , fi+p(x,y) ∈ {0, 1}I Such that: fi+j(x,y) = 1 iff fi (x,y) = vj

I We need distinct features for distinct output classesI Instead of: fi (x) (1 ≤ i ≤ m)I We have: fi+0m(x,y), . . . , fi+Nm(x,y) for Y = {0, . . . ,N}I Such that: fi+jm(x,y) = fi (x) iff y = yj

Machine Learning for NLP 6(25)

Page 7: Machine Learning for NLP - cl.lingfil.uu.se › ~nivre › master › ml-perc.pdf · I \Maximum Entropy"&\NLP"2660 hits, 141 before 2000 I \SVM"&\NLP"2210 hits, 16 before 2000 I \Perceptron"&\NLP",

Preliminaries

Examples

I x is a document and y is a label

fj(x,y) =

1 if x contains the word “interest”

and y =“financial”0 otherwise

fj(x,y) = % of words in x with punctuation and y =“scientific”

I x is a word and y is a part-of-speech tag

fj(x,y) =

{1 if x = “bank” and y = Verb0 otherwise

Machine Learning for NLP 7(25)

Page 8: Machine Learning for NLP - cl.lingfil.uu.se › ~nivre › master › ml-perc.pdf · I \Maximum Entropy"&\NLP"2660 hits, 141 before 2000 I \SVM"&\NLP"2210 hits, 16 before 2000 I \Perceptron"&\NLP",

Preliminaries

Examples

I x is a name, y is a label classifying the name

f0(x,y) =

1 if x contains “George”and y = “Person”

0 otherwise

f1(x,y) =

1 if x contains “Washington”and y = “Person”

0 otherwise

f2(x,y) =

1 if x contains “Bridge”and y = “Person”

0 otherwise

f3(x,y) =

1 if x contains “General”and y = “Person”

0 otherwise

f4(x,y) =

1 if x contains “George”and y = “Object”

0 otherwise

f5(x,y) =

1 if x contains “Washington”and y = “Object”

0 otherwise

f6(x,y) =

1 if x contains “Bridge”and y = “Object”

0 otherwise

f7(x,y) =

1 if x contains “General”and y = “Object”

0 otherwise

I x=General George Washington, y=Person → f(x,y) = [1 1 0 1 0 0 0 0]

I x=George Washington Bridge, y=Object → f(x,y) = [0 0 0 0 1 1 1 0]

I x=George Washington George, y=Object → f(x,y) = [0 0 0 0 1 1 0 0]

Machine Learning for NLP 8(25)

Page 9: Machine Learning for NLP - cl.lingfil.uu.se › ~nivre › master › ml-perc.pdf · I \Maximum Entropy"&\NLP"2660 hits, 141 before 2000 I \SVM"&\NLP"2210 hits, 16 before 2000 I \Perceptron"&\NLP",

Preliminaries

Block Feature Vectors

I x=General George Washington, y=Person → f(x,y) = [1 1 0 1 0 0 0 0]

I x=George Washington Bridge, y=Object → f(x,y) = [0 0 0 0 1 1 1 0]

I x=George Washington George, y=Object → f(x,y) = [0 0 0 0 1 1 0 0]

I One equal-size block of the feature vector for each label

I Input features duplicated in each block

I Non-zero values allowed only in one block

Machine Learning for NLP 9(25)

Page 10: Machine Learning for NLP - cl.lingfil.uu.se › ~nivre › master › ml-perc.pdf · I \Maximum Entropy"&\NLP"2660 hits, 141 before 2000 I \SVM"&\NLP"2210 hits, 16 before 2000 I \Perceptron"&\NLP",

Linear Classifiers

Linear Classifiers

I Linear classifier: score (or probability) of a particularclassification is based on a linear combination of features andtheir weights

I Let w ∈ Rm be a high dimensional weight vectorI If we assume that w is known, then we define our classifier as

I Multiclass Classification: Y = {0, 1, . . . ,N}

y = argmaxy

w · f(x,y)

= argmaxy

m∑j=0

wj × fj(x,y)

I Binary Classification just a special case of multiclass

Machine Learning for NLP 10(25)

Page 11: Machine Learning for NLP - cl.lingfil.uu.se › ~nivre › master › ml-perc.pdf · I \Maximum Entropy"&\NLP"2660 hits, 141 before 2000 I \SVM"&\NLP"2210 hits, 16 before 2000 I \Perceptron"&\NLP",

Linear Classifiers

Linear Classifiers – Bias Terms

I Often linear classifiers presented as

y = argmaxy

m∑j=0

wj × fj(x,y) + by

I Where b is a bias or offset term

I But this can be folded into f

x=General George Washington, y=Person → f(x,y) = [1 1 0 1 1 0 0 0 0 0]

x=General George Washington, y=Object → f(x,y) = [0 0 0 0 0 1 1 0 1 1]

f4(x,y) =

{1 y =“Person”0 otherwise

f9(x,y) =

{1 y =“Object”0 otherwise

I w4 and w9 are now the bias terms for the labels

Machine Learning for NLP 11(25)

Page 12: Machine Learning for NLP - cl.lingfil.uu.se › ~nivre › master › ml-perc.pdf · I \Maximum Entropy"&\NLP"2660 hits, 141 before 2000 I \SVM"&\NLP"2210 hits, 16 before 2000 I \Perceptron"&\NLP",

Linear Classifiers

Binary Linear Classifier

Divides all points:

Machine Learning for NLP 12(25)

Page 13: Machine Learning for NLP - cl.lingfil.uu.se › ~nivre › master › ml-perc.pdf · I \Maximum Entropy"&\NLP"2660 hits, 141 before 2000 I \SVM"&\NLP"2210 hits, 16 before 2000 I \Perceptron"&\NLP",

Linear Classifiers

Multiclass Linear ClassifierDefines regions of space:

I i.e., + are all points (x,y) where + = argmaxy w · f(x,y)

Machine Learning for NLP 13(25)

Page 14: Machine Learning for NLP - cl.lingfil.uu.se › ~nivre › master › ml-perc.pdf · I \Maximum Entropy"&\NLP"2660 hits, 141 before 2000 I \SVM"&\NLP"2210 hits, 16 before 2000 I \Perceptron"&\NLP",

Linear Classifiers

Separability

I A set of points is separable, if there exists a w such thatclassification is perfect

Separable Not Separable

I This can also be defined mathematically (and we will shortly)

Machine Learning for NLP 14(25)

Page 15: Machine Learning for NLP - cl.lingfil.uu.se › ~nivre › master › ml-perc.pdf · I \Maximum Entropy"&\NLP"2660 hits, 141 before 2000 I \SVM"&\NLP"2210 hits, 16 before 2000 I \Perceptron"&\NLP",

Linear Classifiers

Supervised Learning – how to find w

I Input: training examples T = {(xt ,yt)}|T |t=1

I Input: feature representation fI Output: w that maximizes/minimizes some important

function on the training setI minimize error (Perceptron, SVMs, Boosting)I maximize likelihood of data (Logistic Regression, Naive Bayes)

I Assumption: The training data is separableI Not necessary, just makes life easierI There is a lot of good work in machine learning to tackle the

non-separable case

Machine Learning for NLP 15(25)

Page 16: Machine Learning for NLP - cl.lingfil.uu.se › ~nivre › master › ml-perc.pdf · I \Maximum Entropy"&\NLP"2660 hits, 141 before 2000 I \SVM"&\NLP"2210 hits, 16 before 2000 I \Perceptron"&\NLP",

Linear Classifiers

Perceptron

I Choose a w that minimizes error

w = argminw

∑t

1− 1[yt = argmaxy

w · f(xt ,y)]

1[p] =

{1 p is true0 otherwise

I This is a 0-1 loss functionI Aside: when minimizing error people tend to use hinge-loss or

other smoother loss functions

Machine Learning for NLP 16(25)

Page 17: Machine Learning for NLP - cl.lingfil.uu.se › ~nivre › master › ml-perc.pdf · I \Maximum Entropy"&\NLP"2660 hits, 141 before 2000 I \SVM"&\NLP"2210 hits, 16 before 2000 I \Perceptron"&\NLP",

Linear Classifiers

Perceptron Learning Algorithm

Training data: T = {(xt ,yt)}|T |t=1

1. w(0) = 0; i = 02. for n : 1..N3. for t : 1..T

4. Let y′ = argmaxy w(i) · f(xt ,y)5. if y′ 6= yt

6. w(i+1) = w(i) + f(xt ,yt)− f(xt ,y′)

7. i = i + 18. return wi

Machine Learning for NLP 17(25)

Page 18: Machine Learning for NLP - cl.lingfil.uu.se › ~nivre › master › ml-perc.pdf · I \Maximum Entropy"&\NLP"2660 hits, 141 before 2000 I \SVM"&\NLP"2210 hits, 16 before 2000 I \Perceptron"&\NLP",

Linear Classifiers

Perceptron: Separability and Margin

I Given an training instance (xt ,yt), define:I Yt = Y − {yt}I i.e., Yt is the set of incorrect labels for xt

I A training set T is separable with margin γ > 0 if there existsa vector u with ‖u‖ = 1 such that:

u · f(xt ,yt)− u · f(xt ,y′) ≥ γ

for all y′ ∈ Yt and ||u|| =√∑

j u2j (Euclidean or L2 norm)

I Assumption: the training set is separable with margin γ

Machine Learning for NLP 18(25)

Page 19: Machine Learning for NLP - cl.lingfil.uu.se › ~nivre › master › ml-perc.pdf · I \Maximum Entropy"&\NLP"2660 hits, 141 before 2000 I \SVM"&\NLP"2210 hits, 16 before 2000 I \Perceptron"&\NLP",

Linear Classifiers

Perceptron: Main Theorem

I Theorem: For any training set separable with a margin of γ,the following holds for the perceptron algorithm:

mistakes made during training ≤ R2

γ2

where R ≥ ||f(xt ,yt)− f(xt ,y′)|| for (xt ,yt) ∈ T and y′ ∈ Yt

I Thus, after a finite number of training iterations, the error onthe training set will converge to zero

I For proof, see Collins (2002)

Machine Learning for NLP 19(25)

Page 20: Machine Learning for NLP - cl.lingfil.uu.se › ~nivre › master › ml-perc.pdf · I \Maximum Entropy"&\NLP"2660 hits, 141 before 2000 I \SVM"&\NLP"2210 hits, 16 before 2000 I \Perceptron"&\NLP",

Linear Classifiers

Practical Considerations

I The perceptron is sensitive to the order of training examplesI Consider: 500 positive instances + 500 negative instances

I Shuffling:I Randomly permute training instances between iterations

I Voting and averaging:I Let w1, . . .wn be all the weight vectors seen in trainingI The voted perceptron predicts the majority vote of w1, . . .wn

I The averaged perceptron predicts using the average vector:

w =1

n

∑i

w1, . . .wn

Machine Learning for NLP 20(25)

Page 21: Machine Learning for NLP - cl.lingfil.uu.se › ~nivre › master › ml-perc.pdf · I \Maximum Entropy"&\NLP"2660 hits, 141 before 2000 I \SVM"&\NLP"2210 hits, 16 before 2000 I \Perceptron"&\NLP",

Linear Classifiers

Perceptron Summary

I Learns a linear classifier that minimizes errorI Guaranteed to find a w in a finite amount of timeI Improvement 1: shuffle training data between iterationsI Improvement 2: average weight vectors seen during training

I Perceptron is an example of an online learning algorithmI w is updated based on a single training instance in isolation

w(i+1) = w(i) + f(xt ,yt)− f(xt ,y′)

I Compare decision trees that perform batch learningI All training instances are used to find best split

Machine Learning for NLP 21(25)

Page 22: Machine Learning for NLP - cl.lingfil.uu.se › ~nivre › master › ml-perc.pdf · I \Maximum Entropy"&\NLP"2660 hits, 141 before 2000 I \SVM"&\NLP"2210 hits, 16 before 2000 I \Perceptron"&\NLP",

Linear Classifiers

Assignment 2

I Implement the perceptron (starter code in Python)I Three subtasks:

I Implement the linear classifier (dot product)I Implement the perceptron updateI Evaluate on the spambase data set

I For VGI Implement the averaged perceptron

Machine Learning for NLP 22(25)

Page 23: Machine Learning for NLP - cl.lingfil.uu.se › ~nivre › master › ml-perc.pdf · I \Maximum Entropy"&\NLP"2660 hits, 141 before 2000 I \SVM"&\NLP"2210 hits, 16 before 2000 I \Perceptron"&\NLP",

Appendix

Proofs and Derivations

Machine Learning for NLP 23(25)

Page 24: Machine Learning for NLP - cl.lingfil.uu.se › ~nivre › master › ml-perc.pdf · I \Maximum Entropy"&\NLP"2660 hits, 141 before 2000 I \SVM"&\NLP"2210 hits, 16 before 2000 I \Perceptron"&\NLP",

Convergence Proof for Perceptron

Perceptron Learning AlgorithmTraining data: T = {(xt ,yt )}|T |

t=1

1. w(0) = 0; i = 02. for n : 1..N3. for t : 1..T

4. Let y′ = argmaxy w(i) · f(xt ,y)

5. if y′ 6= yt

6. w(i+1) = w(i) + f(xt ,yt )− f(xt ,y′)

7. i = i + 1

8. return wi

I w (k−1) are the weights before kth

mistake

I Suppose kth mistake made at thetth example, (xt ,yt)

I y′ = argmaxy w(k−1) · f(xt ,y)

I y′ 6= yt

I w (k) = w(k−1) + f(xt ,yt)− f(xt ,y′)

I Now: u · w(k) = u · w(k−1) + u · (f(xt ,yt)− f(xt ,y′)) ≥ u · w(k−1) + γI Now: w(0) = 0 and u · w(0) = 0, by induction on k, u · w(k) ≥ kγI Now: since u · w(k) ≤ ||u|| × ||w(k)|| and ||u|| = 1 then ||w(k)|| ≥ kγI Now:

||w(k)||2 = ||w(k−1)||2 + ||f(xt ,yt)− f(xt ,y′)||2 + 2w(k−1) · (f(xt ,yt)− f(xt ,y

′))

||w(k)||2 ≤ ||w(k−1)||2 + R2

(since R ≥ ||f(xt ,yt)− f(xt ,y′)||

and w(k−1) · f(xt ,yt)− w(k−1) · f(xt ,y′) ≤ 0)

Machine Learning for NLP 24(25)

Page 25: Machine Learning for NLP - cl.lingfil.uu.se › ~nivre › master › ml-perc.pdf · I \Maximum Entropy"&\NLP"2660 hits, 141 before 2000 I \SVM"&\NLP"2210 hits, 16 before 2000 I \Perceptron"&\NLP",

Convergence Proof for Perceptron

Perceptron Learning Algorithm

I We have just shown that ||w(k)|| ≥ kγ and||w(k)||2 ≤ ||w(k−1)||2 + R2

I By induction on k and since w(0) = 0 and ||w(0)||2 = 0

||w(k)||2 ≤ kR2

I Therefore,k2γ2 ≤ ||w(k)||2 ≤ kR2

I and solving for k

k ≤ R2

γ2

I Therefore the number of errors is bounded!

Machine Learning for NLP 25(25)