Linear classification models:...

CS534-Machine learning

Linear classification models: Perceptron

Classification problem

• Given input , the goal is to predict , which is a categorical variable

is called the class labelis the feature vector

• Example:: monthly income and bank saving amount;

risky or not risky: bag-of-words representation of an email;

spam or none spam

+

+

++

+

+-

--

--

--

+

x1

x2

Linear Classifier• We will be begin with the simplest choise: linear

classifiers

Why linear model?

A Canonical Representation• Each example described by a -dim vector x , … , ∈ ,

and assigned a class label ∈ 1, 1

• We learn a linear function x,w ⋯

• Given unlabeled example x, it predicts– 1 if x,w ⋯ 0, and 1 otherwise

• Introduce a dummy feature 1 for all examples:

x,w w x, where w , , … , and x , ,⋯ ,

• The classifier can be compactly represented by:x sign x,w

• Given its true label , our prediction for x is correct if x,w >0

• Goal of learning: find a good w– such that h(x) makes as few mis-predictions as possible

Learning w: An Optimization Problem

• Formulate learning problem as an optimization problems

– Given: A set of N training examples

x1, 1 , x2, 2 , … , x ,

– Design a loss function L

– Find the weight vector w that minimizes the

expected/average loss on training data

n

mmm

T yLn

J1

),(1)( xww

Loss Functions• 0/1 Loss function:

where ’, 0when ’ , otherwise ’, 1

• Issue: does not produce useful gradient since the surface of J is piece-wise flat

0/1 loss

n

mmm

T ysignLn

J1

1/0 )),((1)( xww

Loss Functions• Instead we will consider the “perceptron criterion”

• The term max 0, w x is 0 when is predicted correctly, otherwise it is equal to w x , which can be viewed as the “confidence” in the mis-prediction

• Has a nice gradient leading to the solution region

n

mm

Tmp y

nwJ

1),0max(1)( xw

0/1 loss Perceptron criterion

Stochastic Gradient Descent

• The objective function consists of a sum over data points---we can update the parameter after observing each example

• This is the Stochastic gradient descent approach

otherwise

0 if 0

),0max()(

),0max(1)(1

mm

mmm

mT

mm

n

mm

Tm

yy

J

yJ

yn

J

xxw

xww

xww

After observing x , , if it is a mistake w ← w x

Online Preceptron(Stochastic gradient descent)

mm

mm

mm

mm

y · uy

· u , ym

xww

xwx

w

0 if

)(: example iningAccept traRepeat

...,0)(0,0,0, Let

Online learning refers to the learning mode in which the model update is performed each time a single observation is received.

Batch learning in contrast performs model update after observing the whole training set.

w1

w2

w3

Decision boundary 1

Decision boundary 2

Decision boundary 3

+

+

-

-

-

-

+

+

Red points belong to the positive class, blue points belong to the negative class

When an error is made, moves the weight in a direction that corrects the error

Batch Perceptron Algorithm

|| until

/

0 if

do to1for ...,0)(0,0,0,

do...,0)(0,0,0, Let

,...,1)( examples training:Given

deltadelta

ndeltadeltaxydeltadelta

· uy · u

n m delta

n, m , y

mm

mm

mm

mm

ww

xw

wx

Simplest case: = 1 and don’t normalize – ‘Fixed increment perceptron’

– the step size

• Also referred as the learning rate• In practice, recommend to decrease η as

learning continues• Some optimization approaches set step-

size automatically, e.g., by line search, and converge faster

• If linearly separable, there is only one basin for the hinge loss, thus local minimum is the global minimum

Online VS. Batch Perceptron• Batch learning learns with a batch of examples collectively• Online learning learns with one example at a time• Both learning mechanisms are useful in practice• Online Perceptron is sensitive to the order that training examples are

received• In batch training, the correction incurred by each mistake is

accumulated and applied at once at the end of the iteration• In online training, each correction is applied immediately once a

mistake is encountered, which will change the decision boundary, thus different mistakes maybe encountered for online and batch training

• Online training performs stochastic gradient descent, an approximation to real gradient descent, which is used by the batch training

Convergence Theorem (Block, 1962, Novikoff, 1962)

.)/(most at is makesalgorithm perceptron that themistakes ofnumber then the

, allfor 0 and 1 , and , , If.),( ... ),,(),,( sequence example ningGiven trai

2

2211

D

iyDiyyy

iii

NN

xuuuxxxx

Note that ||·|| is the Euclidean norm of a vector.

Proofamount boundedlower aby ector solution v a closer totor weight vec

the moves updateeach that show toneedjust wee,convergenc show To

factor scalingarbitrary an is where ector,solution v a is given that ector,solution v a also is that Note

uu

kkk ykk xwwx )()1( have wemistake, kth the be Let

2D set can wefactor, scalingarbitrary an is Because

222 )()1( Dkk uwuw

kk

kkkk

kkkkk

kkkkk

kkkk

kkkk

yDkkyyk

DDykykykyk

ykykykyk

k

xuuwwxxuuw

xxuwxuwxxuwxuwxuwxuw

xuwuxwuw

because , 2)(0)( because, D2)(

because,2)(2)(2)(2)(

)(])([2)()()(

)1(

22

22

22

22

222

22

2

Proof (cont.)

22222222 )1()1( kDkDkDk uuwuw

thatshow can we, on inductionBy k

)/(

)(

0

2

2

2

2

22

Dk

DD

k

kD

Margin• is referred to as the margin

– The minimum distance from data points to the decision boundary

– The bigger the margin, the easier the classification problem is

– The bigger the margin, the more confident we are about our prediction

• We will see later in the course this concept leads to one of the recent most exciting developments in the ML field – support vector machines

Not linearly separable case

• In such cases the algorithm will never stop! How to fix?

• Look for decision boundary that make as few mistakes as possible – NP-hard!

Fixing the Perceptron• Idea one: only go through the data once, or a fixed

number of times

• At least this stops• Problem: the final w might not be good e.g. right before it

stops the algorithm might perform an update on a total outlier

ii

ii

ii

ii

y · uy

· u , yi

xww

xwx

w

0 if

)(: example traininga Taken timesfor Repeat

...,0)(0,0,0, Let

Voted Perceptron• Keep intermediate hypotheses and have them vote

[Freund and Schapire 1998]

N

n

Tnnc

0)}sgn(sgn{ xw

Store a collection of linear separators w0 w1 …, along with their survival time c0, c1 …

The c’s can be good measures of the reliability of the w’s

For classification, take a weighted vote among all separators:

Summary of Perceptron

• Learns a Classifier ŷ = f(x) directly• Applies gradient descent search to optimize the hinge

loss function– Online version performs stochastic gradient descent

• Guaranteed to converge in finite steps if linearly separable– There exists an upper bound on the number of corrections

needed– Inversely proportional to the margin of the optimal decision

boundary

• If not linearly separable, use voted perceptrons

Linear classification models:...

Documents

Transcript of Linear classification models:...