Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks...

62
Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew Moore lecture notes

Transcript of Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks...

Page 1: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Data Mining - 2011 - Volinsky - Columbia University

Topic 9: Advanced ClassificationNeural Networks

Support Vector Machines

1

Credits:Shawndra Hill

Andrew Moore lecture notes

Page 2: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Data Mining - 2011 - Volinsky - Columbia University

Outline

• Special Topics– Neural Networks– Support Vector Machines

2

Page 3: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Data Mining - 2011 - Volinsky - Columbia University

Neural Networks Agenda

• The biological inspiration• Structure of neural net models• Using neural net models• Training neural net models• Strengths and weaknesses• An example

3

Page 4: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

What the heck are neural nets?

• A data mining algorithm, inspired by biological processes

• A type of non-linear regression/classification

• An ensemble method– Although not usually thought of as such

• A black box!

Data Mining - 2011 - Volinsky - Columbia University 4

Page 5: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Inspiration from Biology

• Information processing inspired by biological nervous systems

Data Mining - 2011 - Volinsky - Columbia University

Structure of the nervous system:

A large number of neurons (information processing units) connected together

A neuron’s response depends on the states of other neurons it is connected to and to the ‘strength’ of those connections.

The ‘strengths’ are learned based on experience.5

Page 6: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Data Mining - 2011 - Volinsky - Columbia University

From Real to Artificial

6

Page 7: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Data Mining - 2011 - Volinsky - Columbia University

Nodes: A Closer Look

Inputvalues

weights

Summingfunction

Bias

b

Activationfunction

Output

y

x1

x2

xm

w2

wm

w1

∑ )(−ϕ

7

Page 8: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Data Mining - 2011 - Volinsky - Columbia University

Nodes: A Closer Look

A node (neuron) is the basic information processing unit of a neural net. It has: A set of inputs with weights w1, w2, …, wm

along with a default input called the bias An adder function (linear combiner) that

computes the weighted sum of the inputs

An Activation function (squashing function) that transforms v, usually non-linearly.

v = w jx jj=1

m

ϕ

y = ϕ(v +b)

8

Page 9: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Data Mining - 2011 - Volinsky - Columbia University

A Simple Node: A Perceptron

A simple activation function: A signing threshold

x1

x2

xn

w2

w1

wn

b (bias)

v y(v)

⎩⎨⎧

<−≥+

=0 if 1

0 if 1)(

v

vvϕ

9

Page 10: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Data Mining - 2011 - Volinsky - Columbia University

Common Activation Functions• Step function

• Sigmoid (logistic) function

• Hyperbolic Tangent (Tanh) function

• The s-shape adds non-linearity

(v) =ev

1+ ev=

1

1+ e−v

tanh(v) =ev − e−v

ev + e−v[Hornick (1989)]: combining many of these simple functions is sufficient to approximate any continuous non-linear function arbitrarily well over a compact interval.

10

Page 11: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Neural Network: Architecture

• Big idea: a combination of simple non-linear models working together to model a complex function

• How many layers? Nodes? What is the function?– Magic– Luckily, defaults do well

Data Mining - 2011 - Volinsky - Columbia University

Inputlayer

Outputlayer

Hidden Layer(s)

11

Page 12: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Data Mining - 2011 - Volinsky - Columbia University

Neural Networks: The Model

• Model has two components– A particular architecture

• Number of hidden layers• Number of nodes in the input, output and hidden

layers• Specification of the activation function(s)

– The associated set of weights

• Weights and complexity are “learned” from the data– Supervised learning, applied iteratively– Out-of-sample methods; Cross-validation

12

Page 13: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Data Mining - 2011 - Volinsky - Columbia University

Fitting a Neural Net: Feed Forward

• Supply attribute values at input nodes• Obtain predictions from the output

node(s)– Predicting classes

• Two classes – single output node with threshold• Multiple classes – use multiple outputs, one for

each class Predicted class = output node with highest value

Multiple class problems are one of the main uses of NN!

13

Page 14: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Data Mining - 2011 - Volinsky - Columbia University

A Simple NN: Regression

• A one-node neural network:– Called a ‘perceptron’– Use identity function as the activation function– What’s the output?

• Weighted sum of inputs

Logistic regression just changes the activation function to the logistic function

Data Mining - Columbia University

x1

x2

xn

w2

w1

wn

b (bias)

v y(v)

14

Page 15: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Data Mining - 2011 - Volinsky - Columbia University

Training a NN: What does it learn?

• It fits/learns the weights that best translates inputs into outputs given its architecture

• Hidden units can be thought to learn some higher order regularities or features of the inputs that can be used to predict outputs.

“Multi layer perceptron”

15

Page 16: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Data Mining - 2011 - Volinsky - Columbia University

Perceptron Training Rule

Perceptron = Adder + Threshold

1. Start with a random set of small weights.

2. Calculate an example3. Change the weight by an

amount proportional to the difference between the desired output and the actual output.

Δ Wi = η * (D-Y).Ii

Learning rate/

Step sizeDesired output

Input

Actual output

16

Page 17: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Data Mining - 2011 - Volinsky - Columbia University

Training NNs: Back Propagation

• How to train a neural net (find the optimal weights):– Present a training sample to the neural network.– Calculate the error in each output neuron.– For each neuron, calculate what the output should have been, and a scaling

factor, how much lower or higher the output must be adjusted to match the desired output. This is the local error.

– Adjust the weights of each neuron to lower the local error.– Assign "blame" for the local error to neurons at the previous level, giving

greater responsibility to neurons connected by stronger weights.– Repeat on the neurons at the previous level, using each one's "blame" as its

error.

• This ‘propogates’ the error backward. The sequence of forward and backward fits is called ‘back propogation’.

17

Page 18: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Data Mining - 2011 - Volinsky - Columbia University

Training NNs: How to do it

• A “Gradient Descent” algorithm is typically used to fit back propogation

• You can imagine a surface in an n-dimensional space such that– Each dimension is a weight– Each point in this space is a particular combination of weights– Each point on the “surface” is the output error that

corresponds to that combination of weights– You want to minimize error i.e. find the “valleys” on this

surface– Note the potential for ‘local minima’

18

Page 19: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Training NNs: Gradient Descent

• Find the gradient in each direction:

• Move according to these gradients will result in the move of ‘steepest descent’

• Note potential problem with ‘local minima’.

Data Mining - 2011 - Volinsky - Columbia University

∂Error∂wi

19

Page 20: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Gradient Descent

• Direction of steepest descent can be found mathematically or via computational estimation

Data Mining - 2011 - Volinsky - Columbia University

Via A. Moore

20

Page 21: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Data Mining - 2011 - Volinsky - Columbia University

Neural Nets: Strengths

• Can model very complex functions, very accurately – non linearity is built into the model

• Handles noisy data quite well• Provides fast predictions• Good for multiple category problems

– Many-class classification– Image detection– Speech recognition– Financial models

• Good for multiple stage problems

21

Page 22: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Data Mining - 2011 - Volinsky - Columbia University

Neural Nets: Weaknesses

• A black-box. Hard to explain or gain intuition.

• For complex problems, training time could be quite high

• Many, many training parameters– Layers, neurons per layer, output layers, bias,

training algs, learning rate

• Highly prone to overfitting– Balance between complexity with parsimony can

be learned through cross-validation

22

Page 23: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Example: Face Detection

Data Mining - 2011 - Volinsky - Columbia University

Architecture of the complete system: they use another neuralnet to estimate orientation of the face, then rectify it. They search over scales to find bigger/smaller faces.

Figure from “Rotation invariant neural-network based face detection,” H.A. Rowley, S. Baluja and T. Kanade, Proc. Computer Vision and Pattern Recognition, 1998, copyright 1998, IEEE

23

Page 24: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Data Mining - 2011 - Volinsky - Columbia University

Rowley, Baluja and Kanade’s (1998)Image Size: 20 x 20Input Layer: 400 unitsHidden Layer: 15 units

24

Page 25: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Neural Nets: Face Detection

Data Mining - 2011 - Volinsky - Columbia University

Goal: detect “face or no face”

25

Page 26: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Data Mining - 2011 - Volinsky - Columbia University

Face Detection: Results

26

Page 27: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Data Mining - 2011 - Volinsky - Columbia University

Face Detection Results: A Few Misses

27

Page 28: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Neural Nets

• Face detection in action

• For more:– See Hastie, et al Chapter 11

• R packages– Basic : nnet– Better: amore

Data Mining - 2011 - Volinsky - Columbia University 28

Page 29: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Support Vector Machines

Data Mining - 2011 - Volinsky - Columbia University 29

Page 30: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

SVM

• Classification technique• Start with a BIG assumption

– The classes can be separated linearly

Data Mining - 2011 - Volinsky - Columbia University 30

Page 31: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Data Mining - 2011 - Volinsky - Columbia University 31

Linear Classifiersf x

yest

denotes +1

denotes -1

f(x,w,b) = sign(w. x - b)

How would you classify this data?

Page 32: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Data Mining - 2011 - Volinsky - Columbia University 32

Linear Classifiersf x

yest

denotes +1

denotes -1

f(x,w,b) = sign(w. x - b)

How would you classify this data?

Page 33: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Data Mining - 2011 - Volinsky - Columbia University 33

Linear Classifiersf x

yest

denotes +1

denotes -1

f(x,w,b) = sign(w. x - b)

How would you classify this data?

Page 34: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Data Mining - 2011 - Volinsky - Columbia University 34

Linear Classifiersf x

yest

denotes +1

denotes -1

f(x,w,b) = sign(w. x - b)

How would you classify this data?

Page 35: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Data Mining - 2011 - Volinsky - Columbia University 35

Linear Classifiersf x

yest

denotes +1

denotes -1

f(x,w,b) = sign(w. x - b)

Any of these would be fine..

..but which is best?

Page 36: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Data Mining - 2011 - Volinsky - Columbia University 36

Classifier Marginf x

yest

denotes +1

denotes -1

f(x,w,b) = sign(w. x - b)

Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.

Page 37: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Data Mining - 2011 - Volinsky - Columbia University 37

Maximum Marginf x

yest

denotes +1

denotes -1

f(x,w,b) = sign(w. x - b)

The maximum margin linear classifier is the linear classifier with the, um, maximum margin.

This is the simplest kind of SVM (Called an LSVM)Linear SVM

Page 38: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Data Mining - 2011 - Volinsky - Columbia University 38

Maximum Marginf x

yest

denotes +1

denotes -1

f(x,w,b) = sign(w. x - b)

The maximum margin linear classifier is the linear classifier with the, um, maximum margin.

This is the simplest kind of SVM (Called an LSVM)

Support Vectors are those datapoints that the margin pushes up against

Linear SVM

Page 39: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Data Mining - 2011 - Volinsky - Columbia University 39

Why Maximum Margin?

denotes +1

denotes -1

f(x,w,b) = sign(w. x - b)

The maximum margin linear classifier is the linear classifier with the, um, maximum margin.

This is the simplest kind of SVM (Called an LSVM)

Support Vectors are those datapoints that the margin pushes up against

1. Intuitively this feels safest.

2. If we’ve made a small error in the location of the boundary (it’s been jolted in its perpendicular direction) this gives us least chance of causing a misclassification.

3. LOOCV is easy since the model is immune to removal of any non-support-vector datapoints.

4. There’s some theory (using VC dimension) that is related to (but not the same as) the proposition that this is a good thing.

5. Empirically it works very very well.

Page 40: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Data Mining - 2011 - Volinsky - Columbia University 40

Specifying a line and margin

• How do we represent this mathematically?

• …in m input dimensions?

Plus-Plane

Minus-Plane

Classifier Boundary

“Predict Class

= +1”

zone

“Predict Class

= -1”

zone

Page 41: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Data Mining - 2011 - Volinsky - Columbia University 41

Specifying a line and margin

• Plus-plane = { x : w . x + b = +1 }• Minus-plane = { x : w . x + b = -1 }

Plus-Plane

Minus-Plane

Classifier Boundary

“Predict Class

= +1”

zone

“Predict Class

= -1”

zone

Classify as..

+1 if w . x + b >= 1

-1 if w . x + b <= -1

Universe explodes

if -1 < w . x + b < 1

wx+b=1

wx+b=0

wx+b=-

1

Page 42: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Data Mining - 2011 - Volinsky - Columbia University 42

Computing the margin width

• Plus-plane = { x : w . x + b = +1 }• Minus-plane = { x : w . x + b = -1 }Claim: The vector w is perpendicular to the

Plus Plane.

“Predict Class

= +1”

zone

“Predict Class

= -1”

zonewx+b=1

wx+b=0

wx+b=-

1

M = Margin Width

How do we compute M in terms of w and b?

Page 43: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Data Mining - 2011 - Volinsky - Columbia University 43

Computing the margin width

• Plus-plane = { x : w . x + b = +1 }• Minus-plane = { x : w . x + b = -1 }• The vector w is perpendicular to the Plus

Plane• Let x- be any point on the minus plane• Let x+ be the closest plus-plane-point to x-.

“Predict Class

= +1”

zone

“Predict Class

= -1”

zonewx+b=1

wx+b=0

wx+b=-

1

M = Margin Width

How do we compute M in terms of w and b?

x-

x+

Any location in m: not necessarily a datapoint

Any location in Rm: not necessarily a datapoint

Page 44: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Data Mining - 2011 - Volinsky - Columbia University 44

Computing the margin width

• Plus-plane = { x : w . x + b = +1 }• Minus-plane = { x : w . x + b = -1 }• The vector w is perpendicular to the Plus

Plane• Let x- be any point on the minus plane• Let x+ be the closest plus-plane-point to x-.• Claim: x+ = x- + w for some value of .

“Predict Class

= +1”

zone

“Predict Class

= -1”

zonewx+b=1

wx+b=0

wx+b=-

1

M = Margin Width

How do we compute M in terms of w and b?

x-

x+

Page 45: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Data Mining - 2011 - Volinsky - Columbia University 45

Computing the margin width

• Plus-plane = { x : w . x + b = +1 }• Minus-plane = { x : w . x + b = -1 }• The vector w is perpendicular to the Plus Plane• Let x- be any point on the minus plane• Let x+ be the closest plus-plane-point to x-.• Claim: x+ = x- + w for some value of . Why?

“Predict Class

= +1”

zone

“Predict Class

= -1”

zonewx+b=1

wx+b=0

wx+b=-

1

M = Margin Width

How do we compute M in terms of w and b?

x-

x+

The line from x- to x+ is perpendicular to the planes.

So to get from x- to x+ travel some distance in direction w.

Page 46: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Data Mining - 2011 - Volinsky - Columbia University 46

Computing the margin width

What we know:• w . x+ + b = +1 • w . x- + b = -1 • x+ = x- + w• |x+ - x- | = MIt’s now easy to get

M in terms of w and b

“Predict Class

= +1”

zone

“Predict Class

= -1”

zonewx+b=1

wx+b=0

wx+b=-

1

M = Margin Width

x-

x+

Page 47: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Data Mining - 2011 - Volinsky - Columbia University 47

Computing the margin width

What we know:• w . x+ + b = +1 • w . x- + b = -1 • x+ = x- + w• |x+ - x- | = MIt’s now easy to get

M in terms of w and b

“Predict Class

= +1”

zone

“Predict Class

= -1”

zonewx+b=1

wx+b=0

wx+b=-

1

M = Margin Width

w . (x - + w) + b = 1

=>

w . x - + b + w .w = 1

=>

-1 + w .w = 1

=>

x-

x+

=2

w.w

Page 48: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Data Mining - 2011 - Volinsky - Columbia University 48

Computing the margin width

What we know:• w . x+ + b = +1 • w . x- + b = -1 • x+ = x- + w• |x+ - x- | = M•

“Predict Class

= +1”

zone

“Predict Class

= -1”

zonewx+b=1

wx+b=0

wx+b=-

1

M = Margin Width =

M = |x+ - x- | =| w |=

x-

x+

=2

w.w

wwww

ww

.

2

.

.2==

= | w | = λ w.w

ww.

2

Page 49: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Data Mining - 2011 - Volinsky - Columbia University 49

Learning the Maximum Margin Classifier

Given a guess of w and b we can• Compute whether all data points in the correct half-

planes• Compute the width of the marginSearch the space of w’s and b’s to find the widest

margin that matches all the datapoints.

“Predict Class

= +1”

zone

“Predict Class

= -1”

zonewx+b=1

wx+b=0

wx+b=-

1

M = Margin Width =

x-

x+ww.

2

Page 50: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Data Mining - 2011 - Volinsky - Columbia University 50

Uh-oh!

denotes +1

denotes -1

This is going to be a problem!

What should we do?

Page 51: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Data Mining - 2011 - Volinsky - Columbia University 51

Uh-oh!

denotes +1

denotes -1

This is going to be a problem!

What should we do?

Idea 1:

Find minimum w.w, while minimizing number of training set errors.

Problemette: Two things to minimize makes for an ill-defined optimization

Page 52: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Data Mining - 2011 - Volinsky - Columbia University 52

Uh-oh!

denotes +1

denotes -1

This is going to be a problem!

What should we do?

Idea 1.1:

Minimize

w.w + C (#train errors)

And:

Use a trick

Tradeoff parameter

Page 53: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Data Mining - 2011 - Volinsky - Columbia University 53

Suppose we’re in 1-dimension

What would SVMs do with this data?

x=0

Page 54: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Data Mining - 2011 - Volinsky - Columbia University 54

Suppose we’re in 1-dimension

Not a big surprise

Positive “plane” Negative “plane”

x=0

Page 55: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Data Mining - 2011 - Volinsky - Columbia University 55

Harder 1-dimensional dataset

What can be done about this?

x=0

Page 56: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Data Mining - 2011 - Volinsky - Columbia University 56

Harder 1-dimensional dataset

Embed the data in a higher dimensional space

x=0 ),( 2kkk xx=z

Page 57: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Data Mining - 2011 - Volinsky - Columbia University 57

Harder 1-dimensional dataset

x=0 ),( 2kkk xx=z

Page 58: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Data Mining - 2011 - Volinsky - Columbia University 58

SVM Kernel Functions

• Embedding the data in a higher dimensional space where it is separable is called the “kernel trick”

• Beyond polynomials there are other very high dimensional basis functions that can be made practical by finding the right Kernel Function– Radial-Basis-style Kernel Function:

– Neural-net-style Kernel Function:

⎟⎟⎠

⎞⎜⎜⎝

⎛ −−=

2

2

2

)(exp),(

σ

babaK

).tanh(),( δκ −= babaK

Page 59: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Data Mining - 2011 - Volinsky - Columbia University 59

SVM Performance

• Trick: find linear boundaries in an enlarged space– Translate to nonlinear boundaries in the original

space

• Magic: for more details, see Hastie et al 12.3

• Anecdotally they work very very well indeed.• Example: They are currently the best-known

classifier on a well-studied hand-written-character recognition benchmark

• There is a lot of excitement and religious fervor about SVMs.

• Despite this, some practitioners are a little skeptical.

Page 60: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Data Mining - 2011 - Volinsky - Columbia University 60

Page 61: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Data Mining - 2011 - Volinsky - Columbia University 61

Doing multi-class classification

• SVMs can only handle two-class outputs (i.e. a categorical output variable with arity 2).

• What can be done?• Answer: with output arity N, learn N SVM’s

– SVM 1 learns “Output==1” vs “Output != 1”– SVM 2 learns “Output==2” vs “Output != 2”– :– SVM N learns “Output==N” vs “Output != N”

• Then to predict the output for a new input, just predict with each SVM and find out which one puts the prediction the furthest into the positive region.

Page 62: Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks Support Vector Machines 1 Credits: Shawndra Hill Andrew.

Data Mining - 2011 - Volinsky - Columbia University 62

References

• Hastie, et al Chapter 11 (NN); Chapter 12 (SVM)

• Andrew Moore notes on Neural nets• Andrew Moore notes on SVM• Wikipedia has very good pages on both topics• An excellent tutorial on VC-dimension and

Support Vector Machines by C. Burges. – A tutorial on support vector machines for pattern recognit

ion. Data Mining and Knowledge Discovery, 2(2):955-974, 1998.

• The SVM Bible:Statistical Learning Theory by Vladimir Vapnik, Wiley-

Interscience; 1998