Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks...

Data Mining - 2011 - Volinsky - Columbia University

Topic 9: Advanced ClassificationNeural Networks

Support Vector Machines

1

Credits:Shawndra Hill

Andrew Moore lecture notes


Outline

• Special Topics– Neural Networks– Support Vector Machines

2


Neural Networks Agenda

• The biological inspiration• Structure of neural net models• Using neural net models• Training neural net models• Strengths and weaknesses• An example

3

What the heck are neural nets?

• A data mining algorithm, inspired by biological processes

• A type of non-linear regression/classification

• An ensemble method– Although not usually thought of as such

• A black box!

Data Mining - 2011 - Volinsky - Columbia University 4

Inspiration from Biology

• Information processing inspired by biological nervous systems


Structure of the nervous system:

A large number of neurons (information processing units) connected together

A neuron’s response depends on the states of other neurons it is connected to and to the ‘strength’ of those connections.

The ‘strengths’ are learned based on experience.5


From Real to Artificial

6


Nodes: A Closer Look

Inputvalues

weights

Summingfunction

Bias

b

Activationfunction

Output

y

x1

x2

xm

w2

wm

w1

∑ )(−ϕ

7


Nodes: A Closer Look

A node (neuron) is the basic information processing unit of a neural net. It has: A set of inputs with weights w1, w2, …, wm

along with a default input called the bias An adder function (linear combiner) that

computes the weighted sum of the inputs

An Activation function (squashing function) that transforms v, usually non-linearly.

€

v = w jx jj=1

m

∑

ϕ

€

y = ϕ(v +b)

8


A Simple Node: A Perceptron

A simple activation function: A signing threshold

x1

x2

xn

w2

w1

wn

b (bias)

v y(v)

⎩⎨⎧

<−≥+

=0 if 1

0 if 1)(

v

vvϕ

9


Common Activation Functions• Step function

• Sigmoid (logistic) function

• Hyperbolic Tangent (Tanh) function

• The s-shape adds non-linearity

€

(v) =ev

1+ ev=

1

1+ e−v

€

tanh(v) =ev − e−v

ev + e−v[Hornick (1989)]: combining many of these simple functions is sufficient to approximate any continuous non-linear function arbitrarily well over a compact interval.

10

Neural Network: Architecture

• Big idea: a combination of simple non-linear models working together to model a complex function

• How many layers? Nodes? What is the function?– Magic– Luckily, defaults do well


Inputlayer

Outputlayer

Hidden Layer(s)

11


Neural Networks: The Model

• Model has two components– A particular architecture

• Number of hidden layers• Number of nodes in the input, output and hidden

layers• Specification of the activation function(s)

– The associated set of weights

• Weights and complexity are “learned” from the data– Supervised learning, applied iteratively– Out-of-sample methods; Cross-validation

12


Fitting a Neural Net: Feed Forward

• Supply attribute values at input nodes• Obtain predictions from the output

node(s)– Predicting classes

• Two classes – single output node with threshold• Multiple classes – use multiple outputs, one for

each class Predicted class = output node with highest value

Multiple class problems are one of the main uses of NN!

13


A Simple NN: Regression

• A one-node neural network:– Called a ‘perceptron’– Use identity function as the activation function– What’s the output?

• Weighted sum of inputs

Logistic regression just changes the activation function to the logistic function

Data Mining - Columbia University

x1

x2

xn

w2

w1

wn

b (bias)

v y(v)

14


Training a NN: What does it learn?

• It fits/learns the weights that best translates inputs into outputs given its architecture

• Hidden units can be thought to learn some higher order regularities or features of the inputs that can be used to predict outputs.

“Multi layer perceptron”

15


Perceptron Training Rule

Perceptron = Adder + Threshold

1. Start with a random set of small weights.

2. Calculate an example3. Change the weight by an

amount proportional to the difference between the desired output and the actual output.

Δ Wi = η * (D-Y).Ii

Learning rate/

Step sizeDesired output

Input

Actual output

16


Training NNs: Back Propagation

• How to train a neural net (find the optimal weights):– Present a training sample to the neural network.– Calculate the error in each output neuron.– For each neuron, calculate what the output should have been, and a scaling

factor, how much lower or higher the output must be adjusted to match the desired output. This is the local error.

– Adjust the weights of each neuron to lower the local error.– Assign "blame" for the local error to neurons at the previous level, giving

greater responsibility to neurons connected by stronger weights.– Repeat on the neurons at the previous level, using each one's "blame" as its

error.

• This ‘propogates’ the error backward. The sequence of forward and backward fits is called ‘back propogation’.

17


Training NNs: How to do it

• A “Gradient Descent” algorithm is typically used to fit back propogation

• You can imagine a surface in an n-dimensional space such that– Each dimension is a weight– Each point in this space is a particular combination of weights– Each point on the “surface” is the output error that

corresponds to that combination of weights– You want to minimize error i.e. find the “valleys” on this

surface– Note the potential for ‘local minima’

18

Training NNs: Gradient Descent

• Find the gradient in each direction:

• Move according to these gradients will result in the move of ‘steepest descent’

• Note potential problem with ‘local minima’.


€

∂Error∂wi

19

Gradient Descent

• Direction of steepest descent can be found mathematically or via computational estimation


Via A. Moore

20


Neural Nets: Strengths

• Can model very complex functions, very accurately – non linearity is built into the model

• Handles noisy data quite well• Provides fast predictions• Good for multiple category problems

– Many-class classification– Image detection– Speech recognition– Financial models

• Good for multiple stage problems

21


Neural Nets: Weaknesses

• A black-box. Hard to explain or gain intuition.

• For complex problems, training time could be quite high

• Many, many training parameters– Layers, neurons per layer, output layers, bias,

training algs, learning rate

• Highly prone to overfitting– Balance between complexity with parsimony can

be learned through cross-validation

22

Example: Face Detection


Architecture of the complete system: they use another neuralnet to estimate orientation of the face, then rectify it. They search over scales to find bigger/smaller faces.

Figure from “Rotation invariant neural-network based face detection,” H.A. Rowley, S. Baluja and T. Kanade, Proc. Computer Vision and Pattern Recognition, 1998, copyright 1998, IEEE

23


Rowley, Baluja and Kanade’s (1998)Image Size: 20 x 20Input Layer: 400 unitsHidden Layer: 15 units

24

Neural Nets: Face Detection


Goal: detect “face or no face”

25


Face Detection: Results

26


Face Detection Results: A Few Misses

27

Neural Nets

• Face detection in action

• For more:– See Hastie, et al Chapter 11

• R packages– Basic : nnet– Better: amore


http://cs.nyu.edu/~yann/research/cface/face_demo.avi

Support Vector Machines


SVM

• Classification technique• Start with a BIG assumption

– The classes can be separated linearly



Linear Classifiersf x

yest

denotes +1

denotes -1

f(x,w,b) = sign(w. x - b)

How would you classify this data?



yest

denotes +1

denotes -1





yest

denotes +1

denotes -1


Any of these would be fine..

..but which is best?


Classifier Marginf x

yest

denotes +1

denotes -1


Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.


Maximum Marginf x

yest

denotes +1

denotes -1


The maximum margin linear classifier is the linear classifier with the, um, maximum margin.

This is the simplest kind of SVM (Called an LSVM)Linear SVM


Maximum Marginf x

yest

denotes +1

denotes -1



This is the simplest kind of SVM (Called an LSVM)

Support Vectors are those datapoints that the margin pushes up against

Linear SVM


Why Maximum Margin?

denotes +1

denotes -1



This is the simplest kind of SVM (Called an LSVM)

Support Vectors are those datapoints that the margin pushes up against

1. Intuitively this feels safest.

2. If we’ve made a small error in the location of the boundary (it’s been jolted in its perpendicular direction) this gives us least chance of causing a misclassification.

3. LOOCV is easy since the model is immune to removal of any non-support-vector datapoints.

4. There’s some theory (using VC dimension) that is related to (but not the same as) the proposition that this is a good thing.

5. Empirically it works very very well.


Specifying a line and margin

• How do we represent this mathematically?

• …in m input dimensions?

Plus-Plane

Minus-Plane

Classifier Boundary

“Predict Class

= +1”

zone

“Predict Class

= -1”

zone


Specifying a line and margin

• Plus-plane = { x : w . x + b = +1 }• Minus-plane = { x : w . x + b = -1 }

Plus-Plane

Minus-Plane

Classifier Boundary

“Predict Class

= +1”

zone

“Predict Class

= -1”

zone

Classify as..

+1 if w . x + b >= 1

-1 if w . x + b <= -1

Universe explodes

if -1 < w . x + b < 1

wx+b=1

wx+b=0

wx+b=-

1


Computing the margin width

• Plus-plane = { x : w . x + b = +1 }• Minus-plane = { x : w . x + b = -1 }Claim: The vector w is perpendicular to the

Plus Plane.

“Predict Class

= +1”

zone

“Predict Class

= -1”

zonewx+b=1

wx+b=0

wx+b=-

1

M = Margin Width

How do we compute M in terms of w and b?



• Plus-plane = { x : w . x + b = +1 }• Minus-plane = { x : w . x + b = -1 }• The vector w is perpendicular to the Plus

Plane• Let x- be any point on the minus plane• Let x+ be the closest plus-plane-point to x-.

“Predict Class

= +1”

zone

“Predict Class

= -1”

zonewx+b=1

wx+b=0

wx+b=-

1

M = Margin Width


x-

x+

Any location in m: not necessarily a datapoint

Any location in Rm: not necessarily a datapoint



• Plus-plane = { x : w . x + b = +1 }• Minus-plane = { x : w . x + b = -1 }• The vector w is perpendicular to the Plus

Plane• Let x- be any point on the minus plane• Let x+ be the closest plus-plane-point to x-.• Claim: x+ = x- + w for some value of .

“Predict Class

= +1”

zone

“Predict Class

= -1”

zonewx+b=1

wx+b=0

wx+b=-

1

M = Margin Width


x-

x+



• Plus-plane = { x : w . x + b = +1 }• Minus-plane = { x : w . x + b = -1 }• The vector w is perpendicular to the Plus Plane• Let x- be any point on the minus plane• Let x+ be the closest plus-plane-point to x-.• Claim: x+ = x- + w for some value of . Why?

“Predict Class

= +1”

zone

“Predict Class

= -1”

zonewx+b=1

wx+b=0

wx+b=-

1

M = Margin Width


x-

x+

The line from x- to x+ is perpendicular to the planes.

So to get from x- to x+ travel some distance in direction w.



What we know:• w . x+ + b = +1 • w . x- + b = -1 • x+ = x- + w• |x+ - x- | = MIt’s now easy to get

M in terms of w and b

“Predict Class

= +1”

zone

“Predict Class

= -1”

zonewx+b=1

wx+b=0

wx+b=-

1

M = Margin Width

x-

x+



What we know:• w . x+ + b = +1 • w . x- + b = -1 • x+ = x- + w• |x+ - x- | = MIt’s now easy to get

M in terms of w and b

“Predict Class

= +1”

zone

“Predict Class

= -1”

zonewx+b=1

wx+b=0

wx+b=-

1

M = Margin Width

w . (x - + w) + b = 1

=>

w . x - + b + w .w = 1

=>

-1 + w .w = 1

=>

x-

x+

€

=2

w.w



What we know:• w . x+ + b = +1 • w . x- + b = -1 • x+ = x- + w• |x+ - x- | = M•

“Predict Class

= +1”

zone

“Predict Class

= -1”

zonewx+b=1

wx+b=0

wx+b=-

1

M = Margin Width =

M = |x+ - x- | =| w |=

x-

x+

€

=2

w.w

wwww

ww

.

2

.

.2==

€

= | w | = λ w.w

ww.

2


Learning the Maximum Margin Classifier

Given a guess of w and b we can• Compute whether all data points in the correct half-

planes• Compute the width of the marginSearch the space of w’s and b’s to find the widest

margin that matches all the datapoints.

“Predict Class

= +1”

zone

“Predict Class

= -1”

zonewx+b=1

wx+b=0

wx+b=-

1

M = Margin Width =

x-

x+ww.

2


Uh-oh!

denotes +1

denotes -1

This is going to be a problem!

What should we do?


Uh-oh!

denotes +1

denotes -1


What should we do?

Idea 1:

Find minimum w.w, while minimizing number of training set errors.

Problemette: Two things to minimize makes for an ill-defined optimization


Uh-oh!

denotes +1

denotes -1


What should we do?

Idea 1.1:

Minimize

w.w + C (#train errors)

And:

Use a trick

Tradeoff parameter


Suppose we’re in 1-dimension

What would SVMs do with this data?

x=0


Suppose we’re in 1-dimension

Not a big surprise

Positive “plane” Negative “plane”

x=0


Harder 1-dimensional dataset

What can be done about this?

x=0



Embed the data in a higher dimensional space

x=0 ),( 2kkk xx=z



x=0 ),( 2kkk xx=z


SVM Kernel Functions

• Embedding the data in a higher dimensional space where it is separable is called the “kernel trick”

• Beyond polynomials there are other very high dimensional basis functions that can be made practical by finding the right Kernel Function– Radial-Basis-style Kernel Function:

– Neural-net-style Kernel Function:

⎟⎟⎠

⎞⎜⎜⎝

⎛ −−=

2

2

2

)(exp),(

σ

babaK

).tanh(),( δκ −= babaK


SVM Performance

• Trick: find linear boundaries in an enlarged space– Translate to nonlinear boundaries in the original

space

• Magic: for more details, see Hastie et al 12.3

• Anecdotally they work very very well indeed.• Example: They are currently the best-known

classifier on a well-studied hand-written-character recognition benchmark

• There is a lot of excitement and religious fervor about SVMs.

• Despite this, some practitioners are a little skeptical.


Doing multi-class classification

• SVMs can only handle two-class outputs (i.e. a categorical output variable with arity 2).

• What can be done?• Answer: with output arity N, learn N SVM’s

– SVM 1 learns “Output==1” vs “Output != 1”– SVM 2 learns “Output==2” vs “Output != 2”– :– SVM N learns “Output==N” vs “Output != N”

• Then to predict the output for a new input, just predict with each SVM and find out which one puts the prediction the furthest into the positive region.


References

• Hastie, et al Chapter 11 (NN); Chapter 12 (SVM)

• Andrew Moore notes on Neural nets• Andrew Moore notes on SVM• Wikipedia has very good pages on both topics• An excellent tutorial on VC-dimension and

Support Vector Machines by C. Burges. – A tutorial on support vector machines for pattern recognit

ion. Data Mining and Knowledge Discovery, 2(2):955-974, 1998.

• The SVM Bible:Statistical Learning Theory by Vladimir Vapnik, Wiley-

Interscience; 1998

http://www.autonlab.org/tutorials/neural.html

http://www.autonlab.org/tutorials/svm.html

http://research.microsoft.com/pubs/67119/svmtutorial.pdf

http://research.microsoft.com/pubs/67119/svmtutorial.pdf

Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks...

Documents

Transcript of Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks...