Third Semester Credits Contact Marks Total credits Hours ...
Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks...
-
Upload
bruce-morris -
Category
Documents
-
view
215 -
download
0
Transcript of Data Mining - 2011 - Volinsky - Columbia University Topic 9: Advanced Classification Neural Networks...
Data Mining - 2011 - Volinsky - Columbia University
Topic 9: Advanced ClassificationNeural Networks
Support Vector Machines
1
Credits:Shawndra Hill
Andrew Moore lecture notes
Data Mining - 2011 - Volinsky - Columbia University
Outline
• Special Topics– Neural Networks– Support Vector Machines
2
Data Mining - 2011 - Volinsky - Columbia University
Neural Networks Agenda
• The biological inspiration• Structure of neural net models• Using neural net models• Training neural net models• Strengths and weaknesses• An example
3
What the heck are neural nets?
• A data mining algorithm, inspired by biological processes
• A type of non-linear regression/classification
• An ensemble method– Although not usually thought of as such
• A black box!
Data Mining - 2011 - Volinsky - Columbia University 4
Inspiration from Biology
• Information processing inspired by biological nervous systems
Data Mining - 2011 - Volinsky - Columbia University
Structure of the nervous system:
A large number of neurons (information processing units) connected together
A neuron’s response depends on the states of other neurons it is connected to and to the ‘strength’ of those connections.
The ‘strengths’ are learned based on experience.5
Data Mining - 2011 - Volinsky - Columbia University
From Real to Artificial
6
Data Mining - 2011 - Volinsky - Columbia University
Nodes: A Closer Look
Inputvalues
weights
Summingfunction
Bias
b
Activationfunction
Output
y
x1
x2
xm
w2
wm
w1
∑ )(−ϕ
7
Data Mining - 2011 - Volinsky - Columbia University
Nodes: A Closer Look
A node (neuron) is the basic information processing unit of a neural net. It has: A set of inputs with weights w1, w2, …, wm
along with a default input called the bias An adder function (linear combiner) that
computes the weighted sum of the inputs
An Activation function (squashing function) that transforms v, usually non-linearly.
€
v = w jx jj=1
m
∑
ϕ
€
y = ϕ(v +b)
8
Data Mining - 2011 - Volinsky - Columbia University
A Simple Node: A Perceptron
A simple activation function: A signing threshold
x1
x2
xn
w2
w1
wn
b (bias)
v y(v)
⎩⎨⎧
<−≥+
=0 if 1
0 if 1)(
v
vvϕ
9
Data Mining - 2011 - Volinsky - Columbia University
Common Activation Functions• Step function
• Sigmoid (logistic) function
• Hyperbolic Tangent (Tanh) function
• The s-shape adds non-linearity
€
(v) =ev
1+ ev=
1
1+ e−v
€
tanh(v) =ev − e−v
ev + e−v[Hornick (1989)]: combining many of these simple functions is sufficient to approximate any continuous non-linear function arbitrarily well over a compact interval.
10
Neural Network: Architecture
• Big idea: a combination of simple non-linear models working together to model a complex function
• How many layers? Nodes? What is the function?– Magic– Luckily, defaults do well
Data Mining - 2011 - Volinsky - Columbia University
Inputlayer
Outputlayer
Hidden Layer(s)
11
Data Mining - 2011 - Volinsky - Columbia University
Neural Networks: The Model
• Model has two components– A particular architecture
• Number of hidden layers• Number of nodes in the input, output and hidden
layers• Specification of the activation function(s)
– The associated set of weights
• Weights and complexity are “learned” from the data– Supervised learning, applied iteratively– Out-of-sample methods; Cross-validation
12
Data Mining - 2011 - Volinsky - Columbia University
Fitting a Neural Net: Feed Forward
• Supply attribute values at input nodes• Obtain predictions from the output
node(s)– Predicting classes
• Two classes – single output node with threshold• Multiple classes – use multiple outputs, one for
each class Predicted class = output node with highest value
Multiple class problems are one of the main uses of NN!
13
Data Mining - 2011 - Volinsky - Columbia University
A Simple NN: Regression
• A one-node neural network:– Called a ‘perceptron’– Use identity function as the activation function– What’s the output?
• Weighted sum of inputs
Logistic regression just changes the activation function to the logistic function
Data Mining - Columbia University
x1
x2
xn
w2
w1
wn
b (bias)
v y(v)
14
Data Mining - 2011 - Volinsky - Columbia University
Training a NN: What does it learn?
• It fits/learns the weights that best translates inputs into outputs given its architecture
• Hidden units can be thought to learn some higher order regularities or features of the inputs that can be used to predict outputs.
“Multi layer perceptron”
15
Data Mining - 2011 - Volinsky - Columbia University
Perceptron Training Rule
Perceptron = Adder + Threshold
1. Start with a random set of small weights.
2. Calculate an example3. Change the weight by an
amount proportional to the difference between the desired output and the actual output.
Δ Wi = η * (D-Y).Ii
Learning rate/
Step sizeDesired output
Input
Actual output
16
Data Mining - 2011 - Volinsky - Columbia University
Training NNs: Back Propagation
• How to train a neural net (find the optimal weights):– Present a training sample to the neural network.– Calculate the error in each output neuron.– For each neuron, calculate what the output should have been, and a scaling
factor, how much lower or higher the output must be adjusted to match the desired output. This is the local error.
– Adjust the weights of each neuron to lower the local error.– Assign "blame" for the local error to neurons at the previous level, giving
greater responsibility to neurons connected by stronger weights.– Repeat on the neurons at the previous level, using each one's "blame" as its
error.
• This ‘propogates’ the error backward. The sequence of forward and backward fits is called ‘back propogation’.
17
Data Mining - 2011 - Volinsky - Columbia University
Training NNs: How to do it
• A “Gradient Descent” algorithm is typically used to fit back propogation
• You can imagine a surface in an n-dimensional space such that– Each dimension is a weight– Each point in this space is a particular combination of weights– Each point on the “surface” is the output error that
corresponds to that combination of weights– You want to minimize error i.e. find the “valleys” on this
surface– Note the potential for ‘local minima’
18
Training NNs: Gradient Descent
• Find the gradient in each direction:
• Move according to these gradients will result in the move of ‘steepest descent’
• Note potential problem with ‘local minima’.
Data Mining - 2011 - Volinsky - Columbia University
€
∂Error∂wi
19
Gradient Descent
• Direction of steepest descent can be found mathematically or via computational estimation
Data Mining - 2011 - Volinsky - Columbia University
Via A. Moore
20
Data Mining - 2011 - Volinsky - Columbia University
Neural Nets: Strengths
• Can model very complex functions, very accurately – non linearity is built into the model
• Handles noisy data quite well• Provides fast predictions• Good for multiple category problems
– Many-class classification– Image detection– Speech recognition– Financial models
• Good for multiple stage problems
21
Data Mining - 2011 - Volinsky - Columbia University
Neural Nets: Weaknesses
• A black-box. Hard to explain or gain intuition.
• For complex problems, training time could be quite high
• Many, many training parameters– Layers, neurons per layer, output layers, bias,
training algs, learning rate
• Highly prone to overfitting– Balance between complexity with parsimony can
be learned through cross-validation
22
Example: Face Detection
Data Mining - 2011 - Volinsky - Columbia University
Architecture of the complete system: they use another neuralnet to estimate orientation of the face, then rectify it. They search over scales to find bigger/smaller faces.
Figure from “Rotation invariant neural-network based face detection,” H.A. Rowley, S. Baluja and T. Kanade, Proc. Computer Vision and Pattern Recognition, 1998, copyright 1998, IEEE
23
Data Mining - 2011 - Volinsky - Columbia University
Rowley, Baluja and Kanade’s (1998)Image Size: 20 x 20Input Layer: 400 unitsHidden Layer: 15 units
24
Neural Nets: Face Detection
Data Mining - 2011 - Volinsky - Columbia University
Goal: detect “face or no face”
25
Data Mining - 2011 - Volinsky - Columbia University
Face Detection: Results
26
Data Mining - 2011 - Volinsky - Columbia University
Face Detection Results: A Few Misses
27
Neural Nets
• Face detection in action
• For more:– See Hastie, et al Chapter 11
• R packages– Basic : nnet– Better: amore
Data Mining - 2011 - Volinsky - Columbia University 28
Support Vector Machines
Data Mining - 2011 - Volinsky - Columbia University 29
SVM
• Classification technique• Start with a BIG assumption
– The classes can be separated linearly
Data Mining - 2011 - Volinsky - Columbia University 30
Data Mining - 2011 - Volinsky - Columbia University 31
Linear Classifiersf x
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
How would you classify this data?
Data Mining - 2011 - Volinsky - Columbia University 32
Linear Classifiersf x
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
How would you classify this data?
Data Mining - 2011 - Volinsky - Columbia University 33
Linear Classifiersf x
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
How would you classify this data?
Data Mining - 2011 - Volinsky - Columbia University 34
Linear Classifiersf x
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
How would you classify this data?
Data Mining - 2011 - Volinsky - Columbia University 35
Linear Classifiersf x
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
Any of these would be fine..
..but which is best?
Data Mining - 2011 - Volinsky - Columbia University 36
Classifier Marginf x
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.
Data Mining - 2011 - Volinsky - Columbia University 37
Maximum Marginf x
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
The maximum margin linear classifier is the linear classifier with the, um, maximum margin.
This is the simplest kind of SVM (Called an LSVM)Linear SVM
Data Mining - 2011 - Volinsky - Columbia University 38
Maximum Marginf x
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
The maximum margin linear classifier is the linear classifier with the, um, maximum margin.
This is the simplest kind of SVM (Called an LSVM)
Support Vectors are those datapoints that the margin pushes up against
Linear SVM
Data Mining - 2011 - Volinsky - Columbia University 39
Why Maximum Margin?
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
The maximum margin linear classifier is the linear classifier with the, um, maximum margin.
This is the simplest kind of SVM (Called an LSVM)
Support Vectors are those datapoints that the margin pushes up against
1. Intuitively this feels safest.
2. If we’ve made a small error in the location of the boundary (it’s been jolted in its perpendicular direction) this gives us least chance of causing a misclassification.
3. LOOCV is easy since the model is immune to removal of any non-support-vector datapoints.
4. There’s some theory (using VC dimension) that is related to (but not the same as) the proposition that this is a good thing.
5. Empirically it works very very well.
Data Mining - 2011 - Volinsky - Columbia University 40
Specifying a line and margin
• How do we represent this mathematically?
• …in m input dimensions?
Plus-Plane
Minus-Plane
Classifier Boundary
“Predict Class
= +1”
zone
“Predict Class
= -1”
zone
Data Mining - 2011 - Volinsky - Columbia University 41
Specifying a line and margin
• Plus-plane = { x : w . x + b = +1 }• Minus-plane = { x : w . x + b = -1 }
Plus-Plane
Minus-Plane
Classifier Boundary
“Predict Class
= +1”
zone
“Predict Class
= -1”
zone
Classify as..
+1 if w . x + b >= 1
-1 if w . x + b <= -1
Universe explodes
if -1 < w . x + b < 1
wx+b=1
wx+b=0
wx+b=-
1
Data Mining - 2011 - Volinsky - Columbia University 42
Computing the margin width
• Plus-plane = { x : w . x + b = +1 }• Minus-plane = { x : w . x + b = -1 }Claim: The vector w is perpendicular to the
Plus Plane.
“Predict Class
= +1”
zone
“Predict Class
= -1”
zonewx+b=1
wx+b=0
wx+b=-
1
M = Margin Width
How do we compute M in terms of w and b?
Data Mining - 2011 - Volinsky - Columbia University 43
Computing the margin width
• Plus-plane = { x : w . x + b = +1 }• Minus-plane = { x : w . x + b = -1 }• The vector w is perpendicular to the Plus
Plane• Let x- be any point on the minus plane• Let x+ be the closest plus-plane-point to x-.
“Predict Class
= +1”
zone
“Predict Class
= -1”
zonewx+b=1
wx+b=0
wx+b=-
1
M = Margin Width
How do we compute M in terms of w and b?
x-
x+
Any location in m: not necessarily a datapoint
Any location in Rm: not necessarily a datapoint
Data Mining - 2011 - Volinsky - Columbia University 44
Computing the margin width
• Plus-plane = { x : w . x + b = +1 }• Minus-plane = { x : w . x + b = -1 }• The vector w is perpendicular to the Plus
Plane• Let x- be any point on the minus plane• Let x+ be the closest plus-plane-point to x-.• Claim: x+ = x- + w for some value of .
“Predict Class
= +1”
zone
“Predict Class
= -1”
zonewx+b=1
wx+b=0
wx+b=-
1
M = Margin Width
How do we compute M in terms of w and b?
x-
x+
Data Mining - 2011 - Volinsky - Columbia University 45
Computing the margin width
• Plus-plane = { x : w . x + b = +1 }• Minus-plane = { x : w . x + b = -1 }• The vector w is perpendicular to the Plus Plane• Let x- be any point on the minus plane• Let x+ be the closest plus-plane-point to x-.• Claim: x+ = x- + w for some value of . Why?
“Predict Class
= +1”
zone
“Predict Class
= -1”
zonewx+b=1
wx+b=0
wx+b=-
1
M = Margin Width
How do we compute M in terms of w and b?
x-
x+
The line from x- to x+ is perpendicular to the planes.
So to get from x- to x+ travel some distance in direction w.
Data Mining - 2011 - Volinsky - Columbia University 46
Computing the margin width
What we know:• w . x+ + b = +1 • w . x- + b = -1 • x+ = x- + w• |x+ - x- | = MIt’s now easy to get
M in terms of w and b
“Predict Class
= +1”
zone
“Predict Class
= -1”
zonewx+b=1
wx+b=0
wx+b=-
1
M = Margin Width
x-
x+
Data Mining - 2011 - Volinsky - Columbia University 47
Computing the margin width
What we know:• w . x+ + b = +1 • w . x- + b = -1 • x+ = x- + w• |x+ - x- | = MIt’s now easy to get
M in terms of w and b
“Predict Class
= +1”
zone
“Predict Class
= -1”
zonewx+b=1
wx+b=0
wx+b=-
1
M = Margin Width
w . (x - + w) + b = 1
=>
w . x - + b + w .w = 1
=>
-1 + w .w = 1
=>
x-
x+
€
=2
w.w
Data Mining - 2011 - Volinsky - Columbia University 48
Computing the margin width
What we know:• w . x+ + b = +1 • w . x- + b = -1 • x+ = x- + w• |x+ - x- | = M•
“Predict Class
= +1”
zone
“Predict Class
= -1”
zonewx+b=1
wx+b=0
wx+b=-
1
M = Margin Width =
M = |x+ - x- | =| w |=
x-
x+
€
=2
w.w
wwww
ww
.
2
.
.2==
€
= | w | = λ w.w
ww.
2
Data Mining - 2011 - Volinsky - Columbia University 49
Learning the Maximum Margin Classifier
Given a guess of w and b we can• Compute whether all data points in the correct half-
planes• Compute the width of the marginSearch the space of w’s and b’s to find the widest
margin that matches all the datapoints.
“Predict Class
= +1”
zone
“Predict Class
= -1”
zonewx+b=1
wx+b=0
wx+b=-
1
M = Margin Width =
x-
x+ww.
2
Data Mining - 2011 - Volinsky - Columbia University 50
Uh-oh!
denotes +1
denotes -1
This is going to be a problem!
What should we do?
Data Mining - 2011 - Volinsky - Columbia University 51
Uh-oh!
denotes +1
denotes -1
This is going to be a problem!
What should we do?
Idea 1:
Find minimum w.w, while minimizing number of training set errors.
Problemette: Two things to minimize makes for an ill-defined optimization
Data Mining - 2011 - Volinsky - Columbia University 52
Uh-oh!
denotes +1
denotes -1
This is going to be a problem!
What should we do?
Idea 1.1:
Minimize
w.w + C (#train errors)
And:
Use a trick
Tradeoff parameter
Data Mining - 2011 - Volinsky - Columbia University 53
Suppose we’re in 1-dimension
What would SVMs do with this data?
x=0
Data Mining - 2011 - Volinsky - Columbia University 54
Suppose we’re in 1-dimension
Not a big surprise
Positive “plane” Negative “plane”
x=0
Data Mining - 2011 - Volinsky - Columbia University 55
Harder 1-dimensional dataset
What can be done about this?
x=0
Data Mining - 2011 - Volinsky - Columbia University 56
Harder 1-dimensional dataset
Embed the data in a higher dimensional space
x=0 ),( 2kkk xx=z
Data Mining - 2011 - Volinsky - Columbia University 57
Harder 1-dimensional dataset
x=0 ),( 2kkk xx=z
Data Mining - 2011 - Volinsky - Columbia University 58
SVM Kernel Functions
• Embedding the data in a higher dimensional space where it is separable is called the “kernel trick”
• Beyond polynomials there are other very high dimensional basis functions that can be made practical by finding the right Kernel Function– Radial-Basis-style Kernel Function:
– Neural-net-style Kernel Function:
⎟⎟⎠
⎞⎜⎜⎝
⎛ −−=
2
2
2
)(exp),(
σ
babaK
).tanh(),( δκ −= babaK
Data Mining - 2011 - Volinsky - Columbia University 59
SVM Performance
• Trick: find linear boundaries in an enlarged space– Translate to nonlinear boundaries in the original
space
• Magic: for more details, see Hastie et al 12.3
• Anecdotally they work very very well indeed.• Example: They are currently the best-known
classifier on a well-studied hand-written-character recognition benchmark
• There is a lot of excitement and religious fervor about SVMs.
• Despite this, some practitioners are a little skeptical.
Data Mining - 2011 - Volinsky - Columbia University 60
Data Mining - 2011 - Volinsky - Columbia University 61
Doing multi-class classification
• SVMs can only handle two-class outputs (i.e. a categorical output variable with arity 2).
• What can be done?• Answer: with output arity N, learn N SVM’s
– SVM 1 learns “Output==1” vs “Output != 1”– SVM 2 learns “Output==2” vs “Output != 2”– :– SVM N learns “Output==N” vs “Output != N”
• Then to predict the output for a new input, just predict with each SVM and find out which one puts the prediction the furthest into the positive region.
Data Mining - 2011 - Volinsky - Columbia University 62
References
• Hastie, et al Chapter 11 (NN); Chapter 12 (SVM)
• Andrew Moore notes on Neural nets• Andrew Moore notes on SVM• Wikipedia has very good pages on both topics• An excellent tutorial on VC-dimension and
Support Vector Machines by C. Burges. – A tutorial on support vector machines for pattern recognit
ion. Data Mining and Knowledge Discovery, 2(2):955-974, 1998.
• The SVM Bible:Statistical Learning Theory by Vladimir Vapnik, Wiley-
Interscience; 1998