Classification III Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course...
-
Upload
daniel-little -
Category
Documents
-
view
222 -
download
0
Transcript of Classification III Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course...
![Page 1: Classification III Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,](https://reader035.fdocuments.net/reader035/viewer/2022062806/5697bfd11a28abf838cab02a/html5/thumbnails/1.jpg)
1
Classification III
Tamara Berg
CS 560 Artificial Intelligence
Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell, Andrew Moore, Percy Liang, Luke Zettlemoyer, Rob Pless, Killian Weinberger, Deva Ramanan
![Page 2: Classification III Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,](https://reader035.fdocuments.net/reader035/viewer/2022062806/5697bfd11a28abf838cab02a/html5/thumbnails/2.jpg)
2
Discriminant Function• It can be arbitrary functions of x, such as:
Nearest Neighbor
Decision Tree
LinearFunctions
![Page 3: Classification III Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,](https://reader035.fdocuments.net/reader035/viewer/2022062806/5697bfd11a28abf838cab02a/html5/thumbnails/3.jpg)
Linear classifier
• Find a linear function to separate the classes
f(x) = sgn(w1x1 + w2x2 + … + wDxD) = sgn(w x)
![Page 4: Classification III Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,](https://reader035.fdocuments.net/reader035/viewer/2022062806/5697bfd11a28abf838cab02a/html5/thumbnails/4.jpg)
Perceptron
x1
x2
xD
w1
w2
w3
x3
wD
Input
Weights
.
.
.
Output: sgn(wx + b)
Can incorporate bias as component of the weight vector by always including a feature with value set to 1
![Page 5: Classification III Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,](https://reader035.fdocuments.net/reader035/viewer/2022062806/5697bfd11a28abf838cab02a/html5/thumbnails/5.jpg)
Loose inspiration: Human neurons
![Page 6: Classification III Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,](https://reader035.fdocuments.net/reader035/viewer/2022062806/5697bfd11a28abf838cab02a/html5/thumbnails/6.jpg)
Perceptron training algorithm• Initialize weights• Cycle through training examples in multiple
passes (epochs)• For each training example:
– If classified correctly, do nothing– If classified incorrectly, update weights
![Page 7: Classification III Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,](https://reader035.fdocuments.net/reader035/viewer/2022062806/5697bfd11a28abf838cab02a/html5/thumbnails/7.jpg)
Perceptron update rule• For each training instance x with label y:
– Classify with current weights: y’ = sgn(wx)– Update weights:
– α is a learning rate that should decay as 1/t, e.g., 1000/(1000+t)
– What happens if answer is correct?– Otherwise, consider what happens to individual
weights: • If y = 1 and y’ = −1, wi will be increased if xi is positive or
decreased if xi is negative −> wx gets bigger
• If y = −1 and y’ = 1, wi will be decreased if xi is positive or increased if xi is negative −> wx gets smaller
![Page 8: Classification III Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,](https://reader035.fdocuments.net/reader035/viewer/2022062806/5697bfd11a28abf838cab02a/html5/thumbnails/8.jpg)
Implementation details
• Bias (add feature dimension with value fixed to 1) vs. no bias
• Initialization of weights: all zeros vs. random• Number of epochs (passes through the training
data)• Order of cycling through training examples
![Page 9: Classification III Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,](https://reader035.fdocuments.net/reader035/viewer/2022062806/5697bfd11a28abf838cab02a/html5/thumbnails/9.jpg)
Multi-class perceptrons
• Need to keep a weight vector wc for each class c
• Decision rule: • Update rule: suppose an example from
class c gets misclassified as c’– Update for c: – Update for c’:
![Page 10: Classification III Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,](https://reader035.fdocuments.net/reader035/viewer/2022062806/5697bfd11a28abf838cab02a/html5/thumbnails/10.jpg)
Differentiable perceptron
x1
x2
xd
w1
w2
w3
x3
wd
Sigmoid function:
Input
Weights
.
.
.te
t
1
1)(
Output: (wx + b)
![Page 11: Classification III Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,](https://reader035.fdocuments.net/reader035/viewer/2022062806/5697bfd11a28abf838cab02a/html5/thumbnails/11.jpg)
Update rule for differentiable perceptron
• Define total classification error or loss on the training set:
• Update weights by gradient descent:
• For a single training point, the update is:
)()(,)()(1
2jj
N
jjj ffyE xwxxw ww
N
jjjjjj
N
jjjjj
fy
fyE
1
1
))(1)(()(2
)()(')(2
xxwxwx
xww
xwxw
www
E
xxwxwxww ))(1)(()( fy
![Page 12: Classification III Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,](https://reader035.fdocuments.net/reader035/viewer/2022062806/5697bfd11a28abf838cab02a/html5/thumbnails/12.jpg)
Multi-Layer Neural Network
• Can learn nonlinear functions• Training: find network weights to minimize the error between true and
estimated labels of training examples:
• Minimization can be done by gradient descent provided f is differentiable– This training method is called back-propagation
N
iii fyfE
1
2)()( x
![Page 13: Classification III Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,](https://reader035.fdocuments.net/reader035/viewer/2022062806/5697bfd11a28abf838cab02a/html5/thumbnails/13.jpg)
Deep convolutional neural networks
Zeiler, M., and Fergus, R. Visualizing and Understanding Convolutional Neural Networks, tech report, 2013.Krizhevsky, A., Sutskever, I., and Hinton, G.E. ImageNet classication with deep convolutional neural networks. NIPS, 2012.
![Page 16: Classification III Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,](https://reader035.fdocuments.net/reader035/viewer/2022062806/5697bfd11a28abf838cab02a/html5/thumbnails/16.jpg)
16
Linear classifier
• Find a linear function to separate the classes
f(x) = sgn(w1x1 + w2x2 + … + wDxD) = sgn(w x)
![Page 17: Classification III Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,](https://reader035.fdocuments.net/reader035/viewer/2022062806/5697bfd11a28abf838cab02a/html5/thumbnails/17.jpg)
17
Linear Discriminant Function• f(x) is a linear function:
x1
x2
wT x + b = 0
wT x + b < 0
wT x + b > 0
A hyper-plane in the feature space
denotes +1
denotes -1
x1
![Page 18: Classification III Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,](https://reader035.fdocuments.net/reader035/viewer/2022062806/5697bfd11a28abf838cab02a/html5/thumbnails/18.jpg)
18
• How would you classify these points using a linear discriminant function in order to minimize the error rate?
Linear Discriminant Function
denotes +1
denotes -1
x1
x2
Infinite number of answers!
![Page 19: Classification III Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,](https://reader035.fdocuments.net/reader035/viewer/2022062806/5697bfd11a28abf838cab02a/html5/thumbnails/19.jpg)
19
• How would you classify these points using a linear discriminant function in order to minimize the error rate?
Linear Discriminant Function
x1
x2
Infinite number of answers!
denotes +1
denotes -1
![Page 20: Classification III Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,](https://reader035.fdocuments.net/reader035/viewer/2022062806/5697bfd11a28abf838cab02a/html5/thumbnails/20.jpg)
20
• How would you classify these points using a linear discriminant function in order to minimize the error rate?
Linear Discriminant Function
x1
x2
Infinite number of answers!
denotes +1
denotes -1
![Page 21: Classification III Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,](https://reader035.fdocuments.net/reader035/viewer/2022062806/5697bfd11a28abf838cab02a/html5/thumbnails/21.jpg)
21
x1
x2• How would you classify these points using a linear discriminant function in order to minimize the error rate?
Linear Discriminant Function
Infinite number of answers!
Which one is the best?
denotes +1
denotes -1
![Page 22: Classification III Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,](https://reader035.fdocuments.net/reader035/viewer/2022062806/5697bfd11a28abf838cab02a/html5/thumbnails/22.jpg)
22
Large Margin Linear Classifier
“safe zone”• The linear discriminant
function (classifier) with the maximum margin is the best
Margin is defined as the width that the boundary could be increased by before hitting a data point
Why it is the best? strong generalization ability
Margin
x1
x2
Linear SVM
![Page 23: Classification III Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,](https://reader035.fdocuments.net/reader035/viewer/2022062806/5697bfd11a28abf838cab02a/html5/thumbnails/23.jpg)
23
Large Margin Linear Classifier
x1
x2 Margin
wT x + b = 0
wT x + b = -1w
T x + b = 1
x+
x+
x-
Support Vectors
![Page 24: Classification III Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,](https://reader035.fdocuments.net/reader035/viewer/2022062806/5697bfd11a28abf838cab02a/html5/thumbnails/24.jpg)
Support vector machines• Find hyperplane that maximizes the margin between the positive and
negative examples
1:1)(negative
1:1)( positive
by
by
wxx
wxx
MarginSupport vectors
C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998
Distance between point and hyperplane: ||||
||
w
wx b
For support vectors, 1 bwx
Therefore, the margin is 2 / ||w||
![Page 25: Classification III Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,](https://reader035.fdocuments.net/reader035/viewer/2022062806/5697bfd11a28abf838cab02a/html5/thumbnails/25.jpg)
Finding the maximum margin hyperplane
1. Maximize margin 2 / ||w||
2. Correctly classify all training data:
Quadratic optimization problem:
C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998
1)(subject to2
1min
2
, by ii
bxww
w
![Page 26: Classification III Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,](https://reader035.fdocuments.net/reader035/viewer/2022062806/5697bfd11a28abf838cab02a/html5/thumbnails/26.jpg)
26
Solving the Optimization Problem
The linear discriminant function is:
Notice it relies on a dot product between the test point x and the support vectors xi
![Page 27: Classification III Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,](https://reader035.fdocuments.net/reader035/viewer/2022062806/5697bfd11a28abf838cab02a/html5/thumbnails/27.jpg)
27
Linear separability
![Page 28: Classification III Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,](https://reader035.fdocuments.net/reader035/viewer/2022062806/5697bfd11a28abf838cab02a/html5/thumbnails/28.jpg)
28
Non-linear SVMs: Feature Space General idea: the original input space can be mapped to
some higher-dimensional feature space where the training set is separable:
Φ: x → φ(x)
Slide courtesy of www.iro.umontreal.ca/~pift6080/documents/papers/svm_tutorial.ppt
![Page 29: Classification III Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,](https://reader035.fdocuments.net/reader035/viewer/2022062806/5697bfd11a28abf838cab02a/html5/thumbnails/29.jpg)
29
Nonlinear SVMs: The Kernel Trick With this mapping, our discriminant function becomes:
SV
( ) ( ) ( ) ( )T Ti i
i
g b b
x w x x x
No need to know this mapping explicitly, because we only use the dot product of feature vectors in both the training and test.
A kernel function is defined as a function that corresponds to a dot product of two feature vectors in some expanded feature space:
( , ) ( ) ( )Ti j i jK x x x x
![Page 30: Classification III Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,](https://reader035.fdocuments.net/reader035/viewer/2022062806/5697bfd11a28abf838cab02a/html5/thumbnails/30.jpg)
30
Nonlinear SVMs: The Kernel Trick
Linear kernel:
2
2( , ) exp( )
2i j
i jK
x x
x x
( , ) Ti j i jK x x x x
( , ) (1 )T pi j i jK x x x x
0 1( , ) tanh( )Ti j i jK x x x x
Examples of commonly-used kernel functions:
Polynomial kernel:
Gaussian (Radial-Basis Function (RBF) ) kernel:
Sigmoid:
![Page 31: Classification III Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,](https://reader035.fdocuments.net/reader035/viewer/2022062806/5697bfd11a28abf838cab02a/html5/thumbnails/31.jpg)
31
Support Vector Machine: Algorithm
1. Choose a kernel function
2. Choose a value for C and any other parameters (e.g. σ)
3. Solve the quadratic programming problem (many software packages available)
4. Classify held out validation instances using the learned model
5. Select the best learned model based on validation accuracy
6. Classify test instances using the final selected model
![Page 32: Classification III Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,](https://reader035.fdocuments.net/reader035/viewer/2022062806/5697bfd11a28abf838cab02a/html5/thumbnails/32.jpg)
32
Some Issues• Choice of kernel - Gaussian or polynomial kernel is default - if ineffective, more elaborate kernels are needed - domain experts can give assistance in formulating appropriate
similarity measures
• Choice of kernel parameters - e.g. σ in Gaussian kernel - In the absence of reliable criteria, applications rely on the use of a
validation set or cross-validation to set such parameters.
This slide is courtesy of www.iro.umontreal.ca/~pift6080/documents/papers/svm_tutorial.ppt
![Page 33: Classification III Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,](https://reader035.fdocuments.net/reader035/viewer/2022062806/5697bfd11a28abf838cab02a/html5/thumbnails/33.jpg)
33
Summary: Support Vector Machine
1. Large Margin Classifier – Better generalization ability & less over-fitting
2. The Kernel Trick– Map data points to higher dimensional space in order
to make them linearly separable.– Since only dot product is needed, we do not need to
represent the mapping explicitly.
![Page 34: Classification III Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,](https://reader035.fdocuments.net/reader035/viewer/2022062806/5697bfd11a28abf838cab02a/html5/thumbnails/34.jpg)
34
SVMs in Computer Vision
![Page 35: Classification III Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,](https://reader035.fdocuments.net/reader035/viewer/2022062806/5697bfd11a28abf838cab02a/html5/thumbnails/35.jpg)
Detection
features?
classify+1 pos
-1 neg
• We slide a window over the image• Extract features for each window• Classify each window into pos/neg
x F(x) y
? ?
![Page 36: Classification III Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,](https://reader035.fdocuments.net/reader035/viewer/2022062806/5697bfd11a28abf838cab02a/html5/thumbnails/36.jpg)
36
Sliding Window Detection
![Page 37: Classification III Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,](https://reader035.fdocuments.net/reader035/viewer/2022062806/5697bfd11a28abf838cab02a/html5/thumbnails/37.jpg)
37
Representation
![Page 38: Classification III Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,](https://reader035.fdocuments.net/reader035/viewer/2022062806/5697bfd11a28abf838cab02a/html5/thumbnails/38.jpg)
38
![Page 39: Classification III Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,](https://reader035.fdocuments.net/reader035/viewer/2022062806/5697bfd11a28abf838cab02a/html5/thumbnails/39.jpg)
39
![Page 40: Classification III Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,](https://reader035.fdocuments.net/reader035/viewer/2022062806/5697bfd11a28abf838cab02a/html5/thumbnails/40.jpg)
40