Discriminative & Deep Learning for Big Dataaz/lectures/aims-big_data/discrim_learning1.pdf ·...
Transcript of Discriminative & Deep Learning for Big Dataaz/lectures/aims-big_data/discrim_learning1.pdf ·...
Discriminative & Deep Learning for
Big Data
Andrew Zisserman
Visual Geometry Group
University of Oxford
http://www.robots.ox.ac.uk/~vgg
AIMS-CDT Michaelmas 2019
Overview
Two aspects:
1. Learning from Big Data:
• Supervised and discriminative learning: provides training data
for large capacity models
• e.g. for data hungry deep neural networks
2. Searching and modelling Big Data:
• Need efficient (fast and low memory) methods to address this
Mon: Discriminative Learning 1 [AZ]
• Introduction, NN, linear classifiers; regression, SVMs, loss functions
Tue: Discriminative Learning 2 and searching Big Data [AZ,RA]
• Multiple classes, large scale retrieval; ANN, LSH, PQ
• Practical 1: Image classification
Wed: Deep Learning 1 [AV]
• Neural networks, CNNs, back-prop, applications
Thu: Deep learning 2 and Large Scale Learning [AV]
• Architectures, visualization, applications; large scale learning
• Practical 2: CNNs
Mon 9th: Seminar by Alyosha Efros
• Applications of large scale matching in Computer Vision
Course Outline
Recommended book: general introduction
• Pattern Recognition and
Machine Learning
Christopher Bishop, Springer, 2006.
• Excellent on classification and
regression
Textbooks 2: also general
• Elements of Statistical
Learning
Hastie, Tibshirani, Friedman,
Springer, 2009, second edition
• Good explanation of algorithms
• pdf available online
Learning with Kernels
Bernhard Schölkopf and Alexander J. Smola
MIT, 2002.
Textbooks 3: specialized for SVMs and kernels
Textbooks 4: specialized for deep learning
• Deep Learning
Goodfellow, Bengio, Courville
MIT Press, 2016
• html version available online
• http://www.deeplearningbook.org
Introduction: What is Machine Learning?
Algorithms that can improve their performance using
training data
• Typically the algorithm has a (large) number of
parameters whose values are learnt from the data
• Can be applied in situations where it is very challenging
(= impossible) to define rules by hand, e.g.:
• Face recognition
• Speech recognition
• Stock prediction
Example 1: hand-written digit recognition
Images are 28 x 28 pixels
How to proceed …
As a supervised classification problem
Start with training data, e.g. 6000 examples of each digit
• Can achieve testing error of 0.4%
• One of first commercial and widely used ML systems (for zip codes & checks)
Example 2: Face detection
• Again, a supervised classification problem
• Need to classify an image window into three classes:
• non-face
• frontal-face
• profile-face
Classifier is learnt from labelled data
Training data for frontal faces
• 5000 faces
All near frontal
Age, race, gender, lighting
• 108 non faces
• faces are normalized
scale, translation
Example 3: Spam detection
• This is a classification problem
• Task is to classify email into spam/non-spam
• Data xi is word count, e.g. of viagra, outperform, “you may be
surprized to be contacted” …
• Requires a learning system as “enemy” keeps innovating
Example 4: Stock price prediction
• Task is to predict stock price at future date
• This is a regression task, as the output is continuous
Protein Structure and Disulfide Bridges
Protein: 1IMT
AVITGACERDLQCG
KGTCCAVSLWIKSV
RVCTPVGTSGEDCH
PASHKIPFSGQRMH
HTCPCAPNLACVQT
SPKKFKCLSK
Regression task: given sequence predict 3D structure
Example 5: Computational biology
abL
greyscale image
“Free”
supervisory
signal
Regression task: predict pixel colour from a monochrome input
Example 6: Colourization
colour image
Colourization examples
Train network to predict pixel colour from a monochrome input
Colorful Image Colorization, Zhang et al., ECCV 2016
Example 7: Machine translation
What is the anticipated cost of collecting fees under the new proposal?
En vertu des nouvelles propositions, quel est le coût prévu de perception des droits?
x y
Whatis
theanticipated
costof
collecting fees
under the
new proposal
?
Envertudelesnouvellespropositions, quelestle coûtprévude perception de les droits?
e.g. Google translate
Use of aligned text
What is the anticipated cost of collecting fees under the new proposal?
Google Translate (2019)
Example 8: Machine transcription – sequences
Use of aligned speech and text
• Automated speech recognition
Use of aligned images and text
• Text spotting and recognition
Use of aligned face video and text
• Automated lip reading
Lip reading sentences
Chung, Senior, Vinyals, Zisserman, 2017
Speech and natural language
https://www.skype.com/en/features/skype-translator/
Google Translate App
• Translate between 103 languages by typing
• Offline: Translate 52 languages when you have no Internet
• Instant camera translation: Use your camera to translate
text instantly in 30 languages
• Camera Mode: Take pictures of text for higher-quality
translations in 37 languages
• Conversation Mode: Two-way instant speech translation in
32 languages
https://play.google.com/store/apps/details?id=com.google.android.apps.tra
nslate&hl=en
See also: The Great AI Awakening (New York Times Magazine)
Slide credit: Lana Lazebnik
Lecture outline
• Classification
• Regression
• Overfitting
• Regularization and Loss functions
• Part II
• Linear classifiers
• Support Vector Machine (SVM)
• Logistic regression (LR)
Supervised Learning: Overview
Learning machine
Classification
• Suppose we are given a training set of N observations
• Classification problem is to estimate f(x) from this data such that
K Nearest Neighbour (K-NN) Classifier
Algorithm
• For each test point, x, to be classified, find the K nearest
samples in the training data
• Classify the point, x, according to the majority vote of their
class labels
e.g. K = 3
• applicable to
multi-class case
K = 1
Voronoi diagram:
• partitions the space into regions
• boundaries are equal distance
from training points
Classification boundary:
• non-linear
-1.5 -1 -0.5 0 0.5 1 1.5-0.2
0
0.2
0.4
0.6
0.8
1
1.2
A sampling assumption: training and test data
-1.5 -1 -0.5 0 0.5 1-0.2
0
0.2
0.4
0.6
0.8
1
1.2
Training data Testing data
• Assume that the training examples are drawn independently from the set of all
possible examples.
• This makes it very unlikely that a strong regularity in the training data will be absent in
the test data.
• Measure classification error as
loss function
The “risk”
K = 1
-1.5 -1 -0.5 0 0.5 1-0.2
0
0.2
0.4
0.6
0.8
1
1.2
error = 0.0
Training data
-1.5 -1 -0.5 0 0.5 1 1.5-0.2
0
0.2
0.4
0.6
0.8
1
1.2
error = 0.15
Testing data
K = 3
-1.5 -1 -0.5 0 0.5 1-0.2
0
0.2
0.4
0.6
0.8
1
1.2
-1.5 -1 -0.5 0 0.5 1 1.5-0.2
0
0.2
0.4
0.6
0.8
1
1.2
error = 0.0760 error = 0.1340
Training data Testing data
Generalization
• The real aim of supervised learning is to do well on test data that is not known during learning
• Choosing the values for the parameters that minimize the loss function on the training data is not necessarily the best policy
• We want the learning machine to model the true regularities in the data and to ignore the noise in the data.
K = 1
-1.5 -1 -0.5 0 0.5 1-0.2
0
0.2
0.4
0.6
0.8
1
1.2
-1.5 -1 -0.5 0 0.5 1 1.5-0.2
0
0.2
0.4
0.6
0.8
1
1.2
error = 0.0 error = 0.15
Training data Testing data
K = 3
-1.5 -1 -0.5 0 0.5 1-0.2
0
0.2
0.4
0.6
0.8
1
1.2
-1.5 -1 -0.5 0 0.5 1 1.5-0.2
0
0.2
0.4
0.6
0.8
1
1.2
error = 0.0760 error = 0.1340
Training data Testing data
K = 7
-1.5 -1 -0.5 0 0.5 1-0.2
0
0.2
0.4
0.6
0.8
1
1.2
-1.5 -1 -0.5 0 0.5 1 1.5-0.2
0
0.2
0.4
0.6
0.8
1
1.2
error = 0.1320 error = 0.1110
Training data Testing data
K = 21
-1.5 -1 -0.5 0 0.5 1-0.2
0
0.2
0.4
0.6
0.8
1
1.2
-1.5 -1 -0.5 0 0.5 1 1.5-0.2
0
0.2
0.4
0.6
0.8
1
1.2
error = 0.1120 error = 0.0920
Training data Testing data
Properties and training
As K increases:
• Classification boundary becomes smoother
• Training error can increase
Choose (learn) K on a validation set
• Split training data into training and validation
• Hold out validation data and measure error on this
• Validation set acts as a proxy for the test set
Example: hand written digit recognition
• MNIST data set
• Distance = raw pixel distance between images
• 60K training examples
• 10K testing examples
• K-NN gives 5% classification error
Summary
Advantages:
• K-NN is a simple but effective classification procedure
• Applies to multi-class classification
• Decision surfaces are non-linear
• Quality of predictions automatically improves with more training
data
• Only a single parameter, K; easily tuned by cross-validation
• Use as a baseline/debugging classifier
-1.5 -1 -0.5 0 0.5 1-0.2
0
0.2
0.4
0.6
0.8
1
1.2
Summary
Disadvantages:
• What does nearest mean? Need to specify or learn a distance
metric.
• Computational cost: must store and search through the entire
training set at test time. Can alleviate this problem by thinning,
and use of efficient data structures like approximate NN
Lecture outline
• Classification
• Regression
• Overfitting
• Regularization and Loss functions
• Part II
• Linear classifiers
• Support Vector Machine (SVM)
• Logistic regression (LR)
Regression
• Suppose we are given a training set of N observations
• Regression problem is to estimate y(x) from this data
y
x
K-NN Regression
Algorithm
• For each test point, x, find the K nearest samples xi in the
training data and their values yi
• Output is mean of their values
• Again, need to choose (learn) K
y
Example:
Real time head pose regression
Fanelli et al., DAGM 2011
Image Regression Examples
IM2GPS
When was that made?Age estimation
Slide credit: Lana Lazebnik
Regression example: polynomial curve fitting
• The green curve is the true function (which is
not a polynomial)
• The data points are uniform in x but have
noise in y.
• We will use a loss function that measures the
squared error in the prediction of y(x) from x.
The loss for the red polynomial is the sum of
the squared vertical errors.
from Bishop
target value
polynomial
regression
y
y
Some fits to the data: which is best?
from Bishop
over fitting
y y
yy
Over-fitting
Root-Mean-Square (RMS) Error:
• test data: a different sample from the same true function
• training error goes to zero, but test error increases with M
Trading off goodness of fit against model complexity
• If the model has as many degrees of freedom as the data, it
can fit the training data perfectly
• But the objective in ML is generalization
• Can expect a model to generalize well if it explains the training
data surprisingly well given the complexity of the model.
How to prevent over fitting? I
• Add more data than the model “complexity”
• For 9th order polynomial:
y y
Polynomial Coefficients
• Regularization: penalize large coefficient values
How to prevent over fitting? II
loss function regularization
• In practice use validation data to choose lambda (not test data)
• Convex cost function, closed form solution
• This is “weight decay” in deep learning
“ridge” regression
y
Polynomial Coefficients
Summary Point: How to set parameters?
Use a validation set
Divide the total dataset into three subsets:
• Training data is used for learning the parameters of the model.
• Validation data is not used for learning, but is used to determine the hyper-parameters, e.g. for deciding the type of model and the amount of regularization.
• Test data is used to get a final, unbiased estimate of how well the learning machine works. Expect this estimate to be worse than on the validation data.
Can re-divide the total dataset to get another unbiased estimate of the true error rate.
• Again, need to control the complexity of the (discriminant)
function
Lecture outline
• Classification
• Regression
• Overfitting
• Regularization and Loss functions
• Part II
• Linear classifiers
• Support Vector Machine (SVM)
• Logistic regression (LR)
What comes next?
• Learning by optimizing a cost function:
loss function regularization
• In general
• choose loss function for: classification, regression, ranking, clustering …
• choose regularization function
The “Lasso” or L1 norm regularization
loss function regularization
• This is a quadratic optimization problem
• There is a unique solution
• p-Norm definition:
• LASSO = Least Absolute Shrinkage and Selection
Sparsity property of the Lasso
• contour plots for d = 2
ridge regression lasso
• Minimum where contours of loss and regularizer are tangent
• For the lasso case, minima occur at “corners”
• Consequently one of the weights is zero
• In high dimensions many weights can be zero
regularization parameter lambda
0 0.5 1-1
-0.5
0
0.5
1
1.5Ridge Regression
percent of lambdaMax
co
effic
ien
t va
lue
s
0 0.5 1-1
-0.5
0
0.5
1
1.5L1-Regularized Least Squares
percent of lambdaMax
co
effic
ien
t va
lue
s
Lasso in action
Sparse weight vectors
• Weights being zero is a method of “feature selection” –
zeroing out the unimportant features
• (The SVM classifier also has this property (sparse alpha
in the dual representation))
• Ridge regression does not
More loss functions for regression
• Linear classifiers
• Linear separability
• Support Vector Machine (SVM) classifier
• Wide margin
• Cost function
• Hard and soft margins
• Optimization
• Logistic Regression classifier
• Cost function
Part II Outline
Linear Classifiers and the SVM
Binary Classification
Linear separability
linearly
separable
not
linearly
separable
Linear classifiers
X2
X1
A linear classifier has the form
• in 2D the discriminant is a line
• is the normal to the line, and b the bias
• is known as the weight vector
Linear classifiers
A linear classifier has the form
• in 3D the discriminant is a plane, and in nD it is a hyperplane
For a K-NN classifier it was necessary to `carry’ the training data
For a linear classifier, the training data is used to learn w and then discarded
Only w is needed for classifying new data
What is the best w?
• maximum margin solution: most stable under perturbations of the inputs
Interlude: Why should we care about linear classifiers?
Example: bicycle image classification
Pre-deep learning – non-linear classifiers
interpretationinput representation
Feature
Extractor
Deep learning – train network for linear classification
interpretationinput representation
CNN Feature
Extractor
• Optimize network with loss
function for a linear classifier
• Learns to produce feature vectors
x that are linearly separable
How to find the maximum margin?
linearly separable data
Support Vector Machine
w
Support Vector
Support Vector
support vectors
wTx + b = 0
linearly separable data
How to find the maximum margin?
linearly separable data
SVM – sketch derivation
Support Vector Machine
w
Support Vector
Support Vector
wTx + b = 0
wTx + b = 1
wTx + b = -1
Margin =
linearly separable data
SVM – Optimization
Linear separability again: What is the best w?
• the points can be linearly separated but
there is a very narrow margin
• but possibly the large margin solution is
better, even though one constraint is violated
In general there is a trade off between the margin and the number of
mistakes on the training data
Margin violations and misclassifications
w
Support Vector
Support Vector
wTx + b = 0
wTx + b = 1
wTx + b = -1
Misclassified point
Margin violationPenalize points according
to their distance from the
margin
Loss functions
• SVM uses “hinge” loss
• in contrast to the 0-1 loss
“Soft” margin problem
loss functionregularization
Loss function
w
Support Vector
Support Vector
wTx + b = 0loss function
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
feature x
featu
re y
File: az-margin.mat, # of points K = 27
• data is linearly separable
• but only with a narrow margin
C = Infinity hard margin
STPRTool Franc & Hlavac
C = 10 soft margin
• Does this cost function have a unique solution?
• Does the solution depend on the starting point of an iterative
optimization algorithm (such as gradient descent)?
local
minimum
global
minimum
If the cost function is convex, then a locally optimal point is globally optimal
(provided the optimization is over a convex set, which it is in our case)
Optimization
Convex functions
Convex function examples
convex Not convex
A non-negative sum of convex functions is convex
+
convex
• We have seen – “ridge” regression
squared loss: squared regularizer:
• Lasso regression
squared loss: lasso regularizer:
• SVM
hinge loss: squared regularizer:
Summary: Learning by optimizing cost functions
loss function regularization
Logistic Regression Classifier
Overview
The logistic function or sigmoid function
-20 -15 -10 -5 0 5 10 15 200
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
z
-20 -15 -10 -5 0 5 10 15 200
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
A sigmoid favours a larger margin cf a step classifier
Margin property
-20 -15 -10 -5 0 5 10 15 200
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Though, need to control the gradient. How?
Learning
-20 -15 -10 -5 0 5 10 15 200
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
1
0.5
Maximum Likelihood Estimation
Logistic Regression Loss function
Comparison of SVM and LR cost functions
Note:
• both approximate 0-1 loss
• very similar asymptotic behaviour
• main difference is smoothness of LR,
and non-zero outside SVM margin
• SVM gives sparse solution for ai
yi f (x i )
• Beyond Binary Classification
• Multi-class and Multi-label
• Using binary classifiers
• Big Data
• retrieval and ranking
• precision-recall curves
• Nearest Neighbours (NN)
• Approximate NN, Locality Sensitive Hashing, Product Quantization
• Intro to practical
Next lecture