Discriminative & Deep Learning for Big Dataaz/lectures/aims-big_data/discrim_learning1.pdf ·...

Discriminative & Deep Learning for

Big Data

Andrew Zisserman

Visual Geometry Group

University of Oxford

http://www.robots.ox.ac.uk/~vgg

AIMS-CDT Michaelmas 2019

Overview

Two aspects:

1. Learning from Big Data:

• Supervised and discriminative learning: provides training data

for large capacity models

• e.g. for data hungry deep neural networks

2. Searching and modelling Big Data:

• Need efficient (fast and low memory) methods to address this

Mon: Discriminative Learning 1 [AZ]

• Introduction, NN, linear classifiers; regression, SVMs, loss functions

Tue: Discriminative Learning 2 and searching Big Data [AZ,RA]

• Multiple classes, large scale retrieval; ANN, LSH, PQ

• Practical 1: Image classification

Wed: Deep Learning 1 [AV]

• Neural networks, CNNs, back-prop, applications

Thu: Deep learning 2 and Large Scale Learning [AV]

• Architectures, visualization, applications; large scale learning

• Practical 2: CNNs

Mon 9th: Seminar by Alyosha Efros

• Applications of large scale matching in Computer Vision

Course Outline

Recommended book: general introduction

• Pattern Recognition and

Machine Learning

Christopher Bishop, Springer, 2006.

• Excellent on classification and

regression

Textbooks 2: also general

• Elements of Statistical

Learning

Hastie, Tibshirani, Friedman,

Springer, 2009, second edition

• Good explanation of algorithms

• pdf available online

Learning with Kernels

Bernhard Schölkopf and Alexander J. Smola

MIT, 2002.

Textbooks 3: specialized for SVMs and kernels

Textbooks 4: specialized for deep learning

• Deep Learning

Goodfellow, Bengio, Courville

MIT Press, 2016

• html version available online

• http://www.deeplearningbook.org

Introduction: What is Machine Learning?

Algorithms that can improve their performance using

training data

• Typically the algorithm has a (large) number of

parameters whose values are learnt from the data

• Can be applied in situations where it is very challenging

(= impossible) to define rules by hand, e.g.:

• Face recognition

• Speech recognition

• Stock prediction

Example 1: hand-written digit recognition

Images are 28 x 28 pixels

How to proceed …

As a supervised classification problem

Start with training data, e.g. 6000 examples of each digit

• Can achieve testing error of 0.4%

• One of first commercial and widely used ML systems (for zip codes & checks)

Example 2: Face detection

• Again, a supervised classification problem

• Need to classify an image window into three classes:

• non-face

• frontal-face

• profile-face

Classifier is learnt from labelled data

Training data for frontal faces

• 5000 faces

All near frontal

Age, race, gender, lighting

• 108 non faces

• faces are normalized

scale, translation

Example 3: Spam detection

• This is a classification problem

• Task is to classify email into spam/non-spam

• Data xi is word count, e.g. of viagra, outperform, “you may be

surprized to be contacted” …

• Requires a learning system as “enemy” keeps innovating

Example 4: Stock price prediction

• Task is to predict stock price at future date

• This is a regression task, as the output is continuous

Protein Structure and Disulfide Bridges

Protein: 1IMT

AVITGACERDLQCG

KGTCCAVSLWIKSV

RVCTPVGTSGEDCH

PASHKIPFSGQRMH

HTCPCAPNLACVQT

SPKKFKCLSK

Regression task: given sequence predict 3D structure

Example 5: Computational biology

abL

greyscale image

“Free”

supervisory

signal

Regression task: predict pixel colour from a monochrome input

Example 6: Colourization

colour image

Colourization examples

Train network to predict pixel colour from a monochrome input

Colorful Image Colorization, Zhang et al., ECCV 2016

Example 7: Machine translation

What is the anticipated cost of collecting fees under the new proposal?

En vertu des nouvelles propositions, quel est le coût prévu de perception des droits?

x y

Whatis

theanticipated

costof

collecting fees

under the

new proposal

?

Envertudelesnouvellespropositions, quelestle coûtprévude perception de les droits?

e.g. Google translate

Use of aligned text

What is the anticipated cost of collecting fees under the new proposal?

Google Translate (2019)

Example 8: Machine transcription – sequences

Use of aligned speech and text

• Automated speech recognition

Use of aligned images and text

• Text spotting and recognition

Use of aligned face video and text

• Automated lip reading

Lip reading sentences

Chung, Senior, Vinyals, Zisserman, 2017

Speech and natural language

https://www.skype.com/en/features/skype-translator/

Google Translate App

• Translate between 103 languages by typing

• Offline: Translate 52 languages when you have no Internet

• Instant camera translation: Use your camera to translate

text instantly in 30 languages

• Camera Mode: Take pictures of text for higher-quality

translations in 37 languages

• Conversation Mode: Two-way instant speech translation in

32 languages

https://play.google.com/store/apps/details?id=com.google.android.apps.tra

nslate&hl=en

See also: The Great AI Awakening (New York Times Magazine)

Slide credit: Lana Lazebnik

https://www.skype.com/en/features/skype-translator/

https://play.google.com/store/apps/details?id=com.google.android.apps.translate&hl=en

https://mobile.nytimes.com/2016/12/14/magazine/the-great-ai-awakening.html

Lecture outline

• Classification

• Regression

• Overfitting

• Regularization and Loss functions

• Part II

• Linear classifiers

• Support Vector Machine (SVM)

• Logistic regression (LR)

Supervised Learning: Overview

Learning machine

Classification

• Suppose we are given a training set of N observations

• Classification problem is to estimate f(x) from this data such that

K Nearest Neighbour (K-NN) Classifier

Algorithm

• For each test point, x, to be classified, find the K nearest

samples in the training data

• Classify the point, x, according to the majority vote of their

class labels

e.g. K = 3

• applicable to

multi-class case

K = 1

Voronoi diagram:

• partitions the space into regions

• boundaries are equal distance

from training points

Classification boundary:

• non-linear

-1.5 -1 -0.5 0 0.5 1 1.5-0.2

0

0.2

0.4

0.6

0.8

1

1.2

A sampling assumption: training and test data

-1.5 -1 -0.5 0 0.5 1-0.2

0

0.2

0.4

0.6

0.8

1

1.2

Training data Testing data

• Assume that the training examples are drawn independently from the set of all

possible examples.

• This makes it very unlikely that a strong regularity in the training data will be absent in

the test data.

• Measure classification error as

loss function

The “risk”

K = 1

-1.5 -1 -0.5 0 0.5 1-0.2

0

0.2

0.4

0.6

0.8

1

1.2

error = 0.0

Training data

-1.5 -1 -0.5 0 0.5 1 1.5-0.2

0

0.2

0.4

0.6

0.8

1

1.2

error = 0.15

Testing data

K = 3

-1.5 -1 -0.5 0 0.5 1-0.2

0

0.2

0.4

0.6

0.8

1

1.2

-1.5 -1 -0.5 0 0.5 1 1.5-0.2

0

0.2

0.4

0.6

0.8

1

1.2

error = 0.0760 error = 0.1340


Generalization

• The real aim of supervised learning is to do well on test data that is not known during learning

• Choosing the values for the parameters that minimize the loss function on the training data is not necessarily the best policy

• We want the learning machine to model the true regularities in the data and to ignore the noise in the data.

K = 1

-1.5 -1 -0.5 0 0.5 1-0.2

0

0.2

0.4

0.6

0.8

1

1.2

-1.5 -1 -0.5 0 0.5 1 1.5-0.2

0

0.2

0.4

0.6

0.8

1

1.2

error = 0.0 error = 0.15


K = 3

-1.5 -1 -0.5 0 0.5 1-0.2

0

0.2

0.4

0.6

0.8

1

1.2

-1.5 -1 -0.5 0 0.5 1 1.5-0.2

0

0.2

0.4

0.6

0.8

1

1.2

error = 0.0760 error = 0.1340


K = 7

-1.5 -1 -0.5 0 0.5 1-0.2

0

0.2

0.4

0.6

0.8

1

1.2

-1.5 -1 -0.5 0 0.5 1 1.5-0.2

0

0.2

0.4

0.6

0.8

1

1.2

error = 0.1320 error = 0.1110


K = 21

-1.5 -1 -0.5 0 0.5 1-0.2

0

0.2

0.4

0.6

0.8

1

1.2

-1.5 -1 -0.5 0 0.5 1 1.5-0.2

0

0.2

0.4

0.6

0.8

1

1.2

error = 0.1120 error = 0.0920


Properties and training

As K increases:

• Classification boundary becomes smoother

• Training error can increase

Choose (learn) K on a validation set

• Split training data into training and validation

• Hold out validation data and measure error on this

• Validation set acts as a proxy for the test set

Example: hand written digit recognition

• MNIST data set

• Distance = raw pixel distance between images

• 60K training examples

• 10K testing examples

• K-NN gives 5% classification error

Summary

Advantages:

• K-NN is a simple but effective classification procedure

• Applies to multi-class classification

• Decision surfaces are non-linear

• Quality of predictions automatically improves with more training

data

• Only a single parameter, K; easily tuned by cross-validation

• Use as a baseline/debugging classifier

-1.5 -1 -0.5 0 0.5 1-0.2

0

0.2

0.4

0.6

0.8

1

1.2

Summary

Disadvantages:

• What does nearest mean? Need to specify or learn a distance

metric.

• Computational cost: must store and search through the entire

training set at test time. Can alleviate this problem by thinning,

and use of efficient data structures like approximate NN

Lecture outline

• Classification

• Regression

• Overfitting


• Part II




Regression

• Suppose we are given a training set of N observations

• Regression problem is to estimate y(x) from this data

y

x

K-NN Regression

Algorithm

• For each test point, x, find the K nearest samples xi in the

training data and their values yi

• Output is mean of their values

• Again, need to choose (learn) K

y

Example:

Real time head pose regression

Fanelli et al., DAGM 2011

Image Regression Examples

IM2GPS

When was that made?Age estimation

Slide credit: Lana Lazebnik

https://www.cc.gatech.edu/~nvo9/revisitingim2gps_iccv2017/

http://www.tamaraberg.com/papers/sirion_wacv2017.pdf

https://techxplore.com/news/2015-05-microsoft-age-estimate-tool-unleashed-real-time.html

Regression example: polynomial curve fitting

• The green curve is the true function (which is

not a polynomial)

• The data points are uniform in x but have

noise in y.

• We will use a loss function that measures the

squared error in the prediction of y(x) from x.

The loss for the red polynomial is the sum of

the squared vertical errors.

from Bishop

target value

polynomial

regression

y

y

Some fits to the data: which is best?

from Bishop

over fitting

y y

yy

Over-fitting

Root-Mean-Square (RMS) Error:

• test data: a different sample from the same true function

• training error goes to zero, but test error increases with M

Trading off goodness of fit against model complexity

• If the model has as many degrees of freedom as the data, it

can fit the training data perfectly

• But the objective in ML is generalization

• Can expect a model to generalize well if it explains the training

data surprisingly well given the complexity of the model.

How to prevent over fitting? I

• Add more data than the model “complexity”

• For 9th order polynomial:

y y

Polynomial Coefficients

• Regularization: penalize large coefficient values

How to prevent over fitting? II

loss function regularization

• In practice use validation data to choose lambda (not test data)

• Convex cost function, closed form solution

• This is “weight decay” in deep learning

“ridge” regression

y

Polynomial Coefficients

Summary Point: How to set parameters?

Use a validation set

Divide the total dataset into three subsets:

• Training data is used for learning the parameters of the model.

• Validation data is not used for learning, but is used to determine the hyper-parameters, e.g. for deciding the type of model and the amount of regularization.

• Test data is used to get a final, unbiased estimate of how well the learning machine works. Expect this estimate to be worse than on the validation data.

Can re-divide the total dataset to get another unbiased estimate of the true error rate.

• Again, need to control the complexity of the (discriminant)

function

Lecture outline

• Classification

• Regression

• Overfitting


• Part II




What comes next?

• Learning by optimizing a cost function:


• In general

• choose loss function for: classification, regression, ranking, clustering …

• choose regularization function

The “Lasso” or L1 norm regularization


• This is a quadratic optimization problem

• There is a unique solution

• p-Norm definition:

• LASSO = Least Absolute Shrinkage and Selection

Sparsity property of the Lasso

• contour plots for d = 2

ridge regression lasso

• Minimum where contours of loss and regularizer are tangent

• For the lasso case, minima occur at “corners”

• Consequently one of the weights is zero

• In high dimensions many weights can be zero

regularization parameter lambda

0 0.5 1-1

-0.5

0

0.5

1

1.5Ridge Regression

percent of lambdaMax

co

effic

ien

t va

lue

s

0 0.5 1-1

-0.5

0

0.5

1

1.5L1-Regularized Least Squares

percent of lambdaMax

co

effic

ien

t va

lue

s

Lasso in action

Sparse weight vectors

• Weights being zero is a method of “feature selection” –

zeroing out the unimportant features

• (The SVM classifier also has this property (sparse alpha

in the dual representation))

• Ridge regression does not

More loss functions for regression


• Linear separability

• Support Vector Machine (SVM) classifier

• Wide margin

• Cost function

• Hard and soft margins

• Optimization

• Logistic Regression classifier

• Cost function

Part II Outline

Linear Classifiers and the SVM

Binary Classification

Linear separability

linearly

separable

not

linearly

separable

Linear classifiers

X2

X1

A linear classifier has the form

• in 2D the discriminant is a line

• is the normal to the line, and b the bias

• is known as the weight vector

Linear classifiers

A linear classifier has the form

• in 3D the discriminant is a plane, and in nD it is a hyperplane

For a K-NN classifier it was necessary to `carry’ the training data

For a linear classifier, the training data is used to learn w and then discarded

Only w is needed for classifying new data

What is the best w?

• maximum margin solution: most stable under perturbations of the inputs

Interlude: Why should we care about linear classifiers?

Example: bicycle image classification

Pre-deep learning – non-linear classifiers

interpretationinput representation

Feature

Extractor

Deep learning – train network for linear classification

interpretationinput representation

CNN Feature

Extractor

• Optimize network with loss

function for a linear classifier

• Learns to produce feature vectors

x that are linearly separable

How to find the maximum margin?

linearly separable data

Support Vector Machine

w

Support Vector

Support Vector

support vectors

wTx + b = 0


How to find the maximum margin?


SVM – sketch derivation

Support Vector Machine

w

Support Vector

Support Vector

wTx + b = 0

wTx + b = 1

wTx + b = -1

Margin =


SVM – Optimization

Linear separability again: What is the best w?

• the points can be linearly separated but

there is a very narrow margin

• but possibly the large margin solution is

better, even though one constraint is violated

In general there is a trade off between the margin and the number of

mistakes on the training data

Margin violations and misclassifications

w

Support Vector

Support Vector

wTx + b = 0

wTx + b = 1

wTx + b = -1

Misclassified point

Margin violationPenalize points according

to their distance from the

margin

Loss functions

• SVM uses “hinge” loss

• in contrast to the 0-1 loss

“Soft” margin problem

loss functionregularization

Loss function

w

Support Vector

Support Vector

wTx + b = 0loss function

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

feature x

featu

re y

File: az-margin.mat, # of points K = 27

• data is linearly separable

• but only with a narrow margin

C = Infinity hard margin

STPRTool Franc & Hlavac

C = 10 soft margin

• Does this cost function have a unique solution?

• Does the solution depend on the starting point of an iterative

optimization algorithm (such as gradient descent)?

local

minimum

global

minimum

If the cost function is convex, then a locally optimal point is globally optimal

(provided the optimization is over a convex set, which it is in our case)

Optimization

Convex functions

Convex function examples

convex Not convex

A non-negative sum of convex functions is convex

+

convex

• We have seen – “ridge” regression

squared loss: squared regularizer:

• Lasso regression

squared loss: lasso regularizer:

• SVM

hinge loss: squared regularizer:

Summary: Learning by optimizing cost functions


Logistic Regression Classifier

Overview

The logistic function or sigmoid function

-20 -15 -10 -5 0 5 10 15 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

z

-20 -15 -10 -5 0 5 10 15 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

A sigmoid favours a larger margin cf a step classifier

Margin property

-20 -15 -10 -5 0 5 10 15 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Though, need to control the gradient. How?

Learning

-20 -15 -10 -5 0 5 10 15 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0

1

0.5

Maximum Likelihood Estimation

Logistic Regression Loss function

Comparison of SVM and LR cost functions

Note:

• both approximate 0-1 loss

• very similar asymptotic behaviour

• main difference is smoothness of LR,

and non-zero outside SVM margin

• SVM gives sparse solution for ai

yi f (x i )

• Beyond Binary Classification

• Multi-class and Multi-label

• Using binary classifiers

• Big Data

• retrieval and ranking

• precision-recall curves

• Nearest Neighbours (NN)

• Approximate NN, Locality Sensitive Hashing, Product Quantization

• Intro to practical

Next lecture

Discriminative & Deep Learning for Big Dataaz/lectures/aims-big_data/discrim_learning1.pdf ·...

Documents

Transcript of Discriminative & Deep Learning for Big Dataaz/lectures/aims-big_data/discrim_learning1.pdf ·...