Introduction to Deep Learning for Biomedical...

Post on 21-Aug-2018

246 views 0 download

Transcript of Introduction to Deep Learning for Biomedical...

Introduction to Deep Learning for Biomedical

Engineering

After a presentation made by:Evan Shelhamer, Jeff Donahue, Jon Long

caffe.berkeleyvision.orggithub.com/BVLC/caffe 1

Prof. Bart ter Haar Romeny

What isDeep Learning?

2

3

A typical Deep Convolutional Neural Network

4

5

ImageNet – Fei Fei Li

ImageNet Large ScaleVisual Recognition Competition(ILSVRC)

AlexNET

6

7

Litjens, Geert, Thijs Kooi, Babak Ehteshami Bejnordi, Arnaud Arindra Adiyoso Setio, Francesco Ciompi, Mohsen Ghafoorian, Jeroen AWM van der Laak, Bram van Ginneken, and Clara I. Sánchez. "A survey on deep learning in medical image analysis." arXiv preprint arXiv:1702.05747 (Feb 2017).

8

9

Power of heatmaps – Train on image level, visualize on pixel level.

10Samaneh Abbasi, Bart Romeny et al. TU/e:Recurrent Convolutional Neural Networks, MICCAI 2017, Quebec City

11

Samaneh Abbasi et al. TU/e:Recurrent ConvolutionalNeural Networks,MICCAI 2017, Quebec City

12

13

14

For Diabetic Retinopathy the best detection performance is by Quellec et al.: Az = 0.954 in Kaggle’s dataset and Az = 0.949 in e-Ophtha.

15

Why Deep Learning?

Applications

The Challengeof Recognition

Learning & Optimization

Network Tour Transfer Learning

Deep Learning for VisionDive into

Deep LearningWhat is DL?Why Now?

Caffe First Sip

Why Deep Learning? End-to-End Learning for Many Tasks

vision speech text control

16

Presenter
Presentation Notes
deep learning has proven useful for many purposes and not just one single task this is the point of the learning end-to-end, that is, learning the whole problem from input to output:�the same toolkit can work for different domains whether vision, speech, text, or control and robotics we’ll focus on vision, and next we’ll look at core visual recognition tasks and the standard benchmarks for each problem. deep learning approaches have delivered dramatic improvements across these and many other tasks.

Some examples

Demo: Google translate on smartphone (speech + images)

Demo: https://www.imageidentify.com/

How does this work?

Биомедицинская инженерияToday you can read this Russian text with your smartphone

Kaggle: Diabetic Retinopathy ChallengeBlog

Google Photos

18

Other examples:

Robot vision and recognition:Harvest robot for peppers.

Wageningen University, the Netherlands

Vision for self-driving cars

19

Aalsmeer, Netherlands, largest flower auction in the world

20

Quick facts and figures about the Dutch Horticulture industry

The Dutch horticulture sector is a global trendsetter and the undisputed international market leader in flowers, plants, bulbs and propagation material.

Did you know?• Holland has a 44% share of the worldwide trade in floricultural products, making it the dominant global supplier of flowers and flower products. Some 77% of all flower bulbs traded worldwide come from the Netherlands, the majority of which are tulips. 40% of the trade in 2015 was cut flowers and flower buds.• The sector is the number 1 exporter to the world for live trees, plants, bulbs, roots and cut flowers.• The sector is the number 3 exporter in nutritional horticulture products.• Of the approximately 1,800 new plant varieties that enter the European market each year, 65% originate in the Netherlands. In addition, Dutch breeders account for more than 35% of all applications for community plant variety rights.• The Dutch are one of the world’s largest exporter of seeds: the exports of seeds amounted to € 3.1 billion in 2014.• In 2014 the Netherlands was the world’s second largest exporter (in value) of fresh vegetables. The Netherlands exported vegetables with a market value of € 7 billion.

21

From Wikipedia:

Deep learning is a class of machine learning algorithms that

• use a deep cascade of many layers of nonlinear processing unitsfor feature extraction and transformation.

• Each successive layer uses the output from the previous layer as input. • The algorithms may be supervised or unsupervised.• Applications include pattern analysis (unsupervised) and classification (supervised).

• are based on the (unsupervised) learning of multiple levels of features or representations of the data.

• Higher level features are derived from lower level features to form a hierarchical representation.

Deep Learning

So we have to learn:

1. Overview in depth → Introduction, Caffe example2. What are filters? → Convolution and convolution networks3. What is learned? → Invariant geometric features4. How can kernels be learned? → Principal Component Analysis5. How does the visual system this? → Front-end vision, visual cortex6. How can we use this? → Software developments in Deep Learning7. Questions → and answers

Deep Learning is a very hot area of Machine Learning Research, with many remarkable recent successes, such as 97.5% accuracy on face recognition, nearly perfect German traffic sign recognition, or even Dogs vs Cats image recognition with 98.9% accuracy.

Many winning entries in recent Kaggle Data Science competitions have used Deep Learning.

The term "deep learning" refers to the method of training multi-layered neural networks, and became popular after papers by Geoffrey Hinton and his co-workers which showed a fast way to train such networks.

http://www.kdnuggets.com/2014/05/learn-deep-learning-courses-tutorials-overviews.html

Yann LeCun, a student of Geoff Hinton, also developed a very effective algorithm for deep learning, called ConvNet, which was successfully used in late 80-s and early 90-s for automatic reading of amounts on bank checks.

In May 2014, Baidu, the Chinese search giant, has hired Andrew Ng, a leading Machine Learning and Deep Learning expert (and co-founder of Coursera) to head their new AI Lab in Silicon Valley, setting up an AI & Deep Learning race with Google (which hired Geoffrey Hinton) and Facebook (which hired Yann LeCun to head Facebook AI Lab).

27

Human vision and convolutional neural networks:

A cascade of increasing complexity

• Hierarchical network• Use of context

28

Wikipedia: Gestalt psychology or gestaltism (German: Gestalt "shape, form") is a philosophy of mind of the Berlin School of experimental psychology. Gestalt psychology is an attempt to understand the laws behind the ability to acquire and maintain meaningful perceptions in an apparently chaotic world. The central principle of gestalt psychology is that the mind forms a global whole with self-organizing tendencies. The assumed physiological mechanisms on which Gestalt theory rests are poorly defined and support for their existence is lacking. It is known as ‘perceptual grouping’.

AlexNET - pdf

Vision: the highest bandwidth input channel

29

Machines are useful mainly to the extent that they interact with the physical worldVisual information is the richest source of information about the real world

Vision is the highest-bandwidth mode for machines to obtain real-world info

Embedded vision enables our things to be:- More responsive- More personal and secure- Safer, more autonomous- Easier to use

subaru.com

30

http://www.kdnuggets.com/2017/02/top-arxiv-papers-january-convnets-wide-adversarial.html

Top papers on arXiv (https://arxiv.org/) :

31

Performance evaluation: http://www.robots.ox.ac.uk/~vgg/research/deep_eval/

VOC:

VisualObjectClasses

Why Now?1.Data

ImageNet et al.: millions of labeled (crowdsourced) images1.Compute

GPUs: terabytes/s memory bandwidth, teraflops compute1.Technique

new optimization know-how,new variants on old architectures,new tools for rapid experimentation

32

Presenter
Presentation Notes
note the importance of memory bandwidth: it determines how fast you can look at all that data

Why Now? DataFor example:

>14 million labeled images>1 million with bounding boxes

>300,000 images with labeled and segmented objects

33

URL

Why Now? GPUs

Parallel processorsfor parallel models:

Inherent Parallelismsame op, different data

Bandwidthlots of data in and out

Tuned PrimitivescuDNN and cuBLASfor deep nets for matrices 34

Nvidia News URL

Presenter
Presentation Notes
mention ILSVRC in particular as standard contest mention/include industrial data, e.g. Facebook, YouTube have much much more data than represented here the data is valuable!

GPU – Graphical Processing Unit

35

Thousands of parallell coresFully programmable in e.g. CUDAVery affordableShared large memory (e.g. 12 GB)In large server banksCan be rented by Amazon, Baidu, Alibaba etc.

Titan Xp GPU

36

Why Now? TechniqueNon-convex and high-dimensional learning is okaywith the right design choices

e.g. non-saturating non-linearities

Learning by Stochastic Gradient Descent (SGD) with momentum and other variants — more later!

instead of

37

38

Examples from NVIDIA:https://developer.nvidia.com/deep-learning

39

DeepBreak

Presenter
Presentation Notes
mention the traditional picture of getting stuck in local minima, and how this is not a problem in practice

What is Deep Learning?

Compositional ModelsLearned End-to-End

Hierarchy of Representations- vision: pixel, motif, part, object- text: character, word, clause, sentence- speech: audio, band, phone, word concrete

abstract

layer1

input

layer2

loss

θ1

θ2

truth

output

θ3

40

Back-propagation jointly learnsall of the model parameters tooptimize the output for the task—more on this later!

What is Deep Learning?

Compositional ModelsLearned End-to-End

41

layer1

input

layer2

loss

θ1

θ2

truth

output

θ3

Shallow Learning

[slide credit K. Cho]

Separation of hand engineering and machine learning

42

= a conclusion reached on the basis of evidence and reasoning

Presenter
Presentation Notes
note that representations are learned, and don’t correspond exactly to examples given

Hand-Engineered Features

43Features from years of vision expertise by the whole community are nowsurpassed by learned representations and these transfer across tasks

[figure credit R. Fergus]

Presenter
Presentation Notes
note that deep learning does not have to be backprop

Deep Learning

44[slide credit K. Cho]

Presenter
Presentation Notes
shallow learning: logistic regression, svm, decision tree, codebook -> quantization -> classification pipeline

45

End-to-End Learning Representations

The visual world is too vast and variedto fully describe by hand

Learn the representation from datalocal appearance parts and texture objects and semantics

[figure credit H. Lee]

Presenter
Presentation Notes
all the data -> learning learning -> all the tasks

Hierarchical growth of complexity

46

47

End-to-End Learning Tasks

The visual world is too vast and variedto fully describe by hand

Learn the task from data

Presenter
Presentation Notes
layers: compositionality feature sharing learning: better task performance other data computation time

Types of Learning

Vast space of models!

[figure credit Marc’aurelio Ranzato, CVPR 2014 tutorial]

Deep Network

Recurrent Network

Convolutional Network

48

Example: TensorFlow (URL)

49

The Neural Networks ZOO : http://www.asimovinstitute.org/neural-network-zoo/

50

Neural Network Graphs : http://www.asimovinstitute.org/neural-network-zoo-prequel-cells-layers/

51

Neural Network Graphs : http://www.asimovinstitute.org/neural-network-zoo-prequel-cells-layers/

History

Is deep learning 4, 20, or 50 years old? What’s changed?

2000s Sparse, Probabilistic, and Layer-wise models (Hinton, Bengio, Ng)2012 DL popularized in vision by contest victory (Krizhevsky et al. 2012)

Rosenblatt’s Perceptron52

Radial Basis Function

Convolutional Networks: 1989

LeNet: a layered model composed of convolution and subsampling layers followed by a holistic representationand ultimately a classifier for handwritten digits [LeNet]

53

Note: channel dimension goes upas spatial dimension goes down... still a common pattern today

AlexNet: a layered model composed of convolution, subsampling, and further operations followed by a holistic representation and all-in-all a landmark classifier onILSVRC12 [AlexNet]

+ data+ gpu+ non-saturating non-linearity+ regularization 54

Convolutional Networks: 2012

Presenter
Presentation Notes
gloss connected the dots exploration of model structure optimization know-how computation + data

55

FC 1000

FC 4096 / ReLU

FC 4096 / ReLU

Max Pool 3x3s2

Conv 3x3s1, 256 / ReLU

Conv 3x3s1, 384 / ReLU

Conv 3x3s1, 384 / ReLU

Max Pool 3x3s2

Local Response Norm

Conv 5x5s1, 256 / ReLU

Max Pool 3x3s2

Local Response NormConv 11x11s4, 96 /

ReLU

FC-ReLU:stack at end of the net to learn outputmajority of the learned parameters

Conv-Pool: 1+ conv are followed by pooling to subsamplespatial size shrinks; receptive field grows

Conv-ReLU:all conv are followed by non-linearityin this case ReLU

Convnet Design Patterns

Convnet Computation: 2012 & 2014AlexNet inference for a single image (3x227x227 input):

- 725M FLOPS

- 60M parameters (60,965,224 to be exact)

- 408 mb GPU memory in Caffe<12 gb for batch size of 1,500

- <1ms / image on Titan X with cuDNN v4for batch size >= 256

56

Compare GoogleNet (ILSVRC14 winner):- 2x FLOPs- 0.1x the parameters- 14% more accurate

Architecture matters!But the computational primitives are the same.

FC 1000

FC 4096 / ReLU

FC 4096 / ReLU

Max Pool 3x3s2

Conv 3x3s1, 256 / ReLU

Conv 3x3s1, 384 / ReLU

Conv 3x3s1, 384 / ReLU

Max Pool 3x3s2

Local Response Norm

Conv 5x5s1, 256 / ReLU

Max Pool 3x3s2

Local Response Norm

Conv 11x11s4, 96 / ReLU

4M

16M

37M

442K

1.3M

884K

307K

35K

4M

16M

37M

74M

112M

149M

223M

105M

params FLOPsAlexNet

Convolutional Nets: 2014

GoogLeNet ILSVRC14 Winner: ~6.6% Top-5 error- composition of multi-scale dimension-reduced

“Inception” modules- no FC layers and only 5 million parameters

+ depth+ auxiliary classifiers+ dimensionality reduction

57[Szegedy15]

1x1 Convolution

58

- reduce channel dimension to control 1. parameter count 2. computation- stack with non-linearity for deeper net- found in many of the latest nets

each filter has size64x1x1 and does a64-dim dot product

1x1 convwith 32 filters

[figure credit A. Karpathy]

Presenter
Presentation Notes
comment on inference v. training (this is the time for inference on a single image; a training iteration is roughly 2-3x the computation and is iterated many times) go through each AlexNet, then gloss over GoogLeNet FLOPS: 725,066,088 for all conv + fc w/ biases 000,659,272 for ReLU 000,027,000 for pooling 000,020,000 for LRN layer, weight ops, bias ops conv1 105415200 290400 conv2 223948800 186624 conv3 149520384 64896 conv4 112140288 64896 conv5 74760192 43264 fc6 37748736 4096 fc7 16777216 4096 fc8 4096000 1000 conv2 has 256 * (96 / 2) * 5^2 = 307,200 params

Convolutional Nets: 2014

VGG16 ILSVRC14 Runner-up: ~7.3% Top-5 error- 13 layers of 3x3 convolution interleaved with

max pooling + 3 fully-connected layers - simple architecture, good for transfer learning- 155 million params and more expensive to compute

+ depth+ fine-tuning deeper and deeper+ stacking small filters

59

FC 1000

FC 4096 / ReLU

FC 4096 / ReLUMax Pool 2x2s2

Conv 3x3s1, 256 / ReLU

Conv 3x3s1, 256 / ReLU

Conv 3x3s1, 256 / ReLU

Max Pool 2x2s2

Conv 3x3s1, 128 / ReLU

Max Pool 2x2s2

Conv 3x3s1, 64 / ReLU

Conv 3x3s1, 64 / ReLU

Conv 3x3s1, 128 / ReLU Max Pool 2x2s2

Conv 3x3s1, 512 / ReLU

Conv 3x3s1, 512 / ReLU

Conv 3x3s1, 512 / ReLU

Max Pool 2x2s2

Conv 3x3s1, 512 / ReLU

Conv 3x3s1, 512 / ReLU

Conv 3x3s1, 512 / ReLU

stack 23x3 conv

for a 5x5 receptive field

[figure creditA. Karpathy]

[Simonyan15]

ILSVRC15 and COCO15 Winner: MSRA ResNet- classification- detection- segmentation

Convolutional Nets: 2015

Learn residual mapping w.r.t. identity

- very deep 100+ layer nets

- skip connections across layers

- batch normalization

60

Kaiming He, et al.Deep Residual Learning for Image RecognitionarXiv 1512.03385. Dec. 2015.

[He15]

Convolutional Nets: 2015

MSRA ResNet

(~5x the layers shown here)

ILSVRC15 Winner 3.5% Top-5 error andCOCO15 Winner with >10% lead for detection and segmentation

- MSRA Residual Net (ResNet): 101 and 152 layer networks- skip and sum layers to form residuals- batch normalization (optimization trick) 61[He15]

Why Now?1.Data

ImageNet et al.: millions of labeled (crowdsourced) images1.Compute

GPUs: terabytes/s memory bandwidth, teraflops compute1.Technique

new optimization know-how,new variants on old architectures,new tools for rapid experimentation

63

Presenter
Presentation Notes
http://arxiv.org/abs/1512.03385

Why Now? DataFor example:

>14 million labeled images>1 million with bounding boxes

>300,000 images with labeled and segmented objects

64

URL

Why Now? GPUs

Parallel processorsfor parallel models:

Inherent Parallelismsame op, different data

Bandwidthlots of data in and out

Tuned PrimitivescuDNN and cuBLASfor deep nets for matrices 65

Nvidia News URL

Presenter
Presentation Notes
note the importance of memory bandwidth: it determines how fast you can look at all that data

GPU – Graphical Processing Unit

66

Thousands of parallell coresFully programmable in e.g. CUDAVery affordableShared large memory (e.g. 12 GB)In large server banksCan be rented by Amazon, Baidu, Alibaba etc.

Titan Xp GPU

67

Presenter
Presentation Notes
mention ILSVRC in particular as standard contest mention/include industrial data, e.g. Facebook, YouTube have much much more data than represented here the data is valuable!

Why Now? TechniqueNon-convex and high-dimensional learning is okaywith the right design choices

e.g. non-saturating non-linearities

Learning by Stochastic Gradient Descent (SGD) with momentum and other variants — more later!

instead of

68

framework

Why Now? Deep Learning Frameworks

networkinternal

representation

tools:visualization, profiling, debugging, etc.

layer library:fast implementations of common functions and gradients

backend:dispatch compute for learning and inference

frontend:a language for any network, any task

69

Deep Learning Frameworks

all open sourcewe like to brew our networks with Caffe

CaffeBerkeley / BVLCC++ / CUDA, Python, MATLAB

TorchFacebook + NYULua (C++)

TheanoU. MontrealPython

TensorFlowGooglePython (C++)

70

- This isn’t a problem (except for neuroscientists)

- Be wary of neural realism hype or “it just works because it’s like the brain”

- network, not neural networkunit, not neuron

Not So “Neural”

71

These models are not how the brain worksWe don’t know how the brain works!

Visual Recognition TasksClassification- what kind of image?- which kind(s) of objects?

Challenges- appearance varies by

lighting, pose, context, ...- clutter- fine-grained categorization

(horse or exact species) 72

❏ dog❏ car❏ horse❏ bike❏ cat❏ bottle❏ person

Presenter
Presentation Notes
mention the traditional picture of getting stuck in local minima, and how this is not a problem in practice

73

Image Classification: ILSVRC 2010-2015

[graph credit K. He]74

top-5error

❏ dog❏ car❏ horse❏ bike❏ cat❏ bottle❏ person

ImageNet Large Scale Visual Recognition Competition

Website

AlexNET - pdf

Visual Recognition Tasks

75

car person horse

Detection- what objects are there?- where are the objects?

Challenges- localization- multiple instances- small objects

Presenter
Presentation Notes
for-real edition of “it works because”: more data/supervision more of the model is made learnable

Detection: PASCAL VOC

[graph credit R. Girshick]76

dete

ctio

n ac

cura

cy

R-CNN:regions +convnets

state-of-the-art, in Caffe

Visual object classes

Presenter
Presentation Notes
classification is the fundamental visual task of recognizing what is in an image or what type of image it is. for example the kinds of objects in the image shown are car, horse, and person but we could also consider tasks like whether this is a daytime or nighttime image classification is challenging because of the many differences in appearance seen in the visual world, like lighting, pose, style, and so on clutter or noise can obscure the information to be extracted from the image fine-grained categorization is a further difficulty when we want to recognize not just any horse but an exact species

Semantic Segmentation- what kind of thing

is each pixel part of?- what kind of stuff

is each pixel?

Challenges- tension between

recognition and localization

- amount of computation

Visual Recognition Tasks

77

horse

car

78

Some examples:

• NVIDIA news:https://news.developer.nvidia.com/google-releases-tensorflow-1-0/http://nvidianews.nvidia.com/news?q=neural+nets&year=&month=&c=&from=&to= http://nvidianews.nvidia.com/news?q=deep+learning&year=&month=&c=&from=&to=

• Free book:http://neuralnetworksanddeeplearning.com/

• Other books:MIT: https://pdfs.semanticscholar.org/751f/aab15cbb955b07537fc38901bc96d4e70f57.pdf

• New companies:http://aidence.com/

• Papers:Classical paper: http://www.nature.com/nature/journal/v521/n7553/abs/nature14539.htmlImageNet: http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks(cited 11342 times) CAD: https://www.nature.com/articles/srep24454

• Google TensorFlow:https://www.tensorflow.org/get_started/

• Kaggle Diabetic Retinopahy Challenge: https://www.kaggle.com/c/diabetic-retinopathy-detection(see also our BMIE project: www.retinacheck.org/zh/index.html).

• Google Diabetic Retinopathy paper:https://research.googleblog.com/2016/11/deep-learning-for-detection-of-diabetic.html?m=1

Presenter
Presentation Notes
This graph shows the latest results as of the 2015 challenge with years running right to left The introduction of deep learning not only dropped the error by almost 10 points in 2012, but deep learning methods have improved in accuracy every year while networks are made deeper and deeper many of the contest winners and runners-up were done with Caffe or reproduced with Caffe, including the latest winner (ResNet) Mention speed as well as accuracy Highlighted done in Caffe (ResNet, VGG) or reproduced in Caffe (GoogLeNet, AlexNet)

Some Basics of Deep Learning

79

Presenter
Presentation Notes
detection is the task of recognizing not only what but where: both the identity and location of each object need to be predicted while classification considered only presence or absence, detection demands the recognition of every instance as we see for all three cars localization is difficult especially for interacting or articulated objects like the person and horse or for small objects that are easy to miss

80

Why Deep Learning?

Applications

The Challengeof Recognition

Learning & Optimization

Network Tour Transfer Learning

Deep Learning for Vision

Embedded Vision Alliance Tutorial – © Shelhamer, Donahue, Long

Dive intoDeep Learning

What is DL?Why Now?

Caffe First Sip

Presenter
Presentation Notes
deep learning is likewise having a remarkable impact on detection PASCAL VOC is a gold standard dataset and challenge with fierce competition detection accuracy scores both recognition and localization by 2012 progress had slowed and plateaued only to be driven further by the adoption of deep learning R-CNN Gloss mean AP as “detection accuracy” == a measure of recognition and localization gold standard for detection. drove the data set + challenge shift in computer vision. successor is COCO

First Dive Into Deep Learning

81

Deep Learning is

Stacking LayersandLearning End-to-End

Presenter
Presentation Notes
semantic segmentation is a visual recognition task that asks the identity of every pixel for things this means what kind of object is the pixel part of, as in this example output that shows which are the horse pixels and which are the person pixels we could just as well ask for what kind of stuff each pixel is, such as grass or sky, or in the context of a satellite image there might be road, buildings, crops, water, and so forth. in this task there is a tension between recognizing what globally and where locally computational cost can be an obstacle now that there is a decision to be made for every pixel

Deep networks are layered models made bystacking different types of transformation

A layer is a transformation

82

Stacking Layers

x’ = layer(x)

x2 = layer1(x1)x3 = layer2(x2)...

How do layers stack?

Networks run layer-by-layer, composingthe input-output transformation of each layer

83

Layered Networks

layer1

layer2

output

input

layer1

layer2

output

input

During learning, the error is passed backlayer-by-layer to tune the transformations

layer1

layer2

output

input What kind of layers should we stack?

x1out

= layer1(input)= layer2(x1)

output+ error

Non-linearity

84

The simplest layers

Matrix Multiplication

(for example)

85

Matrix Multiplication

Multiply input x by weights W and add bias bLearns linear transformations

K x O dimensionalK inputsO outputs

O outputs

86

Matrix Multiplication == Fully Connected Layer

Output is a function of every input, or the input and output are“fully connected”

Abbreviated as FC

[figure credit BDTI]

Presenter
Presentation Notes
note: animated

- Suppose our data points (x) are 2D and each comes with a label y, where y = -1 or y = 1

- Learn a weight vector w = [w1; w2]

- Predict the class of a given xby sign(wTx) = sign(w1x1 + w2x2)

87

Linear Classification

?

To classify we need to separate the data into red vs. blue

y = -1

y = 1

x1

x2

- Suppose our data points (x) are 2D and each comes with a label y, where y = -1 or y = 1

- Learn a weight vector w = [w1; w2]

- Predict the class of a given xby sign(wTx) = sign(w1x1 + w2x2)

88

Linear Classification

To classify we need to separate the data into red vs. blue

y = -1

y = 1

x1

x2

89

Linearity is Not Enough

To classify we need to separate the data into red vs. blue

y = -1

y = 1

?x1

x2

NO90

Linearity is Not Enough

To classify we need to separate the data into red vs. blue

y = -1

y = 1

x1

x2

NO91

Linearity is Not Enough

To classify we need to separate the data into red vs. blue

y = -1

y = 1

x1

x2

Presenter
Presentation Notes
armed with matrix multiplication we can do linear classification separate the data into red vs. blue the data x is 2-dimensional, with axes x1 and x2 while the output y is simply -1 / red or +1 / blue learn weights w1 and w2 to weight x1 and x2 to represent a separating line what line?

NO92

Linearity is Not Enough

To classify we need to separate the data into red vs. blue

y = -1

y = 1

x1

x2

Presenter
Presentation Notes
armed with matrix multiplication we can do linear classification separate the data into red vs. blue the data x is 2-dimensional, with axes x1 and x2 while the output y is simply -1 / red or +1 / blue learn weights w1 and w2 to weight x1 and x2 to represent a separating line what line?

YES93

Linearity is Not Enough

To classify we need to separate the data into red vs. blue

y = -1

y = 1

Non-linearity!

x1

x2

Presenter
Presentation Notes
in practice linearity is not enough and real world data requires more sophisticated classifiers what line separates this data? none need non-linearity

YES94

Linearity is Not Enough

To classify we need to separate the data into red vs. blue

y = -1

y = 1

Non-linearity!

x1

x2

Presenter
Presentation Notes
in practice linearity is not enough and real world data requires more sophisticated classifiers what line separates this data? none need non-linearity

95

The Limits of Linearity

Linear steps collapse and stay linear

Linear layers alone do not meaningfully stack

Presenter
Presentation Notes
in practice linearity is not enough and real world data requires more sophisticated classifiers what line separates this data? none need non-linearity

96

The Shallowest Deep Net

Deep nets are made by stacking learned linear layersand simple pointwise non-linear layers

Due to the Rectified Linear Unit (ReLU) non-linearity max(0, x), x3 cannot be computed as a linear function of x1

Linear Non-linear, Deep

add ReLU

Presenter
Presentation Notes
in practice linearity is not enough and real world data requires more sophisticated classifiers what line separates this data? none need non-linearity

Non-linearity is needed to deepen the representationMany non-linearities or activations to choose from

97

Non-linearityReLU

Sigmoid

Presenter
Presentation Notes
in practice linearity is not enough and real world data requires more sophisticated classifiers what line separates this data? none need non-linearity

Yet more non-linearities

98

ReLU

Sigmoid

TanH

Leaky ReLU

When in doubt, ReLU

Worth trying Leaky ReLU, ELU

Avoid Sigmoid

ELU

Presenter
Presentation Notes
in practice linearity is not enough and real world data requires more sophisticated classifiers what line separates this data? none need non-linearity

99

Define Your First Net

Let’s go non-linear ona classification problem

Try It OutDeep Learning in your browser demos

100

Designing for Sight

Convolutional Networks or convnets are nets for vision

- functional fit for the visual worldby compositionality and feature sharing

- learned end-to-end to handle visual detailfor more accuracy and less engineering

Convnets are the dominant architectures for visual tasks

101

Visual StructureLocal Processing: pixels close together go togetherreceptive fields capture local detail

Across Space: the same what, no matter whererecognize the same input in different places

Presenter
Presentation Notes
Pointwise non-linearities

102

Visual StructureLocal Processing: pixels close together go togetherreceptive fields capture local detail

Across Space: the same what, no matter whererecognize the same input in different places

Can rely on spatial coherence This is not a cat

103

Visual StructureLocal Processing: pixels close together go togetherreceptive fields capture local detail

Across Space: the same what, no matter whererecognize the same input in different places

Can rely on spatial coherence This is not a cat

All of these are cats

104

Vision Layers

Convolution/Filteringlinear layer for vision

Poolingspatial summarization max pool 2x2

with stride 2

Learned Filter

[figure credit A. Karpathy, cs231n course]

So use the same weights between nodes with the same spatial relationship

Convolution: A Linear Layer for VisionImages have translation invariant semantics: these are all equally squirrels

105

This is convolution (or correlation—used interchangeably in vision)Convolution means fewer parameters for more efficient learning

106

A Filter

input is 3x32x32 dataa color image (3 RGB channels) and square (32x32)

A filter is a spatially local and cross-channel templateConvnet filters are learned

[figure adapted from A. Karpathy]

107

A Filter

input is 3x32x32 dataa color image (3 RGB channels) and square (32x32)

A filter is a spatially local and cross-channel templateConvnet filters are learned

filter is 3x5x5 weights- spatially local: kernel size is 5x5- cross-channel: connected across all input channels

[figure adapted from A. Karpathy]

108

A Filter

input is 3x32x32 dataa color image (3 RGB channels) and square (32x32)

A filter is a spatially local and cross-channel templateConvnet filters are learned

filter is 3x5x5 weights- spatially local: kernel size is 5x5- cross-channel: connected across all input channels

total parameters:3*52 = 75 filter weights + 1 bias

[figure adapted from A. Karpathy]

One filter evaluation is a dot product between the input window and weights + bias

109

Convolution

32

inputfilterbiasoutput

3x32x323x5x5

11

[figure adapted from A. Karpathy]

Presenter
Presentation Notes
use the same weights for the same spatial relationship

110

Convolution

32

inputfilterbiasoutput

3x32x323x5x5

11

feature map

1x28x28

[figure adapted from A. Karpathy]

Convolving the filter with the input gives a feature map.

111

Convolution

32

inputfilterbiasoutput

3x32x323x5x5

11

feature map

Convolving the filter with the input gives a feature map.

1x28x28

Filter parameters:FC parameters:

3*52 = 753*322 = 3,072 [figure adapted from A. Karpathy]

112

Convolution Layer (conv)

32

inputfiltersbiasoutput

3x32x326x3x5x5

66x28x28

feature maps

Convolution layers have multiple filters for more modeling capacity

Convolution Layer

[figure adapted from A. Karpathy]

113

Convolution Layer (conv)

32

inputfiltersbiasoutput

3x32x326x3x5x5

66x28x28

feature maps

Convolution layers have multiple filters for more modeling capacity

Convolution LayerLearned Filters from AlexNet conv1

conv1 has 96 filters foredge, color, and frequency

richer than 3D RGB [figure adapted from A. Karpathy]

Presenter
Presentation Notes
weights are shared across space

114

Pooling (pool)

2x2 pooling, stride 2Max pooling

Average pooling

Spatial summary by computingoperation over window with stride

- overlapping or non-overlapping

- separate across channels

- Current fashion:3x3 max poolingwith stride 2

[figure credit BDTI]

Presenter
Presentation Notes
weights are shared across space

[figure credit A. Karpathy]

Pooling

115

- reduce resolution

- increase receptive field sizefor later layers

- save computation

- add invariance to translation/noise within pooling window

64x224x22464x112x112

Presenter
Presentation Notes
weights are shared across space

Fully Connected Layers (FC)

116

Learn a global feature from the full feature mapsOften found at the end of convnetsNote: this could likewise be done by a large convolution kernel

feature maps2x2x2

unroll

input1 x 8

weights8 x 3

outputsor units

1 x 3

bias1 x 3

117

Normalization Layers (Deprecated)Local response normalization was popular for a time but is now deprecated;more recent networks do not include these layers

[figure credit BDTI]

118

- layers compute differentiable transformations

- types of layers: conv, ReLU, pool, FC

- parameters (conv, FC) or not (pool, ReLU)

- arguments like kernel size, stride, etc. (conv pool)

Layer Review

119

Convnet Architecture

Input Image Scores

Conv 3x3s1, 10 / ReLU Type: Conv Kernel Size: 3x3 Stride: 1 Channels:10 Activation: ReLU

FC 10

Conv 3x3s1, 10 / ReLU

Max Pool 3x3s1

Conv 3x3s1, 10 / ReLU

Conv 3x3s1, 10 / ReLU

Conv 3x3s1, 10 / ReLU

Max Pool 3x3s1

Conv 3x3s1, 10 / ReLU

Max Pool 3x3s1

Conv 3x3s1, 10 / ReLU

Stack convolution, non-linearity, and pooling until global FC layer classifier

[figure credit A. Karpathy]

Presenter
Presentation Notes

Data augmentation: making muchmore data

120

transform the training data, without changing its truth

… and anything else you can come up ith! ( d bi ti f th b

horizontal flipscat still a cat

random crops/scalesviews of catcat cat darker cat

relighting

[figure adapted from A. Karpathy]

much

121

See a Net Learn to See

Let’s watch a convnet as it learnshow to recognize objects in images

MNIST demo: Try It Out

Cifar 10 demo: Try It Out

Internalfunctionality

122

Supervised Learning

Given labeled data:(x1, y1), (x2, y2), …, (xN, yN)

Goal: find a function f such that yn = f(xn)for all n, “as well as possible”

labeldata

123

What does “as well as possible” mean?Pick a loss function ℓ(y, ŷ): how wrong is it to predict ŷ when the true label is y?Minimize the total loss over all data:

E.g. ℓ(y, ŷ) = ‖y - ŷ‖2 “Euclidean Loss” or everyday linear regression

Supervised Loss

124

Parametric Learning

How do we find the label-prediction function f?Parametric answer: pick it from a family determined by a set of parameters θ:

E.g. f(x; θ) = θ x “linear prediction”For us: f is a network, θ is a set of weights

f(x) = f(x; θ)matrix vector

125

Parametric Supervised Learning

Altogether: our goal is to find θ in order to loss true label

parameters(weights)

model(network)

predicted label

sum over data 126

Underfitting and Overfitting

underfitting:not enough parameters to model the data

overfitting:enough parameters to memorize the training set without generalizing

fewer parameters more parameters

127

[figure credit A. Karpathy]

RegularizationHow can we prevent overfitting without reducing the number of parameters?

Add a regularization penalty to our loss: “complicated” solutions are worse128

[figure credit A. Karpathy]

Regularization: Weight Decay and Dropout

Weight Decay: minimize L(θ) + λ‖θ‖2 to pull weights toward zeroλ (scalar) is an optimization setting… pick it empiricallyaka “L2 regularization”

Dropout: during training, randomly set a fraction p of activations to zerop is an optimization setting (often 0.5)forces model to be robust to noise

129

Gradient Descent: Intuition

Want to minimize “loss” function L(x; θ)

θ axis

L(x; θ)

Move in the direction of the gradient

old θnew θ

θ (vector): parameter to updatex (vector): input data (fixed on this slide)

130

The gradient tells you, for each element of the network parameters,how the loss changes in response to a change in that parameter.

Stochastic Gradient Descent (SGD)

Want to minimize “loss” function L(x; θ)1. Pick input datum x

2. Compute parameter gradient

3. Multiply by learning rate

4. Update parameters θ

131

(The alternative is to average the gradient over all available data,“batch gradient descent”:

That’s too slow for big data!)

Why “Stochastic”?

The gradient depends on the choice of input datum xChoose x randomly (or just cycle through all data in a fixed order)

132

SGD with Weight Decay and Momentum

133

SGD with Weight Decay and Momentum

weight decay(regularization)

134

SGD with Weight Decay and Momentum

There are many other variants:Adam, RMSprop, AdaDelta, AdaGrad, Nesterov, ...

weight decay(regularization)

momentum(p is a number less than 1)

135

136

ReLU

Sigmoid

Layer GradientsMatrix Multiply Gradients

137

Back-propagation: The Chain Rule

layer1

θ

loss (ℓ)

A net is a composition of layer functionsThe gradient of a net is the product of layer gradients

Back-propagation in a Bigger Net

layer1

x

layer2

loss

θ1

θ2

input

output

y truth

ŷ

θ3

138

Backward passForward pass