Deep Learning Demystified and GPU Dataflow Programming · 2017. 12. 23. · [Mahendran & Vedaldi,...

30
Vladimir Golkov Technical University of Munich Computer Vision Group Deep Learning

Transcript of Deep Learning Demystified and GPU Dataflow Programming · 2017. 12. 23. · [Mahendran & Vedaldi,...

Page 1: Deep Learning Demystified and GPU Dataflow Programming · 2017. 12. 23. · [Mahendran & Vedaldi, CVPR 2015] Image Reconstruction from Deep-Layer Features Another network: Loss in

Vladimir Golkov

Technical University of Munich

Computer Vision Group

Deep Learning

Page 2: Deep Learning Demystified and GPU Dataflow Programming · 2017. 12. 23. · [Mahendran & Vedaldi, CVPR 2015] Image Reconstruction from Deep-Layer Features Another network: Loss in

2Vladimir Golkov (TUM)

1D Input, 1D Output

input

target

Page 3: Deep Learning Demystified and GPU Dataflow Programming · 2017. 12. 23. · [Mahendran & Vedaldi, CVPR 2015] Image Reconstruction from Deep-Layer Features Another network: Loss in

Imagine many dimensions

(data occupies sparse entangled regions)

Deep network: sequence of (simple) nonlinear disentangling transformations

(Transformation parameters are optimization variables)

3Vladimir Golkov (TUM)

2D Input, 1D Output: Data Distribution Complexity

Page 4: Deep Learning Demystified and GPU Dataflow Programming · 2017. 12. 23. · [Mahendran & Vedaldi, CVPR 2015] Image Reconstruction from Deep-Layer Features Another network: Loss in

4

Nonlinear Coordinate Transformation

http://cs.stanford.edu/people/karpathy/convnetjs/

Dimensionality may change!

Page 5: Deep Learning Demystified and GPU Dataflow Programming · 2017. 12. 23. · [Mahendran & Vedaldi, CVPR 2015] Image Reconstruction from Deep-Layer Features Another network: Loss in

Rectified linear units do exactly such kinks

5Vladimir Golkov (TUM)

Sequence of Simple Nonlinear Coordinate Transformations

Linear

separation

of purple

and white

sheet

Page 6: Deep Learning Demystified and GPU Dataflow Programming · 2017. 12. 23. · [Mahendran & Vedaldi, CVPR 2015] Image Reconstruction from Deep-Layer Features Another network: Loss in

6Vladimir Golkov (TUM)

Sequence of Simple Nonlinear Coordinate Transformations

Linear

separation

of dough

and butter

Page 7: Deep Learning Demystified and GPU Dataflow Programming · 2017. 12. 23. · [Mahendran & Vedaldi, CVPR 2015] Image Reconstruction from Deep-Layer Features Another network: Loss in

7Vladimir Golkov (TUM)

Sequence of Simple Nonlinear Coordinate Transformations

Data is sparse (almost lower-dimensional)

Page 8: Deep Learning Demystified and GPU Dataflow Programming · 2017. 12. 23. · [Mahendran & Vedaldi, CVPR 2015] Image Reconstruction from Deep-Layer Features Another network: Loss in

8

The Increasing Complexity of Features

[Zeiler & Fergus, ECCV 2014]

Increasing dimensionality, receptive field, invariance, complexity

Layer 1Feature space

coordinates:

45° edge yes/no

Green patch yes/no

Layer 2Layer 3

Feature space coordinates:

Person yes/no

Car wheel yes/no

Input

features:

RGB

“by design” “by convergence”

Page 9: Deep Learning Demystified and GPU Dataflow Programming · 2017. 12. 23. · [Mahendran & Vedaldi, CVPR 2015] Image Reconstruction from Deep-Layer Features Another network: Loss in

Informed approach:

If I were to choose layer-wise features by hand in a smart, optimal way,

which features, how many features, with which receptive fields would I choose?

Bottom-up approach: make dataset simple (e.g. simple samples and/or few samples), get a

simple network (few layers, few neurons/filters) to work at least on the training set, then re-

train increasingly complex networks on increasingly complex data

Top-down approach: use a network architecture that is known to work well on similar data, get

it to work on your data, then tune the architecture if necessary

9Vladimir Golkov (TUM)

Network Architecture: Your Decision!

You define architectureNetwork learns

Page 10: Deep Learning Demystified and GPU Dataflow Programming · 2017. 12. 23. · [Mahendran & Vedaldi, CVPR 2015] Image Reconstruction from Deep-Layer Features Another network: Loss in

FORWARD PASS

(DATA TRANSFORMATION)

Page 11: Deep Learning Demystified and GPU Dataflow Programming · 2017. 12. 23. · [Mahendran & Vedaldi, CVPR 2015] Image Reconstruction from Deep-Layer Features Another network: Loss in

𝑥0

is input feature vector for neural network (one sample).

𝑥𝐿

is output vector of neural network with 𝐿 layers.

Layer number 𝑙 has:

• Inputs (usually 𝑥𝑙−1

, i.e. outputs of layer number 𝑙 − 1)

• Weight matrix 𝑊 𝑙 , bias vector 𝑏 𝑙 - both trained (e.g. with stochastic gradient descent)

such that network output 𝑥𝐿

for the training samples minimizes some objective (loss)

• Nonlinearity 𝑠𝑙 (fixed in advance, for example ReLU 𝑧 ≔ max{0, 𝑧})

• Quiz: why is nonlinearity useful?

• Output 𝑥(𝑙)

of layer 𝑙

Transformation from 𝑥𝑙−1

to 𝑥𝑙

performed by layer 𝑙:

𝑥(𝑙)

= 𝑠𝑙 𝑊 𝑙 𝑥𝑙−1

+ 𝑏 𝑙

11Vladimir Golkov (TUM)

Fully-Connected Layer

Page 12: Deep Learning Demystified and GPU Dataflow Programming · 2017. 12. 23. · [Mahendran & Vedaldi, CVPR 2015] Image Reconstruction from Deep-Layer Features Another network: Loss in

𝑊 𝑙 =0 0.1 −1

−0.2 0 1

𝑥 𝑙−1 =123

𝑏 𝑙 =01.2

𝑊 𝑙 𝑥𝑙−1

+ 𝑏 𝑙 =

=0 ∙ 1 + 0.1 ∙ 2 − 1 ∙ 3 + 0

−0.2 ∙ 1 + 0 ∙ 2 + 1 ∙ 3 + 1.2

=−2.84

12Vladimir Golkov (TUM)

Example

1

2

3

-2.8

4

𝑏1𝑙

= 0

𝑏2𝑙

= 1.2

Page 13: Deep Learning Demystified and GPU Dataflow Programming · 2017. 12. 23. · [Mahendran & Vedaldi, CVPR 2015] Image Reconstruction from Deep-Layer Features Another network: Loss in

• “Zero-dimensional” data: multilayer perceptron

• Structured data: translation-covariant operations

• Neighborhood structure: convolutional networks (2D/3D images, 1D bio. sequences, …)

• Sequential structure (memory): recurrent networks (1D text, 1D audio, …)

13Vladimir Golkov (TUM)

Structured Data

inputs

outputs

inputs

outputs

inputs

outputs

Page 14: Deep Learning Demystified and GPU Dataflow Programming · 2017. 12. 23. · [Mahendran & Vedaldi, CVPR 2015] Image Reconstruction from Deep-Layer Features Another network: Loss in

Appropriate for 2D structured data (e.g. images) where we want:

• Locality of feature extraction (far-away pixels do not influence local output)

• Translation-equivariance (shifting input in space (𝑖, 𝑗 dimensions) yields same output shifted

in the same way)

𝑥𝑖,𝑗,𝑘(𝑙)

= 𝑠𝑙 𝑏𝑘𝑙

+

𝑖, 𝑗, 𝑘

𝑤𝑖− 𝑖,𝑗− 𝑗, 𝑘,𝑘𝑙

𝑥𝑖,𝑗, 𝑘𝑙−1

• the size of 𝑊 along the 𝑖, 𝑗 dimensions is called “filter size”

• the size of 𝑊 along the 𝑘 dimension is the number of input channels (e.g. three (red, green,

blue) in first layer)

• the size of 𝑊 along the 𝑘 dimension is the number of filters (number of output channels)

• Equivalent for 1D, 3D, ...

• http://cs231n.github.io/assets/conv-demo/

14Vladimir Golkov (TUM)

2D Convolutional Layer

Page 15: Deep Learning Demystified and GPU Dataflow Programming · 2017. 12. 23. · [Mahendran & Vedaldi, CVPR 2015] Image Reconstruction from Deep-Layer Features Another network: Loss in

15

Convolutional Network

Interactive: http://scs.ryerson.ca/~aharley/vis/conv/flat.html

Page 16: Deep Learning Demystified and GPU Dataflow Programming · 2017. 12. 23. · [Mahendran & Vedaldi, CVPR 2015] Image Reconstruction from Deep-Layer Features Another network: Loss in

16

Nonlinearity

Page 17: Deep Learning Demystified and GPU Dataflow Programming · 2017. 12. 23. · [Mahendran & Vedaldi, CVPR 2015] Image Reconstruction from Deep-Layer Features Another network: Loss in

N-class classification:

• N outputs

• nonlinearity in last layer: softmax

• loss: categorical cross-entropy between outputs 𝑥𝐿

and targets 𝑡 (sum over all training samples)

2-class classification:

• 1 output

• nonlinearity in last layer: sigmoid

• loss: binary cross-entropy between outputs 𝑥𝐿

and targets 𝑡 (sum over all training samples)

2-class classification (alternative formulation)

• 2 outputs

• nonlinearity in last layer: softmax

• loss: categorical cross-entropy between outputs 𝑥𝐿

and targets 𝑡 (sum over all training samples)

Many regression tasks:

• linear output in last layer

• loss: mean squared error between outputs 𝑥𝐿

and targets 𝑡 (sum over all training samples)

17Vladimir Golkov (TUM)

Loss Functions

Page 18: Deep Learning Demystified and GPU Dataflow Programming · 2017. 12. 23. · [Mahendran & Vedaldi, CVPR 2015] Image Reconstruction from Deep-Layer Features Another network: Loss in

• Fix number 𝐿 of layers

• Fix sizes of weight arrays and bias vectors

• For a fully-connected layer, this corresponds to the “number of neurons”

• Fix nonlinearities

• Initialize weights and biases with random numbers

• Repeat:

• Select mini-batch (i.e. small subset) of training samples

• Compute the gradient of the loss with respect to all trainable parameters (all weights and

biases)

• Use chain rule (“error backpropagation”) to compute gradient for hidden layers

• Perform a gradient-descent step (or similar) towards minimizing the error

• (Called “stochastic” gradient descent because every mini-batch is a random subset of

the entire training set)

• Important hyperparameter: learning rate (i.e. step length factor)

• “Early stopping”: Stop when loss on validation set starts increasing (to avoid overfitting)

18Vladimir Golkov (TUM)

Neural Network Training Procedure

Page 19: Deep Learning Demystified and GPU Dataflow Programming · 2017. 12. 23. · [Mahendran & Vedaldi, CVPR 2015] Image Reconstruction from Deep-Layer Features Another network: Loss in

• The data representation should be natural

(do not “outsource” known data transformations to the learning of the mapping)

• Make it easy for the network

• For angles, we use sine and cosine to avoid the jump from 360° to 0°

• Redundancy is okay!

• Fair scale of features (and initial weights and learning rate) to facilitate optimization

• Data augmentation using natural assumptions

• Features from different distributions or missing: use several disentangled inputs to tell the network!

• Trade-off: The more a network should be able to do, the much more data and/or better techniques

are required

• https://en.wikipedia.org/wiki/Statistical_data_type

• Categorical variable: one-hot encoding

• Ordinal variable: cumulative sum of one-hot encoding [cf. Jianlin Cheng, arXiv:0704.1028]

• etc

19Vladimir Golkov (TUM)

Data Representation: Your Decision!

Page 20: Deep Learning Demystified and GPU Dataflow Programming · 2017. 12. 23. · [Mahendran & Vedaldi, CVPR 2015] Image Reconstruction from Deep-Layer Features Another network: Loss in

Impose meaningful invariances/assumptions about mapping in a hard or soft manner:

• Limited complexity of model: few layers, few neurons, weight sharing

• Locality and shift-equivariance of feature extraction: ConvNets

• Exact spatial locations of features don’t matter: pooling; strided convolutions

• http://cs231n.github.io/assets/conv-demo/

• Deep features shouldn’t strongly rely on each other: dropout (randomly setting some deep

features to zero during training)

• Data augmentation:

• Known meaningful transformations of training samples

• Random noise (e.g. dropout) in first or hidden layers

• Optimization algorithm tricks, e.g. early stopping

20Vladimir Golkov (TUM)

Regularization to Avoid Overfitting

Page 21: Deep Learning Demystified and GPU Dataflow Programming · 2017. 12. 23. · [Mahendran & Vedaldi, CVPR 2015] Image Reconstruction from Deep-Layer Features Another network: Loss in

WHAT DID THE NETWORK LEARN ?

Page 22: Deep Learning Demystified and GPU Dataflow Programming · 2017. 12. 23. · [Mahendran & Vedaldi, CVPR 2015] Image Reconstruction from Deep-Layer Features Another network: Loss in

23

Convolutional Network

Interactive: http://scs.ryerson.ca/~aharley/vis/conv/flat.html

Page 23: Deep Learning Demystified and GPU Dataflow Programming · 2017. 12. 23. · [Mahendran & Vedaldi, CVPR 2015] Image Reconstruction from Deep-Layer Features Another network: Loss in

24

• Visualization: creative, no canonical way

• Look at standard networks to gain intuition

Page 24: Deep Learning Demystified and GPU Dataflow Programming · 2017. 12. 23. · [Mahendran & Vedaldi, CVPR 2015] Image Reconstruction from Deep-Layer Features Another network: Loss in

25

Inverse problem:

Loss in feature space

[Mahendran & Vedaldi, CVPR 2015]

Image Reconstruction from Deep-Layer Features

Another network:

Loss in image space

[Dosovitskiy & Brox, CVPR 2016]

not entire information content can be retrieved, e.g. many solutions may be “known”,

but only their “least-squares compromise” can be shown

Another network:

Loss in feature space + adversarial loss

+ loss in image space

[Dosovitskiy & Brox, NIPS 2016]

generative model “improvises” realistic details

Page 25: Deep Learning Demystified and GPU Dataflow Programming · 2017. 12. 23. · [Mahendran & Vedaldi, CVPR 2015] Image Reconstruction from Deep-Layer Features Another network: Loss in

26

Reasons for Activations of Each Neuron

[Selvaraju et al., arXiv 2016]

Page 26: Deep Learning Demystified and GPU Dataflow Programming · 2017. 12. 23. · [Mahendran & Vedaldi, CVPR 2015] Image Reconstruction from Deep-Layer Features Another network: Loss in

27

Network Bottleneck

Page 27: Deep Learning Demystified and GPU Dataflow Programming · 2017. 12. 23. · [Mahendran & Vedaldi, CVPR 2015] Image Reconstruction from Deep-Layer Features Another network: Loss in

28

Network Bottleneck

Page 28: Deep Learning Demystified and GPU Dataflow Programming · 2017. 12. 23. · [Mahendran & Vedaldi, CVPR 2015] Image Reconstruction from Deep-Layer Features Another network: Loss in

2 3 5 7 11

0.2030507011

2 3 5 7 11

• Any amount of information can pass

through a layer with just one neuron!

(“smuggled” as fractional digits)

• Quiz time: So why don’t we have one-neuron layers?

• Learning to entangle (before) and disentangle (after)

is difficult for the usual reasons:

• Solution is not global optimum and not perfect

• Optimum on training set ≠ optimum on test set

• Difficulty to entangle/disentangle used to our advantage:

hard bottleneck (few neurons),

soft bottleneck (additional loss terms):

• dimensionality reduction

• regularizing effect

• understanding of intrinsic factors of variation of data

29Vladimir Golkov (TUM)

An Unpopular and Counterintuitive Fact

Page 29: Deep Learning Demystified and GPU Dataflow Programming · 2017. 12. 23. · [Mahendran & Vedaldi, CVPR 2015] Image Reconstruction from Deep-Layer Features Another network: Loss in

• Appropriate measure of quality

• Desired properties of output (e.g. image smoothness, anatomic plausibility, ...)

• Desired properties of mapping (e.g. f(x1)≈f(x2) for x1≈x2)

• Weird data distributions: Class imbalance, domain adaptation, ...

• If derivative is zero on more than a null set, better use a “smoothed” version

• Apart from that, all you need is subdifferentiability almost everywhere (not even

everywhere)

• i.e. kinks and jumps are OK

• e.g. ReLU

30Vladimir Golkov (TUM)

Cost Functions: Your Decision!

Page 30: Deep Learning Demystified and GPU Dataflow Programming · 2017. 12. 23. · [Mahendran & Vedaldi, CVPR 2015] Image Reconstruction from Deep-Layer Features Another network: Loss in

THANK YOU!