Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li...

133
Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016 Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 1 Lecture 13: Segmentation and Attention

Transcript of Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li...

Page 1: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 20161

Lecture 13:

Segmentation and Attention

Page 2: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Administrative● Assignment 3 due tonight!● We are reading your milestones

2

Page 3: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Last time: Software Packages

3

Caffe TorchTheano

Lasagne

Keras

TensorFlow

Page 4: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Today

● Segmentation○ Semantic Segmentation○ Instance Segmentation

● (Soft) Attention○ Discrete locations○ Continuous locations (Spatial Transformers)

4

Page 5: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

But first….

5

Page 6: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

But first….

6

Szegedy et al, Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning, arXiv 2016

New ImageNet Record today!

Page 7: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Inception-v4

7

Szegedy et al, Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning, arXiv 2016

V = Valid convolutions(no padding)

1x7, 7x1 filters

9 layers

Strided convolution AND max pooling

Page 8: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Inception-v4

8

Szegedy et al, Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning, arXiv 2016

x4

9 layers

4 x 3 layers

3 layers

Page 9: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Inception-v4

9

Szegedy et al, Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning, arXiv 2016

x7

9 layers

4 x 3 layers

3 layers

5 x 7 layers

4 layers

Page 10: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Inception-v4

10

Szegedy et al, Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning, arXiv 2016

x3

9 layers

4 x 3 layers

3 layers

5 x 7 layers

4 layers

3 x 4 layers

75 layers

Page 11: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Inception-ResNet-v2

11

9 layers

Page 12: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Inception-ResNet-v2

12

9 layers

x75 x 4 layers

3 layers

Page 13: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Inception-ResNet-v2

13

9 layers

x10

5 x 4 layers

3 layers

3 layers

10 x 4 layers

Page 14: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Inception-ResNet-v2

14

9 layers

x 5

5 x 3 layers

3 layers

3 layers

10 x 4 layers

5 x 4 layers

75 layers

Page 15: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Inception-ResNet-v2

15

Residual and non-residual converge to similar value, but residual learns faster

Page 16: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Today

● Segmentation○ Semantic Segmentation○ Instance Segmentation

● (Soft) Attention○ Discrete locations○ Continuous locations (Spatial Transformers)

16

Page 17: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Segmentation

17

Page 18: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 20161818

Classification Classification + Localization

Computer Vision Tasks

CAT CAT CAT, DOG, DUCK

Object Detection Segmentation

CAT, DOG, DUCK

Multiple objectsSingle object

Page 19: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 20161919

Computer Vision TasksClassification + Localization Object Detection Segmentation

Lecture 8

Classification

Page 20: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 20162020

Classification

Computer Vision Tasks

Object DetectionClassification + Localization Segmentation

Today

Page 21: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Semantic Segmentation

21

Label every pixel!

Don’t differentiate instances (cows)

Classic computer vision problem

Figure credit: Shotton et al, “TextonBoost for Image Understanding: Multi-Class Object Recognition and Segmentation by Jointly Modeling Texture, Layout, and Context”, IJCV 2007

Page 22: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Instance Segmentation

22

Figure credit: Dai et al, “Instance-aware Semantic Segmentation via Multi-task Network Cascades”, arXiv 2015

Detect instances, give category, label pixels

“simultaneous detection and segmentation” (SDS)

Lots of recent work (MS-COCO)

Page 23: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Semantic Segmentation

23

Page 24: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Semantic Segmentation

24

Page 25: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Semantic Segmentation

25

Extract patch

Page 26: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Semantic Segmentation

26

CNN

Extract patch

Run througha CNN

Page 27: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Semantic Segmentation

27

CNN COW

Extract patch

Run througha CNN

Classify center pixel

Page 28: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Semantic Segmentation

28

CNN COW

Extract patch

Run througha CNN

Classify center pixel

Repeat for every pixel

Page 29: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Semantic Segmentation

29

CNN

Run “fully convolutional” network to get all pixels at once

Smaller output due to pooling

Page 30: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Semantic Segmentation: Multi-Scale

30

Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013

Page 31: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Semantic Segmentation: Multi-Scale

31

Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013

Resize image to multiple scales

Page 32: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Semantic Segmentation: Multi-Scale

32

Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013

Resize image to multiple scales

Run one CNN per scale

Page 33: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Semantic Segmentation: Multi-Scale

33

Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013

Resize image to multiple scales

Run one CNN per scale

Upscale outputsand concatenate

Page 34: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Semantic Segmentation: Multi-Scale

34

Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013

Resize image to multiple scales

Run one CNN per scale

Upscale outputsand concatenate

External “bottom-up” segmentation

Page 35: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Semantic Segmentation: Multi-Scale

35

Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013

Resize image to multiple scales

Run one CNN per scale

Upscale outputsand concatenate

External “bottom-up” segmentation

Combine everything for final outputs

Page 36: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Semantic Segmentation: Refinement

36

Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014

Apply CNN once to get labels

Page 37: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Semantic Segmentation: Refinement

37

Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014

Apply CNN once to get labels

Apply AGAIN to refine labels

Page 38: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Semantic Segmentation: Refinement

38

Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014

Apply CNN once to get labels

Apply AGAIN to refine labels

And again!

Page 39: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Semantic Segmentation: Refinement

39

Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014

Apply CNN once to get labels

Apply AGAIN to refine labels

And again!

Same CNN weights:recurrent convolutional network

Page 40: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Semantic Segmentation: Refinement

40

Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014

Apply CNN once to get labels

Apply AGAIN to refine labels

And again!

Same CNN weights:recurrent convolutional network

More iterations improve results

Page 41: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 201641

Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015

Semantic Segmentation: Upsampling

Page 42: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 201642

Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015

Semantic Segmentation: Upsampling

Learnable upsampling!

Page 43: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 201643

Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015

Semantic Segmentation: Upsampling

Page 44: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 201644

Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015

Semantic Segmentation: Upsampling

“skip connections”

Page 45: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 201645

Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015

Semantic Segmentation: Upsampling

Skip connections = Better results

“skip connections”

Page 46: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Learnable Upsampling: “Deconvolution”

46

Typical 3 x 3 convolution, stride 1 pad 1

Input: 4 x 4 Output: 4 x 4

Page 47: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Learnable Upsampling: “Deconvolution”

47

Typical 3 x 3 convolution, stride 1 pad 1

Input: 4 x 4 Output: 4 x 4

Dot product between filter and input

Page 48: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Learnable Upsampling: “Deconvolution”

48

Typical 3 x 3 convolution, stride 1 pad 1

Input: 4 x 4 Output: 4 x 4

Dot product between filter and input

Page 49: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Learnable Upsampling: “Deconvolution”

49

Typical 3 x 3 convolution, stride 2 pad 1

Input: 4 x 4 Output: 2 x 2

Page 50: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Learnable Upsampling: “Deconvolution”

50

Typical 3 x 3 convolution, stride 2 pad 1

Input: 4 x 4 Output: 2 x 2

Dot product between filter and input

Page 51: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Learnable Upsampling: “Deconvolution”

51

Typical 3 x 3 convolution, stride 2 pad 1

Input: 4 x 4 Output: 2 x 2

Dot product between filter and input

Page 52: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Learnable Upsampling: “Deconvolution”

52

3 x 3 “deconvolution”, stride 2 pad 1

Input: 2 x 2 Output: 4 x 4

Page 53: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Learnable Upsampling: “Deconvolution”

53

3 x 3 “deconvolution”, stride 2 pad 1

Input: 2 x 2 Output: 4 x 4

Input gives weight for filter

Page 54: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Learnable Upsampling: “Deconvolution”

54

3 x 3 “deconvolution”, stride 2 pad 1

Input: 2 x 2 Output: 4 x 4

Input gives weight for filter

Page 55: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Learnable Upsampling: “Deconvolution”

55

3 x 3 “deconvolution”, stride 2 pad 1

Input: 2 x 2 Output: 4 x 4

Input gives weight for filter

Sum where output overlaps

Page 56: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Learnable Upsampling: “Deconvolution”

56

3 x 3 “deconvolution”, stride 2 pad 1

Input: 2 x 2 Output: 4 x 4

Input gives weight for filter

Sum where output overlaps

Same as backward pass for normal convolution!

Page 57: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Learnable Upsampling: “Deconvolution”

57

3 x 3 “deconvolution”, stride 2 pad 1

Input: 2 x 2 Output: 4 x 4

Input gives weight for filter

Sum where output overlaps

Same as backward pass for normal convolution!

“Deconvolution” is a bad name, already defined as “inverse of convolution”

Better names: convolution transpose,backward strided convolution,1/2 strided convolution, upconvolution

Page 58: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Learnable Upsampling: “Deconvolution”

58

“Deconvolution” is a bad name, already defined as “inverse of convolution”

Better names: convolution transpose,backward strided convolution,1/2 strided convolution, upconvolution

Im et al, “Generating images with recurrent adversarial networks”, arXiv 2016

Radford et al, “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks”, ICLR 2016

Page 59: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Learnable Upsampling: “Deconvolution”

59

“Deconvolution” is a bad name, already defined as “inverse of convolution”

Better names: convolution transpose,backward strided convolution,1/2 strided convolution, upconvolution

Im et al, “Generating images with recurrent adversarial networks”, arXiv 2016

Radford et al, “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks”, ICLR 2016

Great explanation in appendix

Page 60: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Semantic Segmentation: Upsampling

60

Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015

Page 61: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Semantic Segmentation: Upsampling

61

Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015

Normal VGG “Upside down” VGG

6 days of training on Titan X…

Page 62: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Instance Segmentation

62

Page 63: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Instance Segmentation

63

Figure credit: Dai et al, “Instance-aware Semantic Segmentation via Multi-task Network Cascades”, arXiv 2015

Detect instances, give category, label pixels

“simultaneous detection and segmentation” (SDS)

Lots of recent work (MS-COCO)

Page 64: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Instance Segmentation

64

Hariharan et al, “Simultaneous Detection and Segmentation”, ECCV 2014

Similar to R-CNN, but with segments

Page 65: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Instance Segmentation

65

Hariharan et al, “Simultaneous Detection and Segmentation”, ECCV 2014

External Segment proposals

Similar to R-CNN, but with segments

Page 66: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Instance Segmentation

66

Hariharan et al, “Simultaneous Detection and Segmentation”, ECCV 2014

External Segment proposals

Similar to R-CNN, but with segments

Page 67: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Instance Segmentation

67

Hariharan et al, “Simultaneous Detection and Segmentation”, ECCV 2014

External Segment proposals

Mask out background with mean image

Similar to R-CNN, but with segments

Page 68: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Instance Segmentation

68

Hariharan et al, “Simultaneous Detection and Segmentation”, ECCV 2014

External Segment proposals

Mask out background with mean image

Similar to R-CNN, but with segments

Page 69: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Instance Segmentation

69

Hariharan et al, “Simultaneous Detection and Segmentation”, ECCV 2014

External Segment proposals

Mask out background with mean image

Similar to R-CNN, but with segments

Page 70: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Instance Segmentation: Hypercolumns

70

Hariharan et al, “Hypercolumns for Object Segmentation and Fine-grained Localization”, CVPR 2015

Page 71: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Instance Segmentation: Hypercolumns

71

Hariharan et al, “Hypercolumns for Object Segmentation and Fine-grained Localization”, CVPR 2015

Page 72: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Instance Segmentation: Cascades

72

Dai et al, “Instance-aware Semantic Segmentation via Multi-task Network Cascades”, arXiv 2015

Similar to Faster R-CNN

Won COCO 2015 challenge (with ResNet)

Page 73: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Instance Segmentation: Cascades

73

Dai et al, “Instance-aware Semantic Segmentation via Multi-task Network Cascades”, arXiv 2015

Similar to Faster R-CNN

Won COCO 2015 challenge (with ResNet)

Page 74: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Instance Segmentation: Cascades

74

Dai et al, “Instance-aware Semantic Segmentation via Multi-task Network Cascades”, arXiv 2015

Similar to Faster R-CNN

Region proposal network (RPN)

Won COCO 2015 challenge (with ResNet)

Page 75: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Instance Segmentation: Cascades

75

Dai et al, “Instance-aware Semantic Segmentation via Multi-task Network Cascades”, arXiv 2015

Similar to Faster R-CNN

Region proposal network (RPN)

Reshape boxes to fixed size,figure / ground logistic regression

Won COCO 2015 challenge (with ResNet)

Page 76: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Instance Segmentation: Cascades

76

Dai et al, “Instance-aware Semantic Segmentation via Multi-task Network Cascades”, arXiv 2015

Similar to Faster R-CNN

Won COCO 2015 challenge (with ResNet)

Region proposal network (RPN)

Reshape boxes to fixed size,figure / ground logistic regression

Mask out background, predict object class

Page 77: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Instance Segmentation: Cascades

77

Dai et al, “Instance-aware Semantic Segmentation via Multi-task Network Cascades”, arXiv 2015

Similar to Faster R-CNN

Won COCO 2015 challenge (with ResNet)

Region proposal network (RPN)

Reshape boxes to fixed size,figure / ground logistic regression

Mask out background, predict object class

Learn entire model end-to-end!

Page 78: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Instance Segmentation: Cascades

78

Dai et al, “Instance-aware Semantic Segmentation via Multi-task Network Cascades”, arXiv 2015 Predictions Ground truth

Page 79: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Segmentation Overview● Semantic segmentation

○ Classify all pixels○ Fully convolutional models, downsample then upsample○ Learnable upsampling: fractionally strided convolution○ Skip connections can help

● Instance Segmentation○ Detect instance, generate mask○ Similar pipelines to object detection

79

Page 80: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Attention Models

80

Page 81: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Recall: RNN for Captioning

81

Image: H x W x 3

Page 82: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Recall: RNN for Captioning

82

CNN

Image: H x W x 3

Features: D

Page 83: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Recall: RNN for Captioning

83

CNN

Image: H x W x 3

Features: D

h0

Hidden state: H

Page 84: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Recall: RNN for Captioning

84

CNN

Image: H x W x 3

Features: D

h0

Hidden state: H

h1

y1

First word

d1

Distribution over vocab

Page 85: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Recall: RNN for Captioning

85

CNN

Image: H x W x 3

Features: D

h0

Hidden state: H

h1

y1

h2

y2

First word

Second word

d1

Distribution over vocab

d2

Page 86: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Recall: RNN for Captioning

86

CNN

Image: H x W x 3

Features: D

h0

Hidden state: H

h1

y1

h2

y2

First word

Second word

d1

Distribution over vocab

d2

RNN only looks at whole image, once

Page 87: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Recall: RNN for Captioning

87

CNN

Image: H x W x 3

Features: D

h0

Hidden state: H

h1

y1

h2

y2

First word

Second word

d1

Distribution over vocab

d2

RNN only looks at whole image, once

What if the RNN looks at different parts of the image at each timestep?

Page 88: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 201688

CNN

Image: H x W x 3

Features: L x D

Soft Attention for Captioning

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Page 89: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 201689

CNN

Image: H x W x 3

Features: L x D

h0

Soft Attention for Captioning

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Page 90: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 201690

CNN

Image: H x W x 3

Features: L x D

h0

a1

Distribution over L locations

Soft Attention for Captioning

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Page 91: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 201691

CNN

Image: H x W x 3

Features: L x D

h0

a1

Weighted combination of features

Distribution over L locations

Soft Attention for Captioning

z1Weighted

features: D

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Page 92: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 201692

CNN

Image: H x W x 3

Features: L x D

h0

a1

z1

Weighted combination of features

h1

Distribution over L locations

Soft Attention for Captioning

Weighted features: D y1

First wordXu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Page 93: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 201693

CNN

Image: H x W x 3

Features: L x D

h0

a1

z1

Weighted combination of features

y1

h1

First word

Distribution over L locations

Soft Attention for Captioning

a2 d1

Weighted features: D

Distribution over vocab

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Page 94: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 201694

CNN

Image: H x W x 3

Features: L x D

h0

a1

z1

Weighted combination of features

y1

h1

First word

Distribution over L locations

Soft Attention for Captioning

a2 d1

z2Weighted

features: D

Distribution over vocab

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Page 95: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 201695

CNN

Image: H x W x 3

Features: L x D

h0

a1

z1

Weighted combination of features

y1

h1

First word

Distribution over L locations

Soft Attention for Captioning

a2 d1

h2

z2 y2Weighted

features: D

Distribution over vocab

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Page 96: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 201696

CNN

Image: H x W x 3

Features: L x D

h0

a1

z1

Weighted combination of features

y1

h1

First word

Distribution over L locations

Soft Attention for Captioning

a2 d1

h2

a3 d2

z2 y2Weighted

features: D

Distribution over vocab

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Page 97: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 201697

CNN

Image: H x W x 3

Features: L x D

h0

a1

z1

Weighted combination of features

y1

h1

First word

Distribution over L locations

Soft Attention for Captioning

a2 d1

h2

a3 d2

z2 y2Weighted

features: D

Distribution over vocab

Guess which framework was used to implement?

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Page 98: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 201698

CNN

Image: H x W x 3

Features: L x D

h0

a1

z1

Weighted combination of features

y1

h1

First word

Distribution over L locations

Soft Attention for Captioning

a2 d1

h2

a3 d2

z2 y2Weighted

features: D

Distribution over vocab

Guess which framework was used to implement?

Crazy RNN = Theano

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Page 99: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Soft vs Hard Attention

99

CNN

Image: H x W x 3

Grid of features(Each D-

dimensional)

a b

c d

pa pb

pc pd

Distribution over grid locations

pa + pb + pc + pc = 1

From RNN:

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Page 100: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Soft vs Hard Attention

100

CNN

Image: H x W x 3

Grid of features(Each D-

dimensional)

a b

c d

pa pb

pc pd

Distribution over grid locations

pa + pb + pc + pc = 1

Context vector z(D-dimensional)

From RNN:

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Page 101: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Soft vs Hard Attention

101

CNN

Image: H x W x 3

Grid of features(Each D-

dimensional)

a b

c d

pa pb

pc pd

Distribution over grid locations

pa + pb + pc + pc = 1

Soft attention:Summarize ALL locationsz = paa+ pbb + pcc + pdd

Derivative dz/dp is nice!Train with gradient descent

Context vector z(D-dimensional)

From RNN:

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Page 102: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Soft vs Hard Attention

102

CNN

Image: H x W x 3

Grid of features(Each D-

dimensional)

a b

c d

pa pb

pc pd

Distribution over grid locations

pa + pb + pc + pc = 1

Soft attention:Summarize ALL locationsz = paa+ pbb + pcc + pdd

Derivative dz/dp is nice!Train with gradient descent

Context vector z(D-dimensional) Hard attention:

Sample ONE locationaccording to p, z = that vector

With argmax, dz/dp is zero almost everywhere …

Can’t use gradient descent;need reinforcement learning

From RNN:

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Page 103: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Soft Attention for Captioning

103

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Soft attention

Hard attention

Page 104: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Soft Attention for Captioning

104

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Page 105: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Soft Attention for Captioning

105

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Attention constrained to fixed grid! We’ll come back to this ….

Page 106: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

“Mi gato es el mejor” -> “My cat is the best”

Soft Attention for Translation

106

Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015

Page 107: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Soft Attention for Translation

“Mi gato es el mejor” -> “My cat is the best”

107

Distribution over input words

Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015

Page 108: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Soft Attention for Translation

“Mi gato es el mejor” -> “My cat is the best”

108

Distribution over input words

Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015

Page 109: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Soft Attention for Translation

“Mi gato es el mejor” -> “My cat is the best”

109

Distribution over input words

Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015

Page 110: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Soft Attention for Everything!

110

Speech recognition, attention over input sounds:- Chan et al, “Listen, Attend, and Spell”, arXiv 2015- Chorowski et al, “Attention-based models for Speech Recognition”, NIPS 2015

Image, question to answer,attention over image:- Xu and Saenko, “Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering”, arXiv 2015- Zhu et al, “Visual7W: Grounded Question Answering in Images”, arXiv 2015

Video captioning,attention over input frames:- Yao et al, “Describing Videos by Exploiting Temporal Structure”, ICCV 2015

Machine Translation,attention over input:- Luong et al, “Effective Approaches to Attention-based Neural Machine Translation,” EMNLP 2015

Page 111: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Attending to arbitrary regions?

111

Image: H x W x 3

Features: L x D

Attention mechanism from Show, Attend, and Tell only lets us softly attend to fixed grid positions … can we do better?

Page 112: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Attending to Arbitrary Regions

112

Graves, “Generating Sequences with Recurrent Neural Networks”, arXiv 2013

- Read text, generate handwriting using an RNN- Attend to arbitrary regions of the output by predicting params of a mixture model

Page 113: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Attending to Arbitrary Regions

113

Graves, “Generating Sequences with Recurrent Neural Networks”, arXiv 2013

- Read text, generate handwriting using an RNN- Attend to arbitrary regions of the output by predicting params of a mixture model

Which are real and which are generated?

Page 114: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Attending to Arbitrary Regions

114

Graves, “Generating Sequences with Recurrent Neural Networks”, arXiv 2013

- Read text, generate handwriting using an RNN- Attend to arbitrary regions of the output by predicting params of a mixture model

Which are real and which are generated?

REAL

GENERATED

Page 115: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Attending to Arbitrary Regions: DRAW

115Gregor et al, “DRAW: A Recurrent Neural Network For Image Generation”, ICML 2015

Classify images by attending to arbitrary regions of the input

Page 116: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Attending to Arbitrary Regions: DRAW

116

Generate images by attending to arbitrary regions of the output

Gregor et al, “DRAW: A Recurrent Neural Network For Image Generation”, ICML 2015

Classify images by attending to arbitrary regions of the input

Page 117: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016117

Page 118: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Attending to Arbitrary Regions:Spatial Transformer Networks

Attention mechanism similar to DRAW, but easier to explain

118

Jaderberg et al, “Spatial Transformer Networks”, NIPS 2015

Page 119: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Spatial Transformer Networks

119

Input image:H x W x 3

Box Coordinates:(xc, yc, w, h)

Cropped and rescaled image:

X x Y x 3

Jaderberg et al, “Spatial Transformer Networks”, NIPS 2015

Page 120: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Spatial Transformer Networks

120

Input image:H x W x 3

Box Coordinates:(xc, yc, w, h)

Cropped and rescaled image:

X x Y x 3

Jaderberg et al, “Spatial Transformer Networks”, NIPS 2015

Can we make this function differentiable?

Page 121: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Spatial Transformer Networks

121

Input image:H x W x 3

Box Coordinates:(xc, yc, w, h)

Cropped and rescaled image:

X x Y x 3

Can we make this function differentiable?

Jaderberg et al, “Spatial Transformer Networks”, NIPS 2015

Idea: Function mapping pixel coordinates (xt, yt) of output to pixel coordinates (xs, ys) of input

Page 122: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Spatial Transformer Networks

122

Input image:H x W x 3

Box Coordinates:(xc, yc, w, h)

Cropped and rescaled image:

X x Y x 3

Can we make this function differentiable?

Jaderberg et al, “Spatial Transformer Networks”, NIPS 2015

Idea: Function mapping pixel coordinates (xt, yt) of output to pixel coordinates (xs, ys) of input

(xt, yt)

(xs, ys)

Page 123: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Spatial Transformer Networks

123

Input image:H x W x 3

Box Coordinates:(xc, yc, w, h)

Cropped and rescaled image:

X x Y x 3

Can we make this function differentiable?

Jaderberg et al, “Spatial Transformer Networks”, NIPS 2015

Idea: Function mapping pixel coordinates (xt, yt) of output to pixel coordinates (xs, ys) of input

(xt, yt)

(xs, ys)

Page 124: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Spatial Transformer Networks

124

Input image:H x W x 3

Box Coordinates:(xc, yc, w, h)

Cropped and rescaled image:

X x Y x 3

Can we make this function differentiable?

Jaderberg et al, “Spatial Transformer Networks”, NIPS 2015

Idea: Function mapping pixel coordinates (xt, yt) of output to pixel coordinates (xs, ys) of input

Repeat for all pixels in output to get a sampling grid

Page 125: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Spatial Transformer Networks

125

Input image:H x W x 3

Box Coordinates:(xc, yc, w, h)

Cropped and rescaled image:

X x Y x 3

Can we make this function differentiable?

Jaderberg et al, “Spatial Transformer Networks”, NIPS 2015

Idea: Function mapping pixel coordinates (xt, yt) of output to pixel coordinates (xs, ys) of input

Repeat for all pixels in output to get a sampling grid

Then use bilinear interpolation to compute output

Page 126: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Spatial Transformer Networks

126

Input image:H x W x 3

Box Coordinates:(xc, yc, w, h)

Cropped and rescaled image:

X x Y x 3

Can we make this function differentiable?

Jaderberg et al, “Spatial Transformer Networks”, NIPS 2015

Idea: Function mapping pixel coordinates (xt, yt) of output to pixel coordinates (xs, ys) of input

Repeat for all pixels in output to get a sampling grid

Then use bilinear interpolation to compute output

Network attends to input by predicting

Page 127: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Spatial Transformer Networks

127

Input: Full image

Output: Region of interest from input

Page 128: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Spatial Transformer Networks

128

Input: Full image

A smallLocalization network predicts transform

Output: Region of interest from input

Page 129: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Spatial Transformer Networks

129

Input: Full image

A smallLocalization network predicts transform

Grid generator uses to compute sampling grid

Output: Region of interest from input

Page 130: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Spatial Transformer Networks

130

Input: Full image

A smallLocalization network predicts transform

Grid generator uses to compute sampling grid

Sampler uses bilinear interpolation

to produce output

Output: Region of interest from input

Page 131: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Spatial Transformer Networks

131

Differentiable “attention / transformation” module

Insert spatial transformers into a classification network and it learns to attend and transform the input

Page 132: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016132

Page 133: Lecture 13 - vision.stanford.eduvision.stanford.edu/.../2016/winter1516_lecture13.pdf · Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 13 - 24 Feb 2016 Learnable Upsampling:

Lecture 13 - Fei-Fei Li & Andrej Karpathy & Justin Johnson 24 Feb 2016

Attention Recap

● Soft attention:○ Easy to implement: produce distribution over input

locations, reweight features and feed as input○ Attend to arbitrary input locations using spatial

transformer networks● Hard attention:

○ Attend to a single input location○ Can’t use gradient descent! ○ Need reinforcement learning!

133