Download - CS60010: Deep Learningsudeshna/courses/DL18/... · Deep Visual -Semantic Alignments for Generating Image Descriptions, Karpathy and Fei -Fei Show and Tell: A Neural Image Caption

Transcript
Page 1: CS60010: Deep Learningsudeshna/courses/DL18/... · Deep Visual -Semantic Alignments for Generating Image Descriptions, Karpathy and Fei -Fei Show and Tell: A Neural Image Caption

CS60010: Deep Learning Recurrent Neural Network

Sudeshna Sarkar Spring 2018

13 Feb 2018

Page 2: CS60010: Deep Learningsudeshna/courses/DL18/... · Deep Visual -Semantic Alignments for Generating Image Descriptions, Karpathy and Fei -Fei Show and Tell: A Neural Image Caption

ATTENTION MODEL

Page 3: CS60010: Deep Learningsudeshna/courses/DL18/... · Deep Visual -Semantic Alignments for Generating Image Descriptions, Karpathy and Fei -Fei Show and Tell: A Neural Image Caption

Early attention models

• Larochelle and Hinton, 2010, “Learning to combine foveal glimpses with a third-order Boltzmann machine”

• Misha Denil et al, 2011, “Learning where to Attend with Deep Architectures for Image Tracking”

2014: Neural Translation Breakthroughs • Devlin et al, ACL’2014 • Cho et al EMNLP’2014 • Bahdanau, Cho & Bengio, arXiv sept. 2014 • Jean, Cho, Memisevic & Bengio, arXiv dec. 2014 • Sutskever et al NIPS’2014

Page 4: CS60010: Deep Learningsudeshna/courses/DL18/... · Deep Visual -Semantic Alignments for Generating Image Descriptions, Karpathy and Fei -Fei Show and Tell: A Neural Image Caption

Other Applications

• Ba et al 2014, Visual attention for recognition

• Mnih et al 2014, Visual attention for recognition

• Chorowski et al, 2014, Speech recognition

• Graves et al 2014, Neural Turing machines

• Yao et al 2015, Video description generation

• Vinyals et al, 2015, Conversational Agents

• Xu et al 2015, Image caption generation

• Xu et al 2015, Visual Question Answering

Page 5: CS60010: Deep Learningsudeshna/courses/DL18/... · Deep Visual -Semantic Alignments for Generating Image Descriptions, Karpathy and Fei -Fei Show and Tell: A Neural Image Caption

Soft vs Hard Attention Models Hard attention:

• Attend to a single input location.

• Can’t use gradient descent.

Soft attention: • Compute a weighted combination (attention) over some

inputs using an attention network.

• Can use backpropagation to train end-to-end.

Page 6: CS60010: Deep Learningsudeshna/courses/DL18/... · Deep Visual -Semantic Alignments for Generating Image Descriptions, Karpathy and Fei -Fei Show and Tell: A Neural Image Caption

Problem With Long Sequences • Consider encoder-decoder model. • It suffers from the constraint that all input sequences are

forced to be encoded to a fixed length internal vector. • Attention within Sequences • Keep the intermediate outputs from the encoder LSTM from

each step of the input sequence and training the model to learn to pay selective attention to these inputs and relate them to items in the output sequence.

Page 7: CS60010: Deep Learningsudeshna/courses/DL18/... · Deep Visual -Semantic Alignments for Generating Image Descriptions, Karpathy and Fei -Fei Show and Tell: A Neural Image Caption

Soft Attention for Translation

“I love coffee” -> “Me gusta el café”

Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015

Page 8: CS60010: Deep Learningsudeshna/courses/DL18/... · Deep Visual -Semantic Alignments for Generating Image Descriptions, Karpathy and Fei -Fei Show and Tell: A Neural Image Caption

Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Soft Attention for Translation

“I love coffee” -> “Me gusta el café”

Distribution over input words

Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015

Page 9: CS60010: Deep Learningsudeshna/courses/DL18/... · Deep Visual -Semantic Alignments for Generating Image Descriptions, Karpathy and Fei -Fei Show and Tell: A Neural Image Caption

Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Soft Attention for Translation

“I love coffee” -> “Me gusta el café”

Distribution over input words

Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015

Page 10: CS60010: Deep Learningsudeshna/courses/DL18/... · Deep Visual -Semantic Alignments for Generating Image Descriptions, Karpathy and Fei -Fei Show and Tell: A Neural Image Caption

Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Soft Attention for Translation

“I love coffee” -> “Me gusta el café”

Distribution over input words

Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015

Page 11: CS60010: Deep Learningsudeshna/courses/DL18/... · Deep Visual -Semantic Alignments for Generating Image Descriptions, Karpathy and Fei -Fei Show and Tell: A Neural Image Caption

Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Soft Attention for Translation

“I love coffee” -> “Me gusta el café”

Distribution over input words

Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015

Page 12: CS60010: Deep Learningsudeshna/courses/DL18/... · Deep Visual -Semantic Alignments for Generating Image Descriptions, Karpathy and Fei -Fei Show and Tell: A Neural Image Caption

• Each time the model generates a word in a translation, it (soft) searches for a set of positions in a source sentence where the most relevant information is concentrated.

• The model then predicts a target word based on the context vectors associated with these source positions and all the previous generated target words.

• … it encodes the input sentence into a sequence of vectors and chooses a subset of these vectors adaptively while decoding the translation.

Page 13: CS60010: Deep Learningsudeshna/courses/DL18/... · Deep Visual -Semantic Alignments for Generating Image Descriptions, Karpathy and Fei -Fei Show and Tell: A Neural Image Caption

NMT

Bahdanau, ICLR’15

Page 14: CS60010: Deep Learningsudeshna/courses/DL18/... · Deep Visual -Semantic Alignments for Generating Image Descriptions, Karpathy and Fei -Fei Show and Tell: A Neural Image Caption

. . .

.

. . . . . . . . . .

. . . . . . . .

. . . . .

.

. . .

.

. . .

.

Annotation vector ℎ𝑗= [ℎ𝑗 , ℎ𝑗]

A set of annotation vectors {ℎ1,ℎ2, … ,ℎ𝑇} For each target word 𝑦𝑡,

1. Compute 𝛼𝑡,𝑗 = 𝑓 𝑦𝑡−1,ℎ𝑗 , 𝑠𝑡−1 • ∑ 𝛼𝑡,𝑗 = 1𝑗 • f is a feedforward neural network

2. Get a context vector 𝑐𝑡 = ∑ 𝛼𝑡,𝑗ℎ𝑗𝑗

Train the model with SGD and backpropagation

αt,T

h1

h1

x 1

αt,1

αt,2

h2

h2

x 2

αt,3

h3

h3

x 3

hT

hT

x T

y t-1

st-1

+

y t

st

Learning to Align and Translate jointly

Page 15: CS60010: Deep Learningsudeshna/courses/DL18/... · Deep Visual -Semantic Alignments for Generating Image Descriptions, Karpathy and Fei -Fei Show and Tell: A Neural Image Caption

Context vector (input to decoder): Mixture weights: Alignment score (how well do input words near j match output words at position i):

Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015

Soft Attention for Translation

Page 16: CS60010: Deep Learningsudeshna/courses/DL18/... · Deep Visual -Semantic Alignments for Generating Image Descriptions, Karpathy and Fei -Fei Show and Tell: A Neural Image Caption

Soft Attention for Translation

Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015

Page 17: CS60010: Deep Learningsudeshna/courses/DL18/... · Deep Visual -Semantic Alignments for Generating Image Descriptions, Karpathy and Fei -Fei Show and Tell: A Neural Image Caption

Reached State of the art in one year:

Yoshua Bengio, NIPS RAM workshop 2015

Soft Attention for Translation

Page 18: CS60010: Deep Learningsudeshna/courses/DL18/... · Deep Visual -Semantic Alignments for Generating Image Descriptions, Karpathy and Fei -Fei Show and Tell: A Neural Image Caption

Luong, Pham and Manning’s Translation System (2015):

Luong and Manning IWSLT 2015

Soft Attention for Translation

Translation Error Rate vs Human

Page 19: CS60010: Deep Learningsudeshna/courses/DL18/... · Deep Visual -Semantic Alignments for Generating Image Descriptions, Karpathy and Fei -Fei Show and Tell: A Neural Image Caption

Stacked LSTM (c.f. bidirectional flat encoder in Bahdanau et al):

Effective Approaches to Attention-based Neural Machine Translation Minh-Thang Luong, Hieu Pham, Christopher D. Manning, EMNLP 15

Luong, Pham and Manning 2015

Page 20: CS60010: Deep Learningsudeshna/courses/DL18/... · Deep Visual -Semantic Alignments for Generating Image Descriptions, Karpathy and Fei -Fei Show and Tell: A Neural Image Caption

Effective Approaches to Attention-based Neural Machine Translation, Minh-Thang Luong Hieu Pham Christopher D. Manning, EMNLP 15

Global Attention Model Global attention model is similar but simpler than Bahdanau’s: Different word matching functions were used

Page 21: CS60010: Deep Learningsudeshna/courses/DL18/... · Deep Visual -Semantic Alignments for Generating Image Descriptions, Karpathy and Fei -Fei Show and Tell: A Neural Image Caption

• Compute a best aligned position pt first • Then compute a context vector centered

at that position

Effective Approaches to Attention-based Neural Machine Translation, Minh-Thang Luong Hieu Pham Christopher D. Manning, EMNLP 15

Local Attention Model

Page 22: CS60010: Deep Learningsudeshna/courses/DL18/... · Deep Visual -Semantic Alignments for Generating Image Descriptions, Karpathy and Fei -Fei Show and Tell: A Neural Image Caption

Problem with Large Images • Convolutional neural networks applied to computer vision

problems also suffer from similar limitations, where it can be difficult to learn models on very large images.

• As a result, a series of glimpses can be taken of a large image to formulate an approximate impression of the image before making a prediction.

Page 23: CS60010: Deep Learningsudeshna/courses/DL18/... · Deep Visual -Semantic Alignments for Generating Image Descriptions, Karpathy and Fei -Fei Show and Tell: A Neural Image Caption

Explain Images with Multimodal Recurrent Neural Networks, Mao et al.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 4, 2017

Image Captioning

May 4, 2017 Lecture 10 - 27

Fei-Fei Li & Justin Johnson & Serena Yeung

Deep Visual-Semantic Alignments for Generating Image Descriptions, Karpathy and Fei-Fei Show and Tell: A Neural Image Caption Generator, Vinyals et al. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, Donahue et al. Learning a Recurrent Visual Representation for Image Caption Generation, Chen and Zitnick

Page 24: CS60010: Deep Learningsudeshna/courses/DL18/... · Deep Visual -Semantic Alignments for Generating Image Descriptions, Karpathy and Fei -Fei Show and Tell: A Neural Image Caption

Lecture 10 - 28

Convolutional Neural Network

Fei-Fei Li & Justin Johnson & Serena Yeung

Recurrent Neural Network

Page 25: CS60010: Deep Learningsudeshna/courses/DL18/... · Deep Visual -Semantic Alignments for Generating Image Descriptions, Karpathy and Fei -Fei Show and Tell: A Neural Image Caption

test image

Page 26: CS60010: Deep Learningsudeshna/courses/DL18/... · Deep Visual -Semantic Alignments for Generating Image Descriptions, Karpathy and Fei -Fei Show and Tell: A Neural Image Caption

test image

X

Page 27: CS60010: Deep Learningsudeshna/courses/DL18/... · Deep Visual -Semantic Alignments for Generating Image Descriptions, Karpathy and Fei -Fei Show and Tell: A Neural Image Caption

Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson

RNN for Captioning

Image: H x W x 3

Page 28: CS60010: Deep Learningsudeshna/courses/DL18/... · Deep Visual -Semantic Alignments for Generating Image Descriptions, Karpathy and Fei -Fei Show and Tell: A Neural Image Caption

Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Recall: RNN for Captioning

CNN

Image: H x W x 3

Features: D

Page 29: CS60010: Deep Learningsudeshna/courses/DL18/... · Deep Visual -Semantic Alignments for Generating Image Descriptions, Karpathy and Fei -Fei Show and Tell: A Neural Image Caption

Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Recall: RNN for Captioning

CNN

Image: H x W x 3

Features: D

h0

Hidden state: H

Page 30: CS60010: Deep Learningsudeshna/courses/DL18/... · Deep Visual -Semantic Alignments for Generating Image Descriptions, Karpathy and Fei -Fei Show and Tell: A Neural Image Caption

Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Recall: RNN for Captioning

CNN

Image: H x W x 3

Features: D

h0

Hidden state: H

h1

y1

First word

d1

Distribution over vocab

Page 31: CS60010: Deep Learningsudeshna/courses/DL18/... · Deep Visual -Semantic Alignments for Generating Image Descriptions, Karpathy and Fei -Fei Show and Tell: A Neural Image Caption

Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Recall: RNN for Captioning

CNN

Image: H x W x 3

Features: D

h0

Hidden state: H

h1

y1

h2

y2

First word

Second word

d1

Distribution over vocab

d2

Page 32: CS60010: Deep Learningsudeshna/courses/DL18/... · Deep Visual -Semantic Alignments for Generating Image Descriptions, Karpathy and Fei -Fei Show and Tell: A Neural Image Caption

Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Recall: RNN for Captioning

CNN

Image: H x W x 3

Features: D

h0

Hidden state: H

h1

y1

h2

y2

First word

Second word

d1

Distribution over vocab

d2

RNN only looks at whole image, once

Page 33: CS60010: Deep Learningsudeshna/courses/DL18/... · Deep Visual -Semantic Alignments for Generating Image Descriptions, Karpathy and Fei -Fei Show and Tell: A Neural Image Caption

Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Recall: RNN for Captioning

CNN

Image: H x W x 3

Features: D

h0

Hidden state: H

h1

y1

h2

y2

First word

Second word

d1

Distribution over vocab

d2

RNN only looks at whole image, once

What if the RNN looks at different parts of the image at each timestep?

Page 34: CS60010: Deep Learningsudeshna/courses/DL18/... · Deep Visual -Semantic Alignments for Generating Image Descriptions, Karpathy and Fei -Fei Show and Tell: A Neural Image Caption

Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Soft Attention for Captioning

CNN

Image: H x W x 3

Features: L x D

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Page 35: CS60010: Deep Learningsudeshna/courses/DL18/... · Deep Visual -Semantic Alignments for Generating Image Descriptions, Karpathy and Fei -Fei Show and Tell: A Neural Image Caption

Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Soft Attention for Captioning

CNN

Image: H x W x 3

Features: L x D

h0

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Page 36: CS60010: Deep Learningsudeshna/courses/DL18/... · Deep Visual -Semantic Alignments for Generating Image Descriptions, Karpathy and Fei -Fei Show and Tell: A Neural Image Caption

Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Soft Attention for Captioning

CNN

Image: H x W x 3

Features: L x D

h0

a1

Distribution over L locations

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Page 37: CS60010: Deep Learningsudeshna/courses/DL18/... · Deep Visual -Semantic Alignments for Generating Image Descriptions, Karpathy and Fei -Fei Show and Tell: A Neural Image Caption

Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Soft Attention for Captioning

CNN

Image: H x W x 3

Features: L x D

h0

a1

Weighted combination of features

Distribution over L locations

z1 Weighted features: D

Page 38: CS60010: Deep Learningsudeshna/courses/DL18/... · Deep Visual -Semantic Alignments for Generating Image Descriptions, Karpathy and Fei -Fei Show and Tell: A Neural Image Caption

Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Soft Attention for Captioning

CNN

Image: H x W x

3

Features: L x D

h0

a1

z1

Weighted combination of features

h1

Distribution over L locations

Weighted features: D

y1

First word

Page 39: CS60010: Deep Learningsudeshna/courses/DL18/... · Deep Visual -Semantic Alignments for Generating Image Descriptions, Karpathy and Fei -Fei Show and Tell: A Neural Image Caption

Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Soft Attention for Captioning

CNN

Image: H x W x 3

Features: L x D

h0

a1

z1

Weighted combination of features

y1

h1

First word

Distribution over L locations

a2 d1

Weighted features: D

Distribution over vocab

Page 40: CS60010: Deep Learningsudeshna/courses/DL18/... · Deep Visual -Semantic Alignments for Generating Image Descriptions, Karpathy and Fei -Fei Show and Tell: A Neural Image Caption

Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Soft Attention for Captioning

CNN

Image: H x W x 3

Features: L x D

h0

a1

z1

Weighted combination of features

y1

h1

First word

Distribution over L locations

a2 d1

z2 Weighted features: D

Distribution over vocab

Page 41: CS60010: Deep Learningsudeshna/courses/DL18/... · Deep Visual -Semantic Alignments for Generating Image Descriptions, Karpathy and Fei -Fei Show and Tell: A Neural Image Caption

Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Soft Attention for Captioning

CNN

Image: H x W x 3

Features: L x D

h0

a1

z1

Weighted combination of features

y1

h1

First word

Distribution over L locations

a2 d1

h2

z2 y2 Weighted features: D

Distribution over vocab

Page 42: CS60010: Deep Learningsudeshna/courses/DL18/... · Deep Visual -Semantic Alignments for Generating Image Descriptions, Karpathy and Fei -Fei Show and Tell: A Neural Image Caption

Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Soft Attention for Captioning

CNN

Image: H x W x 3

Features: L x D

h0

a1

z1

Weighted combination of features

y1

h1

First word

Distribution over L locations

a2 d1

h2

a3 d2

z2 y2 Weighted features: D

Distribution over vocab

Page 43: CS60010: Deep Learningsudeshna/courses/DL18/... · Deep Visual -Semantic Alignments for Generating Image Descriptions, Karpathy and Fei -Fei Show and Tell: A Neural Image Caption

Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Soft vs Hard Attention

CNN

Image: H x W x 3

Grid of features (Each D-dimensional)

a b

c d

pa pb

pc pd

Distribution over grid locations

pa + pb + pc + pc = 1

From RNN:

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Page 44: CS60010: Deep Learningsudeshna/courses/DL18/... · Deep Visual -Semantic Alignments for Generating Image Descriptions, Karpathy and Fei -Fei Show and Tell: A Neural Image Caption

Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Soft vs Hard Attention

CNN

Image: H x W x 3

Grid of features (Each D-dimensional)

a b

c d

pa pb

pc pd

Distribution over grid locations

pa + pb + pc + pc = 1

From RNN:

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Context vector z (D-dimensional)

Page 45: CS60010: Deep Learningsudeshna/courses/DL18/... · Deep Visual -Semantic Alignments for Generating Image Descriptions, Karpathy and Fei -Fei Show and Tell: A Neural Image Caption

Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Soft vs Hard Attention

CNN

Image: H x W x 3

Grid of features (Each D-dimensional)

a b

c d

pa pb

pc pd

Distribution over grid locations

pa + pb + pc + pc = 1

From RNN:

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Context vector z (D-dimensional)

Soft attention: Summarize ALL locations z = paa+ pbb + pcc + pdd

Derivative dz/dp is nice!

Train with gradient descent

Page 46: CS60010: Deep Learningsudeshna/courses/DL18/... · Deep Visual -Semantic Alignments for Generating Image Descriptions, Karpathy and Fei -Fei Show and Tell: A Neural Image Caption

Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Soft vs Hard Attention

CNN

Image: H x W x 3

Grid of features (Each D-dimensional)

a b

c d

pa pb

pc pd

Distribution over grid locations

pa + pb + pc + pc = 1

From RNN:

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Context vector z (D-dimensional)

Soft attention: Summarize ALL locations z = paa+ pbb + pcc + pdd Derivative dz/dp is nice! Train with gradient descent

Hard attention: Sample ONE location according to p, z = that vector With argmax, dz/dp is zero almost everywhere … Can’t use gradient descent; need reinforcement learning

Page 47: CS60010: Deep Learningsudeshna/courses/DL18/... · Deep Visual -Semantic Alignments for Generating Image Descriptions, Karpathy and Fei -Fei Show and Tell: A Neural Image Caption

Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Soft Attention for Captioning

Hard attention

Soft attention

Page 48: CS60010: Deep Learningsudeshna/courses/DL18/... · Deep Visual -Semantic Alignments for Generating Image Descriptions, Karpathy and Fei -Fei Show and Tell: A Neural Image Caption

Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Soft Attention for Captioning

Page 49: CS60010: Deep Learningsudeshna/courses/DL18/... · Deep Visual -Semantic Alignments for Generating Image Descriptions, Karpathy and Fei -Fei Show and Tell: A Neural Image Caption

Soft Attention for Video “Describing Videos by Exploiting Temporal Structure,” Li Yao et al, arXiv 2015.

Page 50: CS60010: Deep Learningsudeshna/courses/DL18/... · Deep Visual -Semantic Alignments for Generating Image Descriptions, Karpathy and Fei -Fei Show and Tell: A Neural Image Caption

Soft Attention for Video The attention model: “Describing Videos by Exploiting Temporal Structure,” Li Yao et al, arXiv 2015.

Page 51: CS60010: Deep Learningsudeshna/courses/DL18/... · Deep Visual -Semantic Alignments for Generating Image Descriptions, Karpathy and Fei -Fei Show and Tell: A Neural Image Caption

Soft Attention for Video “Describing Videos by Exploiting Temporal Structure,” Li Yao et al, arXiv 2015.

Page 52: CS60010: Deep Learningsudeshna/courses/DL18/... · Deep Visual -Semantic Alignments for Generating Image Descriptions, Karpathy and Fei -Fei Show and Tell: A Neural Image Caption

Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Soft Attention for Captioning Attention constrained to fixed grid! We’ll come back to this ….

Page 53: CS60010: Deep Learningsudeshna/courses/DL18/... · Deep Visual -Semantic Alignments for Generating Image Descriptions, Karpathy and Fei -Fei Show and Tell: A Neural Image Caption

Attention Takeaways

Performance: Attention models can improve accuracy and reduce computation. Complexity: There are many design choices. • Those choices have a big effect on performance Explainability: Attention models encode explanations. • Both locus and trajectory help understand what’s going on. Hard vs. Soft: Soft models are easier to train, hard models require reinforcement learning.