Download - CS60010: Deep Learningsudeshna/courses/DL18/... · Deep Visual -Semantic Alignments for Generating Image Descriptions, Karpathy and Fei -Fei Show and Tell: A Neural Image Caption

CS60010: Deep Learning Recurrent Neural Network

Sudeshna Sarkar Spring 2018

13 Feb 2018

ATTENTION MODEL

Early attention models

• Larochelle and Hinton, 2010, “Learning to combine foveal glimpses with a third-order Boltzmann machine”

• Misha Denil et al, 2011, “Learning where to Attend with Deep Architectures for Image Tracking”

2014: Neural Translation Breakthroughs • Devlin et al, ACL’2014 • Cho et al EMNLP’2014 • Bahdanau, Cho & Bengio, arXiv sept. 2014 • Jean, Cho, Memisevic & Bengio, arXiv dec. 2014 • Sutskever et al NIPS’2014

Other Applications

• Ba et al 2014, Visual attention for recognition

• Mnih et al 2014, Visual attention for recognition

• Chorowski et al, 2014, Speech recognition

• Graves et al 2014, Neural Turing machines

• Yao et al 2015, Video description generation

• Vinyals et al, 2015, Conversational Agents

• Xu et al 2015, Image caption generation

• Xu et al 2015, Visual Question Answering

Soft vs Hard Attention Models Hard attention:

• Attend to a single input location.

• Can’t use gradient descent.

Soft attention: • Compute a weighted combination (attention) over some

inputs using an attention network.

• Can use backpropagation to train end-to-end.

Problem With Long Sequences • Consider encoder-decoder model. • It suffers from the constraint that all input sequences are

forced to be encoded to a fixed length internal vector. • Attention within Sequences • Keep the intermediate outputs from the encoder LSTM from

each step of the input sequence and training the model to learn to pay selective attention to these inputs and relate them to items in the output sequence.

Soft Attention for Translation

“I love coffee” -> “Me gusta el café”

Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015

Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson


“I love coffee” -> “Me gusta el café”

Distribution over input words


• Each time the model generates a word in a translation, it (soft) searches for a set of positions in a source sentence where the most relevant information is concentrated.

• The model then predicts a target word based on the context vectors associated with these source positions and all the previous generated target words.

• … it encodes the input sentence into a sequence of vectors and chooses a subset of these vectors adaptively while decoding the translation.

NMT

Bahdanau, ICLR’15

. . .

.

. . . . . . . . . .

. . . . . . . .

. . . . .

.

. . .

.

. . .

.

Annotation vector ℎ𝑗= [ℎ𝑗 , ℎ𝑗]

A set of annotation vectors {ℎ1,ℎ2, … ,ℎ𝑇} For each target word 𝑦𝑡,

1. Compute 𝛼𝑡,𝑗 = 𝑓 𝑦𝑡−1,ℎ𝑗 , 𝑠𝑡−1 • ∑ 𝛼𝑡,𝑗 = 1𝑗 • f is a feedforward neural network

2. Get a context vector 𝑐𝑡 = ∑ 𝛼𝑡,𝑗ℎ𝑗𝑗

Train the model with SGD and backpropagation

αt,T

h1

h1

x 1

αt,1

αt,2

h2

h2

x 2

αt,3

h3

h3

x 3

hT

hT

x T

y t-1

st-1

+

y t

st

Learning to Align and Translate jointly

Context vector (input to decoder): Mixture weights: Alignment score (how well do input words near j match output words at position i):



Reached State of the art in one year:

Yoshua Bengio, NIPS RAM workshop 2015


Luong, Pham and Manning’s Translation System (2015):

Luong and Manning IWSLT 2015


Translation Error Rate vs Human

Stacked LSTM (c.f. bidirectional flat encoder in Bahdanau et al):

Effective Approaches to Attention-based Neural Machine Translation Minh-Thang Luong, Hieu Pham, Christopher D. Manning, EMNLP 15

Luong, Pham and Manning 2015

Effective Approaches to Attention-based Neural Machine Translation, Minh-Thang Luong Hieu Pham Christopher D. Manning, EMNLP 15

Global Attention Model Global attention model is similar but simpler than Bahdanau’s: Different word matching functions were used

• Compute a best aligned position pt first • Then compute a context vector centered

at that position

Effective Approaches to Attention-based Neural Machine Translation, Minh-Thang Luong Hieu Pham Christopher D. Manning, EMNLP 15

Local Attention Model

Problem with Large Images • Convolutional neural networks applied to computer vision

problems also suffer from similar limitations, where it can be difficult to learn models on very large images.

• As a result, a series of glimpses can be taken of a large image to formulate an approximate impression of the image before making a prediction.

Explain Images with Multimodal Recurrent Neural Networks, Mao et al.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 4, 2017

Image Captioning

May 4, 2017 Lecture 10 - 27

Fei-Fei Li & Justin Johnson & Serena Yeung

Deep Visual-Semantic Alignments for Generating Image Descriptions, Karpathy and Fei-Fei Show and Tell: A Neural Image Caption Generator, Vinyals et al. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, Donahue et al. Learning a Recurrent Visual Representation for Image Caption Generation, Chen and Zitnick

Lecture 10 - 28

Convolutional Neural Network

Fei-Fei Li & Justin Johnson & Serena Yeung

Recurrent Neural Network

test image

test image

X


RNN for Captioning

Image: H x W x 3


Recall: RNN for Captioning

CNN

Image: H x W x 3

Features: D



CNN

Image: H x W x 3

Features: D

h0

Hidden state: H



CNN

Image: H x W x 3

Features: D

h0

Hidden state: H

h1

y1

First word

d1

Distribution over vocab



CNN

Image: H x W x 3

Features: D

h0

Hidden state: H

h1

y1

h2

y2

First word

Second word

d1


d2



CNN

Image: H x W x 3

Features: D

h0

Hidden state: H

h1

y1

h2

y2

First word

Second word

d1


d2

RNN only looks at whole image, once



CNN

Image: H x W x 3

Features: D

h0

Hidden state: H

h1

y1

h2

y2

First word

Second word

d1


d2

RNN only looks at whole image, once

What if the RNN looks at different parts of the image at each timestep?


Soft Attention for Captioning

CNN

Image: H x W x 3

Features: L x D

Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015



CNN

Image: H x W x 3

Features: L x D

h0




CNN

Image: H x W x 3

Features: L x D

h0

a1

Distribution over L locations




CNN

Image: H x W x 3

Features: L x D

h0

a1

Weighted combination of features


z1 Weighted features: D



CNN

Image: H x W x

3

Features: L x D

h0

a1

z1


h1


Weighted features: D

y1

First word



CNN

Image: H x W x 3

Features: L x D

h0

a1

z1


y1

h1

First word


a2 d1

Weighted features: D




CNN

Image: H x W x 3

Features: L x D

h0

a1

z1


y1

h1

First word


a2 d1

z2 Weighted features: D




CNN

Image: H x W x 3

Features: L x D

h0

a1

z1


y1

h1

First word


a2 d1

h2

z2 y2 Weighted features: D




CNN

Image: H x W x 3

Features: L x D

h0

a1

z1


y1

h1

First word


a2 d1

h2

a3 d2

z2 y2 Weighted features: D



Soft vs Hard Attention

CNN

Image: H x W x 3

Grid of features (Each D-dimensional)

a b

c d

pa pb

pc pd

Distribution over grid locations

pa + pb + pc + pc = 1

From RNN:




CNN

Image: H x W x 3


a b

c d

pa pb

pc pd


pa + pb + pc + pc = 1

From RNN:


Context vector z (D-dimensional)



CNN

Image: H x W x 3


a b

c d

pa pb

pc pd


pa + pb + pc + pc = 1

From RNN:



Soft attention: Summarize ALL locations z = paa+ pbb + pcc + pdd

Derivative dz/dp is nice!

Train with gradient descent



CNN

Image: H x W x 3


a b

c d

pa pb

pc pd


pa + pb + pc + pc = 1

From RNN:



Soft attention: Summarize ALL locations z = paa+ pbb + pcc + pdd Derivative dz/dp is nice! Train with gradient descent

Hard attention: Sample ONE location according to p, z = that vector With argmax, dz/dp is zero almost everywhere … Can’t use gradient descent; need reinforcement learning



Hard attention

Soft attention

Soft Attention for Video “Describing Videos by Exploiting Temporal Structure,” Li Yao et al, arXiv 2015.

Soft Attention for Video The attention model: “Describing Videos by Exploiting Temporal Structure,” Li Yao et al, arXiv 2015.

Soft Attention for Video “Describing Videos by Exploiting Temporal Structure,” Li Yao et al, arXiv 2015.


Soft Attention for Captioning Attention constrained to fixed grid! We’ll come back to this ….

Attention Takeaways

Performance: Attention models can improve accuracy and reduce computation. Complexity: There are many design choices. • Those choices have a big effect on performance Explainability: Attention models encode explanations. • Both locus and trajectory help understand what’s going on. Hard vs. Soft: Soft models are easier to train, hard models require reinforcement learning.