CS60010: Deep Learningsudeshna/courses/DL18/... · Deep Visual -Semantic Alignments for Generating...
Transcript of CS60010: Deep Learningsudeshna/courses/DL18/... · Deep Visual -Semantic Alignments for Generating...
CS60010: Deep Learning Recurrent Neural Network
Sudeshna Sarkar Spring 2018
13 Feb 2018
ATTENTION MODEL
Early attention models
• Larochelle and Hinton, 2010, “Learning to combine foveal glimpses with a third-order Boltzmann machine”
• Misha Denil et al, 2011, “Learning where to Attend with Deep Architectures for Image Tracking”
2014: Neural Translation Breakthroughs • Devlin et al, ACL’2014 • Cho et al EMNLP’2014 • Bahdanau, Cho & Bengio, arXiv sept. 2014 • Jean, Cho, Memisevic & Bengio, arXiv dec. 2014 • Sutskever et al NIPS’2014
Other Applications
• Ba et al 2014, Visual attention for recognition
• Mnih et al 2014, Visual attention for recognition
• Chorowski et al, 2014, Speech recognition
• Graves et al 2014, Neural Turing machines
• Yao et al 2015, Video description generation
• Vinyals et al, 2015, Conversational Agents
• Xu et al 2015, Image caption generation
• Xu et al 2015, Visual Question Answering
Soft vs Hard Attention Models Hard attention:
• Attend to a single input location.
• Can’t use gradient descent.
Soft attention: • Compute a weighted combination (attention) over some
inputs using an attention network.
• Can use backpropagation to train end-to-end.
Problem With Long Sequences • Consider encoder-decoder model. • It suffers from the constraint that all input sequences are
forced to be encoded to a fixed length internal vector. • Attention within Sequences • Keep the intermediate outputs from the encoder LSTM from
each step of the input sequence and training the model to learn to pay selective attention to these inputs and relate them to items in the output sequence.
Soft Attention for Translation
“I love coffee” -> “Me gusta el café”
Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015
Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Soft Attention for Translation
“I love coffee” -> “Me gusta el café”
Distribution over input words
Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015
Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Soft Attention for Translation
“I love coffee” -> “Me gusta el café”
Distribution over input words
Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015
Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Soft Attention for Translation
“I love coffee” -> “Me gusta el café”
Distribution over input words
Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015
Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Soft Attention for Translation
“I love coffee” -> “Me gusta el café”
Distribution over input words
Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015
• Each time the model generates a word in a translation, it (soft) searches for a set of positions in a source sentence where the most relevant information is concentrated.
• The model then predicts a target word based on the context vectors associated with these source positions and all the previous generated target words.
• … it encodes the input sentence into a sequence of vectors and chooses a subset of these vectors adaptively while decoding the translation.
NMT
Bahdanau, ICLR’15
. . .
.
. . . . . . . . . .
. . . . . . . .
. . . . .
.
. . .
.
. . .
.
Annotation vector ℎ𝑗= [ℎ𝑗 , ℎ𝑗]
A set of annotation vectors {ℎ1,ℎ2, … ,ℎ𝑇} For each target word 𝑦𝑡,
1. Compute 𝛼𝑡,𝑗 = 𝑓 𝑦𝑡−1,ℎ𝑗 , 𝑠𝑡−1 • ∑ 𝛼𝑡,𝑗 = 1𝑗 • f is a feedforward neural network
2. Get a context vector 𝑐𝑡 = ∑ 𝛼𝑡,𝑗ℎ𝑗𝑗
Train the model with SGD and backpropagation
αt,T
h1
h1
x 1
αt,1
αt,2
h2
h2
x 2
αt,3
h3
h3
x 3
hT
hT
x T
y t-1
st-1
+
y t
st
Learning to Align and Translate jointly
Context vector (input to decoder): Mixture weights: Alignment score (how well do input words near j match output words at position i):
Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015
Soft Attention for Translation
Soft Attention for Translation
Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015
Reached State of the art in one year:
Yoshua Bengio, NIPS RAM workshop 2015
Soft Attention for Translation
Luong, Pham and Manning’s Translation System (2015):
Luong and Manning IWSLT 2015
Soft Attention for Translation
Translation Error Rate vs Human
Stacked LSTM (c.f. bidirectional flat encoder in Bahdanau et al):
Effective Approaches to Attention-based Neural Machine Translation Minh-Thang Luong, Hieu Pham, Christopher D. Manning, EMNLP 15
Luong, Pham and Manning 2015
Effective Approaches to Attention-based Neural Machine Translation, Minh-Thang Luong Hieu Pham Christopher D. Manning, EMNLP 15
Global Attention Model Global attention model is similar but simpler than Bahdanau’s: Different word matching functions were used
• Compute a best aligned position pt first • Then compute a context vector centered
at that position
Effective Approaches to Attention-based Neural Machine Translation, Minh-Thang Luong Hieu Pham Christopher D. Manning, EMNLP 15
Local Attention Model
Problem with Large Images • Convolutional neural networks applied to computer vision
problems also suffer from similar limitations, where it can be difficult to learn models on very large images.
• As a result, a series of glimpses can be taken of a large image to formulate an approximate impression of the image before making a prediction.
Explain Images with Multimodal Recurrent Neural Networks, Mao et al.
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 4, 2017
Image Captioning
May 4, 2017 Lecture 10 - 27
Fei-Fei Li & Justin Johnson & Serena Yeung
Deep Visual-Semantic Alignments for Generating Image Descriptions, Karpathy and Fei-Fei Show and Tell: A Neural Image Caption Generator, Vinyals et al. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, Donahue et al. Learning a Recurrent Visual Representation for Image Caption Generation, Chen and Zitnick
Lecture 10 - 28
Convolutional Neural Network
Fei-Fei Li & Justin Johnson & Serena Yeung
Recurrent Neural Network
test image
test image
X
Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
RNN for Captioning
Image: H x W x 3
Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Recall: RNN for Captioning
CNN
Image: H x W x 3
Features: D
Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Recall: RNN for Captioning
CNN
Image: H x W x 3
Features: D
h0
Hidden state: H
Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Recall: RNN for Captioning
CNN
Image: H x W x 3
Features: D
h0
Hidden state: H
h1
y1
First word
d1
Distribution over vocab
Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Recall: RNN for Captioning
CNN
Image: H x W x 3
Features: D
h0
Hidden state: H
h1
y1
h2
y2
First word
Second word
d1
Distribution over vocab
d2
Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Recall: RNN for Captioning
CNN
Image: H x W x 3
Features: D
h0
Hidden state: H
h1
y1
h2
y2
First word
Second word
d1
Distribution over vocab
d2
RNN only looks at whole image, once
Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Recall: RNN for Captioning
CNN
Image: H x W x 3
Features: D
h0
Hidden state: H
h1
y1
h2
y2
First word
Second word
d1
Distribution over vocab
d2
RNN only looks at whole image, once
What if the RNN looks at different parts of the image at each timestep?
Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Soft Attention for Captioning
CNN
Image: H x W x 3
Features: L x D
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Soft Attention for Captioning
CNN
Image: H x W x 3
Features: L x D
h0
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Soft Attention for Captioning
CNN
Image: H x W x 3
Features: L x D
h0
a1
Distribution over L locations
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Soft Attention for Captioning
CNN
Image: H x W x 3
Features: L x D
h0
a1
Weighted combination of features
Distribution over L locations
z1 Weighted features: D
Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Soft Attention for Captioning
CNN
Image: H x W x
3
Features: L x D
h0
a1
z1
Weighted combination of features
h1
Distribution over L locations
Weighted features: D
y1
First word
Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Soft Attention for Captioning
CNN
Image: H x W x 3
Features: L x D
h0
a1
z1
Weighted combination of features
y1
h1
First word
Distribution over L locations
a2 d1
Weighted features: D
Distribution over vocab
Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Soft Attention for Captioning
CNN
Image: H x W x 3
Features: L x D
h0
a1
z1
Weighted combination of features
y1
h1
First word
Distribution over L locations
a2 d1
z2 Weighted features: D
Distribution over vocab
Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Soft Attention for Captioning
CNN
Image: H x W x 3
Features: L x D
h0
a1
z1
Weighted combination of features
y1
h1
First word
Distribution over L locations
a2 d1
h2
z2 y2 Weighted features: D
Distribution over vocab
Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Soft Attention for Captioning
CNN
Image: H x W x 3
Features: L x D
h0
a1
z1
Weighted combination of features
y1
h1
First word
Distribution over L locations
a2 d1
h2
a3 d2
z2 y2 Weighted features: D
Distribution over vocab
Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Soft vs Hard Attention
CNN
Image: H x W x 3
Grid of features (Each D-dimensional)
a b
c d
pa pb
pc pd
Distribution over grid locations
pa + pb + pc + pc = 1
From RNN:
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Soft vs Hard Attention
CNN
Image: H x W x 3
Grid of features (Each D-dimensional)
a b
c d
pa pb
pc pd
Distribution over grid locations
pa + pb + pc + pc = 1
From RNN:
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
Context vector z (D-dimensional)
Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Soft vs Hard Attention
CNN
Image: H x W x 3
Grid of features (Each D-dimensional)
a b
c d
pa pb
pc pd
Distribution over grid locations
pa + pb + pc + pc = 1
From RNN:
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
Context vector z (D-dimensional)
Soft attention: Summarize ALL locations z = paa+ pbb + pcc + pdd
Derivative dz/dp is nice!
Train with gradient descent
Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Soft vs Hard Attention
CNN
Image: H x W x 3
Grid of features (Each D-dimensional)
a b
c d
pa pb
pc pd
Distribution over grid locations
pa + pb + pc + pc = 1
From RNN:
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
Context vector z (D-dimensional)
Soft attention: Summarize ALL locations z = paa+ pbb + pcc + pdd Derivative dz/dp is nice! Train with gradient descent
Hard attention: Sample ONE location according to p, z = that vector With argmax, dz/dp is zero almost everywhere … Can’t use gradient descent; need reinforcement learning
Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Soft Attention for Captioning
Hard attention
Soft attention
Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Soft Attention for Captioning
Soft Attention for Video “Describing Videos by Exploiting Temporal Structure,” Li Yao et al, arXiv 2015.
Soft Attention for Video The attention model: “Describing Videos by Exploiting Temporal Structure,” Li Yao et al, arXiv 2015.
Soft Attention for Video “Describing Videos by Exploiting Temporal Structure,” Li Yao et al, arXiv 2015.
Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
Soft Attention for Captioning Attention constrained to fixed grid! We’ll come back to this ….
Attention Takeaways
Performance: Attention models can improve accuracy and reduce computation. Complexity: There are many design choices. • Those choices have a big effect on performance Explainability: Attention models encode explanations. • Both locus and trajectory help understand what’s going on. Hard vs. Soft: Soft models are easier to train, hard models require reinforcement learning.