Deep Learning for Computer Vision: Generative models and adversarial training (UPC 2016)
Deep Learning for Computer Vision: Language and vision (UPC 2016)
-
Upload
xavier-giro -
Category
Engineering
-
view
397 -
download
0
Transcript of Deep Learning for Computer Vision: Language and vision (UPC 2016)
Day 4 Lecture 3
Language and Vision
Xavier Giroacute-i-Nieto
2
Acknowledgments
Santi Pascual
3
In lecture D2L6 RNNs
Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation arXiv preprint arXiv14061078 (2014)
Language IN
Language OUT
4
Motivation
5
Much earlier than lecture D2L6 RNNs
Neco RP and Forcada ML 1997 June Asynchronous translations with recurrent neural nets In Neural Networks 1997 International Conference on (Vol 4 pp 2535-2540) IEEE
6
Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
Representation or Embedding
For clarity letrsquos study a Neural Machine Translation (NMT) case
7
Encoder One-hot encoding
One-hot encoding Binary representation of the words in a vocabulary where the only combinations with a single hot (1) bit and all other cold (0) bits are allowed
Word Binary One-hot encoding
zero 00 0000
one 01 0010
two 10 0100
three 11 1000
8
Encoder One-hot encoding
Natural language words can also be one-hot encoded on a vector of dimensionality equal to the size of the dictionary (K)
Word One-hot encoding
economic 000010
growth 001000
has 100000
slowed 000001
Encoder One-hot encoding
One-hot is a very simple representation every word is equidistant from every other word
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
10
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
siM WiE
The one-hot is linearly projected to a space of lower dimension (typically 100-500) with matrix E for learned weights
K
K
11
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
siM Wi
Projection matrix E corresponds to a fully connected layer so its parameters will be learned with a training process
K
12
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
Sequence of continious-space
word representations
Sequence of words
13
Encoder Recurrence
Sequence
Figure Cristopher Olah ldquoUnderstanding LSTM Networksrdquo (2015)
14
Encoder Recurrence
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
15
Encoder Recurrence
time
time
Front View Side View
Rotation 90o
16
Encoder RecurrenceFront View
Rotation 90o
Side View
Representation or embedding of the sentence
17
Sentence Embedding
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states
18
(Word Embeddings)
Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013
19
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1
20
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
With zi ready we can score each word k in the vocabulary with a dot product
RNN internal
state
Neuron weights for
word k
21
Decoder
Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989
and finally normalize to word probabilities with a softmax
Score for word k
Probability that the ith word is word k
Previous words Hidden state
22
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted
EOS
23
Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
24
Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate
Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014
25
Encoder-Decoder Seq2Seq
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
26
Encoder-Decoder Beyond text
27
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
28
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into accountimage features in the firsthidden state
Multimodal Recurrent Neural Network
29
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Learn moreJulia Hockenmeirer
54
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
2
Acknowledgments
Santi Pascual
3
In lecture D2L6 RNNs
Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation arXiv preprint arXiv14061078 (2014)
Language IN
Language OUT
4
Motivation
5
Much earlier than lecture D2L6 RNNs
Neco RP and Forcada ML 1997 June Asynchronous translations with recurrent neural nets In Neural Networks 1997 International Conference on (Vol 4 pp 2535-2540) IEEE
6
Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
Representation or Embedding
For clarity letrsquos study a Neural Machine Translation (NMT) case
7
Encoder One-hot encoding
One-hot encoding Binary representation of the words in a vocabulary where the only combinations with a single hot (1) bit and all other cold (0) bits are allowed
Word Binary One-hot encoding
zero 00 0000
one 01 0010
two 10 0100
three 11 1000
8
Encoder One-hot encoding
Natural language words can also be one-hot encoded on a vector of dimensionality equal to the size of the dictionary (K)
Word One-hot encoding
economic 000010
growth 001000
has 100000
slowed 000001
Encoder One-hot encoding
One-hot is a very simple representation every word is equidistant from every other word
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
10
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
siM WiE
The one-hot is linearly projected to a space of lower dimension (typically 100-500) with matrix E for learned weights
K
K
11
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
siM Wi
Projection matrix E corresponds to a fully connected layer so its parameters will be learned with a training process
K
12
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
Sequence of continious-space
word representations
Sequence of words
13
Encoder Recurrence
Sequence
Figure Cristopher Olah ldquoUnderstanding LSTM Networksrdquo (2015)
14
Encoder Recurrence
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
15
Encoder Recurrence
time
time
Front View Side View
Rotation 90o
16
Encoder RecurrenceFront View
Rotation 90o
Side View
Representation or embedding of the sentence
17
Sentence Embedding
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states
18
(Word Embeddings)
Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013
19
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1
20
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
With zi ready we can score each word k in the vocabulary with a dot product
RNN internal
state
Neuron weights for
word k
21
Decoder
Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989
and finally normalize to word probabilities with a softmax
Score for word k
Probability that the ith word is word k
Previous words Hidden state
22
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted
EOS
23
Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
24
Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate
Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014
25
Encoder-Decoder Seq2Seq
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
26
Encoder-Decoder Beyond text
27
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
28
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into accountimage features in the firsthidden state
Multimodal Recurrent Neural Network
29
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Learn moreJulia Hockenmeirer
54
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
3
In lecture D2L6 RNNs
Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation arXiv preprint arXiv14061078 (2014)
Language IN
Language OUT
4
Motivation
5
Much earlier than lecture D2L6 RNNs
Neco RP and Forcada ML 1997 June Asynchronous translations with recurrent neural nets In Neural Networks 1997 International Conference on (Vol 4 pp 2535-2540) IEEE
6
Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
Representation or Embedding
For clarity letrsquos study a Neural Machine Translation (NMT) case
7
Encoder One-hot encoding
One-hot encoding Binary representation of the words in a vocabulary where the only combinations with a single hot (1) bit and all other cold (0) bits are allowed
Word Binary One-hot encoding
zero 00 0000
one 01 0010
two 10 0100
three 11 1000
8
Encoder One-hot encoding
Natural language words can also be one-hot encoded on a vector of dimensionality equal to the size of the dictionary (K)
Word One-hot encoding
economic 000010
growth 001000
has 100000
slowed 000001
Encoder One-hot encoding
One-hot is a very simple representation every word is equidistant from every other word
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
10
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
siM WiE
The one-hot is linearly projected to a space of lower dimension (typically 100-500) with matrix E for learned weights
K
K
11
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
siM Wi
Projection matrix E corresponds to a fully connected layer so its parameters will be learned with a training process
K
12
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
Sequence of continious-space
word representations
Sequence of words
13
Encoder Recurrence
Sequence
Figure Cristopher Olah ldquoUnderstanding LSTM Networksrdquo (2015)
14
Encoder Recurrence
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
15
Encoder Recurrence
time
time
Front View Side View
Rotation 90o
16
Encoder RecurrenceFront View
Rotation 90o
Side View
Representation or embedding of the sentence
17
Sentence Embedding
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states
18
(Word Embeddings)
Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013
19
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1
20
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
With zi ready we can score each word k in the vocabulary with a dot product
RNN internal
state
Neuron weights for
word k
21
Decoder
Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989
and finally normalize to word probabilities with a softmax
Score for word k
Probability that the ith word is word k
Previous words Hidden state
22
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted
EOS
23
Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
24
Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate
Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014
25
Encoder-Decoder Seq2Seq
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
26
Encoder-Decoder Beyond text
27
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
28
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into accountimage features in the firsthidden state
Multimodal Recurrent Neural Network
29
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Learn moreJulia Hockenmeirer
54
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
4
Motivation
5
Much earlier than lecture D2L6 RNNs
Neco RP and Forcada ML 1997 June Asynchronous translations with recurrent neural nets In Neural Networks 1997 International Conference on (Vol 4 pp 2535-2540) IEEE
6
Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
Representation or Embedding
For clarity letrsquos study a Neural Machine Translation (NMT) case
7
Encoder One-hot encoding
One-hot encoding Binary representation of the words in a vocabulary where the only combinations with a single hot (1) bit and all other cold (0) bits are allowed
Word Binary One-hot encoding
zero 00 0000
one 01 0010
two 10 0100
three 11 1000
8
Encoder One-hot encoding
Natural language words can also be one-hot encoded on a vector of dimensionality equal to the size of the dictionary (K)
Word One-hot encoding
economic 000010
growth 001000
has 100000
slowed 000001
Encoder One-hot encoding
One-hot is a very simple representation every word is equidistant from every other word
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
10
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
siM WiE
The one-hot is linearly projected to a space of lower dimension (typically 100-500) with matrix E for learned weights
K
K
11
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
siM Wi
Projection matrix E corresponds to a fully connected layer so its parameters will be learned with a training process
K
12
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
Sequence of continious-space
word representations
Sequence of words
13
Encoder Recurrence
Sequence
Figure Cristopher Olah ldquoUnderstanding LSTM Networksrdquo (2015)
14
Encoder Recurrence
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
15
Encoder Recurrence
time
time
Front View Side View
Rotation 90o
16
Encoder RecurrenceFront View
Rotation 90o
Side View
Representation or embedding of the sentence
17
Sentence Embedding
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states
18
(Word Embeddings)
Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013
19
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1
20
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
With zi ready we can score each word k in the vocabulary with a dot product
RNN internal
state
Neuron weights for
word k
21
Decoder
Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989
and finally normalize to word probabilities with a softmax
Score for word k
Probability that the ith word is word k
Previous words Hidden state
22
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted
EOS
23
Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
24
Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate
Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014
25
Encoder-Decoder Seq2Seq
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
26
Encoder-Decoder Beyond text
27
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
28
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into accountimage features in the firsthidden state
Multimodal Recurrent Neural Network
29
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Learn moreJulia Hockenmeirer
54
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
5
Much earlier than lecture D2L6 RNNs
Neco RP and Forcada ML 1997 June Asynchronous translations with recurrent neural nets In Neural Networks 1997 International Conference on (Vol 4 pp 2535-2540) IEEE
6
Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
Representation or Embedding
For clarity letrsquos study a Neural Machine Translation (NMT) case
7
Encoder One-hot encoding
One-hot encoding Binary representation of the words in a vocabulary where the only combinations with a single hot (1) bit and all other cold (0) bits are allowed
Word Binary One-hot encoding
zero 00 0000
one 01 0010
two 10 0100
three 11 1000
8
Encoder One-hot encoding
Natural language words can also be one-hot encoded on a vector of dimensionality equal to the size of the dictionary (K)
Word One-hot encoding
economic 000010
growth 001000
has 100000
slowed 000001
Encoder One-hot encoding
One-hot is a very simple representation every word is equidistant from every other word
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
10
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
siM WiE
The one-hot is linearly projected to a space of lower dimension (typically 100-500) with matrix E for learned weights
K
K
11
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
siM Wi
Projection matrix E corresponds to a fully connected layer so its parameters will be learned with a training process
K
12
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
Sequence of continious-space
word representations
Sequence of words
13
Encoder Recurrence
Sequence
Figure Cristopher Olah ldquoUnderstanding LSTM Networksrdquo (2015)
14
Encoder Recurrence
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
15
Encoder Recurrence
time
time
Front View Side View
Rotation 90o
16
Encoder RecurrenceFront View
Rotation 90o
Side View
Representation or embedding of the sentence
17
Sentence Embedding
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states
18
(Word Embeddings)
Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013
19
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1
20
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
With zi ready we can score each word k in the vocabulary with a dot product
RNN internal
state
Neuron weights for
word k
21
Decoder
Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989
and finally normalize to word probabilities with a softmax
Score for word k
Probability that the ith word is word k
Previous words Hidden state
22
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted
EOS
23
Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
24
Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate
Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014
25
Encoder-Decoder Seq2Seq
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
26
Encoder-Decoder Beyond text
27
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
28
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into accountimage features in the firsthidden state
Multimodal Recurrent Neural Network
29
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Learn moreJulia Hockenmeirer
54
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
6
Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
Representation or Embedding
For clarity letrsquos study a Neural Machine Translation (NMT) case
7
Encoder One-hot encoding
One-hot encoding Binary representation of the words in a vocabulary where the only combinations with a single hot (1) bit and all other cold (0) bits are allowed
Word Binary One-hot encoding
zero 00 0000
one 01 0010
two 10 0100
three 11 1000
8
Encoder One-hot encoding
Natural language words can also be one-hot encoded on a vector of dimensionality equal to the size of the dictionary (K)
Word One-hot encoding
economic 000010
growth 001000
has 100000
slowed 000001
Encoder One-hot encoding
One-hot is a very simple representation every word is equidistant from every other word
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
10
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
siM WiE
The one-hot is linearly projected to a space of lower dimension (typically 100-500) with matrix E for learned weights
K
K
11
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
siM Wi
Projection matrix E corresponds to a fully connected layer so its parameters will be learned with a training process
K
12
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
Sequence of continious-space
word representations
Sequence of words
13
Encoder Recurrence
Sequence
Figure Cristopher Olah ldquoUnderstanding LSTM Networksrdquo (2015)
14
Encoder Recurrence
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
15
Encoder Recurrence
time
time
Front View Side View
Rotation 90o
16
Encoder RecurrenceFront View
Rotation 90o
Side View
Representation or embedding of the sentence
17
Sentence Embedding
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states
18
(Word Embeddings)
Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013
19
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1
20
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
With zi ready we can score each word k in the vocabulary with a dot product
RNN internal
state
Neuron weights for
word k
21
Decoder
Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989
and finally normalize to word probabilities with a softmax
Score for word k
Probability that the ith word is word k
Previous words Hidden state
22
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted
EOS
23
Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
24
Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate
Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014
25
Encoder-Decoder Seq2Seq
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
26
Encoder-Decoder Beyond text
27
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
28
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into accountimage features in the firsthidden state
Multimodal Recurrent Neural Network
29
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Learn moreJulia Hockenmeirer
54
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
7
Encoder One-hot encoding
One-hot encoding Binary representation of the words in a vocabulary where the only combinations with a single hot (1) bit and all other cold (0) bits are allowed
Word Binary One-hot encoding
zero 00 0000
one 01 0010
two 10 0100
three 11 1000
8
Encoder One-hot encoding
Natural language words can also be one-hot encoded on a vector of dimensionality equal to the size of the dictionary (K)
Word One-hot encoding
economic 000010
growth 001000
has 100000
slowed 000001
Encoder One-hot encoding
One-hot is a very simple representation every word is equidistant from every other word
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
10
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
siM WiE
The one-hot is linearly projected to a space of lower dimension (typically 100-500) with matrix E for learned weights
K
K
11
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
siM Wi
Projection matrix E corresponds to a fully connected layer so its parameters will be learned with a training process
K
12
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
Sequence of continious-space
word representations
Sequence of words
13
Encoder Recurrence
Sequence
Figure Cristopher Olah ldquoUnderstanding LSTM Networksrdquo (2015)
14
Encoder Recurrence
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
15
Encoder Recurrence
time
time
Front View Side View
Rotation 90o
16
Encoder RecurrenceFront View
Rotation 90o
Side View
Representation or embedding of the sentence
17
Sentence Embedding
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states
18
(Word Embeddings)
Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013
19
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1
20
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
With zi ready we can score each word k in the vocabulary with a dot product
RNN internal
state
Neuron weights for
word k
21
Decoder
Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989
and finally normalize to word probabilities with a softmax
Score for word k
Probability that the ith word is word k
Previous words Hidden state
22
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted
EOS
23
Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
24
Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate
Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014
25
Encoder-Decoder Seq2Seq
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
26
Encoder-Decoder Beyond text
27
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
28
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into accountimage features in the firsthidden state
Multimodal Recurrent Neural Network
29
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Learn moreJulia Hockenmeirer
54
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
8
Encoder One-hot encoding
Natural language words can also be one-hot encoded on a vector of dimensionality equal to the size of the dictionary (K)
Word One-hot encoding
economic 000010
growth 001000
has 100000
slowed 000001
Encoder One-hot encoding
One-hot is a very simple representation every word is equidistant from every other word
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
10
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
siM WiE
The one-hot is linearly projected to a space of lower dimension (typically 100-500) with matrix E for learned weights
K
K
11
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
siM Wi
Projection matrix E corresponds to a fully connected layer so its parameters will be learned with a training process
K
12
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
Sequence of continious-space
word representations
Sequence of words
13
Encoder Recurrence
Sequence
Figure Cristopher Olah ldquoUnderstanding LSTM Networksrdquo (2015)
14
Encoder Recurrence
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
15
Encoder Recurrence
time
time
Front View Side View
Rotation 90o
16
Encoder RecurrenceFront View
Rotation 90o
Side View
Representation or embedding of the sentence
17
Sentence Embedding
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states
18
(Word Embeddings)
Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013
19
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1
20
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
With zi ready we can score each word k in the vocabulary with a dot product
RNN internal
state
Neuron weights for
word k
21
Decoder
Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989
and finally normalize to word probabilities with a softmax
Score for word k
Probability that the ith word is word k
Previous words Hidden state
22
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted
EOS
23
Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
24
Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate
Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014
25
Encoder-Decoder Seq2Seq
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
26
Encoder-Decoder Beyond text
27
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
28
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into accountimage features in the firsthidden state
Multimodal Recurrent Neural Network
29
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Learn moreJulia Hockenmeirer
54
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
Encoder One-hot encoding
One-hot is a very simple representation every word is equidistant from every other word
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
10
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
siM WiE
The one-hot is linearly projected to a space of lower dimension (typically 100-500) with matrix E for learned weights
K
K
11
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
siM Wi
Projection matrix E corresponds to a fully connected layer so its parameters will be learned with a training process
K
12
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
Sequence of continious-space
word representations
Sequence of words
13
Encoder Recurrence
Sequence
Figure Cristopher Olah ldquoUnderstanding LSTM Networksrdquo (2015)
14
Encoder Recurrence
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
15
Encoder Recurrence
time
time
Front View Side View
Rotation 90o
16
Encoder RecurrenceFront View
Rotation 90o
Side View
Representation or embedding of the sentence
17
Sentence Embedding
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states
18
(Word Embeddings)
Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013
19
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1
20
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
With zi ready we can score each word k in the vocabulary with a dot product
RNN internal
state
Neuron weights for
word k
21
Decoder
Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989
and finally normalize to word probabilities with a softmax
Score for word k
Probability that the ith word is word k
Previous words Hidden state
22
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted
EOS
23
Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
24
Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate
Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014
25
Encoder-Decoder Seq2Seq
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
26
Encoder-Decoder Beyond text
27
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
28
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into accountimage features in the firsthidden state
Multimodal Recurrent Neural Network
29
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Learn moreJulia Hockenmeirer
54
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
10
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
siM WiE
The one-hot is linearly projected to a space of lower dimension (typically 100-500) with matrix E for learned weights
K
K
11
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
siM Wi
Projection matrix E corresponds to a fully connected layer so its parameters will be learned with a training process
K
12
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
Sequence of continious-space
word representations
Sequence of words
13
Encoder Recurrence
Sequence
Figure Cristopher Olah ldquoUnderstanding LSTM Networksrdquo (2015)
14
Encoder Recurrence
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
15
Encoder Recurrence
time
time
Front View Side View
Rotation 90o
16
Encoder RecurrenceFront View
Rotation 90o
Side View
Representation or embedding of the sentence
17
Sentence Embedding
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states
18
(Word Embeddings)
Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013
19
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1
20
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
With zi ready we can score each word k in the vocabulary with a dot product
RNN internal
state
Neuron weights for
word k
21
Decoder
Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989
and finally normalize to word probabilities with a softmax
Score for word k
Probability that the ith word is word k
Previous words Hidden state
22
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted
EOS
23
Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
24
Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate
Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014
25
Encoder-Decoder Seq2Seq
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
26
Encoder-Decoder Beyond text
27
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
28
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into accountimage features in the firsthidden state
Multimodal Recurrent Neural Network
29
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Learn moreJulia Hockenmeirer
54
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
11
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
siM Wi
Projection matrix E corresponds to a fully connected layer so its parameters will be learned with a training process
K
12
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
Sequence of continious-space
word representations
Sequence of words
13
Encoder Recurrence
Sequence
Figure Cristopher Olah ldquoUnderstanding LSTM Networksrdquo (2015)
14
Encoder Recurrence
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
15
Encoder Recurrence
time
time
Front View Side View
Rotation 90o
16
Encoder RecurrenceFront View
Rotation 90o
Side View
Representation or embedding of the sentence
17
Sentence Embedding
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states
18
(Word Embeddings)
Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013
19
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1
20
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
With zi ready we can score each word k in the vocabulary with a dot product
RNN internal
state
Neuron weights for
word k
21
Decoder
Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989
and finally normalize to word probabilities with a softmax
Score for word k
Probability that the ith word is word k
Previous words Hidden state
22
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted
EOS
23
Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
24
Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate
Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014
25
Encoder-Decoder Seq2Seq
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
26
Encoder-Decoder Beyond text
27
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
28
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into accountimage features in the firsthidden state
Multimodal Recurrent Neural Network
29
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Learn moreJulia Hockenmeirer
54
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
12
Encoder Projection to continious space
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
Sequence of continious-space
word representations
Sequence of words
13
Encoder Recurrence
Sequence
Figure Cristopher Olah ldquoUnderstanding LSTM Networksrdquo (2015)
14
Encoder Recurrence
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
15
Encoder Recurrence
time
time
Front View Side View
Rotation 90o
16
Encoder RecurrenceFront View
Rotation 90o
Side View
Representation or embedding of the sentence
17
Sentence Embedding
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states
18
(Word Embeddings)
Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013
19
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1
20
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
With zi ready we can score each word k in the vocabulary with a dot product
RNN internal
state
Neuron weights for
word k
21
Decoder
Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989
and finally normalize to word probabilities with a softmax
Score for word k
Probability that the ith word is word k
Previous words Hidden state
22
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted
EOS
23
Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
24
Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate
Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014
25
Encoder-Decoder Seq2Seq
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
26
Encoder-Decoder Beyond text
27
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
28
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into accountimage features in the firsthidden state
Multimodal Recurrent Neural Network
29
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Learn moreJulia Hockenmeirer
54
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
13
Encoder Recurrence
Sequence
Figure Cristopher Olah ldquoUnderstanding LSTM Networksrdquo (2015)
14
Encoder Recurrence
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
15
Encoder Recurrence
time
time
Front View Side View
Rotation 90o
16
Encoder RecurrenceFront View
Rotation 90o
Side View
Representation or embedding of the sentence
17
Sentence Embedding
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states
18
(Word Embeddings)
Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013
19
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1
20
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
With zi ready we can score each word k in the vocabulary with a dot product
RNN internal
state
Neuron weights for
word k
21
Decoder
Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989
and finally normalize to word probabilities with a softmax
Score for word k
Probability that the ith word is word k
Previous words Hidden state
22
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted
EOS
23
Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
24
Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate
Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014
25
Encoder-Decoder Seq2Seq
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
26
Encoder-Decoder Beyond text
27
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
28
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into accountimage features in the firsthidden state
Multimodal Recurrent Neural Network
29
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Learn moreJulia Hockenmeirer
54
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
14
Encoder Recurrence
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
15
Encoder Recurrence
time
time
Front View Side View
Rotation 90o
16
Encoder RecurrenceFront View
Rotation 90o
Side View
Representation or embedding of the sentence
17
Sentence Embedding
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states
18
(Word Embeddings)
Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013
19
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1
20
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
With zi ready we can score each word k in the vocabulary with a dot product
RNN internal
state
Neuron weights for
word k
21
Decoder
Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989
and finally normalize to word probabilities with a softmax
Score for word k
Probability that the ith word is word k
Previous words Hidden state
22
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted
EOS
23
Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
24
Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate
Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014
25
Encoder-Decoder Seq2Seq
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
26
Encoder-Decoder Beyond text
27
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
28
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into accountimage features in the firsthidden state
Multimodal Recurrent Neural Network
29
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Learn moreJulia Hockenmeirer
54
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
15
Encoder Recurrence
time
time
Front View Side View
Rotation 90o
16
Encoder RecurrenceFront View
Rotation 90o
Side View
Representation or embedding of the sentence
17
Sentence Embedding
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states
18
(Word Embeddings)
Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013
19
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1
20
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
With zi ready we can score each word k in the vocabulary with a dot product
RNN internal
state
Neuron weights for
word k
21
Decoder
Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989
and finally normalize to word probabilities with a softmax
Score for word k
Probability that the ith word is word k
Previous words Hidden state
22
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted
EOS
23
Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
24
Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate
Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014
25
Encoder-Decoder Seq2Seq
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
26
Encoder-Decoder Beyond text
27
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
28
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into accountimage features in the firsthidden state
Multimodal Recurrent Neural Network
29
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Learn moreJulia Hockenmeirer
54
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
16
Encoder RecurrenceFront View
Rotation 90o
Side View
Representation or embedding of the sentence
17
Sentence Embedding
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states
18
(Word Embeddings)
Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013
19
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1
20
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
With zi ready we can score each word k in the vocabulary with a dot product
RNN internal
state
Neuron weights for
word k
21
Decoder
Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989
and finally normalize to word probabilities with a softmax
Score for word k
Probability that the ith word is word k
Previous words Hidden state
22
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted
EOS
23
Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
24
Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate
Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014
25
Encoder-Decoder Seq2Seq
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
26
Encoder-Decoder Beyond text
27
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
28
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into accountimage features in the firsthidden state
Multimodal Recurrent Neural Network
29
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Learn moreJulia Hockenmeirer
54
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
17
Sentence Embedding
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
Clusters by meaning appear on 2-dimensional PCA of LSTM hidden states
18
(Word Embeddings)
Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013
19
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1
20
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
With zi ready we can score each word k in the vocabulary with a dot product
RNN internal
state
Neuron weights for
word k
21
Decoder
Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989
and finally normalize to word probabilities with a softmax
Score for word k
Probability that the ith word is word k
Previous words Hidden state
22
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted
EOS
23
Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
24
Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate
Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014
25
Encoder-Decoder Seq2Seq
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
26
Encoder-Decoder Beyond text
27
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
28
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into accountimage features in the firsthidden state
Multimodal Recurrent Neural Network
29
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Learn moreJulia Hockenmeirer
54
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
18
(Word Embeddings)
Mikolov Tomas Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems pp 3111-3119 2013
19
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1
20
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
With zi ready we can score each word k in the vocabulary with a dot product
RNN internal
state
Neuron weights for
word k
21
Decoder
Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989
and finally normalize to word probabilities with a softmax
Score for word k
Probability that the ith word is word k
Previous words Hidden state
22
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted
EOS
23
Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
24
Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate
Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014
25
Encoder-Decoder Seq2Seq
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
26
Encoder-Decoder Beyond text
27
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
28
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into accountimage features in the firsthidden state
Multimodal Recurrent Neural Network
29
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Learn moreJulia Hockenmeirer
54
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
19
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
RNNrsquos internal state zi depends on sentence embedding ht previous word ui-1 and previous internal state zi-1
20
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
With zi ready we can score each word k in the vocabulary with a dot product
RNN internal
state
Neuron weights for
word k
21
Decoder
Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989
and finally normalize to word probabilities with a softmax
Score for word k
Probability that the ith word is word k
Previous words Hidden state
22
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted
EOS
23
Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
24
Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate
Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014
25
Encoder-Decoder Seq2Seq
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
26
Encoder-Decoder Beyond text
27
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
28
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into accountimage features in the firsthidden state
Multimodal Recurrent Neural Network
29
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Learn moreJulia Hockenmeirer
54
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
20
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
With zi ready we can score each word k in the vocabulary with a dot product
RNN internal
state
Neuron weights for
word k
21
Decoder
Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989
and finally normalize to word probabilities with a softmax
Score for word k
Probability that the ith word is word k
Previous words Hidden state
22
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted
EOS
23
Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
24
Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate
Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014
25
Encoder-Decoder Seq2Seq
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
26
Encoder-Decoder Beyond text
27
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
28
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into accountimage features in the firsthidden state
Multimodal Recurrent Neural Network
29
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Learn moreJulia Hockenmeirer
54
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
21
Decoder
Bridle John S Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters NIPS 1989
and finally normalize to word probabilities with a softmax
Score for word k
Probability that the ith word is word k
Previous words Hidden state
22
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted
EOS
23
Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
24
Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate
Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014
25
Encoder-Decoder Seq2Seq
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
26
Encoder-Decoder Beyond text
27
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
28
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into accountimage features in the firsthidden state
Multimodal Recurrent Neural Network
29
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Learn moreJulia Hockenmeirer
54
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
22
Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
More words for the decoded sentence are generated until a ltEOSgt (End Of Sentence) ldquowordrdquo is predicted
EOS
23
Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
24
Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate
Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014
25
Encoder-Decoder Seq2Seq
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
26
Encoder-Decoder Beyond text
27
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
28
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into accountimage features in the firsthidden state
Multimodal Recurrent Neural Network
29
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Learn moreJulia Hockenmeirer
54
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
23
Encoder-Decoder
Kyunghyun Cho ldquoIntroduction to Neural Machine Translation with GPUsrdquo (2015)
24
Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate
Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014
25
Encoder-Decoder Seq2Seq
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
26
Encoder-Decoder Beyond text
27
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
28
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into accountimage features in the firsthidden state
Multimodal Recurrent Neural Network
29
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Learn moreJulia Hockenmeirer
54
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
24
Encoder-Decoder TrainingDataset of pairs of sentences in the two languages to translate
Cho Kyunghyun Bart Van Merrieumlnboer Caglar Gulcehre Dzmitry Bahdanau Fethi Bougares Holger Schwenk and Yoshua Bengio Learning phrase representations using RNN encoder-decoder for statistical machine translation AMNLP 2014
25
Encoder-Decoder Seq2Seq
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
26
Encoder-Decoder Beyond text
27
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
28
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into accountimage features in the firsthidden state
Multimodal Recurrent Neural Network
29
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Learn moreJulia Hockenmeirer
54
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
25
Encoder-Decoder Seq2Seq
Sutskever Ilya Oriol Vinyals and Quoc V Le Sequence to sequence learning with neural networks NIPS 2014
26
Encoder-Decoder Beyond text
27
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
28
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into accountimage features in the firsthidden state
Multimodal Recurrent Neural Network
29
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Learn moreJulia Hockenmeirer
54
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
26
Encoder-Decoder Beyond text
27
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
28
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into accountimage features in the firsthidden state
Multimodal Recurrent Neural Network
29
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Learn moreJulia Hockenmeirer
54
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
27
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
28
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into accountimage features in the firsthidden state
Multimodal Recurrent Neural Network
29
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Learn moreJulia Hockenmeirer
54
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
28
Captioning DeepImageSent
(Slides by Marc Bolantildeos) Karpathy Andrej and Li Fei-Fei Deep visual-semantic alignments for generating image descriptions CVPR 2015
only takes into accountimage features in the firsthidden state
Multimodal Recurrent Neural Network
29
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Learn moreJulia Hockenmeirer
54
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
29
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Learn moreJulia Hockenmeirer
54
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
30
Captioning Show amp Tell
Vinyals Oriol Alexander Toshev Samy Bengio and Dumitru Erhan Show and tell A neural image caption generator CVPR 2015
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Learn moreJulia Hockenmeirer
54
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
31
Captioning LSTM for image amp video
Jeffrey Donahue Lisa Anne Hendricks Sergio Guadarrama Marcus Rohrbach Subhashini Venugopalan Kate Saenko Trevor Darrel Long-term Recurrent Convolutional Networks for Visual Recognition and Description CVPR 2015 code
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Learn moreJulia Hockenmeirer
54
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
32
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
Captioning (+ Detection) DenseCap
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Learn moreJulia Hockenmeirer
54
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
33
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Learn moreJulia Hockenmeirer
54
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
34
Captioning (+ Detection) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
XAVI ldquoman has short hairrdquo ldquoman with short hairrdquo
AMAIArdquoa woman wearing a black shirtrdquo ldquo
BOTH ldquotwo men wearing black glassesrdquo
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Learn moreJulia Hockenmeirer
54
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
35
Captioning (+ Retrieval) DenseCap
Johnson Justin Andrej Karpathy and Li Fei-Fei Densecap Fully convolutional localization networks for dense captioning CVPR 2016
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Learn moreJulia Hockenmeirer
54
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
36
Captioning HRNE
( Slides by Marc Bolantildeos) Pingbo Pan Zhongwen Xu Yi YangFei WuYueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning CVPR 2016
LSTM unit (2nd layer)
Time
Image
t = 1 t = T
hidden stateat t = T
first chunkof data
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Learn moreJulia Hockenmeirer
54
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
37
Visual Question Answering
[z1 z2 hellip zN] [y1 y2 hellip yM]
ldquoIs economic growth decreasing rdquo
ldquoYesrdquo
EncodeEncode
Decode
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Learn moreJulia Hockenmeirer
54
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
38
Extract visual features
Embedding
Predict answerMerge
Question
What object is flying
AnswerKite
Visual Question Answering
Slide credit Issey Masuda
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Learn moreJulia Hockenmeirer
54
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
39
Visual Question Answering
Noh H Seo P H amp Han B Image question answering using convolutional neural network with dynamic parameter prediction CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Learn moreJulia Hockenmeirer
54
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
40
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering arXiv preprint arXiv160301417 (2016)
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Learn moreJulia Hockenmeirer
54
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
41
Visual Question Answering Dynamic
(Slides and Slidecast by Santi Pascual) Xiong Caiming Stephen Merity and Richard Socher Dynamic Memory Networks for Visual and Textual Question Answering ICML 2016
Main idea split image into local regions Consider each region equivalent to a sentence
Local Region Feature Extraction CNN (VGG-19) (1) Rescale input to 448x448 (2) Take output from last pooling layer rarr D=512x14x14 rarr 196 512-d local region vectors
Visual feature embedding W matrix to project image features to ldquoqrdquo-textual space
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Learn moreJulia Hockenmeirer
54
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
42
Visual Question Answering Grounded
(Slides and Screencast by Issey Masuda) Zhu Yuke Oliver Groth Michael Bernstein and Li Fei-FeiVisual7W Grounded Question Answering in Images CVPR 2016
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Learn moreJulia Hockenmeirer
54
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
43
Datasets Visual Genome
Krishna Ranjay Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen et al Visual genome Connecting language and vision using crowdsourced dense image annotations arXiv preprint arXiv160207332 (2016)
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Learn moreJulia Hockenmeirer
54
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
44
Datasets Microsoft SIND
Microsoft SIND
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Learn moreJulia Hockenmeirer
54
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
45
Challenge Microsoft Coco
Captioning
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Learn moreJulia Hockenmeirer
54
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
46
Challenge Storytelling
Storytelling
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Learn moreJulia Hockenmeirer
54
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
47
Challenge Movie Description
Movie Description Retrieval and Fill-in-the-blank
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Learn moreJulia Hockenmeirer
54
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
48
Challenges Movie Question Answering
Movie Question Answering
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Learn moreJulia Hockenmeirer
54
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
49
Challenges Visual Question Answering
Visual Question Answering
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Learn moreJulia Hockenmeirer
54
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
50
1000
Humans
8330
UC Berkeley amp Sony
6647
Baseline LSTMampCNN
5406
Baseline Nearest neighbor
4285
Baseline Prior per question type
3747
Baseline All yes
2988
5362
I Masuda-Mora ldquoOpen-Ended Visual Question-Answeringrdquo Submitted as BSc ETSETB thesis [clean code in Keras perfect for beginners ]
Challenges Visual Question Answering
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Learn moreJulia Hockenmeirer
54
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
51
Summary Embedding language and vision into semantic embeddings
allows fusion learning
Very high interest among researchers Great topic for your
thesis
Will vision and language (and multimedia) communities be
merged with (absorbed by) the machine learning one
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Learn moreJulia Hockenmeirer
54
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
52
Conclusions
New Turing test How to evaluate AIrsquos image understanding
Slide credit Issey Masuda
53
Learn moreJulia Hockenmeirer
54
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
53
Learn moreJulia Hockenmeirer
54
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi
54
Thanks QampA Follow me at
httpsimatgeupceduwebpeoplexavier-giro
DocXaviProfessorXavi