Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and...
Transcript of Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and...
![Page 1: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/1.jpg)
Multilingual Multimodal Language Processing Using
Neural Networks
Mitesh M Khapra
IBM Research India
Sarath Chandar
Université de Montréal
![Page 2: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/2.jpg)
2
Is it ``University the Montreal``?
![Page 3: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/3.jpg)
Tutorial Outline
1. Introduction and motivation
2. Neural networks – basics
3. Multilingual multimodal representation learning
4. Multilingual multimodal generation
5. Summary and open problems
3
![Page 4: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/4.jpg)
Tutorial Outline
1. Introduction and motivation
2. Neural networks – basics
3. Multilingual multimodal representation learning
4. Multilingual multimodal generation
5. Summary and open problems
4
![Page 5: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/5.jpg)
What is multilingual multimodal NLP?
Designing language processing systems that can handle• Multiple languages (English, French, German, Hindi, Spanish,…)
• Multiple modalities (image, speech, video,…)
5
![Page 6: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/6.jpg)
Why multilingual multimodal NLP?
6
We live an increasingly multilingual multimodal world
Yet these people worship me like a God. Who am I??
Video, Tamil Audio, English subtitles
English: brown horses eating tall grass beside a body of water
French: chevaux brun manger l'herbe haute à côté d'un corps de l'eau
![Page 7: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/7.jpg)
Why multilingual multimodal NLP?
7*This example is taken from Calixto et al., 2012.
Seal – selo (stamp) and foca (marine animal).
Seal pup should have been translated to filhote de foca (young seal), but it has been translated as selo.
Pup in title has been wrongly translated to filhote de cachorro(young dog).
![Page 8: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/8.jpg)
Why multilingual multimodal NLP?
8*Article from forbes.com
![Page 9: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/9.jpg)
Why multilingual multimodal NLP?
9
![Page 10: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/10.jpg)
Why using Neural Networks?
• Backpropaganda of Neural Networks in the name of Deep Learning.
• Significant success in speech recognition, computer vision.
• Slowly conquering NLP?
10
![Page 11: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/11.jpg)
Deep Representation Learning
Input
Hand
designed
program
Output
Rule-based
systems
Classic machine
learning
Input
Hand
designed
features
Output
Mapping
from
features
Shallow
representation
learning
Input
features
Output
Mapping
from
features
Deep learning
Input
Simple
features
Output
More
abstract
features
Mapping
from
features
11
![Page 12: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/12.jpg)
Tutorial Outline
1. Introduction and motivation
2. Neural networks – basicsa) Neural network and backpropagation
b) Matching data with architectures
c) Auto-encoders
d) Distributed natural language processing
3. Multilingual multimodal representation learning
4. Multilingual multimodal generation
5. Summary and open problems
12
![Page 13: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/13.jpg)
Artificial Neuron / Perceptron
13
Neuron pre-activation:
Neuron activation:
• W – connection weights
• b – neuron bias
• g(.) – activation function
![Page 14: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/14.jpg)
Activation functions
14
Identity
tanh
sigmoid
relu *Images from Hugo Larochelle’s course.
![Page 15: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/15.jpg)
Learning problem
Given training data find W and b that minimizes
are the parameters of the perceptron model.
15
![Page 16: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/16.jpg)
Learning using gradient descent
• Compute gradient w.r.t the parameters.
• Make a small step in the direction of the negative gradient in the parameter space.
16
![Page 17: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/17.jpg)
Gradient Descent
1
0
J(0,1)
*This animation is taken from Andrew Ng’s course.
![Page 18: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/18.jpg)
Gradient Descent
0
1
J(0,1)
*This animation is taken from Andrew Ng’s course.
![Page 19: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/19.jpg)
Stochastic Gradient Descent (SGD)
• Approximate the gradient using a mini-batch of examples instead of entire training set.
• Online SGD – when mini-batch size is 1.
• SGD is most commonly used when compared to full-batch GD.
19
![Page 20: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/20.jpg)
Online SGD for Perceptron Learning
• Perceptron learning objective is a convex function.
• GD is guaranteed to converge to global minimum while SGD will converge to global minimum by slowly letting learning rate decrease to zero.
20
![Page 21: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/21.jpg)
What can a perceptron do?
• Can solve linearly separable problems.
21*Images from Hugo Larochelle’s course.
![Page 22: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/22.jpg)
Can’t solve non-linearly separable problems…
22*Images from Hugo Larochelle’s course.
![Page 23: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/23.jpg)
Unless the input is transformed to a better feature space..
23*Images from Hugo Larochelle’s course.
![Page 24: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/24.jpg)
Can the learning algorithm automatically learn these features?
24
![Page 25: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/25.jpg)
Neural Networks
25
• You need some non-linearity f.
• Without f, this is still a perceptron!
• h – hidden layer.
![Page 26: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/26.jpg)
Neural Networks with multiple outputs
26
![Page 27: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/27.jpg)
Training Neural Networks
• Learning problem is still same as perceptron learning problem.
• Only the functional form of output is more complicated.
• We can learn the parameters of the neural network using gradient descent.
• Can we make use the sequential nature of the neural networks for more efficient computation of gradient?
27
![Page 28: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/28.jpg)
Backpropagation for Neural Networks
• Algorithm for efficient computation of gradients using chain rule of differentiation.
• Backpropagation is not a learning algorithm. We still use gradient descent.
• No more need to derive backprop manually! Theano/Torch/Tensorflowcan do it for you!
28
![Page 29: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/29.jpg)
Deep Neural Networks
29
• Can have multiple hidden layers.
• More hidden layers, more non-linear the final projection!
![Page 30: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/30.jpg)
Language Modeling – An application
• N-gram language modeling: given n-1 words, predict the n-th word.
• Example
Objects are often grouped spatially.
4-gram model will consider ‘are’, ‘often’, ‘grouped’ to predict ‘spatially’.
Traditional n-gram models will use the frequency statistics to compute p(spatially|grouped,often,are).
30
![Page 31: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/31.jpg)
Neural Language Modeling (Bengio et al., 2001)
31
Word Embedding Matrix
• Feed-forward neural network.
• Word embedding matrix We is also learnt using backprop+GD.
![Page 32: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/32.jpg)
Distributed Natural Language Processing
• Neural Networks learn distributed word representation instead of localized word representations.
• Advantages of distributed word representation:• Comes with similarity as a by-product.
• Easy to generalize when compared to localized representations.
• Can be used to initialize word embedding in other algorithms.
32
![Page 33: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/33.jpg)
Modeling sequence data
• Feedforward networks ignore the sequence information.
• We managed to include some sequence information in neural language model by considering previous n-1 words.
• Can we implicitly add this sequence information in the network architecture?
33
![Page 34: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/34.jpg)
Recurrent Neural Networks
34
h0 can be initialized to a zero vector or learned as a parameter.
![Page 35: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/35.jpg)
Recurrent Neural Networks
35
![Page 36: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/36.jpg)
Backpropagation Through Time (BPTT)
36
To fit y1
![Page 37: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/37.jpg)
Backpropagation Through Time (BPTT)
37
To fit y2
![Page 38: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/38.jpg)
Backpropagation Through Time (BPTT)
38
To fit y3
![Page 39: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/39.jpg)
Backpropagation Through Time (BPTT)
39
To fit y4
Computationally expensive for longer sequences !
![Page 40: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/40.jpg)
Truncated Backpropagation Through Time (T-BPTT)
40
To fit y4Truncated after 2 steps.
![Page 41: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/41.jpg)
RNN Language Model (Mikolov et al., 2010)
Language modeling as sequential prediction problem.
I/P : Objects are often grouped spatially .
O/P: are often grouped spatially . <EOS>
41
![Page 42: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/42.jpg)
RNN Language model
42
Objects are often grouped
are often grouped spatially
![Page 43: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/43.jpg)
RNN Language model
43
Objects are often grouped
are often grouped spatially
Models p(grouped|often,are,objects)
![Page 44: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/44.jpg)
Long Short Term Memory (LSTM) (Hochreiter &
Schmidhuber, 1997)
• LSTM is a variant of RNN that is good at modeling long-term dependencies.
• RNN uses multiplication to overwrite the hidden state while LSTM uses addition (better gradient flow!).
44
![Page 45: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/45.jpg)
Long Short Term Memory (LSTM)
45*Pictures from Chris Olah
![Page 46: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/46.jpg)
Long Short Term Memory (LSTM)
46*Pictures from Chris Olah
Additive context
![Page 47: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/47.jpg)
Long Short Term Memory (LSTM)
47*Pictures from Chris Olah
Forget gate
![Page 48: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/48.jpg)
Long Short Term Memory (LSTM)
48*Pictures from Chris Olah
Input gate
![Page 49: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/49.jpg)
Long Short Term Memory (LSTM)
49*Pictures from Chris Olah
Additive context
![Page 50: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/50.jpg)
Long Short Term Memory (LSTM)
50*Pictures from Chris Olah
Output gate
![Page 51: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/51.jpg)
Gated Recurrent Units (GRU) (Cho et al., 2014)
51*Pictures from Chris Olah
![Page 52: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/52.jpg)
Recursive Neural Networks(Pollack., 1990)
52
. . . . . . . .
The cat
. . . .NP
. . . . . . . .
The cat
. . . . . . . .
The cat
. . . .NP
. . . . PP
. . . . VP
. . . . S
Shared weights
![Page 53: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/53.jpg)
Convolutional Networks
State of the art performance in object recognition, object detection, image segmentation…
53
• Standard network architecture for image representation.
![Page 54: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/54.jpg)
Matching data with architecture
Bag-of-words like data
54
Sequence data
ImagesTree structured data
![Page 55: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/55.jpg)
Autoencoders
• Consists of two modules: encoder and decoder.
• Output is equal to the input.
55
![Page 56: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/56.jpg)
Autoencoders
56
This is for bag-of-words like data.
![Page 57: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/57.jpg)
Recurrent Autoencoder
57
. . . .
. . . .
x1
. . . .
. . . .
x2
. . . .
. . . .
x3
. . . .
. . . .
x1
. . . .
. . . .
x2
. . . .
. . . .
x3
Recurrent Encoder
Recurrent Decoder
![Page 58: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/58.jpg)
Recurrent Autoencoder (v2)
58
. . . .
. . . .
x1
. . . .
. . . .
x2
. . . .
. . . .
x3
. . . .
. . . .
x1
. . . .
. . . .
x2
. . . .
. . . .
x3
. . . .
-
. . . .
x1
. . . .
x2
![Page 59: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/59.jpg)
59
You can also imagine Recursive Autoencoders, Convolutional Autoencoders…
![Page 60: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/60.jpg)
For a quick introduction:
60
![Page 61: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/61.jpg)
For a detailed introduction:
61
![Page 62: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/62.jpg)
Tutorial Outline
1. Introduction and motivation
2. Neural networks – basics
3. Multilingual multimodal representation learning
4. Multilingual multimodal generation
5. Summary and research directions
![Page 63: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/63.jpg)
princess
queen
buvait
drink
drank
Lets start by defining the goal of learning multilingual word representations
eat
ate
king
prince
Monolingual Word Representations(capture syntactic and semantic
similarities between words)
English French Joint English French
Multilingual Word Representations(capture syntactic and semantic
similarities between words bothwithin and across languages)
manger
mangé
boire
roi
princereine
princesse
drink
drank
boire
buvaiteat
ate
manger
mangé
king
prince
roi
prince reine
princesse
queen
princess
![Page 64: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/64.jpg)
First lets try to understand how do we learn monolingual word representations
![Page 65: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/65.jpg)
Consider this Task: Predict n-th word given previous n-1 words
Example: he sat on a chair
Training data: All n-word windows in your corpus
Now, lets try to answer these two questions:• How do you model this task?• What is the connection between this task and learning word representations?
![Page 66: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/66.jpg)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
he
cat
sheep
duck
the
mat
on
chat
chair
sleep
sat
slept
a
you
randomly initialized
𝑊 ∈ ℝ 𝑉 × 𝑑
Training instance: he sat on a chair
.
.
.
.
.
he
sat
on
a
concatenate
.
.
.
.
.
.
.
.
.
.
𝑊ℎ ∈ ℝ𝑘⋅𝑑 ×ℎ
.
.
.
.
.
.
.
.
.
.
.
.
.
.
𝑊𝑜𝑢𝑡 ∈ ℝℎ ×|𝑉|
𝑃 ℎ𝑒 ℎ𝑒, 𝑠𝑎𝑡, 𝑜𝑛, 𝑎)
𝑃 𝑐𝑎𝑡 ℎ𝑒, 𝑠𝑎𝑡, 𝑜𝑛, 𝑎)𝑃 𝑠ℎ𝑒𝑒𝑝 ℎ𝑒, 𝑠𝑎𝑡, 𝑜𝑛, 𝑎)
.
.
.
.
𝑃 𝑐ℎ𝑎𝑖𝑟 ℎ𝑒, 𝑠𝑎𝑡, 𝑜𝑛, 𝑎)
Objective:maximize−log(𝑃 𝑐ℎ𝑎𝑖𝑟 ℎ𝑒, 𝑠𝑎𝑡, 𝑜𝑛, 𝑎))
.
.
.
.
back propagate
|𝑉|
![Page 67: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/67.jpg)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
he
cat
sheep
duck
the
mat
on
chat
chair
sleep
sat
slept
a
you
update
𝑊 ∈ ℝ 𝑉 × 𝑑
Training instance: he sat on a chair
.
.
.
.
.
he
sat
on
a
concatenate
.
.
.
.
.
.
.
.
.
.
𝑊ℎ ∈ ℝ𝑘∙𝑑 ×ℎ
.
.
.
.
.
.
.
.
.
.
.
.
.
.
𝑊𝑜𝑢𝑡 ∈ ℝℎ ×|𝑉|
𝑃 ℎ𝑒 ℎ𝑒, 𝑠𝑎𝑡, 𝑜𝑛, 𝑎)
𝑃 𝑐𝑎𝑡 ℎ𝑒, 𝑠𝑎𝑡, 𝑜𝑛, 𝑎)𝑃 𝑠ℎ𝑒𝑒𝑝 ℎ𝑒, 𝑠𝑎𝑡, 𝑜𝑛, 𝑎)
.
.
.
.
𝑃 𝑐ℎ𝑎𝑖𝑟 ℎ𝑒, 𝑠𝑎𝑡, 𝑜𝑛, 𝑎)
Objective:maximize−log(𝑃 𝑐ℎ𝑎𝑖𝑟 ℎ𝑒, 𝑠𝑎𝑡, 𝑜𝑛, 𝑎))
.
.
.
.
back propagate
![Page 68: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/68.jpg)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
he
cat
sheep
duck
the
mat
on
chat
chair
sleep
sat
slept
a
you
𝑊 ∈ ℝ 𝑉 × 𝑑
Training instance: he sat on a chair
.
.
.
.
.
he
sat
on
a
concatenate
.
.
.
.
.
.
.
.
.
.
𝑊ℎ ∈ ℝ𝑘⋅𝑑 ×ℎ
.
.
.
.
.
.
.
.
.
.
.
.
.
.
𝑊𝑜𝑢𝑡 ∈ ℝℎ ×|𝑉|
𝑃 ℎ𝑒 ℎ𝑒, 𝑠𝑎𝑡, 𝑜𝑛, 𝑎)
𝑃 𝑐𝑎𝑡 ℎ𝑒, 𝑠𝑎𝑡, 𝑜𝑛, 𝑎)𝑃 𝑠ℎ𝑒𝑒𝑝 ℎ𝑒, 𝑠𝑎𝑡, 𝑜𝑛, 𝑎)
.
.
.
.
𝑃 𝑐ℎ𝑎𝑖𝑟 ℎ𝑒, 𝑠𝑎𝑡, 𝑜𝑛, 𝑎)
In general:
𝑚𝑎𝑥𝑖𝑚𝑖𝑧𝑒
𝑖=1
𝑇
−log(𝑃 𝑤𝑖 𝑤𝑖−𝑘 , … , 𝑤𝑖−1))
T = total number of words in the corpus
.
.
.
.
back propagate
![Page 69: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/69.jpg)
How does this result in meaningful word representations?
Intuition: similar words appear in similar contextshe sat on a chairhe sits on a chair
To predict chair in both cases the model should learn tomake the representations of “sits” and “sat” similar
![Page 70: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/70.jpg)
Alternate formulation … … …
![Page 71: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/71.jpg)
.
.
.
.
.
.
.
.
.
he
.
.
.
.
.
.
.
.
.
.
𝑊ℎ ∈ ℝ𝑘⋅𝑑 ×ℎ
𝑊𝑜𝑢𝑡 ∈ ℝℎ ×|1|
.
.
.
.
.
.
sat
on
a
oxygen
.
.
.
.
.
.
.
.
.
he
.
.
.
.
.
.
.
.
.
.
𝑊ℎ ∈ ℝ𝑘⋅𝑑 ×ℎ
𝑊𝑜𝑢𝑡 ∈ ℝℎ ×|1|
𝑠
.
.
.
.
.
.
sat
on
a
chair
𝑠𝑐
Positive: he sat on a chair Negative: he sat on a oxygen
𝑚𝑎𝑥𝑖𝑚𝑖𝑧𝑒 (0, 1 − 𝑠 + 𝑠𝑐)
back propagate and update word representations Advantage: does not require this expensive matrix multiplication
![Page 72: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/72.jpg)
Coming back to learning (multi)bilingual representations
![Page 73: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/73.jpg)
Joint English French
Multilingual Word Representations(capture syntactic and semantic
similarities between words bothwithin and across languages)
drink
drank
boire
buvaiteat
ate
manger
mangé
king
prince
roi
prince reine
princesse
queen
princess
Can also be extended to bigger units (sentences, documents, etc.)
Two paradigms:• Offline Bilingual Alignment• Joint training for Bilingual Alignment
![Page 74: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/74.jpg)
Joint English French
Multilingual Word Representations(capture syntactic and semantic
similarities between words bothwithin and across languages)
drink
drank
boire
buvaiteat
ate
manger
mangé
king
prince
roi
prince reine
princesse
queen
princess
Can also be extended to bigger units (sentences, documents, etc.)
Two paradigms:• Offline Bilingual Alignment• Joint training for Bilingual Alignment
![Page 75: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/75.jpg)
Offline bilingual alignment:• Stage 1: independently learn word
representations for two languages
• Stage 2: Now try to enforce similarity between the representations of similar words across the two languages.
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
he
cat
sheep
duck
the
mat
on
chat
chair
sleep
sat
slept
a
you
. . . . .
il
mouton
toi
canard
la
tapis
sur
bavarder
chaise
dormir
assis
dormi
un
chat
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
Goal in Stage 2: transform the word representations such that representations of (cat, chat), (sheep, mouton) , (you, toi), etc. are close to each other
How? Let’s see …
English French
![Page 76: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/76.jpg)
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
After Stage 1: X = representations of English wordsY = representations of French words
X Y
Use a bilingual dictionary to make X and Y parallel (i.e. corresponding rows in X and Y form a translation pair)
(Faruqui and Dyer, 2014)
![Page 77: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/77.jpg)
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
After Stage 1: X = representations of English wordsY = representations of French words
X YSearch for projection vectors a & b
Such that after projecting the original representations on a and b …
… the projections are correlated
77
1 2
3
4
1
2
3
41
1 2
2
3
3
4
4
a bUse Canonical Correlation Analysis (CCA)
(Faruqui and Dyer, 2014)
Goal : transform X and Y such that the transformed representations of (cat, chat), (you, toi), etc. are close to each other
![Page 78: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/78.jpg)
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
After Stage 1: X = representations of English wordsY = representations of French words
X Y
Goal : transform X and Y such that the transformed representations of (cat, chat), (you, toi), etc. are close to each other
78
1 2
3
4
1
2
3
41
1 2
2
3
3
4
4
a b
(Faruqui and Dyer, 2014)
𝑋𝐴
Transform X by projecting it on A
![Page 79: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/79.jpg)
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
After Stage 1: X = representations of English wordsY = representations of French words
X Y
Goal : transform X and Y such that the transformed representations of (cat, chat), (you, toi), etc. are close to each other
79
1 2
3
4
1
2
3
41
1 2
2
3
3
4
4
a b
(Faruqui and Dyer, 2014)
𝑋𝐴 YB
Transform Y by projecting it on B
![Page 80: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/80.jpg)
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
After Stage 1: X = representations of English wordsY = representations of French words
X Y
Goal : transform X and Y such that the transformed representations of (cat, chat), (you, toi), etc. are close to each other
80
1 2
3
4
1
2
3
41
1 2
2
3
3
4
4
a b
(Faruqui and Dyer, 2014)
(𝑋𝐴)𝑇 𝑌𝐵
![Page 81: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/81.jpg)
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
After Stage 1: X = representations of English wordsY = representations of French words
X Y
Goal : transform X and Y such that the transformed representations of (cat, chat), (you, toi), etc. are close to each other
81
1 2
3
4
1
2
3
41
1 2
2
3
3
4
4
a b
(Faruqui and Dyer, 2014)
𝐴𝑇𝑋𝑇𝑌𝐵
This term is simply the correlation between transformations 𝑋𝐴 & YB. We need to maximize this
![Page 82: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/82.jpg)
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
After Stage 1: X = representations of English wordsY = representations of French words
X Y
Goal : transform X and Y such that the transformed representations of (cat, chat), (you, toi), etc. are close to each other
82
1 2
3
4
1
2
3
41
1 2
2
3
3
4
4
a b
(Faruqui and Dyer, 2014)
𝑚𝑎𝑥𝑖𝑚𝑖𝑧𝑒 𝑡𝑟𝑎𝑐𝑒 𝐴𝑇𝑋𝑇𝑌𝐵𝑠. 𝑡. 𝐴𝑇𝑋𝑇𝑋𝐴 = 𝐼𝐵𝑇𝑌𝑇𝑌𝐵 = 𝐼
![Page 83: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/83.jpg)
Alternately one could use Deep CCA instead of CCA …
![Page 84: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/84.jpg)
. . . . . . . .
. . . . . . . . . .
. . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
After Stage 1: X = representations of English wordsY = representations of French words
(Lu et. al, 2015)
. . . . . . . .
. . . . . . . . . .
. . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
sheep mouton
𝑥 𝑦
𝑓(𝑥) 𝑔(𝑦)
𝑚𝑎𝑥𝑖𝑚𝑖𝑧𝑒𝑎𝑇𝑓 𝑥 𝑇𝑔 𝑦 𝑏
𝑎𝑇𝑓 𝑥 𝑇𝑓 𝑥 𝑎 𝑏𝑇𝑔 𝑦 𝑇𝑔 𝑦 𝑏𝑤. 𝑟. 𝑡 𝑊𝑓, 𝑊𝑔, 𝑎, 𝑏 (same as CCA)
𝑊𝑓 𝑊𝑔
Again, use a bilingual dictionary to make X and Y parallel
Extract deep features
𝑓(𝑥) and 𝑔(𝑦) from
𝑥 and 𝑦
𝑊𝑓 and 𝑊𝑔 are the
parameters of the two networks
Now find projection vectors 𝑎 & 𝑏 such that …..
Backpropagate and
update 𝑊𝑓,𝑊𝑔, 𝑎, 𝑏
![Page 85: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/85.jpg)
Joint English French
Multilingual Word Representations(capture syntactic and semantic
similarities between words bothwithin and across languages)
drink
drank
boire
buvaiteat
ate
manger
mangé
king
prince
roi
prince reine
princesse
queen
princess
Can also be extended to bigger units (sentences, documents, etc.)
Two paradigms:• Offline Bilingual Alignment• Joint training for Bilingual Alignment
![Page 86: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/86.jpg)
Joint English French
Multilingual Word Representations(capture syntactic and semantic
similarities between words bothwithin and across languages)
drink
drank
boire
buvaiteat
ate
manger
mangé
king
prince
roi
prince reine
princesse
queen
princess
Can also be extended to bigger units (sentences, documents, etc.)
Two paradigms:• Offline Bilingual Alignment• Joint training for Bilingual Alignment
• Use only parallel data• Use monolingual as well a parallel data
![Page 87: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/87.jpg)
Joint English French
Multilingual Word Representations(capture syntactic and semantic
similarities between words bothwithin and across languages)
drink
drank
boire
buvaiteat
ate
manger
mangé
king
prince
roi
prince reine
princesse
queen
princess
Can also be extended to bigger units (sentences, documents, etc.)
Two paradigms:• Offline Bilingual Alignment• Joint training for Bilingual Alignment
• Use only parallel data• Use monolingual as well a parallel data
![Page 88: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/88.jpg)
Compose word representations to get a sentence representation using a Compositional Vector Model(CVM)
Two options considered:ADD: (simply add word vectors)𝑠 = 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒𝑤𝑖 = 𝑟𝑒𝑝𝑟𝑒𝑠𝑒𝑛𝑡𝑎𝑡𝑖𝑜𝑛 𝑜𝑓𝑤𝑜𝑟𝑑 𝑖 𝑖𝑛 𝑡ℎ𝑒 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒
𝑓 𝑠 =
𝑖=1
𝑛
𝑤𝑖
BI (gram):
𝑓 𝑠 =
𝑖=1
𝑛
tanh(𝑤𝑖−1 +𝑤𝑖)
Training data: Parallel sentences
. . .
he sat on a chair
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
il était assis sur une chaise
CVM CVM
𝑎 = 𝐸𝑛𝑔𝑙𝑖𝑠ℎ 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒𝑏 = 𝑝𝑎𝑟𝑎𝑙𝑙𝑒𝑙 𝐹𝑟𝑒𝑛𝑐ℎ 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒𝑛 = 𝑟𝑎𝑛𝑑𝑜𝑚 𝐹𝑟𝑒𝑛𝑐ℎ 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒
𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒
𝐸 𝑎, 𝑏 = 𝑓 𝑎 − 𝑔 𝑏2
𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒𝑚𝑎𝑥(0,𝑚 + 𝐸 𝑎, 𝑏 − 𝐸(𝑎, 𝑛))
. . . . . . . . . . . . . . . .𝑓 𝑎 𝑔 𝑏
degenerate solution is tomake 𝑓 𝑎 = 𝑔 𝑏 = 0
To avoid this use max-margin training
(Hermann & Blunson, 2014)
Backpropagate & update𝑤𝑖’s in both languages
![Page 89: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/89.jpg)
Training data: Parallel sentences
. . .
he sat on a chair
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
il était assis sur une chaise
CVM CVM
𝑎 = 𝐸𝑛𝑔𝑙𝑖𝑠ℎ 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒𝑏 = 𝑝𝑎𝑟𝑎𝑙𝑙𝑒𝑙 𝐹𝑟𝑒𝑛𝑐ℎ 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒𝑛 = 𝑟𝑎𝑛𝑑𝑜𝑚 𝐹𝑟𝑒𝑛𝑐ℎ 𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒
𝐸 𝑎, 𝑏 = 𝑓 𝑎 − 𝑔 𝑏2
𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒𝑚𝑎𝑥(0,𝑚 + 𝐸 𝑎, 𝑏 − 𝐸(𝑎, 𝑛))
. . . . . . . . . . . . . . . .𝑓 𝑎 𝑔 𝑏
(Hermann & Blunson, 2014)
Backpropagate & update𝑤𝑖’s in both languages
To reduce the distance between 𝑓 𝑎 & 𝑔 𝑏 the model will eventually learn to reduce the distance between (chair, chaise), (sit, assis), (he, il) etc.
![Page 90: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/90.jpg)
The previous approach strictly requires parallel data…
Can we exploit monolingual data in two languages in addition to parallel data between them?
![Page 91: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/91.jpg)
Joint English French
Multilingual Word Representations(capture syntactic and semantic
similarities between words bothwithin and across languages)
drink
drank
boire
buvaiteat
ate
manger
mangé
king
prince
roi
prince reine
princesse
queen
princess
Can also be extended to bigger units (sentences, documents, etc.)
Two paradigms:• Offline Bilingual Alignment• Joint training for Bilingual Alignment
• Use only parallel data• Use monolingual as well a parallel data
![Page 92: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/92.jpg)
Given monolingual data we already know how to learn word representations
Recap:
𝑚𝑎𝑥𝑖𝑚𝑖𝑧𝑒
𝑖=1
𝑇𝑒
−log(𝑃 𝑤𝑖 𝑤𝑖−𝑘, … , 𝑤𝑖−1))
𝑤. 𝑟. 𝑡 𝑊𝑒𝑚𝑏𝑒 ,𝑊ℎ
𝑒 ,𝑊𝑜𝑢𝑡𝑒
𝑇𝑒= total number of words in the English corpus𝑊𝑒𝑚𝑏𝑒 = word representation for English words𝑊ℎ𝑒 ,𝑊𝑜𝑢𝑡𝑒 = other parameters of the model
𝑚𝑎𝑥𝑖𝑚𝑖𝑧𝑒
𝑖=1
𝑇𝑓
−log(𝑃 𝑤𝑖 𝑤𝑖−𝑘 , … , 𝑤𝑖−1))
𝑤. 𝑟. 𝑡𝑊𝑒𝑚𝑏𝑓,𝑊ℎ𝑓, 𝑊𝑜𝑢𝑡𝑓
𝑇𝑓= total number of words in the French corpus
𝑊𝑒𝑚𝑏𝑓 = word representation for French words
𝑊ℎ𝑓,𝑊𝑜𝑢𝑡𝑓 = other parameters of the model
Similarly for French
![Page 93: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/93.jpg)
Simply putting the two languages together we get ……
![Page 94: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/94.jpg)
𝑚𝑎𝑥𝑖𝑚𝑖𝑧𝑒
𝑗∈{𝑒,𝑓}
𝑖=1
𝑇𝑗
−log(𝑃 𝑤𝑖 𝑤𝑖−𝑘 , … , 𝑤𝑖−1))
𝑤. 𝑟. 𝑡 𝜃𝑒 , 𝜃𝑓𝜃𝑒=𝑊𝑒𝑚𝑏
𝑒 ,𝑊ℎ𝑒 ,𝑊𝑜𝑢𝑡𝑒
𝜃𝑓=𝑊𝑒𝑚𝑏𝑓𝑊ℎ𝑓,𝑊𝑜𝑢𝑡𝑓
Nothing great about this… this is same as training 𝜃𝑒 , 𝜃𝑓separately
Things become interesting when in addition we have parallel data…
We can then modify the objective function ….
![Page 95: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/95.jpg)
𝑚𝑎𝑥𝑖𝑚𝑖𝑧𝑒
𝑗∈{𝑒,𝑓}
𝑖=1
𝑇𝑗
−log(𝑃 𝑤𝑖 𝑤𝑖−𝑘 , … , 𝑤𝑖−1)) + 𝜆 · Ω (𝑊𝑒𝑚𝑏𝑒 ,𝑊𝑒𝑚𝑏
𝑓)
𝑤. 𝑟. 𝑡 𝜃𝑒 , 𝜃𝑓𝜃𝑒= 𝑊𝑒𝑚𝑏
𝑒 ,𝑊ℎ𝑒 ,𝑊𝑜𝑢𝑡𝑒
𝜃𝑓= 𝑊𝑒𝑚𝑏𝑓,𝑊ℎ𝑓,𝑊𝑜𝑢𝑡𝑓
Ω 𝑊𝑒𝑚𝑏𝑒 ,𝑊𝑒𝑚𝑏
𝑓=
𝑤𝑖∈𝑉𝑒
𝑤𝑗∈𝑉𝑓
𝑠𝑖𝑚 𝑤𝑖 , 𝑤𝑗 ∗ 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒(𝑊𝑒𝑚𝑏𝑖𝑒 ,𝑊𝑒𝑚𝑏𝑗
𝑓)
This weighted sum will be low only when similar wordsacross languages are embedded close to each other
monolingual similarity bilingual similarity
![Page 96: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/96.jpg)
Now, lets look at two specific instances of this formulation….
![Page 97: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/97.jpg)
𝑊𝑒𝑚𝑏𝑒 ∈ ℝ 𝑉
𝑒 × 𝑑
𝑊𝑒𝑚𝑏𝑓∈ ℝ 𝑉
𝑓 × 𝑑
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
he
chair
on
duck
a
sat
chat
. . . . .
. . . . .
0.02 0.9 0.05 0.01 0.020.85 0.01 0.02 0.03 0.090.06 0.01 0.01 0.01 0.950.02 0.02 0.92 0.02 0.020.10 0.05 0.05 0.81 0.04
hesatchairaon
assis il une sur chaise
𝐸𝑎𝑐ℎ 𝑐𝑒𝑙𝑙 (𝑖, 𝑗) 𝑜𝑓 𝐴 𝑠𝑡𝑜𝑟𝑒𝑠
𝑠𝑖𝑚 𝑤𝑖 , 𝑤𝑗 using word
alignment information from a parallel corpus
𝐴
Il
chaise
sur
ente
une
assis
plaudern
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
English Training instance: he sat on a chair
In addition also update French words in proportional to their similarity to {he, sat, on, a}
𝑊𝑒𝑚𝑏𝑖𝑓
=𝑊𝑒𝑚𝑏𝑖𝑓+
𝑤𝑗∈𝑉𝑒
𝐴𝑖,𝑗𝜕 ℒ(𝜃𝑒)
𝜕 𝑊𝑒𝑚𝑏𝑗𝑒
ℒ 𝜃𝑒 =
𝑖=1
𝑇𝑒
−log(𝑃 𝑤𝑖 𝑤𝑖−𝑘, … , 𝑤𝑖−1))
𝑀𝑜𝑟𝑒 𝑓𝑜𝑟𝑚𝑎𝑙𝑙𝑦,
Similar words across the two languages undergo similar updates and hence remain close to each other
(Klementiev et. al., 2012)
![Page 98: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/98.jpg)
𝑚𝑎𝑥𝑖𝑚𝑖𝑧𝑒 𝑚𝑎𝑥(0, 1 − 𝑠𝑒 + 𝑠𝑐𝑒)
𝑤. 𝑟. 𝑡. 𝜃𝑓
𝑚𝑎𝑥𝑖𝑚𝑖𝑧𝑒 𝑚𝑎𝑥(0, 1 − 𝑠𝑓 + 𝑠𝑐𝑓)
𝑤. 𝑟. 𝑡. 𝜃𝑒
En positive: he sat on a chairEn negative: he sat on a oxygen
Fr positive: Il était assis sur une chaiseFr negative: Il était assis sur une oxygène
Independently update 𝜃𝑒 and 𝜃𝑓
+ Parallel dataEn: he sat on a chair [𝑠𝑒 = 𝑤1
𝑒 , 𝑤2𝑒 , 𝑤3𝑒 , 𝑤4𝑒 , 𝑤5𝑒]
Fr : Il était assis sur une chaise [𝑠𝑓 = 𝑤1𝑓, 𝑤2𝑓, 𝑤3𝑓, 𝑤4𝑓, 𝑤5𝑓]
𝑛𝑜𝑤, 𝑎𝑙𝑠𝑜 𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒 Ω 𝑊𝑒𝑚𝑏𝑒 ,𝑊𝑒𝑚𝑏
𝑓=1
𝑚
𝑤𝑖∈𝑠𝑒
𝑤𝑚
𝑊𝑒𝑚𝑏𝑖𝑒 −
1
𝑛
𝑤𝑗∈𝑠𝑒
𝑤𝑛
𝑊𝑒𝑚𝑏𝑖𝑓
2
𝑤. 𝑟. 𝑡 𝑊𝑒𝑚𝑏𝑒 ,𝑊𝑒𝑚𝑏
𝑓
(Gouws et. al., 2015)
![Page 99: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/99.jpg)
In fact, looking back, we can analyze all the approaches under this framework…
𝑚𝑎𝑥𝑖𝑚𝑖𝑧𝑒
𝑗∈{𝑒,𝑓}
𝑖=1
𝑇𝑗
ℒ 𝜃𝑗 + 𝜆 · Ω (𝑊𝑒𝑚𝑏𝑒 ,𝑊𝑒𝑚𝑏
𝑓)
𝑤. 𝑟. 𝑡 𝜃𝑒 , 𝜃𝑓𝜃𝑒= 𝑊𝑒 ,𝑊ℎ
𝑒 ,𝑊𝑜𝑢𝑡𝑒
𝜃𝑓= 𝑊𝑓,𝑊ℎ𝑓, 𝑊𝑜𝑢𝑡𝑓
monolingual similarity bilingual similarity
![Page 100: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/100.jpg)
ℒ 𝜃𝑒 ℒ 𝜃𝑓 Ω 𝜃𝑒 , 𝜃𝑓 Training
![Page 101: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/101.jpg)
ℒ 𝜃𝑒 ℒ 𝜃𝑓 Ω 𝜃𝑒 , 𝜃𝑓 Training
𝑖=1
𝑇𝑒
−log(𝑃 𝑤𝑖 𝑤𝑖−𝑘, … , 𝑤𝑖−1))
𝑖=1
𝑇𝑓
−log(𝑃 𝑤𝑖 𝑤𝑖−𝑘, … , 𝑤𝑖−1))𝑎𝑇 𝑊𝑒𝑚𝑏
𝑒 𝑇 𝑊𝑒𝑚𝑏𝑓𝑏
𝑎𝑇 𝑊𝑒𝑚𝑏𝑒 𝑇 𝑊𝑒𝑚𝑏
𝑒 𝑎 𝑏𝑇 𝑊𝑒𝑚𝑏𝑓 𝑇
𝑊𝑒𝑚𝑏𝑓𝑏
Ω 𝜃𝑒 , 𝜃𝑓
is optimizedafteroptimizing
ℒ 𝜃i(Faruqui and Dyer, 2014)
![Page 102: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/102.jpg)
ℒ 𝜃𝑒 ℒ 𝜃𝑓 Ω 𝜃𝑒 , 𝜃𝑓 Training
𝑖=1
𝑇𝑒
−log(𝑃 𝑤𝑖 𝑤𝑖−𝑘, … , 𝑤𝑖−1))
𝑖=1
𝑇𝑓
−log(𝑃 𝑤𝑖 𝑤𝑖−𝑘, … , 𝑤𝑖−1))𝑎𝑇 𝑊𝑒𝑚𝑏
𝑒 𝑇 𝑊𝑒𝑚𝑏𝑓𝑏
𝑎𝑇 𝑊𝑒𝑚𝑏𝑒 𝑇 𝑊𝑒𝑚𝑏
𝑒 𝑎 𝑏𝑇 𝑊𝑒𝑚𝑏𝑓 𝑇
𝑊𝑒𝑚𝑏𝑓𝑏
Ω 𝜃𝑒 , 𝜃𝑓
is optimizedafteroptimizing
ℒ 𝜃i
𝑖=1
𝑇𝑒
−log(𝑃 𝑤𝑖 𝑤𝑖−𝑘, … , 𝑤𝑖−1))
𝑖=1
𝑇𝑓
−log(𝑃 𝑤𝑖 𝑤𝑖−𝑘, … , 𝑤𝑖−1))𝑎𝑇𝑓 𝑊𝑒𝑚𝑏
𝑒 𝑇g 𝑊𝑒𝑚𝑏𝑓𝑏
𝑎𝑇𝑓 𝑊𝑒𝑚𝑏𝑒 𝑇𝑓( 𝑊𝑒𝑚𝑏
𝑒 𝑎 𝑏𝑇𝑓 𝑊𝑒𝑚𝑏𝑓 𝑇𝑓 𝑊𝑒𝑚𝑏
𝑓𝑏
-- ‘’ --
(Faruqui and Dyer, 2014)
(Lu et. al., 2015)
![Page 103: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/103.jpg)
ℒ 𝜃𝑒 ℒ 𝜃𝑓 Ω 𝜃𝑒 , 𝜃𝑓 Training
𝑖=1
𝑇𝑒
−log(𝑃 𝑤𝑖 𝑤𝑖−𝑘, … , 𝑤𝑖−1))
𝑖=1
𝑇𝑓
−log(𝑃 𝑤𝑖 𝑤𝑖−𝑘, … , 𝑤𝑖−1))𝑎𝑇 𝑊𝑒𝑚𝑏
𝑒 𝑇 𝑊𝑒𝑚𝑏𝑓𝑏
𝑎𝑇 𝑊𝑒𝑚𝑏𝑒 𝑇 𝑊𝑒𝑚𝑏
𝑒 𝑎 𝑏𝑇 𝑊𝑒𝑚𝑏𝑓 𝑇
𝑊𝑒𝑚𝑏𝑓𝑏
Ω 𝜃𝑒 , 𝜃𝑓
is optimizedafteroptimizing
ℒ 𝜃i
𝑖=1
𝑇𝑒
−log(𝑃 𝑤𝑖 𝑤𝑖−𝑘, … , 𝑤𝑖−1))
𝑖=1
𝑇𝑓
−log(𝑃 𝑤𝑖 𝑤𝑖−𝑘, … , 𝑤𝑖−1))𝑎𝑇𝑓 𝑊𝑒𝑚𝑏
𝑒 𝑇g 𝑊𝑒𝑚𝑏𝑓𝑏
𝑎𝑇𝑓 𝑊𝑒𝑚𝑏𝑒 𝑇𝑓( 𝑊𝑒𝑚𝑏
𝑒 𝑎 𝑏𝑇𝑓 𝑊𝑒𝑚𝑏𝑓 𝑇𝑓 𝑊𝑒𝑚𝑏
𝑓𝑏
-- ‘’ --
0 0 𝑚𝑎𝑥(0,𝑚 + 𝐸 𝑎, 𝑏 − 𝐸(𝑎, 𝑛))
𝑤ℎ𝑒𝑟𝑒, 𝐸 𝑎, 𝑏 = 𝑓 𝑎 − 𝑔 𝑏2
Only
Ω 𝜃𝑒 , 𝜃𝑓
is optimized
(Faruqui and Dyer, 2014)
(Lu et. al., 2015)
(Herman & Blunsom, 2014)
![Page 104: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/104.jpg)
ℒ 𝜃𝑒 ℒ 𝜃𝑓 Ω 𝜃𝑒 , 𝜃𝑓 Training
𝑖=1
𝑇𝑒
−log(𝑃 𝑤𝑖 𝑤𝑖−𝑘, … , 𝑤𝑖−1))
𝑖=1
𝑇𝑓
−log(𝑃 𝑤𝑖 𝑤𝑖−𝑘, … , 𝑤𝑖−1))𝑎𝑇 𝑊𝑒𝑚𝑏
𝑒 𝑇 𝑊𝑒𝑚𝑏𝑓𝑏
𝑎𝑇 𝑊𝑒𝑚𝑏𝑒 𝑇 𝑊𝑒𝑚𝑏
𝑒 𝑎 𝑏𝑇 𝑊𝑒𝑚𝑏𝑓 𝑇
𝑊𝑒𝑚𝑏𝑓𝑏
Ω 𝜃𝑒 , 𝜃𝑓
is optimizedafteroptimizing
ℒ 𝜃i
𝑖=1
𝑇𝑒
−log(𝑃 𝑤𝑖 𝑤𝑖−𝑘, … , 𝑤𝑖−1))
𝑖=1
𝑇𝑓
−log(𝑃 𝑤𝑖 𝑤𝑖−𝑘, … , 𝑤𝑖−1))𝑎𝑇𝑓 𝑊𝑒𝑚𝑏
𝑒 𝑇g 𝑊𝑒𝑚𝑏𝑓𝑏
𝑎𝑇𝑓 𝑊𝑒𝑚𝑏𝑒 𝑇𝑓( 𝑊𝑒𝑚𝑏
𝑒 𝑎 𝑏𝑇𝑓 𝑊𝑒𝑚𝑏𝑓 𝑇𝑓 𝑊𝑒𝑚𝑏
𝑓𝑏
-- ‘’ --
0 0 𝑚𝑎𝑥(0,𝑚 + 𝐸 𝑎, 𝑏 − 𝐸(𝑎, 𝑛))
𝑤ℎ𝑒𝑟𝑒, 𝐸 𝑎, 𝑏 = 𝑓 𝑎 − 𝑔 𝑏2
Only
Ω 𝜃𝑒 , 𝜃𝑓
is optimized
𝑖=1
𝑇𝑒
−log(𝑃 𝑤𝑖 𝑤𝑖−𝑘, … , 𝑤𝑖−1))
𝑖=1
𝑇𝑓
−log(𝑃 𝑤𝑖 𝑤𝑖−𝑘, … , 𝑤𝑖−1))0.5 ∗ 𝑊𝑒𝑚𝑏
𝑒𝑇 𝐴⊗ 𝐼 𝑊𝑒𝑚𝑏𝑓 Ω 𝜃𝑒 , 𝜃𝑓
& ℒ 𝜃i
are optimizedjointly
(Faruqui and Dyer, 2014)
(Lu et. al., 2015)
(Herman & Blunsom, 2014)
(Klementiev et. al., 2012)
![Page 105: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/105.jpg)
ℒ 𝜃𝑒 ℒ 𝜃𝑓 Ω 𝜃𝑒 , 𝜃𝑓 Training
𝑖=1
𝑇𝑒
−log(𝑃 𝑤𝑖 𝑤𝑖−𝑘, … , 𝑤𝑖−1))
𝑖=1
𝑇𝑓
−log(𝑃 𝑤𝑖 𝑤𝑖−𝑘, … , 𝑤𝑖−1))𝑎𝑇 𝑊𝑒𝑚𝑏
𝑒 𝑇 𝑊𝑒𝑚𝑏𝑓𝑏
𝑎𝑇 𝑊𝑒𝑚𝑏𝑒 𝑇 𝑊𝑒𝑚𝑏
𝑒 𝑎 𝑏𝑇 𝑊𝑒𝑚𝑏𝑓 𝑇
𝑊𝑒𝑚𝑏𝑓𝑏
Ω 𝜃𝑒 , 𝜃𝑓
is optimizedafteroptimizing
ℒ 𝜃i
𝑖=1
𝑇𝑒
−log(𝑃 𝑤𝑖 𝑤𝑖−𝑘, … , 𝑤𝑖−1))
𝑖=1
𝑇𝑓
−log(𝑃 𝑤𝑖 𝑤𝑖−𝑘, … , 𝑤𝑖−1))𝑎𝑇𝑓 𝑊𝑒𝑚𝑏
𝑒 𝑇g 𝑊𝑒𝑚𝑏𝑓𝑏
𝑎𝑇𝑓 𝑊𝑒𝑚𝑏𝑒 𝑇𝑓( 𝑊𝑒𝑚𝑏
𝑒 𝑎 𝑏𝑇𝑓 𝑊𝑒𝑚𝑏𝑓 𝑇𝑓 𝑊𝑒𝑚𝑏
𝑓𝑏
-- ‘’ --
0 0 𝑚𝑎𝑥(0,𝑚 + 𝐸 𝑎, 𝑏 − 𝐸(𝑎, 𝑛))
𝑤ℎ𝑒𝑟𝑒, 𝐸 𝑎, 𝑏 = 𝑓 𝑎 − 𝑔 𝑏2
Only
Ω 𝜃𝑒 , 𝜃𝑓
is optimized
𝑖=1
𝑇𝑒
−log(𝑃 𝑤𝑖 𝑤𝑖−𝑘, … , 𝑤𝑖−1))
𝑖=1
𝑇𝑓
−log(𝑃 𝑤𝑖 𝑤𝑖−𝑘, … , 𝑤𝑖−1))0.5 ∗ 𝑊𝑒𝑚𝑏
𝑒𝑇 𝐴⊗ 𝐼 𝑊𝑒𝑚𝑏𝑓 Ω 𝜃𝑒 , 𝜃𝑓
& ℒ 𝜃i
are optimizedjointly
𝑚𝑎𝑥(0, 1 − 𝑠𝑒 + 𝑠𝑐𝑒) 𝑚𝑎𝑥(0, 1 − 𝑠𝑓 + 𝑠𝑐
𝑓) 1
𝑚
𝑤𝑖∈𝑠𝑒
𝑤𝑚
𝑊𝑒𝑚𝑏𝑖𝑒 −
1
𝑛
𝑤𝑗∈𝑠𝑒
𝑤𝑛
𝑊𝑒𝑚𝑏𝑖𝑓
2 -- ‘’ --
(Faruqui and Dyer, 2014)
(Lu et. al., 2015)
(Herman & Blunsom, 2014)
(Klementiev et. al., 2012)
(Gouws et. al., 2015)
![Page 106: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/106.jpg)
Now lets take a look at an approach which is based on autoencoders….
![Page 107: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/107.jpg)
Background: a neural network based single view autoencoder
𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒
𝑖=1
𝑁
(𝑋𝑖 − 𝑔(ℎ 𝑋 ))2
ℎ 𝑋 = 𝑓 𝑋 = 𝑓(𝑾𝑋 + 𝑏)
𝑋′ = g(ℎ 𝑋 ) = 𝑔(𝑾′ℎ(𝑋) + 𝑏′)
𝑒𝑛𝑐𝑜𝑑𝑒𝑟
𝑑𝑒𝑐𝑜𝑑𝑒𝑟
𝑢𝑠𝑒 𝑏𝑎𝑐𝑘 𝑝𝑟𝑜𝑝𝑜𝑔𝑎𝑡𝑖𝑜𝑛
107
![Page 108: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/108.jpg)
ℎ𝑥 𝑋 = 𝑓𝑥 𝑋 = 𝑓𝑥(𝑾𝒙𝑋 + 𝑏)
𝑋′ = 𝑔𝑥(ℎ 𝑋 ) = 𝑔𝑥(𝑾𝒙′ ℎ𝑥(𝑋) + 𝑏′)
ℎ𝑦 𝑌 = 𝑓𝑦 𝑌 = 𝑓𝑦(𝑾𝒚𝑌 + 𝑏)
𝑌′ = 𝑔𝑦(ℎ 𝑌 ) = 𝑔𝑦(𝑾𝒚′ ℎ𝑦(𝑌) + 𝑏′)
𝒆𝒏𝒄𝒐𝒅𝒆𝒓
𝒅𝒆𝒄𝒐𝒅𝒆𝒓
Correlational Neural Network
108
A multiview autoencoder
![Page 109: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/109.jpg)
A multiview autoencoder
𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒
𝑖=1
𝑁
(𝑔𝑥(𝑓𝑥 𝑋𝑖 ) − 𝑋𝑖 )2
109
Correlational Neural Network
ℎ𝑥 𝑋 = 𝑓𝑥 𝑋 = 𝑓𝑥(𝑾𝒙𝑋 + 𝑏)
𝑋′ = 𝑔𝑥(ℎ 𝑋 ) = 𝑔𝑥(𝑾𝒙′ ℎ𝑥(𝑋) + 𝑏′)
ℎ𝑦 𝑌 = 𝑓𝑦 𝑌 = 𝑓𝑦(𝑾𝒚𝑌 + 𝑏)
𝑌′ = 𝑔𝑦(ℎ 𝑌 ) = 𝑔𝑦(𝑾𝒚′ ℎ𝑦(𝑌) + 𝑏′)
𝒆𝒏𝒄𝒐𝒅𝒆𝒓
𝒅𝒆𝒄𝒐𝒅𝒆𝒓
![Page 110: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/110.jpg)
A multiview autoencoder
𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒
𝑖=1
𝑁
(𝑔𝑥(𝑓𝑥 𝑋𝑖 ) − 𝑋𝑖 )2
+
𝑖=1
𝑁
(𝑔𝑦(𝑓𝑦 𝑌𝑖 ) − 𝑌𝑖 )2
110
Correlational Neural Network
ℎ𝑥 𝑋 = 𝑓𝑥 𝑋 = 𝑓𝑥(𝑾𝒙𝑋 + 𝑏)
𝑋′ = 𝑔𝑥(ℎ 𝑋 ) = 𝑔𝑥(𝑾𝒙′ ℎ𝑥(𝑋) + 𝑏′)
ℎ𝑦 𝑌 = 𝑓𝑦 𝑌 = 𝑓𝑦(𝑾𝒚𝑌 + 𝑏)
𝑌′ = 𝑔𝑦(ℎ 𝑌 ) = 𝑔𝑦(𝑾𝒚′ ℎ𝑦(𝑌) + 𝑏′)
𝒆𝒏𝒄𝒐𝒅𝒆𝒓
𝒅𝒆𝒄𝒐𝒅𝒆𝒓
![Page 111: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/111.jpg)
A multiview autoencoder
𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒
𝑖=1
𝑁
(𝑔𝑥(𝑓𝑥 𝑋𝑖 ) − 𝑋𝑖 )2
+
𝑖=1
𝑁
(𝑔𝑦(𝑓𝑦 𝑌𝑖 ) − 𝑌𝑖 )2
+
𝑖=1
𝑁
(𝑔𝑥(𝑓𝑦 𝑌𝑖 ) − 𝑋𝑖 )2
111
Correlational Neural Network
ℎ𝑥 𝑋 = 𝑓𝑥 𝑋 = 𝑓𝑥(𝑾𝒙𝑋 + 𝑏)
𝑋′ = 𝑔𝑥(ℎ 𝑋 ) = 𝑔𝑥(𝑾𝒙′ ℎ𝑥(𝑋) + 𝑏′)
ℎ𝑦 𝑌 = 𝑓𝑦 𝑌 = 𝑓𝑦(𝑾𝒚𝑌 + 𝑏)
𝑌′ = 𝑔𝑦(ℎ 𝑌 ) = 𝑔𝑦(𝑾𝒚′ ℎ𝑦(𝑌) + 𝑏′)
𝒆𝒏𝒄𝒐𝒅𝒆𝒓
𝒅𝒆𝒄𝒐𝒅𝒆𝒓
![Page 112: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/112.jpg)
A multiview autoencoder
𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒
𝑖=1
𝑁
(𝑔𝑥(𝑓𝑥 𝑋𝑖 ) − 𝑋𝑖 )2
+
𝑖=1
𝑁
(𝑔𝑦(𝑓𝑦 𝑌𝑖 ) − 𝑌𝑖 )2
+
𝑖=1
𝑁
(𝑔𝑥(𝑓𝑦 𝑌𝑖 ) − 𝑋𝑖 )2
+
𝑖=1
𝑁
(𝑔𝑦(𝑓𝑥 𝑋𝑖 ) − 𝑌𝑖 )2
112
Correlational Neural Network
ℎ𝑥 𝑋 = 𝑓𝑥 𝑋 = 𝑓𝑥(𝑾𝒙𝑋 + 𝑏)
𝑋′ = 𝑔𝑥(ℎ 𝑋 ) = 𝑔𝑥(𝑾𝒙′ ℎ𝑥(𝑋) + 𝑏′)
ℎ𝑦 𝑌 = 𝑓𝑦 𝑌 = 𝑓𝑦(𝑾𝒚𝑌 + 𝑏)
𝑌′ = 𝑔𝑦(ℎ 𝑌 ) = 𝑔𝑦(𝑾𝒚′ ℎ𝑦(𝑌) + 𝑏′)
𝒆𝒏𝒄𝒐𝒅𝒆𝒓
𝒅𝒆𝒄𝒐𝒅𝒆𝒓
![Page 113: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/113.jpg)
A multiview autoencoder
𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒
𝑖=1
𝑁
(𝑔𝑥(𝑓𝑥 𝑋𝑖 ) − 𝑋𝑖 )2
+
𝑖=1
𝑁
(𝑔𝑦(𝑓𝑦 𝑌𝑖 ) − 𝑌𝑖 )2
+
𝑖=1
𝑁
(𝑔𝑥(𝑓𝑦 𝑌𝑖 ) − 𝑋𝑖 )2
+
𝑖=1
𝑁
(𝑔𝑦(𝑓𝑥 𝑋𝑖 ) − 𝑌𝑖 )2
113
Correlational Neural Network
ℎ𝑥 𝑋 = 𝑓𝑥 𝑋 = 𝑓𝑥(𝑾𝒙𝑋 + 𝑏)
𝑋′ = 𝑔𝑥(ℎ 𝑋 ) = 𝑔𝑥(𝑾𝒙′ ℎ𝑥(𝑋) + 𝑏′)
ℎ𝑦 𝑌 = 𝑓𝑦 𝑌 = 𝑓𝑦(𝑾𝒚𝑌 + 𝑏)
𝑌′ = 𝑔𝑦(ℎ 𝑌 ) = 𝑔𝑦(𝑾𝒚′ ℎ𝑦(𝑌) + 𝑏′)
𝒆𝒏𝒄𝒐𝒅𝒆𝒓
𝒅𝒆𝒄𝒐𝒅𝒆𝒓
![Page 114: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/114.jpg)
. . . . . . . . . . . .
𝑋
𝑋′
𝑌
𝑌′
𝑓𝑥(∙)
𝑔𝑥(∙)
𝑓𝑦(∙)
𝑔𝑦(∙)
ℎ(∙)
So far so good…. But will the representations h(X) and h(Y) be correlated?
Turns out that there is no guarantee for this !
114
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
![Page 115: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/115.jpg)
. . . . . . . . . . . .
. . . . . . .
𝑋
𝑋′
𝑌
𝑌′
𝑓𝑥(∙)
𝑔𝑥(∙)
𝑓𝑦(∙)
𝑔𝑦(∙)
ℎ(∙)
So far so good…. But will the representations h(X) and h(Y) be correlated?
Turns out that there is no guarantee for this !
115
. . . . . . . .
. . . . . . . . . . . . . . . .
![Page 116: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/116.jpg)
. . . . . . . . . . . .
. . . . . . .
𝑋
𝑋′
𝑌
𝑌′
𝑓𝑥(∙)
𝑔𝑥(∙)
𝑓𝑦(∙)
𝑔𝑦(∙)
ℎ(∙)
So far so good…. But will the representations h(X) and h(Y) be correlated?
Turns out that there is no guarantee for this !
116
. . . . . . . .
. . . . . . . . . . . . . . . .
![Page 117: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/117.jpg)
A multiview autoencoder
𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒
𝑖=1
𝑁
(𝑔𝑥(𝑓𝑥 𝑋𝑖 ) − 𝑋𝑖 )2
+
𝑖=1
𝑁
(𝑔𝑦(𝑓𝑦 𝑌𝑖 ) − 𝑌𝑖 )2
+
𝑖=1
𝑁
(𝑔𝑥(𝑓𝑦 𝑌𝑖 ) − 𝑋𝑖 )2
+
𝑖=1
𝑁
(𝑔𝑦(𝑓𝑥 𝑋𝑖 ) − 𝑌𝑖 )2
−𝑐𝑜𝑟𝑟(ℎ 𝑋 , ℎ( 𝑌)
117
Correlational Neural Network
ℎ𝑥 𝑋 = 𝑓𝑥 𝑋 = 𝑓𝑥(𝑾𝒙𝑋 + 𝑏)
𝑋′ = 𝑔𝑥(ℎ 𝑋 ) = 𝑔𝑥(𝑾𝒙′ ℎ𝑥(𝑋) + 𝑏′)
ℎ𝑦 𝑌 = 𝑓𝑦 𝑌 = 𝑓𝑦(𝑾𝒚𝑌 + 𝑏)
𝑌′ = 𝑔𝑦(ℎ 𝑌 ) = 𝑔𝑦(𝑾𝒚′ ℎ𝑦(𝑌) + 𝑏′)
𝒆𝒏𝒄𝒐𝒅𝒆𝒓
𝒅𝒆𝒄𝒐𝒅𝒆𝒓
![Page 118: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/118.jpg)
Lets compare the performance of some of these approaches on the taskof cross language document classification
Labeled documents in L1
+-+-+---+
+
+-+-+---+
+
Unlabeled documents in L2
Common RepresentationLearner
![Page 119: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/119.jpg)
Approach ende de en
Klementiev et. al., 2012 77.6 71.1
Herman & Blunsom, 2013 83.7 71.4
Chandar et.al., 2014 91.8 72.8
Gouws et. al., 2015 86.5 75.0
(Herman & Blunsom, 2014)
(Klementiev et. al., 2012)
(Gouws et. al., 2015)
(Chandar et.al., 2014)
![Page 120: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/120.jpg)
We now look at multimodal representation learning
![Page 121: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/121.jpg)
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .. . . .
. . . .
a
man
jumping
on
his
bike
𝑥1
𝑥2
𝑥3
𝑥4
𝑥5
𝑥6
ℎ1
ℎ2
ℎ3
ℎ4
ℎ5
ℎ6
𝑔θ(∙)
𝑔θ(∙)
𝑔θ(∙)
𝑔θ(∙)
𝑔θ(∙)
𝑔θ(∙)
ℎ1 = 𝑔𝜃 𝑥1 = 𝑓(𝑊𝑣𝑥1)
ℎ6 = 𝑔𝜃 𝑥6, ℎ5 = 𝑓(𝑊𝑣𝑥6 +𝑊𝑙ℎ5)
ℎ3 = 𝑔𝜃 𝑥3, ℎ2, ℎ4 = 𝑓(𝑊𝑣𝑥6 +𝑊𝑙ℎ2 +𝑊𝑟ℎ4)
𝑊𝑟 𝑓𝑜𝑟 𝑟𝑖𝑔ℎ𝑡 𝑐ℎ𝑖𝑙𝑑𝑊𝑙 𝑓𝑜𝑟 𝑙𝑒𝑓𝑡 𝑐ℎ𝑖𝑙𝑑
Recursive Neural Network
ConvolutionalNeural
Network
. . . .
![Page 122: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/122.jpg)
Recursive Neural
Network
ConvolutionalNeural
Network
. . . . . . . .
Common Space
. . . .
. . . .
𝑊𝑡 𝑊𝑖𝑚
𝑣𝑖
𝑦𝑗
𝑂𝑏𝑗𝑒𝑐𝑡𝑖𝑣𝑒:
𝑖,𝑗 ∈𝑃
𝑐∈𝑆\S(𝑖)
max(0, ∆ − 𝑣𝑖𝑇𝑦𝑗 + 𝑣𝑖
𝑇𝑦𝑐)
+
𝑖,𝑗 ∈𝑃
𝑐∈𝐼\I(𝑗)
max(0, ∆ − 𝑣𝑖𝑇𝑦𝑗 + 𝑣𝑖
𝑇𝑦𝑐)
𝑖, 𝑗 = 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑎𝑖𝑟
𝑖, 𝑐 = 𝑖𝑛𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑎𝑖𝑟
![Page 123: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/123.jpg)
Bridge Correlational Neural Networks
@ NAACL, 2016
Monday, June 13, 2016
2:40 – 3:00 p.m.
Interested in more ?
![Page 124: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/124.jpg)
A quick summary … … …
![Page 125: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/125.jpg)
(Faruqui and Dyer, 2014) (Lu et. al., 2015) (Herman & Blunsom, 2014)
(Klementiev et. al., 2012) (Gouws et. al., 2015) (Chandar et. al., 2015)
(Socher et. al., 2013)
![Page 126: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/126.jpg)
Research Directions: Representation Learning
• Learn task specific bilingual embeddings
• Learn from comparable corpora (instead of parallel corpora)
• Handle data imbalance • More data for ℒ 𝜃𝑗
• Less data for 𝜆 · Ω 𝑊𝑒𝑚𝑏𝑒 ,𝑊𝑒𝑚𝑏
𝑓
• Handle larger vocabulary
𝑚𝑎𝑥𝑖𝑚𝑖𝑧𝑒
𝑗∈{𝑒,𝑓}
𝑖=1
𝑇𝑗
ℒ 𝜃𝑗 + 𝜆 · Ω 𝑊𝑒𝑚𝑏𝑒 ,𝑊𝑒𝑚𝑏
𝑓+ ℒ𝑡𝑎𝑠𝑘(α)
𝑤. 𝑟. 𝑡 𝜃𝑒 , 𝜃𝑓, α
𝜃𝑒= 𝑊𝑒 ,𝑊ℎ𝑒 ,𝑊𝑜𝑢𝑡𝑒
𝜃𝑓= 𝑊𝑓,𝑊ℎ𝑓, 𝑊𝑜𝑢𝑡𝑓
α= task –specific parameters
monolingual similarity bilingual similarity
![Page 127: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/127.jpg)
Tutorial Outline
1. Introduction and motivation
2. Neural networks – basics
3. Multilingual multimodal representation learning
4. Multilingual multimodal generation
a) Machine Translation
b) Image captioning
c) Visual Question Answering
d) Video captioning
e) Image generation from captions
5. Summary and open problems
127
![Page 128: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/128.jpg)
Natural Language Generation
• Natural Language Processing• Natural Language Understanding (NLU)
• Natural Language Generation (NLG)
• Natural language generation is hard.
• Applications of NLG:• Machine Translation
• Question answering
• Captioning
• Summarization
• Dialogue systems
• Evaluating NLG systems is also hard! (More about in the end)
128
![Page 129: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/129.jpg)
Multilingual Multimodal NLG
Multilingual Multimodal NLG refers to conditional NLG where the generator could be conditioned on
• Multiple languages (machine translation, summarization, dialog systems)
• Multiple modalities like images, videos (visual QA, image captioning, video captioning)
129
![Page 130: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/130.jpg)
Tutorial Outline
1. Introduction and motivation
2. Neural networks – basics
3. Multilingual multimodal representation learning
4. Multilingual multimodal generation
a) Machine Translation
b) Image captioning
c) Visual Question Answering
d) Video captioning
e) Image generation from captions
5. Summary and research directions
130
![Page 131: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/131.jpg)
Machine Translation
En: Economic growth has slowed down in recent years.
Fr : La croissance économique a ralenti au cours des dernières années .
• Statistical Machine Translation (SMT) aims to design systems that can learn to translate between languages based on some training data.
• SMT maximizes
• p(e|f) – translation model
• p(f) – language model
• Traditional methods – long pipeline (Example: Moses, Joshua)
131
![Page 132: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/132.jpg)
Neural Machine Translation
• Neural Network based machine translation system.
• Why Neural MT?• Easy to train in an end-to-end fashion.
• The whole system can be optimized for the actual task in hand.
• No need to store gigantic phrase tables. Small memory footprint.
• Simple decoder unlike highly intricate decoders in standard MT.
• We will consider• Single source – single target NMT
• Multi source – single target NMT
• Single source – multi target NMT
• Multi source – multi target NMT
132
![Page 133: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/133.jpg)
Learning phrase representations using RNN Encoder-Decoder for SMT (Cho et al., 2014)
• RNN Encoder: encode a variable-length sequence into a fixed-length vector representation.
• RNN Decoder: decode a given fixed-length vector representation back into a variable-length sequence.
133
Economic growth has slowed down in recent years.
La croissance économique a ralenti au cours des dernières années .
GRU Encoder
GRU Decoder
![Page 134: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/134.jpg)
RNN Encoder-Decoder
134
How to use the trained model?
1. Generate a target sequence given an input
sequence.
2. Score a given pair of input/output sequence
p(y|x).
![Page 135: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/135.jpg)
2-D embedding of the learned phrase representation
135
![Page 136: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/136.jpg)
WMT’14 English/French SMT - rescoring
136
Baseline: Moses with default setting
![Page 137: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/137.jpg)
Sequence to Sequence Learning with Neural Networks (Sutskever et al., 2014)
137
What is different from Cho et al., 2014?
1. LSTM encoder/decoder instead of GRU encoder/decoder.
2. 4 layers deep encoder/decoder instead of shallow encoder/decoder.
![Page 138: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/138.jpg)
2-D embedding of the learned phrase representation
138
![Page 139: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/139.jpg)
The performance of the LSTM on WMT’14 English to French test set
139
Trick-1: Reverse the input sequence.
Trick-2: Ensemble Neural Nets.
![Page 140: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/140.jpg)
Performance as a function of sentence length
140
![Page 141: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/141.jpg)
Neural MT by jointly learning to align and translate (Bahdanau et al., 2015)
• Issue with encoder-decoder approach• Can we compress all the necessary information in a sentence to a fixed length
vector?
NO
• How to choose the length of the vector?• It should be proportional to the length of the sentence.
• Bahdanau et al. proposed to use k fixed length vectors for encoder where k is the length of the sentence.
141
![Page 142: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/142.jpg)
Attention based NMT
142
• To generate tth word
• Model learns an attention
mechanism over all the words in
the input sentence.
• Input words are represented using a
bi-directional RNN.
![Page 143: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/143.jpg)
Results on WMT’14 En-Fr Dataset
143
BLEU scoreEffect of increasing the
sentence length
![Page 144: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/144.jpg)
Visualization of attention
144
![Page 145: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/145.jpg)
Effective Approaches to Attention-based NMT (Luong et al., 2015)
• Exploring various architectural choices for attention based NMT systems.• Global Attention
• Local Attention
• Input-feeding approach
145
![Page 146: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/146.jpg)
Global Attention
146
• Stacked LSTM encoder instead of bi-
directional single layer LSTM encoder.
• Hidden state ht of the final layer is
used for context computation.
• All the source words are used for
computing the context (similar to
Bahdanau et al.)
![Page 147: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/147.jpg)
Local Attention
147
• The model first predicts a single
aligned position pt for the current
target word.
• A window chosen around pt is used to
compute the context ct
• ct is a weighted average of the source
hidden states in the window.
• In between soft attention and hard
attention (more on hard attention
later).
![Page 148: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/148.jpg)
Input-feeding Approach
148
• Attention vectors are fed as input to
the next time steps to inform the
model.
• Similar to coverage set in standard MT.
![Page 149: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/149.jpg)
WMT’14 En-De results
149
![Page 150: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/150.jpg)
Alignment visualizations
150
Global attention Local-m attention
Local-p attention Gold alignment
![Page 151: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/151.jpg)
Alignment Error Rate on RWTH En-De alignment data
151
![Page 152: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/152.jpg)
Multi-source NMT
• Why multiple sources?
• Two strings can reduce ambiguity via triangulation.
• Ex: English word “bank” may be easily translated to French in the presence of a second, German input containing the work “Flussufer” (river bank).
• Sources should be distant languages for better triangulation.English and German to French
×English and French to German
152
![Page 153: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/153.jpg)
Multi-source NMT (Zoph and Knight, 2016)
• Train p(e|f, g) model directly on trilingual data.
• Use it to decode e given any (f, g) pair.
• How to combine information from f and g?
153
![Page 154: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/154.jpg)
Multi-source Encoder-Decoder Model
154
![Page 155: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/155.jpg)
Multi-source Attention model
• We can take local-attention NMT model and concatenate context from multiple sources
155
![Page 156: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/156.jpg)
English-French-German NMT
156
![Page 157: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/157.jpg)
Multi-Target NMT (Dong et alt., 2015)
157
Multi-task learning framework for multiple-target language translation
Optimization for end to multi-end model
![Page 158: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/158.jpg)
Multi-Target NMT (Dong et alt., 2015)
158
Size of training corpus for different language pairs
Multi-task neural translation v.s. single model given large-scale corpus in all language pairs
Multi-task neural translation v.s. single model with a small-scale training corpus on some language pairs. * means that the language pair is sub-sampled.
Multi-task NMT v.s. single model v.s. moses on the WMT 2013 test set
Faster and Better convergence in Multi-task Learning in multiple language translation
![Page 159: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/159.jpg)
Multi-Way, Multilingual NMT (Firat et alt., 2016)
• Multiple sources and multiple targets
• Advantage?• Sharing knowledge across multiple languages.
• One universal common space for multiple languages.
• Advantageous for low resource languages?
• Challenges• N-lingual data is difficult to get for N>2.
• Parameters of the system grows fast w.r.t number of languages.
159
![Page 160: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/160.jpg)
Simple Solutions
• Encoder-decoder model with multiple encoders and multiple decoders which are shared across language pairs.
• Can we do the same method with attention based models?• Attention is language pair specific.
• With L languages, we need O(L2) attentions.
• Can we share same attention module to multiple languages?
YES
160
![Page 161: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/161.jpg)
Multi-way Multilingual NMT(Firat et al., 2016)
• Model• N encoders
• N decoders
• Shared attention mechanism
• Both encoders and decoders are shared across multiple language pairs.
• Each encoder can be of different type (convolutional/rnn, different sizes).
161
![Page 162: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/162.jpg)
One step of multiway multilingual NMT
162
![Page 163: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/163.jpg)
Low Resource Translation
163
![Page 164: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/164.jpg)
Large Scale Translation
164
![Page 165: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/165.jpg)
Tutorial Outline
1. Introduction and motivation
2. Neural networks – basics
3. Multilingual multimodal representation learning
4. Multilingual multimodal generation
a) Machine Translation
b) Image captioning
c) Visual Question Answering
d) Video captioning
e) Image generation from captions
5. Summary and research directions
165
![Page 166: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/166.jpg)
Image Captioning
166
A women is throwing a frisbee in a park.
Input
Output
![Page 167: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/167.jpg)
Image Captioning
167Input
1. A building is surrounded by wide attractions, such
as a horse statue and a statue of a giant hand and
wrist, with a picnic table next to them.
2. A large hand statue outside of a country store.3. A strange antique store with odd artwork outside
near assorted tables and chairs.4. A wooden shop with a large hand in the
forecourt.5. Country store has big hand with checkered-base
statue and tables with benches on front yard
Crowdsourced Captions
![Page 168: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/168.jpg)
Image Captioning
• Requires both visual understanding and language understanding.
• Hard problem.
• Several potential applications.
168
![Page 169: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/169.jpg)
Show and Tell: A Neural Image Caption Generator (Vinyals et al.,2015)
169
![Page 170: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/170.jpg)
Show and Tell
170
![Page 171: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/171.jpg)
Generated captions
171
![Page 172: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/172.jpg)
BLEU-1 scores
172
![Page 173: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/173.jpg)
Show, Attend and Tell: Neural Image CaptionGeneration with Visual Attention (Xu et al.,2016)
173
![Page 174: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/174.jpg)
Attention over time
174
Soft attention
Hard attention
![Page 175: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/175.jpg)
Show, Attend and Tell
• Show: Convolutional encoder
Instead of using final fully connected representation, use a lower convolutional layer.
This allows decoder to selective focus on certain parts of the image.
• Attend: Soft attention or hard attention over image (per time step)
• Tell: LSTM decoder conditioned on context at current time step,
word generated in previous time step, previous hidden state.
175
![Page 176: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/176.jpg)
Deterministic Soft Attention
• Same as the attention mechanism in Bahdanau et al.’s NMT system.
176
![Page 177: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/177.jpg)
Stochastic Hard Attention
177
This is not differentiable!
Use REINFORCE
![Page 178: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/178.jpg)
Attending correct objects
178
![Page 179: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/179.jpg)
Examples of mistakes
179
![Page 180: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/180.jpg)
Results of 3 datasets
180
![Page 181: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/181.jpg)
Dense Captioning
181
![Page 182: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/182.jpg)
Dense Captioning (Johnson et al.,2015)
182
![Page 183: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/183.jpg)
Dense Captioning
183
![Page 184: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/184.jpg)
Towards interpretable image search systems
184
![Page 185: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/185.jpg)
Tutorial Outline
1. Introduction and motivation
2. Neural networks – basics
3. Multilingual multimodal representation learning
4. Multilingual multimodal generation
a) Machine Translation
b) Image captioning
c) Visual Question Answering
d) Video captioning
e) Image generation from captions
5. Summary and research directions
185
![Page 186: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/186.jpg)
Visual Question Answering
186
![Page 187: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/187.jpg)
Visual QA dataset (Agrawal et al., 2016)
187
![Page 188: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/188.jpg)
Image/Text Encoder with LSTM answer module (Agrawal et al., 2016)
188
![Page 189: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/189.jpg)
Results
189
![Page 190: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/190.jpg)
Memory Network based VQA (Xiong et al., 2016)
190
![Page 191: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/191.jpg)
Visual Input Module
191
![Page 192: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/192.jpg)
Episodic Memory Module
192
![Page 193: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/193.jpg)
Results on VQA dataset
193
![Page 194: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/194.jpg)
Tutorial Outline
1. Introduction and motivation
2. Neural networks – basics
3. Multilingual multimodal representation learning
4. Multilingual multimodal generation
a) Machine Translation
b) Image captioning
c) Visual Question Answering
d) Video captioning
e) Image generation from captions
5. Summary and research directions
194
![Page 195: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/195.jpg)
Video Captioning
195
![Page 196: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/196.jpg)
Video Captioning (Venugopalan et al., 2015)
196
![Page 197: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/197.jpg)
Video Captioning (Venugopalan et al., 2015)
197
![Page 198: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/198.jpg)
Video Captioning (Venugopalan et al., 2015)
198
![Page 199: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/199.jpg)
Also..
Describing videos by exploiting temporal structure – (Yao et al., 2015)
Proposes• Exploiting local structure by using spatio-temporal convolutional network.
• Exploiting global structure by using temporal attention mechanism.
199
![Page 200: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/200.jpg)
Tutorial Outline
1. Introduction and motivation
2. Neural networks – basics
3. Multilingual multimodal representation learning
4. Multilingual multimodal generation
a) Machine Translation
b) Image captioning
c) Visual Question Answering
d) Video captioning
e) Image generation from captions
5. Summary and research directions
200
![Page 201: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/201.jpg)
Generating images from captions (Mansimov et al.,
2016)
201
![Page 202: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/202.jpg)
Generating images from captions (Mansimov et al.,
2016)
202
![Page 203: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/203.jpg)
Tutorial Outline
1. Introduction and motivation
2. Neural networks – basics
3. Multilingual multimodal representation learning
4. Multilingual multimodal generation
5. Summary and open problems
203
![Page 204: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/204.jpg)
Summary
• Significant progress in multilingual/multimodal language processing in past few years.
• Reasons• Better representation learners
• Data
• Compute power
• Finally AI is able to have direct impact in common man’s life!
• This is just the beginning!
204
![Page 205: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/205.jpg)
Research Directions: Representation Learning
• Learn task specific bilingual embeddings
• Learn from comparable corpora (instead of parallel corpora)
• Handle data imbalance • More data for ℒ 𝜃𝑗
• Less data for 𝜆 · Ω 𝑊𝑒𝑚𝑏𝑒 ,𝑊𝑒𝑚𝑏
𝑓
• Handle larger vocabulary
𝑚𝑎𝑥𝑖𝑚𝑖𝑧𝑒
𝑗∈{𝑒,𝑓}
𝑖=1
𝑇𝑗
ℒ 𝜃𝑗 + 𝜆 · Ω 𝑊𝑒𝑚𝑏𝑒 ,𝑊𝑒𝑚𝑏
𝑓+ ℒ𝑡𝑎𝑠𝑘(α)
𝑤. 𝑟. 𝑡 𝜃𝑒 , 𝜃𝑓, α
𝜃𝑒= 𝑊𝑒 ,𝑊ℎ𝑒 ,𝑊𝑜𝑢𝑡𝑒
𝜃𝑓= 𝑊𝑓,𝑊ℎ𝑓, 𝑊𝑜𝑢𝑡𝑓
α= task –specific parameters
monolingual similarity bilingual similarity
![Page 206: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/206.jpg)
Research Directions: Language Generation
• Handling large vocabulary during generation• hierarchical softmax (Morin & Bengio, 2005)
• NCE (Mnih & Teh, 2012)
• Hashing based approaches (Shrivastava & Li, 2014)
• Sampling based approaches (Jean et al ., 2015)
• Blackout (Ji et al., 2016)
• Character level instead of word level? • Character level decoder - Chung et al., 2016
• Character level encoder/decoder – Ling et al, 2015
• Still an open problem.
206
![Page 207: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/207.jpg)
Research Directions: Language Generation
• Handling out-of-vocabulary words.
• Out-of-vocabulary word in source side?• Pointing the unknown words - Caglar et al., 2016
• Out-of-vocabulary word in target side?
207
![Page 208: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/208.jpg)
Research Directions: Language Generation
• Better evaluation metrics?
• Perplexity and BLEU does not encourage diversity in the generation.
• Human evaluation?
• Can we come up with better automated evaluation measures?
• Related work: How NOT to evaluate your dialogue system (Liu et al., 2016)
208
![Page 209: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/209.jpg)
Research Directions: Language Generation
• Transfer learning for low resource languages?
• Data efficient architectures?
209
![Page 210: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/210.jpg)
Acknowledgements
Many images/tables are directly taken from the respective papers.
210
![Page 211: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/211.jpg)
211
Slides will be made available at:
http://sarathchandar.in/mmnlp-tutorial/
Bibliography of related papers will be maintained at:
http://github.com/apsarath/mmnlp-papers
![Page 212: Multilingual Multimodal Language Processing …2. Neural networks –basics a) Neural network and backpropagation b) Matching data with architectures c) Auto-encoders d) Distributed](https://reader033.fdocuments.net/reader033/viewer/2022042318/5f07e9507e708231d41f604f/html5/thumbnails/212.jpg)
Questions?
212