KapilThadani [email protected] - Liangliang...
Transcript of KapilThadani [email protected] - Liangliang...
2
Outline
◦ Recurrent neural networks- Connections and updates- Activation functions- Gated units
◦ Sequence-to-sequence networks- Machine translation- Encoder-decoder architectures- Attention mechanism- Large vocabularies- Copying mechanism- Scheduled sampling- Multilingual MT
3
Recurrent connections
Output vector
Hidden state
Input vector
ht
xt
yt
3
Recurrent connections
Output vector
Hidden state
Input vector
ht
xt
yt
Wxh
Whh
ht = φh(Wxh xt +Whh ht−1)
3
Recurrent connections
Output vector
Hidden state
Input vector
ht
xt
yt
Wxh
Why
Whh
yt = φy(Why ht)
ht = φh(Wxh xt +Whh ht−1)
4
Recurrent connections: Unfolding
x1 x2 x3 x4 · · ·
4
Recurrent connections: Unfolding
x1 x2 x3 x4 · · ·
h1
Wxh
4
Recurrent connections: Unfolding
x1 x2 x3 x4 · · ·
h1
Wxh
y1
Why
4
Recurrent connections: Unfolding
x1 x2 x3 x4 · · ·
h1
Wxh
y1
Why
h2
Wxh
Whh
4
Recurrent connections: Unfolding
x1 x2 x3 x4 · · ·
h1
Wxh
y1
Why
h2
Wxh
Whh
y2
Why
4
Recurrent connections: Unfolding
x1 x2 x3 x4 · · ·
h1
Wxh
y1
Why
h2
Wxh
Whh
y2
Why
h3
Wxh
Whh
4
Recurrent connections: Unfolding
x1 x2 x3 x4 · · ·
h1
Wxh
y1
Why
h2
Wxh
Whh
y2
Why
h3
Wxh
Whh
y3
Why
4
Recurrent connections: Unfolding
x1 x2 x3 x4 · · ·
h1
Wxh
y1
Why
h2
Wxh
Whh
y2
Why
h3
Wxh
Whh
y3
Why
h4
Wxh
Whh
4
Recurrent connections: Unfolding
x1 x2 x3 x4 · · ·
h1
Wxh
y1
Why
h2
Wxh
Whh
y2
Why
h3
Wxh
Whh
y3
Why
h4
Wxh
Whh
y4
Why
4
Recurrent connections: Backprop through time
x1 x2 x3 x4 · · ·
h1
Wxh
y1
Why
h2
Wxh
Whh
y2
Why
h3
Wxh
Whh
y3
Why
h4
Wxh
Whh
y4
Why ∂L∂Why
∂L∂Wxh
∂L∂Whh
4
Recurrent connections: Backprop through time
x1 x2 x3 x4 · · ·
h1
Wxh
y1
Why
h2
Wxh
Whh
y2
Why
h3
Wxh
Whh
y3
Why
h4
Wxh
Whh
y4
Why ∂L∂Why
∂L∂Wxh
∂L∂Whh
∂L∂Wxh
∂L∂Whh
∂L∂Wxh
∂L∂Whh
∂L∂Wxh
5
Activation functionsφh is typically a smooth, bounded function, e.g., σ, tanh
ht−1 httanh
xt
ht = tanh(Wxh xt +Whh ht−1)
− Susceptible to vanishing gradients− Can fail to capture long-term dependencies
6
Long short-term memory (LSTM) Hochreiter & Schmidhuber (1997)
ct−1
ct
xt
ht−1 tanh +
c̃t = tanh(Wxh xt +Whh ht−1)
ct = ct−1 + c̃t
6
Long short-term memory (LSTM) Hochreiter & Schmidhuber (1997)
ct−1
ct
xt
ht−1 tanh +
forget×σ
ft = σ(Wfx xt +Wfh ht−1)
c̃t = tanh(Wxh xt +Whh ht−1)
ct = ft � ct−1 + c̃t
6
Long short-term memory (LSTM) Hochreiter & Schmidhuber (1997)
ct−1
ct
xt
ht−1 tanh +
forget×σ
input×
σ
ft = σ(Wfx xt +Wfh ht−1)
it = σ(Wix xt +Wih ht−1)
c̃t = tanh(Wxh xt +Whh ht−1)
ct = ft � ct−1 + it � c̃t
6
Long short-term memory (LSTM) Hochreiter & Schmidhuber (1997)
ct−1
ct
xt
ht−1 tanh +
forget×σ
input×
σ
output
tanh
×
ht
σ
ft = σ(Wfx xt +Wfh ht−1)
it = σ(Wix xt +Wih ht−1)
ot = σ(Wox xt +Woh ht−1)
c̃t = tanh(Wxh xt +Whh ht−1)
ct = ft � ct−1 + it � c̃t
ht = ot � tanh(ct)
7
Gated Recurrent Unit (GRU) Cho et al. (2014)
xt
ht−1
ht
tanh
h̃t = tanh(Wxh xt +Whh ht−1)
ht = h̃t
7
Gated Recurrent Unit (GRU) Cho et al. (2014)
xt
ht−1
ht
tanh
reset
×
σ
rt = σ(Wrx xt +Wrh ht−1)
h̃t = tanh(Wxh xt +Whh (rt � ht−1))ht = h̃t
7
Gated Recurrent Unit (GRU) Cho et al. (2014)
xt
ht−1
ht
tanh
reset
×
σ
update
×
σ
rt = σ(Wrx xt +Wrh ht−1)
zt = σ(Wzx xt +Wzh ht−1)
h̃t = tanh(Wxh xt +Whh (rt � ht−1))ht = zt � h̃t
7
Gated Recurrent Unit (GRU) Cho et al. (2014)
xt
ht−1
ht
tanh
reset
×
σ
update
×
σ
1−
×
+
σ
rt = σ(Wrx xt +Wrh ht−1)
zt = σ(Wzx xt +Wzh ht−1)
h̃t = tanh(Wxh xt +Whh (rt � ht−1))ht = (1− zt)� ht−1 + zt � h̃t
8
Processing text with RNNsInput
- Word/sentence embeddings- One-hot words/characters- CNNs over characters/words/sentences, e.g., document modeling- Absent, e.g., RNN-LMs
...
Recurrent layer- Gated units: LSTMs, GRUs- Forward, backward, bidirectional- ReLUs initialized with identity matrix
...
Output- Softmax over words/characters/labels, e.g., text generation- Deeper RNN layers- Absent, e.g., text encoders
...
9
Machine Translation
“One naturally wonders if the problem of translation could conceivably betreated as a problem in cryptography. When I look at an article inRussian, I say: ’This is really written in English, but it has been coded insome strange symbols. I will now proceed to decode.”
— Warren WeaverTranslation (1955)
10
The MT Pyramid
analysis
generation
Source Target
Interlingua
lexical
syntactic
semantic
pragmatic
11
Phrase-based MT
Tomorrow I will fly to the conference in Canada
Morgen fliege Ich nach Kanada zur Konferenz
11
Phrase-based MT
Tomorrow I will fly to the conference in Canada
Morgen fliege Ich nach Kanada zur Konferenz
12
Phrase-based MT
1. Collect bilingual dataset 〈Si, Ti〉 ∈ D
2. Unsupervised phrase-based alignmentI phrase table π
3. Unsupervised n-gram language modelingI language model ψ
4. Supervised decoderI parameters θ T̂ = argmax
Tp(T |S)
= argmaxT
p(S|T, π, θ) · p(T |ψ)
12
Neural MT
1. Collect bilingual dataset 〈Si, Ti〉 ∈ D
2. Unsupervised phrase-based alignmentI phrase table π
3. Unsupervised n-gram language modelingI language model ψ
4. Supervised encoder-decoder frameworkI parameters θ
13
Encoder
I Input: source words x1, . . . , xnI Output: context vector c
RNN w/gated units
x1 x2 x3 x4 x5 . . . xn
h1 h2 h3 h4 h5 . . . hn
c
14
Decoder
I Input: context vector cI Output: translated words y1, . . . , ym
Softmax
RNN w/gated units
y1 y2 y3 y4 y5 . . . ym
c
s1 s2 s3 s4 s5 . . . sm
si = f(si−1, yi−1, c)
15
Sequence-to-sequence learning Sutskever, Vinyals & Le (2014)
x1 x2 x3 x4 x5 . . . xn
h1 h2 h3 h4 h5 . . . hn
y1 y2 y3 y4 y5 . . . ym
c
s1 s2 s3 s4 s5 . . . sm
si = f(si−1, yi−1, hn)
16
Sequence-to-sequence learning Sutskever, Vinyals & Le (2014)
Produces a fixed length representation of input- “sentence embedding” or “thought vector” ( )
16
Sequence-to-sequence learning Sutskever, Vinyals & Le (2014)
Produces a fixed length representation of input- “sentence embedding” or “thought vector” ( )
17
Sequence-to-sequence learning Sutskever, Vinyals & Le (2014)
LSTM units do not solve vanishing gradients- Poor performance on long sentences- Need to reverse the input
18
Attention-based translation Bahdanau et al (2014)
x1 x2 x3 x4 x5 . . . xn
h1 h2 h3 h4 h5 . . . hn
y1 y2 y3 y4
s1 s2 s3 s4
18
Attention-based translation Bahdanau et al (2014)
x1 x2 x3 x4 x5 . . . xn
h1 h2 h3 h4 h5 . . . hn
y1 y2 y3 y4
s1 s2 s3 s4
feedforward
e4,1 e4,2 e4,3 e4,4 e4,5 e4,n
eij = a(si−1, hj)
18
Attention-based translation Bahdanau et al (2014)
x1 x2 x3 x4 x5 . . . xn
h1 h2 h3 h4 h5 . . . hn
y1 y2 y3 y4
s1 s2 s3 s4
e4,1 e4,2 e4,3 e4,4 e4,5 e4,n
α4,1 α4,2 α4,3 α4,4 α4,5 α4,n
softmax
αij =exp(eij)∑k exp(eik)
eij = a(si−1, hj)
18
Attention-based translation Bahdanau et al (2014)
x1 x2 x3 x4 x5 . . . xn
h1 h2 h3 h4 h5 . . . hn
y1 y2 y3 y4
s1 s2 s3 s4
e4,1 e4,2 e4,3 e4,4 e4,5 e4,n
α4,1 α4,2 α4,3 α4,4 α4,5 α4,n
weightedaverage
c5
ci =∑j
αijhj
αij =exp(eij)∑k exp(eik)
eij = a(si−1, hj)
18
Attention-based translation Bahdanau et al (2014)
x1 x2 x3 x4 x5 . . . xn
h1 h2 h3 h4 h5 . . . hn
y1 y2 y3 y4
s1 s2 s3 s4
e4,1 e4,2 e4,3 e4,4 e4,5 e4,n
α4,1 α4,2 α4,3 α4,4 α4,5 α4,n
c5
s5
si = f(si−1, yi−1, ci)
ci =∑j
αijhj
αij =exp(eij)∑k exp(eik)
eij = a(si−1, hj)
19
Attention-based translation Bahdanau et al (2014)
◦ Encoder:- Bidirectional RNN: forward (hj) + backward (h′j)
◦ GRUs instead of LSTMs- Simpler, fewer parameters
◦ Decoder:- si and yi also depend on yi−1
- Additional hidden layer prior to softmax for yi- Inference is O(mn) instead of O(m) for seq-to-seq
20
Attention-based translation Bahdanau et al (2014)
Improved results on long sentences
21
Attention-based translation Bahdanau et al (2014)
Sensible induced alignments
21
Attention-based translation Bahdanau et al (2014)
Sensible induced alignments
22
Images
Show, Attend & Tell: Neural Image Caption Generation with Visual Attention(Xu et al. 2015)
23
Videos
Describing Videos by Exploiting Temporal Structure (Yao et al. 2015)
24
Large vocabularies
Sequence-to-sequence models can typically scale to 30K-50K words
But real-world applications need at least 500K-1M words
25
Large vocabularies
Alternative 1: Hierarchical softmax
- Predict path in binary tree representation of output layer- Reduces to log2(V ) binary decisions
p(wt = “dog”| · · · ) = (1− σ(U0ht))× σ(U1ht)× σ(U4ht)
0
1 2
3 4 5 6
cow duck cat dog she he and the
26
Large vocabularies Jean et al (2015)
Alternative 2: Importance sampling
- Expensive to compute the softmax normalization term over V
p(yi = wj |y<i, x) =exp
(W>j f(si, yi−1, ci)
)∑|V |k=1 exp
(W>k f(si, yi−1, ci)
)- Use a small subset of the target vocabulary for each update
- Approximate expectation over gradient of loss with fewer samples
- Partition the training corpus and maintain local vocabularies in eachpartition to use GPUs efficiently
27
Large vocabularies Wu et al (2016)
Alternative 3: Wordpiece units
- Reduce vocabulary by replacing infrequent words with wordpieces
Jet makers feud over seat width with big orders at stake
⇓
_J et _makers _fe ud _over _seat _width _with _big _orders _at _stake
28
Copying mechanism Gu et al (2016)
In monolingual tasks, copy rare words directly from the input
Generation via standard attention-based decoder
ψg(yi = wj) =W>j f(si, yi−1, ci) wj ∈ V
Copying via a non-linear projection of input hidden states
ψc(yi = xj) = tanh(h>j U)f(si, yi−1, ci) xj ∈ X
Both modes compete via the softmax
p(yi = wj |y<i, x) =1
Z
exp (ψg(wj)) +∑
k:xk=wj
exp (ψc(xk))
29
Copying mechanism Gu et al (2016)
Decoding probability p(yt| · · · )
30
Copying mechanism Gu et al (2016)
31
Scheduled sampling Bengio et al (2015)
Decoder outputs yi and hidden states si typically conditioned onprevious outputs
si = f(si−1, yi−1, ci)
At training time, these are taken from labels (“teacher forcing”)I Model can’t learn to recover from errors
Replace with model output over training using annealed parameter
32
Multilingual MT Johnson et al (2016)One model for translating between multiple languages
- Just add a language identification token before each sentence
t-SNE projection of learned representations of 74 sentences and differenttranslations in English, Japanese and Korean