KapilThadani [email protected] - Liangliang...

1

Sequence-to-Sequence Architectures

Kapil [email protected]

RESEARCH

2

Outline

◦ Recurrent neural networks- Connections and updates- Activation functions- Gated units

◦ Sequence-to-sequence networks- Machine translation- Encoder-decoder architectures- Attention mechanism- Large vocabularies- Copying mechanism- Scheduled sampling- Multilingual MT

3

Recurrent connections

Output vector

Hidden state

Input vector

ht

xt

yt

3


Output vector

Hidden state

Input vector

ht

xt

yt

Wxh

Whh

ht = φh(Wxh xt +Whh ht−1)

3


Output vector

Hidden state

Input vector

ht

xt

yt

Wxh

Why

Whh

yt = φy(Why ht)

ht = φh(Wxh xt +Whh ht−1)

4

Recurrent connections: Unfolding

x1 x2 x3 x4 · · ·

4


x1 x2 x3 x4 · · ·

h1

Wxh

4


x1 x2 x3 x4 · · ·

h1

Wxh

y1

Why

4


x1 x2 x3 x4 · · ·

h1

Wxh

y1

Why

h2

Wxh

Whh

4


x1 x2 x3 x4 · · ·

h1

Wxh

y1

Why

h2

Wxh

Whh

y2

Why

4


x1 x2 x3 x4 · · ·

h1

Wxh

y1

Why

h2

Wxh

Whh

y2

Why

h3

Wxh

Whh

4


x1 x2 x3 x4 · · ·

h1

Wxh

y1

Why

h2

Wxh

Whh

y2

Why

h3

Wxh

Whh

y3

Why

4


x1 x2 x3 x4 · · ·

h1

Wxh

y1

Why

h2

Wxh

Whh

y2

Why

h3

Wxh

Whh

y3

Why

h4

Wxh

Whh

4


x1 x2 x3 x4 · · ·

h1

Wxh

y1

Why

h2

Wxh

Whh

y2

Why

h3

Wxh

Whh

y3

Why

h4

Wxh

Whh

y4

Why

4

Recurrent connections: Backprop through time

x1 x2 x3 x4 · · ·

h1

Wxh

y1

Why

h2

Wxh

Whh

y2

Why

h3

Wxh

Whh

y3

Why

h4

Wxh

Whh

y4

Why ∂L∂Why

∂L∂Wxh

∂L∂Whh

4

Recurrent connections: Backprop through time

x1 x2 x3 x4 · · ·

h1

Wxh

y1

Why

h2

Wxh

Whh

y2

Why

h3

Wxh

Whh

y3

Why

h4

Wxh

Whh

y4

Why ∂L∂Why

∂L∂Wxh

∂L∂Whh

∂L∂Wxh

∂L∂Whh

∂L∂Wxh

∂L∂Whh

∂L∂Wxh

5

Activation functionsφh is typically a smooth, bounded function, e.g., σ, tanh

ht−1 httanh

xt

ht = tanh(Wxh xt +Whh ht−1)

− Susceptible to vanishing gradients− Can fail to capture long-term dependencies

6

Long short-term memory (LSTM) Hochreiter & Schmidhuber (1997)

ct−1

ct

xt

ht−1 tanh +

c̃t = tanh(Wxh xt +Whh ht−1)

ct = ct−1 + c̃t

6


ct−1

ct

xt

ht−1 tanh +

forget×σ

ft = σ(Wfx xt +Wfh ht−1)


ct = ft � ct−1 + c̃t

6


ct−1

ct

xt

ht−1 tanh +

forget×σ

input×

σ


it = σ(Wix xt +Wih ht−1)


ct = ft � ct−1 + it � c̃t

6


ct−1

ct

xt

ht−1 tanh +

forget×σ

input×

σ

output

tanh

×

ht

σ


it = σ(Wix xt +Wih ht−1)

ot = σ(Wox xt +Woh ht−1)


ct = ft � ct−1 + it � c̃t

ht = ot � tanh(ct)

7

Gated Recurrent Unit (GRU) Cho et al. (2014)

xt

ht−1

ht

tanh

h̃t = tanh(Wxh xt +Whh ht−1)

ht = h̃t

7


xt

ht−1

ht

tanh

reset

×

σ

rt = σ(Wrx xt +Wrh ht−1)

h̃t = tanh(Wxh xt +Whh (rt � ht−1))ht = h̃t

7


xt

ht−1

ht

tanh

reset

×

σ

update

×

σ


zt = σ(Wzx xt +Wzh ht−1)

h̃t = tanh(Wxh xt +Whh (rt � ht−1))ht = zt � h̃t

7


xt

ht−1

ht

tanh

reset

×

σ

update

×

σ

1−

×

+

σ


zt = σ(Wzx xt +Wzh ht−1)

h̃t = tanh(Wxh xt +Whh (rt � ht−1))ht = (1− zt)� ht−1 + zt � h̃t

8

Processing text with RNNsInput

- Word/sentence embeddings- One-hot words/characters- CNNs over characters/words/sentences, e.g., document modeling- Absent, e.g., RNN-LMs

...

Recurrent layer- Gated units: LSTMs, GRUs- Forward, backward, bidirectional- ReLUs initialized with identity matrix

...

Output- Softmax over words/characters/labels, e.g., text generation- Deeper RNN layers- Absent, e.g., text encoders

...

9

Machine Translation

“One naturally wonders if the problem of translation could conceivably betreated as a problem in cryptography. When I look at an article inRussian, I say: ’This is really written in English, but it has been coded insome strange symbols. I will now proceed to decode.”

— Warren WeaverTranslation (1955)

10

The MT Pyramid

analysis

generation

Source Target

Interlingua

lexical

syntactic

semantic

pragmatic

11

Phrase-based MT

Tomorrow I will fly to the conference in Canada

Morgen fliege Ich nach Kanada zur Konferenz

12

Phrase-based MT

1. Collect bilingual dataset 〈Si, Ti〉 ∈ D

2. Unsupervised phrase-based alignmentI phrase table π

3. Unsupervised n-gram language modelingI language model ψ

4. Supervised decoderI parameters θ T̂ = argmax

Tp(T |S)

= argmaxT

p(S|T, π, θ) · p(T |ψ)

12

Neural MT

1. Collect bilingual dataset 〈Si, Ti〉 ∈ D

2. Unsupervised phrase-based alignmentI phrase table π

3. Unsupervised n-gram language modelingI language model ψ

4. Supervised encoder-decoder frameworkI parameters θ

13

Encoder

I Input: source words x1, . . . , xnI Output: context vector c

RNN w/gated units

x1 x2 x3 x4 x5 . . . xn

h1 h2 h3 h4 h5 . . . hn

c

14

Decoder

I Input: context vector cI Output: translated words y1, . . . , ym

Softmax

RNN w/gated units

y1 y2 y3 y4 y5 . . . ym

c

s1 s2 s3 s4 s5 . . . sm

si = f(si−1, yi−1, c)

15

Sequence-to-sequence learning Sutskever, Vinyals & Le (2014)

x1 x2 x3 x4 x5 . . . xn

h1 h2 h3 h4 h5 . . . hn

y1 y2 y3 y4 y5 . . . ym

c

s1 s2 s3 s4 s5 . . . sm

si = f(si−1, yi−1, hn)

16


Produces a fixed length representation of input- “sentence embedding” or “thought vector” ( )

17


LSTM units do not solve vanishing gradients- Poor performance on long sentences- Need to reverse the input

18

Attention-based translation Bahdanau et al (2014)

x1 x2 x3 x4 x5 . . . xn

h1 h2 h3 h4 h5 . . . hn

y1 y2 y3 y4

s1 s2 s3 s4

18


x1 x2 x3 x4 x5 . . . xn

h1 h2 h3 h4 h5 . . . hn

y1 y2 y3 y4

s1 s2 s3 s4

feedforward

e4,1 e4,2 e4,3 e4,4 e4,5 e4,n

eij = a(si−1, hj)

18


x1 x2 x3 x4 x5 . . . xn

h1 h2 h3 h4 h5 . . . hn

y1 y2 y3 y4

s1 s2 s3 s4

e4,1 e4,2 e4,3 e4,4 e4,5 e4,n

α4,1 α4,2 α4,3 α4,4 α4,5 α4,n

softmax

αij =exp(eij)∑k exp(eik)

eij = a(si−1, hj)

18


x1 x2 x3 x4 x5 . . . xn

h1 h2 h3 h4 h5 . . . hn

y1 y2 y3 y4

s1 s2 s3 s4

e4,1 e4,2 e4,3 e4,4 e4,5 e4,n

α4,1 α4,2 α4,3 α4,4 α4,5 α4,n

weightedaverage

c5

ci =∑j

αijhj


eij = a(si−1, hj)

18


x1 x2 x3 x4 x5 . . . xn

h1 h2 h3 h4 h5 . . . hn

y1 y2 y3 y4

s1 s2 s3 s4

e4,1 e4,2 e4,3 e4,4 e4,5 e4,n

α4,1 α4,2 α4,3 α4,4 α4,5 α4,n

c5

s5

si = f(si−1, yi−1, ci)

ci =∑j

αijhj


eij = a(si−1, hj)

19


◦ Encoder:- Bidirectional RNN: forward (hj) + backward (h′j)

◦ GRUs instead of LSTMs- Simpler, fewer parameters

◦ Decoder:- si and yi also depend on yi−1

- Additional hidden layer prior to softmax for yi- Inference is O(mn) instead of O(m) for seq-to-seq

20


Improved results on long sentences

21


Sensible induced alignments

22

Images

Show, Attend & Tell: Neural Image Caption Generation with Visual Attention(Xu et al. 2015)

23

Videos

Describing Videos by Exploiting Temporal Structure (Yao et al. 2015)

24

Large vocabularies

Sequence-to-sequence models can typically scale to 30K-50K words

But real-world applications need at least 500K-1M words

25

Large vocabularies

Alternative 1: Hierarchical softmax

- Predict path in binary tree representation of output layer- Reduces to log2(V ) binary decisions

p(wt = “dog”| · · · ) = (1− σ(U0ht))× σ(U1ht)× σ(U4ht)

0

1 2

3 4 5 6

cow duck cat dog she he and the

26

Large vocabularies Jean et al (2015)

Alternative 2: Importance sampling

- Expensive to compute the softmax normalization term over V

p(yi = wj |y<i, x) =exp

(W>j f(si, yi−1, ci)

)∑|V |k=1 exp

(W>k f(si, yi−1, ci)

)- Use a small subset of the target vocabulary for each update

- Approximate expectation over gradient of loss with fewer samples

- Partition the training corpus and maintain local vocabularies in eachpartition to use GPUs efficiently

27

Large vocabularies Wu et al (2016)

Alternative 3: Wordpiece units

- Reduce vocabulary by replacing infrequent words with wordpieces

Jet makers feud over seat width with big orders at stake

⇓

_J et _makers _fe ud _over _seat _width _with _big _orders _at _stake

28

Copying mechanism Gu et al (2016)

In monolingual tasks, copy rare words directly from the input

Generation via standard attention-based decoder

ψg(yi = wj) =W>j f(si, yi−1, ci) wj ∈ V

Copying via a non-linear projection of input hidden states

ψc(yi = xj) = tanh(h>j U)f(si, yi−1, ci) xj ∈ X

Both modes compete via the softmax

p(yi = wj |y<i, x) =1

Z

exp (ψg(wj)) +∑

k:xk=wj

exp (ψc(xk))

29


Decoding probability p(yt| · · · )

30


31

Scheduled sampling Bengio et al (2015)

Decoder outputs yi and hidden states si typically conditioned onprevious outputs

si = f(si−1, yi−1, ci)

At training time, these are taken from labels (“teacher forcing”)I Model can’t learn to recover from errors

Replace with model output over training using annealed parameter

32

Multilingual MT Johnson et al (2016)One model for translating between multiple languages

- Just add a language identification token before each sentence

t-SNE projection of learned representations of 74 sentences and differenttranslations in English, Japanese and Korean

KapilThadani [email protected] - Liangliang...

Documents

Transcript of KapilThadani [email protected] - Liangliang...