KapilThadani [email protected] - Liangliang...

58
1 Sequence-to-Sequence Architectures Kapil Thadani [email protected] RESEARCH

Transcript of KapilThadani [email protected] - Liangliang...

Page 1: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits

1

Sequence-to-Sequence Architectures

Kapil [email protected]

RESEARCH

Page 2: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits

2

Outline

◦ Recurrent neural networks- Connections and updates- Activation functions- Gated units

◦ Sequence-to-sequence networks- Machine translation- Encoder-decoder architectures- Attention mechanism- Large vocabularies- Copying mechanism- Scheduled sampling- Multilingual MT

Page 3: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits

3

Recurrent connections

Output vector

Hidden state

Input vector

ht

xt

yt

Page 4: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits

3

Recurrent connections

Output vector

Hidden state

Input vector

ht

xt

yt

Wxh

Whh

ht = φh(Wxh xt +Whh ht−1)

Page 5: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits

3

Recurrent connections

Output vector

Hidden state

Input vector

ht

xt

yt

Wxh

Why

Whh

yt = φy(Why ht)

ht = φh(Wxh xt +Whh ht−1)

Page 6: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits

4

Recurrent connections: Unfolding

x1 x2 x3 x4 · · ·

Page 7: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits

4

Recurrent connections: Unfolding

x1 x2 x3 x4 · · ·

h1

Wxh

Page 8: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits

4

Recurrent connections: Unfolding

x1 x2 x3 x4 · · ·

h1

Wxh

y1

Why

Page 9: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits

4

Recurrent connections: Unfolding

x1 x2 x3 x4 · · ·

h1

Wxh

y1

Why

h2

Wxh

Whh

Page 10: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits

4

Recurrent connections: Unfolding

x1 x2 x3 x4 · · ·

h1

Wxh

y1

Why

h2

Wxh

Whh

y2

Why

Page 11: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits

4

Recurrent connections: Unfolding

x1 x2 x3 x4 · · ·

h1

Wxh

y1

Why

h2

Wxh

Whh

y2

Why

h3

Wxh

Whh

Page 12: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits

4

Recurrent connections: Unfolding

x1 x2 x3 x4 · · ·

h1

Wxh

y1

Why

h2

Wxh

Whh

y2

Why

h3

Wxh

Whh

y3

Why

Page 13: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits

4

Recurrent connections: Unfolding

x1 x2 x3 x4 · · ·

h1

Wxh

y1

Why

h2

Wxh

Whh

y2

Why

h3

Wxh

Whh

y3

Why

h4

Wxh

Whh

Page 14: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits

4

Recurrent connections: Unfolding

x1 x2 x3 x4 · · ·

h1

Wxh

y1

Why

h2

Wxh

Whh

y2

Why

h3

Wxh

Whh

y3

Why

h4

Wxh

Whh

y4

Why

Page 15: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits

4

Recurrent connections: Backprop through time

x1 x2 x3 x4 · · ·

h1

Wxh

y1

Why

h2

Wxh

Whh

y2

Why

h3

Wxh

Whh

y3

Why

h4

Wxh

Whh

y4

Why ∂L∂Why

∂L∂Wxh

∂L∂Whh

Page 16: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits

4

Recurrent connections: Backprop through time

x1 x2 x3 x4 · · ·

h1

Wxh

y1

Why

h2

Wxh

Whh

y2

Why

h3

Wxh

Whh

y3

Why

h4

Wxh

Whh

y4

Why ∂L∂Why

∂L∂Wxh

∂L∂Whh

∂L∂Wxh

∂L∂Whh

∂L∂Wxh

∂L∂Whh

∂L∂Wxh

Page 17: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits

5

Activation functionsφh is typically a smooth, bounded function, e.g., σ, tanh

ht−1 httanh

xt

ht = tanh(Wxh xt +Whh ht−1)

− Susceptible to vanishing gradients− Can fail to capture long-term dependencies

Page 18: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits

6

Long short-term memory (LSTM) Hochreiter & Schmidhuber (1997)

ct−1

ct

xt

ht−1 tanh +

c̃t = tanh(Wxh xt +Whh ht−1)

ct = ct−1 + c̃t

Page 19: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits

6

Long short-term memory (LSTM) Hochreiter & Schmidhuber (1997)

ct−1

ct

xt

ht−1 tanh +

forget×σ

ft = σ(Wfx xt +Wfh ht−1)

c̃t = tanh(Wxh xt +Whh ht−1)

ct = ft � ct−1 + c̃t

Page 20: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits

6

Long short-term memory (LSTM) Hochreiter & Schmidhuber (1997)

ct−1

ct

xt

ht−1 tanh +

forget×σ

input×

σ

ft = σ(Wfx xt +Wfh ht−1)

it = σ(Wix xt +Wih ht−1)

c̃t = tanh(Wxh xt +Whh ht−1)

ct = ft � ct−1 + it � c̃t

Page 21: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits

6

Long short-term memory (LSTM) Hochreiter & Schmidhuber (1997)

ct−1

ct

xt

ht−1 tanh +

forget×σ

input×

σ

output

tanh

×

ht

σ

ft = σ(Wfx xt +Wfh ht−1)

it = σ(Wix xt +Wih ht−1)

ot = σ(Wox xt +Woh ht−1)

c̃t = tanh(Wxh xt +Whh ht−1)

ct = ft � ct−1 + it � c̃t

ht = ot � tanh(ct)

Page 22: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits

7

Gated Recurrent Unit (GRU) Cho et al. (2014)

xt

ht−1

ht

tanh

h̃t = tanh(Wxh xt +Whh ht−1)

ht = h̃t

Page 23: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits

7

Gated Recurrent Unit (GRU) Cho et al. (2014)

xt

ht−1

ht

tanh

reset

×

σ

rt = σ(Wrx xt +Wrh ht−1)

h̃t = tanh(Wxh xt +Whh (rt � ht−1))ht = h̃t

Page 24: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits

7

Gated Recurrent Unit (GRU) Cho et al. (2014)

xt

ht−1

ht

tanh

reset

×

σ

update

×

σ

rt = σ(Wrx xt +Wrh ht−1)

zt = σ(Wzx xt +Wzh ht−1)

h̃t = tanh(Wxh xt +Whh (rt � ht−1))ht = zt � h̃t

Page 25: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits

7

Gated Recurrent Unit (GRU) Cho et al. (2014)

xt

ht−1

ht

tanh

reset

×

σ

update

×

σ

1−

×

+

σ

rt = σ(Wrx xt +Wrh ht−1)

zt = σ(Wzx xt +Wzh ht−1)

h̃t = tanh(Wxh xt +Whh (rt � ht−1))ht = (1− zt)� ht−1 + zt � h̃t

Page 26: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits

8

Processing text with RNNsInput

- Word/sentence embeddings- One-hot words/characters- CNNs over characters/words/sentences, e.g., document modeling- Absent, e.g., RNN-LMs

...

Recurrent layer- Gated units: LSTMs, GRUs- Forward, backward, bidirectional- ReLUs initialized with identity matrix

...

Output- Softmax over words/characters/labels, e.g., text generation- Deeper RNN layers- Absent, e.g., text encoders

...

Page 27: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits

9

Machine Translation

“One naturally wonders if the problem of translation could conceivably betreated as a problem in cryptography. When I look at an article inRussian, I say: ’This is really written in English, but it has been coded insome strange symbols. I will now proceed to decode.”

— Warren WeaverTranslation (1955)

Page 28: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits

10

The MT Pyramid

analysis

generation

Source Target

Interlingua

lexical

syntactic

semantic

pragmatic

Page 29: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits

11

Phrase-based MT

Tomorrow I will fly to the conference in Canada

Morgen fliege Ich nach Kanada zur Konferenz

Page 30: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits

11

Phrase-based MT

Tomorrow I will fly to the conference in Canada

Morgen fliege Ich nach Kanada zur Konferenz

Page 31: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits

12

Phrase-based MT

1. Collect bilingual dataset 〈Si, Ti〉 ∈ D

2. Unsupervised phrase-based alignmentI phrase table π

3. Unsupervised n-gram language modelingI language model ψ

4. Supervised decoderI parameters θ T̂ = argmax

Tp(T |S)

= argmaxT

p(S|T, π, θ) · p(T |ψ)

Page 32: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits

12

Neural MT

1. Collect bilingual dataset 〈Si, Ti〉 ∈ D

2. Unsupervised phrase-based alignmentI phrase table π

3. Unsupervised n-gram language modelingI language model ψ

4. Supervised encoder-decoder frameworkI parameters θ

Page 33: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits

13

Encoder

I Input: source words x1, . . . , xnI Output: context vector c

RNN w/gated units

x1 x2 x3 x4 x5 . . . xn

h1 h2 h3 h4 h5 . . . hn

c

Page 34: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits

14

Decoder

I Input: context vector cI Output: translated words y1, . . . , ym

Softmax

RNN w/gated units

y1 y2 y3 y4 y5 . . . ym

c

s1 s2 s3 s4 s5 . . . sm

si = f(si−1, yi−1, c)

Page 35: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits

15

Sequence-to-sequence learning Sutskever, Vinyals & Le (2014)

x1 x2 x3 x4 x5 . . . xn

h1 h2 h3 h4 h5 . . . hn

y1 y2 y3 y4 y5 . . . ym

c

s1 s2 s3 s4 s5 . . . sm

si = f(si−1, yi−1, hn)

Page 36: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits

16

Sequence-to-sequence learning Sutskever, Vinyals & Le (2014)

Produces a fixed length representation of input- “sentence embedding” or “thought vector” ( )

Page 37: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits

16

Sequence-to-sequence learning Sutskever, Vinyals & Le (2014)

Produces a fixed length representation of input- “sentence embedding” or “thought vector” ( )

Page 38: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits

17

Sequence-to-sequence learning Sutskever, Vinyals & Le (2014)

LSTM units do not solve vanishing gradients- Poor performance on long sentences- Need to reverse the input

Page 39: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits

18

Attention-based translation Bahdanau et al (2014)

x1 x2 x3 x4 x5 . . . xn

h1 h2 h3 h4 h5 . . . hn

y1 y2 y3 y4

s1 s2 s3 s4

Page 40: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits

18

Attention-based translation Bahdanau et al (2014)

x1 x2 x3 x4 x5 . . . xn

h1 h2 h3 h4 h5 . . . hn

y1 y2 y3 y4

s1 s2 s3 s4

feedforward

e4,1 e4,2 e4,3 e4,4 e4,5 e4,n

eij = a(si−1, hj)

Page 41: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits

18

Attention-based translation Bahdanau et al (2014)

x1 x2 x3 x4 x5 . . . xn

h1 h2 h3 h4 h5 . . . hn

y1 y2 y3 y4

s1 s2 s3 s4

e4,1 e4,2 e4,3 e4,4 e4,5 e4,n

α4,1 α4,2 α4,3 α4,4 α4,5 α4,n

softmax

αij =exp(eij)∑k exp(eik)

eij = a(si−1, hj)

Page 42: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits

18

Attention-based translation Bahdanau et al (2014)

x1 x2 x3 x4 x5 . . . xn

h1 h2 h3 h4 h5 . . . hn

y1 y2 y3 y4

s1 s2 s3 s4

e4,1 e4,2 e4,3 e4,4 e4,5 e4,n

α4,1 α4,2 α4,3 α4,4 α4,5 α4,n

weightedaverage

c5

ci =∑j

αijhj

αij =exp(eij)∑k exp(eik)

eij = a(si−1, hj)

Page 43: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits

18

Attention-based translation Bahdanau et al (2014)

x1 x2 x3 x4 x5 . . . xn

h1 h2 h3 h4 h5 . . . hn

y1 y2 y3 y4

s1 s2 s3 s4

e4,1 e4,2 e4,3 e4,4 e4,5 e4,n

α4,1 α4,2 α4,3 α4,4 α4,5 α4,n

c5

s5

si = f(si−1, yi−1, ci)

ci =∑j

αijhj

αij =exp(eij)∑k exp(eik)

eij = a(si−1, hj)

Page 44: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits

19

Attention-based translation Bahdanau et al (2014)

◦ Encoder:- Bidirectional RNN: forward (hj) + backward (h′j)

◦ GRUs instead of LSTMs- Simpler, fewer parameters

◦ Decoder:- si and yi also depend on yi−1

- Additional hidden layer prior to softmax for yi- Inference is O(mn) instead of O(m) for seq-to-seq

Page 45: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits

20

Attention-based translation Bahdanau et al (2014)

Improved results on long sentences

Page 46: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits

21

Attention-based translation Bahdanau et al (2014)

Sensible induced alignments

Page 47: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits

21

Attention-based translation Bahdanau et al (2014)

Sensible induced alignments

Page 48: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits

22

Images

Show, Attend & Tell: Neural Image Caption Generation with Visual Attention(Xu et al. 2015)

Page 49: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits

23

Videos

Describing Videos by Exploiting Temporal Structure (Yao et al. 2015)

Page 50: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits

24

Large vocabularies

Sequence-to-sequence models can typically scale to 30K-50K words

But real-world applications need at least 500K-1M words

Page 51: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits

25

Large vocabularies

Alternative 1: Hierarchical softmax

- Predict path in binary tree representation of output layer- Reduces to log2(V ) binary decisions

p(wt = “dog”| · · · ) = (1− σ(U0ht))× σ(U1ht)× σ(U4ht)

0

1 2

3 4 5 6

cow duck cat dog she he and the

Page 52: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits

26

Large vocabularies Jean et al (2015)

Alternative 2: Importance sampling

- Expensive to compute the softmax normalization term over V

p(yi = wj |y<i, x) =exp

(W>j f(si, yi−1, ci)

)∑|V |k=1 exp

(W>k f(si, yi−1, ci)

)- Use a small subset of the target vocabulary for each update

- Approximate expectation over gradient of loss with fewer samples

- Partition the training corpus and maintain local vocabularies in eachpartition to use GPUs efficiently

Page 53: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits

27

Large vocabularies Wu et al (2016)

Alternative 3: Wordpiece units

- Reduce vocabulary by replacing infrequent words with wordpieces

Jet makers feud over seat width with big orders at stake

_J et _makers _fe ud _over _seat _width _with _big _orders _at _stake

Page 54: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits

28

Copying mechanism Gu et al (2016)

In monolingual tasks, copy rare words directly from the input

Generation via standard attention-based decoder

ψg(yi = wj) =W>j f(si, yi−1, ci) wj ∈ V

Copying via a non-linear projection of input hidden states

ψc(yi = xj) = tanh(h>j U)f(si, yi−1, ci) xj ∈ X

Both modes compete via the softmax

p(yi = wj |y<i, x) =1

Z

exp (ψg(wj)) +∑

k:xk=wj

exp (ψc(xk))

Page 55: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits

29

Copying mechanism Gu et al (2016)

Decoding probability p(yt| · · · )

Page 56: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits

30

Copying mechanism Gu et al (2016)

Page 57: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits

31

Scheduled sampling Bengio et al (2015)

Decoder outputs yi and hidden states si typically conditioned onprevious outputs

si = f(si−1, yi−1, ci)

At training time, these are taken from labels (“teacher forcing”)I Model can’t learn to recover from errors

Replace with model output over training using annealed parameter

Page 58: KapilThadani kapil@cs.columbia - Liangliang Caollcao.net/cu-deeplearning17/lecture/lecture8_kapil.pdf · 2020-05-19 · kapil@cs.columbia.edu RESEARCH. Outline 2 Recurrentneuralnetworks-Connectionsandupdates-Activationfunctions-Gatedunits

32

Multilingual MT Johnson et al (2016)One model for translating between multiple languages

- Just add a language identification token before each sentence

t-SNE projection of learned representations of 74 sentences and differenttranslations in English, Japanese and Korean