KapilThadani kapil@cs.columbia - Liangliang...

Sequence-to-Sequence Architectures

Kapil Thadanikapil@cs.columbia.edu

RESEARCH

Outline

◦ Recurrent neural networks- Connections and updates- Activation functions- Gated units

◦ Sequence-to-sequence networks- Machine translation- Encoder-decoder architectures- Attention mechanism- Large vocabularies- Copying mechanism- Scheduled sampling- Multilingual MT

Recurrent connections

Output vector

Hidden state

Input vector

Output vector

Hidden state

Input vector

ht = φh(Wxh xt +Whh ht−1)

Output vector

Hidden state

Input vector

yt = φy(Why ht)

ht = φh(Wxh xt +Whh ht−1)

Recurrent connections: Unfolding

x1 x2 x3 x4 · · ·

Recurrent connections: Backprop through time

x1 x2 x3 x4 · · ·

Why ∂L∂Why

∂L∂Wxh

∂L∂Whh

Recurrent connections: Backprop through time

x1 x2 x3 x4 · · ·

Why ∂L∂Why

∂L∂Wxh

∂L∂Whh

∂L∂Wxh

∂L∂Whh

∂L∂Wxh

∂L∂Whh

∂L∂Wxh

Activation functionsφh is typically a smooth, bounded function, e.g., σ, tanh

ht−1 httanh

ht = tanh(Wxh xt +Whh ht−1)

− Susceptible to vanishing gradients− Can fail to capture long-term dependencies

Long short-term memory (LSTM) Hochreiter & Schmidhuber (1997)

ct−1

ht−1 tanh +

c̃t = tanh(Wxh xt +Whh ht−1)

ct = ct−1 + c̃t

ct−1

ht−1 tanh +

forget×σ

ft = σ(Wfx xt +Wfh ht−1)

ct = ft � ct−1 + c̃t

ct−1

ht−1 tanh +

forget×σ

input×

it = σ(Wix xt +Wih ht−1)

ct = ft � ct−1 + it � c̃t

ct−1

ht−1 tanh +

forget×σ

input×

output

it = σ(Wix xt +Wih ht−1)

ot = σ(Wox xt +Woh ht−1)

ct = ft � ct−1 + it � c̃t

ht = ot � tanh(ct)

Gated Recurrent Unit (GRU) Cho et al. (2014)

ht−1

h̃t = tanh(Wxh xt +Whh ht−1)

ht = h̃t

ht−1

rt = σ(Wrx xt +Wrh ht−1)

h̃t = tanh(Wxh xt +Whh (rt � ht−1))ht = h̃t

ht−1

update

zt = σ(Wzx xt +Wzh ht−1)

h̃t = tanh(Wxh xt +Whh (rt � ht−1))ht = zt � h̃t

ht−1

update

zt = σ(Wzx xt +Wzh ht−1)

h̃t = tanh(Wxh xt +Whh (rt � ht−1))ht = (1− zt)� ht−1 + zt � h̃t

Processing text with RNNsInput

- Word/sentence embeddings- One-hot words/characters- CNNs over characters/words/sentences, e.g., document modeling- Absent, e.g., RNN-LMs

Recurrent layer- Gated units: LSTMs, GRUs- Forward, backward, bidirectional- ReLUs initialized with identity matrix

Output- Softmax over words/characters/labels, e.g., text generation- Deeper RNN layers- Absent, e.g., text encoders

Machine Translation

“One naturally wonders if the problem of translation could conceivably betreated as a problem in cryptography. When I look at an article inRussian, I say: ’This is really written in English, but it has been coded insome strange symbols. I will now proceed to decode.”

— Warren WeaverTranslation (1955)

The MT Pyramid

analysis

generation

Source Target

Interlingua

lexical

syntactic

semantic

pragmatic

Phrase-based MT

Tomorrow I will fly to the conference in Canada

Morgen fliege Ich nach Kanada zur Konferenz

Phrase-based MT

Tomorrow I will fly to the conference in Canada

Morgen fliege Ich nach Kanada zur Konferenz

Phrase-based MT

1. Collect bilingual dataset 〈Si, Ti〉 ∈ D

2. Unsupervised phrase-based alignmentI phrase table π

3. Unsupervised n-gram language modelingI language model ψ

4. Supervised decoderI parameters θ T̂ = argmax

Tp(T |S)

= argmaxT

p(S|T, π, θ) · p(T |ψ)

Neural MT

1. Collect bilingual dataset 〈Si, Ti〉 ∈ D

2. Unsupervised phrase-based alignmentI phrase table π

3. Unsupervised n-gram language modelingI language model ψ

4. Supervised encoder-decoder frameworkI parameters θ

Encoder

I Input: source words x1, . . . , xnI Output: context vector c

RNN w/gated units

x1 x2 x3 x4 x5 . . . xn

h1 h2 h3 h4 h5 . . . hn

Decoder

I Input: context vector cI Output: translated words y1, . . . , ym

Softmax

RNN w/gated units

y1 y2 y3 y4 y5 . . . ym

s1 s2 s3 s4 s5 . . . sm

si = f(si−1, yi−1, c)

Sequence-to-sequence learning Sutskever, Vinyals & Le (2014)

x1 x2 x3 x4 x5 . . . xn

h1 h2 h3 h4 h5 . . . hn

y1 y2 y3 y4 y5 . . . ym

s1 s2 s3 s4 s5 . . . sm

si = f(si−1, yi−1, hn)

Produces a fixed length representation of input- “sentence embedding” or “thought vector” ( )

LSTM units do not solve vanishing gradients- Poor performance on long sentences- Need to reverse the input

Attention-based translation Bahdanau et al (2014)

x1 x2 x3 x4 x5 . . . xn

h1 h2 h3 h4 h5 . . . hn

y1 y2 y3 y4

s1 s2 s3 s4

x1 x2 x3 x4 x5 . . . xn

h1 h2 h3 h4 h5 . . . hn

y1 y2 y3 y4

s1 s2 s3 s4

feedforward

e4,1 e4,2 e4,3 e4,4 e4,5 e4,n

eij = a(si−1, hj)

x1 x2 x3 x4 x5 . . . xn

h1 h2 h3 h4 h5 . . . hn

y1 y2 y3 y4

s1 s2 s3 s4

e4,1 e4,2 e4,3 e4,4 e4,5 e4,n

α4,1 α4,2 α4,3 α4,4 α4,5 α4,n

softmax

αij =exp(eij)∑k exp(eik)

eij = a(si−1, hj)

x1 x2 x3 x4 x5 . . . xn

h1 h2 h3 h4 h5 . . . hn

y1 y2 y3 y4

s1 s2 s3 s4

e4,1 e4,2 e4,3 e4,4 e4,5 e4,n

α4,1 α4,2 α4,3 α4,4 α4,5 α4,n

weightedaverage

ci =∑j

αijhj

eij = a(si−1, hj)

x1 x2 x3 x4 x5 . . . xn

h1 h2 h3 h4 h5 . . . hn

y1 y2 y3 y4

s1 s2 s3 s4

e4,1 e4,2 e4,3 e4,4 e4,5 e4,n

α4,1 α4,2 α4,3 α4,4 α4,5 α4,n

si = f(si−1, yi−1, ci)

ci =∑j

αijhj

eij = a(si−1, hj)

◦ Encoder:- Bidirectional RNN: forward (hj) + backward (h′j)

◦ GRUs instead of LSTMs- Simpler, fewer parameters

◦ Decoder:- si and yi also depend on yi−1

- Additional hidden layer prior to softmax for yi- Inference is O(mn) instead of O(m) for seq-to-seq

Improved results on long sentences

Sensible induced alignments

Images

Show, Attend & Tell: Neural Image Caption Generation with Visual Attention(Xu et al. 2015)

Videos

Describing Videos by Exploiting Temporal Structure (Yao et al. 2015)

Large vocabularies

Sequence-to-sequence models can typically scale to 30K-50K words

But real-world applications need at least 500K-1M words

Large vocabularies

Alternative 1: Hierarchical softmax

- Predict path in binary tree representation of output layer- Reduces to log2(V ) binary decisions

p(wt = “dog”| · · · ) = (1− σ(U0ht))× σ(U1ht)× σ(U4ht)

3 4 5 6

cow duck cat dog she he and the

Large vocabularies Jean et al (2015)

Alternative 2: Importance sampling

- Expensive to compute the softmax normalization term over V

p(yi = wj |y<i, x) =exp

(W>j f(si, yi−1, ci)

)∑|V |k=1 exp

(W>k f(si, yi−1, ci)

)- Use a small subset of the target vocabulary for each update

- Approximate expectation over gradient of loss with fewer samples

- Partition the training corpus and maintain local vocabularies in eachpartition to use GPUs efficiently

Large vocabularies Wu et al (2016)

Alternative 3: Wordpiece units

- Reduce vocabulary by replacing infrequent words with wordpieces

Jet makers feud over seat width with big orders at stake

_J et _makers _fe ud _over _seat _width _with _big _orders _at _stake

Copying mechanism Gu et al (2016)

In monolingual tasks, copy rare words directly from the input

Generation via standard attention-based decoder

ψg(yi = wj) =W>j f(si, yi−1, ci) wj ∈ V

Copying via a non-linear projection of input hidden states

ψc(yi = xj) = tanh(h>j U)f(si, yi−1, ci) xj ∈ X

Both modes compete via the softmax

p(yi = wj |y<i, x) =1

exp (ψg(wj)) +∑

k:xk=wj

exp (ψc(xk))

Decoding probability p(yt| · · · )

Scheduled sampling Bengio et al (2015)

Decoder outputs yi and hidden states si typically conditioned onprevious outputs

si = f(si−1, yi−1, ci)

At training time, these are taken from labels (“teacher forcing”)I Model can’t learn to recover from errors

Replace with model output over training using annealed parameter

Multilingual MT Johnson et al (2016)One model for translating between multiple languages

- Just add a language identification token before each sentence

t-SNE projection of learned representations of 74 sentences and differenttranslations in English, Japanese and Korean

KapilThadani kapil@cs.columbia - Liangliang...

Documents

Transcript of KapilThadani kapil@cs.columbia - Liangliang...

Kapil mathur

Sebastian Enrique Columbia University senrique@cs.columbia

Kapil steels

KAPIL SACHDEV KAPIL SACHDEV KAPIL SACHDEV ISSUED FOR ... · 0009-04-2020 issued for approvalsohrab khankapil sachdev kapil sachdev kapil sachdev PRO NO:- 3836 Automatic Entrance Gate

TULI Kapil Rajendra Kumar Lee Kong Chian School of ... › sites › default › files › business › pdf › KAPIL… · TULI Kapil Rajendra Kumar 12 July 2012. TULI Kapil Rajendra

KapilThadani kapil@cs.columbia · Machine Translation 4 “One naturally wonders if the problem of translation could conceivably be treated as a problem in cryptography. When I look

kapil project.doc

KAPIL YADAV

Jeet kapil Portfolio

Kapil kadiyan report

Kapil Project(2)

Kapil sibal Assets

Bhanu Kapil page.indd

Shrm - Kapil

BASIC Hydraulic-kapil

Salman Abdul Baset salman@cs.columbia Department of Computer Science Columbia University

Report Kapil

Leadership kapil

KAPIL PLCnew

Kapil Sinha REPORT