Neural’Machine’Transla/on’’ - Stanford NLP Group · 2015-10-23 ·...

Neural Machine Transla/on

Thang Luong Tutorial @ Stanford NLP lunch

(Thanks to Chris Manning, Abigail See, and

Russell Stewart for comments and discussions.)

Sixi roasted husband Meat Muscle Stupid Bean Sprouts

Source words Target words

SyntacGc structure SyntacGc structure

SemanGc structure SemanGc structure

Interlingual

The Vauquois Diagram (1968)

Direct approach Words Words

SyntacGc SyntacGc

SemanGc SemanGc

Interlingual

She loves

aime les chats

mignons

Translate locally!

Transfer approach Words Words

SyntacGc SyntacGc

SemanGc SemanGc

Interlingual

Complex, slow, need lots of resources!

Neural Machine Transla/on

•  Simple – Skip intermediate structures.

•  GeneralizaGon – With distributed representaGons.

Words Words

Interlingual

Outline

•  NMT basics (Sutskever et al., 2014) – Architecture. – Recurrent units. – BackpropagaGon.

•  AYenGon mechanism (Bahdanau et al., 2015)

Neural Machine Transla/on (NMT)

•  Big RNNs trained end-‐to-‐end. am a student _ Je suis étudiant

Je suis étudiant _

•  Big RNNs trained end-‐to-‐end: encoder-‐decoder. am a student _ Je suis étudiant

Je suis étudiant _

Word Embeddings

am a student _ Je suis étudiant

Je suis étudiant _

•  Randomly iniGalized, one for each language.

Source embeddings

Target embeddings

Je suis étudiant _

Recurrent Connec/ons IniGal states

•  O^en set to 0.

Je suis étudiant _

Recurrent Connec/ons

•  Different across layers and encoder / decoder.

Encoder 1st layer

Je suis étudiant _

Encoder 2nd layer

Je suis étudiant _

Decoder 1st layer

Je suis étudiant _

Decoder 2nd layer

Je suis étudiant _

SoBmax Layer

•  Hidden states ↦ scores.

Scores

Je suis étudiant _

SoBmax Layer

•  Scores ↦ probabiliGes.

Je suis étudiant _

Scores

so#max

P(suis | Je, source) |V|

Output embeddings

Training Loss

am a student _ Je suis étudiantI

Je suis étudiant _

Training Loss

-log P(Je)

-log P(suis)

•  Sum of all individual losses

Training Loss

-log P(Je)

-log P(suis)

Training Loss

-log P(étudiant)

Training Loss

-log P(_)

Summary Deep RNNs

(Sutskever et al., 2014)

Je suis étudiant _

am a student

_ Je suis étudiant

Je suis étudiant _

BidirecGonal RNNs (Bahdanau et al., 2015)

•  Generalize well.

•  Small memory.

•  Simple decoder.

Outline

Recurrent types – vanilla RNN

Shared over Gme

Vanishing gradient problem!

Vanishing gradients

(Pascanu et al., 2013)

Chain Rule

Bounding Rules

Bound Largest singular value

Vanishing gradients

Chain Rule

Bounding Rules

Sufficient Cond

(Pascanu et al., 2013)

Recurrent types – LSTM

•  Long-‐Short Term Memory (LSTM) –  (Hochreiter & Schmidhuber, 1997)

•  LSTM cells are addiGvely updated – Make backprop through Gme easier.

Com’on, it’s been around for 20 years!

LSTM cells

Building LSTM

•  A naïve version.

Nice gradients!

Building LSTM

•  Add input gates: control input signal.

Input gates

Building LSTM

•  Add forget gates: control memory.

Forget gates

Building LSTM

Output gates

•  Add output gates: extract informaGon. •  (Zaremba et al., 2014).

Why LSTM works?

•  The addiGve operaGon is the key! •  BackpropaGon path through the cell is effecGve.

LSTMt LSTMt-‐1

But forget gates can also be 0?

Important LSTM components?

•  (Jozefowicz et al., 2015): forget gate bias of 1.

•  (Greff et al., 2015): forget gates & output acts.

LSTMt LSTMt-‐1 +

Important LSTM components?

•  (Jozefowicz et al., 2015): forget gate bias of 1.

•  (Greff et al., 2015): forget gates & output acts.

LSTMt LSTMt-‐1 +

Forget gates are important!

•  (Graves, 2013): revived LSTM. – Direct connecGons between cells and gates.

•  Gated Recurrent Unit (GRU) – (Cho et al., 2014a) – No cells, same addiGve idea.

•  LSTM vs. GRU: mixed results (Chung et al., 2015).

Other RNN units

Encoder-‐decoder Summary Encoder Decoder

(Sutskever et al., 2014) (Luong et al., 2015a) (Luong et al., 2015b)

Deep LSTM Deep LSTM

(Cho et al., 2014a) (Bahdanau et al., 2015)

(Jean et al., 2015)

(BidirecGonal) GRU GRU

Deep LSTM Deep LSTM

(Jean et al., 2015)

(Kalchbrenner & Blunsom, 2013) CNN (Inverse CNN)

Deep LSTM Deep LSTM

(Jean et al., 2015)

(Kalchbrenner & Blunsom, 2013) CNN (Inverse CNN)

(Cho et al., 2014b) Gated Recursive CNN GRU

Outline

Backpropaga/on Through Time

-log P(_)

Init to 0

-log P(étudiant)

-log P(suis)

LSTM Backpropaga/on

•  Deltas sent back from the top layers.

LSTMt LSTMt-‐1

LSTM Backpropaga/on – Context

LSTMt LSTMt-‐1

•  Complete context vector gradient.

LSTMt LSTMt-‐1

•  First, use to compute gradients for .

LSTMt LSTMt-‐1

•  Then, update .

LSTMt LSTMt-‐1

LSTM Backprop

LSTMt LSTMt-‐1

LSTM Backprop

•  Compute gradients for , , .

LSTMt LSTMt-‐1

LSTM Backprop

•  Add gradients from the loss / upper layers.

LSTMt LSTMt-‐1

Summary

•  LSTM backpropagaGon is nasty.

•  But it will be much easier if: – Know your matrix calculus! – Pay aYenGon to and .

Outline

•  NMT basics

Sentence Length Problem

With aYenGon Without aYenGon

(Bahdanau et al., 2015)

APen/on Mechanism

•  Problem: Markovian process. •  SoluGon: random access memory – A memory of source hidden states. – cf. Neural Turing Machine (Graves et al., 2014).

Je suis étudiant _

APen/on mechanism

•  Recent innovaGon in deep learning: –  Control problem (Mnih et al., 14) –  Speech recogniGon (Chorowski et al., 15) –  Image capGoning (Xu et al., 15)

Simplified AYenGon (Bahdanau et al., 2015)

Deep LSTM (Sutskever et al., 2014)

Je suis étudiant _

am a student _ Je

Attention Layer

Context vector

What’s next?

Je suis étudiant _

•  Compare target and source hidden states.

APen/on Mechanism – Scoring

am a student _ Je

Attention Layer

Context vector

am a student _ Je

Attention Layer

Context vector

am a student _ Je

Attention Layer

Context vector

am a student _ Je

Attention Layer

Context vector

am a student _ Je

Attention Layer

Context vector

1 3 5 1

•  Convert into alignment weights.

APen/on Mechanism – Normaliza2on

am a student _ Je

Attention Layer

Context vector

0.1 0.3 0.5 0.1

am a student _ Je

Context vector

•  Build context vector: weighted average.

APen/on Mechanism – Context vector

am a student _ Je

Context vector

•  Compute the next hidden state.

APen/on Mechanism – Hidden state

am a student _ Je

Context vector

•  Predict the next word.

APen/on Mechanism – Predict

APen/on Func/ons

(Luong et al., 2015)

APen/on Func/ons

•  Works so far for NMT! •  Concern: insensiGve to locaGons – Similar weights for same source words – Bi-‐direcGonal RNNs can help.

(Luong et al., 2015b)

Other APen/on Func/ons

•  Content-‐based:

•  LocaGon-‐based: –  (Graves, 2013): hand-‐wriGng synthesis model.

•  Hybrid: –  (Chorowski et al., 2015) for speech recogniGon.

Summary

Simplified AYenGon (Bahdanau et al., 2015)

Deep LSTM (Sutskever et al., 2014)

Je suis étudiant _

Thank you!

References (1) •  [Bahdanau et al., 2015] Neural TranslaGon by Jointly Learning to Align and

Translate. hYp://arxiv.org/pdf/1409.0473.pdf •  [Cho et al., 2014a] Learning Phrase RepresentaGons using RNN Encoder–Decoder

for StaGsGcal Machine TranslaGon. hYp://aclweb.org/anthology/D/D14/D14-‐1179.pdf

•  [Cho et al., 2014b] On the ProperGes of Neural Machine TranslaGon: Encoder–Decoder Approaches. hYp://www.aclweb.org/anthology/W14-‐4012

•  [Chorowski et al., 2015] AYenGon-‐Based Models for Speech RecogniGon. hYp://arxiv.org/pdf/1506.07503v1.pdf

•  [Chung et al., 2015] Empirical EvaluaGon of Gated Recurrent Neural Networks on Sequence Modeling. hYp://arxiv.org/pdf/1412.3555.pdf

•  [Graves, 2013] GeneraGng Sequences With Recurrent Neural Networks. hYp://arxiv.org/pdf/1308.0850v5.pdf

•  [Graves, 2014] Neural Turing Machine. hYp://arxiv.org/pdf/1410.5401v2.pdf. •  [Hochreiter & Schmidhuber, 1997] Long Short-‐term Memory.

hYp://deeplearning.cs.cmu.edu/pdfs/Hochreiter97_lstm.pdf

References (2) •  [Kalchbrenner & Blunsom, 2013] Recurrent ConGnuous TranslaGon Models.

hYp://nal.co/papers/KalchbrennerBlunsom_EMNLP13 •  [Luong et al., 2015a] Addressing the Rare Word Problem in Neural Machine

TranslaGon. hYp://www.aclweb.org/anthology/P15-‐1002 •  [Luong et al., 2015b] EffecGve Approaches to AYenGon-‐based Neural Machine

TranslaGon. hYps://aclweb.org/anthology/D/D15/D15-‐1166.pdf •  [Mnih et al., 2014] Recurrent Models of Visual AYenGon.

hYp://papers.nips.cc/paper/5542-‐recurrent-‐models-‐of-‐visual-‐aYenGon.pdf •  [Pascanu et al., 2013] On the difficulty of training Recurrent Neural Networks.

hYp://arxiv.org/pdf/1211.5063v2.pdf •  [Xu et al., 2015] Show, AYend and Tell: Neural Image CapGon GeneraGon with

Visual AYenGon. hYp://jmlr.org/proceedings/papers/v37/xuc15.pdf •  [Sutskever et al., 2014] Sequence to Sequence Learning with Neural Networks.

hYp://papers.nips.cc/paper/5346-‐sequence-‐to-‐sequence-‐learning-‐with-‐neural-‐networks.pdf

•  [Zaremba et al., 2015] Recurrent Neural Network RegularizaGon. hYp://arxiv.org/pdf/1409.2329.pdf

Neural’Machine’Transla/on’’ - Stanford NLP Group · 2015-10-23 ·...

Documents

Transcript of Neural’Machine’Transla/on’’ - Stanford NLP Group · 2015-10-23 ·...

NLP Practitioner Heart of NLP - NLP Courses

Abigail See Minh-Thang Luong Christopher D. Manning … · Abigail See Minh-Thang Luong Christopher D. Manning Computer Science Department, Stanford University, Stanford, CA 94305

Department of Computer Science, Stanford University · Department of Computer Science, Stanford University CS 224n Winter 2018 Motivation •Appealing intersection of NLP and Computer

Nlp-Automata in Nlp

NLP Practitioner 'plus' Infomappe · NLP • NLP Practitioner • NLP Master • NLP Coach. Infomappe NLP Practitioner „plus ...

Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spark by Nicolas Claudon and Yana Ponomarova

NLP Training - NLP Certification

Deep Learning for NLP Part 2 - Stanford University...Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Introduction to Information Retrieval - The Stanford NLP

Deep Learning for NLP Part 3 - Stanford University...Deep Learning for NLP Part 3 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard

Discriminative Models in NLP - Stanford University

RECURSIVE DEEP LEARNING A DISSERTATION - Stanford NLP …manning/dissertations/Socher... · 2014. 8. 29. · me understand the NLP community. Andrew, thanks to you I found and fell

Carlos Salgado, 20 Jahre NLP - Erfahrung NLP ...nlp-insider.com/.../webinare/nlp-methodenkoffer-oder-einstellungssac… · Menschenbild - NLP • Thesen • Das Menschenbild eines

(Nlp) Nlp Secrets

An Extended Model of Natural Logic Bill MacCartney and Christopher D. Manning NLP Group Stanford University 8 January 2009.

From NLP to MLPcomsec.spb.ru/mmm-acns10/doc/MMM-ACNS2010/Session4/02_Te… · • NLP: • Extract sentences, analyze terms, ﬁnd relations (e.g. Stanford parser) • This beautiful_A

Sparse autoencoder - Stanford NLP Groupsocherr/sparseAutoencoder_2011new.pdf · Sparse autoencoder 1 Introduction ... and natural language processing, ... approach does not scale

NLP Lunch Tutorial: Smoothing - Stanford NLP Groupnlp.stanford.edu/~wcmac/papers/20050421-smoothing...NLP Lunch Tutorial: Smoothing Bill MacCartney 21 April 2005 Preface • Everything

Ph.D. Thesis Oral Defense - Stanford NLP Groupnlp.stanford.edu/valentin/pubs/PhD-slides.pdf · 2014-05-09 · Ph.D. Thesis Oral Defense Stanford Artiﬁcial Intelligence Laboratory

Selective Question Answering under ... - Stanford NLP Group · for current NLP models (Yogatama et al.,2019). Models trained on many domains may still strug-gle to generalize to new