Sequence to sequence (encoder-decoder) learning
-
Upload
roberto-pereira-silveira -
Category
Technology
-
view
142 -
download
2
Transcript of Sequence to sequence (encoder-decoder) learning
Seq2seq...and beyond
Hello!I am Roberto Silveira
EE engineer, ML enthusiast
@rsilveira79
SequenceIs a matter of time
RNNIs what you need!
Basic Recurrent cells (RNN)
Source: http://colah.github.io/
Issues× Difficulties to deal with long term
dependencies× Difficult to train - vanish gradient issues
Long term issues
Source: http://colah.github.io/, CS224d notes
Sentence 1"Jane walked into the room. John walked in too. Jane said hi to ___"
Sentence 2"Jane walked into the room. John walked in too. It was late in the day, and everyone was walking home after a long day at work. Jane said hi to ___"
LSTM in 2 min...
Review× Address long term dependencies× More complex to train× Very powerful, lots of data
Source: http://colah.github.io/
LSTM in 2 min...
Review× Address long term dependencies× More complex to train× Very powerful, lots of data
Cell state
Source: http://colah.github.io/
Forget gate
Input gate
Output gate
Gated recurrent unit (GRU) in 2 min ...
Review× Fewer hyperparameters× Train faster× Better solution w/ less data
Source: http://www.wildml.com/, arXiv:1412.3555
Gated recurrent unit (GRU) in 2 min ...
Review× Fewer hyperparameters× Train faster× Better solution w/ less data
Source: http://www.wildml.com/, arXiv:1412.3555
Reset gate
Update gate
Seq2seq learning
Or encoder-decoder architectures
Basic idea"Variable" size input (encoder) ->
Fixed size vector representation ->"Variable" size output (decoder)
"Machine","Learning",
"is","fun"
"Aprendizado","de",
"Máquina","é",
"divertido"
0.6360.1220.981
Input One word at a time Stateful
ModelStateful
ModelEncoded
Sequence
Output One word at a time
First RNN(Encoder)
Second RNN
(Decoder)
Memory of previous word influence next
result
Memory of previous word influence next
result
Sequence to Sequence Learning with Neural Networks (2014)
"Machine","Learning",
"is","fun"
"Aprendizado","de",
"Máquina","é",
"divertido"
0.6360.1220.981
1000d word embeddings
4 layers1000
cells/layer
Encoded Sequence
LSTM(Encoder)
LSTM(Decoder)
Source: arXiv 1409.3215v3
TRAINING → SGD w/o momentum, fixed learning rate of 0.7, 7.5 epochs, batches of 128 sentences, 10 days of training (WMT 14 dataset English to French)
4 layers1000
cells/layer
Recurrent encoder-decoders
Les chiens aiment les os <EOS> Dogs love bones
Dogs love bones <EOS>
Source Sequence Target Sequence
Source: arXiv 1409.3215v3
Recurrent encoder-decoders
Les chiens aiment les os <EOS> Dogs love bones
Dogs love bones <EOS>
Source: arXiv 1409.3215v3
Recurrent encoder-decoders
Leschiensaimentlesos <EOS> Dogs love bones
Dogs love bones <EOS>
Source: arXiv 1409.3215v3
Source: arXiv 1409.3215v3
Recurrent encoder-decoders - issues
● Difficult to cope with large sentences (longer than training corpus)
● Decoder w/ attention mechanism →relieve encoder to squash into fixed length vector
NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE (2015)
Source: arXiv 1409.0473v7
Decoder
Context vector for each target word
Weights of each annotation hj
NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE (2015)
Source: arXiv 1409.0473v7
Decoder
Context vector for each target word
Weights of each annotation hj
Non-monotonic alignment
Attention models for NLP
Source: arXiv 1409.0473v7
Les chiens aiment les os <EOS>
+
<EOS>
Attention models for NLP
Source: arXiv 1409.0473v7
Les chiens aiment les os <EOS>
+
<EOS>
Dogs
Attention models for NLP
Source: arXiv 1409.0473v7
Les chiens aiment les os <EOS>
+
<EOS>
Dogs
Dogs
love+
Attention models for NLP
Source: arXiv 1409.0473v7
Les chiens aiment les os <EOS>
+
<EOS>
Dogs
Dogs
love+
love
bones+
Challenges in using the model● Cannot handle true
variable size input
Source: http://suriyadeepan.github.io/
PADDING
BUCKETING
WORD EMBEDDINGS
● Capture context semantic meaning
● Hard to deal with both short and large sentences
padding
Source: http://suriyadeepan.github.io/
EOS : End of sentencePAD : FillerGO : Start decodingUNK : Unknown; word not in vocabulary
Q : "What time is it? "A : "It is seven thirty."
Q : [ PAD, PAD, PAD, PAD, PAD, “?”, “it”,“is”, “time”, “What” ] A : [ GO, “It”, “is”, “seven”, “thirty”, “.”, EOS, PAD, PAD, PAD ]
Source: https://www.tensorflow.org/
bucketing
Efficiently handle sentences of different lengths
Ex: 100 tokens is the largest sentence in corpus
How about short sentences like: "How are you?" → lots of PAD
Bucket list: [(5, 10), (10, 15), (20, 25), (40, 50)](defaut on Tensorflow translate.py)
Q : [ PAD, PAD, “.”, “go”,“I”] A : [GO "Je" "vais" "." EOS PAD PAD PAD PAD PAD]
Word embeddings (remember previous presentation ;-)Distributed representations → syntactic and semantic is captured
Take =
0.2860.792-0.177-0.1070.109
-0.5420.3490.271
Word embeddings (remember previous presentation ;-)Linguistic regularities (recap)
Phrase representations (Paper - earning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation)
Source: arXiv 1406.1078v3
Phrase representations (Paper - earning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation)
Source: arXiv 1406.1078v3
1000d vector representation
applications
Neural conversational model - chatbots
Source: arXiv 1506.05869v3
Google Smart reply
Google Smart reply
Source: arXiv 1606.04870v1
Interesting facts● Currently responsible for 10% Inbox replies● Training set 238 million messages
Google Smart reply
Source: arXiv 1606.04870v1
Seq2Seq model
Interesting facts● Currently responsible for 10% Inbox replies● Training set 238 million messages
Feedforward triggering model
Semi-supervised semantic clustering
Image captioning(Paper - Show and Tell: A Neural Image Caption Generator)
Source: arXiv 1411.4555v2
Image captioning(Paper - Show and Tell: A Neural Image Caption Generator)
Encoder
Decoder
Source: arXiv 1411.4555v2
What's next?
And so?
Multi-task sequence to sequence(Paper - MULTI-TASK SEQUENCE TO SEQUENCE LEARNING)
Source: arXiv 1511.06114v4
One-to-Many (common encoder)
Many-to-One(common decoder)
Many-to-Many
Neural programmer(Paper - NEURAL PROGRAMMER: INDUCING LATENT PROGRAMS WITH GRADIENT DESCENT)
Source: arXiv 1511.04834v3
Unsupervised pre-training for seq2seq - 2017(Paper - UNSUPERVISED PRETRAINING FOR SEQUENCE TO SEQUENCE LEARNING)
Source: arXiv 1611.02683v1
Unsupervised pre-training for seq2seq - 2017(Paper - UNSUPERVISED PRETRAINING FOR SEQUENCE TO SEQUENCE LEARNING)
Source: arXiv 1611.02683v1
Pre-trained
Pre-trained
@rsilveira79
Place your screenshot here
A Quick example on tensorflow