New Developments in Neural Language Modeling - Inriadeeploria.gforge.inria.fr/adjiTalk.pdf · New...
Transcript of New Developments in Neural Language Modeling - Inriadeeploria.gforge.inria.fr/adjiTalk.pdf · New...
![Page 1: New Developments in Neural Language Modeling - Inriadeeploria.gforge.inria.fr/adjiTalk.pdf · New Developments in Neural Language Modeling ... Feedforward neural networks Source:](https://reader034.fdocuments.net/reader034/viewer/2022051508/5a7fb0727f8b9a682c8bb42f/html5/thumbnails/1.jpg)
New Developments inNeural Language Modeling
Adji Bousso Dieng
Joint work with Chong Wang, Jianfeng Gao, and John Paisley
![Page 2: New Developments in Neural Language Modeling - Inriadeeploria.gforge.inria.fr/adjiTalk.pdf · New Developments in Neural Language Modeling ... Feedforward neural networks Source:](https://reader034.fdocuments.net/reader034/viewer/2022051508/5a7fb0727f8b9a682c8bb42f/html5/thumbnails/2.jpg)
Language modeling applications
![Page 3: New Developments in Neural Language Modeling - Inriadeeploria.gforge.inria.fr/adjiTalk.pdf · New Developments in Neural Language Modeling ... Feedforward neural networks Source:](https://reader034.fdocuments.net/reader034/viewer/2022051508/5a7fb0727f8b9a682c8bb42f/html5/thumbnails/3.jpg)
Language modeling
• Denote by w1, ..., wn a sequence of n words.
• Example: (it, is, sunny, today)
• A language model computes → P (w1, ..., wn)
• The chain rule of probability tells us that
P (w1, ..., wn) = P (w1)
n∏i=2
P (wi|w1:i−1)
Goal → compute these conditional probabilities.
![Page 4: New Developments in Neural Language Modeling - Inriadeeploria.gforge.inria.fr/adjiTalk.pdf · New Developments in Neural Language Modeling ... Feedforward neural networks Source:](https://reader034.fdocuments.net/reader034/viewer/2022051508/5a7fb0727f8b9a682c8bb42f/html5/thumbnails/4.jpg)
N-grams
Unigram: Independence assumption (bag-of-words model)
P (w1, ..., wn) =
n∏i=1
P (wi)
N-gram: Markov assumption of order N
P (w1, ..., wn) = P (w1)
n∏i=2
P (wi|wi−1:i−N+1)
Learn model with maximum likelihood
Problem → poor generalization, curse of dimensionality.
![Page 5: New Developments in Neural Language Modeling - Inriadeeploria.gforge.inria.fr/adjiTalk.pdf · New Developments in Neural Language Modeling ... Feedforward neural networks Source:](https://reader034.fdocuments.net/reader034/viewer/2022051508/5a7fb0727f8b9a682c8bb42f/html5/thumbnails/5.jpg)
Feedforward neural networks
Source: Bengio et al. 2003
Problem → assumes fixed context, only uses same sequence length.
![Page 6: New Developments in Neural Language Modeling - Inriadeeploria.gforge.inria.fr/adjiTalk.pdf · New Developments in Neural Language Modeling ... Feedforward neural networks Source:](https://reader034.fdocuments.net/reader034/viewer/2022051508/5a7fb0727f8b9a682c8bb42f/html5/thumbnails/6.jpg)
Recurrent neural networks
st = f(Uxt +Wst−1)
st = g(s0, xt, xt−1, ..., x1) and g = f(f(f(...)))
ot = softmax(V st)
Problem → vanishing gradients, hidden state has limited capacity.
![Page 7: New Developments in Neural Language Modeling - Inriadeeploria.gforge.inria.fr/adjiTalk.pdf · New Developments in Neural Language Modeling ... Feedforward neural networks Source:](https://reader034.fdocuments.net/reader034/viewer/2022051508/5a7fb0727f8b9a682c8bb42f/html5/thumbnails/7.jpg)
Current challenges and Motivation
“The U.S. presidential race isn’t only drawing attention and
controversy in the United States – it’s being closely watched across
the globe. But what does the rest of the world think about a campaignthat has already thrown up one surprise after another? CNN asked 10journalists for their take on the race so far, and what their country
might be hoping for in America ’ s next President ”
Intuition → syntax is local, semantic is global.
![Page 8: New Developments in Neural Language Modeling - Inriadeeploria.gforge.inria.fr/adjiTalk.pdf · New Developments in Neural Language Modeling ... Feedforward neural networks Source:](https://reader034.fdocuments.net/reader034/viewer/2022051508/5a7fb0727f8b9a682c8bb42f/html5/thumbnails/8.jpg)
Interlude
Probabilistic Topic Models
![Page 9: New Developments in Neural Language Modeling - Inriadeeploria.gforge.inria.fr/adjiTalk.pdf · New Developments in Neural Language Modeling ... Feedforward neural networks Source:](https://reader034.fdocuments.net/reader034/viewer/2022051508/5a7fb0727f8b9a682c8bb42f/html5/thumbnails/9.jpg)
Probabilistic Topic Models
Source: David Blei
![Page 10: New Developments in Neural Language Modeling - Inriadeeploria.gforge.inria.fr/adjiTalk.pdf · New Developments in Neural Language Modeling ... Feedforward neural networks Source:](https://reader034.fdocuments.net/reader034/viewer/2022051508/5a7fb0727f8b9a682c8bb42f/html5/thumbnails/10.jpg)
Probabilistic Topic Models
Source: David Blei
![Page 11: New Developments in Neural Language Modeling - Inriadeeploria.gforge.inria.fr/adjiTalk.pdf · New Developments in Neural Language Modeling ... Feedforward neural networks Source:](https://reader034.fdocuments.net/reader034/viewer/2022051508/5a7fb0727f8b9a682c8bb42f/html5/thumbnails/11.jpg)
Probabilistic Topic Models
Source: David Blei
![Page 12: New Developments in Neural Language Modeling - Inriadeeploria.gforge.inria.fr/adjiTalk.pdf · New Developments in Neural Language Modeling ... Feedforward neural networks Source:](https://reader034.fdocuments.net/reader034/viewer/2022051508/5a7fb0727f8b9a682c8bb42f/html5/thumbnails/12.jpg)
Challenge
How can we combine ideas from probabilistic topic models andrecurrent neural network-based language models to capture both local
and global dependencies?
![Page 13: New Developments in Neural Language Modeling - Inriadeeploria.gforge.inria.fr/adjiTalk.pdf · New Developments in Neural Language Modeling ... Feedforward neural networks Source:](https://reader034.fdocuments.net/reader034/viewer/2022051508/5a7fb0727f8b9a682c8bb42f/html5/thumbnails/13.jpg)
Existing Approach
![Page 14: New Developments in Neural Language Modeling - Inriadeeploria.gforge.inria.fr/adjiTalk.pdf · New Developments in Neural Language Modeling ... Feedforward neural networks Source:](https://reader034.fdocuments.net/reader034/viewer/2022051508/5a7fb0727f8b9a682c8bb42f/html5/thumbnails/14.jpg)
Contextual RNN
Source: Mikolov 2012
st = g1(Uxt +Wst−1 + Ff(t))
yt = softmax(V st +Gf(t))
![Page 15: New Developments in Neural Language Modeling - Inriadeeploria.gforge.inria.fr/adjiTalk.pdf · New Developments in Neural Language Modeling ... Feedforward neural networks Source:](https://reader034.fdocuments.net/reader034/viewer/2022051508/5a7fb0727f8b9a682c8bb42f/html5/thumbnails/15.jpg)
New Approach: TopicRNN
![Page 16: New Developments in Neural Language Modeling - Inriadeeploria.gforge.inria.fr/adjiTalk.pdf · New Developments in Neural Language Modeling ... Feedforward neural networks Source:](https://reader034.fdocuments.net/reader034/viewer/2022051508/5a7fb0727f8b9a682c8bb42f/html5/thumbnails/16.jpg)
TopicRNN: A Generative Model
B B B B
V V VV VV V V
U U U U U U
W W W W W
Model ... unrolled architecture
ht = g1(Uxt +Wht−1)
yt = softmax(V ht + (1− lt)Bθ) and lt ∼ Bernoulli(Γht)
![Page 17: New Developments in Neural Language Modeling - Inriadeeploria.gforge.inria.fr/adjiTalk.pdf · New Developments in Neural Language Modeling ... Feedforward neural networks Source:](https://reader034.fdocuments.net/reader034/viewer/2022051508/5a7fb0727f8b9a682c8bb42f/html5/thumbnails/17.jpg)
TopicRNN: A Generative ModelXc (bag-of-words)stop words excluded
X (full document)stop words included
Y (target document)
RNN
U
VB
W
Inference ... end-to-end architecture
q(θ|Xc) is the recognition network (MLP) that embeds the bag ofwords representation of the document
![Page 18: New Developments in Neural Language Modeling - Inriadeeploria.gforge.inria.fr/adjiTalk.pdf · New Developments in Neural Language Modeling ... Feedforward neural networks Source:](https://reader034.fdocuments.net/reader034/viewer/2022051508/5a7fb0727f8b9a682c8bb42f/html5/thumbnails/18.jpg)
Maximum Likelihood
Ideally, maximize the log marginal likelihood of the observed sequencey1:T , l1:T
log p(y1:T , l1:T |ht) = log
∫p(θ)
T∏t=1
p(yt|ht, lt, θ)p(lt|ht)dθ.
Problem → intractable ...
![Page 19: New Developments in Neural Language Modeling - Inriadeeploria.gforge.inria.fr/adjiTalk.pdf · New Developments in Neural Language Modeling ... Feedforward neural networks Source:](https://reader034.fdocuments.net/reader034/viewer/2022051508/5a7fb0727f8b9a682c8bb42f/html5/thumbnails/19.jpg)
Variational Objectives: ELBO
ELBO(Θ) = Eq(θ|Xc)
[T∑t=1
log p(yt, lt|ht, θ)
]−KL(q(θ|Xc)||p(θ))
ELBO(Θ) ≤ log p(y1:T , l1:T |ht,Θ)
Maximize a lower bound to the log marginal likelihood
Learning: end-to-end via backpropagation using ELBO and Adam
![Page 20: New Developments in Neural Language Modeling - Inriadeeploria.gforge.inria.fr/adjiTalk.pdf · New Developments in Neural Language Modeling ... Feedforward neural networks Source:](https://reader034.fdocuments.net/reader034/viewer/2022051508/5a7fb0727f8b9a682c8bb42f/html5/thumbnails/20.jpg)
Empirical evidence
![Page 21: New Developments in Neural Language Modeling - Inriadeeploria.gforge.inria.fr/adjiTalk.pdf · New Developments in Neural Language Modeling ... Feedforward neural networks Source:](https://reader034.fdocuments.net/reader034/viewer/2022051508/5a7fb0727f8b9a682c8bb42f/html5/thumbnails/21.jpg)
Promising Results on Word Prediction
10 Neurons Valid TestRNN (no features) 239.2 225.0RNN (LDA features) 197.3 187.4TopicRNN 184.5 172.2TopicLSTM 188.0 175.0TopicGRU 178.3 166.7
100 Neurons Valid TestRNN (no features) 150.1 142.1RNN (LDA features) 132.3 126.4TopicRNN 128.5 122.3TopicLSTM 126.0 118.1TopicGRU 118.3 112.4
300 Neurons Valid TestRNN (no features) – 124.7RNN (LDA features) – 113.7TopicRNN 118.3 112.2TopicLSTM 104.1 99.5TopicGRU 99.6 97.3
Perplexity scores on PTB for different network sizes and models.
![Page 22: New Developments in Neural Language Modeling - Inriadeeploria.gforge.inria.fr/adjiTalk.pdf · New Developments in Neural Language Modeling ... Feedforward neural networks Source:](https://reader034.fdocuments.net/reader034/viewer/2022051508/5a7fb0727f8b9a682c8bb42f/html5/thumbnails/22.jpg)
Inferred Topics
Law Company Parties Trading Carslaw spending democratic stock gm
lawyers sales republicans sp autojudge advertising gop price fordrights employees republican investor jaguar
attorney state senate standard carcourt taxes oakland chairman cars
general fiscal highway investors headquarterscommon appropriation democrats retirement british
mr budget bill holders executivesinsurance ad district merrill model
Table: Five Topics from the TopicRNN Model
![Page 23: New Developments in Neural Language Modeling - Inriadeeploria.gforge.inria.fr/adjiTalk.pdf · New Developments in Neural Language Modeling ... Feedforward neural networks Source:](https://reader034.fdocuments.net/reader034/viewer/2022051508/5a7fb0727f8b9a682c8bb42f/html5/thumbnails/23.jpg)
Inferred Document Distributions
0 10 20 30 40 500.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16Inferred Topic Distribution from TopicGRU
0 10 20 30 40 500.00
0.05
0.10
0.15
0.20
0.25Inferred Topic Distribution from TopicGRU
0 10 20 30 40 500.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16Inferred Topic Distribution from TopicGRU
Figure: Inferred distributions using TopicGRU on threedifferent documents. The content of these documents is addedon the appendix. This shows that some of the topics are beingpicked up depending on the input document.
![Page 24: New Developments in Neural Language Modeling - Inriadeeploria.gforge.inria.fr/adjiTalk.pdf · New Developments in Neural Language Modeling ... Feedforward neural networks Source:](https://reader034.fdocuments.net/reader034/viewer/2022051508/5a7fb0727f8b9a682c8bb42f/html5/thumbnails/24.jpg)
Unsupervised Feature Extraction
Clustered learned features from IMDB 100K movie reviews.
![Page 25: New Developments in Neural Language Modeling - Inriadeeploria.gforge.inria.fr/adjiTalk.pdf · New Developments in Neural Language Modeling ... Feedforward neural networks Source:](https://reader034.fdocuments.net/reader034/viewer/2022051508/5a7fb0727f8b9a682c8bb42f/html5/thumbnails/25.jpg)
Promising Results on Sentiment Classification
Model Reported Classification Error rateBoW (bnc) (Maas et al., 2011) 12.20%BoW (b∆ tc) (Maas et al., 2011) 11.77%LDA (Maas et al., 2011) 32.58%Full + BoW (Maas et al., 2011) 11.67%Full + Unlabelled + BoW (Maas et al., 2011) 11.11%WRRBM (Dahl et al., 2012) 12.58%WRRBM + BoW (bnc) (Dahl et al., 2012) 10.77%MNB-uni (Wang & Manning, 2012) 16.45%MNB-bi (Wang & Manning, 2012) 13.41%SVM-uni (Wang & Manning, 2012) 13.05%SVM-bi (Wang & Manning, 2012) 10.84%NBSVM-uni (Wang & Manning, 2012) 11.71%seq2-bown-CNN (Johnson & Zhang, 2014) 14.70%NBSVM-bi (Wang & Manning, 2012) 8.78%Paragraph Vector (Le & Mikolov, 2014) 7.42%SA-LSTM with joint training (Dai & Le, 2015) 14.70%LSTM with tuning and dropout (Dai & Le, 2015) 13.50%LSTM initialized with word2vec embeddings (Dai & Le, 2015) 10.00%SA-LSTM with linear gain (Dai & Le, 2015) 9.17%LM-TM (Dai & Le, 2015) 7.64%SA-LSTM (Dai & Le, 2015) 7.24%TopicRNN 6.28%
![Page 26: New Developments in Neural Language Modeling - Inriadeeploria.gforge.inria.fr/adjiTalk.pdf · New Developments in Neural Language Modeling ... Feedforward neural networks Source:](https://reader034.fdocuments.net/reader034/viewer/2022051508/5a7fb0727f8b9a682c8bb42f/html5/thumbnails/26.jpg)
Remaining challenges
1. Learning to encode and decode rare words well.
2. More on capturing long term dependencies.