Generating Text with Recurrent Neural Networks

28
GENERATING TEXT WITH RECURRENT NEURAL NETWORKS Ilya Sutskever, James Martens, and Geoffrey Hinton, ICML 2011 2013-4-1 Institute of Electronics, NCTU 指指指指 : 指指指 S. J. Wang 指指 : 指指指 K. T. Chen

description

Generating Text with Recurrent Neural Networks. Ilya Sutskever , James Martens, and Geoffrey Hinton, ICML 2011 2013-4-1 Institute of Electronics, NCTU 指導教授 : 王聖智 S. J. Wang 學生 : 陳冠廷 K. T. Chen . Outline. Introduction Motivation What is the RNN ? - PowerPoint PPT Presentation

Transcript of Generating Text with Recurrent Neural Networks

Page 1: Generating   Text   with  Recurrent Neural   Networks

GENERATING TEXT WITH RECURRENT NEURAL NETWORKSIlya Sutskever, James Martens, and Geoffrey Hinton, ICML 2011

2013-4-1Institute of Electronics, NCTU指導教授 : 王聖智 S. J. Wang學生 : 陳冠廷 K. T. Chen

Page 2: Generating   Text   with  Recurrent Neural   Networks

2

Outline• Introduction

• Motivation• What is the RNN?• Why do we choose RNN to solve this problem? • How to train RNN?

• Contribution• Character-Level language modeling• The multiplicative RNN

• The Experiments• Discussion

Page 3: Generating   Text   with  Recurrent Neural   Networks

3

Outline• Introduction

• Motivation• What is the RNN?• Why do we choose RNN to solve this problem? • How to train RNN?

• Contribution• Character-Level language modeling• The multiplicative RNN

• The Experiments• Discussion

Page 4: Generating   Text   with  Recurrent Neural   Networks

4

Movation• Read some sentences and then try to predict next

character.

?Easter is a Christian festival and holiday celebrating the resurrection of Jesus Christ..

Easter is a Christian festival and holiday celebrating the resurrection of Jesus Christ on the third day after his crucifixion at Calvary as described in the New Testament.

Page 5: Generating   Text   with  Recurrent Neural   Networks

5

Recurrent neural networks

input

hidden

output

Feed-forward neural network Recurrent neural network

input

hidden

output

• A recurrent neural network (RNN) is a class of neural network where connections between units form a directed cycle

Page 6: Generating   Text   with  Recurrent Neural   Networks

6

Why do we choose RNNs?• RNNs are suitable to deal with sequential data.(memory)• RNNs are neural network in time

time

t -1

𝑊 h𝑜

𝑊 h𝑣

𝑊 hh

t

𝑊 h𝑜

𝑊 h𝑣

𝑊 hh

t +1

𝑊 h𝑜

𝑊 h𝑣

predictions

hiddens

inputs

Page 7: Generating   Text   with  Recurrent Neural   Networks

7

How to train RNN?• Backpropagation through time (BPTT)• The gradient is easy to compute with backpropagation.• RNNs learn by minimizing the training error.

𝑊 hh

time

t -1

𝑊 h𝑜

𝑊 h𝑣

𝑊 hh

t

𝑊 h𝑜

𝑊 h𝑣

t +1

𝑊 h𝑜

𝑊 h𝑣

predictions

hiddens

inputs

Page 8: Generating   Text   with  Recurrent Neural   Networks

9

RNNs are hard to train • They can be volatile and can exhibit long-range sensitivity

to small parameter perturbations.• The Butterfly Effect

• The “vanishing gradient problem” makes gradient descent ineffective.

outputs

hiddens

inputs

time

Page 9: Generating   Text   with  Recurrent Neural   Networks

10

How to overcome vanishing gradient?

• Long-short term memory.(LSTM)• Modify the architecture of neural network.

• Hessian-Free optimizer. (James Martens et al. 2011.)

• Base on the Newton’s method + conjugate gradient algorithm

• Echo State network.• Only learn the hidden-output weighted.

data write keep read

Page 10: Generating   Text   with  Recurrent Neural   Networks

11

Outline• Introduction

• Motivation• What is the RNN?• Why do we choose RNN to solve this problem? • How to train RNN?

• Contribution• Character-Level language modeling• The multiplicative RNN

• The Experiments• Discussion

Page 11: Generating   Text   with  Recurrent Neural   Networks

12

Character-Level language modeling

• The RNN observes a sequence of characters.• The target output at each time step is defined as the input

character at the next time-step.

H

e

“H”

e

l

“He”

l

l

“Hel

l

o

“Hel

l”o

“Hel

lo”

target

Hidden state stores relevantInformation.

Page 12: Generating   Text   with  Recurrent Neural   Networks

13

The standard RNN

• The current input is transformed via the visible-to-hidden weight matrix ,and then contributes additively to the input for the current hidden state.

……

.

……

.

……H

character: 1-of-86

SoftmaxPredict distribution for next character.

h𝑡=tanh (𝑊 h𝑥 𝑥𝑡+𝑊 hhh𝑡 −1+𝑏h)𝑜𝑡=𝑊 h𝑜 h𝑡+𝑏𝑜

𝑊 h𝑥

𝑊 hh

Page 13: Generating   Text   with  Recurrent Neural   Networks

14

Some motivation from model a tree

• Each node is a hidden state vector. The next character must transform this to a new node.

• The next hidden state needs to depend on the conjunction of the current character and the current hidden representation.

..fix

..fixi ..fixe

i e

.fixin

n

Page 14: Generating   Text   with  Recurrent Neural   Networks

15

• They tried several neural network architectures and found the “Multiplicative-RNN” (MRNN) to be more effective than the regular RNN

The Multiplicative RNN

Current input character

The weight matrix is chosen by the current character

h𝑡=tanh (𝑊 h𝑥 𝑥𝑡+𝑊 hh(𝑥𝑡 )h𝑡−1+𝑏h)

𝑜𝑡=𝑊 h𝑜 h𝑡+𝑏𝑜

Page 15: Generating   Text   with  Recurrent Neural   Networks

16

The Multiplicative RNN• Naïve implementation : assign a matrix to each character

• This requires a lot of parameters. (86*1500*1500)• This could make the net overfit.• Difficult to parallelize on a GPU

• Factorize the matrices of each character• Fewer parameters• Easier to parallelize

Page 16: Generating   Text   with  Recurrent Neural   Networks

17

The Multiplicative RNN• We can get groups a and b to interact multiplicatively by

using “factors”

Group b

Grou

p a

Grou

p cf

𝒖 𝑓 𝒗 𝑓

𝒘 𝑓

𝒄 𝑓=(𝒃𝑇𝒘 𝑓 )(𝒂𝑇𝒖 𝑓 )𝒗 𝑓

𝒄 𝑓=(𝒃𝑇𝒘 𝑓 )(𝒗 𝑓 𝒖𝑓𝑇 )𝒂

Scalar coefficient

Outer product transition matrix with rank 1

𝒄=(∑𝑓 (𝒃𝑇𝒘 𝑓 ) (𝒗 𝑓 𝒖 𝑓𝑇 ))𝒂

Page 17: Generating   Text   with  Recurrent Neural   Networks

18

The Multiplicative RNN

……

.

……

.

……H

character: 1-of-86

1500 hidden units

Predict distribution for next character.

f

𝒖 𝑓 𝒗 𝑓

𝒘 h𝑓

1500 hidden units

……

.Each factor f defines a rank one matrix,

𝒄=(∑𝑓 (𝒃𝑇𝒘 𝑓 ) (𝒗 𝑓 𝒖 𝑓𝑇 ))𝒂

Page 18: Generating   Text   with  Recurrent Neural   Networks

19

The Multiplicative RNN

𝑊 h𝑜

𝑊 h𝑣

𝑊 h𝑓 𝑊 h𝑓

t - 1

𝑊 h𝑜

𝑊 h𝑣

𝑊 h𝑓 𝑊 h𝑓

t

𝑊 h𝑜

𝑊 h𝑣

𝑊 h𝑓 𝑊 h𝑓

t + 1

𝑊 h𝑜

𝑊 h𝑣

𝑊 h𝑓 𝑊 h𝑓

t +2𝑊 𝑣𝑓 𝑊 𝑣𝑓 𝑊 𝑣𝑓

Time

Output

Input characters

Page 19: Generating   Text   with  Recurrent Neural   Networks

20

The Multiplicative RNN:Key advantages

• The MRNN combines conjunction of contexts and characters more easily:

• The MRNN has two nonlinearities per timestep,whick make its dynamics even ricker and more powerful.

fix

Predict “i,e,_”

i

fixi

Predict “n”

Page 20: Generating   Text   with  Recurrent Neural   Networks

21

Outline• Introduction

• Motivation• What is the RNN?• Why do we choose RNN to solve this problem? • How to train RNN?

• Contribution• Character-Level language modeling• The multiplicative RNN

• The Experiments• Discussion

Page 21: Generating   Text   with  Recurrent Neural   Networks

22

The Experiments• Training on three large datasets

• ~1GB of the English Wikipedia• ~1GB of articles from New York Times• ~100MB of JMLR and AISTATS paper

• Compare with the Sequence Memorizer (Wood et al.) and PAQ (Mahoney et al.)

Page 22: Generating   Text   with  Recurrent Neural   Networks

23

Training on subsequencesThis is an extremely long string of

text………………………………………………………………………….

millions long

This is an extre ………250

his is an extrem……..is is an

extreme……...s is an

extremel……... is an

extremely……... is an

extremely ....... s an extremely l.......…..

The subsequences

Compute the gradient and the curvature on subset of the subsequences.Use a different subset at each iteration

Page 23: Generating   Text   with  Recurrent Neural   Networks

24

Parallelization• Use HF optimizer to evaluate the gradient and curvature

on large minibatches of data.

Data

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

+

+

gradient

curvature

Page 24: Generating   Text   with  Recurrent Neural   Networks

25

The architecture of model• Use 1500 hidden units and 1500 multiplicative factors on 250-

long sequences.

• Arguably the largest and deepest neural network ever trained.

….

….

….

….

….

….

….

….

predicetions

hiddens

input

1500

500

......

......

......

Page 25: Generating   Text   with  Recurrent Neural   Networks

26

Demo• The MRNN extracts “higher level information”, stores it for

many timesteps ,and uses it to make a prediction.• Parentheses sensitivity

• (Wlching et al. 2005) the latter has received numerical testimony without much deeply grow

• (Wlching, Wulungching, Alching, Blching, Clching et al." 2076) and Jill Abbas, The Scriptures reported that Achsia and employed a

• the sequence memoizer (Wood et al McWhitt), now called "The Fair Savings.'"" interpreted a critic. In t

• Wlching ethics, like virtual locations. The signature tolerator is necessary to en• Wlching et al., or Australia Christi and an undergraduate work in over knowledge,

inc• They often talk about examples as of January 19, . The "Hall Your Way" (NRF film)

and OSCIP• Her image was fixed through an archipelago's go after Carol^^'s first century, but

simply to

Page 26: Generating   Text   with  Recurrent Neural   Networks

27

Outline• Introduction

• Motivation• What is the RNN?• Why do we choose RNN to solve this problem? • How to train RNN?

• Contribution• Character-Level language modeling• The multiplicative RNN

• The Experiments• Discussion

Page 27: Generating   Text   with  Recurrent Neural   Networks

28

Discussion• The MRNN model generated text contains very few non-

words. (e.g., “cryptoliation”, “homosomalist”). This let MRNN can deal with real words that it didn’t see in the training set.

• If they have more computational power, they could train much bigger MRNNs with millions of units and billions of connections.

Page 28: Generating   Text   with  Recurrent Neural   Networks

29

Reference• Generating Text with Recurrent Neural Networks, Ilya

Sutskever, James Martens, and Geoffrey Hinton, ICML 2011• Factored Conditional Restricted Boltzmann Machines for

Modeling Motion Style, GrahamW. Taylor, Geoffrey E. Hinton• Coursera : Neural Networks for Machine Learning ,Geoffrey

Hinton• http://www.cs.toronto.edu/~ilya/rnn.html