Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 -...

60
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018 Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018 1 Lecture 10: Recurrent Neural Networks

Transcript of Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 -...

Page 1: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 20181

Lecture 10:Recurrent Neural Networks

Page 2: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 201813

Vanilla Neural Networks

“Vanilla” Neural Network

Page 3: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 201814

Recurrent Neural Networks: Process Sequences

e.g. Image Captioningimage -> sequence of words

Page 4: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 201815

Recurrent Neural Networks: Process Sequences

e.g. Sentiment Classificationsequence of words -> sentiment

Page 5: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 201816

Recurrent Neural Networks: Process Sequences

e.g. Machine Translationseq of words -> seq of words

Page 6: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 201817

Recurrent Neural Networks: Process Sequences

e.g. Video classification on frame level

Page 7: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 201820

Recurrent Neural Network

x

RNN

Page 8: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 201821

Recurrent Neural Network

x

RNN

yusually want to predict a vector at some time steps

Page 9: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 201822

Recurrent Neural Network

x

RNN

yWe can process a sequence of vectors x by applying a recurrence formula at every time step:

new state old state input vector at some time step

some functionwith parameters W

Page 10: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 201823

Recurrent Neural Network

x

RNN

yWe can process a sequence of vectors x by applying a recurrence formula at every time step:

Notice: the same function and the same set of parameters are used at every time step.

Page 11: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 201824

(Simple) Recurrent Neural Network

x

RNN

y

The state consists of a single “hidden” vector h:

Sometimes called a “Vanilla RNN” or an “Elman RNN” after Prof. Jeffrey Elman

Page 12: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 201825

h0 fW h1

x1

RNN: Computational Graph

Page 13: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 201826

h0 fW h1 fW h2

x2x1

RNN: Computational Graph

Page 14: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 201827

h0 fW h1 fW h2 fW h3

x3

x2x1

RNN: Computational Graph

hT

Page 15: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 201828

h0 fW h1 fW h2 fW h3

x3

x2x1W

RNN: Computational Graph

Re-use the same weight matrix at every time-step

hT

Page 16: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 201829

h0 fW h1 fW h2 fW h3

x3

yT

x2x1W

RNN: Computational Graph: Many to Many

hT

y3y2y1

Page 17: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 201830

h0 fW h1 fW h2 fW h3

x3

yT

x2x1W

RNN: Computational Graph: Many to Many

hT

y3y2y1 L1L2 L3 LT

Page 18: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 201831

h0 fW h1 fW h2 fW h3

x3

yT

x2x1W

RNN: Computational Graph: Many to Many

hT

y3y2y1 L1L2 L3 LT

L

Page 19: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 201832

h0 fW h1 fW h2 fW h3

x3

y

x2x1W

RNN: Computational Graph: Many to One

hT

Page 20: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 201833

h0 fW h1 fW h2 fW h3

yT

xW

RNN: Computational Graph: One to Many

hT

y3y2y1

Page 21: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 201834

Sequence to Sequence: Many-to-one + one-to-many

h0

fWh1

fWh2

fWh3

x3

x2

x1

W1

hT

Many to one: Encode input sequence in a single vector

Sutskever et al, “Sequence to Sequence Learning with Neural Networks”, NIPS 2014

Page 22: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 201835

Sequence to Sequence: Many-to-one + one-to-many

h0

fWh1

fWh2

fWh3

x3

x2

x1

W1

hT

y1

y2

Many to one: Encode input sequence in a single vector

One to many: Produce output sequence from single input vector

fWh1

fWh2

fW

W2

Sutskever et al, “Sequence to Sequence Learning with Neural Networks”, NIPS 2014

Page 23: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 201836

Example: Character-levelLanguage Model

Vocabulary:[h,e,l,o]

Example trainingsequence:“hello”

Page 24: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 201837

Example: Character-levelLanguage Model

Vocabulary:[h,e,l,o]

Example trainingsequence:“hello”

Page 25: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 201838

Example: Character-levelLanguage Model

Vocabulary:[h,e,l,o]

Example trainingsequence:“hello”

Page 26: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 201839

Example: Character-levelLanguage ModelSampling

Vocabulary:[h,e,l,o]

At test-time sample characters one at a time, feed back to model

.03

.13

.00

.84

.25

.20

.05

.50

.11

.17

.68

.03

.11

.02

.08

.79Softmax

“e” “l” “l” “o”Sample

Page 27: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 201840

.03

.13

.00

.84

.25

.20

.05

.50

.11

.17

.68

.03

.11

.02

.08

.79Softmax

“e” “l” “l” “o”SampleExample:

Character-levelLanguage ModelSampling

Vocabulary:[h,e,l,o]

At test-time sample characters one at a time, feed back to model

Page 28: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 201841

.03

.13

.00

.84

.25

.20

.05

.50

.11

.17

.68

.03

.11

.02

.08

.79Softmax

“e” “l” “l” “o”SampleExample:

Character-levelLanguage ModelSampling

Vocabulary:[h,e,l,o]

At test-time sample characters one at a time, feed back to model

Page 29: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 201842

.03

.13

.00

.84

.25

.20

.05

.50

.11

.17

.68

.03

.11

.02

.08

.79Softmax

“e” “l” “l” “o”SampleExample:

Character-levelLanguage ModelSampling

Vocabulary:[h,e,l,o]

At test-time sample characters one at a time, feed back to model

Page 30: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 201843

Backpropagation through timeLoss

Forward through entire sequence to compute loss, then backward through entire sequence to compute gradient

Page 31: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 201844

Truncated Backpropagation through timeLoss

Run forward and backward through chunks of the sequence instead of whole sequence

Page 32: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 201845

Truncated Backpropagation through timeLoss

Carry hidden states forward in time forever, but only backpropagate for some smaller number of steps

Page 33: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 201846

Truncated Backpropagation through timeLoss

Page 34: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 201847

min-char-rnn.py gist: 112 lines of Python

(https://gist.github.com/karpathy/d4dee566867f8291f086)

Page 35: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 201848

x

RNN

y

Page 36: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 201849

train more

train more

train more

at first:

Page 37: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 201850

Page 38: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 201851

The Stacks Project: open source algebraic geometry textbook

Latex source http://stacks.math.columbia.edu/The stacks project is licensed under the GNU Free Documentation License

Page 39: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 201852

Page 40: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 201853

Page 41: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 201854

Page 42: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 201855

Generated C code

Page 43: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 201856

Page 44: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 201857

Page 45: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 201891

time

depth

Multilayer RNNs

LSTM:

Page 46: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 201892

ht-1

xt

W

stack

tanh

ht

Vanilla RNN Gradient Flow Bengio et al, “Learning long-term dependencies with gradient descent is difficult”, IEEE Transactions on Neural Networks, 1994Pascanu et al, “On the difficulty of training recurrent neural networks”, ICML 2013

Page 47: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 201893

ht-1

xt

W

stack

tanh

ht

Vanilla RNN Gradient FlowBackpropagation from ht to ht-1 multiplies by W (actually Whh

T)

Bengio et al, “Learning long-term dependencies with gradient descent is difficult”, IEEE Transactions on Neural Networks, 1994Pascanu et al, “On the difficulty of training recurrent neural networks”, ICML 2013

Page 48: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 201894

Vanilla RNN Gradient Flow

h0 h1 h2 h3 h4

x1 x2 x3 x4

Computing gradient of h0 involves many factors of W(and repeated tanh)

Bengio et al, “Learning long-term dependencies with gradient descent is difficult”, IEEE Transactions on Neural Networks, 1994Pascanu et al, “On the difficulty of training recurrent neural networks”, ICML 2013

Page 49: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 201895

Vanilla RNN Gradient Flow

h0 h1 h2 h3 h4

x1 x2 x3 x4

Largest singular value > 1: Exploding gradients

Largest singular value < 1:Vanishing gradients

Computing gradient of h0 involves many factors of W(and repeated tanh)

Bengio et al, “Learning long-term dependencies with gradient descent is difficult”, IEEE Transactions on Neural Networks, 1994Pascanu et al, “On the difficulty of training recurrent neural networks”, ICML 2013

Page 50: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 201896

Vanilla RNN Gradient Flow

h0 h1 h2 h3 h4

x1 x2 x3 x4

Largest singular value > 1: Exploding gradients

Largest singular value < 1:Vanishing gradients

Gradient clipping: Scale gradient if its norm is too bigComputing gradient

of h0 involves many factors of W(and repeated tanh)

Bengio et al, “Learning long-term dependencies with gradient descent is difficult”, IEEE Transactions on Neural Networks, 1994Pascanu et al, “On the difficulty of training recurrent neural networks”, ICML 2013

Page 51: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 201897

Vanilla RNN Gradient Flow

h0 h1 h2 h3 h4

x1 x2 x3 x4

Computing gradient of h0 involves many factors of W(and repeated tanh)

Largest singular value > 1: Exploding gradients

Largest singular value < 1:Vanishing gradients Change RNN architecture

Bengio et al, “Learning long-term dependencies with gradient descent is difficult”, IEEE Transactions on Neural Networks, 1994Pascanu et al, “On the difficulty of training recurrent neural networks”, ICML 2013

Page 52: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 201898

Long Short Term Memory (LSTM)

Hochreiter and Schmidhuber, “Long Short Term Memory”, Neural Computation 1997

Vanilla RNN LSTM

Page 53: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 201899

Long Short Term Memory (LSTM)[Hochreiter et al., 1997]

x

h

vector from before (h)

W

i

f

o

g

vector from below (x)

sigmoid

sigmoid

tanh

sigmoid

4h x 2h 4h 4*h

i: Input gate, whether to write to cellf: Forget gate, Whether to erase cello: Output gate, How much to reveal cellg: Gate gate (?), How much to write to cell

Page 54: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018

100

ct-1

ht-1

xt

fig

o

W ☉

+ ct

tanh

☉ ht

Long Short Term Memory (LSTM)[Hochreiter et al., 1997]

stack

Page 55: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018

101

ct-1

ht-1

xt

fig

o

W ☉

+ ct

tanh

☉ ht

Long Short Term Memory (LSTM): Gradient Flow[Hochreiter et al., 1997]

stack

Backpropagation from ct to ct-1 only elementwise multiplication by f, no matrix multiply by W

Page 56: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018102

Long Short Term Memory (LSTM): Gradient Flow[Hochreiter et al., 1997]

c0 c1 c2 c3

Uninterrupted gradient flow!

Page 57: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018103

Long Short Term Memory (LSTM): Gradient Flow[Hochreiter et al., 1997]

c0 c1 c2 c3

Uninterrupted gradient flow!

Input

Softm

ax

3x3 conv, 64

7x7 conv, 64 / 2

FC 1000

Pool

3x3 conv, 64

3x3 conv, 643x3 conv, 64

3x3 conv, 643x3 conv, 64

3x3 conv, 1283x3 conv, 128 / 2

3x3 conv, 1283x3 conv, 128

3x3 conv, 1283x3 conv, 128

...

3x3 conv, 643x3 conv, 64

3x3 conv, 643x3 conv, 64

3x3 conv, 643x3 conv, 64

Pool

Similar to ResNet!

Page 58: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018104

Long Short Term Memory (LSTM): Gradient Flow[Hochreiter et al., 1997]

c0 c1 c2 c3

Uninterrupted gradient flow!

Input

Softm

ax

3x3 conv, 64

7x7 conv, 64 / 2

FC 1000

Pool

3x3 conv, 64

3x3 conv, 643x3 conv, 64

3x3 conv, 643x3 conv, 64

3x3 conv, 1283x3 conv, 128 / 2

3x3 conv, 1283x3 conv, 128

3x3 conv, 1283x3 conv, 128

...

3x3 conv, 643x3 conv, 64

3x3 conv, 643x3 conv, 64

3x3 conv, 643x3 conv, 64

Pool

Similar to ResNet!

In between:Highway Networks

Srivastava et al, “Highway Networks”, ICML DL Workshop 2015

Page 59: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018105

Other RNN Variants

[LSTM: A Search Space Odyssey, Greff et al., 2015]

[An Empirical Exploration of Recurrent Network Architectures, Jozefowicz et al., 2015]

GRU [Learning phrase representations using rnn encoder-decoder for statistical machine translation, Cho et al. 2014]

Page 60: Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - May 3, 2018106

Summary- RNNs allow a lot of flexibility in architecture design- Vanilla RNNs are simple but don’t work very well- Common to use LSTM or GRU: their additive interactions

improve gradient flow- Backward flow of gradients in RNN can explode or vanish.

Exploding is controlled with gradient clipping. Vanishing is controlled with additive interactions (LSTM)

- Better/simpler architectures are a hot topic of current research- Better understanding (both theoretical and empirical) is needed.