CS60010: Deep Learningsudeshna/courses/DL18/lec12...Machine Translation ๐‘ƒ๐‘ฆ 1,๐‘ฆ 2, โ€ฆ ,...

23
CS60010: Deep Learning Recurrent Neural Network Sudeshna Sarkar Spring 2018 5 Feb 2018

Transcript of CS60010: Deep Learningsudeshna/courses/DL18/lec12...Machine Translation ๐‘ƒ๐‘ฆ 1,๐‘ฆ 2, โ€ฆ ,...

Page 1: CS60010: Deep Learningsudeshna/courses/DL18/lec12...Machine Translation ๐‘ƒ๐‘ฆ 1,๐‘ฆ 2, โ€ฆ , ๐‘ฆ ๐‘‡ ๐‘ฅ 1,๐‘ฅ 2, โ€ฆ , ๐‘ฅ ๐‘† Recurrent neural networks โ€ข RNNs are very

CS60010: Deep Learning

Recurrent Neural Network

Sudeshna Sarkar Spring 2018

5 Feb 2018

Page 2: CS60010: Deep Learningsudeshna/courses/DL18/lec12...Machine Translation ๐‘ƒ๐‘ฆ 1,๐‘ฆ 2, โ€ฆ , ๐‘ฆ ๐‘‡ ๐‘ฅ 1,๐‘ฅ 2, โ€ฆ , ๐‘ฅ ๐‘† Recurrent neural networks โ€ข RNNs are very

Sequence

โ€ข Sequence data: sentences, speech, stock market, signal data โ€ข Sequence of words in an English sentence โ€ข Acoustic features at successive time frames in speech recognition โ€ข Successive frames in video classification โ€ข Rainfall measurements on successive days in Hong Kong โ€ข Daily values of current exchange rate

Page 3: CS60010: Deep Learningsudeshna/courses/DL18/lec12...Machine Translation ๐‘ƒ๐‘ฆ 1,๐‘ฆ 2, โ€ฆ , ๐‘ฆ ๐‘‡ ๐‘ฅ 1,๐‘ฅ 2, โ€ฆ , ๐‘ฅ ๐‘† Recurrent neural networks โ€ข RNNs are very

Modeling Sequential Data โ€ข Sample data sequences from a certain distribution

๐‘ƒ(๐‘ฅ1, ๐‘ฅ2, โ€ฆ , ๐‘ฅ๐‘‡) โ€ข Generate natural sentences to describe an image

๐‘ƒ(๐‘ฆ1,๐‘ฆ2, โ€ฆ ,๐‘ฆ๐‘‡|๐ผ) โ€ข Activity recognition from a video sequence

๐‘ƒ(๐‘ฆ|๐‘ฅ1, ๐‘ฅ2, โ€ฆ , ๐‘ฅ๐‘‡) โ€ข Speech Recognition

๐‘ƒ ๐‘ฆ1,๐‘ฆ2, โ€ฆ ,๐‘ฆ๐‘‡ ๐‘ฅ1, ๐‘ฅ2, โ€ฆ , ๐‘ฅ๐‘‡ โ€ข Machine Translation

๐‘ƒ ๐‘ฆ1,๐‘ฆ2, โ€ฆ ,๐‘ฆ๐‘‡ ๐‘ฅ1, ๐‘ฅ2, โ€ฆ , ๐‘ฅ๐‘†

Page 4: CS60010: Deep Learningsudeshna/courses/DL18/lec12...Machine Translation ๐‘ƒ๐‘ฆ 1,๐‘ฆ 2, โ€ฆ , ๐‘ฆ ๐‘‡ ๐‘ฅ 1,๐‘ฅ 2, โ€ฆ , ๐‘ฅ ๐‘† Recurrent neural networks โ€ข RNNs are very

Recurrent neural networks โ€ข RNNs are very powerful, because

they combine two properties: โ€ข Distributed hidden state that allows

them to store a lot of information about the past efficiently.

โ€ข Non-linear dynamics that allows them to update their hidden state in complicated ways.

โ€ข With enough neurons and time, RNNs can compute anything that can be computed by your computer.

input

input

input

hidden

hidden

hidden

output

output

output time

Page 5: CS60010: Deep Learningsudeshna/courses/DL18/lec12...Machine Translation ๐‘ƒ๐‘ฆ 1,๐‘ฆ 2, โ€ฆ , ๐‘ฆ ๐‘‡ ๐‘ฅ 1,๐‘ฅ 2, โ€ฆ , ๐‘ฅ ๐‘† Recurrent neural networks โ€ข RNNs are very

Based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson

Recurrent Networks offer a lot of flexibility:

Vanilla Neural Networks

Page 6: CS60010: Deep Learningsudeshna/courses/DL18/lec12...Machine Translation ๐‘ƒ๐‘ฆ 1,๐‘ฆ 2, โ€ฆ , ๐‘ฆ ๐‘‡ ๐‘ฅ 1,๐‘ฅ 2, โ€ฆ , ๐‘ฅ ๐‘† Recurrent neural networks โ€ข RNNs are very

Why RNNs?

โ€ข Can model sequences having variable length โ€ข Efficient: Weights shared across time-steps

Page 7: CS60010: Deep Learningsudeshna/courses/DL18/lec12...Machine Translation ๐‘ƒ๐‘ฆ 1,๐‘ฆ 2, โ€ฆ , ๐‘ฆ ๐‘‡ ๐‘ฅ 1,๐‘ฅ 2, โ€ฆ , ๐‘ฅ ๐‘† Recurrent neural networks โ€ข RNNs are very

Dynamic system; Unfolded; Computation graph

โ€ข Dynamical system: classical form ๐‘ (๐‘ก) = ๐‘“ ๐‘ (๐‘กโˆ’1);๐œƒ ๐‘ (3) = ๐‘“ ๐‘“ ๐‘ (1);๐œƒ

โ€ข For a finite no. of time steps ฯ„, the graph can be unfolded by applying the definition ฯ„-1 times.

โ€ข The same parameters are used at all time steps.

Page 8: CS60010: Deep Learningsudeshna/courses/DL18/lec12...Machine Translation ๐‘ƒ๐‘ฆ 1,๐‘ฆ 2, โ€ฆ , ๐‘ฆ ๐‘‡ ๐‘ฅ 1,๐‘ฅ 2, โ€ฆ , ๐‘ฅ ๐‘† Recurrent neural networks โ€ข RNNs are very

Dynamical system driven by external signal

โ€ข Consider a dynamical system driven by external (input) signal ๐‘ฅ(๐‘ก): ๐‘ (๐‘ก) = ๐‘“ ๐‘ (๐‘กโˆ’1), ๐‘ฅ(๐‘ก);๐œƒ

โ€ข The state now contains information about the whole past input sequence. To indicate that the state is hidden rewrite using variable h for state:

โ„Ž(๐‘ก) = ๐‘“ โ„Ž(๐‘กโˆ’1), ๐‘ฅ(๐‘ก); ๐œƒ

Page 9: CS60010: Deep Learningsudeshna/courses/DL18/lec12...Machine Translation ๐‘ƒ๐‘ฆ 1,๐‘ฆ 2, โ€ฆ , ๐‘ฆ ๐‘‡ ๐‘ฅ 1,๐‘ฅ 2, โ€ฆ , ๐‘ฅ ๐‘† Recurrent neural networks โ€ข RNNs are very

Output prediction by RNN

โ€ข Task : To predict the future from the past

โ€ข The network typically learns to use h (t) as a summary of the task-relevant aspects of the past sequence of inputs upto t

โ€ข The summary is in general lossy since it maps a sequence of arbitrary length (x (t), x (t-1),..,x (2),x (1)) to a fixed length vector h (t)

โ€ข Depending on the training criterion, summary keeps some aspects of past sequence more precisely than other aspects

Page 10: CS60010: Deep Learningsudeshna/courses/DL18/lec12...Machine Translation ๐‘ƒ๐‘ฆ 1,๐‘ฆ 2, โ€ฆ , ๐‘ฆ ๐‘‡ ๐‘ฅ 1,๐‘ฅ 2, โ€ฆ , ๐‘ฅ ๐‘† Recurrent neural networks โ€ข RNNs are very

โ„Ž(๐‘ก) = ๐‘“ โ„Ž(๐‘กโˆ’1), ๐‘ฅ(๐‘ก); ๐œƒ

can be written in two different ways: circuit diagram or an unfolded computational graph

The unfolded graph has a size dependent on the sequence length.

Unfolding: from circuit diagram to computational graph

Page 11: CS60010: Deep Learningsudeshna/courses/DL18/lec12...Machine Translation ๐‘ƒ๐‘ฆ 1,๐‘ฆ 2, โ€ฆ , ๐‘ฆ ๐‘‡ ๐‘ฅ 1,๐‘ฅ 2, โ€ฆ , ๐‘ฅ ๐‘† Recurrent neural networks โ€ข RNNs are very

Unfolding

โ€ข We can represent the unfolded recurrence after t steps with a function ๐‘”(๐‘ก):

โ„Ž(๐‘ก) = ๐‘”(๐‘ก) ๐‘ฅ(๐‘ก), ๐‘ฅ(๐‘กโˆ’1) โ€ฆ , ๐‘ฅ(1) โ€ข The function ๐‘”(๐‘ก) takes in whole past sequence

๐‘ฅ(๐‘ก), ๐‘ฅ(๐‘กโˆ’1) โ€ฆ , ๐‘ฅ(1) as input and produces the current state .

โ€ข But the unfolded recurrent structure allows us to factorize g(t) into repeated application of a function ๐‘“.

โ„Ž(๐‘ก) = ๐‘“ โ„Ž(๐‘กโˆ’1), ๐‘ฅ(๐‘ก); ๐œƒ

Page 12: CS60010: Deep Learningsudeshna/courses/DL18/lec12...Machine Translation ๐‘ƒ๐‘ฆ 1,๐‘ฆ 2, โ€ฆ , ๐‘ฆ ๐‘‡ ๐‘ฅ 1,๐‘ฅ 2, โ€ฆ , ๐‘ฅ ๐‘† Recurrent neural networks โ€ข RNNs are very

Advantages of unfolding model

1. Regardless of sequence length, the learned model has the same input size because it is specified in terms of transition from one state to another state rather than specified in terms of a variable length history of states

2. It is possible to use same function f with same parameters at every step

โ€ข These two factors make it possible to learn a single model f that operates on all time steps and all sequence lengths.

โ€ข Learning a single shared model: can apply the network to input sequences of different lengths and predict sequences of different lengths and to work with fewer training examples.

Page 13: CS60010: Deep Learningsudeshna/courses/DL18/lec12...Machine Translation ๐‘ƒ๐‘ฆ 1,๐‘ฆ 2, โ€ฆ , ๐‘ฆ ๐‘‡ ๐‘ฅ 1,๐‘ฅ 2, โ€ฆ , ๐‘ฅ ๐‘† Recurrent neural networks โ€ข RNNs are very
Page 14: CS60010: Deep Learningsudeshna/courses/DL18/lec12...Machine Translation ๐‘ƒ๐‘ฆ 1,๐‘ฆ 2, โ€ฆ , ๐‘ฆ ๐‘‡ ๐‘ฅ 1,๐‘ฅ 2, โ€ฆ , ๐‘ฅ ๐‘† Recurrent neural networks โ€ข RNNs are very
Page 15: CS60010: Deep Learningsudeshna/courses/DL18/lec12...Machine Translation ๐‘ƒ๐‘ฆ 1,๐‘ฆ 2, โ€ฆ , ๐‘ฆ ๐‘‡ ๐‘ฅ 1,๐‘ฅ 2, โ€ฆ , ๐‘ฅ ๐‘† Recurrent neural networks โ€ข RNNs are very

o(t)= c+ Vh(t)

h(t)=tanh(a(t)) a(t)= b+ Wh(t-1)+ Ux (t)

Unfolding Computational Graphs

โ€ข A Computational Graph is a way to formalize the structure of a set of computations such as mapping inputs and parameters to outputs and loss

โ€ข We can unfold a recursive or recurrent computation into a computational graph that has a repetitive structure

15

Page 16: CS60010: Deep Learningsudeshna/courses/DL18/lec12...Machine Translation ๐‘ƒ๐‘ฆ 1,๐‘ฆ 2, โ€ฆ , ๐‘ฆ ๐‘‡ ๐‘ฅ 1,๐‘ฅ 2, โ€ฆ , ๐‘ฅ ๐‘† Recurrent neural networks โ€ข RNNs are very

A more complex unfolded computational graph

Page 17: CS60010: Deep Learningsudeshna/courses/DL18/lec12...Machine Translation ๐‘ƒ๐‘ฆ 1,๐‘ฆ 2, โ€ฆ , ๐‘ฆ ๐‘‡ ๐‘ฅ 1,๐‘ฅ 2, โ€ฆ , ๐‘ฅ ๐‘† Recurrent neural networks โ€ข RNNs are very

Three design patterns of RNNs

1. Output at each time step; recurrent connections between hidden units

2. Output at each time step; recurrent connections only from output at one time step to hidden units at next time step

3. Recurrent connections between hidden units to read entire input sequence and produce a single output Can summarize sequence to produce a fixed size representation for further processing

Page 18: CS60010: Deep Learningsudeshna/courses/DL18/lec12...Machine Translation ๐‘ƒ๐‘ฆ 1,๐‘ฆ 2, โ€ฆ , ๐‘ฆ ๐‘‡ ๐‘ฅ 1,๐‘ฅ 2, โ€ฆ , ๐‘ฅ ๐‘† Recurrent neural networks โ€ข RNNs are very

RNN1: with recurrence between hidden units Maps input sequence x to output o With softmax outputs Loss L internally computes ๐‘ฆ๏ฟฝ= softmax(๐‘œ) and compares to target ๐‘ฆ โ€ข Update equation applied for each

time step from ๐‘ก = 1 to ๐‘ก = ๐œ

Parameters: โ€ข bias vectors b and c โ€ข weight matrices U (input-to-hidden),

V (hidden-to-output) and W (hidden- to-hidden) connections

Page 19: CS60010: Deep Learningsudeshna/courses/DL18/lec12...Machine Translation ๐‘ƒ๐‘ฆ 1,๐‘ฆ 2, โ€ฆ , ๐‘ฆ ๐‘‡ ๐‘ฅ 1,๐‘ฅ 2, โ€ฆ , ๐‘ฅ ๐‘† Recurrent neural networks โ€ข RNNs are very

Loss function for a given sequence

โ€ข The total loss for a given sequence of x values with a sequence of y values is the sum of the losses over the time steps

โ€ข If ๐ฟ(๐‘ก) is the negative log-likelihood of y (t) given x (1),..x (t)

then

๐ฟ ๐‘ฅ(1), ๐‘ฅ(2), โ€ฆ , ๐‘ฅ(๐‘ก) , ๐‘ฆ(1),๐‘ฆ(2), โ€ฆ ,๐‘ฆ(๐‘ก) = ๏ฟฝ๐ฟ(๐‘ก)

๐‘ก

= โˆ’๏ฟฝ log๐‘๐‘š๐‘š๐‘š๐‘š๐‘š(๐‘ฆ(๐‘ก)| ๐‘ฅ(1), ๐‘ฅ(2), โ€ฆ , ๐‘ฅ(๐‘ก) )๐‘ก

Page 20: CS60010: Deep Learningsudeshna/courses/DL18/lec12...Machine Translation ๐‘ƒ๐‘ฆ 1,๐‘ฆ 2, โ€ฆ , ๐‘ฆ ๐‘‡ ๐‘ฅ 1,๐‘ฅ 2, โ€ฆ , ๐‘ฅ ๐‘† Recurrent neural networks โ€ข RNNs are very

Backpropagation through time

โ€ข We can think of the recurrent net as a layered, feed-forward net with shared weights and then train the feed-forward net with weight constraints.

โ€ข We can also think of this training algorithm in the time domain: โ€ข The forward pass builds up a stack of the activities of all the units at each

time step. โ€ข The backward pass peels activities off the stack to compute the error

derivatives at each time step. โ€ข After the backward pass we add together the derivatives at all the

different times for each weight.

Page 21: CS60010: Deep Learningsudeshna/courses/DL18/lec12...Machine Translation ๐‘ƒ๐‘ฆ 1,๐‘ฆ 2, โ€ฆ , ๐‘ฆ ๐‘‡ ๐‘ฅ 1,๐‘ฅ 2, โ€ฆ , ๐‘ฅ ๐‘† Recurrent neural networks โ€ข RNNs are very

Gradients on V, c, W and U

๐œ•๐ฟ๐œ•๐ฟ๐‘ก

= 1,๐œ•๐ฟ๐œ•๐‘œ๐‘ก

=๐œ•๐ฟ๐œ•๐ฟ๐‘ก

๐œ•๐ฟ๐‘ก๐œ•๐‘œ๐‘ก

=๐œ•๐ฟ๐‘ก๐œ•๐‘œ๐‘ก

๐œ•๐ฟ๐œ•๐‘‰

= ๏ฟฝ๐œ•๐ฟ๐‘ก๐œ•๐‘œ๐‘ก๐‘ก

๐œ•๐‘œ๐‘ก๐œ•๐‘‰

๐œ•๐ฟ๐œ•๐‘

= ๏ฟฝ๐œ•๐ฟ๐‘ก๐œ•๐‘œ๐‘ก๐‘ก

๐œ•๐‘œ๐‘ก๐œ•๐‘

๐œ•๐ฟ๐œ•๐‘Š

= ๏ฟฝ๐œ•๐ฟ๐‘ก๐œ•โ„Ž๐‘ก๐‘ก

๐œ•โ„Ž๐‘ก๐œ•๐‘Š

๐œ•๐ฟ๐œ•โ„Ž๐‘ก

=๐œ•๐ฟ

๐œ•โ„Ž๐‘ก+1๐œ•โ„Ž๐‘ก+1๐œ•โ„Ž๐‘ก

+๐œ•๐ฟ๐œ•๐‘œ๐‘ก

๐œ•๐‘œ๐‘ก๐œ•โ„Ž๐‘ก

Page 22: CS60010: Deep Learningsudeshna/courses/DL18/lec12...Machine Translation ๐‘ƒ๐‘ฆ 1,๐‘ฆ 2, โ€ฆ , ๐‘ฆ ๐‘‡ ๐‘ฅ 1,๐‘ฅ 2, โ€ฆ , ๐‘ฅ ๐‘† Recurrent neural networks โ€ข RNNs are very

Feedforward Depth (df) Feedforward depth: longest path between an input and output at the same timestep

Feedforward depth = 4

High level feature!

Notation: h0,1 โ‡’ time step 0, neuron #1

Page 23: CS60010: Deep Learningsudeshna/courses/DL18/lec12...Machine Translation ๐‘ƒ๐‘ฆ 1,๐‘ฆ 2, โ€ฆ , ๐‘ฆ ๐‘‡ ๐‘ฅ 1,๐‘ฅ 2, โ€ฆ , ๐‘ฅ ๐‘† Recurrent neural networks โ€ข RNNs are very

Option 2: Recurrent Depth (dr)

โ— Recurrent depth: Longest path between same hidden state in successive timesteps

Recurrent depth = 3