Artificial Neural Networks and Deep...

29
- Recurrent Neural Networks- Matteo Matteucci, PhD ([email protected]) Artificial Intelligence and Robotics Laboratory Politecnico di Milano Artificial Neural Networks and Deep Learning

Transcript of Artificial Neural Networks and Deep...

Page 1: Artificial Neural Networks and Deep Learningchrome.ws.dei.polimi.it/.../b/b2/AN2DL_04_2019_RecurrentNeuralNet… · - Recurrent Neural Networks-Matteo Matteucci, PhD (matteo.matteucci@polimi.it)

- Recurrent Neural Networks-

Matteo Matteucci, PhD ([email protected])

Artificial Intelligence and Robotics Laboratory

Politecnico di Milano

Artificial Neural Networks and Deep Learning

Page 2: Artificial Neural Networks and Deep Learningchrome.ws.dei.polimi.it/.../b/b2/AN2DL_04_2019_RecurrentNeuralNet… · - Recurrent Neural Networks-Matteo Matteucci, PhD (matteo.matteucci@polimi.it)

2

Xt

Sequence Modeling

So far we have considered only «static» datasets

x1

xI

xi

1

𝑔1 𝑥 w𝑤𝑗𝑖

𝑤11

𝑤𝐽𝐼

𝑤10

… … …

1 1

𝑔𝐾 𝑥 w

1

Page 3: Artificial Neural Networks and Deep Learningchrome.ws.dei.polimi.it/.../b/b2/AN2DL_04_2019_RecurrentNeuralNet… · - Recurrent Neural Networks-Matteo Matteucci, PhD (matteo.matteucci@polimi.it)

3

Sequence Modeling

So far we have considered only «static» datasets

X0

x1

xI

xi

X1

x1

xI

xi

X2

x1

xI

xi

X3

x1

xI

xi

Xt

x1

xI

xi

time

Page 4: Artificial Neural Networks and Deep Learningchrome.ws.dei.polimi.it/.../b/b2/AN2DL_04_2019_RecurrentNeuralNet… · - Recurrent Neural Networks-Matteo Matteucci, PhD (matteo.matteucci@polimi.it)

4

Sequence Modeling

Different ways to deal with «dynamic» data:

Memoryless models:• Autoregressive models

• Feedforward neural networks

Models with memory:• Linear dynamical systems

• Hidden Markov models

• Recurrent Neural Networks

• ...

X0

x1

xI

xi

X1

x1

xI

xi

X2

x1

xI

xi

X3

x1

xI

xi

Xt

x1

xI

xi

time

X0 X1 X2 X3 Xt… …

Page 5: Artificial Neural Networks and Deep Learningchrome.ws.dei.polimi.it/.../b/b2/AN2DL_04_2019_RecurrentNeuralNet… · - Recurrent Neural Networks-Matteo Matteucci, PhD (matteo.matteucci@polimi.it)

5

Memoryless Models for Sequences

Autoregressive models

• Predict the next input from

previous ones using «delay taps»

Feed forward neural networks

• Generalize autoregressive models

using non linear hidden layers

X0 X1 X2 X3 Xt… …

𝑊𝑡−1

𝑊𝑡−2

X0 X1 X2 X3 Xt… …

Hidden

𝑊𝑡𝑊𝑡−1𝑊𝑡−2

time

time

Page 6: Artificial Neural Networks and Deep Learningchrome.ws.dei.polimi.it/.../b/b2/AN2DL_04_2019_RecurrentNeuralNet… · - Recurrent Neural Networks-Matteo Matteucci, PhD (matteo.matteucci@polimi.it)

6

Dynamical Systems

Generative models with a real-valued hidden state which cannot be

observed directly

• The hidden state has some dynamics possibly

affected by noise and produces the output

• To compute the output has to infer hidden state

• Input are treated as driving inputs

In linear dynamical systems this becomes:

• State continuous with Gaussian uncertainty

• Transformations are assumed to be linear

• State can be estimated using Kalman filtering

X0 X1 Xt…

Hid

den

Hid

den

Hid

den

Y0 Y1 Yt…

time

Stochastic systems ...

Page 7: Artificial Neural Networks and Deep Learningchrome.ws.dei.polimi.it/.../b/b2/AN2DL_04_2019_RecurrentNeuralNet… · - Recurrent Neural Networks-Matteo Matteucci, PhD (matteo.matteucci@polimi.it)

7

Dynamical Systems

Generative models with a real-valued hidden state which cannot be

observed directly

• The hidden state has some dynamics possibly

affected by noise and produces the output

• To compute the output has to infer hidden state

• Input are treated as driving inputs

In hidden Markov models this becomes:

• State assumed to be discrete, state transitions

are stochastic (transition matrix)

• Output is a stochastic function of hidden states

• State can be estimated via Viterbi algorithm.

Hid

den

Hid

den

Hid

den

Y0 Y1 Yt…

time

Stochastic systems ...

Page 8: Artificial Neural Networks and Deep Learningchrome.ws.dei.polimi.it/.../b/b2/AN2DL_04_2019_RecurrentNeuralNet… · - Recurrent Neural Networks-Matteo Matteucci, PhD (matteo.matteucci@polimi.it)

8

Recurrent Neural networks

Memory via recurrent connections:

• Distributed hidden state allows

to store a information efficiently

• Non-linear dynamics allows

complex hidden state updates

Deterministic

systems ...

“With enough neurons and time, RNNs

can compute anything that can be

computed by a computer.”

(Computation Beyond the Turing Limit

Hava T. Siegelmann, 1995)…

𝑐𝐵𝑡−1

𝑐1𝑡−1

𝑐1𝑡 𝑥,W𝐵

1, VB(1)

𝑐𝐵𝑡 𝑥,W𝐵

1, VB(1)

x1

xI

xi

… 𝑔𝑡 𝑥 w

𝑤𝑗𝑖

𝑤11

𝑤𝐽𝐼

ℎ𝑗𝑡(𝑥,W(1),)

1

1

Page 9: Artificial Neural Networks and Deep Learningchrome.ws.dei.polimi.it/.../b/b2/AN2DL_04_2019_RecurrentNeuralNet… · - Recurrent Neural Networks-Matteo Matteucci, PhD (matteo.matteucci@polimi.it)

9

Recurrent Neural networks

Memory via recurrent connections:

• Distributed hidden state allows

to store a information efficiently

• Non-linear dynamics allows

complex hidden state updates

𝑐𝐵𝑡−1

𝑐1𝑡−1

𝑐1𝑡 𝑥,W𝐵

1, VB(1)

𝑐𝐵𝑡 𝑥,W𝐵

1, VB(1)

x1

xI

xi

… 𝑔𝑡 𝑥 w

𝑤𝑗𝑖

𝑤11

𝑤𝐽𝐼

ℎ𝑗𝑡(𝑥,W 1 , 𝑉(1))

1

1

𝑔𝑡 𝑥𝑛|𝑤 = 𝑔

𝑗=0

𝐽

𝑤1𝑗(2)⋅ ℎ𝑗𝑡 ⋅ +

𝑏=0

𝐵

𝑣1𝑏(2)⋅ 𝑐𝑏𝑡 ⋅

ℎ𝑗𝑡 ⋅ = ℎ𝑗

𝑡

𝑗=0

𝐽

𝑤𝑗𝑖(1)⋅ 𝑥𝑖,𝑛 +

𝑏=0

𝐵

𝑣𝑗𝑏(1)⋅ 𝑐𝑏𝑡−1

𝑐𝑏𝑡 ⋅ = 𝑐𝑏

𝑡

𝑗=0

𝐽

𝑣𝑏𝑖(1)⋅ 𝑥𝑖,𝑛 +

𝑏′=0

𝐵

𝑣𝑏𝑏′(1)⋅ 𝑐𝑏′𝑡−1

Page 10: Artificial Neural Networks and Deep Learningchrome.ws.dei.polimi.it/.../b/b2/AN2DL_04_2019_RecurrentNeuralNet… · - Recurrent Neural Networks-Matteo Matteucci, PhD (matteo.matteucci@polimi.it)

10

Backpropagation Through Time

𝑐𝐵𝑡−1

𝑐1𝑡−1

𝑐1𝑡 𝑥,W𝐵

1, VB(1)

𝑐𝐵𝑡 𝑥,W𝐵

1, VB(1)

x1

xI

xi

… 𝑔𝑡 𝑥 w

𝑤𝑗𝑖

𝑤11

𝑤𝐽𝐼

ℎ𝑗𝑡(𝑥,W 1 , 𝑉 1 )

1

1

Page 11: Artificial Neural Networks and Deep Learningchrome.ws.dei.polimi.it/.../b/b2/AN2DL_04_2019_RecurrentNeuralNet… · - Recurrent Neural Networks-Matteo Matteucci, PhD (matteo.matteucci@polimi.it)

11

Backpropagation Through Time

x1

xI

xi

… 𝑔𝑡 𝑥 w

𝑤𝑗𝑖

𝑤11

𝑤𝐽𝐼

ℎ𝑗𝑡(𝑥,W 1 , 𝑉 1 )

1

1

𝑐1𝑡 𝑥,W𝐵

1, VB(1)

𝑐𝐵𝑡 𝑥,W𝐵

1, VB(1)

x1

xI

xi

1

x1

xI

xi

1

x1

xI

xi

1

All these weights

should be the same.

Page 12: Artificial Neural Networks and Deep Learningchrome.ws.dei.polimi.it/.../b/b2/AN2DL_04_2019_RecurrentNeuralNet… · - Recurrent Neural Networks-Matteo Matteucci, PhD (matteo.matteucci@polimi.it)

12

Backpropagation Through Time

• Perform network unroll for U steps

• Initialize 𝑉, 𝑉𝐵 replicas to be the same

• Compute gradients and update replicas

with the average of their gradients

x1

xI

xi

𝑔𝑡 𝑥 w

𝑤𝑗𝑖

𝑤11

𝑤𝐽𝐼

ℎ𝑗𝑡(𝑥,W 1 , 𝑉 1 )

1

1

𝑐1𝑡 𝑥,W𝐵

1, VB(1)

𝑐𝐵𝑡 𝑥,W𝐵

1, VB(1)

… … … …

𝑉𝐵𝑡𝑉𝐵

𝑡−1𝑉𝐵𝑡−2𝑉𝐵

𝑡−3

𝑉 = 𝑉 − 𝜂 ⋅1

𝑈

0

𝑈−1

𝑉𝑡−𝑢 𝑉𝐵 = 𝑉𝐵 − 𝜂 ⋅1

𝑈

0

𝑈−1

𝑉𝐵𝑡−𝑢

Page 13: Artificial Neural Networks and Deep Learningchrome.ws.dei.polimi.it/.../b/b2/AN2DL_04_2019_RecurrentNeuralNet… · - Recurrent Neural Networks-Matteo Matteucci, PhD (matteo.matteucci@polimi.it)

13

How much should we go back in time?

Sometime output might be related to

some input happened quite long before

However backpropagation through

time was not able to train recurrent

neural networks significantly

back in time ...

𝑐𝐵𝑡−1

𝑐1𝑡−1

𝑐1𝑡 𝑥,W𝐵

1, VB(1)

𝑐𝐵𝑡 𝑥,W𝐵

1, VB(1)

x1

xI

xi

… 𝑔𝑡 𝑥 w

𝑤𝑗𝑖

𝑤11

𝑤𝐽𝐼

ℎ𝑗𝑡(𝑥,W 1 , 𝑉 1 )

1

1

Jane walked into the room. John walked in too.It was late in the day. Jane said hi to <???>

Was due to not being able to

backprop through many layers ...

Page 14: Artificial Neural Networks and Deep Learningchrome.ws.dei.polimi.it/.../b/b2/AN2DL_04_2019_RecurrentNeuralNet… · - Recurrent Neural Networks-Matteo Matteucci, PhD (matteo.matteucci@polimi.it)

14

How much can we go back in time?

To better understand why it was not working consider a simplified case:

Backpropagation over an entire sequence is computed as

If we consider the norm of these terms

𝑥𝑦𝑡 = 𝑤(2)𝑔(ℎ𝑡)ℎ𝑡 = 𝑔(𝑣(1)ℎ𝑡−1 + 𝑤(1)𝑥)

𝜕𝐸

𝜕𝑤=

𝑡=1

𝑆𝜕𝐸𝑡

𝜕𝑤

𝜕𝐸𝑡

𝜕𝑤=

𝑡=1

𝑡𝜕𝐸𝑡

𝜕𝑦𝑡𝜕𝑦𝑡

𝜕ℎ𝑡𝜕ℎ𝑡

𝜕ℎ𝑘𝜕ℎ𝑘

𝜕𝑤

𝜕ℎ𝑡

𝜕ℎ𝑘=

𝑖=𝑘+1

𝑡𝜕ℎ𝑖𝜕ℎ𝑖−1=

𝑖=𝑘+1

𝑡

𝑣 1 𝑔′ ℎ𝑖−1

𝜕ℎ𝑖𝜕ℎ𝑖−1

= 𝑣 1 𝑔′ ℎ𝑖−1𝜕ℎ𝑡

𝜕ℎ𝑘≤ 𝛾𝑣 ⋅ 𝛾𝑔′

𝑡−𝑘

If (𝛾𝑣𝛾𝑔′) < 1this

converges to 0 ...

With Sigmoids and Tanh we

have vanishing gradients

Page 15: Artificial Neural Networks and Deep Learningchrome.ws.dei.polimi.it/.../b/b2/AN2DL_04_2019_RecurrentNeuralNet… · - Recurrent Neural Networks-Matteo Matteucci, PhD (matteo.matteucci@polimi.it)

15

Dealing with Vanishing Gradient

Force all gradients to be either 0 or 1

Build Recurrent Neural Networks using small modules that are designed

to remember values for a long time.

𝑔 𝑎 = 𝑅𝑒𝐿𝑢 𝑎 = max 0, 𝑎

𝑔′ 𝑎 = 1𝑎>0

It only accumulates

the input ...

𝑥𝑦𝑡 = 𝑤(2)𝑔(ℎ𝑡)ℎ𝑡 = 𝑣(1)ℎ𝑡−1 + 𝑤(1)𝑥

𝑣(1) = 1

Page 16: Artificial Neural Networks and Deep Learningchrome.ws.dei.polimi.it/.../b/b2/AN2DL_04_2019_RecurrentNeuralNet… · - Recurrent Neural Networks-Matteo Matteucci, PhD (matteo.matteucci@polimi.it)

16

Long-Short Term Memories

Hochreiter & Schmidhuber (1997) solved the problem of vanishing

gradient designing a memory cell using logistic and linear units with

multiplicative interactions:

• Information gets into the cell

whenever its “write” gate is on.

• The information stays in the cell

so long as its “keep” gate is on.

• Information is read from the cell

by turning on its “read” gate.

Can backpropagate

through this since the

loop has fixed weight.

Page 17: Artificial Neural Networks and Deep Learningchrome.ws.dei.polimi.it/.../b/b2/AN2DL_04_2019_RecurrentNeuralNet… · - Recurrent Neural Networks-Matteo Matteucci, PhD (matteo.matteucci@polimi.it)

17

RNN vs. LSTM

RNN

Page 18: Artificial Neural Networks and Deep Learningchrome.ws.dei.polimi.it/.../b/b2/AN2DL_04_2019_RecurrentNeuralNet… · - Recurrent Neural Networks-Matteo Matteucci, PhD (matteo.matteucci@polimi.it)

18

Long Short Term Memory

LSTM

Page 19: Artificial Neural Networks and Deep Learningchrome.ws.dei.polimi.it/.../b/b2/AN2DL_04_2019_RecurrentNeuralNet… · - Recurrent Neural Networks-Matteo Matteucci, PhD (matteo.matteucci@polimi.it)

19

Long Short Term Memory

Input gate

LSTM

Page 20: Artificial Neural Networks and Deep Learningchrome.ws.dei.polimi.it/.../b/b2/AN2DL_04_2019_RecurrentNeuralNet… · - Recurrent Neural Networks-Matteo Matteucci, PhD (matteo.matteucci@polimi.it)

20

Long Short Term Memory

Forget gate

LSTM

Page 21: Artificial Neural Networks and Deep Learningchrome.ws.dei.polimi.it/.../b/b2/AN2DL_04_2019_RecurrentNeuralNet… · - Recurrent Neural Networks-Matteo Matteucci, PhD (matteo.matteucci@polimi.it)

21

Long Short Term Memory

Memory gate

LSTM

Page 22: Artificial Neural Networks and Deep Learningchrome.ws.dei.polimi.it/.../b/b2/AN2DL_04_2019_RecurrentNeuralNet… · - Recurrent Neural Networks-Matteo Matteucci, PhD (matteo.matteucci@polimi.it)

22

Long Short Term Memory

Output gate

LSTM

Page 23: Artificial Neural Networks and Deep Learningchrome.ws.dei.polimi.it/.../b/b2/AN2DL_04_2019_RecurrentNeuralNet… · - Recurrent Neural Networks-Matteo Matteucci, PhD (matteo.matteucci@polimi.it)

23

Gated Recurrent Unit (GRU)

It combines the forget and input gates into a single “update gate.” It also

merges the cell state and hidden state, and makes some other changes.

Page 24: Artificial Neural Networks and Deep Learningchrome.ws.dei.polimi.it/.../b/b2/AN2DL_04_2019_RecurrentNeuralNet… · - Recurrent Neural Networks-Matteo Matteucci, PhD (matteo.matteucci@polimi.it)

24

LSTM Networks

You can build a computation graph with continuous transformations.

X0 X1 Xt…

Hid

den

Hid

den

Hid

den…

Y0 Y1 Yt…

Page 25: Artificial Neural Networks and Deep Learningchrome.ws.dei.polimi.it/.../b/b2/AN2DL_04_2019_RecurrentNeuralNet… · - Recurrent Neural Networks-Matteo Matteucci, PhD (matteo.matteucci@polimi.it)

25

LSTM Networks

You can build a computation graph with continuous transformations.

X0 X1 Xt…

Hid

den

Hid

den

Hid

den…

Y0 Y1 Yt…

Page 26: Artificial Neural Networks and Deep Learningchrome.ws.dei.polimi.it/.../b/b2/AN2DL_04_2019_RecurrentNeuralNet… · - Recurrent Neural Networks-Matteo Matteucci, PhD (matteo.matteucci@polimi.it)

26

LSTM Networks

You can build a computation graph with continuous transformations.

X0 X1 Xt…

LSTM

LSTM

LSTM

Y0 Y1 Yt…

LSTM

LSTM

LSTM

ReL

u

ReL

u

ReL

u

Page 27: Artificial Neural Networks and Deep Learningchrome.ws.dei.polimi.it/.../b/b2/AN2DL_04_2019_RecurrentNeuralNet… · - Recurrent Neural Networks-Matteo Matteucci, PhD (matteo.matteucci@polimi.it)

27

Sequential Data Problems

Fixed-sized

input

to fixed-sized

output

(e.g. image

classification)

Sequence output

(e.g. image captioning

takes an image and

outputs a sentence of

words).

Sequence input (e.g.

sentiment analysis

where a given sentence

is classified as

expressing positive or

negative sentiment).

Sequence input and sequence output (e.g. Machine Translation: an RNN reads a sentence in English and then outputs a sentence in French)

Synced sequence input and output (e.g. video classification where we wish to label each frame of the video)

Credits: Andrej Karpathy

Page 28: Artificial Neural Networks and Deep Learningchrome.ws.dei.polimi.it/.../b/b2/AN2DL_04_2019_RecurrentNeuralNet… · - Recurrent Neural Networks-Matteo Matteucci, PhD (matteo.matteucci@polimi.it)

28

Sequence to Sequence Modeling

Given <S, T> pairs, read S, and output T’ that best

matches T

Page 29: Artificial Neural Networks and Deep Learningchrome.ws.dei.polimi.it/.../b/b2/AN2DL_04_2019_RecurrentNeuralNet… · - Recurrent Neural Networks-Matteo Matteucci, PhD (matteo.matteucci@polimi.it)

29

Tips & Tricks

When conditioning on full input sequence Bidirectional RNNs exploit it:

• Have one RNNs traverse the sequence left-to-right

• Have another RNN traverse the sequence right-to-left

• Use concatenation of hidden layers as feature representation

When initializing RNN we need to specify the initial state

• Could initialize them to a fixed value (such as 0)

• Better to treat the initial state as learned parameters

• Start off with random guesses of the initial state values

• Backpropagate the prediction error through time all the way to the initial state values

and compute the gradient of the error with respect to these

• Update these parameters by gradient descent