Recurrent neural networks

Recurrent Neural Networks

Viacheslav Khomenko, Ph.D.

Contents

Recap: feed-forward artificial neural network

Temporal dependencies

Recurrent neural network architectures

RNN training

New RNN architectures

Practical considerations

Neural models for locomotion

Application of RNNs

RECAP: FEED-FORWARD

ARTIFICIAL NEURAL

NETWORK

Feed-forward network

W. McCulloch and W. Pitts , 1940s Abstract mathematical model of a brain cell

Perceptron for classificationF. Rosenblatt, 1958

Multi-layer artificial neural networkP. Werbos, 1975

Input

Features

Input

Input

Input

Petals

Sepal

Yellow

patch

VeinsIris flower

Input

layer

Hidden

layer(s)

Output

layer

Hid-

den

Hid-

den

Hid-

den

Out-

putIris

Out-

put¬Iris

Decisions

Feed-forward network

Decisions are based on current inputs:

• No memory about the past

• No future scope

A 𝒚x A

Input layer Hidden layer(s) Output layer

A

Input Decision output

Simplified representation:

Vector of input features:

Vector of predicted values:

x

𝒚

Neural activation:

A – some activation function (tanh etc…)

𝑤, 𝑏 – network parameters

TEMPORAL

DEPENDENCIES

Temporal dependencies

Analyzing temporal dependencies

Frame 0 Frame 1 Frame 2 Frame 3 Frame 4

P(Iris): 0.1

P(¬Iris): 0.9

P(Iris): 0.11

P(¬Iris): 0.89

P(Iris): 0.2

P(¬Iris): 0.8

P(Iris): 0.45

P(¬Iris): 0.55

P(Iris): 0.9

P(¬Iris): 0.1

Decision on

sequence of

observations

Improved decisions

Stem: seen

Petals: hidden

Stem: seen

Petals: hidden

Stem: seen

Petals: partial

Stem: partial

Petals: partialStem: hidden

Petals: seen

For each state

Reber Grammar

Synthetic problem that can not be solved without memory.

Learn to predict

next possible edges

Transitions have equal probabilities:

P(1→2) = P(1→3) = 0.5

0.5

0.5 0.5

0.5

0.5

0.5

0.5

0.5

0.5

0.5

States (nodes)

Transitions

(edges)

WordCurrent node Possible paths

Begin 1 2 3 4 5 6 1 2 3 4 5 6 End

B

Step

0 1 0 0 0 0 0 0

Step

0 1 0 0 0 0 0 0

P 1 0 1 0 0 0 0 0 1 0 1 1 0 0 0 0

T 2 0 0 0 1 0 0 0 2 0 0 1 0 1 0 0

T 3 0 0 0 1 0 0 0 3 0 0 1 0 1 0 0

T 4 0 0 0 1 0 0 0 4 0 0 1 0 1 0 0

T 5 0 0 0 1 0 0 0 5 0 0 1 0 1 0 0

T 6 0 0 0 1 0 0 0 6 0 0 1 0 1 0 0

V 7 0 0 0 1 0 0 0 7 0 0 1 0 1 0 0

P 8 0 0 0 0 0 1 0 8 0 0 0 1 0 1 0

X 9 0 0 0 0 1 0 0 9 0 0 1 0 0 1 0

T 10 0 0 0 1 0 0 0 10 0 0 1 0 1 0 0

T 11 0 0 0 1 0 0 0 11 0 0 1 0 1 0 0

T 12 0 0 0 1 0 0 0 12 0 0 1 0 1 0 0

T 13 0 0 0 1 0 0 0 13 0 0 1 0 1 0 0

V 14 0 0 0 1 0 0 0 14 0 0 1 0 1 0 0

V 15 0 0 0 0 0 1 0 15 0 0 0 1 0 1 0

E 16 0 0 0 0 0 0 1 16 0 0 0 0 0 0 1

Reber Grammar

WordCurrent node Possible paths

Begin 1 2 3 4 5 6 1 2 3 4 5 6 End

B

Step

0 1 0 0 0 0 0 0

Step

0 1 0 0 0 0 0 0

P 1 0 1 0 0 0 0 0 1 0 1 1 0 0 0 0

T 2 0 0 0 1 0 0 0 2 0 0 1 0 1 0 0

T 3 0 0 0 1 0 0 0 3 0 0 1 0 1 0 0

T 4 0 0 0 1 0 0 0 4 0 0 1 0 1 0 0

T 5 0 0 0 1 0 0 0 5 0 0 1 0 1 0 0

T 6 0 0 0 1 0 0 0 6 0 0 1 0 1 0 0

V 7 0 0 0 1 0 0 0 7 0 0 1 0 1 0 0

P 8 0 0 0 0 0 1 0 8 0 0 0 1 0 1 0

X 9 0 0 0 0 1 0 0 9 0 0 1 0 0 1 0

T 10 0 0 0 1 0 0 0 10 0 0 1 0 1 0 0

T 11 0 0 0 1 0 0 0 11 0 0 1 0 1 0 0

T 12 0 0 0 1 0 0 0 12 0 0 1 0 1 0 0

T 13 0 0 0 1 0 0 0 13 0 0 1 0 1 0 0

V 14 0 0 0 1 0 0 0 14 0 0 1 0 1 0 0

V 15 0 0 0 0 0 1 0 15 0 0 0 1 0 1 0

E 16 0 0 0 0 0 0 1 16 0 0 0 0 0 0 1

Reber Grammar

Input vector x at time t = 2 Output vector y at time t = 2

Memory is important → Reasoning relies on

experience

Pro: Dependencies between features at different timestamps

Cons:

• Limited history of the input (< 10 timestamps)

• Delay values should be set explicitly

• Not general, can not solve complex tasks (such as Reber Grammar)

• FFNN with delayed inputs

• No internal state

Time-delay neural network

Input

Features

Input

Input

Input

Input

layer Hidden

layer

Output

layerHid-

den

Hid-

den

Hid-

den

Out-

put 𝒚(𝒕)

x(t)

x(t-1)

x(t-2)

x(t-3)

delay

delay

delay

RECURRENT NEURAL

NETWORK

ARCHITECTURES

But… not working because not stable!

Simple recurrence:

feed-back output to inputNaïve attempts…

Lack of the feedback control

A∑ 𝒚(𝒕)x(t)

A

Input layer Hidden layer Output layer

AInput Decision output

Past output state

1 step delay

Expected

𝒚

𝒚

Obtained

𝒚

𝒚

Introducing recurrence

A

A

∑

𝒚(𝒕)x(t)A


A

Context layer

Pro: Fast to train because can be parallelized in time

Cons:

• Output transforms hidden state → nonlinear effects, information distorted

• The output dimension may be too small → information in hidden states is truncated

M.I. Jordan, 1986

1 step delay

Jordan recurrent networkLimited short-term

memory

Output-to hidden

connections

J.L. Elman, 1990

Often referenced as the basic RNN structure

and called “Vanilla” RNN

• Should see complete sequence to be trained

• Can not be parallelized by timestamps

• Has some important training difficulties….

A

A

∑

𝒚(𝒕)x(t)A


A

Context layer 1 step delay

Hidden-to hidden connections

make system Turing-complete

Elman recurrent network

𝑾𝑖ℎ Weight matrix from input to hidden

𝑾𝑜 Weight matrix from hidden to output

𝒙𝑡 Input (feature) vector at time t

𝒚𝑡 Network output vector at time t

𝒉𝑡 Network internal (hidden) states vector at time t

𝑼 Weight matrix from hidden to hidden

𝒃 Bias parameter vector

𝒉𝑡 = 𝜎 𝑾𝑖ℎ ∙ 𝒙𝑡 +𝑼 ∙ 𝒉𝑡−1 +𝒃

𝒚𝑡 = 𝜎 𝑾𝑜 ∙ 𝒉𝑡

Vanilla RNN

Unfolding the network in time

Vanilla RNN



RNN TRAINING

Backpropagation: • Reliable and controlled convergence

• Supported by most of ML frameworks

Evolutionary methods, expectation maximization,

non-parametric methods, particle swarm optimization

Target: obtain the network parameters that optimize the cost function

Cost functions: log loss, mean squared root error etc…

Tasks:

Methods:

• For each timestamp of the input sequence x predict output y (synchronously)

• For the input sequence x predict the scalar value of y (e.g., at end of sequence)

• For the input sequence x of length Lx generate the

output sequence y of different length Ly

Research

RNN training

1. Unfold the network.

2. Repeat for the train data:

1. Given some input sequence 𝒙2. For t in 0, N-1:

1. Forward-propagate

2. Initialize hidden state to the past value 𝒉𝑡−13. Obtain output sequence 𝒚4. Calculate error 𝑬 𝒚, 𝒚5. Back-propagate error across the unfolded network

6. Average the weights

7. Compute next hidden state value 𝒉𝑡



𝑬 𝒚, 𝒚 = −

𝑡

𝒚𝒕 lg 𝒚𝒕

E.g., cross entropy loss:

Back-propagation through time

Apply chain rule:


𝜕𝑬𝟐𝜕𝜽

=

𝑘=0

2𝜕𝑬𝟐𝜕 𝒚𝟐

∙𝜕 𝒚𝟐𝜕𝒉𝟐

∙𝜕𝒉𝟐𝜕𝒉𝒌

∙𝜕𝒉𝒌𝜕𝜽

𝜽 - Network parametersFor time 2:

𝜕𝒉𝟐𝜕𝒉𝟎

=𝜕𝒉𝟐𝜕𝒉𝟏

∙𝜕𝒉𝟏𝜕𝒉𝟎

Saturation

Gradient

close to 0

Saturated neurons gradients → 0

• Smaller weigh parameters leads to faster gradients vanishing.

• Very big initial parameters make the gradient descent to diverge fast (explode).

Drive previous layers gradients to 0

(especially for far time-stamps)

Known problem for deep feed-forward networks.

For recurrent networks (even shallow) makes impossible to learn long-term dependencies!

𝝏𝒉𝑡𝝏𝒉0

=𝝏𝒉𝑡𝝏𝒉𝑡−1

∙ ⋯ ∙𝝏𝒉3𝝏𝒉2

∙𝝏𝒉2𝝏𝒉1

∙𝝏𝒉1𝝏𝒉0

• Decays exponentially

• Network stops learning, can’t update

• Impossible to learn correlations

• between temporally distant events

Problem: vanishing gradients

Network can not converge and

weigh parameters do not stabilize

Diagnostics: NaNs; Cost function large fluctuations

Large increase in the norm of the gradient

during training

Pascanou R. et al, On the difficulty of training

recurrent neural networks. arXiv (2012)

Problem: exploding gradients

Solutions:

• Use gradient clipping

• Try reduce learning rate

• Change loss function by setting constrains on weights (L1/L2 norms)

Deep networks train difficulties:

• Vanishing gradient

• Exploding gradient

Possible solutions:

• One of the previously proposed solutions

or

• Use unsupervised pre-training →

difficult to implement, sometimes the

unsupervised solution differs much from the supervised

or

• Improve network architecture!

Fundamental deep learning problem

NEW RNN ARCHITECTURES

Echo State

Network Readout

Only readout

neurons are

trained!

Herbert Jaeger, 2001

In practice:

• Easy to over-fit

(models learns by

heart) – gives good

results on the train

data only

• The reservoir hyper-

parameters

optimization is not

evident

Reservoir computing

Liquid state

machine

Similar to ESN, but using more

biological plausible neuron models

→ spiking (dynamic) neurons

In practice:

• Still, more a

research area

• Requires special

hardware to be

computationally

efficient

Daniel Brunner

Tal Dahan and Astar Sade

Reservoir computing

• No Input Gate

• No Forget Gate

• No Output Gate

• No Input Activation Function

• No Output Activation Function

• No Peepholes

• Coupled Input and

• Forget Gate

• Full Gate Recurrence

Variants

S. Hochreiter & J. Schmidhuber, 1997

Long short-term memory

Due to gaining routing

mechanism, can be

efficiently trained to learn

LONG-TERM dependencies

Has context in both directions, at any timestamp

Bidirectional RNN

Last-1 output = First+1 output

BPXXXXXPE

BTXXXXXXXXTE

Testing capacity to

maintain long term

dependencies

Correct cases

BT ….. TE

BP ….. PE

Incorrect cases

BT ….. PE

BP ….. TE

System must be able to learn to compare

First+1 symbol with Last-1 symbol

Embedded Reber Grammar

PRACTICAL

CONSIDERATIONS

Masking input (output)

Input (output) has variable length

Data batch

Length of input ≠ length of output•CTC loss function•Encoder-decoder architecture

Transform the network outputs into a

conditional probability distribution over label

sequences

- C - A - T -

- BLANK

labelling

Result decoding

Raw output: -----CCCC---AA-TTTT---1) Remove repeating symbols: -C-A-T-

2) Remove blanks: CAT

NEURAL MODELS FOR

LOCOMOTION

Locomotion principles in nature

[S.Roland et al., 2004]

Locomotion: movement or

the ability to move from

one place to another

Manipulation ≠ Locomotion

Aperiodic

series of

motions

Stable

Periodic

motion

gaits

Quasi stable

[A. Ijspeert et al., 2007]

Wheeled on soft

ground

[S.Roland et al. 2004 ]

Locomotion efficiency

Nature: no “pure” wheeled locomotion

Reason: variety of surfaces, rough terrain, adaptation is necessary

Biological locomotion exploits patterns

The number of legs influences

• Mechanical complexity

• Control complexity

• Generated patterns (for 6 legs N = (2k-1)! = 11! = 39 916 800 )

[S.Roland 2004]

Locomotion efficiency

• Gait control is on “automatic pilot”

• Automatic gait is energy efficient

• Perturbation introduces modification

Not fully nature way (weak adaptation, no decisions)

How the nature deals with locomotion?

- Initiate motion by putting energy

- Passive stage

- Generate

- Control for stability

- Repeat

- Brain?

- Nervous system?

- Spinal cord?

Inconceivable automation

Complexity of the phenomena involved in motor control

Central Nervous System

Motor Nervous System

NeuromuscularJunction

Models of musculoskeletal system …

Models of Motor Nervous System

Extrait: Univ du Québec-etsmtl (cours) Extrait: collège de France ( L. Damn)

Extrait: Univ. Paris 8- cours Licence L.612

Spinal cord

[P. Hénaff 2013]

Biological motor control

MU aggregates muscular fibers

innervated by the common

motor neuron. Contraction of

these fibers is thus

simultaneous.

Motor unit

Sensory nerve

Motor nerve

Dorsal rootPosterior horn

Anterior horn

Ventral root

Nervo-muscular

fiber

Reflexes: pathways

Muscle contraction as a

response to its own elongationMuscle contraction as a

response to external stimuli

[P. Hénaff 2013]

Central Pattern Generator• Automatic activity is controlled by spinal centers

• CPG (Central Pattern Generator) is a group of synaptic connections to generate

rhythmic motions

• The spinal pattern-generating networks do not require sensory input but nevertheless

are strongly regulated by input from limb proprioceptors

Sensory-motor architecture for locomotion

[McCrea 2006]

Biological sensory-motor architecture

models

Muscular contraction is put in place during embryonic life or after the birth

• Insects can walk directly upon

birth

• Most mammals require several

minutes to stand

• Humans require more than a

year to walk on two legs

How learning occurs

[ejjack2]

https://www.youtube.com/channel/UCEV_xxid6162gYn13-iEpNw

Mathematical modeling of CPG

[J. Nassour et al.

2010]

[P.F. Rowat,

A.I. Selverston

1997]

CPG approximation Limit cycle behavior

Gait Matrix

Coupling different CPG

Sensory feedback

Mathematical modeling of CPG

Hopf oscillator

Neural controllers

CPG of tronc

ipsilateral

And

Contralateral

Connections

Matsuoka model

Neural based CPG controller for biped locomotion [Taga 1995]

Neural controller• 1 CPG per joint

• 2 coupled neurons per CPG

• Inhibitions: contra and ipsi latéral

• sensori motricity Intégration

Extrait de Taga 1995 (Biol. Cyb.)

Internal coupling of

the networkArticular sensory inputs:

speeds, forces, contact

ground

Model of Neuron i

(Matsuoka 1985)[P. Hénaff 2013]

With couplingTemporal evaluation of frequency components of the

sagittal acceleration of the robot’s pelvis

• Automatically determines robot’s natural frequencies

• Continuously adapts to evolution of defects

Phase portraits of the oscillator

Without coupling

Learning

Synchronous

Compensation of articulation defects

ROBIAN

LISV, UVSQ

ROBIAN

LISV, UVSQ

[V.Khomenko, 2013,

LISV, UVSQ, France]

APPLICATION OF

RECURRENT NEURAL

NETWORKS

• Human-computer interaction

– Speech and handwriting recognition

– Music composition

– Activity recognition

• Identification and control

– Identification and control of dynamic systems by learning

– Biologically inspired robotics for adaptive locomotion

– Study of biological pattern structures forming and evaluation

Application of RNNs

Recurrent neural networks

Engineering

Transcript of Recurrent neural networks