Recurrent neural networks
-
Upload
viacheslav-khomenko-phd -
Category
Engineering
-
view
748 -
download
0
Transcript of Recurrent neural networks
Recurrent Neural Networks
Viacheslav Khomenko, Ph.D.
Contents
Recap: feed-forward artificial neural network
Temporal dependencies
Recurrent neural network architectures
RNN training
New RNN architectures
Practical considerations
Neural models for locomotion
Application of RNNs
RECAP: FEED-FORWARD
ARTIFICIAL NEURAL
NETWORK
Feed-forward network
W. McCulloch and W. Pitts , 1940s Abstract mathematical model of a brain cell
Perceptron for classificationF. Rosenblatt, 1958
Multi-layer artificial neural networkP. Werbos, 1975
Input
Features
Input
Input
Input
Petals
Sepal
Yellow
patch
VeinsIris flower
Input
layer
Hidden
layer(s)
Output
layer
Hid-
den
Hid-
den
Hid-
den
Out-
putIris
Out-
put¬Iris
Decisions
Feed-forward network
Decisions are based on current inputs:
• No memory about the past
• No future scope
A 𝒚x A
Input layer Hidden layer(s) Output layer
A
Input Decision output
Simplified representation:
Vector of input features:
Vector of predicted values:
x
𝒚
Neural activation:
A – some activation function (tanh etc…)
𝑤, 𝑏 – network parameters
TEMPORAL
DEPENDENCIES
Temporal dependencies
Analyzing temporal dependencies
Frame 0 Frame 1 Frame 2 Frame 3 Frame 4
P(Iris): 0.1
P(¬Iris): 0.9
P(Iris): 0.11
P(¬Iris): 0.89
P(Iris): 0.2
P(¬Iris): 0.8
P(Iris): 0.45
P(¬Iris): 0.55
P(Iris): 0.9
P(¬Iris): 0.1
Decision on
sequence of
observations
Improved decisions
Stem: seen
Petals: hidden
Stem: seen
Petals: hidden
Stem: seen
Petals: partial
Stem: partial
Petals: partialStem: hidden
Petals: seen
For each state
Reber Grammar
Synthetic problem that can not be solved without memory.
Learn to predict
next possible edges
Transitions have equal probabilities:
P(1→2) = P(1→3) = 0.5
0.5
0.5 0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
States (nodes)
Transitions
(edges)
WordCurrent node Possible paths
Begin 1 2 3 4 5 6 1 2 3 4 5 6 End
B
Step
0 1 0 0 0 0 0 0
Step
0 1 0 0 0 0 0 0
P 1 0 1 0 0 0 0 0 1 0 1 1 0 0 0 0
T 2 0 0 0 1 0 0 0 2 0 0 1 0 1 0 0
T 3 0 0 0 1 0 0 0 3 0 0 1 0 1 0 0
T 4 0 0 0 1 0 0 0 4 0 0 1 0 1 0 0
T 5 0 0 0 1 0 0 0 5 0 0 1 0 1 0 0
T 6 0 0 0 1 0 0 0 6 0 0 1 0 1 0 0
V 7 0 0 0 1 0 0 0 7 0 0 1 0 1 0 0
P 8 0 0 0 0 0 1 0 8 0 0 0 1 0 1 0
X 9 0 0 0 0 1 0 0 9 0 0 1 0 0 1 0
T 10 0 0 0 1 0 0 0 10 0 0 1 0 1 0 0
T 11 0 0 0 1 0 0 0 11 0 0 1 0 1 0 0
T 12 0 0 0 1 0 0 0 12 0 0 1 0 1 0 0
T 13 0 0 0 1 0 0 0 13 0 0 1 0 1 0 0
V 14 0 0 0 1 0 0 0 14 0 0 1 0 1 0 0
V 15 0 0 0 0 0 1 0 15 0 0 0 1 0 1 0
E 16 0 0 0 0 0 0 1 16 0 0 0 0 0 0 1
Reber Grammar
WordCurrent node Possible paths
Begin 1 2 3 4 5 6 1 2 3 4 5 6 End
B
Step
0 1 0 0 0 0 0 0
Step
0 1 0 0 0 0 0 0
P 1 0 1 0 0 0 0 0 1 0 1 1 0 0 0 0
T 2 0 0 0 1 0 0 0 2 0 0 1 0 1 0 0
T 3 0 0 0 1 0 0 0 3 0 0 1 0 1 0 0
T 4 0 0 0 1 0 0 0 4 0 0 1 0 1 0 0
T 5 0 0 0 1 0 0 0 5 0 0 1 0 1 0 0
T 6 0 0 0 1 0 0 0 6 0 0 1 0 1 0 0
V 7 0 0 0 1 0 0 0 7 0 0 1 0 1 0 0
P 8 0 0 0 0 0 1 0 8 0 0 0 1 0 1 0
X 9 0 0 0 0 1 0 0 9 0 0 1 0 0 1 0
T 10 0 0 0 1 0 0 0 10 0 0 1 0 1 0 0
T 11 0 0 0 1 0 0 0 11 0 0 1 0 1 0 0
T 12 0 0 0 1 0 0 0 12 0 0 1 0 1 0 0
T 13 0 0 0 1 0 0 0 13 0 0 1 0 1 0 0
V 14 0 0 0 1 0 0 0 14 0 0 1 0 1 0 0
V 15 0 0 0 0 0 1 0 15 0 0 0 1 0 1 0
E 16 0 0 0 0 0 0 1 16 0 0 0 0 0 0 1
Reber Grammar
WordCurrent node Possible paths
Begin 1 2 3 4 5 6 1 2 3 4 5 6 End
B
Step
0 1 0 0 0 0 0 0
Step
0 1 0 0 0 0 0 0
P 1 0 1 0 0 0 0 0 1 0 1 1 0 0 0 0
T 2 0 0 0 1 0 0 0 2 0 0 1 0 1 0 0
T 3 0 0 0 1 0 0 0 3 0 0 1 0 1 0 0
T 4 0 0 0 1 0 0 0 4 0 0 1 0 1 0 0
T 5 0 0 0 1 0 0 0 5 0 0 1 0 1 0 0
T 6 0 0 0 1 0 0 0 6 0 0 1 0 1 0 0
V 7 0 0 0 1 0 0 0 7 0 0 1 0 1 0 0
P 8 0 0 0 0 0 1 0 8 0 0 0 1 0 1 0
X 9 0 0 0 0 1 0 0 9 0 0 1 0 0 1 0
T 10 0 0 0 1 0 0 0 10 0 0 1 0 1 0 0
T 11 0 0 0 1 0 0 0 11 0 0 1 0 1 0 0
T 12 0 0 0 1 0 0 0 12 0 0 1 0 1 0 0
T 13 0 0 0 1 0 0 0 13 0 0 1 0 1 0 0
V 14 0 0 0 1 0 0 0 14 0 0 1 0 1 0 0
V 15 0 0 0 0 0 1 0 15 0 0 0 1 0 1 0
E 16 0 0 0 0 0 0 1 16 0 0 0 0 0 0 1
Reber Grammar
WordCurrent node Possible paths
Begin 1 2 3 4 5 6 1 2 3 4 5 6 End
B
Step
0 1 0 0 0 0 0 0
Step
0 1 0 0 0 0 0 0
P 1 0 1 0 0 0 0 0 1 0 1 1 0 0 0 0
T 2 0 0 0 1 0 0 0 2 0 0 1 0 1 0 0
T 3 0 0 0 1 0 0 0 3 0 0 1 0 1 0 0
T 4 0 0 0 1 0 0 0 4 0 0 1 0 1 0 0
T 5 0 0 0 1 0 0 0 5 0 0 1 0 1 0 0
T 6 0 0 0 1 0 0 0 6 0 0 1 0 1 0 0
V 7 0 0 0 1 0 0 0 7 0 0 1 0 1 0 0
P 8 0 0 0 0 0 1 0 8 0 0 0 1 0 1 0
X 9 0 0 0 0 1 0 0 9 0 0 1 0 0 1 0
T 10 0 0 0 1 0 0 0 10 0 0 1 0 1 0 0
T 11 0 0 0 1 0 0 0 11 0 0 1 0 1 0 0
T 12 0 0 0 1 0 0 0 12 0 0 1 0 1 0 0
T 13 0 0 0 1 0 0 0 13 0 0 1 0 1 0 0
V 14 0 0 0 1 0 0 0 14 0 0 1 0 1 0 0
V 15 0 0 0 0 0 1 0 15 0 0 0 1 0 1 0
E 16 0 0 0 0 0 0 1 16 0 0 0 0 0 0 1
Reber Grammar
WordCurrent node Possible paths
Begin 1 2 3 4 5 6 1 2 3 4 5 6 End
B
Step
0 1 0 0 0 0 0 0
Step
0 1 0 0 0 0 0 0
P 1 0 1 0 0 0 0 0 1 0 1 1 0 0 0 0
T 2 0 0 0 1 0 0 0 2 0 0 1 0 1 0 0
T 3 0 0 0 1 0 0 0 3 0 0 1 0 1 0 0
T 4 0 0 0 1 0 0 0 4 0 0 1 0 1 0 0
T 5 0 0 0 1 0 0 0 5 0 0 1 0 1 0 0
T 6 0 0 0 1 0 0 0 6 0 0 1 0 1 0 0
V 7 0 0 0 1 0 0 0 7 0 0 1 0 1 0 0
P 8 0 0 0 0 0 1 0 8 0 0 0 1 0 1 0
X 9 0 0 0 0 1 0 0 9 0 0 1 0 0 1 0
T 10 0 0 0 1 0 0 0 10 0 0 1 0 1 0 0
T 11 0 0 0 1 0 0 0 11 0 0 1 0 1 0 0
T 12 0 0 0 1 0 0 0 12 0 0 1 0 1 0 0
T 13 0 0 0 1 0 0 0 13 0 0 1 0 1 0 0
V 14 0 0 0 1 0 0 0 14 0 0 1 0 1 0 0
V 15 0 0 0 0 0 1 0 15 0 0 0 1 0 1 0
E 16 0 0 0 0 0 0 1 16 0 0 0 0 0 0 1
Reber Grammar
WordCurrent node Possible paths
Begin 1 2 3 4 5 6 1 2 3 4 5 6 End
B
Step
0 1 0 0 0 0 0 0
Step
0 1 0 0 0 0 0 0
P 1 0 1 0 0 0 0 0 1 0 1 1 0 0 0 0
T 2 0 0 0 1 0 0 0 2 0 0 1 0 1 0 0
T 3 0 0 0 1 0 0 0 3 0 0 1 0 1 0 0
T 4 0 0 0 1 0 0 0 4 0 0 1 0 1 0 0
T 5 0 0 0 1 0 0 0 5 0 0 1 0 1 0 0
T 6 0 0 0 1 0 0 0 6 0 0 1 0 1 0 0
V 7 0 0 0 1 0 0 0 7 0 0 1 0 1 0 0
P 8 0 0 0 0 0 1 0 8 0 0 0 1 0 1 0
X 9 0 0 0 0 1 0 0 9 0 0 1 0 0 1 0
T 10 0 0 0 1 0 0 0 10 0 0 1 0 1 0 0
T 11 0 0 0 1 0 0 0 11 0 0 1 0 1 0 0
T 12 0 0 0 1 0 0 0 12 0 0 1 0 1 0 0
T 13 0 0 0 1 0 0 0 13 0 0 1 0 1 0 0
V 14 0 0 0 1 0 0 0 14 0 0 1 0 1 0 0
V 15 0 0 0 0 0 1 0 15 0 0 0 1 0 1 0
E 16 0 0 0 0 0 0 1 16 0 0 0 0 0 0 1
Reber Grammar
WordCurrent node Possible paths
Begin 1 2 3 4 5 6 1 2 3 4 5 6 End
B
Step
0 1 0 0 0 0 0 0
Step
0 1 0 0 0 0 0 0
P 1 0 1 0 0 0 0 0 1 0 1 1 0 0 0 0
T 2 0 0 0 1 0 0 0 2 0 0 1 0 1 0 0
T 3 0 0 0 1 0 0 0 3 0 0 1 0 1 0 0
T 4 0 0 0 1 0 0 0 4 0 0 1 0 1 0 0
T 5 0 0 0 1 0 0 0 5 0 0 1 0 1 0 0
T 6 0 0 0 1 0 0 0 6 0 0 1 0 1 0 0
V 7 0 0 0 1 0 0 0 7 0 0 1 0 1 0 0
P 8 0 0 0 0 0 1 0 8 0 0 0 1 0 1 0
X 9 0 0 0 0 1 0 0 9 0 0 1 0 0 1 0
T 10 0 0 0 1 0 0 0 10 0 0 1 0 1 0 0
T 11 0 0 0 1 0 0 0 11 0 0 1 0 1 0 0
T 12 0 0 0 1 0 0 0 12 0 0 1 0 1 0 0
T 13 0 0 0 1 0 0 0 13 0 0 1 0 1 0 0
V 14 0 0 0 1 0 0 0 14 0 0 1 0 1 0 0
V 15 0 0 0 0 0 1 0 15 0 0 0 1 0 1 0
E 16 0 0 0 0 0 0 1 16 0 0 0 0 0 0 1
Reber Grammar
WordCurrent node Possible paths
Begin 1 2 3 4 5 6 1 2 3 4 5 6 End
B
Step
0 1 0 0 0 0 0 0
Step
0 1 0 0 0 0 0 0
P 1 0 1 0 0 0 0 0 1 0 1 1 0 0 0 0
T 2 0 0 0 1 0 0 0 2 0 0 1 0 1 0 0
T 3 0 0 0 1 0 0 0 3 0 0 1 0 1 0 0
T 4 0 0 0 1 0 0 0 4 0 0 1 0 1 0 0
T 5 0 0 0 1 0 0 0 5 0 0 1 0 1 0 0
T 6 0 0 0 1 0 0 0 6 0 0 1 0 1 0 0
V 7 0 0 0 1 0 0 0 7 0 0 1 0 1 0 0
P 8 0 0 0 0 0 1 0 8 0 0 0 1 0 1 0
X 9 0 0 0 0 1 0 0 9 0 0 1 0 0 1 0
T 10 0 0 0 1 0 0 0 10 0 0 1 0 1 0 0
T 11 0 0 0 1 0 0 0 11 0 0 1 0 1 0 0
T 12 0 0 0 1 0 0 0 12 0 0 1 0 1 0 0
T 13 0 0 0 1 0 0 0 13 0 0 1 0 1 0 0
V 14 0 0 0 1 0 0 0 14 0 0 1 0 1 0 0
V 15 0 0 0 0 0 1 0 15 0 0 0 1 0 1 0
E 16 0 0 0 0 0 0 1 16 0 0 0 0 0 0 1
Reber Grammar
WordCurrent node Possible paths
Begin 1 2 3 4 5 6 1 2 3 4 5 6 End
B
Step
0 1 0 0 0 0 0 0
Step
0 1 0 0 0 0 0 0
P 1 0 1 0 0 0 0 0 1 0 1 1 0 0 0 0
T 2 0 0 0 1 0 0 0 2 0 0 1 0 1 0 0
T 3 0 0 0 1 0 0 0 3 0 0 1 0 1 0 0
T 4 0 0 0 1 0 0 0 4 0 0 1 0 1 0 0
T 5 0 0 0 1 0 0 0 5 0 0 1 0 1 0 0
T 6 0 0 0 1 0 0 0 6 0 0 1 0 1 0 0
V 7 0 0 0 1 0 0 0 7 0 0 1 0 1 0 0
P 8 0 0 0 0 0 1 0 8 0 0 0 1 0 1 0
X 9 0 0 0 0 1 0 0 9 0 0 1 0 0 1 0
T 10 0 0 0 1 0 0 0 10 0 0 1 0 1 0 0
T 11 0 0 0 1 0 0 0 11 0 0 1 0 1 0 0
T 12 0 0 0 1 0 0 0 12 0 0 1 0 1 0 0
T 13 0 0 0 1 0 0 0 13 0 0 1 0 1 0 0
V 14 0 0 0 1 0 0 0 14 0 0 1 0 1 0 0
V 15 0 0 0 0 0 1 0 15 0 0 0 1 0 1 0
E 16 0 0 0 0 0 0 1 16 0 0 0 0 0 0 1
Reber Grammar
WordCurrent node Possible paths
Begin 1 2 3 4 5 6 1 2 3 4 5 6 End
B
Step
0 1 0 0 0 0 0 0
Step
0 1 0 0 0 0 0 0
P 1 0 1 0 0 0 0 0 1 0 1 1 0 0 0 0
T 2 0 0 0 1 0 0 0 2 0 0 1 0 1 0 0
T 3 0 0 0 1 0 0 0 3 0 0 1 0 1 0 0
T 4 0 0 0 1 0 0 0 4 0 0 1 0 1 0 0
T 5 0 0 0 1 0 0 0 5 0 0 1 0 1 0 0
T 6 0 0 0 1 0 0 0 6 0 0 1 0 1 0 0
V 7 0 0 0 1 0 0 0 7 0 0 1 0 1 0 0
P 8 0 0 0 0 0 1 0 8 0 0 0 1 0 1 0
X 9 0 0 0 0 1 0 0 9 0 0 1 0 0 1 0
T 10 0 0 0 1 0 0 0 10 0 0 1 0 1 0 0
T 11 0 0 0 1 0 0 0 11 0 0 1 0 1 0 0
T 12 0 0 0 1 0 0 0 12 0 0 1 0 1 0 0
T 13 0 0 0 1 0 0 0 13 0 0 1 0 1 0 0
V 14 0 0 0 1 0 0 0 14 0 0 1 0 1 0 0
V 15 0 0 0 0 0 1 0 15 0 0 0 1 0 1 0
E 16 0 0 0 0 0 0 1 16 0 0 0 0 0 0 1
Reber Grammar
WordCurrent node Possible paths
Begin 1 2 3 4 5 6 1 2 3 4 5 6 End
B
Step
0 1 0 0 0 0 0 0
Step
0 1 0 0 0 0 0 0
P 1 0 1 0 0 0 0 0 1 0 1 1 0 0 0 0
T 2 0 0 0 1 0 0 0 2 0 0 1 0 1 0 0
T 3 0 0 0 1 0 0 0 3 0 0 1 0 1 0 0
T 4 0 0 0 1 0 0 0 4 0 0 1 0 1 0 0
T 5 0 0 0 1 0 0 0 5 0 0 1 0 1 0 0
T 6 0 0 0 1 0 0 0 6 0 0 1 0 1 0 0
V 7 0 0 0 1 0 0 0 7 0 0 1 0 1 0 0
P 8 0 0 0 0 0 1 0 8 0 0 0 1 0 1 0
X 9 0 0 0 0 1 0 0 9 0 0 1 0 0 1 0
T 10 0 0 0 1 0 0 0 10 0 0 1 0 1 0 0
T 11 0 0 0 1 0 0 0 11 0 0 1 0 1 0 0
T 12 0 0 0 1 0 0 0 12 0 0 1 0 1 0 0
T 13 0 0 0 1 0 0 0 13 0 0 1 0 1 0 0
V 14 0 0 0 1 0 0 0 14 0 0 1 0 1 0 0
V 15 0 0 0 0 0 1 0 15 0 0 0 1 0 1 0
E 16 0 0 0 0 0 0 1 16 0 0 0 0 0 0 1
Reber Grammar
Input vector x at time t = 2 Output vector y at time t = 2
Memory is important → Reasoning relies on
experience
Pro: Dependencies between features at different timestamps
Cons:
• Limited history of the input (< 10 timestamps)
• Delay values should be set explicitly
• Not general, can not solve complex tasks (such as Reber Grammar)
• FFNN with delayed inputs
• No internal state
Time-delay neural network
Input
Features
Input
Input
Input
Input
layer Hidden
layer
Output
layerHid-
den
Hid-
den
Hid-
den
Out-
put 𝒚(𝒕)
x(t)
x(t-1)
x(t-2)
x(t-3)
delay
delay
delay
RECURRENT NEURAL
NETWORK
ARCHITECTURES
But… not working because not stable!
Simple recurrence:
feed-back output to inputNaïve attempts…
Lack of the feedback control
A∑ 𝒚(𝒕)x(t)
A
Input layer Hidden layer Output layer
AInput Decision output
Past output state
1 step delay
Expected
𝒚
𝒚
Obtained
𝒚
𝒚
Introducing recurrence
A
A
∑
𝒚(𝒕)x(t)A
Input layer Hidden layer Output layer
A
Context layer
Pro: Fast to train because can be parallelized in time
Cons:
• Output transforms hidden state → nonlinear effects, information distorted
• The output dimension may be too small → information in hidden states is truncated
M.I. Jordan, 1986
1 step delay
Jordan recurrent networkLimited short-term
memory
Output-to hidden
connections
J.L. Elman, 1990
Often referenced as the basic RNN structure
and called “Vanilla” RNN
• Should see complete sequence to be trained
• Can not be parallelized by timestamps
• Has some important training difficulties….
A
A
∑
𝒚(𝒕)x(t)A
Input layer Hidden layer Output layer
A
Context layer 1 step delay
Hidden-to hidden connections
make system Turing-complete
Elman recurrent network
𝑾𝑖ℎ Weight matrix from input to hidden
𝑾𝑜 Weight matrix from hidden to output
𝒙𝑡 Input (feature) vector at time t
𝒚𝑡 Network output vector at time t
𝒉𝑡 Network internal (hidden) states vector at time t
𝑼 Weight matrix from hidden to hidden
𝒃 Bias parameter vector
𝒉𝑡 = 𝜎 𝑾𝑖ℎ ∙ 𝒙𝑡 +𝑼 ∙ 𝒉𝑡−1 +𝒃
𝒚𝑡 = 𝜎 𝑾𝑜 ∙ 𝒉𝑡
Vanilla RNN
Unfolding the network in time
Vanilla RNN
𝒉𝑡 = 𝜎 𝑾𝑖ℎ ∙ 𝒙𝑡 +𝑼 ∙ 𝒉𝑡−1 +𝒃
𝒚𝑡 = 𝜎 𝑾𝑜 ∙ 𝒉𝑡
RNN TRAINING
Backpropagation: • Reliable and controlled convergence
• Supported by most of ML frameworks
Evolutionary methods, expectation maximization,
non-parametric methods, particle swarm optimization
Target: obtain the network parameters that optimize the cost function
Cost functions: log loss, mean squared root error etc…
Tasks:
Methods:
• For each timestamp of the input sequence x predict output y (synchronously)
• For the input sequence x predict the scalar value of y (e.g., at end of sequence)
• For the input sequence x of length Lx generate the
output sequence y of different length Ly
Research
RNN training
1. Unfold the network.
2. Repeat for the train data:
1. Given some input sequence 𝒙2. For t in 0, N-1:
1. Forward-propagate
2. Initialize hidden state to the past value 𝒉𝑡−13. Obtain output sequence 𝒚4. Calculate error 𝑬 𝒚, 𝒚5. Back-propagate error across the unfolded network
6. Average the weights
7. Compute next hidden state value 𝒉𝑡
𝒉𝑡 = 𝜎 𝑾𝑖ℎ ∙ 𝒙𝑡 +𝑼 ∙ 𝒉𝑡−1 +𝒃
𝒚𝑡 = 𝜎 𝑾𝑜 ∙ 𝒉𝑡
𝑬 𝒚, 𝒚 = −
𝑡
𝒚𝒕 lg 𝒚𝒕
E.g., cross entropy loss:
Back-propagation through time
Apply chain rule:
Back-propagation through time
𝜕𝑬𝟐𝜕𝜽
=
𝑘=0
2𝜕𝑬𝟐𝜕 𝒚𝟐
∙𝜕 𝒚𝟐𝜕𝒉𝟐
∙𝜕𝒉𝟐𝜕𝒉𝒌
∙𝜕𝒉𝒌𝜕𝜽
𝜽 - Network parametersFor time 2:
𝜕𝒉𝟐𝜕𝒉𝟎
=𝜕𝒉𝟐𝜕𝒉𝟏
∙𝜕𝒉𝟏𝜕𝒉𝟎
Back-propagation through time
Back-propagation through time
Back-propagation through time
Back-propagation through time
Back-propagation through time
Back-propagation through time
Saturation
Gradient
close to 0
Saturated neurons gradients → 0
• Smaller weigh parameters leads to faster gradients vanishing.
• Very big initial parameters make the gradient descent to diverge fast (explode).
Drive previous layers gradients to 0
(especially for far time-stamps)
Known problem for deep feed-forward networks.
For recurrent networks (even shallow) makes impossible to learn long-term dependencies!
𝝏𝒉𝑡𝝏𝒉0
=𝝏𝒉𝑡𝝏𝒉𝑡−1
∙ ⋯ ∙𝝏𝒉3𝝏𝒉2
∙𝝏𝒉2𝝏𝒉1
∙𝝏𝒉1𝝏𝒉0
• Decays exponentially
• Network stops learning, can’t update
• Impossible to learn correlations
• between temporally distant events
Problem: vanishing gradients
Network can not converge and
weigh parameters do not stabilize
Diagnostics: NaNs; Cost function large fluctuations
Large increase in the norm of the gradient
during training
Pascanou R. et al, On the difficulty of training
recurrent neural networks. arXiv (2012)
Problem: exploding gradients
Solutions:
• Use gradient clipping
• Try reduce learning rate
• Change loss function by setting constrains on weights (L1/L2 norms)
Deep networks train difficulties:
• Vanishing gradient
• Exploding gradient
Possible solutions:
• One of the previously proposed solutions
or
• Use unsupervised pre-training →
difficult to implement, sometimes the
unsupervised solution differs much from the supervised
or
• Improve network architecture!
Fundamental deep learning problem
NEW RNN ARCHITECTURES
Echo State
Network Readout
Only readout
neurons are
trained!
Herbert Jaeger, 2001
In practice:
• Easy to over-fit
(models learns by
heart) – gives good
results on the train
data only
• The reservoir hyper-
parameters
optimization is not
evident
Reservoir computing
Liquid state
machine
Similar to ESN, but using more
biological plausible neuron models
→ spiking (dynamic) neurons
In practice:
• Still, more a
research area
• Requires special
hardware to be
computationally
efficient
Daniel Brunner
Tal Dahan and Astar Sade
Reservoir computing
• No Input Gate
• No Forget Gate
• No Output Gate
• No Input Activation Function
• No Output Activation Function
• No Peepholes
• Coupled Input and
• Forget Gate
• Full Gate Recurrence
Variants
S. Hochreiter & J. Schmidhuber, 1997
Long short-term memory
Due to gaining routing
mechanism, can be
efficiently trained to learn
LONG-TERM dependencies
Has context in both directions, at any timestamp
Bidirectional RNN
Last-1 output = First+1 output
BPXXXXXPE
BTXXXXXXXXTE
Testing capacity to
maintain long term
dependencies
Correct cases
BT ….. TE
BP ….. PE
Incorrect cases
BT ….. PE
BP ….. TE
System must be able to learn to compare
First+1 symbol with Last-1 symbol
Embedded Reber Grammar
PRACTICAL
CONSIDERATIONS
Masking input (output)
Input (output) has variable length
Data batch
Length of input ≠ length of output•CTC loss function•Encoder-decoder architecture
Transform the network outputs into a
conditional probability distribution over label
sequences
- C - A - T -
- BLANK
labelling
Result decoding
Raw output: -----CCCC---AA-TTTT---1) Remove repeating symbols: -C-A-T-
2) Remove blanks: CAT
NEURAL MODELS FOR
LOCOMOTION
Locomotion principles in nature
[S.Roland et al., 2004]
Locomotion: movement or
the ability to move from
one place to another
Manipulation ≠ Locomotion
Aperiodic
series of
motions
Stable
Periodic
motion
gaits
Quasi stable
[A. Ijspeert et al., 2007]
Wheeled on soft
ground
[S.Roland et al. 2004 ]
Locomotion efficiency
Nature: no “pure” wheeled locomotion
Reason: variety of surfaces, rough terrain, adaptation is necessary
Biological locomotion exploits patterns
The number of legs influences
• Mechanical complexity
• Control complexity
• Generated patterns (for 6 legs N = (2k-1)! = 11! = 39 916 800 )
[S.Roland 2004]
Locomotion efficiency
• Gait control is on “automatic pilot”
• Automatic gait is energy efficient
• Perturbation introduces modification
Not fully nature way (weak adaptation, no decisions)
How the nature deals with locomotion?
- Initiate motion by putting energy
- Passive stage
- Generate
- Control for stability
- Repeat
- Brain?
- Nervous system?
- Spinal cord?
Inconceivable automation
Complexity of the phenomena involved in motor control
Central Nervous System
Motor Nervous System
NeuromuscularJunction
Models of musculoskeletal system …
Models of Motor Nervous System
Extrait: Univ du Québec-etsmtl (cours) Extrait: collège de France ( L. Damn)
Extrait: Univ. Paris 8- cours Licence L.612
Spinal cord
[P. Hénaff 2013]
Biological motor control
MU aggregates muscular fibers
innervated by the common
motor neuron. Contraction of
these fibers is thus
simultaneous.
Motor unit
Sensory nerve
Motor nerve
Dorsal rootPosterior horn
Anterior horn
Ventral root
Nervo-muscular
fiber
Reflexes: pathways
Muscle contraction as a
response to its own elongationMuscle contraction as a
response to external stimuli
[P. Hénaff 2013]
Central Pattern Generator• Automatic activity is controlled by spinal centers
• CPG (Central Pattern Generator) is a group of synaptic connections to generate
rhythmic motions
• The spinal pattern-generating networks do not require sensory input but nevertheless
are strongly regulated by input from limb proprioceptors
Sensory-motor architecture for locomotion
[McCrea 2006]
Biological sensory-motor architecture
models
Muscular contraction is put in place during embryonic life or after the birth
• Insects can walk directly upon
birth
• Most mammals require several
minutes to stand
• Humans require more than a
year to walk on two legs
How learning occurs
[ejjack2]
Mathematical modeling of CPG
[J. Nassour et al.
2010]
[P.F. Rowat,
A.I. Selverston
1997]
CPG approximation Limit cycle behavior
Gait Matrix
Coupling different CPG
Sensory feedback
Mathematical modeling of CPG
Hopf oscillator
Neural controllers
CPG of tronc
ipsilateral
And
Contralateral
Connections
Matsuoka model
Neural based CPG controller for biped locomotion [Taga 1995]
Neural controller• 1 CPG per joint
• 2 coupled neurons per CPG
• Inhibitions: contra and ipsi latéral
• sensori motricity Intégration
Extrait de Taga 1995 (Biol. Cyb.)
Internal coupling of
the networkArticular sensory inputs:
speeds, forces, contact
ground
Model of Neuron i
(Matsuoka 1985)[P. Hénaff 2013]
With couplingTemporal evaluation of frequency components of the
sagittal acceleration of the robot’s pelvis
• Automatically determines robot’s natural frequencies
• Continuously adapts to evolution of defects
Phase portraits of the oscillator
Without coupling
Learning
Synchronous
Compensation of articulation defects
ROBIAN
LISV, UVSQ
ROBIAN
LISV, UVSQ
[V.Khomenko, 2013,
LISV, UVSQ, France]
APPLICATION OF
RECURRENT NEURAL
NETWORKS
• Human-computer interaction
– Speech and handwriting recognition
– Music composition
– Activity recognition
• Identification and control
– Identification and control of dynamic systems by learning
– Biologically inspired robotics for adaptive locomotion
– Study of biological pattern structures forming and evaluation
Application of RNNs