Echo State Networks -...
Transcript of Echo State Networks -...
Reservoir Computing
and
Echo State Networks
An Introduction to:
Claudio Gallicchio
Outline
• Focus: Supervised learning in domain of sequences
• Recurrent Neural networks for supervised learning in domains of
sequences
• Reservoir Computing: paradigm for efficient training of Recurrent
Neural Networks
• Echo State Network model theory and applications
Notation used for Tasks on Sequence Domains (Supervised Learning)
Input Sequence → Output Sequence
One output vector for each input vector
E.g. Sequence Transdution, Next Step Prediction
Input Sequence → Output Vector
One output vector for each input sequence
E.g. Sequence classification
time
Example or sample:
Example or sample:
Sequence Processing
Notation: in the following slides, the variable n denotes the time step.
• NN for processing temporal data exploit the idea of representing (more or
less explicitely) the past input context in which new input information is
observed
• Basic approaches: windowing strategies, feedback connections.
Neural Networks for Learning in Sequence Domains
𝑢(𝑛 − 𝐷𝑈)
𝑢(𝑛 − 1)
𝑢(𝑛)
Input TDL
of order
𝐷𝑈
.
.
.
𝑓 ∙
Multi Layer
Perceptron
𝑦(𝑛)
Input Delay Neural Networks (IDNN)
𝒚 𝑛 = 𝑓(𝒖 𝑛 − 𝐷𝑈 , … , 𝒖 𝑛 − 1 , 𝒖 𝑛 ),
• Pro: simplicity, training using
Back-propagation
• Con: fixed window size.
Temporal context represented using feed-forward neural networks + input windowing
Recurrent Neural Networks (RNNs)
Elman Network (Simple Recurrent Network)
• Pro: theoretically very
powerful; Universal
approximation through
training
• Con: drawbacks related to
training
• Neural network architectures with explicit recurrent connections
• Feedback allows the representation of temporal context of state information (neural
memory) – implement dynamical systems
• Potentially maintain input history information for arbitrary periods of time
𝒚 𝑛 = 𝑓𝑌 𝒙 𝑛
𝒙 𝑛 = 𝑓𝐻 𝒖 𝑛 , 𝒄 𝑛 𝒄 𝑛 = 𝒙(𝑛 − 1)
Learning with RNNs
• Universal approximation of RNNs (e.g. Elman, NARX) through learning
• However, training algorithms for RNNs involve some known drawbacks:
• High computational training costs and slow convergence
• Local minima (error function is generally a non convex function)
• Vanishing of the gradient and problem in learning long-term dependencies
Markovian Bias of RNNs
• Properties of RNNs state dynamics in the early stages of training
• RNNs initialized with small weights result in contractive state transition functions
and can discriminate among different input histories even prior to learning
• Markovian characterization of the state dynamics is a bias for RNN architectures
• Computational tasks with characterization compatible to such Markovian
characterization can be approached by RNNs in which recurrent connections are not
trained
• Reservoir Computation paradigm exploits this fixed Markovian characterization
Reservoir Computing (RC)
• Paradigm for efficient RNN modeling – state of the art for efficient learning in
sequential domains
• Implements dynamical system
• Conceptual separation: dynamical/recurrent non-linear part (reservoir)
feed-forward output tool (readout)
• Efficiency:
• training is restricted to the linear readout
• exploits Markovian characterization resulting from (untrained) contractive
dynamics
• Includes several classes: Echo State Networks (ESNs), Liquid State Machines,
Backpropagation Decorrelation, Evolino, ...
Echo State Networks
inW
W
outW( )nu ( )ny
( )nx
Input Layer Reservoir Readout
• Reservoir: untrained large, sparsely and randomly connected, non-linear layer
• Readout: trained linear layer
• linear units
• leaky-integrators
• spiking neurons
Train only the connections to the readout
encoding of the input
sequence
Input Space: Reservoir State Space: Output Space:
Reservoir Computation
The reservoir implements the state transition function:
Iterated version of the state transition function (application of the reservoir to an input
sequence):
initial state
Echo State Property (ESP)
Echo State Property
• Holds if the state of the network is determined uniquely by the left-inifinite input
history
• State contractive, state forgetting, input forgetting
• The state of the network asymptotically depends only on the driving input signal
• Dependencies on the initial conditions are progressively lost
Conditions for the Echo State Property
Conditions on
Sufficient: maximum singular value is less than 1
Necessary: spectral radius is less than 1
(contractive dynamics)
(asymptotically stable around the 0 state)
How to Initialize ESNs
Reservoir Initialization
• initialized randomly in
• Start with a randomly generated matrix
• Scale to meet the condition for the ESP
• initialization procedure:
inW
W
outW( )nu ( )ny
( )nx
Input Layer Reservoir Readout
Training ESNs
• Discard an initial transient (washout)
• Collect the reservoir states and target values for each n
•Train the linear readout:
Training Phase
inW
W
outW( )nu ( )ny
( )nx
Input Layer Reservoir Readout
Training the Readout
Moore-Penrose pseudo-inversion
Ridge Regression
(possible regularization using random noise)
• Off-line training: standard in most applications
is the regularization coefficient (tipically < 1)
• On-line training
• Least Mean Squares typically not suitable (ill posed problem)
• Recursive Least Squares more suitable
Training the Readout
• Other readouts:
• MLPs, SVMs, kNN, etc...
• Multiple readouts for the same reservoir: solving more tasks with the
same reservoir dynamics
Easy, Efficient, but many fixed hyper-parameters to set...
ESN Hyper-parametrization (Model Selection)
• Reservoir dimension
• Spectral radius
• Input Scaling
• Reservoir sparsity
• Non-linearity of reservoir activation function
• Input bias
• Architectural design
• Length of the transient (settling time)
• Readout Regularization
ESN hyper-parametrization should be chosen carefully through an
appropriate model selection procedure
ESN Architectural Variants
• direct input-to-readout connections
• output feedback connections (stability issues)
inW
W
outW( )tu ( )ty
( )tx
Input Layer Reservoir Readout
backW
properly initialized to guarantee
convergence of the encoding
Memory Capacity
How long is the effective short-term memory of ESNs?
Some results:
The MC is bounded by the reservoir dimension
The MC is maximum for linear reservoirs
• Longer delays cannot be learnt better than short delays (“fading memory”)
• It is impossible to train ESNs on tasks which require unbounded-time memory
Squared correlation
coefficient
Dilemma: memory capacity VS non-linearity
𝑀𝐶 = 𝜌2(𝒖 𝑛 − 𝑘 , 𝒚𝑘 𝑛 )
∞
𝑘=1
Applications of ESNs
• Several hundreds of relevant applications of ESNs are reported in literature
• (Chaotic) Time series prediction
• Non-linear system identification
• Speech recognition
• Sentiment analysis
• Robot localization & control
• Financial forecasting
• Bio-medical applications
• Ambient Assited Living
• Human Activity Recognition
• ...
Applications of ESN – Examples/1
Mackey-Glass time series
contraction coefficient
reservoir dimension
Extremely good approximation performance!
• Generalization of predictive performance to unseen environments
Training
Applications of ESN – Examples/2
Forecasting of user movements
Applications of ESN – Examples/3
• Input from heterogeneous sensor sources (data fusion)
• Predicting event occurrence and confidence
• Effectiveness in learning a variety of HAR tasks
• Training on new events
• Average test accuracy is 91.32%(±0.80)
Human Activity Recognition and Localization
Applications of ESN – Examples/4
Robotics
• Indoor localization estimation in critical environment (Stella Maris Hospital)
• Precise robot localization estimation using noisy RSSI data (35 cm)
• Recalibration in case of environmental alterations or sensor malfunctions
Applications of ESN – Examples/5
Prediction of Electricity Price on the Italian Market
Accurate prediction of hourly electricity price (less than 10% MAPE error)
Applications of ESN – Examples/6
Waveform of speech signal Average reservoir state Emotion Class
EVALITA 2014 - Emotion Recognition Track (Sentiment Analysis)
• Challenge: the reservoir encodes the temporal input signals avoiding the
need of explicitly resorting to fixed-size feature extraction
• Promising performances already in line with the state of the art
Speech and Text Processing
Markovianity
• Markovian nature: states assumed in correspondence of different input sequences
sharing a common suffix are close to each other proportionally to the length of the
common suffix
Contractivity
Markovianity
• Markovianity and contractivity: Iterated Function Systems, fractal theory,
architectural bias of RNNs
RNNs initialized with small weigths (with contractive state transition function)
and bounded state space implement (approximate arbitrarily well) definite memory
machines
• RNN dynamics constrained in region of state space with Markovian characterization
• Contractivity (in any norm) of state transition function impies Echo States (next slide)
• ESNs featured by fixed contractive dynamics
• Relations with the universality of RC for bounded memory computation
Contractivity and ESP
A contractive setting of the state transition function τ (in any norm) implies the ESP.
Assumption: τ is contractive with parameter C.
Approaches 0 as n goes to infinity
Contractivity and Reservoir Initialization
Reservoir is initialized to implement a contractive state transition function, so that the
ESP is guaranteed.
Assumption: Euclidean distance as metric in the reservoir space, tanh as reservoir
activation function
(sufficient condition for the ESN)
This leads to the formulation of the sufficient condition on the maximum singular value
of the recurrent reservoir weight matrix:
Markovianity
a a a a a
a a b b a
b a b a a
b b b b a
Influence on the state
• Input sequences sharing a common suffix drive the ESN into close states, proportionally
to the length of the suffix
• Ability to intrinsically discriminate among different input sequences in a suffix-based
fashion without adaptation of the reservoir parameters
• Target task should match Markovianity of reservoir state space
Markovianity
σ = 0.3
1st PC: last input symbol
2nd PC: next-to-last input symbol
e b h j a a c
e a h i e c c
j g j e e g f
j g h e e g h
Conclusions
• Reservoir Computing: paradigm for efficient modeling of RNNs
• Reservoir: non-linear dynamic component, untrained after contractive initialization
• Readout: linear feed-forward component, trained
• Easy to implement, fast to train
• Markovian flavour of reservoir state dynamics
• Successful applications (tasks compatible with Markovian characterization)
• Model Selection: many hyper-parameters to be set
Research Issues
• Optimization of reservoirs: supervised or unsupervised reservoir adaptation
(e.g. Intrinsic Plasticity)
• Architectural Studies: e.g. Minimum complexity ESNs, φ-ESNs, ...
• Non-linearity vs memory capacity
• Stability analysis in case of output feedbacks
• Reservoir Computing for Learning in Structured Domains
TreeESN, GraphESN
• Applications, applications, applications, applications ...