Online Learning in Event-based Restricted …dannyneil.com/attach/dneil_thesis.pdfAbstract Online...
Transcript of Online Learning in Event-based Restricted …dannyneil.com/attach/dneil_thesis.pdfAbstract Online...
University of Zurich and ETH Zurich
Master Thesis
Online Learning in Event-basedRestricted Boltzmann Machines
Author:
Daniel Neil
Supervisor:
Michael Pfeiffer and
Shih-Chii Liu
A thesis submitted in fulfilment of the requirements
for the degree of MSc UZH ETH in Neural Systems and Computation
in the
Sensors Group
Institute of Neuroinformatics
October 2013
Declaration of Authorship
I, Daniel Neil, declare that this thesis titled, ‘Online Learning in Event-based Restricted
Boltzmann Machines’ and the work presented in it are my own. I confirm that:
� This work was done wholly or mainly while in candidature for a research degree
at this University.
� Where any part of this thesis has previously been submitted for a degree or any
other qualification at this University or any other institution, this has been clearly
stated.
� Where I have consulted the published work of others, this is always clearly at-
tributed.
� Where I have quoted from the work of others, the source is always given. With
the exception of such quotations, this thesis is entirely my own work.
� I have acknowledged all main sources of help.
� Where the thesis is based on work done by myself jointly with others, I have made
clear exactly what was done by others and what I have contributed myself.
Signed:
Date:
i
Abstract
Online Learning in Event-based Restricted Boltzmann Machines
by Daniel Neil
Restricted Boltzmann Machines (RBMs) constitute the main building blocks of Deep
Belief Networks and other state-of-the-art machine learning tools. It has recently been
shown how RBMs can be implemented in networks of spiking neurons, which is ad-
vantageous because the necessary repetitive updates can be performed in an efficient
asynchronous and event-driven manner. However, like any previously known method
for training RBMs, the training process for event-based RBMs was performed offline.
The offline training fails to exploit the computational advantages of spiking networks,
and does not capture the online learning characteristics of biological systems.
This thesis introduces the first online method of training event-based RBMs that com-
bines the standard RBM-training method, called contrastive divergence (CD), with bi-
ologically inspired spike-based learning. The new rule, which we call “evtCD”, offers
sparse and asynchronous weight updates in spiking neural network implementations of
RBMs, and is the first online training algorithm for this architecture. Moreover, the
algorithm is shown to approximate the previous offline training process.
Performance of training was evaluated on the standard MNIST handwritten digit iden-
tification task, achieving 90.4% accuracy when combined with a linear decoder on the
features extracted by a single event-based RBM. Finally, evtCD was applied to the real-
time output of an event-based vision sensor and achieved 86.7% accuracy after only 60
seconds of training time and presentation of less than 2.5% of the standard training
digits.
Acknowledgements
This thesis could not have been accomplished without the original effort of Peter O’Connor
and his boundless persistence to make event-based Deep Belief Networks possible. Saee
Paliwal has contributed her endless support and incomparable mathematical ability dur-
ing the course of this work, and without our discussions I would have likely floundered in
the great space of ideas. A very large thanks to Michael Pfeiffer for his expert knowledge
and his key insights into the work, which critically moved the project forward at various
times when it stalled. Finally, this document will hopefully be just one of many writ-
ten by me at the Institute of Neuroinformatics, thanks largely to the encouragement,
insights, and support of Shih-Chii Liu. I am deeply in all of your debt for your help.
iii
Contents
Declaration of Authorship i
Abstract ii
Acknowledgements iii
Contents iv
1 Introduction 1
2 A Background in Restricted Boltzmann Machines and Deep Learning 4
2.1 A Historical Introduction to Deep Learning . . . . . . . . . . . . . . . . . 4
2.2 Energy-Based Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Products of Experts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 RBMs and Contrastive Divergence . . . . . . . . . . . . . . . . . . . . . . 16
2.5 Extensions to Standard Learning Rules . . . . . . . . . . . . . . . . . . . 17
3 Derivation of evtCD 19
3.1 Spiking Restricted Boltzmann Machines . . . . . . . . . . . . . . . . . . . 19
3.2 evtCD, an Online Learning Rule for Spiking Restricted Boltzmann Machines 21
4 Implementation of evtCD 26
4.1 Algorithm Recipe for Software Implementation . . . . . . . . . . . . . . . 26
4.2 Supervised Training with evtCD . . . . . . . . . . . . . . . . . . . . . . . 28
5 Test Methodology 30
5.1 MNIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.2 Time-stepped Training Methodology . . . . . . . . . . . . . . . . . . . . . 31
5.2.1 Extracting Spikes From Still Images . . . . . . . . . . . . . . . . . 33
5.3 Online Training Methodology . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.3.1 A Software Implementation of Fixational Eye Movements . . . . . 34
5.3.2 Training Environment . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.3.3 Java Reference Implementation . . . . . . . . . . . . . . . . . . . . 36
6 Quantification of evtCD Training 38
6.1 Improving Training through Parameter Optimizations . . . . . . . . . . . 38
6.1.1 Baseline Training Demonstration . . . . . . . . . . . . . . . . . . . 39
6.1.2 Learning Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.1.3 Number of Input Events . . . . . . . . . . . . . . . . . . . . . . . . 43
iv
Contents v
6.1.4 Batch Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.1.5 Noise Temperature . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.1.6 Persistent Contrastive Divergence . . . . . . . . . . . . . . . . . . . 50
6.1.7 Bounded Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.1.8 Inverse Decay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.2 Training as a Feature Extractor . . . . . . . . . . . . . . . . . . . . . . . . 56
6.3 Online Training with Spike-Based Sensors . . . . . . . . . . . . . . . . . . 60
7 Conclusions and Future Work 62
7.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
A Java Implementation 66
B Matlab Implementation 69
C EyeMove Implementation 73
Bibliography 75
Chapter 1
Introduction
Deep networks, specifically Deep Belief Networks [1, 2] and Deep Boltzmann machines
[3], are achieving state-of-the-art performance on classification tasks for images, videos,
audio, and text [4–10]. Importantly, Restricted Boltzmann machines (RBMs) [11–13]
underlie both of these approaches and recent work [14–17] has strongly pushed to in-
vestigate the possibility of implementing RBMs on networks of spiking neurons. The
reason for this is two-fold.
First, fast and efficient silicon architectures [18–21] designed specifically to accelerate
spiking neural networks are emerging, and RBMs composed of spiking neurons would
pair progress in machine learning with novel and powerful computing architectures.
Additionally, it has been shown that the accuracy of networks on classification tasks
is guaranteed to improve with larger size or more layers [1, 14], implying that higher
accuracy could be achieved by just scaling up the network size. Therefore, scale is an
important factor, and these neural network accelerators are all designed to run large
networks of spiking neurons faster than a general-purpose computer.
Second, RBMs implemented with spiking neurons offer a fundamental advantage when
used in very large networks over traditional RBM approaches. As [14] points out, scale is
not the dominant factor in processing time for the brain, unlike standard computational
approaches, because the processing units are both parallel and event-driven. The brain
adapts its processing speed to the rate of input, so the computational effort is propor-
tional to the number of events. This so-called event-driven computational approach is
a hallmark of neuromorphic designs that seek inspiration from the brain to build event-
based, asynchronous, and often low-power silicon systems [22–25]. The advantages of
event-driven computation are well studied [16, 26, 27] and certain types of spiking neu-
ron models can be implemented entirely in event-driven systems [28, 29]. Additionally,
current event-driven neuromorphic sensors produce sparse outputs for vision [24, 27, 30]
1
Chapter 1. Introduction 2
and audition [31]. By designing algorithms that can run on spiking neural networks,
it is possible to construct a complete hardware system using event-driven computation
alone.
However, the pre-existing implementations of RBMs composed of spiking neural net-
works [14–16] all use an offline and synchronous training algorithm. No training algo-
rithm has yet been discovered for online training of RBMs composed of spiking neurons.
The main question this thesis is concerned with is the following: can RBMs composed
of spiking neurons be trained online? Specifically, there are three subgoals:
1. Derive a rule for online learning of an RBM composed of spiking neural networks;
2. Design an event-driven, asynchronous implementation of this rule to achieve high
performance in scalable systems;
3. Demonstrate this training rule’s effectiveness on a common benchmark task.
In this thesis, I will introduce evtCD, an online learning rule inspired by biological rules
of spike time-dependent plasticity (STDP) [32–34] that trains an RBM composed of
leaky integrate-and-fire (LIF) spiking neurons. This training algorithm can be run in an
entirely event-driven way by using updates from STDP with insights from contrastive
divergence (CD) learning, the standard method of learning for RBMs.
Beyond the description of this new algorithm, I demonstrate a proof sketch of how this
learning rule approximates a previously demonstrated offline learning rule introduced in
[14], which in turn arises from contrastive divergence learning.
Finally, this thesis contributes a real-time implementation of evtCD to demonstrate the
efficiency of this learning rule. An event-based image sensor [24], which produces image
events instead of image frames, generates spike trains which are used to run the evtCD
training algorithm on populations of spiking neurons. When applied to the commonly
used MNIST handwritten digit identification task [35], the receptive fields of the neurons
learn digit parts as they do in the standard frame-based training. After sixty seconds
of real-time learning, the network learns features that perform better than an optimal
linear decoder on the raw digits, achieving a classification accuracy of 86.7%.
This thesis is structured as follows. In Chapter 2, the history of deep learning is in-
troduced and the derivations of previous methods are analyzed for their applicability
to spiking neural networks. Chapter 3 introduces evtCD and links the algorithm to
the previously successful rate-based offline learning algorithm in [14]. In Chapter 4, an
algorithm recipe is shown for implementing evtCD in software, and supervised learning
Chapter 1. Introduction 3
with the evtCD algorithm is explained. Following that, Chapter 5 explains the testing
methodology and setup for obtaining results that analyze evtCD training. Chapter 6
then studies the behaviour of the algorithm under various parameters and extensions,
and demonstrates rapid real-time learning of the MNIST handwritten digit dataset.
Finally, Chapter 7 concludes by introducing ideas for future work.
Chapter 2
A Background in Restricted
Boltzmann Machines and Deep
Learning
This chapter will introduce the prior work on RBMs and deep learning to lay the foun-
dations for the introduction of the evtCD learning rule. In Section 2.1, a historical
introduction will give an overview of the history and intuition of training RBMs. Subse-
quently, Sections 2.2 through 2.4 will focus on reproducing the mathematical derivations
of RBM learning rules to understand the assumptions in them. Finally, standard exten-
sions to the contrastive divergence learning rule will be briefly explained in Section 2.5,
as they form the basis for many of the investigations found in Section 6.
2.1 A Historical Introduction to Deep Learning
The origins of deep learning begin with Boltzmann machines [36]. Introduced in 1986,
Boltzmann machines are an undirected bipartite probabilistic generative model. Though
they are composed of two layers (“bipartite”), the connections between these layers are
bidirectional (hence, “undirected”) to allow the system to either pull external states
into their internal representation or to generate data from their internal representation.
These machines are “probabilistic” in that they encode the probabilities of types of
inputs on which they are trained, and are modeled on a physical analogy to distributions
of matter that probabilistically settle into low-energy configurations. Finally, Boltzmann
machines are “generative” models, meaning they are capable of producing data that
look like the inputs on which they have been trained. If, for example, a network is
4
Chapter 2. A Background in Restricted Boltzmann Machines and Deep Learning 5
trained on handwritten digits, a Boltzmann machine will, after training, produce digit-
like patterns on the visible part of the system when allowed to freely sample from
the distribution specified by the weights in the system. Section 2.2 addresses their
mathematical formulation.
Boltzmann machines are trained using a computationally intensive process in which
the machines are annealed into low-energy states, and these states are used to guide
a training algorithm to model the joint probabilities of the inputs presented to them
(Equation 2.20). These machines are typically discussed as having a “visible” state and
a “hidden” state, shown in Figure 2.1, in which the visible state corresponds to the
data that is fed into the machine and the hidden state corresponds to some abstracted
representation hidden from the outside world. The goal of the Boltzmann machine is
to use the energy dynamics of the system to learn arbitrary distributions of the input
data. The relationships that specify the distribution are mapped through the connection
weights (Equation 2.2), which force the hidden units to represent the input distributions
and cause the low energy states to correspond to probable configurations of the system
[36].
Ultimately, this distribution matching is a very important task for learning [37]. It is a
form of unsupervised learning in which the goal of the system is to design a probability
distribution that can arbitrarily approximate an input distribution, learning only from
samples from this unknown input distribution. This is an important goal because it
means the system can perform inference on that model, calculate likelihoods of a given
input, and produce samples similar to those it has been trained upon [13, 36, 37]. As
will be shown later in this work, it also allows a form of supervised learning if the labels
are presented as part of the joint distribution it needs to learn [2, 13].
RBMs, which were introduced under the name “Harmoniums” by [11], are a slight mod-
ification of the Boltzmann machine in that intra-layer connections have been removed
to make units in the same layer conditionally independent, as seen in Figure 2.1. As
mentioned before, both Boltzmann and Restricted Boltzmann machines can be trained
in an unsupervised fashion, which means that no training labels are necessary for the
system to learn the joint distribution of their inputs.
Unfortunately, training a Boltzmann machine through simulated annealing takes con-
siderable computational time because the system must be allowed to stabilize to an
equilibrium. This is necessary to obtain a single sample from the data and model distri-
butions, which are used to calculate the gradient of learning for minimizing the difference
between these two distributions (see Equation 2.20 in Section 2.2), and thus is the lim-
iting factor for training a dataset. If every weight update takes a significant amount of
Chapter 2. A Background in Restricted Boltzmann Machines and Deep Learning 6
...
...
Visible
Hidden
BoltzmannMachine
...
...
RestrictedBoltzmannMachine
Figure 2.1: Diagrammatic view of a Boltzmann machine and a Restricted Boltzmannmachine. Note that the Restricted Boltzmann machine lacks intra-layer connections.
Figure taken from [14].
time to calculate (i.e., using Equation 2.20), this precludes training and hindered the
adoption of Boltzmann machines [13, 37].
The Restricted Boltzmann machine attempts to address one significant issue of training
the Boltzmann machine: the difficulty of obtaining a true sample from a Boltzmann
machine. By removing intra-layer connections, inference becomes tractable within this
model, as recognized by [12], and Gibbs sampling can be used to infer likely states of
the model instead of annealing the system into equilibrium [12, 13].
After their initial introduction in the mid-1980s, progress on RBMs was sporadic and
largely ineffectual in favor of training supervised shallow architectures with backpropa-
gation, as in [38]. Moreover, shallow architectures are already universal, meaning that
a shallow architecture could theoretically approximate any function given enough units.
Although it can be exponentially more efficient to have deep layers than a single layer
[37], no effective training algorithms were yet created that could train deep networks.
Unfortunately, standard error-gradient techniques like backpropagation assign excessive
error updates to the final layer in a deep architecture. The error used in learning effec-
tively disappears after being propagated back through all layers [1, 37], causing very little
training signal to reach the initial layers. This results in over-learning of the top layer
and under-learning in lower layers, using the resources of a deep network inefficiently -
and ultimately performing worse than a single well-trained layer [37, 39].
An alternative approach was championed by Yann Le Cun (for example, [37, 40]), which
was to use convolutional networks. These networks have templates mapped across the
Chapter 2. A Background in Restricted Boltzmann Machines and Deep Learning 7
input space, combined with transformation layers. This effectively decreases the num-
ber of independent weights and allows for more efficient training, as well making the
system more robust to certain transformations; depending on the architecture, it can be
made to be more robust to both translation and scaling. Unfortunately, this approach
imposes certain assumptions about the inputs (translation invariance, for example), and
ultimately gives up the descriptive power of large numbers of weights in favor of simpler,
more effective training.
Then, in 2002, work was published which began shifting progress back towards deep
networks [2, 13]. Data was becoming very plentiful, but, unfortunately, most of that
data was unlabeled; a simple Internet search could easily yield troves of data, but it
is not structured in ways that allows computers to easily learn from it [13]. To take
advantage of this available information, unsupervised learning is a very powerful tool
because it allows a computer to pre-learn from large volumes of unlabeled data, learning
about the differences between classes that it sees. Then, when presented with labels,
it can fine-tune its learning with a final supervised step. However, RBM took far too
long to train using simulated annealing, but in this 2002 paper Hinton discovered a very
effective approximation that works well in practice and enabled the rise of deep networks
[37].
One way to obtain a sample from the RBM’s model distribution is to begin a Markov
chain starting with a current data sample. This Markov chain can alternately draw
samples from the hidden layer given the visible, and the visible layer given the hidden
(see Figure 2.2). It turns out this process is equivalent to a Markov Chain Monte
Carlo (MCMC) sampling method known as Gibbs sampling, and Gibbs sampling in this
context converges to the stationary distribution specified by the RBM regardless of the
starting point. Informally, this means that no matter how the system is initialized,
this process of repeatedly generating samples from the system causes those samples to
gradually become closer to the distribution specified by the system. The first sample
may be random data, but by the time the energy dynamics have adjusted the activations
many, many times, the samples begin to be the types of samples specified by the RBM.
Q0
Visible Layer
Hidden Layer
...
...
Q1
Visible Layer
Hidden Layer
...
...
Qn
Visible Layer
Hidden Layer
...
...
...
Figure 2.2: The Markov chain used in contrastive divergence. Gibbs steps are takento create samples closer to the equilibrium distribution than the original data sample.
Chapter 2. A Background in Restricted Boltzmann Machines and Deep Learning 8
Gibbs sampling is designed to generate approximate samples from a distribution where
direct sampling is difficult. The key insight that underlies Gibbs sampling is that sam-
pling from a conditional distribution may be easier than obtaining pure unbiased samples
from the distribution. The process is surprisingly simple; for all joint variables in the
distribution, hold all but one fixed, and draw a sample conditioned on the others. In
the next step, use the updated sample value for that variable to draw a sample for a
different variable. Intuitively, the process “walks” a sampling process towards the dis-
tribution specified by the parameters, and given infinite steps will yield samples from
the distribution specified by the parameters.
In practice, this process cannot be repeated indefinitely, hence the use of simulated
annealing to try and stabilize the system with an energy-cooling process. Unfortunately,
it cannot be known a priori how many steps are necessary to obtain a “true enough”
sample from the model distribution, and in practice this number can be quite large and
also quite variable. However, the key insight in the [13] paper is that a single step is
sufficient to learn effectively. It is not obvious that this should be the case, however, and
the next section deals with the mathematics to explain why this assumption is valid.
Intuitively, this single-step sampling method known as CD-1 (see Equation 2.33) works
because a sample drawn from a Gibbs step is closer to the model distribution specified
by the RBM than the input data. Even though this model sample is still highly corre-
lated with the current input, it still contains information about the gradient that would
decrease the difference between the model distribution and the data distribution, and
this single sample can be used to approximate a sample from the model distribution.
A step then taken to minimize the difference between the model sample and the data
sample will make the model distribution more like the data distribution. In practice, a
single Gibbs step is quick to calculate and very effective [2, 13, 37].
The second key insight, which appeared a few years later in [2], was to use RBMs as a
building block in training deep networks. Each layer in a deep network could be trained
as an RBM and the whole system can be composed together out of RBMs. When
training, the first layer is trained in an unsupervised way with the visible layer as the
input layer and the layer above as the hidden layer. Then, to train the next layer, the
learned weights from the first layer are fixed as the hidden layer becomes the visible layer
of the next layer. This process continues until the whole network is trained in a greedy
layer-wise fashion [39]. These deep networks will only be very sparingly addressed in
this thesis, although there is much more room to explore.
Now, the mathematics of these systems should be examined to firmly ground the theory
on which the rest of this thesis is based.
Chapter 2. A Background in Restricted Boltzmann Machines and Deep Learning 9
2.2 Energy-Based Models
We begin with a description of energy-based models, which contains both the original
Boltzmann machine and the Restricted Boltzmann machine. This proof follows the
outline given in the Appendix of [36] and updates the notation to be consistent with the
formulae given elsewhere in this manuscript. In analogy to a physical system, we begin
by defining the probability of a visible state in the Boltzmann machine as exponential
in the energy:
P (v) =∑h
P (v,h) =
∑h
e−E(v,h)∑x,h
e−E(x,h)(2.1)
Here, the vector v denotes the current visible state of the system, recalling that the
Boltzmann machine and RBM are both bipartite machines composed of a visible state v
and a hidden state h. This equivalence marginalizes out the hidden states vector h given
some visible vector v, and additionally relates the energy function to the probability of a
state v. The normalizing factor on the denominator at the right of Equation 2.1, called
the partition function by analogy to physics, simply rescales the energy of the current
visible state by the sum over all possible system states. Note that to avoid confusion, x
is used for the visible vector in the partition function summation rather than v.
Where the ‘ symbol indicates the transpose operation, b an energy bias on the visible
states, c an energy bias on the hidden states, and a weight matrix W providing an
energy description of the joint activations of the hidden and visible states, the energy of
a system configuration can be defined as:
E(v,h) = −b′v − c′h− h′Wv (2.2)
Or, equivalently, the sum over i visible neurons and j hidden neurons:
E(v,h) =∑i
−bivi −∑j
cjhj −∑i,j
wijvihj (2.3)
The goal of this derivation will be to determine a method of updating the weights W to
force the system to achieve certain co-activations of v and h, and the tool used will be
minimizing the KL divergence. The KL divergence is a directional measure of similarity
between distributions, and in this case we want to minimize the difference between the
distribution of the visible data the machine is presented with and the distribution of data
Chapter 2. A Background in Restricted Boltzmann Machines and Deep Learning 10
it generates. Ideally, if we can find the gradient that minimizes the difference between the
distribution of input states forced by the outside world and the states the system settles
into according to its energy function, the system will eventually match the distribution
of the input states specified by the data. Let P+(v,h) denote the probabilities of input
data state (v,h), and P−(v,h) denote the model probability of a state (v,h).
DKL(P+||P−) =∑x,h
P+(v,h) logP+(v,h)
P−(v,h)(2.4)
Noting that P+(v,h) is the true probability of the data and is independent of the
parameters, it is a constant when taking the gradient with regard to ∂wij :
∂DKL(P+||P−)
∂wij= −
∑v,h
P+(v,h)
P−(v,h)
∂P−(v,h)
∂wij(2.5)
Now, we seek the term ∂P−(v,h)∂wij
, which we can calculate given the probability-energy
relation given in Equation 2.1 and the energy-weight relation given in Equation 2.3. We
pre-compute the partial derivative of the numerator in Equation 2.1 by using Equation
2.3:
∂e−E(x,h)
∂wij= vihje
−E(x,h) (2.6)
Now this term is used in taking full form ∂P−(v,h)∂wij
:
∂P−(v)
∂wij=
∑h
v−i h−j e−E(v,h)∑
x,h
e−E(x,h)−v+i h
+j
∑h
e−E(v,h)∑x,h
e−E(x,h)
(∑x,h
e−E(x,h)
)2 (2.7)
Which can be simplified using the marginal probabilities in Equation 2.1:
∂P−(v)
∂wij= v−i h
−j
∑h
P (v,h)− v+i h
+j P (v)
∑v′,h
P (v′,h) (2.8)
The expression found in Equation 2.8 is the term sought for the KL divergence, and can
be substituted:
Chapter 2. A Background in Restricted Boltzmann Machines and Deep Learning 11
∂G
∂wij= −
∑v
P+(v)
P−(v)v+i h
+j
∑h
P (v,h) +∑v
P+(v)
P−(v)v−i h
−j P (v)
∑v′,h
P (v′,h) (2.9)
By the rules of conditional probability:
P+(v,h) = P+(h|v)P+(v) (2.10)
P−(v,h) = P−(h|v)P−(v) (2.11)
And since the hidden states are chosen according to the same parameterized model,
regardless of whether the visible states are given by the environment or whether the
visible states are chosen from the model distribution:
P−(h|v) = P+(h|v) (2.12)
Using these facts to prepare a term and simplify:
P−(x,h)P+(x)
P−(x)= P−(h|v)P−(v)P
+(v)P−(v)
(2.13)
= P−(h|v)P+(v) = P+(h|v)P+(v) (2.14)
= P+(x,h) (2.15)
Of course, since we are dealing with probabilities:
∑v
P+(v) = 1 (2.16)∑v
P−(v) = 1 (2.17)
Substituting all this into 2.9, reproduced here, simplifies the equation:
Chapter 2. A Background in Restricted Boltzmann Machines and Deep Learning 12
∂G
∂wij= −
∑v
P+(v)
P−(v)v+i h
+j
∑h
P (v,h) +∑v
P+(v)
P−(v)v−i h
−j P (v)
∑v′,h
P (v′,h) (2.18)
= v+i h
+j
∑v,h
P (v,h)− v−i h−j
∑v′,h
P (v′,h) (2.19)
Which is, in fact, just the difference between the expectations of the data distribution
and the model distribution. Therefore, a weight update that decreases the distance
between the model distribution P− and the data distribution P+ is proportional to
the difference in expectations of the binary product vihj between the data and model
distributions.
∆wij ∝ 〈vihj〉data − 〈vihj〉model (2.20)
Unfortunately at this point, training requires obtaining the expectation from the model
distribution, a difficult problem that took decades to work around. However, this for-
mulation appears many times in subsequent derivations and is an important result for
building on. Ultimately, contrastive divergence, the Siegert approach, and evtCD all
work by generating an estimate of the correlations between visible and hidden layers,
and then minimizing the difference between the correlations the model generates and
the correlations the data produces, so this is an important derivation to understand.
2.3 Products of Experts
By a very different route, a similar formulation was described in 2002 with the invention
of product of experts (POE) [13] to yield the contrastive divergence learning rule. The
derivation of the POE given here, following [13], will incorporate lessons from Gibbs
sampling that are valid on POEs to yield a training rule that uniquely works for RBMs.
Imagine a system composed of n “experts,” each of which independently computes the
probability of some data vector d out of all possible data vectors c. These experts then
multiply their probabilities to combine their opinions on the likelihood of the data.
The intuition for this model is that the product of experts can constrain the system so
that the final result is a sharper distribution than any individual model, unlike mixture
models where each expert is summed with the other experts. In mixture models, the final
distribution will be at least as broad as the tightest individual distribution. In contrast,
a POE system can have individual experts that specialize in different dimensions, each
Chapter 2. A Background in Restricted Boltzmann Machines and Deep Learning 13
of which constrains different attributes to yield a very sharp posterior. For problems
that factorize well into component decisions, this is very advantageous; for example, a
system designed to detect human beings can be composed of a torso detector, a head
detector, and a limb detector whose individual contributions can be combined to yield an
overall human detector made of factorized experts. Many real-world problems factorize
into attributes and provide the motivation for POE systems. Moreover, RBMs are POE
models, and other types of experts are also allowed, but Boltzmann machines are not
POEs.
The POE can be formulated as follows:
p(d|θ1...θn) =
∏m pm(d|θm)∑
c
∏m pm(c|θm)
(2.21)
To train the POE, one possible goal is to maximize the probability of the data. Equiva-
lently, the derivative of the log likelihood of the data d can be calculated with respect to
the parameters θm. Since the models are independent with their probabilities multiplied
together, the derivative of the probability p with respect to the model parameter θm can
be found as follows:
∂ log p(d|θ1...θn)
∂θm=∂ log pm(d|θ1...θn)
∂θm−∑c
p(c|θ1...θn)∂ log pm(c|θm)
∂θm(2.22)
The first term, the derivative of the log probability of the data sample with respect to
the parameters θm, is controllable by design. If an expert model is chosen for which it is
possible to find the derivative of the probability of the data with respect to the param-
eters used to calculate the data, then this is straightforward to calculate. Commonly,
the “expert” used is a sigmoid:
pm =1
1 + e−b+
∑jxjwj
(2.23)
which yields a probability pm based on the inputs xj given, respectively, the bias and
weight parameters θm = {b,w}. It is clear, in this case, that the first term in 2.22 is
easily calculable:
∂ log pm(d|θ1...θn)
∂θm=
1− 1
1 + e−b+
∑jxjwj
1
1 + e−b+
∑jxjwj
(2.24)
Chapter 2. A Background in Restricted Boltzmann Machines and Deep Learning 14
Returning to Equation 2.22, though its first term is easily calculable, the second term is
more challenging. The second term combines the probability of the product of experts
with the log probability of a single expert for all data vectors c. For any given data
point c, it should be straightforward to calculate this term because it is the product two
values: the POE probability p(c|θ1...θn) and the derivative of the log likelihood as in
Equation 2.24. With sufficient time to evaluate all data vectors c, this would not be a
problem, but the size of the possible state space likely precludes calculating all possible
vectors. For example, for a vector of k binary neurons, the number of combinations is
exponential in k; there will be 2k possible vectors, a number which grows very rapidly
to be beyond computational power.
However, this second term in Equation 2.22 can be alternately viewed as the expected
derivative of the log probability of an expert on data. This means that if accurate
samples can be drawn from the model distribution, the expectation can be approximated.
To obtain samples from the model distribution, Gibbs sampling can be used to draw
samples from a POE, unlike for Boltzmann machines. Now, the architecture of the
POE (and RBMs) becomes very important: with no intra-layer connections, all units
are conditionally independent of the other units in their layer. Since this is true, a
Gibbs sampling chain can be run in which all hidden units are updated in parallel while
visible units remain fixed, then all visible units are updated in parallel while hidden
units remain fixed. Once this MCMC chain converges to the equilibrium distribution,
the expectation can be calculated from the samples produced.
However, there is a faster way that Hinton introduces in [13]. Imagine that the data
is produced by some true distribution Q0. For a concrete example, learning the joint
probabilities of pixels being “on” together is a common task for a visual classifier, because
the machine learns the relationships between pixels in an image. Now the POE has some
model distribution Q∞ which would like to approximate the true distribution Q0. This
naming convention was chosen to mimic the behavior of a Markov chain beginning
with the true data Q0 and ending at the model’s equilibrium distribution Q∞ after
∞ steps. The idea of contrastive divergence is to minimize the difference between the
model distribution Q∞ and the true distribution Q0. Once again, a way to measure
the difference between these two distributions is to use the Kullback-Liebler divergence.
The KL divergence of these two distributions can be calculated for Q0 and Q∞ as:
DKL(Q0||Q∞) =∑d
pQ0(d) log pQ0(d)−∑d
pQ0(d) log pQ∞(d) (2.25)
=∑d
Q0d logQ0
d −∑d
Q0d logQ∞d (2.26)
Chapter 2. A Background in Restricted Boltzmann Machines and Deep Learning 15
Note that the first term in Equation 2.26 is entirely dependent on the data and is
not affected by the parameters of the model that specify Q∞. This means that during
training, no update of the parameters will affect this constant, so it can be safely ignored.
Unfortunately, calculating the second term will be intractable as it the expectation
of model’s log probability of the data taken with the probabilities of the true data.
However, note that the value Q∞d is just the probability of the POE given some data
and parameters, p(d|θ1...θn), and it is again an expectation. Revisiting Equation 2.22
with this understanding yields:
⟨∂ logQ∞d∂θm
⟩Q0
=
⟨∂ log pm(d|θm)
∂θm
⟩Q0
−⟨∂ log pm(c|θm)
∂θm
⟩Q∞
(2.27)
Here, finally, the key insight of contrastive divergence is applied. Imagine that instead of
minimizing the KL divergence DKL(Q0||Q∞), the difference between the KL divergences
DKL(Q0||Q∞) and DKL(Q1||Q∞) are minimized, where Q1 refers to the distribution af-
ter one full Gibbs step of sampling. Since Q1 is closer to the equilibrium distribution
than Q0, DKL(Q0||Q∞) exceeds DKL(Q1||Q∞) unless Q0 = Q1 meaning Q0 = Q∞,
in which case it will be no worse and learning is already perfectly done. This “con-
trastive divergence” will never be negative. Most importantly, however, the intractable
expectation⟨∂ log pm(c|θm)
∂θm
⟩Q∞
cancels out.
− ∂
∂θm(DKL(Q0||Q∞)−DKL(Q1||Q∞)) =
⟨∂ log pm(d0|θm)
∂θm
⟩Q0−⟨∂ log pm(d1|θm)
∂θm
⟩Q1
+∂Q1
∂θm
∂DKL(Q1||Q∞)∂Q1
In this equation, the term d0 and d1 are introduced, and refer to the starting data
and the data after one Gibbs sampling step, respectively. In [13], Hinton showed the
final term ∂Q1
∂θm
∂DKL(Q1||Q∞)∂Q1 was small and rarely opposed the direction of learning. By
ignoring this term, a very tractable learning rule was developed for POEs in general:
∆θm ∝⟨∂ log pm(d0|θm)
∂θm
⟩Q0
−⟨∂ log pm(d1|θm)
∂θm
⟩Q1
(2.28)
This discovery is relevant because this contrastive divergence rule holds for systems
composed of independent, probabilistic nodes who specify a probability with the product
of their indepedent activations. For a network of spiking neurons encoding the joint
probabilities of certain features being on, the difference in the expectations for the data
and a quickly-derived model sample may be sufficient to train the neural network.
Chapter 2. A Background in Restricted Boltzmann Machines and Deep Learning 16
2.4 RBMs and Contrastive Divergence
Finally, contrastive divergence on RBMs will be examined here because of the similarities
between RBMs and the evtCD learning architecture. This section continues the deriva-
tion given in [13]. Contrastive divergence in RBMs draws from both the Boltzmann
machine’s KL divergence minimization and the POE’s’ contrastive divergence form to
yield a particular easy-to-compute result. Beginning again with the POE log-likelihood
given in Equation 2.22:
∂ log p(d|θ1...θn)
∂θm=∂ log pm(d|θ1...θn)
∂θm−∑c
p(c|θ1...θn)∂ log pm(c|θm)
∂θm(2.29)
Imagining a Boltzmann machine with a single hidden node j, note that θm = wj and
the term ∂ log pm(d|θm)∂θm
can be obtained from the Boltzmann derivation given above:
∂ log pm(d|wj)
∂wij= 〈sisj〉d − 〈sisj〉Q∞(j) (2.30)
The second term can also be calculated similarly:
∑c
p(c|w)∂ log pj(d|wj)
∂wij= 〈sisj〉Q∞ − 〈sisj〉Q∞(j) (2.31)
Subtracting these two equations from each other per the derivation, and then taking the
expectation over the whole dataset yields:
⟨∂ logQ∞d∂wij
⟩Q0
= −∂Q0||Q∞
∂wij= 〈sisj〉Q0 − 〈sisj〉Q∞ (2.32)
Finally, applying the contrastive divergence approximation gives:
− ∂
∂wij(Q0||Q∞)− (Q1||Q∞) = ∆wij ≈ 〈sisj〉Q0 − 〈sisj〉Q1 (2.33)
This is the final form contrastive divergence in RBMs. Next, we will examine the
intuition behind some subsequent improvements to this rule.
Chapter 2. A Background in Restricted Boltzmann Machines and Deep Learning 17
2.5 Extensions to Standard Learning Rules
This standard learning rule has been modified over time to yield practical improvements
in learning, or performance optimizations to speed it up. This section will examine the
most common of these extensions, and informally discuss why these extensions work.
The first extension, which was introduced in the early papers, is the idea of batch
training a set of samples in parallel to get a better estimate of the variance. In [13],
Hinton introduces the idea of batch training to combat the effects of the variance of the
samples. In training, the goal is to find a gradient using a sample to model a distribution,
but often the variance of the sample can be so large as to swamp the gradient of the true
distribution. By averaging the gradient over several parallel samples, the inter-sample
variance is minimized compared to the learning of the true distribution. Moreover, the
parallel processing enables vector operation speedups, which are a common method of
optimization on modern computers. This idea is more fully explored in Section 6.1.4.
A common extensions to gradient learning rules is momentum. This is straightforward to
implement, and acts by adding a percentage of the previous update to the current data
point. When the training is far from converging to the solution, it quickly accelerates
towards the goal by taking larger steps by combining the previous step and the current
step together. The downside is that too much momentum can cause the learning to
overshoot the goal distribution when the learning procedure is closer to the end.
Decay is also commonly applied to networks to help regularize the weight. By slowly
decaying values, overlearned solutions with a very strong weight decay back towards
equilibrium, regularizing the system. If the decay is too fast, the learning procedure will
perform worse, but the right amount of decay influences the system to be more stable
to new data points and decrease overlearning.
Persistent contrastive divergence is a significant contribution by [41], in which the equi-
librium distribution is approach faster over time by continuing the Gibbs process with
every new data point, moving the sample points closer to the equilibrium distribution.
Interestingly, in this case the data and the model are mixed only through the weights;
the Gibbs chain that samples from the model distribution is initialized at the first point
and slowly steps towards equilibrium. Of course, the equilibrium distribution specified
by the weights is also changing slowly, but if the learning rate is low enough then the
process should yield samples more closely tied to the equilibrium distribution. The com-
parison between regular CD (called CD-1 because each model sample point comes from
a distribution one step closer to the equilibrium distribution) and persistent contrastive
divergence can be visualized in Figure 2.3. This idea was explored in neural networks
in Section 6.1.6.
Chapter 2. A Background in Restricted Boltzmann Machines and Deep Learning 18
Finally, sparsity and selectivity have been incorporated into the learning procedures
with [42]. This innovation was originally inspired by biological evidence indicating that
neural receptive fields tend to be sparse, i.e. a neuron responds to a very specific set
of stimuli, and selective, i.e. a neuron responds rarely over time. By incorporating a
cost function and biasing the activations, neurons in an RBM can be forced to learn
different stimuli than are covered by others, and to choose discriminative features by
choosing receptive fields that occur only rarely. This scatters the receptive fields in a
more dispersed way over the problem space and leads to improved generalization and
test set performance.
Q0
Visible Layer
Hidden Layer
...
...
Q1
Visible Layer
Hidden Layer
...
...
Q0
Visible Layer
Hidden Layer
...
...
Q1
Visible Layer
Hidden Layer
...
...
Q0
Visible Layer
Hidden Layer
...
...
Q1
Visible Layer
Hidden Layer
...
...
First Training Iteration
Second Training Iteration
Third Training Iteration
(a) CD-1 form of Contrastive Divergence.
Q0
Visible Layer
Hidden Layer
...
...
Q3
Visible Layer
Hidden Layer
...
...
Q0
Visible Layer
Hidden Layer
...
...
Q2
Visible Layer
Hidden Layer
...
...
Q0
Visible Layer
Hidden Layer
...
...
Q1
Visible Layer
Hidden Layer
...
...
First Training Iteration
Second Training Iteration
Third Training Iteration
(b) Persistent Contrastive Divergence
Figure 2.3: Comparison of CD-1 and PCD.
Chapter 3
Derivation of evtCD
This chapter introduces an online, event-based training rule for spiking RBMs. In Sec-
tion 3.1, previous work on training spiking RBMs is examined. Subsequently, the evtCD
algorithm that forms the core contribution of this thesis is introduced in Section 3.2.
3.1 Spiking Restricted Boltzmann Machines
Variations of training spiking RBMs have been tried, generally without too much suc-
cess. In 1999, Hinton and Brown [43] investigated using sigmoids that spike over time
(eschewing more biological neuron models), beginning an investigation into sequence
learning. However, this work was largely not continued; in the next appearance, Teh
and Hinton [44] applied contrastive divergence to continuously-valued face images using
rates instead of binary spike events. Purely binary representations do not work well
with the real-valued intensities of the face, so the authors proposed encoding a single
intensity with multiple binary features to allow discretized variations in intensity. Alter-
natively, they viewed this as a rate-based code of binary events, and called their system
“RBMrate” as a result. This was subsequently applied to time-series data in [45], which
then moved further from the rate-based approximation of spiking to diffusion networks
of continuous values.
A significant advance was made when O’Connor proposed [46] using a rate-based model
of a neuron in his thesis work to encode the state values in contrastive divergence. In this
way, networks of spiking LIF neurons can be trained offline using a rule very similar to
standard contrastive divergence. However, instead of using binary activations adopted
from sigmoid probabilities, the layer activations were taken to be the continuously-
valued neuron spike rates given by the Siegert formula. Once trained, the weights are
19
Chapter 3. Derivation of evtCD 20
transferred from the rate-based model to a network of spiking LIF neurons, and the
network of spiking neurons behaves according to the energy functions of an RBM. The
framework introduced there forms the starting point for the work in this thesis.
The actual function that transforms inputs and weights to a rate is called the Siegert
function from [47] and has a difficult analytical form. For completeness and the relevance
of this method, it will be given here, but the reader is encouraged to look to other works
such as [48] for a more in-depth analysis of this function. Given excitatory input rate
ρe and inhibitory input rate ρi, the following auxiliary variables can be calculated:
µQ = τ∑
( ~we ~ρe + ~wi~ρi) (3.1)
σ2Q =
τ
2
∑( ~w2
e ~ρe + ~w2i ~ρi) (3.2)
Υ = Vrest + µQ (3.3)
Γ = σQ (3.4)
k =√τsyn/τ (3.5)
γ = |ζ(1/2)| (3.6)
where τsyn is the synaptic time constant, τ is the membrane time constant, and ζ is the
Riemann zeta function. With these auxiliary variables, the average firing rate ρout of
the neuron with resting potential Vrest and reset potential Vreset can be computed as
[49]:
ρout =
(tref +
τ
Γ
√π
2· (3.7)∫ Vth+kγΓ
Vreset+kγΓexp
[(u−Υ)2
2Γ2
]·[1 + erf
(u−Υ
Γ√
2
)]du
)−1
.
Let the function rj = φ(ri,w, θsgrt)/rmax denote the resulting firing rates rj returned
by the Siegert function φ with input rates rj, weight w, and parameter set θsgrt. The
Siegert-computed rate is then normalized by the max firing rate rmax = 1/tref . Since a
neuron is unable to spike faster than 1/tref spikes per second, this normalization maps
the output of the Siegert function to the range [0, 1]. Now, for visible unit rates ri and
hidden unit rate rj the contrastive divergence rule becomes :
∆wij ∝ 〈rirj〉Q0 − 〈rirj〉Q1 (3.8)
Chapter 3. Derivation of evtCD 21
This works quite well in practice, yielding networks that can achieve accuracies greater
than 95% on the MNIST handwritten digit benchmark task [14]. The receptive fields
resemble the digit parts that constitute handwritten digits as can be seen in Figure 3.1.
Figure 3.1: A visualization of the receptive fields of hidden layer neurons trained usingthe Siegert method. Each square represents the weights of the visible layer connectedto that hidden layer neuron. Note here that the fields “factorize” handwritten digits
into digit parts. Figure from [14].
Knowing that Equation 3.8 performs well for training rate-based approximations of LIF
neurons, a link can be sought between a spike-based rule and this rate-based rule.
3.2 evtCD, an Online Learning Rule for Spiking Restricted
Boltzmann Machines
In addition to a possible performance improvement by switching from evaluating the
complex Siegert function to a simple update rule that occurs upon spiking, there is a
biological justification for examining spike-timing dependent plasticity (STDP). In the
brain, neurons change their connection strengths in response to the relative time between
the input neuron firing and the postsynaptic neuron firing (for a review of this so-called
spike-timing dependent plasticity, see [32, 33] and Figure 3.2). Many variations of these
STDP rules exist to capture the varieties of learning that spiking neurons display.
Since a rate-based rule can describe the net result of a large number of spikes, could it
be possible to design a spike-based rule that, in the rate-based limit, approximates the
Siegert update rule given in Equation 3.8? We begin by separating the problem into
identical subproblems specified by w+ij and w−ij :
Chapter 3. Derivation of evtCD 22
Figure 3.2: Visualization of spike-timing dependent plasticity. This figure is takenfrom [50], a seminal experiment measuring the effect of spike timing on changes in synap-tic strength. Shown here is the percent-change of synaptic strength as a result of therelative timing of presynaptic and postsynaptic spikes. In this experiment, presynapticspikes arriving before the postsynaptic neuron fires will result in synaptic strengthening(causal reinforcement), and presynaptic spikes arriving after the postsynaptic neuronfires result in synaptic weakening (acausal depression). This differs from the STDP rule
used in evtCD, which only responds to causal spikes.
∆wij ∝ 〈rirj〉Q0 − 〈rirj〉Q1 (3.9)
∆wij = ∆w+ij −∆w−ij (3.10)
∆w+ij = η 〈rirj〉Q0 (3.11)
∆w−ij = η 〈rirj〉Q1 (3.12)
In this weight update rule, there are four different populations of neurons with their
individual rates: for Q0, which we will call the data layers, there are the ri rates rep-
resenting the visible layer, and the rj rates representing the hidden layer. Similarly, for
Q1, which we will call the model layers, there are the ri rates representing the visible
layer rates, and the rj rates representing the hidden layer rates. Since this problem de-
composes cleanly into rates of four different populations of neurons, those in the visible
and hidden layers of the data and model distribution, we begin by proposing a four-layer
Chapter 3. Derivation of evtCD 23
architecture (Figure 3.3). Unlike contrastive divergence, a spike-based learning rule re-
quires populations to track states. In contrastive divergence, the sigmoids do not have
continuity through time, so the notion of populations is not necessary; the state of a
layer is a random sample, drawn according to the probability function of its given inputs.
Here, however, networks of spiking LIF neurons maintain the states, so the four states
of the network are represented by four physically distinct populations: the data visible,
data hidden, model visible, and model hidden populations. Since the data and model
distributions share a weight matrix, they must be matched in size, but the visible and
hidden layers can be sized according to problem constraints. Similar to the standard
RBM, there are no intra-layer connections, the weights propagate activations forward
between visible and hidden layers, and the hidden-visible weights are the transpose of
the forward weights.
Visible Layer
Hidden Layer
...
...
Data Layers
Visible Layer
Hidden Layer
...
...
Model Layers
WW'
W
Figure 3.3: Architecture of a network used for event-based STDP-like updates. TheevtCD algorithm relies on a network of four neural layers, each encoding a differentset of rates. The arrows indicate the direction of information flow in the network.Importantly, the weight matrix W is shared between the data and the model layersand determines the connection strength between the visible and the hidden. The weight
transpose W′ connects the data hidden layer back to the visible model layer.
If the inputs are assumed to be Poisson-distributed with a rate specified by r, then
the expectation of the number of spikes is r. We begin by proposing the following
STDP-inspired weight update rule:
∆w+ij =
η if h+i = 1, v+
j = 1
0 otherwise(3.13)
∆w−ij =
−η if h−i = 1, v−j = 1
0 otherwise(3.14)
Chapter 3. Derivation of evtCD 24
Since two samples are unlikely to ever be 1 at exactly the same point in continuous time,
we define a windowing period twin over which this statement can be valid (see Figure
3.4). Then, the ratio of number of spikes expected to occur in each window is the ratio
of the size of the window to the overall unit time length of the rate:
E[hi]twin = 〈hi〉twin =1
twinrh,i (3.15)
However, the update rule needs to be a causal model since it is designed to operate in
real-time, and the system is not able to learn about inputs which have not yet occurred.
However, the windowing on the rates for the Poisson distribution is just an average rate
over a constant time period arbitrarily chosen, so we choose here to have the period end
at the current time t and begin at t− twin. In the limit, then, the expected number of
hi events produced over a time period is the rate rh,i.
Therefore, the update rule is equivalent:
∆wij ∝ 〈〈hi〉twin〈vj〉twin〉Q0 − 〈〈hi〉twin〈vj〉twin〉Q1 (3.16)
∆wij ∝ 〈rirj〉Q0 − 〈rirj〉Q1 (3.17)
Where the spike states hi and vj indicate the presence or absence of a spike, and the time
window twin denotes the time over which the expectation is carried out. Importantly,
this is a spike-based rule and exceptionally sparse in its computation. Since both hi and
vj must be 1 in order for learning to occur, and the time window is defined such that
the spike hi occurs at the end of the window, a possible weight update needs only to
be calculated when the hidden layer spikes. At that point, a neuron can check which of
its inputs has spiked in the previous twin and either potentiate or depress its weight as
specified by the rule.
This rule, called the evtCD learning rule, is shown in Figure 3.4, and four examples
are shown in Figure 3.5.
Importantly, the evtCD rule only needs to be evaluated when a hidden neuron spikes.
Since it is a product of two binary events, the product of vihj is only 1 when both vj and
hi are active. If either one is not active, then the system does not update the weights
and the connecting weight wij remains fixed. This will be a key feature that allows very
sparse event-based computation.
Next, we examine how to implement this architecture in practice.
Chapter 3. Derivation of evtCD 25
tpre - tpost
Weight change
Data Layers Model Layers
twin
Weight change
twin
tpre - tpost
Figure 3.4: The evtCD rule derived for this work. The learning rule is divided into twohalves, a weight-potentiating rule and a weight-depressing rule. Unlike most learningrules, spikes from one set of populations (the data layers) only potentiate the weightmatrix, and spikes from another set of populations (the model layers) only depotentiatethe weight matrix. In both cases, the weight change will only occur if a hidden layerspike (“post”) occurs after a visible layer spike (“pre”). In all other cases, the weight
remains fixed.
Data Layers Model Layers
Visible (pre)
Hidden (post)
Visible (pre)
Hidden (post)
wij
increase
No change
Visible (pre)
Hidden (post)
Visible (pre)
Hidden (post)
wij
decrease
No change
Figure 3.5: Four examples of the applied evtCD rule. This diagram is divided intotwo halves, like the evtCD learning rule. Spikes on the left side occur in the data layers,and spikes on the right side occur in the model layers. The gray box preceding a hiddenlayer spike represents the time window twin, and spikes are vertical jumps with timeon the horizontal axis. The upper left quadrant represents a weight increase for wij
connecting these two neurons because the visible layer spikes before the hidden layerand within the time window twin. The lower left produces no result because the visiblelayer spike occurred either before or after the hidden layer spike window. On the right,the model distribution performs identically, but its weight update results in a decrease
of wij rather than an increase since these spikes take place in the model layer.
Chapter 4
Implementation of evtCD
This chapter describes the software implementation of the evtCD learning rule. Now
that it has been shown mathematically in Equations 3.13 and 3.14, as well as diagram-
matically in Figure 3.4, Section 4.1 presents a method for implementing the algorithm.
In Section 4.2, a method of supervised training using the primarily unsupervised evtCD
algorithm is introduced using an idea from earlier work on RBMs [13], allowing labels
to guide the training of networks of spiking neurons.
4.1 Algorithm Recipe for Software Implementation
The aforementioned evtCD algorithm is straightforward to implement for a software
simulation. The process is as follows:
1. Begin by initializing auxiliary variables:
(a) Membrane{1:4}, the membrane potentials of the four neuron layers;
(b) Last Spiked{1:4}, the last time each neuron has spiked, by layer, in order to
find whether a neuron is a possible cause of spiking;
(c) Refrac End{1:4}, the time when the refractory period for a given neuron in a
given layer last ended, used to determine if a neuron is currently refractory;
(d) last update[1:4], the last time a neuron layer was updated with an incoming
spike, used in calculating membrane potential decay;
(e) Thr{1:2}, the thresholds for the visible and hidden neuron layers (shared
between the data and the model);
(f) W, the shared weight matrix.
26
Chapter 4. Implementation of evtCD 27
The matrix Last Spiked that stores the time of the previous spike should be ini-
tialized to a large negative value so as not to artificially potentiate weights from
the initial spikes. Membrane, Refrac End, and last update can be initialized to zero.
Thr, the spike threshold, is typically initialized to a value of 1 for all neurons. Fi-
nally, W is chosen to be initialized from the uniform distribution in [0, 1], because
these uniformly excitatory weights cause more initial spikes for training than a
Gaussian centered around zero.
2. Begin processing input spikes. For an event-based implementation, a priority queue
of spikes is preferable, using a key comprised of the time and the layer; every
insertion and extraction will be O (log(n)). Every spike should be a triple of (time,
address, layer). This data structure is a convenience to accomplish a task biology
does very simply: delaying spikes between their generation and arrival. Here, a
priority queue helps to keep the spikes sorted according to times and layer, and
ensures that the first spike processed is the one that happens first; biology takes
care of this problem automatically.
3. For each input spike:
(a) Decay the membrane potential on the receiving layer by e−∆t/τ , calculating
∆t from the spike time and the last update time for the receiving layer.
(b) Add an impulse corresponding to the weight wi,j for the neuron i from the in-
put spiking neuron j if the receiving neuron’s refractory end period refrac end
is less than the current time. Since the visual-to-hidden layer weights are W
and the hidden-to-visual layer weights are W’, index into the weight matrix
appropriately based on the layer.
(c) If desired, add noise to the neuron membrane potentials.
(d) Examine the updated neurons, comparing their membrane potentials to the
threshold membrane potentials.
(e) For every neuron that exceeds the threshold for that neuron:
i. Record a new refractory end period: refrac end{layer}[i] = spike time
+ t ref.
ii. Reset the membrane potential: Membrane{layer}[i] = 0.
iii. Record this time as the last time the neuron spiked: Last Spiked{layer}[i]
= spike time.
iv. Adjust the threshold by either lowering the threshold if this spike comes
from the data distribution (making more spikes more likely in the model
distribution), or by raising the threshold if this spike originates from a
model distribution.
Chapter 4. Implementation of evtCD 28
v. If this layer is a hidden layer, an STDP weight update can be performed.
If this layer is a data distribution layer (layers 0 or 1), then the weight
wij corresponding to an input neuron j, whose Last Spiked{layer-1}[j]
is within the time windows twin, should be potentiated. If the current
layer is instead from the model distribution, spikes that occurred within
the previous window will be depotentiated. Regardless, if there was no
spike from a preceding neuron, its weight is unaffected.
vi. Add new spikes to the spike queue so downstream neurons receive these
spikes.
Note that in its most simplistic form, the exponential decay can be handled by bitshift-
ing. The summation of input currents is a sum, learning requires only a lookup and a
comparison, and updating a weight with a new value is only another summation. Even
if the only operations that are available are bitshifting, addition, and subtraction, then
this rule is implementable - making it ideal for low-compute architectures [21].
4.2 Supervised Training with evtCD
To measure the efficacy of the evtCD algorithm, it is necessary to objectively assess
its accuracy. In Chapter 5, the exact methods of performance measurement will be
explained, but it is worth discussing the training process by which supervised training
can occur in the unsupervised process of distribution matching. The idea is quite old
and goes back to the early days of training unsupervised learners; essentially, by making
the label part of the input distribution, the system is forced to learn the relationship
of the input distribution to the labels and cluster these elements together [1, 13]. See
Figure 4.1 for an example; by taking the top label layer and concatenating it with the
input layer, a new RBM can be made that trains in a supervised way when it performs
distribution matching.
After training this RBM as if it were a normal RBM, it can be unrolled back to its
original configuration to perform classification. By separating the weights and biases
from the joint layer correctly, the original three-layer architecture can be reconstructed.
Then, passing the activations through the three layers, input to hidden to label, will
classify an example according to the weights of the system.
Chapter 4. Implementation of evtCD 29
Hidden Layer
W1
Visual Input Layer
Label Layer
W2 Hidden Layer
{W1, W2}
Visual Input Layer Label Layer
Figure 4.1: Method of supervised learning using unsupervised learning: by concate-nating the label and input layer, and learning the joint representation of input and
label, the system is forced to learn to cluster the labels with the data.
Chapter 5
Test Methodology
This chapter describes the setup used to train and evaluate the evtCD networks. In Sec-
tion 5.1, the dataset used to assess performance of the training algorithm is introduced.
Following that, two implementations are described for achieving different aims. In Sec-
tion 5.2, the methodology for the time-stepped simulation is described, and Section 5.3
discusses the methodology of using the network for online training of a spiking RBM.
5.1 MNIST
The MNIST handwritten digit dataset is an extremely popular dataset used in machine
learning, compiled by [35], and often used as a benchmark task for new learning algo-
rithms. The challenge is straightforward: given 60,000 training digits, each a 28 by 28
pixel image with the digit in the center, correctly identify a handwritten digit from the
10,000 digit test set.
A human achieves about 99.6% accuracy on this dataset, and the best algorithms in
the world achieve equivalent performance [1, 35, 51]. Often, to boost performance,
transformation like rotations, translations, and deformations are applied to the images
to achieve a larger training set and increase its robustness [14, 37]. These transformations
were not performed here.
There are several attributes that make this dataset attractive for beginning investigations
for a new machine learning rule. First, its modest dimensionality of 784 input dimensions
is quite large compared to simple algorithms but relatively small compared to a modern
computer’s processing power. Second, the data can be represented as binary activations
without losing its identifying characteristics; a pen mark can be characterized as either
present or missing, and learning rules (including evtCD and CD [13]) initially only
30
Chapter 5. Test Methodology 31
supported binary data. Standard RBMs have now been extended to represent real-
valued inputs [37], but that investigation has not been performed for evtCD training
yet. Third, there is an interpretability to the weights of neurons, as they represent
receptive fields over digits. By visually inspecting the weights of hidden neurons, it is
possible to tell if the learning rule has resulted in proper receptive fields that decompose
the input into component parts. Finally, members of the machine learning field are very
familiar with this dataset and conclusions about an algorithm’s strengths or weaknesses
can be clearly seen on this common benchmark.
Figure 5.1: Six digits from the MNIST corpus. Across the top row are examplesof easily classified digits, and the bottom row contains digits 1, 2, and 8 that posed
difficulty for the evtCD algorithm.
Performance on the MNIST benchmark task can be found later in the results section of
this work, specifically in Figure 6.22. It is worth pointing out an important benchmark
here, however: a least-squares (optimal) linear regression can achieve 86.03% classifica-
tion accuracy when trained on the full 60,000 digit training set and all pixels. Ideally,
the evtCD training algorithm would surpass this level of accuracy, given the nonlinear
transformations that the spiking RBM performs.
5.2 Time-stepped Training Methodology
For experimental and debugging reasons, the Matlab implementation was time-stepped.
The task of the time-stepped implementation is to provide a platform to easily study
the parameters of the system and find a method to optimize the overall operation of the
evtCD algorithm.
Chapter 5. Test Methodology 32
The time-stepped testing methodology was consists of the following steps:
1. Load the MNIST handwritten digit database of 60,0000 training digits and 10,000
test digits.
2. Establish parameters for an evtCD training simulation, and initialize the network
architecture for supervised training.
3. Draw spikes from each of the 60,000 digits in the training set, and pass these spikes
as samples from the data-visible layer.
4. Train the network according to the evtCD algorithm, and dispose of spikes emitted
from the model hidden layer.
5. At specific timepoints during the training, as shown in Section 6.1, save network
snapshots for offline analysis.
The frame-based MNIST database was transformed into spike trains as described in
the next section (Section 5.2.1). Every digit was presented for 1/10th of a second of
simulated time, with each “on” pixel emitting 10 spikes on average. The likelihood of a
spike emitted from a pixel was proportional to its intensity, as explained in the following
section (Section 5.2.1).
In addition, the labels were used as part of the input. This supervised training method
is described in Section 4.2 and shown in Figure 4.1. To learn the labels, the label layer
and the input layer were concatenated into a single layer and the joint distribution of
pixels and labels were learned together. This allows an objective metric of training
performance: classification accuracy after 1 epoch on the MNIST handwritten digit
classification task. To determine the network’s choice for a presented digit, the output
layer neuron with the most spikes was chosen as the selected digit.
The architecture for the network is the one illustrated in Figure 6.1, with 784 neurons
in the first layer, corresponding to the 28*28 pixel images of MNIST [35], 100 neurons
in the hidden layer, and 10 neurons in the output layer corresponding to the 10 labels.
This manuscript also contains the full source for a Matlab implementation of time-
stepped training, which can be found in Appendix B.
This training begins by extracting spikes from MNIST images, a process detailed in the
next section.
Chapter 5. Test Methodology 33
Figure 5.2: Drawing spikes with an increasing number of spikes from the MNISThandwritten digit database. Shown here are 10, 50, 100, 500, and 5000 spikes drawn
from a sample “four” digit using Algorithm 1 shown in Section 5.2.1.
5.2.1 Extracting Spikes From Still Images
This technique predates this thesis work, going back at least to [14], but the specification
has not been fully described in print before.
Given a sample image, the spike rate should converge in the limit as number of spikes
increase to a rate-encoding of the image. The absolute spike rate should be a fixed
parameter, but the relative spike rate should emphasize the bright pixels compared to
the dark pixels. This can be accomplished by drawing a spike from the image with prob-
ability proportional to the pixel’s intensity. The Matlab function randsample efficiently
addresses this particular task, and generates spike trains as can be seen in Figure 5.2.
This conversion from fixed image to spike train is used in this work whenever spike trains
are needed from frames.
The timing of each spike is randomly generated. Since a given number of spikes are
emitted in a given amount of time, a spike time is randomly assigned to each spike and
the rates average out over the presentation period correctly. Algorithm 5.2.1 lists the
source of this algorithm which efficiently generates spike trains from data vectors.
1 function [addr, times] = drawspikes(train_x, opts)
2
3 trials = size(train_x, 2);
4 addr = zeros(opts.numspikes, trials);
5 times = zeros(opts.numspikes, trials);
6
7 for trial = 1:trials
8 % Assign addresses
9 addr(:, trial) = randsample(numel(train_x(:, trial)), ...
10 opts.numspikes, true, train_x(:, trial));
11 % Assign times
12 times(:, trial) = sort(rand(size( ...
13 addr(:, trial)))).*opts.timespan;
14 end
Algorithm 1.
Chapter 5. Test Methodology 34
5.3 Online Training Methodology
Before discussing the online training methodology, it is necessary to describe the method
of generating the input to the online training. To test the real-time nature of this system,
it is necessary to use an image sensor that can produce a spiking output, such as the DVS
spiking vision sensor [24]. Since the DVS produces spikes in response only to temporal
contrast changes, it does not spike in static scenes. Therefore, either the scene or the
sensor must be moved in intelligent ways to produce spikes in response to static images.
Since such a model is necessary, it makes sense to adopt a model used in biology to solve
the same problem.
5.3.1 A Software Implementation of Fixational Eye Movements
The implementation described here was published by Engbert et al. [52] to model the
fixational movements of eyes and includes fixational microsaccades. This model does
not include the large saccadic eye movements which can be driven by top-down or
bottom-up attention, but is designed solely to emulate the movement of an eye focusing
on a particular point and the small movements it undergoes to prevent saturation of
photoreceptors. Though the implementation described here moves the image of the
world while keeping an eye fixed, rather than moving an eye while keeping the world
fixed, this model stimulates the DVS camera in a biologically-realistic way.
The model is composed of three factors:
1. A self-avoiding random walk designed to mimic the small tremors of eye muscles;
2. An energy well designed to pull the focus of the eye back to the center;
3. Occasional, small-amplitude rapid movements of the eye known as microsaccades
to reach a new location.
This first component is a self-avoiding random walk. Informally, this is modeled as a
walk across a surface, where the direction chosen is the lowest neighbor. After stepping
onto a position, the energy level of that position rises and becomes less desirable, slowly
decaying back to its starting position. When that spot is encountered later on, it may
have decayed back to be a desirable choice or may still be less desirable than its neighbors.
This helps make the random walk self-avoiding, which is a better model of biology
because the muscles do not suddenly reverse direction and return to the exact spot from
which they have arrived [52].
Chapter 5. Test Methodology 35
Figure 5.3: Fixational eye movements. The background color represents the energywell that pulls the eye movements back to the center, the red circle indicates the currentfixational point, and the black line connects 40 points previously chosen. This figurewas generated by the implementation in Appendix C of the model introduced in [52].
Secondly, a model of fixational eye movements needs a factor to focus the eye towards the
center of the region of interest. In this case, there is an energy well that is quadratically
defined over the surface of interest, with its minimum at the center of the image. This can
be combined with the energy state defined in the self-avoiding random walk. Choosing
the position is then finding the minimum sum of the energy well and the walk surface,
and taking a step to that position [52].
Thirdly, the eye makes small movements around the center of the image known as
microsaccades. In the model proposed in [52], a microsaccade is triggered when the local
energy of the current position is too high; when the level is over a threshold, the focus
jumps from the current position to the global minimum energy. However, to model the
Chapter 5. Test Methodology 36
effects of the muscles on the eye, a cost is added that encourages the movement to occur
predominantly along a major axis, either horizontal or vertical, rather than a biologically
unrealistic diagonal movement. Figure 5.3 shows a visualization of the movement output
as well as the energy landscape used to generate the movement.
The fully described algorithm can be seen in Appendix C.
The default parameters were initialized as per the [52] paper, and can be found in Table
5.1.
Parameter Description Value
lambda Slope of the potential 1
sinkeps Relaxation rate 0.001
chi Vertical/Horizontal constraint for stabiliz-ing microsaccade direction
2*lambda
hc Critical value for triggering microsaccades 7.9
Table 5.1: Parameters for the eye movement, adapted from [52].
This code is used to generate offsets which shift an image on the screen in a biologically
realistic way to stimulate the DVS system. This produces the visual input that is
displayed on the screen in front of the DVS.
5.3.2 Training Environment
The training environment for the jAER implementation consists of three components:
the screen to display the data, the DVS system to receive the spike-based representation,
and the computer running the jAER environment. The completed setup can be seen in
Figure 5.4. In this environment, the training consists of rapidly displaying images from
the Matlab environment and performing online learning within Java.
5.3.3 Java Reference Implementation
The Java reference implementation relies on several environmental factors as well as
constants to appropriately build receptive fields. The secondary visualizer displays digits
at around 26 digits per second to the observing DVS. These spikes form input from the
visible layer in the evtCD algorithm, triggering learning in the hidden neuron layers
with the parameters shown in Table 5.2.
Chapter 5. Test Methodology 37
Figure 5.4: Photograph of the training environment.
Parameter Description Value
t win Window width of evtCD learning 0.005s
tau ref Refractory period 0.001s
inv decay Inverse decay offset to slowly aid learning 1e-5
eta Learning rate 1e-3
tau recon Reconstruction time constant for visualiz-ing reconstructions
0.010s
tau Membrane time constant 0.100s
thresh eta Threshold learning rate 0
Table 5.2: Parameters for online training using the Java reference implementation.
Chapter 6
Quantification of evtCD Training
After the description of an implementation of the evtCD training rule and the training
methodology, this chapter evaluates the performance of the algorithm compared to other
training methods and standard training paradigms. In Section 6.1, an examination of
parameters is performed to assess the optimal parameter space of learning in the network.
In Section 6.2, the training algorithm is combined with a linear decoder to achieve 90.3%
accuracy on the MNIST handwritten digit identification task [35]. Finally, Section 6.3
demonstrates the real-time nature of learning, as the system was rapidly trained to form
receptive fields for identifying digits, achieving 86.7% accuracy after training on 2.5% of
the available data presented over 60 seconds.
6.1 Improving Training through Parameter Optimizations
A major aim of this work is to give an intuitive understanding of how neuron and learning
parameters can affect the evtCD training algorithm, as well as to propose insights for
future work. The default setup for training, called the baseline parameters, can be found
in Table 6.1 and the methodology for this training was described in Section 5.2.
The following alterations to the standard training paradigm will be investigated:
1. Learning rate: how large should a standard weight update be?
2. Number of input events: how do different quantities of input spikes affect learning,
and can the system degrade gracefully with less input?
3. Batch size: can training be parallelized for performance reasons without sacrificing
accuracy?
38
Chapter 6. Quantification of evtCD Training 39
4. Noise temperature: what role does membrane potential noise and stochasticity
play in training?
5. Persistent Contrastive Divergence: does this powerful tool from regular CD also
aid the evtCD training algorithm?
6. Weight limiting: can limiting the range of weights result in better training?
7. Inverse decay: can a constant potentiation in weights help the training process?
These parameters and extensions will be examined in turn.
Label Layer
Hidden Layer
Input Layer
Figure 6.1: The architecture of the trained networks, to scale. The first layer is 784neurons (trained on 28*28 pixel digits), the hidden layer is 100 neurons, and the label
layer is 10 neurons.
6.1.1 Baseline Training Demonstration
These parameter evaluations for the evtCD rule were compared to a trial run with a
default, accurate, and reasonably-fast set of parameters referred to here as the baseline
training parameters. This section will introduce the typical behaviour of the network
using these parameters.
As can be seen from Figure 6.2, the time evolution of the weights follows a very stereo-
typed pattern: after about 10,000 digit presentations, the random initialization of the
weights causes each receptive field to begin navigating towards near local minima [53].
This happens when the hidden neuron successfully finds a component of the input space
Chapter 6. Quantification of evtCD Training 40
Parameter Description Value
temperatures Variance of the noise 0.01, 0.01, 0.01, 0.01
epochs Number of times the entire data set is pre-sented for training
1
eta Learning rate 0.005
momentum Momentum of weight updates 0
decay Decay of weights 0
t win STDP rule window width 0.030s
t refrac Refractory period of a neuron 0.001
tau Membrane time constant 0.050
inv decay Membrane time constant 0.001 * eta
batchsize Number of parallel training samples 10
thr Threshold of a neuron 1
Table 6.1: Parameters for the baseline Matlab-based implementation.
that helps to factorize the image into parts, as discussed in [13, 37]. For the handwritten
digits examined here, the digits factorize into digit parts such as a vertical element or a
curve, and over successive presentations the hidden neurons begin developing receptive
fields that corresponds to these factored elements.
The baseline parameters result in the rapidly increasing accuracy score shown in Figure
6.3, eventually peaking at 79.13% classification accuracy. If training for longer than 1
epoch were desired, it would be beneficial to decrease the learning rate as it plateaus
too early; training with a single epoch and reaching minimum error is a clear sign that
the learning rate is too fast [54].
(a) Weights after 10,000 in-put digits
(b) Weights after 30,000 in-put digits
(c) Weights after 60,000 in-put digits
Figure 6.2: The weights of four example hidden-layer neurons with increasing trainingexamples. The brightness encodes the weight value, and each neuron can be seen
becoming tuned to a particular set of digit regions.
Chapter 6. Quantification of evtCD Training 41
1 10 20 30 40 50 600
10
20
30
40
50
60
70
80
90
100
Digits Presented (thousands)
Accura
cy [%
]
Baseline Accuracy
Figure 6.3: Baseline accuracy of the evtCD training algorithm for one epoch of 60,000training digits, eventually peaking at 81.46% classification accuracy. The overall recordfor this size network trained with evtCD is 81.5%, so this accuracy lies close to the
peak accuracy achieved so far.
6.1.2 Learning Rate
The most basic parameter of training is the learning rate, eta. This parameter deter-
mines the size of a weight update when a hidden layer neuron spikes, and controls how
quickly the system changes its weights to approximate the input distribution. The learn-
ing rate that results in peak performance is smaller than for typical sigmoid networks,
on the order of 10−3, compared to traditional CD which can train using an eta value of
1.
When using the default threshold value for evtCD training, a single weight update using
eta = 1 could cause the weight to exceed the threshold value of that neuron. In an ideal
case, the system would recover and raise its threshold to compensate, but it is possible
for the weights of the network to move the system into a nonspiking regime from which it
will never recover (unlike a sigmoidal network). For this reason, using a smaller learning
rate is preferable.
As can be seen in Figure 6.5, a learning rate that is too fast learns quickly but then
achieves its peak performance early and is unable to improve, overshooting the learning
target [37]. On the other hand, an insufficient learning rate requires more learning
iterations to reach its saturation learning level.
Chapter 6. Quantification of evtCD Training 42
Le
arn
ing
Ra
te f
or
We
igh
t U
pd
ate
s
Digits Presented (thousands)
1 10 20 30 40 50 60
1e−05
1e−04
1e−03
Baseline (5e−3)
1e−02
1e−01
Figure 6.4: Effect of learning rate on receptive field formation. Horizontal axis isincreasing with the number of presented digits, in thousands, and vertical axis indicatesthe value of the learning rate. Note slower learning rates take longer to develop receptive
fields, but fast ones learn improper features and then stop learning.
1 10 20 30 40 50 600
10
20
30
40
50
60
70
80
90
100
Digits Presented (thousands)
Accura
cy [%
]
1e−05
1e−04
1e−03
Baseline (5e−3)
1e−02
1e−01
Figure 6.5: Accuracy evolution with different learning rates. The baseline parameterof 0.005 was chosen to balance the rapidity of the 0.01 learning rate with the more
careful 0.001 learning rate.
Chapter 6. Quantification of evtCD Training 43
6.1.3 Number of Input Events
The number of input events controls the amount of coincident information available for
training in the evtCD algorithm. By raising the number of input events, a neuron is
more likely to encounter joint activations of input and to develop a receptive field for
those regions of input. On the other hand, by lowering the number of input events, it
is possible that spikes will never overlap and a hidden neuron will never uncover joint
probabilities to encode. Because of this, it is important to establish the membrane
time constant tau and the STDP window t win in relation to the spike rate. In these
experiments, tau and t win were scaled proportionately to the baseline input rate of 10
spikes per pixel per digit presentation (which results in a spike rate, in a maximally
“on” pixel, of 100 Hz). No learning will occur if the spike rate drops low enough; in that
case, the exponential decay relaxes the neuron’s membrane potential to resting voltage
before another spike comes in, so it is necessary to lengthen these windows to allow a
fair comparison.
The spike rate shown on the vertical axis in Figure 6.6 and in the legend of Figure 6.7
is the expected number of spikes an “on” pixel can send over the presentation of a digit
(100 ms). Note that the accuracy reaches a peak around the chosen baseline parameter
value of 10 spikes per pixel per digit presentation, and falls off in accuracy with either
less or more input events. There appear to be 2 modes of peak performance in Figure
6.7, with one peak around 100 Hz, and a second peak at a much higher input rate of
250 Hz. Since the system is flexible in the number of event inputs, and the number of
events dominates in the training time, it is better to choose the lower mode of 100 Hz
for these initial investigations.
Qualitatively, Figure 6.6 suggests that as the spike rates increase, the receptive field
specialization increases as well. The receptive fields appear to be more detailed as the
number of coincident input spikes increases, allowing more selectivity in the types of
inputs that drive them.
Chapter 6. Quantification of evtCD Training 44
Inp
ut
Ra
te [
Sp
ike
s p
er
"On
" P
ixe
l P
er
Dig
it]
Digits Presented (thousands)1 10 20 30 40 50 60
4
8
Baseline (10)
12
15
25
Figure 6.6: Effect of input rates on receptive field formation. Horizontal axis isincreasing with the number of presented digits, in thousands, and vertical axis indicatesthe number of spikes per digit. Note that with increasing input rates, the specificity of
features appears to increase.
1 10 20 30 40 50 600
10
20
30
40
50
60
70
80
90
100
Digits Presented (thousands)
Accura
cy [%
]
4
8
Baseline (10)
12
15
25
Figure 6.7: Accuracy evolution with different input rates. Here, the accuracy as aresult of the input rate increases and reaches a peak at an input rate of 100 Hz (10
spikes per 100 ms), with a second mode at the much higher rate of 250 Hz.
Chapter 6. Quantification of evtCD Training 45
6.1.4 Batch Size
In the standard contrastive divergence model, taking a batch of data and processing it
in parallel is an important step for two reasons [13, 55]. First, parallel processing is often
more efficient on modern computers, and batch processing enables the learning algorithm
to capitalize on heavily optimized matrix operations [37, 54]. Secondly, it decreases the
variance of a single learning sample, preventing the algorithm from avoiding areas of high
variance. In [13], Hinton refers to an analogy: when vibrating a thin sheet of metal,
sand particles (following gradient descent) scattered over the surface will settle into
regions between the oscillating peaks to avoid regions of high variance, even though the
time-averaged mean everywhere is zero. Averaging a large number of parallel training
iterations, then, will result in better learning of the true gradient [37, 54].
Batch training in this network is implemented as if there are batchsize parallel networks
updating a common weight structure once per ms. Since the implementation is time-
stepped, all the weight updates happen in parallel based on the activity across the past 1
millisecond timestep. Each parallel network adds a vote to the direction of the gradient,
and the average direction is taken with a normalized weight update. Their collective
update has the same learning rate as a single step from the baseline training example, but
incorporates more evidence about the correct gradient direction for learning. Because
of this, a weight update from a batch run can achieve equal accuracy with fewer weight
updates. Eventually, the training with parallel updates should provide a better estimate
than a single sample point, so its accuracy should exceed training using a single point
as described above [13].
The parallelization comes at effectively zero computational cost in the time-stepped
implementation for small levels of parallelization (the values shown here). The execution
time using batches of 10 is the same as using batches of 1, so a batchsize of 10 was
chosen as the optimal parameter. This additionally coincides with previous suggestions
of batch sizes equal to the number of classes in the data [54]. The receptive field forms
in approximately the same number of weight presentations, and the accuracy suggests
that each weight update is more valid than in a single batch case. Finally, the slow
increase in accuracy of larger batch sizes appears promising for future investigations, as
this could allow more accurate learning given more training time.
Chapter 6. Quantification of evtCD Training 46
Para
llel T
rain
ing B
atc
h S
ize
Digits Presented (thousands)1 10 20 30 40 50 60
2
Baseline (10)
50
100
Figure 6.8: Effect of batch size on receptive field formation. Horizontal axis is increas-ing with the number of presented digits, in thousands, and vertical axis indicates thenumber of parallel-training network batches to calculate the learning gradient. Batchesof size 10 and size 1 develop receptive fields equally fast, and the performance advan-
tages of larger batch sizes made it preferable as a choice for a baseline parameter.
1 10 20 30 40 50 600
10
20
30
40
50
60
70
80
90
100
Digits Presented (thousands)
Accura
cy [%
]
2
Baseline (10)
50
100
Figure 6.9: Accuracy evolution with increasingly parallel estimates of the gradient.Accuracy is slightly improved at moderate batch size values, and learning occurs much
more slowly on a per-digit basis for larger batches.
Chapter 6. Quantification of evtCD Training 47
6.1.5 Noise Temperature
In the evtCD training algorithm, noise can have beneficial effects as well as the expected
detrimental ones. This occurs for two reasons: first, noise helps to regularize the weights
[37], and secondly, noise helps to ensure that the neurons always fire. As has been
mentioned before, the evtCD learning rule only functions when neurons emit spikes,
and the noise term helps to cause neurons to spike.
Moreover, in the evtCD training algorithm, samples are obtained by propagating the
activations of one layer to the next, and there can be significant losses in activations.
Without a term like a bias to encourage more spiking, the activation can decay away;
this is explored more fully in Section 6.1.6.
The noise term was added in two different ways: the first was to take a Gaussian with
mean zero and variance as indicated by the vertical axis in Figure 6.10, then perturb
the membrane potential of each neuron by this noise amount once per timestep (one
millisecond). Note that the threshold was fixed at 1, so the variance on these plots can
actually result in many erroneous spikes.
The second method of noise is a more biologically sound method known as the Orn-
stein–Uhlenbeck process [34, 56]. This method is a low-pass filtered Gaussian using
a time constant of 25 ms [34]. This prevents the noise from rapidly fluctuating the
membrane potential, instead providing a random offset that moves much more slowly.
Interestingly, it appears that purely Gaussian noise helps the neurons to learn more
quickly than in the absence of noise, due to their increased activity. The low-pass filtered
noise results in more precisely-defined receptive fields, which reflects the observations
found in Section 6.1.3.
Chapter 6. Quantification of evtCD Training 48
No
ise
Va
ria
nce
Digits Presented (thousands)
1 10 20 30 40 50 60
0.000
Baseline (0.01)
0.050
0.200
0.400
0.600
(a) Receptive fields of neurons trained with Gaussian noise.
No
ise
Va
ria
nce
Digits Presented (thousands)
1 10 20 30 40 50 60
0.000
Baseline (0.01)
0.050
0.200
0.400
0.600
(b) Receptive fields of neurons trained according to the Ornstein-Uhlenbeck pro-cess [34, 56].
Figure 6.10: Comparison of accuracy in neurons trained with evtCD under variousnoise rates. Note that the introduction of a small amount noise helps accelerate learning,
and the system is able to develop features that ignore the noise.
Chapter 6. Quantification of evtCD Training 49
1 10 20 30 40 50 600
10
20
30
40
50
60
70
80
90
100
Digits Presented (thousands)
Accu
racy [
%]
0.000
Baseline (0.01)
0.050
0.200
0.400
0.600
(a) Receptive fields of neurons trained with Gaussian noise.
1 10 20 30 40 50 600
10
20
30
40
50
60
70
80
90
100
Digits Presented (thousands)
Accu
racy [
%]
0.000
Baseline (0.01)
0.050
0.200
0.400
0.600
(b) Receptive fields of neurons trained according to the Ornstein-Uhlenbeck pro-cess [34, 56].
Figure 6.11: Comparison of accuracy in neurons trained with evtCD under variousnoise rates. Surprisingly, the system is very stable in the presence of noise, and accuracy
remains largely unaffected until quite significant noise is introduced.
Chapter 6. Quantification of evtCD Training 50
6.1.6 Persistent Contrastive Divergence
Persistent contrastive divergence, as originally introduced in [41] and show in Figure 2.3,
creates a persistent Markov chain which is driven entirely separately from the input. The
model distribution samples are generated separately from the data distribution, and each
data point moves the Markov chain closer to the equilibrium distribution. This ignores
the fact that the model, specified by the system weights, changes slightly with each
weight update. Because of this process, the equilibrium distribution does not remain
fixed; however, given a small enough learning rate the system gathers samples closer to
the equilibrium distribution than the normal CD-1 algorithm can.
However, when training spiking neural networks with evtCD, the model distribution is
now a recurrent visible-hidden network. The only mixing with the input data comes
from the weight matrix that is shared between the data and model distribution. Since
the model distribution sampling process is run independently with no external input,
its activity is driven nearly entirely with noise and is responsible for setting up a per-
sistent recurrent network that maintains activity and produces digit-like samples. A
true demonstration of the power of this balanced sampling approach is that the weights
of the system coerce random membrane potential noise into persistent activation that
corresponds to a real digit.
This process is demonstrated in Figure 6.14, which visualizes digit reconstruction arising
from the model layers sampling under persistent contrastive divergence. These confab-
ulations are clearly distinct from real digits, but considering the network is of small
size (100 hidden neurons) and trained for a single epoch, it does a remarkable job of
creating digit-like patterns. There are clear digit parts, tending to be centrally located
and continuous, and many of these are feasible approximations of digits.
Overall, however, the claims of faster training for persistent contrastive divergence in CD
do not seem to hold for evtCD. At all time points, the baseline accuracy outperforms
the PCD-trained network as shown in Figure 6.13. The weights can be qualitatively
assessed in Figure 6.12.
Chapter 6. Quantification of evtCD Training 51
PC
D
Digits Presented (thousands)0 10 20 30 40 50 60
Off
On
Figure 6.12: Effect of persistent contrastive divergence on receptive field formation.Horizontal axis is increasing with the number of presented digits, in thousands, andvertical axis indicates the presence or absence of persistent contrastive divergence inlearning. Features appear to require more time to appear with PCD, given the lack of
direct input to the model layers.
0 10 20 30 40 50 600
10
20
30
40
50
60
70
80
90
100
Digits Presented (thousands)
Accura
cy [%
]
Off
On
Figure 6.13: Accuracy evolution with and without persistent contrastive divergence.Baseline parameters consistently outperform PCD.
Chapter 6. Quantification of evtCD Training 52
Figure 6.14: Demonstration of 9 digit reconstructions from activation on the visiblemodel layer. Since the network samples freely, the units are not driven by externalinput but rather sampled from noise and guided by the energy function specified by the
network weights.
Chapter 6. Quantification of evtCD Training 53
6.1.7 Bounded Weights
One major difference between the networks modeled in evtCD and in true biological
networks is the large value range and high precision available to digital simulations.
The double-precision default implementation allows membrane voltages in excess of 1000
volts and weight updates smaller than 10−8. In this section, the possibility of capping
weights is examined to determine if all that range is necessary and if losing precision
might actually improve performance. Figures 6.15 and 6.16 demonstrate the effects of
capping the weights at 0.25. Largely it has no negative effects, but qualitatively alters
the features that are selected.
Weight capping also affects the initial distribution of weights. It has been suggested that
the initialization of weights plays a very important role for learning, and that properly
initializing the weights can save significant computational effort and have drastic results
on the eventual accuracy [53, 54] . By initializing the weights closer to the extrema,
the training decreases weights to yield features rather than sharpening weights that are
already present.
Interestingly, depriving the weights of much of their accuracy has little effect on the over-
all system. This could be a fruitful avenue for exploration in the future, as low-precision
weights are necessary for some platform implementations [18, 21], and the full impli-
cations of different initialization regimes should be evaluated for possible performance
improvements.
We
igh
ts C
ap
pe
d?
Digits Presented (thousands)1 10 20 30 40 50 60
Off
On
Figure 6.15: Effect of bounding weight magnitude on receptive field formation. Hori-zontal axis is increasing with the number of presented digits, in thousands, and vertical
axis indicates whether weights were bound by [-.25, .25].
6.1.8 Inverse Decay
Finally, in an evtCD-specific optimization, a slow potentiation of all weights was added
as a possible extension. Since these networks learn only when neurons spike, a constant
positive learning offset can improve learning by steadily forcing all neurons to spike at
Chapter 6. Quantification of evtCD Training 54
1 10 20 30 40 50 600
10
20
30
40
50
60
70
80
90
100
Digits Presented (thousands)
Accura
cy [%
]
Off
On
Figure 6.16: Accuracy evolution without and with weight bounding. Boundingweights may have some tangible improvement over the baseline accuracy or is, at the
very least, not necessarily detrimental.
least rarely. At every timestep, all weights in the weight matrix increase their value
by a constant positive offset (here set at 0.01*eta), tending to cause the downstream
neurons to spike. If the spike ended up being fallacious, the weight penalty will punish
the incorrect spike and strongly depotentiate the weight at a value of -eta, but if the
spike was correct it will be reinforced. In either case, the features become more selective.
This tends to speed up the learning rate by causing more initial activity and forcing
neurons to choose to adopt receptive fields rather than remain generic in their selectivity,
as can be seen in Figure 6.17.
Like momentum, this parameter should be adopted early in the training to encourage
appropriate initialization, then decreased over time. Consistent weight inflations prove
detrimental to the overall accuracy of the system when the weights are closer to their
equilibrium values, since it causes undesirable shifts in the energy distribution. Because
the constant positive increase tends to cause added spiking, it has a tendency to shift
the receptive fields over time to respond to novel stimuli instead of approaching an
equilibrium.
Chapter 6. Quantification of evtCD Training 55
Invers
e D
ecay P
er
ms
Digits Presented (thousands)
1 10 20 30 40 50 60
0
Baseline (0.001)
0.02
Figure 6.17: Effect of inverse decay on receptive field formation. Horizontal axis isincreasing with the number of presented digits, in thousands, and vertical axis indicatesthe ratio of the constant increase in weight to eta. That is, the baseline increase pertimestep (1 ms) is 1/1000 * eta. Note that the receptive fields without inverse decay
are largely undifferentiated due to lack of spiking, so a small amount is beneficial.
1 10 20 30 40 50 600
10
20
30
40
50
60
70
80
90
100
Digits Presented (thousands)
Accura
cy [%
]
0
Baseline (0.001)
0.02
Figure 6.18: Accuracy evolution with and without inverse decay. Though inverse de-cay helps the network develop receptive fields, too much decreases the eventual accuracy
of the system and lengthens training time.
Chapter 6. Quantification of evtCD Training 56
6.2 Training as a Feature Extractor
Label Layer
Hidden Layer
Input Layer
evtCD
W (trained by evtCD)
W (trained by evtCD)
Label Layer
Input Layer
LinearSystem
W (trained by optimallinear decoder)
Label Layer
Hidden Layer
Input Layer
evtCD + Linear
W (trained by optimallinear decoder)
W (trained by evtCD)
Figure 6.19: Architectures for the evtCD-trained network, the linear regression net-work, and the combination network.
Besides supervised training, evtCD can also be used to train networks to extract features
in a purely unsupervised way. The common technique of unsupervised learning examines
the input, extracts joint correlations, and clusters the data. This process can be used to
learn receptive fields to reduce the dimensionality of the data (for example, as in [2, 57]),
while preserving relevant information. Moreover, if desired, the output of the reduced
layer can then be trained in a more traditional approach using another classification
technique [13].
To begin, evtCD was used to train the spiking RBM in a purely unsupervised way to
establish relevant receptive fields for the data. Then, the activations of the network in
response to the MNIST training set (60,000 digits) were recorded. This process yields
a new training set of reduced dimensionality, of size trials by hidden-layer-size, and a
linear regressor was trained on this training set. The architecture can be seen in Figure
6.19.
The combination network composed of the evtCD network and the linear regression
network recorded the highest performance of the architectures tried here, achieving
90.03% accuracy. It is a powerful result that the evtCD unsupervised learning method
can reduce the dimensionality of the data from 784 pixels (28*28) to 225 and still achieve
a better score than a linear regression alone.
The confusion matrices in Figure 6.20 indicate the challenging digits for the learning
algorithm. The evtCD algorithm, when used for supervised training, had the most
difficulty with the digit 5. The tested network selected “8”, “3”, and “6” as often alter-
native candidates when presented with a “5”. The linear regression generally confused
the same digits, but had fewer mistakes overall. On the other hand, there are a few
mistakes which appear in the linear classification result but not in the evtCD result; for
example, the linear classifier had more difficulty identifying “5”s that were actually “8”s,
Chapter 6. Quantification of evtCD Training 57
Co
rre
ct
Dig
it
Chosen Digit
0 1 2 3 4 5 6 7 8 9
0
1
2
3
4
5
6
7
8
90
10
20
30
40
50
60
70
80
90
100
(a) Confusion matrix of an RBM trained withthe evtCD learning algorithm.
Co
rre
ct
Dig
it
Chosen Digit
0 1 2 3 4 5 6 7 8 9
0
1
2
3
4
5
6
7
8
90
10
20
30
40
50
60
70
80
90
100
(b) Confusion matrix of MNIST digits usinglinear classification on the pixels.
Co
rre
ct
Dig
it
Chosen Digit
0 1 2 3 4 5 6 7 8 9
0
1
2
3
4
5
6
7
8
90
10
20
30
40
50
60
70
80
90
100
(c) Confusion matrix of combination learning,using linear regression on the output of an RBM
trained with evtCD.
Figure 6.20: Confusion matrices of classification using evtCD learning, optimal linearregression, and combination evtCD and linear regression. Across the vertical axis isthe correct digit, and the horizontal axis is the digit chosen. Color indicates accuracy,in percent, of guessing the chosen digit. Note the common difficulty of distinguishing
“4”s from “9”s and “5”s from “3”s, for example.
and “1”s that were actually “4”s. Qualitatively, it appears that the confusion matrix of
the combination network is the intersection of the mistakes of each network individually.
Additionally, after training, an advantage of the evtCD-trained networks is a level of
interpretability to the weights, unlike for a pure linear classifier. In Figure 6.21, the
final receptive fields of the digits “0” through “9” are shown, with “0” on the upper
left and “4” on the upper right. For the linear network, the values shown here are
the linear relationship of that index to the input pixels, and can be thought of as the
receptive fields. The features are generally not intuitive, though a “1”-like receptive
field can be made out for the “1” digit, and a dim representation of a “6” appears for
Chapter 6. Quantification of evtCD Training 58
the “6” digit (Figure 6.21a). However, by linearly weighting the receptive fields of the
evtCD-trained network, much more intuitive features appear: though noisy, all of the
weighted receptive fields in Figure 6.21b suggest the form of the digit they are supposed
to represent.
Finally, a comparison of the performance of techniques appears in Figure 6.22. Though
the learning algorithm has much to improve before achieving state-of-the-art accuracy,
it nonetheless surpasses the optimal linear methods and achieves impressive accuracy
for a single training epoch executing on spiking neurons.
(a) Linear decoder weights.
(b) STDP-trained linear classifier weights.
Figure 6.21: Examination of the weights of a linear classifier built on top of thedimensionality-reducing STDP-trained system. A. The purely linear classifier weightsdo not build particularly intuitive representations of their sensitivities (with the excep-tion of “1”). B. The combination network weights the receptive fields of the RBM toproduce much more representative versions of their digits, though noisier and blurred.
Chapter 6. Quantification of evtCD Training 59
evtCD Linear Regression Lin+evtCD Siegert State of Art0
10
20
30
40
50
60
70
80
90
100
Accura
cy [%
]
Figure 6.22: Accuracy of learning on the MNIST dataset [2, 35, 46]. The combinedlinear and evtCD method presented in Section 6.2 achieves 90.3% accuracy. As a
supervised learning algorithm, evtCD peaks at 81.46% identification accuracy.
Chapter 6. Quantification of evtCD Training 60
6.3 Online Training with Spike-Based Sensors
To demonstrate the rapidity with which these networks can be trained, evtCD was used
to quickly train a network online with the spiking DVS image sensor [58]. The input rate
to the network is limited by the refresh rate of the display used to train the network;
in this case, 30 FPS was the maximum digit presentation speed at a 60 Hz refresh rate
with a blank frame between each digit.
During 58 seconds, 1500 digits were presented to the spiking DVS system which produced
events used as inputs to the evtCD algorithm. These digits comprise 2.5% of the typical
MNIST training set. The algorithm trained a 14*14 = 196 neuron hidden layer in
a purely unsupervised way, developing receptive fields that correspond to their digit
inputs. The weights for this hidden layer can be found in Figure 6.24; though clearly
less ordered than the full epoch training examples shown in Section 6.1, the receptive
fields display the qualitative features expected of a system trained on handwritten digits.
After the training examples are presented, the network weights are saved and a linear
classifier is trained on the spiking output of the network in response to the digits, as in
Section 6.2. The final performance of this system again exceeds that of a pure linear
classifier operating on the full MNIST training set, and achieves an 86.7% classification
accuracy. This is promising result after processing such a small percentage of the training
data.
Figure 6.23: Screenshot of the Java-based implementation. Shown here at the leftis the weight matrix, with currently updating neurons framed in blue. The originalinput to the system can be seen in red in the upper right, and next to it is the live
reconstruction of that digit, performed by the model layer, shown in blue.
Chapter 6. Quantification of evtCD Training 61
Figure 6.24: Weights of the network learned by evtCD from 1500 digits presentedover 58 seconds. Qualitatively, these features correspond to those seen in the earlier
Section 6.2, factorizing the input into digit parts.
Chapter 7
Conclusions and Future Work
In this final chapter, Section 7.1 reviews and assesses the main aims of the thesis. Section
7.2 discusses possible directions for future work to continue this research.
7.1 Conclusions
In this thesis, the online evtCD algorithm was introduced through two implementations,
one for simulation (Appendix A) and one for live data (Appendix B), which were used
to train spiking neural networks on the standard MNIST handwritten digit dataset [35].
The impact of the parameters of the algorithm was assessed (Section 6.1) to give the
reader an intuitive understanding of how to employ evtCD and to suggest an optimal
starting point for future work. To compare a spike-based implementation of an RBM to
a standard RBM, this thesis also presents a novel method, modeled after biology [52],
to generate spike-based representations from static images that can be used with any
event-based network (Section 5.3.1).
To assess the success of the aims of this endeavour, we examine the original question
posed in the introduction:
Can RBMs composed of spiking neurons be trained online? Three subgoals
were proposed to evaluate this question. The first of these was to:
1. Derive a rule for online learning of an RBM composed of spiking neural networks;
In Section 3.2, I introduced the evtCD algorithm, which is a novel contribution as the
first online learning method for an RBM composed of spiking neurons. The algorithm
uses four spiking neuron populations to represent the four different samples required
62
Chapter 7. Conclusions and Future Work 63
for the standard contrastive divergence algorithm (data visible, data hidden, model
visible, and model hidden as shown in Figure 3.3), which encode a state through their
spiking behaviour. Using these encoded states as samples, correlations between the
spiking behaviours in the data layers strengthen a shared weight matrix, and correlations
between the model layers weaken that shared weight matrix. Changes in the weight
matrix reach an equilibrium when the data correlations match the model correlations,
canceling out the update and implying the model distribution has correctly learned the
data distribution.
The idea of using a windowing function and persistent populations to represent the
instantaneous samples needed for contrastive divergence can be extended to any RBM
composed of time-persistent elements, and the weight update rule presented here can be
altered depending on the dynamics of the elements.
The next goal was to:
2. Design an event-driven, asynchronous implementation of this rule to achieve high
performance in scalable systems;
The evtCD algorithm described in Section 3 only adjusts the weights in the system in
response to spike events from the hidden layers, and can be implemented in a purely
event-driven, asynchronous fashion as shown in Appendix A. To demonstrate, the evtCD
training algorithm was paired with a real-time spiking image sensor [24]. This imple-
mentation brings together an event-based vision sensor with an event-based training
algorithm to yield an entirely event-driven, asynchronously updating training paradigm.
Real-time event-driven training was demonstrated in Section 6.3 with the unsupervised
training of a spiking RBM. After presentation of 1500 handwritten digits during 60
seconds, the network developed receptive fields corresponding to the presented digits and
achieved an 86.7% classification accuracy when using a linear classifier on the extracted
features.
The final of the three goals was to:
3. Demonstrate this training rule’s effectiveness on a common benchmark task.
Even without the additional computational power of a linear decoder, in Section 6.1 it
was demonstrated how a single training epoch and a hidden layer of only 100 neurons
could achieve 81.5% classification accuracy on the MNIST handwritten digit classifi-
cation task [35], without employing any transformations or distortions that commonly
Chapter 7. Conclusions and Future Work 64
provide additional training samples [14, 37]. When paired with a linear decoder, the
network recorded 90.4% classification accuracy of handwritten digits, exceeding the ac-
curacy of an optimal linear system.
In a much broader context, evtCD now allows training of RBMs on platforms designed for
simulations of neurons and STDP [18, 19], though the accuracy is not yet high enough for
evtCD to be a feasible replacement for standard contrastive divergence implementations
in general. In cases where online learning is important, as in robotics, evtCD still
holds an advantage in efficiency. Moreover, the results shown in Chapter 6 indicates
this method is promising, and many optimizations yet remain to close the gap between
evtCD and the state of the art.
In terms of biological modeling, the evtCD algorithm currently requires a biologically
unrealistic sharing of weights between different neurons so that neurons of the data distri-
bution population can potentiate the weights and the neurons of the model distribution
population can depotentiate the weights. However, this requirement may be relaxed
in the future; with population coding of the states, the net strength of a population of
neurons and another must match, rather than individual weights, and a method of dupli-
cating or adjusting the connection strength between two populations can be examined.
Moreover, the similarity between long-chain Gibbs sampling and network recurrence as
shown in Section 6.1.6 suggests that examining sampling methods as biological models
may prove fruitful as suggested in [59].
In summary, it was argued that event-driven algorithms for online training of spiking
Restricted Boltzmann machines would be a valuable contribution indeed, as such an
algorithm could benefit from the state-of-the-art in machine learning [4–10] as well as
neuromorphic engineering [24, 27, 30, 31]. The first such algorithm is evtCD.
7.2 Future Work
There remains a significant amount of future work to be done with these networks, as
they are still quite new and many open questions remain. We list several areas in which
improvements can be made in the online learning of RBMs composed of spiking neurons.
First, the experiments in Section 6.1 involving extensions and parameters have only be-
gun to examine the ideas that have been implemented as optimizations for contrastive
divergence since its invention. Fast weights [60] are a way of rapidly exploring the pa-
rameter space, and could prove effective especially after being successfully demonstrated
for use with rate-based networks in [14]. Sparsity and selectivity, two constraints that
are biological in origin and can improve performance in deep networks [37, 42], may
Chapter 7. Conclusions and Future Work 65
also prove fruitful for spiking RBMs. The initialization conditions and the number of
neurons in the hidden layer of the network also play a strong role in the overall accuracy
of classification systems trained with CD [2, 13, 37, 53], but are still an open research
topic with the evtCD algorithm. As pointed out in [1, 14], a larger network size in-
creases the representational capacity of neural network. By adding additional hidden
nodes, the network is able to learn more discriminative features and will likely improve
performance.
Second, standard RBMs have been trained to represent continuous values. Accomplished
by encoding parameters of Gaussians, this extension is very important for real-world and
computer visions tasks; for example, pixel intensity encoding is very important in visual
identification tasks [2, 37, 39]. Using spike rates to encode continuous-valued inputs in
spiking RBMs has not been investigated yet, but forms an important area for future
research.
Third, one of the fundamental advantages of RBMs is the possibility of stacking them
into deep architectures that can dramatically reduce error rates [1–3, 37]. Now that a
method for training spiking RBMs online has been introduced, it is possible to investigate
whether the evtCD learning rule, too, can be implemented to yield deep networks made
of spiking RBMs. Though the offline training algorithm for deep networks is a greedy
layer-wise training paradigm [2, 37, 39], greedy online training with early stopping could
preserve the properties of layer-wise training that work so well offline.
Finally, one of the powerful advantages of moving to a time-based representation is that
the RBM now exists in the time domain, and could have the ability to learn about
the passage of time. Sequence learning with the evtCD algorithm has not yet been
investigated, but could prove an intriguing direction if weights in the system could
encode the sequential firing patterns of neurons over time as in previous investigations
[43, 61, 62]. Specifically, STDP has been demonstrated to exploit firing information
to learn patterns [63, 64], suggesting that evtCD-trained networks may be similarly
capable.
Appendix A
Java Implementation
Printed here is the reference implementation of the evtCD learning rule for Java, current
as of this publishing data.
1 public void processSpike(SpikeTriplet spike_triplet){
2 // Check for real-world errors
3 if(spike_triplet.time < sys_time)
4 resetTimes();
5
6 sys_time = spike_triplet.time;
7 int layer = spike_triplet.layer + 1;
8
9 // Reconstruct the imaginary first-layer action that resulted in this spike
10 if(layer == 1){
11 last_spiked[0].put(spike_triplet.address, spike_triplet.time - axon_delay);
12 thr[0].put(spike_triplet.address, (thr[0].get(spike_triplet.address) < min_thr) ?
13 min_thr : thr[0].get(spike_triplet.address) - eta * thresh_eta);
14 spike_count[0]++;
15 if(calc_recons[0]){
16 recon[0].muli(Math.exp( - (spike_triplet.time-last_recon[0])/recon_tau));
17 recon[0].put(spike_triplet.address, recon[0].get(spike_triplet.address) + recon_imp);
18 last_recon[0] = spike_triplet.time - axon_delay;
19 }
20 }
21
22 // Update neurons
23 // Decay membrane
24 membranes[layer].muli(Math.exp(-(spike_triplet.time - last_update[layer]) / tau));
25
26 // Add impulse
27 if(layer == 0){
28 membranes[layer].put(spike_triplet.address,
29 membranes[layer].get(spike_triplet.address) + inp_scale);
30 }
31 else if (layer % 2 == 0){
32 membranes[layer].addi(weights.getColumn(spike_triplet.address).muli(
33 refrac_end[layer].lt(spike_triplet.time)));
34 }
66
Appendix A. Java Implementation Code 67
35 else{
36 membranes[layer].addi(weights.getRow(spike_triplet.address).muli(
37 refrac_end[layer].lt(spike_triplet.time)));
38 }
39
40 // Add noise
41 addNoise(membranes[layer], layer);
42
43 // Update last_update
44 last_update[layer] = spike_triplet.time;
45
46 // Add firings to queue
47 int [] newspikes = membranes[layer].gt(thr[layer % 2]).findIndices();
48 for(int n=0; n<newspikes.length; n++){
49 // Update counts
50 spike_count[layer]++;
51
52 // Update refrac end
53 refrac_end[layer].put(newspikes[n], spike_triplet.time + t_refrac);
54
55 // Reset firings
56 membranes[layer].put(newspikes[n], 0);
57
58 // Record time for STDP
59 last_spiked[layer].put(newspikes[n], spike_triplet.time);
60
61 // STDP Threshold Adjustment
62 double thr_direction = (layer < 2) ? -1.0 : 1.0;
63 double wt_direction = (layer < 2) ? 1.0 : -1.0;
64 thr[layer % 2].put(newspikes[n], thr[layer % 2].get(newspikes[n]) +
65 thr_direction * eta * thresh_eta);
66 thr[layer % 2].put(thr[layer % 2].lt(min_thr).findIndices(), min_thr);
67
68 // STDP Weight Adjustment
69 if (layer % 2 == 1) {
70 weights.putColumn(newspikes[n],
71 weights.getColumn(newspikes[n]).addi(
72 last_spiked[layer-1].gt(spike_triplet.time - stdp_lag).
73 muli(wt_direction * eta)));
74 }
75
76 // Reconstruct the layer if desired
77 if(calc_recons[layer]){
78 recon[layer].muli(Math.exp( - (spike_triplet.time-last_recon[layer])/recon_tau));
79 recon[layer].put(newspikes[n], recon[layer].get(newspikes[n]) + recon_imp);
80 last_recon[layer] = spike_triplet.time;
81
82 }
83
84 // Add spikes to the queue if not in the end layer
85 if (layer != 3) {
86 pq.add(new SpikeTriplet(spike_triplet.time + 2*axon_delay*rng.nextFloat(),
87 layer,
88 newspikes[n]));
89 }
Appendix A. Java Implementation Code 68
90 }
91 }
Algorithm 2.
Appendix B
Matlab Implementation
Printed here for reference is the time-stepped implementation of the Matlab code, current
as of this publishing data.
1 %% Training
2 numtrains = size(train_x, 1);
3 numbatches = numtrains / opts.batchsize;
4 train_inp = [train_x train_y];
5 start_time = tic;
6 for e = 1 : opts.numepochs
7 % Choose random order
8 kk = randperm(numtrains);
9
10 % Reset membrane potentials
11 membranes = cell(1,4);
12 membranes{2} = zeros(opts.batchsize, dims(2));
13 membranes{3} = zeros(opts.batchsize, dims(1));
14 membranes{4} = zeros(opts.batchsize, dims(2));
15
16 % Reset refractory period ends
17 refrac_end = cell(1,4);
18 refrac_end{2} = -inf(opts.batchsize, dims(2));
19 refrac_end{3} = -inf(opts.batchsize, dims(1));
20 refrac_end{4} = -inf(opts.batchsize, dims(2));
21
22 % Reset timecounter
23 last_active{1} = -inf(opts.batchsize, dims(1));
24 last_active{2} = -inf(opts.batchsize, dims(2));
25 last_active{3} = -inf(opts.batchsize, dims(1));
26 last_active{4} = -inf(opts.batchsize, dims(2));
27
28 % Reset firings
29 firings = cell(1,4);
30 firings{1} = zeros(opts.batchsize, dims(1));
31 firings{2} = zeros(opts.batchsize, dims(2));
32 firings{3} = zeros(opts.batchsize, dims(1));
33 firings{4} = zeros(opts.batchsize, dims(2));
34
69
Appendix A. Matlab Implementation 70
35 % Reset noise
36 noise = cell(1,4);
37 noise{1} = zeros(opts.batchsize, dims(1));
38 noise{2} = zeros(opts.batchsize, dims(2));
39 noise{3} = zeros(opts.batchsize, dims(1));
40 noise{4} = zeros(opts.batchsize, dims(2));
41
42 % Reset time
43 t_curr = 0;
44
45 % Loop through all batches
46 for b = 1:numbatches
47 % Get batch
48 batch = train_inp(kk((b - 1) * opts.batchsize + 1 : b * opts.batchsize), :);
49
50 % Clear out the slowpass
51 slowpass{1} = zeros(opts.batchsize, dims(1));
52 slowpass{2} = zeros(opts.batchsize, dims(2));
53 slowpass{3} = zeros(opts.batchsize, dims(1));
54 slowpass{4} = zeros(opts.batchsize, dims(2));
55
56 % Go through all repeats
57 for br=1:opts.batchrepeat
58 for t = t_curr:opts.dt:t_curr+opts.t_stop
59 % Generate input and log
60 % --------------------------------------------
61 noise_batch = (1-opts.temps(1)) * batch + ...
62 opts.temps(1) * (rand(size(batch)) > 0.5);
63 in_current = opts.input_rescale * ...
64 (noise_batch .* opts.rate_rescale) > rand(size(batch));
65 firings{1} = (in_current > 0);
66 last_active{1}(firings{1}) = t;
67 slowpass{1} = slowpass{1} + firings{1};
68
69 for pop=2:4
70 % Hidden Layer
71 % ----------------------------------------
72 % Decay membrane
73 membranes{pop} = membranes{pop} * exp(- opts.dt/opts.tau);
74 % Add impulse
75 if(pop == 2)
76 membranes{pop} = membranes{pop} + ...
77 (t > refrac_end{pop}) .* (firings{1} * W’);
78 elseif(pop == 3)
79 if(opts.pcd == 1)
80 membranes{pop} = membranes{pop} + ...
81 (t > refrac_end{pop}) .* (firings{4} * W);
82 else
83 membranes{pop} = membranes{pop} + ...
84 (t > refrac_end{pop}) .* (firings{2} * W);
85 end
86 elseif(pop == 4)
87 membranes{pop} = membranes{pop} + ...
88 (t > refrac_end{pop}) .* (firings{3} * W’);
89 end
Appendix A. Matlab Implementation 71
90
91 % Add noise
92 if(opts.n_noise == 1)
93 noise{pop} = noise{pop}*(1-1/opts.n_tau) + ...
94 1/opts.n_tau * opts.temps(pop)*randn(size(membranes{pop}));
95 else
96 noise{pop} = noise{pop}*(1-1/opts.n_tau) + ...
97 1/opts.n_tau * opts.temps(pop)*rand(size(membranes{pop}));
98 end
99
100 membranes{pop} = membranes{pop} + noise{pop};
101
102 % Get firings
103 full_firings = bsxfun(@gt, membranes{pop}, Thr{mod(pop-1,2)+1});
104 firings{pop} = full_firings;
105 slowpass{pop} = slowpass{pop} + (full_firings);
106 % Reset
107 membranes{pop}(firings{pop}) = 0;
108 refrac_end{pop}(firings{pop}) = t + opts.t_refrac;
109 last_active{pop}(firings{pop}) = t;
110
111 % Bound
112 membranes{pop}(membranes{pop} < opts.min_m) = opts.min_m;
113 end
114
115 % Learn
116 % Threshold adjustment
117 dBv = zeros(opts.batchsize, dims(1));
118 dBv(firings{1}) = - opts.eta * opts.thresh_eta;
119 dBv(firings{3}) = dBv(firings{3}) + opts.eta * opts.thresh_eta;
120 dBh = zeros(opts.batchsize, dims(2));
121 dBh(firings{2}) = - opts.eta * opts.thresh_eta;
122 dBh(firings{4}) = dBh(firings{4}) + opts.eta * opts.thresh_eta;
123
124 Thr{1} = Thr{1} + sum(dBv, 1) / opts.batchsize;
125 Thr{2} = Thr{2} + sum(dBh, 1) / opts.batchsize;
126
127 % STDP
128 if(~isempty(firings{2}))
129 dWp = opts.eta .* ...
130 (double(firings{2})’ * (last_active{1} > (t - opts.pos_lag)));
131 end
132 if(~isempty(firings{4}))
133 dWn = opts.eta .* ...
134 (double(firings{4})’ * (last_active{3} > (t - opts.pos_lag)));
135 end
136
137 dW = (dWp - dWn) / opts.batchsize;
138 W = W + dW;
139
140 W = W * (1 - opts.decay);
141 W = W + opts.inv_decay;
142
143 if(opts.wt_lim)
144 W(W>opts.wt_lim) = opts.wt_lim;
Appendix A. Matlab Implementation 72
145 W(W<-opts.wt_lim) = -opts.wt_lim;
146 end
147 end
148
149 % Clear between presentations
150 t_curr = t + opts.t_gap;
151 for pop=1:4
152 membranes{pop} = membranes{pop} * exp(- opts.t_gap/opts.tau);
153 end
154 end
155 end
156 end
Algorithm 3.
Appendix C
EyeMove Implementation
Printed here is the reference implementation of the eye movement script, adapted from
[52] and discussed in Section 5.3.
1 function offsets = get_tremor_offsets(iters, vis_size, varargin)
2
3 % Build regions
4 L = vis_size;
5 x_0 = floor(L / 2);
6 y_0 = floor(L / 2);
7 x = x_0;
8 y = y_0;
9
10 % Set up energy surface
11 [X, Y] = meshgrid(1:L, 1:L);
12 activations = zeros(L, L);
13 focuser = opts.lambda * L * ((X - x_0).^2/x_0 + (Y - y_0).^2/y_0);
14
15 % Create the choices
16 offsets_for_choice = {[0 1] [0 -1] [-1 0] [1 0]};
17 % Pre-allocate the output offsets
18 offsets = zeros(iters+opts.throwout, 2);
19 for k=1:iters+opts.throwout
20 % Create a saccade?
21 if(activations(x,y) > opts.hc)
22 % Find global minimum with orientation preference
23 u1 = opts.chi * L * ((X - x).^2/x_0 .* (Y - y).^2/y_0);
24 costmat = focuser + activations + u1;
25 [gminj, gmini] = find(costmat==min(min(costmat)));
26 newx = gminj(1);
27 newy = gmini(1);
28 else
29 % Calculate choices
30 u_up = focuser(x,y+1) + activations(x,y+1);
31 u_down = focuser(x,y-1) + activations(x,y-1);
32 u_l = focuser(x-1,y) + activations(x-1,y);
33 u_r = focuser(x+1,y) + activations(x+1,y);
34
73
Appendix C. EyeMove Implementation 74
35 % Choose new spot
36 choices = [u_up u_down u_l u_r];
37 permidx = randperm(4);
38 [~, choice] = min(choices(permidx));
39
40 % Make update
41 newx = x + offsets_for_choice{permidx(choice)}(1);
42 newy = y + offsets_for_choice{permidx(choice)}(2);
43 end
44
45 % Update avoidance path
46 oldspot = activations(x,y);
47 activations = (1-opts.sinkeps) * activations;
48 activations(x,y) = oldspot + 1;
49
50 % Yield a result
51 offsets(k,:) = [newx-x newy-y];
52
53 % Move
54 x = newx;
55 y = newy;
56 end
57 % Remove "burn-in" period used to set up a valid energy well
58 offsets = offsets(opts.throwout+1:end,:);
Algorithm 4.
Bibliography
[1] Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm
for deep belief nets. Neural Computation, 18(7):1527–1554, 2006.
[2] G.E. Hinton and R.R. Salakhutdinov. Reducing the dimensionality of data with
neural networks. Science, 313(5786):504–507, 2006.
[3] Ruslan Salakhutdinov and Geoffrey E. Hinton. Deep boltzmann machines. In
International Conference on Artificial Intelligence and Statistics, pages 448–455,
2009.
[4] MIT Technology Review. 10 breakthrough technologies 2013: Deep
learning. http://www.technologyreview.com/featuredstory/513696/deep-learning/,
April 2013.
[5] H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio. An empirical
evaluation of deep architectures on problems with many factors of variation. In
Proc. of ICML, pages 473–480. ACM, 2007.
[6] D. C. Ciresan, U. Meier, L. M. Gambardella, and J. Schmidhuber. Deep, big,
simple neural nets for handwritten digit recognition. Neural Comp., 22(12):3207–
3220, 2010.
[7] F. Seide, G. Li, and D. Yu. Conversational speech transcription using context-
dependent deep neural networks. In Proc. Interspeech, pages 437–440, 2011.
[8] Q. V. Le, M. A. Ranzato, R. Monga, M. Devin, K. Chen, G. Corrado, J. Dean, and
A. Y. Ng. Building high-level features using large scale unsupervised learning. In
ICML, 2012.
[9] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolutional deep belief networks
for scalable unsupervised learning of hierarchical representations. In Proc. of ICML,
pages 609–616, 2009.
75
Bibliography 76
[10] Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. Imagenet classification with
deep convolutional neural networks. In Advances in Neural Information Processing
Systems 25, pages 1106–1114, 2012.
[11] Paul Smolensky. Information processing in dynamical systems: Foundations of
harmony theory. 1986.
[12] Yoav Freund and David Haussler. Unsupervised learning of distributions of binary
vectors using two layer networks. Computer Research Laboratory [University of
California, Santa Cruz, 1994.
[13] Geoffrey E. Hinton. Training products of experts by minimizing contrastive diver-
gence. Neural Computation, 14(8):1771–1800, 2002.
[14] Peter O’Connor, Daniel Neil, Shih-Chii Liu, Tobi Delbruck, and Michael Pfeiffer.
Real-time classification and sensor fusion with a spiking deep belief network. Fron-
tiers in Neuroscience, 7, 2013.
[15] C. Eliasmith, T. C. Stewart, X. Choo, Tr. Bekolay, T. DeWolf, Y. Tang, and D. Ras-
mussen. A large-scale model of the functioning brain. Science, 338(6111):1202–1205,
2012.
[16] J. Perez-Carrasco, B. Zhao, C. Serrano, B. Acha, T. Serrano-Gotarredona, S. Chen,
and B. Linares-Barranco. Mapping from frame-driven to frame-free event-driven
vision systems by low-rate rate-coding and coincidence processing. Application to
feed forward ConvNets. IEEE Trans. on Pattern Analysis and Machine Intelligence,
in press, 2013.
[17] P. A. Merolla, J. V. Arthur, R. Alvarez-Icaza, J-M. Bussat, and K. A. Boahen. A
multicast tree router for multichip neuromorphic systems. IEEE Transactions on
Circuits and Systems I, 2013.
[18] M. M. Khan, D. R. Lester, L. A. Plana, A. Rast, X. Jin, E. Painkras, and S. B.
Furber. SpiNNaker: mapping neural networks onto a massively-parallel chip mul-
tiprocessor. In Proc. 2008 International Joint Conference on Neural Networks
(IJCNN’08), 2008.
[19] DARPA SyNAPSE Program, 2013. URL http://www.artificialbrains.com/
darpa-synapse-program.
[20] A. Cassidy, A.G. Andreou, and J. Georgiou. Design of a one million neuron single
FPGA neuromorphic system for real-time multimodal scene analysis. In 2011 45th
Annual Conference on Information Sciences and Systems (CISS)., pages 1–6. IEEE,
2011.
Bibliography 77
[21] Daniel L. Neil and Shih-Chii Liu. Minitaur, an event-driven FPGA-based spiking
network accelerator. submitted (under review), 2013.
[22] G. Indiveri, B. Linares-Barranco, T.J. Hamilton, A. van Schaik, R. Etienne-
Cummings, T. Delbruck, S.-C. Liu, P. Dudek, P. Hafliger, S. Renaud, J. Schem-
mel, G. Cauwenberghs, J. Arthur, K. Hynna, F. Folowosele, S. Saighi, T. Serrano-
Gotarredona, J. Wijekoon, Y. Wang, and K. Boahen. Neuromorphic silicon neuron
circuits. Frontiers in Neuroscience, 5:1–23, 2011. ISSN 1662-453X.
[23] S.-C. Liu and T. Delbruck. Neuromorphic sensory systems. Current Opinion in
Neurobiology, 20(3):288–295, 2010.
[24] P. Lichtsteiner, C. Posch, and T. Delbruck. A 128× 128 120 db 15 µs latency
asynchronous temporal contrast vision sensor. IEEE Journal of Solid-State Circuits,
43(2):566–576, 2008.
[25] S-C. Liu, A. van Schaik, B. Minch, and T. Delbruck. Event-based 64-channel
binaural silicon cochlea with Q enhancement mechanisms. In Proceedings of the
2010 IEEE International Symposium on Circuits and Systems, pages 2027–2030,
May 2010. ISCAS 2010: Paris, France, 30 May–2 June.
[26] C. Farabet, R. Paz, J. Perez-Carrasco, C. Zamarreno, A. Linares-Barranco, Y. Le-
Cun, E. Culurciello, T. Serrano-Gotarredona, and B. Linares-Barranco. Compari-
son between frame-constrained fix-pixel-value and frame-free spiking-dynamic-pixel
ConvNets for visual processing. Frontiers in Neuroscience, 6(32), 2012.
[27] T. Delbruck, Bernabe Linares-Barranco, Eugenio Culurciello, and Christoph Posch.
Activity-driven, event-based vision sensors. In Proceedings of 2010 IEEE Interna-
tional Symposium on Circuits and Systems (ISCAS), pages 2426–2429. IEEE, 2010.
[28] J.M. Nageswaran, N. Dutt, J.L. Krichmar, A. Nicolau, and A. Veidenbaum. Effi-
cient simulation of large-scale spiking neural networks using CUDA graphics proces-
sors. In International Joint Conference on Neural Networks (IJCNN) 2009, pages
2145–2152. IEEE, 2009.
[29] L.P. Maguire, T.M. McGinnity, B. Glackin, A. Ghani, A. Belatreche, and J. Harkin.
Challenges for large-scale implementations of spiking neural networks on FPGAs.
Neurocomputing, 71(1):13–29, 2007.
[30] Teresa Serrano-Gotarredona and Bernabe Linares-Barranco. A 128 x 128 1.5%
contrast sensitivity 0.9% fpn 3 µs latency 4 mw asynchronous frame-free dynamic
vision sensor using transimpedance preamplifiers. 2013.
Bibliography 78
[31] B. Minch S-C. Liu, A. van Schaik and T. Delbruck. Asynchronous binaural spatial
audition sensor with 2x64x4 channel output. IEEE Transactions on Biomedical
Circuits and Systems, 2013. In press.
[32] Yang Dan and Mu-ming Poo. Spike timing-dependent plasticity of neural circuits.
Neuron, 44(1):23–30, 2004.
[33] Daniel E. Feldman. The spike-timing dependence of plasticity. Neuron, 75(4):
556–571, 2012.
[34] B. Nessler, M. Pfeiffer, L. Buesing, and W. Maass. Bayesian computation emerges
in generic cortical microcircuits through spike-timing-dependent plasticity. PLoS
Computational Biology, 9(4):e1003037, 2013.
[35] Yann Lecun and Corinna Cortes. The MNIST database of handwritten digits. URL
http://yann.lecun.com/exdb/mnist/.
[36] G. E. Hinton and T. J. Sejnowski. Learning and relearning in Boltzmann machines.
MIT Press, Cambridge, Mass., 1:282–317, 1986.
[37] Y. Bengio. Learning deep architectures for AI. Foundations and Trends in Machine
Learning, 2(1):1–127, 2009.
[38] David E. Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning repre-
sentations by back-propagating errors. Cognitive Modeling, 1:213, 2002.
[39] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training
of deep networks. In Advances in Neural Information Processing Systems 19. MIT
Press, 2006.
[40] Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based
learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–
2324, 1998.
[41] T. Tieleman. Training restricted Boltzmann machines using approximations to the
likelihood gradient. In Proc. of ICML, pages 1064–1071. ACM, 2008.
[42] H. Goh, N. Thome, and M. Cord. Biasing restricted Boltzmann machines to ma-
nipulate latent selectivity and sparsity. In NIPS workshop on deep learning and
unsupervised feature learning, 2010.
[43] Geoffrey E. Hinton and Andrew D. Brown. Spiking boltzmann machines. In NIPS,
pages 122–128. Citeseer, 1999.
Bibliography 79
[44] Yee Whye Teh and Geoffrey E. Hinton. Rate-coded restricted boltzmann machines
for face recognition. Advances in neural information processing systems, pages 908–
914, 2001.
[45] Hsin Chen and Alan Murray. A continuous restricted boltzmann machine with
a hardware-amenable learning algorithm. In Artificial Neural Networks—ICANN
2002, pages 358–363. Springer, 2002.
[46] Peter O’Connor. A real-time sensory-fusion model using a Deep Belief Network with
spiking neurons. Master’s thesis, Institute of Neuroinformatics, Zurich, Switzerland,
2012.
[47] A. J. F. Siegert. On the first passage time probability problem. Physical Review,
81(4):617, 1951.
[48] F. Jug, J. Lengler, C. Krautz, and A. Steger. Spiking networks and their rate-
based equivalents: does it make sense to use Siegert neurons? In Swiss Society for
Neuroscience. 2012.
[49] F. Jug, M. Cook, and A. Steger. Recurrent competitive networks can learn lo-
cally excitatory topologies. In International Joint Conference on Neural Networks
(IJCNN), pages 1–8, 2012.
[50] G-Q. Bi and M-M. Poo. Synaptic modifications in cultured hippocampal neurons:
Dependence on spike timing, synaptic strength, and postsynaptic cell type. Jour.
of Neuroscience, 18(24):10464–10472, 1998.
[51] Dan Ciresan, Ueli Meier, and Jurgen Schmidhuber. Multi-column deep neural
networks for image classification. In Computer Vision and Pattern Recognition
(CVPR), 2012 IEEE Conference on, pages 3642–3649. IEEE, 2012.
[52] Ralf Engbert, Konstantin Mergenthaler, Petra Sinn, and Arkady Pikovsky. An
integrated model of fixational eye movements and microsaccades. Proceedings of
the National Academy of Sciences, 108(39):E765–E770, 2011.
[53] D. Erhan, Y. Bengio, A. Courville, P.A. Manzagol, P. Vincent, and S. Bengio.
Why does unsupervised pre-training help deep learning? The Journal of Machine
Learning Research, 11:625–660, 2010.
[54] G. Hinton. A practical guide to training restricted boltzmann machines. Momen-
tum, 9:1, 2010.
[55] Ilya Sutskever, James Martens, and Geoffrey E. Hinton. Generating text with
recurrent neural networks. In Proceedings of the 28th International Conference on
Machine Learning (ICML-11), pages 1017–1024, 2011.
Bibliography 80
[56] Alain Destexhe, Michael Rudolph, J-M Fellous, and Terrence J. Sejnowski. Fluc-
tuating synaptic conductances recreate in vivo-like activity in neocortical neurons.
Neuroscience, 107(1):13–24, 2001.
[57] J. Masci, U. Meier, D. Ciresan, and J. Schmidhuber. Stacked convolutional auto-
encoders for hierarchical feature extraction. Artificial Neural Networks and Machine
Learning–ICANN 2011, pages 52–59, 2011.
[58] P. Lichtsteiner, T. Delbruck, and C. Posch. A 100db dynamic range high-speed dual-
line optical transient sensor with asynchronous readout. In International Symposium
on Circuits and Systems, ISCAS 2006. IEEE, 2006.
[59] L. Busing, J. Bill, B. Nessler, and W. Maass. Neural Dynamics as Sampling: A
Model for Stochastic Computation in Recurrent Networks of Spiking Neurons. PLoS
Computational Biology, 7(11):e1002211, 2011. ISSN 1553-7358.
[60] T. Tieleman and G. Hinton. Using fast weights to improve persistent contrastive
divergence. In Proc. of ICML, pages 1033–1040. ACM, 2009.
[61] O. Bichler, D. Querlioz, S.J. Thorpe, J.P. Bourgoin, and C. Gamrat. Extraction
of temporally correlated features from dynamic vision sensors with spike-timing-
dependent plasticity. Neural Networks, 2012.
[62] S. Mitra, S. Fusi, and G. Indiveri. Real-time classification of complex patterns using
spike-based learning in neuromorphic VLSI. IEEE Transactions on Biomedical
Circuits and Systems, 3(1):32–42, Feb. 2009.
[63] Timothee Masquelier and Simon J Thorpe. Unsupervised learning of visual features
through spike timing dependent plasticity. PLoS Computational Biology, 3(2):e31,
2007.
[64] T. Masquelier, R. Guyonneau, and S.J. Thorpe. Competitive STDP-based spike
pattern learning. Neural Computation, 21:1259–1276, 2009.