Online Learning in Event-based Restricted …dannyneil.com/attach/dneil_thesis.pdfAbstract Online...

University of Zurich and ETH Zurich

Master Thesis

Online Learning in Event-basedRestricted Boltzmann Machines

Author:

Daniel Neil

Supervisor:

Michael Pfeiffer and

Shih-Chii Liu

A thesis submitted in fulfilment of the requirements

for the degree of MSc UZH ETH in Neural Systems and Computation

in the

Sensors Group

Institute of Neuroinformatics

October 2013

Declaration of Authorship

I, Daniel Neil, declare that this thesis titled, ‘Online Learning in Event-based Restricted

Boltzmann Machines’ and the work presented in it are my own. I confirm that:

� This work was done wholly or mainly while in candidature for a research degree

at this University.

� Where any part of this thesis has previously been submitted for a degree or any

other qualification at this University or any other institution, this has been clearly

stated.

� Where I have consulted the published work of others, this is always clearly at-

tributed.

� Where I have quoted from the work of others, the source is always given. With

the exception of such quotations, this thesis is entirely my own work.

� I have acknowledged all main sources of help.

� Where the thesis is based on work done by myself jointly with others, I have made

clear exactly what was done by others and what I have contributed myself.

Signed:

Date:

i

Abstract

Online Learning in Event-based Restricted Boltzmann Machines

by Daniel Neil

Restricted Boltzmann Machines (RBMs) constitute the main building blocks of Deep

Belief Networks and other state-of-the-art machine learning tools. It has recently been

shown how RBMs can be implemented in networks of spiking neurons, which is ad-

vantageous because the necessary repetitive updates can be performed in an efficient

asynchronous and event-driven manner. However, like any previously known method

for training RBMs, the training process for event-based RBMs was performed offline.

The offline training fails to exploit the computational advantages of spiking networks,

and does not capture the online learning characteristics of biological systems.

This thesis introduces the first online method of training event-based RBMs that com-

bines the standard RBM-training method, called contrastive divergence (CD), with bi-

ologically inspired spike-based learning. The new rule, which we call “evtCD”, offers

sparse and asynchronous weight updates in spiking neural network implementations of

RBMs, and is the first online training algorithm for this architecture. Moreover, the

algorithm is shown to approximate the previous offline training process.

Performance of training was evaluated on the standard MNIST handwritten digit iden-

tification task, achieving 90.4% accuracy when combined with a linear decoder on the

features extracted by a single event-based RBM. Finally, evtCD was applied to the real-

time output of an event-based vision sensor and achieved 86.7% accuracy after only 60

seconds of training time and presentation of less than 2.5% of the standard training

digits.

Acknowledgements

This thesis could not have been accomplished without the original effort of Peter O’Connor

and his boundless persistence to make event-based Deep Belief Networks possible. Saee

Paliwal has contributed her endless support and incomparable mathematical ability dur-

ing the course of this work, and without our discussions I would have likely floundered in

the great space of ideas. A very large thanks to Michael Pfeiffer for his expert knowledge

and his key insights into the work, which critically moved the project forward at various

times when it stalled. Finally, this document will hopefully be just one of many writ-

ten by me at the Institute of Neuroinformatics, thanks largely to the encouragement,

insights, and support of Shih-Chii Liu. I am deeply in all of your debt for your help.

iii

Contents

Declaration of Authorship i

Abstract ii

Acknowledgements iii

Contents iv

1 Introduction 1

2 A Background in Restricted Boltzmann Machines and Deep Learning 4

2.1 A Historical Introduction to Deep Learning . . . . . . . . . . . . . . . . . 4

2.2 Energy-Based Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 Products of Experts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.4 RBMs and Contrastive Divergence . . . . . . . . . . . . . . . . . . . . . . 16

2.5 Extensions to Standard Learning Rules . . . . . . . . . . . . . . . . . . . 17

3 Derivation of evtCD 19

3.1 Spiking Restricted Boltzmann Machines . . . . . . . . . . . . . . . . . . . 19

3.2 evtCD, an Online Learning Rule for Spiking Restricted Boltzmann Machines 21

4 Implementation of evtCD 26

4.1 Algorithm Recipe for Software Implementation . . . . . . . . . . . . . . . 26

4.2 Supervised Training with evtCD . . . . . . . . . . . . . . . . . . . . . . . 28

5 Test Methodology 30

5.1 MNIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.2 Time-stepped Training Methodology . . . . . . . . . . . . . . . . . . . . . 31

5.2.1 Extracting Spikes From Still Images . . . . . . . . . . . . . . . . . 33

5.3 Online Training Methodology . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.3.1 A Software Implementation of Fixational Eye Movements . . . . . 34

5.3.2 Training Environment . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.3.3 Java Reference Implementation . . . . . . . . . . . . . . . . . . . . 36

6 Quantification of evtCD Training 38

6.1 Improving Training through Parameter Optimizations . . . . . . . . . . . 38

6.1.1 Baseline Training Demonstration . . . . . . . . . . . . . . . . . . . 39

6.1.2 Learning Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

6.1.3 Number of Input Events . . . . . . . . . . . . . . . . . . . . . . . . 43

iv

Contents v

6.1.4 Batch Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6.1.5 Noise Temperature . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6.1.6 Persistent Contrastive Divergence . . . . . . . . . . . . . . . . . . . 50

6.1.7 Bounded Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

6.1.8 Inverse Decay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

6.2 Training as a Feature Extractor . . . . . . . . . . . . . . . . . . . . . . . . 56

6.3 Online Training with Spike-Based Sensors . . . . . . . . . . . . . . . . . . 60

7 Conclusions and Future Work 62

7.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

A Java Implementation 66

B Matlab Implementation 69

C EyeMove Implementation 73

Bibliography 75

Chapter 1

Introduction

Deep networks, specifically Deep Belief Networks [1, 2] and Deep Boltzmann machines

[3], are achieving state-of-the-art performance on classification tasks for images, videos,

audio, and text [4–10]. Importantly, Restricted Boltzmann machines (RBMs) [11–13]

underlie both of these approaches and recent work [14–17] has strongly pushed to in-

vestigate the possibility of implementing RBMs on networks of spiking neurons. The

reason for this is two-fold.

First, fast and efficient silicon architectures [18–21] designed specifically to accelerate

spiking neural networks are emerging, and RBMs composed of spiking neurons would

pair progress in machine learning with novel and powerful computing architectures.

Additionally, it has been shown that the accuracy of networks on classification tasks

is guaranteed to improve with larger size or more layers [1, 14], implying that higher

accuracy could be achieved by just scaling up the network size. Therefore, scale is an

important factor, and these neural network accelerators are all designed to run large

networks of spiking neurons faster than a general-purpose computer.

Second, RBMs implemented with spiking neurons offer a fundamental advantage when

used in very large networks over traditional RBM approaches. As [14] points out, scale is

not the dominant factor in processing time for the brain, unlike standard computational

approaches, because the processing units are both parallel and event-driven. The brain

adapts its processing speed to the rate of input, so the computational effort is propor-

tional to the number of events. This so-called event-driven computational approach is

a hallmark of neuromorphic designs that seek inspiration from the brain to build event-

based, asynchronous, and often low-power silicon systems [22–25]. The advantages of

event-driven computation are well studied [16, 26, 27] and certain types of spiking neu-

ron models can be implemented entirely in event-driven systems [28, 29]. Additionally,

current event-driven neuromorphic sensors produce sparse outputs for vision [24, 27, 30]

1

Chapter 1. Introduction 2

and audition [31]. By designing algorithms that can run on spiking neural networks,

it is possible to construct a complete hardware system using event-driven computation

alone.

However, the pre-existing implementations of RBMs composed of spiking neural net-

works [14–16] all use an offline and synchronous training algorithm. No training algo-

rithm has yet been discovered for online training of RBMs composed of spiking neurons.

The main question this thesis is concerned with is the following: can RBMs composed

of spiking neurons be trained online? Specifically, there are three subgoals:

1. Derive a rule for online learning of an RBM composed of spiking neural networks;

2. Design an event-driven, asynchronous implementation of this rule to achieve high

performance in scalable systems;

3. Demonstrate this training rule’s effectiveness on a common benchmark task.

In this thesis, I will introduce evtCD, an online learning rule inspired by biological rules

of spike time-dependent plasticity (STDP) [32–34] that trains an RBM composed of

leaky integrate-and-fire (LIF) spiking neurons. This training algorithm can be run in an

entirely event-driven way by using updates from STDP with insights from contrastive

divergence (CD) learning, the standard method of learning for RBMs.

Beyond the description of this new algorithm, I demonstrate a proof sketch of how this

learning rule approximates a previously demonstrated offline learning rule introduced in

[14], which in turn arises from contrastive divergence learning.

Finally, this thesis contributes a real-time implementation of evtCD to demonstrate the

efficiency of this learning rule. An event-based image sensor [24], which produces image

events instead of image frames, generates spike trains which are used to run the evtCD

training algorithm on populations of spiking neurons. When applied to the commonly

used MNIST handwritten digit identification task [35], the receptive fields of the neurons

learn digit parts as they do in the standard frame-based training. After sixty seconds

of real-time learning, the network learns features that perform better than an optimal

linear decoder on the raw digits, achieving a classification accuracy of 86.7%.

This thesis is structured as follows. In Chapter 2, the history of deep learning is in-

troduced and the derivations of previous methods are analyzed for their applicability

to spiking neural networks. Chapter 3 introduces evtCD and links the algorithm to

the previously successful rate-based offline learning algorithm in [14]. In Chapter 4, an

algorithm recipe is shown for implementing evtCD in software, and supervised learning

Chapter 1. Introduction 3

with the evtCD algorithm is explained. Following that, Chapter 5 explains the testing

methodology and setup for obtaining results that analyze evtCD training. Chapter 6

then studies the behaviour of the algorithm under various parameters and extensions,

and demonstrates rapid real-time learning of the MNIST handwritten digit dataset.

Finally, Chapter 7 concludes by introducing ideas for future work.

Chapter 2

A Background in Restricted

Boltzmann Machines and Deep

Learning

This chapter will introduce the prior work on RBMs and deep learning to lay the foun-

dations for the introduction of the evtCD learning rule. In Section 2.1, a historical

introduction will give an overview of the history and intuition of training RBMs. Subse-

quently, Sections 2.2 through 2.4 will focus on reproducing the mathematical derivations

of RBM learning rules to understand the assumptions in them. Finally, standard exten-

sions to the contrastive divergence learning rule will be briefly explained in Section 2.5,

as they form the basis for many of the investigations found in Section 6.

2.1 A Historical Introduction to Deep Learning

The origins of deep learning begin with Boltzmann machines [36]. Introduced in 1986,

Boltzmann machines are an undirected bipartite probabilistic generative model. Though

they are composed of two layers (“bipartite”), the connections between these layers are

bidirectional (hence, “undirected”) to allow the system to either pull external states

into their internal representation or to generate data from their internal representation.

These machines are “probabilistic” in that they encode the probabilities of types of

inputs on which they are trained, and are modeled on a physical analogy to distributions

of matter that probabilistically settle into low-energy configurations. Finally, Boltzmann

machines are “generative” models, meaning they are capable of producing data that

look like the inputs on which they have been trained. If, for example, a network is

4

Chapter 2. A Background in Restricted Boltzmann Machines and Deep Learning 5

trained on handwritten digits, a Boltzmann machine will, after training, produce digit-

like patterns on the visible part of the system when allowed to freely sample from

the distribution specified by the weights in the system. Section 2.2 addresses their

mathematical formulation.

Boltzmann machines are trained using a computationally intensive process in which

the machines are annealed into low-energy states, and these states are used to guide

a training algorithm to model the joint probabilities of the inputs presented to them

(Equation 2.20). These machines are typically discussed as having a “visible” state and

a “hidden” state, shown in Figure 2.1, in which the visible state corresponds to the

data that is fed into the machine and the hidden state corresponds to some abstracted

representation hidden from the outside world. The goal of the Boltzmann machine is

to use the energy dynamics of the system to learn arbitrary distributions of the input

data. The relationships that specify the distribution are mapped through the connection

weights (Equation 2.2), which force the hidden units to represent the input distributions

and cause the low energy states to correspond to probable configurations of the system

[36].

Ultimately, this distribution matching is a very important task for learning [37]. It is a

form of unsupervised learning in which the goal of the system is to design a probability

distribution that can arbitrarily approximate an input distribution, learning only from

samples from this unknown input distribution. This is an important goal because it

means the system can perform inference on that model, calculate likelihoods of a given

input, and produce samples similar to those it has been trained upon [13, 36, 37]. As

will be shown later in this work, it also allows a form of supervised learning if the labels

are presented as part of the joint distribution it needs to learn [2, 13].

RBMs, which were introduced under the name “Harmoniums” by [11], are a slight mod-

ification of the Boltzmann machine in that intra-layer connections have been removed

to make units in the same layer conditionally independent, as seen in Figure 2.1. As

mentioned before, both Boltzmann and Restricted Boltzmann machines can be trained

in an unsupervised fashion, which means that no training labels are necessary for the

system to learn the joint distribution of their inputs.

Unfortunately, training a Boltzmann machine through simulated annealing takes con-

siderable computational time because the system must be allowed to stabilize to an

equilibrium. This is necessary to obtain a single sample from the data and model distri-

butions, which are used to calculate the gradient of learning for minimizing the difference

between these two distributions (see Equation 2.20 in Section 2.2), and thus is the lim-

iting factor for training a dataset. If every weight update takes a significant amount of


...

...

Visible

Hidden

BoltzmannMachine

...

...

RestrictedBoltzmannMachine

Figure 2.1: Diagrammatic view of a Boltzmann machine and a Restricted Boltzmannmachine. Note that the Restricted Boltzmann machine lacks intra-layer connections.

Figure taken from [14].

time to calculate (i.e., using Equation 2.20), this precludes training and hindered the

adoption of Boltzmann machines [13, 37].

The Restricted Boltzmann machine attempts to address one significant issue of training

the Boltzmann machine: the difficulty of obtaining a true sample from a Boltzmann

machine. By removing intra-layer connections, inference becomes tractable within this

model, as recognized by [12], and Gibbs sampling can be used to infer likely states of

the model instead of annealing the system into equilibrium [12, 13].

After their initial introduction in the mid-1980s, progress on RBMs was sporadic and

largely ineffectual in favor of training supervised shallow architectures with backpropa-

gation, as in [38]. Moreover, shallow architectures are already universal, meaning that

a shallow architecture could theoretically approximate any function given enough units.

Although it can be exponentially more efficient to have deep layers than a single layer

[37], no effective training algorithms were yet created that could train deep networks.

Unfortunately, standard error-gradient techniques like backpropagation assign excessive

error updates to the final layer in a deep architecture. The error used in learning effec-

tively disappears after being propagated back through all layers [1, 37], causing very little

training signal to reach the initial layers. This results in over-learning of the top layer

and under-learning in lower layers, using the resources of a deep network inefficiently -

and ultimately performing worse than a single well-trained layer [37, 39].

An alternative approach was championed by Yann Le Cun (for example, [37, 40]), which

was to use convolutional networks. These networks have templates mapped across the


input space, combined with transformation layers. This effectively decreases the num-

ber of independent weights and allows for more efficient training, as well making the

system more robust to certain transformations; depending on the architecture, it can be

made to be more robust to both translation and scaling. Unfortunately, this approach

imposes certain assumptions about the inputs (translation invariance, for example), and

ultimately gives up the descriptive power of large numbers of weights in favor of simpler,

more effective training.

Then, in 2002, work was published which began shifting progress back towards deep

networks [2, 13]. Data was becoming very plentiful, but, unfortunately, most of that

data was unlabeled; a simple Internet search could easily yield troves of data, but it

is not structured in ways that allows computers to easily learn from it [13]. To take

advantage of this available information, unsupervised learning is a very powerful tool

because it allows a computer to pre-learn from large volumes of unlabeled data, learning

about the differences between classes that it sees. Then, when presented with labels,

it can fine-tune its learning with a final supervised step. However, RBM took far too

long to train using simulated annealing, but in this 2002 paper Hinton discovered a very

effective approximation that works well in practice and enabled the rise of deep networks

[37].

One way to obtain a sample from the RBM’s model distribution is to begin a Markov

chain starting with a current data sample. This Markov chain can alternately draw

samples from the hidden layer given the visible, and the visible layer given the hidden

(see Figure 2.2). It turns out this process is equivalent to a Markov Chain Monte

Carlo (MCMC) sampling method known as Gibbs sampling, and Gibbs sampling in this

context converges to the stationary distribution specified by the RBM regardless of the

starting point. Informally, this means that no matter how the system is initialized,

this process of repeatedly generating samples from the system causes those samples to

gradually become closer to the distribution specified by the system. The first sample

may be random data, but by the time the energy dynamics have adjusted the activations

many, many times, the samples begin to be the types of samples specified by the RBM.

Q0

Visible Layer

Hidden Layer

...

...

Q1

Visible Layer

Hidden Layer

...

...

Qn

Visible Layer

Hidden Layer

...

...

...

Figure 2.2: The Markov chain used in contrastive divergence. Gibbs steps are takento create samples closer to the equilibrium distribution than the original data sample.


Gibbs sampling is designed to generate approximate samples from a distribution where

direct sampling is difficult. The key insight that underlies Gibbs sampling is that sam-

pling from a conditional distribution may be easier than obtaining pure unbiased samples

from the distribution. The process is surprisingly simple; for all joint variables in the

distribution, hold all but one fixed, and draw a sample conditioned on the others. In

the next step, use the updated sample value for that variable to draw a sample for a

different variable. Intuitively, the process “walks” a sampling process towards the dis-

tribution specified by the parameters, and given infinite steps will yield samples from

the distribution specified by the parameters.

In practice, this process cannot be repeated indefinitely, hence the use of simulated

annealing to try and stabilize the system with an energy-cooling process. Unfortunately,

it cannot be known a priori how many steps are necessary to obtain a “true enough”

sample from the model distribution, and in practice this number can be quite large and

also quite variable. However, the key insight in the [13] paper is that a single step is

sufficient to learn effectively. It is not obvious that this should be the case, however, and

the next section deals with the mathematics to explain why this assumption is valid.

Intuitively, this single-step sampling method known as CD-1 (see Equation 2.33) works

because a sample drawn from a Gibbs step is closer to the model distribution specified

by the RBM than the input data. Even though this model sample is still highly corre-

lated with the current input, it still contains information about the gradient that would

decrease the difference between the model distribution and the data distribution, and

this single sample can be used to approximate a sample from the model distribution.

A step then taken to minimize the difference between the model sample and the data

sample will make the model distribution more like the data distribution. In practice, a

single Gibbs step is quick to calculate and very effective [2, 13, 37].

The second key insight, which appeared a few years later in [2], was to use RBMs as a

building block in training deep networks. Each layer in a deep network could be trained

as an RBM and the whole system can be composed together out of RBMs. When

training, the first layer is trained in an unsupervised way with the visible layer as the

input layer and the layer above as the hidden layer. Then, to train the next layer, the

learned weights from the first layer are fixed as the hidden layer becomes the visible layer

of the next layer. This process continues until the whole network is trained in a greedy

layer-wise fashion [39]. These deep networks will only be very sparingly addressed in

this thesis, although there is much more room to explore.

Now, the mathematics of these systems should be examined to firmly ground the theory

on which the rest of this thesis is based.


2.2 Energy-Based Models

We begin with a description of energy-based models, which contains both the original

Boltzmann machine and the Restricted Boltzmann machine. This proof follows the

outline given in the Appendix of [36] and updates the notation to be consistent with the

formulae given elsewhere in this manuscript. In analogy to a physical system, we begin

by defining the probability of a visible state in the Boltzmann machine as exponential

in the energy:

P (v) =∑h

P (v,h) =

∑h

e−E(v,h)∑x,h

e−E(x,h)(2.1)

Here, the vector v denotes the current visible state of the system, recalling that the

Boltzmann machine and RBM are both bipartite machines composed of a visible state v

and a hidden state h. This equivalence marginalizes out the hidden states vector h given

some visible vector v, and additionally relates the energy function to the probability of a

state v. The normalizing factor on the denominator at the right of Equation 2.1, called

the partition function by analogy to physics, simply rescales the energy of the current

visible state by the sum over all possible system states. Note that to avoid confusion, x

is used for the visible vector in the partition function summation rather than v.

Where the ‘ symbol indicates the transpose operation, b an energy bias on the visible

states, c an energy bias on the hidden states, and a weight matrix W providing an

energy description of the joint activations of the hidden and visible states, the energy of

a system configuration can be defined as:

E(v,h) = −b′v − c′h− h′Wv (2.2)

Or, equivalently, the sum over i visible neurons and j hidden neurons:

E(v,h) =∑i

−bivi −∑j

cjhj −∑i,j

wijvihj (2.3)

The goal of this derivation will be to determine a method of updating the weights W to

force the system to achieve certain co-activations of v and h, and the tool used will be

minimizing the KL divergence. The KL divergence is a directional measure of similarity

between distributions, and in this case we want to minimize the difference between the

distribution of the visible data the machine is presented with and the distribution of data


it generates. Ideally, if we can find the gradient that minimizes the difference between the

distribution of input states forced by the outside world and the states the system settles

into according to its energy function, the system will eventually match the distribution

of the input states specified by the data. Let P+(v,h) denote the probabilities of input

data state (v,h), and P−(v,h) denote the model probability of a state (v,h).

DKL(P+||P−) =∑x,h

P+(v,h) logP+(v,h)

P−(v,h)(2.4)

Noting that P+(v,h) is the true probability of the data and is independent of the

parameters, it is a constant when taking the gradient with regard to ∂wij :

∂DKL(P+||P−)

∂wij= −

∑v,h

P+(v,h)

P−(v,h)

∂P−(v,h)

∂wij(2.5)

Now, we seek the term ∂P−(v,h)∂wij

, which we can calculate given the probability-energy

relation given in Equation 2.1 and the energy-weight relation given in Equation 2.3. We

pre-compute the partial derivative of the numerator in Equation 2.1 by using Equation

2.3:

∂e−E(x,h)

∂wij= vihje

−E(x,h) (2.6)

Now this term is used in taking full form ∂P−(v,h)∂wij

:

∂P−(v)

∂wij=

∑h

v−i h−j e−E(v,h)∑

x,h

e−E(x,h)−v+i h

+j

∑h

e−E(v,h)∑x,h

e−E(x,h)

(∑x,h

e−E(x,h)

)2 (2.7)

Which can be simplified using the marginal probabilities in Equation 2.1:

∂P−(v)

∂wij= v−i h

−j

∑h

P (v,h)− v+i h

+j P (v)

∑v′,h

P (v′,h) (2.8)

The expression found in Equation 2.8 is the term sought for the KL divergence, and can

be substituted:


∂G

∂wij= −

∑v

P+(v)

P−(v)v+i h

+j

∑h

P (v,h) +∑v

P+(v)

P−(v)v−i h

−j P (v)

∑v′,h

P (v′,h) (2.9)

By the rules of conditional probability:

P+(v,h) = P+(h|v)P+(v) (2.10)

P−(v,h) = P−(h|v)P−(v) (2.11)

And since the hidden states are chosen according to the same parameterized model,

regardless of whether the visible states are given by the environment or whether the

visible states are chosen from the model distribution:

P−(h|v) = P+(h|v) (2.12)

Using these facts to prepare a term and simplify:

P−(x,h)P+(x)

P−(x)= P−(h|v)P−(v)P

+(v)P−(v)

(2.13)

= P−(h|v)P+(v) = P+(h|v)P+(v) (2.14)

= P+(x,h) (2.15)

Of course, since we are dealing with probabilities:

∑v

P+(v) = 1 (2.16)∑v

P−(v) = 1 (2.17)

Substituting all this into 2.9, reproduced here, simplifies the equation:


∂G

∂wij= −

∑v

P+(v)

P−(v)v+i h

+j

∑h

P (v,h) +∑v

P+(v)

P−(v)v−i h

−j P (v)

∑v′,h

P (v′,h) (2.18)

= v+i h

+j

∑v,h

P (v,h)− v−i h−j

∑v′,h

P (v′,h) (2.19)

Which is, in fact, just the difference between the expectations of the data distribution

and the model distribution. Therefore, a weight update that decreases the distance

between the model distribution P− and the data distribution P+ is proportional to

the difference in expectations of the binary product vihj between the data and model

distributions.

∆wij ∝ 〈vihj〉data − 〈vihj〉model (2.20)

Unfortunately at this point, training requires obtaining the expectation from the model

distribution, a difficult problem that took decades to work around. However, this for-

mulation appears many times in subsequent derivations and is an important result for

building on. Ultimately, contrastive divergence, the Siegert approach, and evtCD all

work by generating an estimate of the correlations between visible and hidden layers,

and then minimizing the difference between the correlations the model generates and

the correlations the data produces, so this is an important derivation to understand.

2.3 Products of Experts

By a very different route, a similar formulation was described in 2002 with the invention

of product of experts (POE) [13] to yield the contrastive divergence learning rule. The

derivation of the POE given here, following [13], will incorporate lessons from Gibbs

sampling that are valid on POEs to yield a training rule that uniquely works for RBMs.

Imagine a system composed of n “experts,” each of which independently computes the

probability of some data vector d out of all possible data vectors c. These experts then

multiply their probabilities to combine their opinions on the likelihood of the data.

The intuition for this model is that the product of experts can constrain the system so

that the final result is a sharper distribution than any individual model, unlike mixture

models where each expert is summed with the other experts. In mixture models, the final

distribution will be at least as broad as the tightest individual distribution. In contrast,

a POE system can have individual experts that specialize in different dimensions, each


of which constrains different attributes to yield a very sharp posterior. For problems

that factorize well into component decisions, this is very advantageous; for example, a

system designed to detect human beings can be composed of a torso detector, a head

detector, and a limb detector whose individual contributions can be combined to yield an

overall human detector made of factorized experts. Many real-world problems factorize

into attributes and provide the motivation for POE systems. Moreover, RBMs are POE

models, and other types of experts are also allowed, but Boltzmann machines are not

POEs.

The POE can be formulated as follows:

p(d|θ1...θn) =

∏m pm(d|θm)∑

c

∏m pm(c|θm)

(2.21)

To train the POE, one possible goal is to maximize the probability of the data. Equiva-

lently, the derivative of the log likelihood of the data d can be calculated with respect to

the parameters θm. Since the models are independent with their probabilities multiplied

together, the derivative of the probability p with respect to the model parameter θm can

be found as follows:

∂ log p(d|θ1...θn)

∂θm=∂ log pm(d|θ1...θn)

∂θm−∑c

p(c|θ1...θn)∂ log pm(c|θm)

∂θm(2.22)

The first term, the derivative of the log probability of the data sample with respect to

the parameters θm, is controllable by design. If an expert model is chosen for which it is

possible to find the derivative of the probability of the data with respect to the param-

eters used to calculate the data, then this is straightforward to calculate. Commonly,

the “expert” used is a sigmoid:

pm =1

1 + e−b+

∑jxjwj

(2.23)

which yields a probability pm based on the inputs xj given, respectively, the bias and

weight parameters θm = {b,w}. It is clear, in this case, that the first term in 2.22 is

easily calculable:

∂ log pm(d|θ1...θn)

∂θm=

1− 1

1 + e−b+

∑jxjwj

1

1 + e−b+

∑jxjwj

(2.24)


Returning to Equation 2.22, though its first term is easily calculable, the second term is

more challenging. The second term combines the probability of the product of experts

with the log probability of a single expert for all data vectors c. For any given data

point c, it should be straightforward to calculate this term because it is the product two

values: the POE probability p(c|θ1...θn) and the derivative of the log likelihood as in

Equation 2.24. With sufficient time to evaluate all data vectors c, this would not be a

problem, but the size of the possible state space likely precludes calculating all possible

vectors. For example, for a vector of k binary neurons, the number of combinations is

exponential in k; there will be 2k possible vectors, a number which grows very rapidly

to be beyond computational power.

However, this second term in Equation 2.22 can be alternately viewed as the expected

derivative of the log probability of an expert on data. This means that if accurate

samples can be drawn from the model distribution, the expectation can be approximated.

To obtain samples from the model distribution, Gibbs sampling can be used to draw

samples from a POE, unlike for Boltzmann machines. Now, the architecture of the

POE (and RBMs) becomes very important: with no intra-layer connections, all units

are conditionally independent of the other units in their layer. Since this is true, a

Gibbs sampling chain can be run in which all hidden units are updated in parallel while

visible units remain fixed, then all visible units are updated in parallel while hidden

units remain fixed. Once this MCMC chain converges to the equilibrium distribution,

the expectation can be calculated from the samples produced.

However, there is a faster way that Hinton introduces in [13]. Imagine that the data

is produced by some true distribution Q0. For a concrete example, learning the joint

probabilities of pixels being “on” together is a common task for a visual classifier, because

the machine learns the relationships between pixels in an image. Now the POE has some

model distribution Q∞ which would like to approximate the true distribution Q0. This

naming convention was chosen to mimic the behavior of a Markov chain beginning

with the true data Q0 and ending at the model’s equilibrium distribution Q∞ after

∞ steps. The idea of contrastive divergence is to minimize the difference between the

model distribution Q∞ and the true distribution Q0. Once again, a way to measure

the difference between these two distributions is to use the Kullback-Liebler divergence.

The KL divergence of these two distributions can be calculated for Q0 and Q∞ as:

DKL(Q0||Q∞) =∑d

pQ0(d) log pQ0(d)−∑d

pQ0(d) log pQ∞(d) (2.25)

=∑d

Q0d logQ0

d −∑d

Q0d logQ∞d (2.26)


Note that the first term in Equation 2.26 is entirely dependent on the data and is

not affected by the parameters of the model that specify Q∞. This means that during

training, no update of the parameters will affect this constant, so it can be safely ignored.

Unfortunately, calculating the second term will be intractable as it the expectation

of model’s log probability of the data taken with the probabilities of the true data.

However, note that the value Q∞d is just the probability of the POE given some data

and parameters, p(d|θ1...θn), and it is again an expectation. Revisiting Equation 2.22

with this understanding yields:

⟨∂ logQ∞d∂θm

⟩Q0

=

⟨∂ log pm(d|θm)

∂θm

⟩Q0

−⟨∂ log pm(c|θm)

∂θm

⟩Q∞

(2.27)

Here, finally, the key insight of contrastive divergence is applied. Imagine that instead of

minimizing the KL divergence DKL(Q0||Q∞), the difference between the KL divergences

DKL(Q0||Q∞) and DKL(Q1||Q∞) are minimized, where Q1 refers to the distribution af-

ter one full Gibbs step of sampling. Since Q1 is closer to the equilibrium distribution

than Q0, DKL(Q0||Q∞) exceeds DKL(Q1||Q∞) unless Q0 = Q1 meaning Q0 = Q∞,

in which case it will be no worse and learning is already perfectly done. This “con-

trastive divergence” will never be negative. Most importantly, however, the intractable

expectation⟨∂ log pm(c|θm)

∂θm

⟩Q∞

cancels out.

− ∂

∂θm(DKL(Q0||Q∞)−DKL(Q1||Q∞)) =

⟨∂ log pm(d0|θm)

∂θm

⟩Q0−⟨∂ log pm(d1|θm)

∂θm

⟩Q1

+∂Q1

∂θm

∂DKL(Q1||Q∞)∂Q1

In this equation, the term d0 and d1 are introduced, and refer to the starting data

and the data after one Gibbs sampling step, respectively. In [13], Hinton showed the

final term ∂Q1

∂θm

∂DKL(Q1||Q∞)∂Q1 was small and rarely opposed the direction of learning. By

ignoring this term, a very tractable learning rule was developed for POEs in general:

∆θm ∝⟨∂ log pm(d0|θm)

∂θm

⟩Q0

−⟨∂ log pm(d1|θm)

∂θm

⟩Q1

(2.28)

This discovery is relevant because this contrastive divergence rule holds for systems

composed of independent, probabilistic nodes who specify a probability with the product

of their indepedent activations. For a network of spiking neurons encoding the joint

probabilities of certain features being on, the difference in the expectations for the data

and a quickly-derived model sample may be sufficient to train the neural network.


2.4 RBMs and Contrastive Divergence

Finally, contrastive divergence on RBMs will be examined here because of the similarities

between RBMs and the evtCD learning architecture. This section continues the deriva-

tion given in [13]. Contrastive divergence in RBMs draws from both the Boltzmann

machine’s KL divergence minimization and the POE’s’ contrastive divergence form to

yield a particular easy-to-compute result. Beginning again with the POE log-likelihood

given in Equation 2.22:

∂ log p(d|θ1...θn)

∂θm=∂ log pm(d|θ1...θn)

∂θm−∑c

p(c|θ1...θn)∂ log pm(c|θm)

∂θm(2.29)

Imagining a Boltzmann machine with a single hidden node j, note that θm = wj and

the term ∂ log pm(d|θm)∂θm

can be obtained from the Boltzmann derivation given above:

∂ log pm(d|wj)

∂wij= 〈sisj〉d − 〈sisj〉Q∞(j) (2.30)

The second term can also be calculated similarly:

∑c

p(c|w)∂ log pj(d|wj)

∂wij= 〈sisj〉Q∞ − 〈sisj〉Q∞(j) (2.31)

Subtracting these two equations from each other per the derivation, and then taking the

expectation over the whole dataset yields:

⟨∂ logQ∞d∂wij

⟩Q0

= −∂Q0||Q∞

∂wij= 〈sisj〉Q0 − 〈sisj〉Q∞ (2.32)

Finally, applying the contrastive divergence approximation gives:

− ∂

∂wij(Q0||Q∞)− (Q1||Q∞) = ∆wij ≈ 〈sisj〉Q0 − 〈sisj〉Q1 (2.33)

This is the final form contrastive divergence in RBMs. Next, we will examine the

intuition behind some subsequent improvements to this rule.


2.5 Extensions to Standard Learning Rules

This standard learning rule has been modified over time to yield practical improvements

in learning, or performance optimizations to speed it up. This section will examine the

most common of these extensions, and informally discuss why these extensions work.

The first extension, which was introduced in the early papers, is the idea of batch

training a set of samples in parallel to get a better estimate of the variance. In [13],

Hinton introduces the idea of batch training to combat the effects of the variance of the

samples. In training, the goal is to find a gradient using a sample to model a distribution,

but often the variance of the sample can be so large as to swamp the gradient of the true

distribution. By averaging the gradient over several parallel samples, the inter-sample

variance is minimized compared to the learning of the true distribution. Moreover, the

parallel processing enables vector operation speedups, which are a common method of

optimization on modern computers. This idea is more fully explored in Section 6.1.4.

A common extensions to gradient learning rules is momentum. This is straightforward to

implement, and acts by adding a percentage of the previous update to the current data

point. When the training is far from converging to the solution, it quickly accelerates

towards the goal by taking larger steps by combining the previous step and the current

step together. The downside is that too much momentum can cause the learning to

overshoot the goal distribution when the learning procedure is closer to the end.

Decay is also commonly applied to networks to help regularize the weight. By slowly

decaying values, overlearned solutions with a very strong weight decay back towards

equilibrium, regularizing the system. If the decay is too fast, the learning procedure will

perform worse, but the right amount of decay influences the system to be more stable

to new data points and decrease overlearning.

Persistent contrastive divergence is a significant contribution by [41], in which the equi-

librium distribution is approach faster over time by continuing the Gibbs process with

every new data point, moving the sample points closer to the equilibrium distribution.

Interestingly, in this case the data and the model are mixed only through the weights;

the Gibbs chain that samples from the model distribution is initialized at the first point

and slowly steps towards equilibrium. Of course, the equilibrium distribution specified

by the weights is also changing slowly, but if the learning rate is low enough then the

process should yield samples more closely tied to the equilibrium distribution. The com-

parison between regular CD (called CD-1 because each model sample point comes from

a distribution one step closer to the equilibrium distribution) and persistent contrastive

divergence can be visualized in Figure 2.3. This idea was explored in neural networks

in Section 6.1.6.


Finally, sparsity and selectivity have been incorporated into the learning procedures

with [42]. This innovation was originally inspired by biological evidence indicating that

neural receptive fields tend to be sparse, i.e. a neuron responds to a very specific set

of stimuli, and selective, i.e. a neuron responds rarely over time. By incorporating a

cost function and biasing the activations, neurons in an RBM can be forced to learn

different stimuli than are covered by others, and to choose discriminative features by

choosing receptive fields that occur only rarely. This scatters the receptive fields in a

more dispersed way over the problem space and leads to improved generalization and

test set performance.

Q0

Visible Layer

Hidden Layer

...

...

Q1

Visible Layer

Hidden Layer

...

...

Q0

Visible Layer

Hidden Layer

...

...

Q1

Visible Layer

Hidden Layer

...

...

Q0

Visible Layer

Hidden Layer

...

...

Q1

Visible Layer

Hidden Layer

...

...

First Training Iteration

Second Training Iteration

Third Training Iteration

(a) CD-1 form of Contrastive Divergence.

Q0

Visible Layer

Hidden Layer

...

...

Q3

Visible Layer

Hidden Layer

...

...

Q0

Visible Layer

Hidden Layer

...

...

Q2

Visible Layer

Hidden Layer

...

...

Q0

Visible Layer

Hidden Layer

...

...

Q1

Visible Layer

Hidden Layer

...

...

First Training Iteration

Second Training Iteration

Third Training Iteration

(b) Persistent Contrastive Divergence

Figure 2.3: Comparison of CD-1 and PCD.

Chapter 3

Derivation of evtCD

This chapter introduces an online, event-based training rule for spiking RBMs. In Sec-

tion 3.1, previous work on training spiking RBMs is examined. Subsequently, the evtCD

algorithm that forms the core contribution of this thesis is introduced in Section 3.2.

3.1 Spiking Restricted Boltzmann Machines

Variations of training spiking RBMs have been tried, generally without too much suc-

cess. In 1999, Hinton and Brown [43] investigated using sigmoids that spike over time

(eschewing more biological neuron models), beginning an investigation into sequence

learning. However, this work was largely not continued; in the next appearance, Teh

and Hinton [44] applied contrastive divergence to continuously-valued face images using

rates instead of binary spike events. Purely binary representations do not work well

with the real-valued intensities of the face, so the authors proposed encoding a single

intensity with multiple binary features to allow discretized variations in intensity. Alter-

natively, they viewed this as a rate-based code of binary events, and called their system

“RBMrate” as a result. This was subsequently applied to time-series data in [45], which

then moved further from the rate-based approximation of spiking to diffusion networks

of continuous values.

A significant advance was made when O’Connor proposed [46] using a rate-based model

of a neuron in his thesis work to encode the state values in contrastive divergence. In this

way, networks of spiking LIF neurons can be trained offline using a rule very similar to

standard contrastive divergence. However, instead of using binary activations adopted

from sigmoid probabilities, the layer activations were taken to be the continuously-

valued neuron spike rates given by the Siegert formula. Once trained, the weights are

19

Chapter 3. Derivation of evtCD 20

transferred from the rate-based model to a network of spiking LIF neurons, and the

network of spiking neurons behaves according to the energy functions of an RBM. The

framework introduced there forms the starting point for the work in this thesis.

The actual function that transforms inputs and weights to a rate is called the Siegert

function from [47] and has a difficult analytical form. For completeness and the relevance

of this method, it will be given here, but the reader is encouraged to look to other works

such as [48] for a more in-depth analysis of this function. Given excitatory input rate

ρe and inhibitory input rate ρi, the following auxiliary variables can be calculated:

µQ = τ∑

( ~we ~ρe + ~wi~ρi) (3.1)

σ2Q =

τ

2

∑( ~w2

e ~ρe + ~w2i ~ρi) (3.2)

Υ = Vrest + µQ (3.3)

Γ = σQ (3.4)

k =√τsyn/τ (3.5)

γ = |ζ(1/2)| (3.6)

where τsyn is the synaptic time constant, τ is the membrane time constant, and ζ is the

Riemann zeta function. With these auxiliary variables, the average firing rate ρout of

the neuron with resting potential Vrest and reset potential Vreset can be computed as

[49]:

ρout =

(tref +

τ

Γ

√π

2· (3.7)∫ Vth+kγΓ

Vreset+kγΓexp

[(u−Υ)2

2Γ2

]·[1 + erf

(u−Υ

Γ√

2

)]du

)−1

.

Let the function rj = φ(ri,w, θsgrt)/rmax denote the resulting firing rates rj returned

by the Siegert function φ with input rates rj, weight w, and parameter set θsgrt. The

Siegert-computed rate is then normalized by the max firing rate rmax = 1/tref . Since a

neuron is unable to spike faster than 1/tref spikes per second, this normalization maps

the output of the Siegert function to the range [0, 1]. Now, for visible unit rates ri and

hidden unit rate rj the contrastive divergence rule becomes :

∆wij ∝ 〈rirj〉Q0 − 〈rirj〉Q1 (3.8)


This works quite well in practice, yielding networks that can achieve accuracies greater

than 95% on the MNIST handwritten digit benchmark task [14]. The receptive fields

resemble the digit parts that constitute handwritten digits as can be seen in Figure 3.1.

Figure 3.1: A visualization of the receptive fields of hidden layer neurons trained usingthe Siegert method. Each square represents the weights of the visible layer connectedto that hidden layer neuron. Note here that the fields “factorize” handwritten digits

into digit parts. Figure from [14].

Knowing that Equation 3.8 performs well for training rate-based approximations of LIF

neurons, a link can be sought between a spike-based rule and this rate-based rule.

3.2 evtCD, an Online Learning Rule for Spiking Restricted

Boltzmann Machines

In addition to a possible performance improvement by switching from evaluating the

complex Siegert function to a simple update rule that occurs upon spiking, there is a

biological justification for examining spike-timing dependent plasticity (STDP). In the

brain, neurons change their connection strengths in response to the relative time between

the input neuron firing and the postsynaptic neuron firing (for a review of this so-called

spike-timing dependent plasticity, see [32, 33] and Figure 3.2). Many variations of these

STDP rules exist to capture the varieties of learning that spiking neurons display.

Since a rate-based rule can describe the net result of a large number of spikes, could it

be possible to design a spike-based rule that, in the rate-based limit, approximates the

Siegert update rule given in Equation 3.8? We begin by separating the problem into

identical subproblems specified by w+ij and w−ij :


Figure 3.2: Visualization of spike-timing dependent plasticity. This figure is takenfrom [50], a seminal experiment measuring the effect of spike timing on changes in synap-tic strength. Shown here is the percent-change of synaptic strength as a result of therelative timing of presynaptic and postsynaptic spikes. In this experiment, presynapticspikes arriving before the postsynaptic neuron fires will result in synaptic strengthening(causal reinforcement), and presynaptic spikes arriving after the postsynaptic neuronfires result in synaptic weakening (acausal depression). This differs from the STDP rule

used in evtCD, which only responds to causal spikes.

∆wij ∝ 〈rirj〉Q0 − 〈rirj〉Q1 (3.9)

∆wij = ∆w+ij −∆w−ij (3.10)

∆w+ij = η 〈rirj〉Q0 (3.11)

∆w−ij = η 〈rirj〉Q1 (3.12)

In this weight update rule, there are four different populations of neurons with their

individual rates: for Q0, which we will call the data layers, there are the ri rates rep-

resenting the visible layer, and the rj rates representing the hidden layer. Similarly, for

Q1, which we will call the model layers, there are the ri rates representing the visible

layer rates, and the rj rates representing the hidden layer rates. Since this problem de-

composes cleanly into rates of four different populations of neurons, those in the visible

and hidden layers of the data and model distribution, we begin by proposing a four-layer


architecture (Figure 3.3). Unlike contrastive divergence, a spike-based learning rule re-

quires populations to track states. In contrastive divergence, the sigmoids do not have

continuity through time, so the notion of populations is not necessary; the state of a

layer is a random sample, drawn according to the probability function of its given inputs.

Here, however, networks of spiking LIF neurons maintain the states, so the four states

of the network are represented by four physically distinct populations: the data visible,

data hidden, model visible, and model hidden populations. Since the data and model

distributions share a weight matrix, they must be matched in size, but the visible and

hidden layers can be sized according to problem constraints. Similar to the standard

RBM, there are no intra-layer connections, the weights propagate activations forward

between visible and hidden layers, and the hidden-visible weights are the transpose of

the forward weights.

Visible Layer

Hidden Layer

...

...

Data Layers

Visible Layer

Hidden Layer

...

...

Model Layers

WW'

W

Figure 3.3: Architecture of a network used for event-based STDP-like updates. TheevtCD algorithm relies on a network of four neural layers, each encoding a differentset of rates. The arrows indicate the direction of information flow in the network.Importantly, the weight matrix W is shared between the data and the model layersand determines the connection strength between the visible and the hidden. The weight

transpose W′ connects the data hidden layer back to the visible model layer.

If the inputs are assumed to be Poisson-distributed with a rate specified by r, then

the expectation of the number of spikes is r. We begin by proposing the following

STDP-inspired weight update rule:

∆w+ij =

η if h+i = 1, v+

j = 1

0 otherwise(3.13)

∆w−ij =

−η if h−i = 1, v−j = 1

0 otherwise(3.14)


Since two samples are unlikely to ever be 1 at exactly the same point in continuous time,

we define a windowing period twin over which this statement can be valid (see Figure

3.4). Then, the ratio of number of spikes expected to occur in each window is the ratio

of the size of the window to the overall unit time length of the rate:

E[hi]twin = 〈hi〉twin =1

twinrh,i (3.15)

However, the update rule needs to be a causal model since it is designed to operate in

real-time, and the system is not able to learn about inputs which have not yet occurred.

However, the windowing on the rates for the Poisson distribution is just an average rate

over a constant time period arbitrarily chosen, so we choose here to have the period end

at the current time t and begin at t− twin. In the limit, then, the expected number of

hi events produced over a time period is the rate rh,i.

Therefore, the update rule is equivalent:

∆wij ∝ 〈〈hi〉twin〈vj〉twin〉Q0 − 〈〈hi〉twin〈vj〉twin〉Q1 (3.16)

∆wij ∝ 〈rirj〉Q0 − 〈rirj〉Q1 (3.17)

Where the spike states hi and vj indicate the presence or absence of a spike, and the time

window twin denotes the time over which the expectation is carried out. Importantly,

this is a spike-based rule and exceptionally sparse in its computation. Since both hi and

vj must be 1 in order for learning to occur, and the time window is defined such that

the spike hi occurs at the end of the window, a possible weight update needs only to

be calculated when the hidden layer spikes. At that point, a neuron can check which of

its inputs has spiked in the previous twin and either potentiate or depress its weight as

specified by the rule.

This rule, called the evtCD learning rule, is shown in Figure 3.4, and four examples

are shown in Figure 3.5.

Importantly, the evtCD rule only needs to be evaluated when a hidden neuron spikes.

Since it is a product of two binary events, the product of vihj is only 1 when both vj and

hi are active. If either one is not active, then the system does not update the weights

and the connecting weight wij remains fixed. This will be a key feature that allows very

sparse event-based computation.

Next, we examine how to implement this architecture in practice.


tpre - tpost

Weight change

Data Layers Model Layers

twin

Weight change

twin

tpre - tpost

Figure 3.4: The evtCD rule derived for this work. The learning rule is divided into twohalves, a weight-potentiating rule and a weight-depressing rule. Unlike most learningrules, spikes from one set of populations (the data layers) only potentiate the weightmatrix, and spikes from another set of populations (the model layers) only depotentiatethe weight matrix. In both cases, the weight change will only occur if a hidden layerspike (“post”) occurs after a visible layer spike (“pre”). In all other cases, the weight

remains fixed.

Data Layers Model Layers

Visible (pre)

Hidden (post)

Visible (pre)

Hidden (post)

wij

increase

No change

Visible (pre)

Hidden (post)

Visible (pre)

Hidden (post)

wij

decrease

No change

Figure 3.5: Four examples of the applied evtCD rule. This diagram is divided intotwo halves, like the evtCD learning rule. Spikes on the left side occur in the data layers,and spikes on the right side occur in the model layers. The gray box preceding a hiddenlayer spike represents the time window twin, and spikes are vertical jumps with timeon the horizontal axis. The upper left quadrant represents a weight increase for wij

connecting these two neurons because the visible layer spikes before the hidden layerand within the time window twin. The lower left produces no result because the visiblelayer spike occurred either before or after the hidden layer spike window. On the right,the model distribution performs identically, but its weight update results in a decrease

of wij rather than an increase since these spikes take place in the model layer.

Chapter 4

Implementation of evtCD

This chapter describes the software implementation of the evtCD learning rule. Now

that it has been shown mathematically in Equations 3.13 and 3.14, as well as diagram-

matically in Figure 3.4, Section 4.1 presents a method for implementing the algorithm.

In Section 4.2, a method of supervised training using the primarily unsupervised evtCD

algorithm is introduced using an idea from earlier work on RBMs [13], allowing labels

to guide the training of networks of spiking neurons.

4.1 Algorithm Recipe for Software Implementation

The aforementioned evtCD algorithm is straightforward to implement for a software

simulation. The process is as follows:

1. Begin by initializing auxiliary variables:

(a) Membrane{1:4}, the membrane potentials of the four neuron layers;

(b) Last Spiked{1:4}, the last time each neuron has spiked, by layer, in order to

find whether a neuron is a possible cause of spiking;

(c) Refrac End{1:4}, the time when the refractory period for a given neuron in a

given layer last ended, used to determine if a neuron is currently refractory;

(d) last update[1:4], the last time a neuron layer was updated with an incoming

spike, used in calculating membrane potential decay;

(e) Thr{1:2}, the thresholds for the visible and hidden neuron layers (shared

between the data and the model);

(f) W, the shared weight matrix.

26

Chapter 4. Implementation of evtCD 27

The matrix Last Spiked that stores the time of the previous spike should be ini-

tialized to a large negative value so as not to artificially potentiate weights from

the initial spikes. Membrane, Refrac End, and last update can be initialized to zero.

Thr, the spike threshold, is typically initialized to a value of 1 for all neurons. Fi-

nally, W is chosen to be initialized from the uniform distribution in [0, 1], because

these uniformly excitatory weights cause more initial spikes for training than a

Gaussian centered around zero.

2. Begin processing input spikes. For an event-based implementation, a priority queue

of spikes is preferable, using a key comprised of the time and the layer; every

insertion and extraction will be O (log(n)). Every spike should be a triple of (time,

address, layer). This data structure is a convenience to accomplish a task biology

does very simply: delaying spikes between their generation and arrival. Here, a

priority queue helps to keep the spikes sorted according to times and layer, and

ensures that the first spike processed is the one that happens first; biology takes

care of this problem automatically.

3. For each input spike:

(a) Decay the membrane potential on the receiving layer by e−∆t/τ , calculating

∆t from the spike time and the last update time for the receiving layer.

(b) Add an impulse corresponding to the weight wi,j for the neuron i from the in-

put spiking neuron j if the receiving neuron’s refractory end period refrac end

is less than the current time. Since the visual-to-hidden layer weights are W

and the hidden-to-visual layer weights are W’, index into the weight matrix

appropriately based on the layer.

(c) If desired, add noise to the neuron membrane potentials.

(d) Examine the updated neurons, comparing their membrane potentials to the

threshold membrane potentials.

(e) For every neuron that exceeds the threshold for that neuron:

i. Record a new refractory end period: refrac end{layer}[i] = spike time

+ t ref.

ii. Reset the membrane potential: Membrane{layer}[i] = 0.

iii. Record this time as the last time the neuron spiked: Last Spiked{layer}[i]

= spike time.

iv. Adjust the threshold by either lowering the threshold if this spike comes

from the data distribution (making more spikes more likely in the model

distribution), or by raising the threshold if this spike originates from a

model distribution.


v. If this layer is a hidden layer, an STDP weight update can be performed.

If this layer is a data distribution layer (layers 0 or 1), then the weight

wij corresponding to an input neuron j, whose Last Spiked{layer-1}[j]

is within the time windows twin, should be potentiated. If the current

layer is instead from the model distribution, spikes that occurred within

the previous window will be depotentiated. Regardless, if there was no

spike from a preceding neuron, its weight is unaffected.

vi. Add new spikes to the spike queue so downstream neurons receive these

spikes.

Note that in its most simplistic form, the exponential decay can be handled by bitshift-

ing. The summation of input currents is a sum, learning requires only a lookup and a

comparison, and updating a weight with a new value is only another summation. Even

if the only operations that are available are bitshifting, addition, and subtraction, then

this rule is implementable - making it ideal for low-compute architectures [21].

4.2 Supervised Training with evtCD

To measure the efficacy of the evtCD algorithm, it is necessary to objectively assess

its accuracy. In Chapter 5, the exact methods of performance measurement will be

explained, but it is worth discussing the training process by which supervised training

can occur in the unsupervised process of distribution matching. The idea is quite old

and goes back to the early days of training unsupervised learners; essentially, by making

the label part of the input distribution, the system is forced to learn the relationship

of the input distribution to the labels and cluster these elements together [1, 13]. See

Figure 4.1 for an example; by taking the top label layer and concatenating it with the

input layer, a new RBM can be made that trains in a supervised way when it performs

distribution matching.

After training this RBM as if it were a normal RBM, it can be unrolled back to its

original configuration to perform classification. By separating the weights and biases

from the joint layer correctly, the original three-layer architecture can be reconstructed.

Then, passing the activations through the three layers, input to hidden to label, will

classify an example according to the weights of the system.


Hidden Layer

W1

Visual Input Layer

Label Layer

W2 Hidden Layer

{W1, W2}

Visual Input Layer Label Layer

Figure 4.1: Method of supervised learning using unsupervised learning: by concate-nating the label and input layer, and learning the joint representation of input and

label, the system is forced to learn to cluster the labels with the data.

Chapter 5

Test Methodology

This chapter describes the setup used to train and evaluate the evtCD networks. In Sec-

tion 5.1, the dataset used to assess performance of the training algorithm is introduced.

Following that, two implementations are described for achieving different aims. In Sec-

tion 5.2, the methodology for the time-stepped simulation is described, and Section 5.3

discusses the methodology of using the network for online training of a spiking RBM.

5.1 MNIST

The MNIST handwritten digit dataset is an extremely popular dataset used in machine

learning, compiled by [35], and often used as a benchmark task for new learning algo-

rithms. The challenge is straightforward: given 60,000 training digits, each a 28 by 28

pixel image with the digit in the center, correctly identify a handwritten digit from the

10,000 digit test set.

A human achieves about 99.6% accuracy on this dataset, and the best algorithms in

the world achieve equivalent performance [1, 35, 51]. Often, to boost performance,

transformation like rotations, translations, and deformations are applied to the images

to achieve a larger training set and increase its robustness [14, 37]. These transformations

were not performed here.

There are several attributes that make this dataset attractive for beginning investigations

for a new machine learning rule. First, its modest dimensionality of 784 input dimensions

is quite large compared to simple algorithms but relatively small compared to a modern

computer’s processing power. Second, the data can be represented as binary activations

without losing its identifying characteristics; a pen mark can be characterized as either

present or missing, and learning rules (including evtCD and CD [13]) initially only

30

Chapter 5. Test Methodology 31

supported binary data. Standard RBMs have now been extended to represent real-

valued inputs [37], but that investigation has not been performed for evtCD training

yet. Third, there is an interpretability to the weights of neurons, as they represent

receptive fields over digits. By visually inspecting the weights of hidden neurons, it is

possible to tell if the learning rule has resulted in proper receptive fields that decompose

the input into component parts. Finally, members of the machine learning field are very

familiar with this dataset and conclusions about an algorithm’s strengths or weaknesses

can be clearly seen on this common benchmark.

Figure 5.1: Six digits from the MNIST corpus. Across the top row are examplesof easily classified digits, and the bottom row contains digits 1, 2, and 8 that posed

difficulty for the evtCD algorithm.

Performance on the MNIST benchmark task can be found later in the results section of

this work, specifically in Figure 6.22. It is worth pointing out an important benchmark

here, however: a least-squares (optimal) linear regression can achieve 86.03% classifica-

tion accuracy when trained on the full 60,000 digit training set and all pixels. Ideally,

the evtCD training algorithm would surpass this level of accuracy, given the nonlinear

transformations that the spiking RBM performs.

5.2 Time-stepped Training Methodology

For experimental and debugging reasons, the Matlab implementation was time-stepped.

The task of the time-stepped implementation is to provide a platform to easily study

the parameters of the system and find a method to optimize the overall operation of the

evtCD algorithm.


The time-stepped testing methodology was consists of the following steps:

1. Load the MNIST handwritten digit database of 60,0000 training digits and 10,000

test digits.

2. Establish parameters for an evtCD training simulation, and initialize the network

architecture for supervised training.

3. Draw spikes from each of the 60,000 digits in the training set, and pass these spikes

as samples from the data-visible layer.

4. Train the network according to the evtCD algorithm, and dispose of spikes emitted

from the model hidden layer.

5. At specific timepoints during the training, as shown in Section 6.1, save network

snapshots for offline analysis.

The frame-based MNIST database was transformed into spike trains as described in

the next section (Section 5.2.1). Every digit was presented for 1/10th of a second of

simulated time, with each “on” pixel emitting 10 spikes on average. The likelihood of a

spike emitted from a pixel was proportional to its intensity, as explained in the following

section (Section 5.2.1).

In addition, the labels were used as part of the input. This supervised training method

is described in Section 4.2 and shown in Figure 4.1. To learn the labels, the label layer

and the input layer were concatenated into a single layer and the joint distribution of

pixels and labels were learned together. This allows an objective metric of training

performance: classification accuracy after 1 epoch on the MNIST handwritten digit

classification task. To determine the network’s choice for a presented digit, the output

layer neuron with the most spikes was chosen as the selected digit.

The architecture for the network is the one illustrated in Figure 6.1, with 784 neurons

in the first layer, corresponding to the 28*28 pixel images of MNIST [35], 100 neurons

in the hidden layer, and 10 neurons in the output layer corresponding to the 10 labels.

This manuscript also contains the full source for a Matlab implementation of time-

stepped training, which can be found in Appendix B.

This training begins by extracting spikes from MNIST images, a process detailed in the

next section.


Figure 5.2: Drawing spikes with an increasing number of spikes from the MNISThandwritten digit database. Shown here are 10, 50, 100, 500, and 5000 spikes drawn

from a sample “four” digit using Algorithm 1 shown in Section 5.2.1.

5.2.1 Extracting Spikes From Still Images

This technique predates this thesis work, going back at least to [14], but the specification

has not been fully described in print before.

Given a sample image, the spike rate should converge in the limit as number of spikes

increase to a rate-encoding of the image. The absolute spike rate should be a fixed

parameter, but the relative spike rate should emphasize the bright pixels compared to

the dark pixels. This can be accomplished by drawing a spike from the image with prob-

ability proportional to the pixel’s intensity. The Matlab function randsample efficiently

addresses this particular task, and generates spike trains as can be seen in Figure 5.2.

This conversion from fixed image to spike train is used in this work whenever spike trains

are needed from frames.

The timing of each spike is randomly generated. Since a given number of spikes are

emitted in a given amount of time, a spike time is randomly assigned to each spike and

the rates average out over the presentation period correctly. Algorithm 5.2.1 lists the

source of this algorithm which efficiently generates spike trains from data vectors.

1 function [addr, times] = drawspikes(train_x, opts)

2

3 trials = size(train_x, 2);

4 addr = zeros(opts.numspikes, trials);

5 times = zeros(opts.numspikes, trials);

6

7 for trial = 1:trials

8 % Assign addresses

9 addr(:, trial) = randsample(numel(train_x(:, trial)), ...

10 opts.numspikes, true, train_x(:, trial));

11 % Assign times

12 times(:, trial) = sort(rand(size( ...

13 addr(:, trial)))).*opts.timespan;

14 end

Algorithm 1.


5.3 Online Training Methodology

Before discussing the online training methodology, it is necessary to describe the method

of generating the input to the online training. To test the real-time nature of this system,

it is necessary to use an image sensor that can produce a spiking output, such as the DVS

spiking vision sensor [24]. Since the DVS produces spikes in response only to temporal

contrast changes, it does not spike in static scenes. Therefore, either the scene or the

sensor must be moved in intelligent ways to produce spikes in response to static images.

Since such a model is necessary, it makes sense to adopt a model used in biology to solve

the same problem.

5.3.1 A Software Implementation of Fixational Eye Movements

The implementation described here was published by Engbert et al. [52] to model the

fixational movements of eyes and includes fixational microsaccades. This model does

not include the large saccadic eye movements which can be driven by top-down or

bottom-up attention, but is designed solely to emulate the movement of an eye focusing

on a particular point and the small movements it undergoes to prevent saturation of

photoreceptors. Though the implementation described here moves the image of the

world while keeping an eye fixed, rather than moving an eye while keeping the world

fixed, this model stimulates the DVS camera in a biologically-realistic way.

The model is composed of three factors:

1. A self-avoiding random walk designed to mimic the small tremors of eye muscles;

2. An energy well designed to pull the focus of the eye back to the center;

3. Occasional, small-amplitude rapid movements of the eye known as microsaccades

to reach a new location.

This first component is a self-avoiding random walk. Informally, this is modeled as a

walk across a surface, where the direction chosen is the lowest neighbor. After stepping

onto a position, the energy level of that position rises and becomes less desirable, slowly

decaying back to its starting position. When that spot is encountered later on, it may

have decayed back to be a desirable choice or may still be less desirable than its neighbors.

This helps make the random walk self-avoiding, which is a better model of biology

because the muscles do not suddenly reverse direction and return to the exact spot from

which they have arrived [52].


Figure 5.3: Fixational eye movements. The background color represents the energywell that pulls the eye movements back to the center, the red circle indicates the currentfixational point, and the black line connects 40 points previously chosen. This figurewas generated by the implementation in Appendix C of the model introduced in [52].

Secondly, a model of fixational eye movements needs a factor to focus the eye towards the

center of the region of interest. In this case, there is an energy well that is quadratically

defined over the surface of interest, with its minimum at the center of the image. This can

be combined with the energy state defined in the self-avoiding random walk. Choosing

the position is then finding the minimum sum of the energy well and the walk surface,

and taking a step to that position [52].

Thirdly, the eye makes small movements around the center of the image known as

microsaccades. In the model proposed in [52], a microsaccade is triggered when the local

energy of the current position is too high; when the level is over a threshold, the focus

jumps from the current position to the global minimum energy. However, to model the


effects of the muscles on the eye, a cost is added that encourages the movement to occur

predominantly along a major axis, either horizontal or vertical, rather than a biologically

unrealistic diagonal movement. Figure 5.3 shows a visualization of the movement output

as well as the energy landscape used to generate the movement.

The fully described algorithm can be seen in Appendix C.

The default parameters were initialized as per the [52] paper, and can be found in Table

5.1.

Parameter Description Value

lambda Slope of the potential 1

sinkeps Relaxation rate 0.001

chi Vertical/Horizontal constraint for stabiliz-ing microsaccade direction

2*lambda

hc Critical value for triggering microsaccades 7.9

Table 5.1: Parameters for the eye movement, adapted from [52].

This code is used to generate offsets which shift an image on the screen in a biologically

realistic way to stimulate the DVS system. This produces the visual input that is

displayed on the screen in front of the DVS.

5.3.2 Training Environment

The training environment for the jAER implementation consists of three components:

the screen to display the data, the DVS system to receive the spike-based representation,

and the computer running the jAER environment. The completed setup can be seen in

Figure 5.4. In this environment, the training consists of rapidly displaying images from

the Matlab environment and performing online learning within Java.

5.3.3 Java Reference Implementation

The Java reference implementation relies on several environmental factors as well as

constants to appropriately build receptive fields. The secondary visualizer displays digits

at around 26 digits per second to the observing DVS. These spikes form input from the

visible layer in the evtCD algorithm, triggering learning in the hidden neuron layers

with the parameters shown in Table 5.2.


Figure 5.4: Photograph of the training environment.


t win Window width of evtCD learning 0.005s

tau ref Refractory period 0.001s

inv decay Inverse decay offset to slowly aid learning 1e-5

eta Learning rate 1e-3

tau recon Reconstruction time constant for visualiz-ing reconstructions

0.010s

tau Membrane time constant 0.100s

thresh eta Threshold learning rate 0

Table 5.2: Parameters for online training using the Java reference implementation.

Chapter 6

Quantification of evtCD Training

After the description of an implementation of the evtCD training rule and the training

methodology, this chapter evaluates the performance of the algorithm compared to other

training methods and standard training paradigms. In Section 6.1, an examination of

parameters is performed to assess the optimal parameter space of learning in the network.

In Section 6.2, the training algorithm is combined with a linear decoder to achieve 90.3%

accuracy on the MNIST handwritten digit identification task [35]. Finally, Section 6.3

demonstrates the real-time nature of learning, as the system was rapidly trained to form

receptive fields for identifying digits, achieving 86.7% accuracy after training on 2.5% of

the available data presented over 60 seconds.

6.1 Improving Training through Parameter Optimizations

A major aim of this work is to give an intuitive understanding of how neuron and learning

parameters can affect the evtCD training algorithm, as well as to propose insights for

future work. The default setup for training, called the baseline parameters, can be found

in Table 6.1 and the methodology for this training was described in Section 5.2.

The following alterations to the standard training paradigm will be investigated:

1. Learning rate: how large should a standard weight update be?

2. Number of input events: how do different quantities of input spikes affect learning,

and can the system degrade gracefully with less input?

3. Batch size: can training be parallelized for performance reasons without sacrificing

accuracy?

38

Chapter 6. Quantification of evtCD Training 39

4. Noise temperature: what role does membrane potential noise and stochasticity

play in training?

5. Persistent Contrastive Divergence: does this powerful tool from regular CD also

aid the evtCD training algorithm?

6. Weight limiting: can limiting the range of weights result in better training?

7. Inverse decay: can a constant potentiation in weights help the training process?

These parameters and extensions will be examined in turn.

Label Layer

Hidden Layer

Input Layer

Figure 6.1: The architecture of the trained networks, to scale. The first layer is 784neurons (trained on 28*28 pixel digits), the hidden layer is 100 neurons, and the label

layer is 10 neurons.

6.1.1 Baseline Training Demonstration

These parameter evaluations for the evtCD rule were compared to a trial run with a

default, accurate, and reasonably-fast set of parameters referred to here as the baseline

training parameters. This section will introduce the typical behaviour of the network

using these parameters.

As can be seen from Figure 6.2, the time evolution of the weights follows a very stereo-

typed pattern: after about 10,000 digit presentations, the random initialization of the

weights causes each receptive field to begin navigating towards near local minima [53].

This happens when the hidden neuron successfully finds a component of the input space



temperatures Variance of the noise 0.01, 0.01, 0.01, 0.01

epochs Number of times the entire data set is pre-sented for training

1

eta Learning rate 0.005

momentum Momentum of weight updates 0

decay Decay of weights 0

t win STDP rule window width 0.030s

t refrac Refractory period of a neuron 0.001

tau Membrane time constant 0.050

inv decay Membrane time constant 0.001 * eta

batchsize Number of parallel training samples 10

thr Threshold of a neuron 1

Table 6.1: Parameters for the baseline Matlab-based implementation.

that helps to factorize the image into parts, as discussed in [13, 37]. For the handwritten

digits examined here, the digits factorize into digit parts such as a vertical element or a

curve, and over successive presentations the hidden neurons begin developing receptive

fields that corresponds to these factored elements.

The baseline parameters result in the rapidly increasing accuracy score shown in Figure

6.3, eventually peaking at 79.13% classification accuracy. If training for longer than 1

epoch were desired, it would be beneficial to decrease the learning rate as it plateaus

too early; training with a single epoch and reaching minimum error is a clear sign that

the learning rate is too fast [54].

(a) Weights after 10,000 in-put digits

(b) Weights after 30,000 in-put digits

(c) Weights after 60,000 in-put digits

Figure 6.2: The weights of four example hidden-layer neurons with increasing trainingexamples. The brightness encodes the weight value, and each neuron can be seen

becoming tuned to a particular set of digit regions.


1 10 20 30 40 50 600

10

20

30

40

50

60

70

80

90

100

Digits Presented (thousands)

Accura

cy [%

]

Baseline Accuracy

Figure 6.3: Baseline accuracy of the evtCD training algorithm for one epoch of 60,000training digits, eventually peaking at 81.46% classification accuracy. The overall recordfor this size network trained with evtCD is 81.5%, so this accuracy lies close to the

peak accuracy achieved so far.

6.1.2 Learning Rate

The most basic parameter of training is the learning rate, eta. This parameter deter-

mines the size of a weight update when a hidden layer neuron spikes, and controls how

quickly the system changes its weights to approximate the input distribution. The learn-

ing rate that results in peak performance is smaller than for typical sigmoid networks,

on the order of 10−3, compared to traditional CD which can train using an eta value of

1.

When using the default threshold value for evtCD training, a single weight update using

eta = 1 could cause the weight to exceed the threshold value of that neuron. In an ideal

case, the system would recover and raise its threshold to compensate, but it is possible

for the weights of the network to move the system into a nonspiking regime from which it

will never recover (unlike a sigmoidal network). For this reason, using a smaller learning

rate is preferable.

As can be seen in Figure 6.5, a learning rate that is too fast learns quickly but then

achieves its peak performance early and is unable to improve, overshooting the learning

target [37]. On the other hand, an insufficient learning rate requires more learning

iterations to reach its saturation learning level.


Le

arn

ing

Ra

te f

or

We

igh

t U

pd

ate

s


1 10 20 30 40 50 60

1e−05

1e−04

1e−03

Baseline (5e−3)

1e−02

1e−01

Figure 6.4: Effect of learning rate on receptive field formation. Horizontal axis isincreasing with the number of presented digits, in thousands, and vertical axis indicatesthe value of the learning rate. Note slower learning rates take longer to develop receptive

fields, but fast ones learn improper features and then stop learning.

1 10 20 30 40 50 600

10

20

30

40

50

60

70

80

90

100


Accura

cy [%

]

1e−05

1e−04

1e−03

Baseline (5e−3)

1e−02

1e−01

Figure 6.5: Accuracy evolution with different learning rates. The baseline parameterof 0.005 was chosen to balance the rapidity of the 0.01 learning rate with the more

careful 0.001 learning rate.


6.1.3 Number of Input Events

The number of input events controls the amount of coincident information available for

training in the evtCD algorithm. By raising the number of input events, a neuron is

more likely to encounter joint activations of input and to develop a receptive field for

those regions of input. On the other hand, by lowering the number of input events, it

is possible that spikes will never overlap and a hidden neuron will never uncover joint

probabilities to encode. Because of this, it is important to establish the membrane

time constant tau and the STDP window t win in relation to the spike rate. In these

experiments, tau and t win were scaled proportionately to the baseline input rate of 10

spikes per pixel per digit presentation (which results in a spike rate, in a maximally

“on” pixel, of 100 Hz). No learning will occur if the spike rate drops low enough; in that

case, the exponential decay relaxes the neuron’s membrane potential to resting voltage

before another spike comes in, so it is necessary to lengthen these windows to allow a

fair comparison.

The spike rate shown on the vertical axis in Figure 6.6 and in the legend of Figure 6.7

is the expected number of spikes an “on” pixel can send over the presentation of a digit

(100 ms). Note that the accuracy reaches a peak around the chosen baseline parameter

value of 10 spikes per pixel per digit presentation, and falls off in accuracy with either

less or more input events. There appear to be 2 modes of peak performance in Figure

6.7, with one peak around 100 Hz, and a second peak at a much higher input rate of

250 Hz. Since the system is flexible in the number of event inputs, and the number of

events dominates in the training time, it is better to choose the lower mode of 100 Hz

for these initial investigations.

Qualitatively, Figure 6.6 suggests that as the spike rates increase, the receptive field

specialization increases as well. The receptive fields appear to be more detailed as the

number of coincident input spikes increases, allowing more selectivity in the types of

inputs that drive them.


Inp

ut

Ra

te [

Sp

ike

s p

er

"On

" P

ixe

l P

er

Dig

it]

Digits Presented (thousands)1 10 20 30 40 50 60

4

8

Baseline (10)

12

15

25

Figure 6.6: Effect of input rates on receptive field formation. Horizontal axis isincreasing with the number of presented digits, in thousands, and vertical axis indicatesthe number of spikes per digit. Note that with increasing input rates, the specificity of

features appears to increase.

1 10 20 30 40 50 600

10

20

30

40

50

60

70

80

90

100


Accura

cy [%

]

4

8

Baseline (10)

12

15

25

Figure 6.7: Accuracy evolution with different input rates. Here, the accuracy as aresult of the input rate increases and reaches a peak at an input rate of 100 Hz (10

spikes per 100 ms), with a second mode at the much higher rate of 250 Hz.


6.1.4 Batch Size

In the standard contrastive divergence model, taking a batch of data and processing it

in parallel is an important step for two reasons [13, 55]. First, parallel processing is often

more efficient on modern computers, and batch processing enables the learning algorithm

to capitalize on heavily optimized matrix operations [37, 54]. Secondly, it decreases the

variance of a single learning sample, preventing the algorithm from avoiding areas of high

variance. In [13], Hinton refers to an analogy: when vibrating a thin sheet of metal,

sand particles (following gradient descent) scattered over the surface will settle into

regions between the oscillating peaks to avoid regions of high variance, even though the

time-averaged mean everywhere is zero. Averaging a large number of parallel training

iterations, then, will result in better learning of the true gradient [37, 54].

Batch training in this network is implemented as if there are batchsize parallel networks

updating a common weight structure once per ms. Since the implementation is time-

stepped, all the weight updates happen in parallel based on the activity across the past 1

millisecond timestep. Each parallel network adds a vote to the direction of the gradient,

and the average direction is taken with a normalized weight update. Their collective

update has the same learning rate as a single step from the baseline training example, but

incorporates more evidence about the correct gradient direction for learning. Because

of this, a weight update from a batch run can achieve equal accuracy with fewer weight

updates. Eventually, the training with parallel updates should provide a better estimate

than a single sample point, so its accuracy should exceed training using a single point

as described above [13].

The parallelization comes at effectively zero computational cost in the time-stepped

implementation for small levels of parallelization (the values shown here). The execution

time using batches of 10 is the same as using batches of 1, so a batchsize of 10 was

chosen as the optimal parameter. This additionally coincides with previous suggestions

of batch sizes equal to the number of classes in the data [54]. The receptive field forms

in approximately the same number of weight presentations, and the accuracy suggests

that each weight update is more valid than in a single batch case. Finally, the slow

increase in accuracy of larger batch sizes appears promising for future investigations, as

this could allow more accurate learning given more training time.


Para

llel T

rain

ing B

atc

h S

ize


2

Baseline (10)

50

100

Figure 6.8: Effect of batch size on receptive field formation. Horizontal axis is increas-ing with the number of presented digits, in thousands, and vertical axis indicates thenumber of parallel-training network batches to calculate the learning gradient. Batchesof size 10 and size 1 develop receptive fields equally fast, and the performance advan-

tages of larger batch sizes made it preferable as a choice for a baseline parameter.

1 10 20 30 40 50 600

10

20

30

40

50

60

70

80

90

100


Accura

cy [%

]

2

Baseline (10)

50

100

Figure 6.9: Accuracy evolution with increasingly parallel estimates of the gradient.Accuracy is slightly improved at moderate batch size values, and learning occurs much

more slowly on a per-digit basis for larger batches.


6.1.5 Noise Temperature

In the evtCD training algorithm, noise can have beneficial effects as well as the expected

detrimental ones. This occurs for two reasons: first, noise helps to regularize the weights

[37], and secondly, noise helps to ensure that the neurons always fire. As has been

mentioned before, the evtCD learning rule only functions when neurons emit spikes,

and the noise term helps to cause neurons to spike.

Moreover, in the evtCD training algorithm, samples are obtained by propagating the

activations of one layer to the next, and there can be significant losses in activations.

Without a term like a bias to encourage more spiking, the activation can decay away;

this is explored more fully in Section 6.1.6.

The noise term was added in two different ways: the first was to take a Gaussian with

mean zero and variance as indicated by the vertical axis in Figure 6.10, then perturb

the membrane potential of each neuron by this noise amount once per timestep (one

millisecond). Note that the threshold was fixed at 1, so the variance on these plots can

actually result in many erroneous spikes.

The second method of noise is a more biologically sound method known as the Orn-

stein–Uhlenbeck process [34, 56]. This method is a low-pass filtered Gaussian using

a time constant of 25 ms [34]. This prevents the noise from rapidly fluctuating the

membrane potential, instead providing a random offset that moves much more slowly.

Interestingly, it appears that purely Gaussian noise helps the neurons to learn more

quickly than in the absence of noise, due to their increased activity. The low-pass filtered

noise results in more precisely-defined receptive fields, which reflects the observations

found in Section 6.1.3.


No

ise

Va

ria

nce


1 10 20 30 40 50 60

0.000

Baseline (0.01)

0.050

0.200

0.400

0.600

(a) Receptive fields of neurons trained with Gaussian noise.

No

ise

Va

ria

nce


1 10 20 30 40 50 60

0.000

Baseline (0.01)

0.050

0.200

0.400

0.600

(b) Receptive fields of neurons trained according to the Ornstein-Uhlenbeck pro-cess [34, 56].

Figure 6.10: Comparison of accuracy in neurons trained with evtCD under variousnoise rates. Note that the introduction of a small amount noise helps accelerate learning,

and the system is able to develop features that ignore the noise.


1 10 20 30 40 50 600

10

20

30

40

50

60

70

80

90

100


Accu

racy [

%]

0.000

Baseline (0.01)

0.050

0.200

0.400

0.600

(a) Receptive fields of neurons trained with Gaussian noise.

1 10 20 30 40 50 600

10

20

30

40

50

60

70

80

90

100


Accu

racy [

%]

0.000

Baseline (0.01)

0.050

0.200

0.400

0.600

(b) Receptive fields of neurons trained according to the Ornstein-Uhlenbeck pro-cess [34, 56].

Figure 6.11: Comparison of accuracy in neurons trained with evtCD under variousnoise rates. Surprisingly, the system is very stable in the presence of noise, and accuracy

remains largely unaffected until quite significant noise is introduced.


6.1.6 Persistent Contrastive Divergence

Persistent contrastive divergence, as originally introduced in [41] and show in Figure 2.3,

creates a persistent Markov chain which is driven entirely separately from the input. The

model distribution samples are generated separately from the data distribution, and each

data point moves the Markov chain closer to the equilibrium distribution. This ignores

the fact that the model, specified by the system weights, changes slightly with each

weight update. Because of this process, the equilibrium distribution does not remain

fixed; however, given a small enough learning rate the system gathers samples closer to

the equilibrium distribution than the normal CD-1 algorithm can.

However, when training spiking neural networks with evtCD, the model distribution is

now a recurrent visible-hidden network. The only mixing with the input data comes

from the weight matrix that is shared between the data and model distribution. Since

the model distribution sampling process is run independently with no external input,

its activity is driven nearly entirely with noise and is responsible for setting up a per-

sistent recurrent network that maintains activity and produces digit-like samples. A

true demonstration of the power of this balanced sampling approach is that the weights

of the system coerce random membrane potential noise into persistent activation that

corresponds to a real digit.

This process is demonstrated in Figure 6.14, which visualizes digit reconstruction arising

from the model layers sampling under persistent contrastive divergence. These confab-

ulations are clearly distinct from real digits, but considering the network is of small

size (100 hidden neurons) and trained for a single epoch, it does a remarkable job of

creating digit-like patterns. There are clear digit parts, tending to be centrally located

and continuous, and many of these are feasible approximations of digits.

Overall, however, the claims of faster training for persistent contrastive divergence in CD

do not seem to hold for evtCD. At all time points, the baseline accuracy outperforms

the PCD-trained network as shown in Figure 6.13. The weights can be qualitatively

assessed in Figure 6.12.


PC

D


Off

On

Figure 6.12: Effect of persistent contrastive divergence on receptive field formation.Horizontal axis is increasing with the number of presented digits, in thousands, andvertical axis indicates the presence or absence of persistent contrastive divergence inlearning. Features appear to require more time to appear with PCD, given the lack of

direct input to the model layers.

0 10 20 30 40 50 600

10

20

30

40

50

60

70

80

90

100


Accura

cy [%

]

Off

On

Figure 6.13: Accuracy evolution with and without persistent contrastive divergence.Baseline parameters consistently outperform PCD.


Figure 6.14: Demonstration of 9 digit reconstructions from activation on the visiblemodel layer. Since the network samples freely, the units are not driven by externalinput but rather sampled from noise and guided by the energy function specified by the

network weights.


6.1.7 Bounded Weights

One major difference between the networks modeled in evtCD and in true biological

networks is the large value range and high precision available to digital simulations.

The double-precision default implementation allows membrane voltages in excess of 1000

volts and weight updates smaller than 10−8. In this section, the possibility of capping

weights is examined to determine if all that range is necessary and if losing precision

might actually improve performance. Figures 6.15 and 6.16 demonstrate the effects of

capping the weights at 0.25. Largely it has no negative effects, but qualitatively alters

the features that are selected.

Weight capping also affects the initial distribution of weights. It has been suggested that

the initialization of weights plays a very important role for learning, and that properly

initializing the weights can save significant computational effort and have drastic results

on the eventual accuracy [53, 54] . By initializing the weights closer to the extrema,

the training decreases weights to yield features rather than sharpening weights that are

already present.

Interestingly, depriving the weights of much of their accuracy has little effect on the over-

all system. This could be a fruitful avenue for exploration in the future, as low-precision

weights are necessary for some platform implementations [18, 21], and the full impli-

cations of different initialization regimes should be evaluated for possible performance

improvements.

We

igh

ts C

ap

pe

d?


Off

On

Figure 6.15: Effect of bounding weight magnitude on receptive field formation. Hori-zontal axis is increasing with the number of presented digits, in thousands, and vertical

axis indicates whether weights were bound by [-.25, .25].

6.1.8 Inverse Decay

Finally, in an evtCD-specific optimization, a slow potentiation of all weights was added

as a possible extension. Since these networks learn only when neurons spike, a constant

positive learning offset can improve learning by steadily forcing all neurons to spike at


1 10 20 30 40 50 600

10

20

30

40

50

60

70

80

90

100


Accura

cy [%

]

Off

On

Figure 6.16: Accuracy evolution without and with weight bounding. Boundingweights may have some tangible improvement over the baseline accuracy or is, at the

very least, not necessarily detrimental.

least rarely. At every timestep, all weights in the weight matrix increase their value

by a constant positive offset (here set at 0.01*eta), tending to cause the downstream

neurons to spike. If the spike ended up being fallacious, the weight penalty will punish

the incorrect spike and strongly depotentiate the weight at a value of -eta, but if the

spike was correct it will be reinforced. In either case, the features become more selective.

This tends to speed up the learning rate by causing more initial activity and forcing

neurons to choose to adopt receptive fields rather than remain generic in their selectivity,

as can be seen in Figure 6.17.

Like momentum, this parameter should be adopted early in the training to encourage

appropriate initialization, then decreased over time. Consistent weight inflations prove

detrimental to the overall accuracy of the system when the weights are closer to their

equilibrium values, since it causes undesirable shifts in the energy distribution. Because

the constant positive increase tends to cause added spiking, it has a tendency to shift

the receptive fields over time to respond to novel stimuli instead of approaching an

equilibrium.


Invers

e D

ecay P

er

ms


1 10 20 30 40 50 60

0

Baseline (0.001)

0.02

Figure 6.17: Effect of inverse decay on receptive field formation. Horizontal axis isincreasing with the number of presented digits, in thousands, and vertical axis indicatesthe ratio of the constant increase in weight to eta. That is, the baseline increase pertimestep (1 ms) is 1/1000 * eta. Note that the receptive fields without inverse decay

are largely undifferentiated due to lack of spiking, so a small amount is beneficial.

1 10 20 30 40 50 600

10

20

30

40

50

60

70

80

90

100


Accura

cy [%

]

0

Baseline (0.001)

0.02

Figure 6.18: Accuracy evolution with and without inverse decay. Though inverse de-cay helps the network develop receptive fields, too much decreases the eventual accuracy

of the system and lengthens training time.


6.2 Training as a Feature Extractor

Label Layer

Hidden Layer

Input Layer

evtCD

W (trained by evtCD)


Label Layer

Input Layer

LinearSystem

W (trained by optimallinear decoder)

Label Layer

Hidden Layer

Input Layer

evtCD + Linear

W (trained by optimallinear decoder)


Figure 6.19: Architectures for the evtCD-trained network, the linear regression net-work, and the combination network.

Besides supervised training, evtCD can also be used to train networks to extract features

in a purely unsupervised way. The common technique of unsupervised learning examines

the input, extracts joint correlations, and clusters the data. This process can be used to

learn receptive fields to reduce the dimensionality of the data (for example, as in [2, 57]),

while preserving relevant information. Moreover, if desired, the output of the reduced

layer can then be trained in a more traditional approach using another classification

technique [13].

To begin, evtCD was used to train the spiking RBM in a purely unsupervised way to

establish relevant receptive fields for the data. Then, the activations of the network in

response to the MNIST training set (60,000 digits) were recorded. This process yields

a new training set of reduced dimensionality, of size trials by hidden-layer-size, and a

linear regressor was trained on this training set. The architecture can be seen in Figure

6.19.

The combination network composed of the evtCD network and the linear regression

network recorded the highest performance of the architectures tried here, achieving

90.03% accuracy. It is a powerful result that the evtCD unsupervised learning method

can reduce the dimensionality of the data from 784 pixels (28*28) to 225 and still achieve

a better score than a linear regression alone.

The confusion matrices in Figure 6.20 indicate the challenging digits for the learning

algorithm. The evtCD algorithm, when used for supervised training, had the most

difficulty with the digit 5. The tested network selected “8”, “3”, and “6” as often alter-

native candidates when presented with a “5”. The linear regression generally confused

the same digits, but had fewer mistakes overall. On the other hand, there are a few

mistakes which appear in the linear classification result but not in the evtCD result; for

example, the linear classifier had more difficulty identifying “5”s that were actually “8”s,


Co

rre

ct

Dig

it

Chosen Digit

0 1 2 3 4 5 6 7 8 9

0

1

2

3

4

5

6

7

8

90

10

20

30

40

50

60

70

80

90

100

(a) Confusion matrix of an RBM trained withthe evtCD learning algorithm.

Co

rre

ct

Dig

it

Chosen Digit

0 1 2 3 4 5 6 7 8 9

0

1

2

3

4

5

6

7

8

90

10

20

30

40

50

60

70

80

90

100

(b) Confusion matrix of MNIST digits usinglinear classification on the pixels.

Co

rre

ct

Dig

it

Chosen Digit

0 1 2 3 4 5 6 7 8 9

0

1

2

3

4

5

6

7

8

90

10

20

30

40

50

60

70

80

90

100

(c) Confusion matrix of combination learning,using linear regression on the output of an RBM

trained with evtCD.

Figure 6.20: Confusion matrices of classification using evtCD learning, optimal linearregression, and combination evtCD and linear regression. Across the vertical axis isthe correct digit, and the horizontal axis is the digit chosen. Color indicates accuracy,in percent, of guessing the chosen digit. Note the common difficulty of distinguishing

“4”s from “9”s and “5”s from “3”s, for example.

and “1”s that were actually “4”s. Qualitatively, it appears that the confusion matrix of

the combination network is the intersection of the mistakes of each network individually.

Additionally, after training, an advantage of the evtCD-trained networks is a level of

interpretability to the weights, unlike for a pure linear classifier. In Figure 6.21, the

final receptive fields of the digits “0” through “9” are shown, with “0” on the upper

left and “4” on the upper right. For the linear network, the values shown here are

the linear relationship of that index to the input pixels, and can be thought of as the

receptive fields. The features are generally not intuitive, though a “1”-like receptive

field can be made out for the “1” digit, and a dim representation of a “6” appears for


the “6” digit (Figure 6.21a). However, by linearly weighting the receptive fields of the

evtCD-trained network, much more intuitive features appear: though noisy, all of the

weighted receptive fields in Figure 6.21b suggest the form of the digit they are supposed

to represent.

Finally, a comparison of the performance of techniques appears in Figure 6.22. Though

the learning algorithm has much to improve before achieving state-of-the-art accuracy,

it nonetheless surpasses the optimal linear methods and achieves impressive accuracy

for a single training epoch executing on spiking neurons.

(a) Linear decoder weights.

(b) STDP-trained linear classifier weights.

Figure 6.21: Examination of the weights of a linear classifier built on top of thedimensionality-reducing STDP-trained system. A. The purely linear classifier weightsdo not build particularly intuitive representations of their sensitivities (with the excep-tion of “1”). B. The combination network weights the receptive fields of the RBM toproduce much more representative versions of their digits, though noisier and blurred.


evtCD Linear Regression Lin+evtCD Siegert State of Art0

10

20

30

40

50

60

70

80

90

100

Accura

cy [%

]

Figure 6.22: Accuracy of learning on the MNIST dataset [2, 35, 46]. The combinedlinear and evtCD method presented in Section 6.2 achieves 90.3% accuracy. As a

supervised learning algorithm, evtCD peaks at 81.46% identification accuracy.


6.3 Online Training with Spike-Based Sensors

To demonstrate the rapidity with which these networks can be trained, evtCD was used

to quickly train a network online with the spiking DVS image sensor [58]. The input rate

to the network is limited by the refresh rate of the display used to train the network;

in this case, 30 FPS was the maximum digit presentation speed at a 60 Hz refresh rate

with a blank frame between each digit.

During 58 seconds, 1500 digits were presented to the spiking DVS system which produced

events used as inputs to the evtCD algorithm. These digits comprise 2.5% of the typical

MNIST training set. The algorithm trained a 14*14 = 196 neuron hidden layer in

a purely unsupervised way, developing receptive fields that correspond to their digit

inputs. The weights for this hidden layer can be found in Figure 6.24; though clearly

less ordered than the full epoch training examples shown in Section 6.1, the receptive

fields display the qualitative features expected of a system trained on handwritten digits.

After the training examples are presented, the network weights are saved and a linear

classifier is trained on the spiking output of the network in response to the digits, as in

Section 6.2. The final performance of this system again exceeds that of a pure linear

classifier operating on the full MNIST training set, and achieves an 86.7% classification

accuracy. This is promising result after processing such a small percentage of the training

data.

Figure 6.23: Screenshot of the Java-based implementation. Shown here at the leftis the weight matrix, with currently updating neurons framed in blue. The originalinput to the system can be seen in red in the upper right, and next to it is the live

reconstruction of that digit, performed by the model layer, shown in blue.


Figure 6.24: Weights of the network learned by evtCD from 1500 digits presentedover 58 seconds. Qualitatively, these features correspond to those seen in the earlier

Section 6.2, factorizing the input into digit parts.

Chapter 7

Conclusions and Future Work

In this final chapter, Section 7.1 reviews and assesses the main aims of the thesis. Section

7.2 discusses possible directions for future work to continue this research.

7.1 Conclusions

In this thesis, the online evtCD algorithm was introduced through two implementations,

one for simulation (Appendix A) and one for live data (Appendix B), which were used

to train spiking neural networks on the standard MNIST handwritten digit dataset [35].

The impact of the parameters of the algorithm was assessed (Section 6.1) to give the

reader an intuitive understanding of how to employ evtCD and to suggest an optimal

starting point for future work. To compare a spike-based implementation of an RBM to

a standard RBM, this thesis also presents a novel method, modeled after biology [52],

to generate spike-based representations from static images that can be used with any

event-based network (Section 5.3.1).

To assess the success of the aims of this endeavour, we examine the original question

posed in the introduction:

Can RBMs composed of spiking neurons be trained online? Three subgoals

were proposed to evaluate this question. The first of these was to:

1. Derive a rule for online learning of an RBM composed of spiking neural networks;

In Section 3.2, I introduced the evtCD algorithm, which is a novel contribution as the

first online learning method for an RBM composed of spiking neurons. The algorithm

uses four spiking neuron populations to represent the four different samples required

62

Chapter 7. Conclusions and Future Work 63

for the standard contrastive divergence algorithm (data visible, data hidden, model

visible, and model hidden as shown in Figure 3.3), which encode a state through their

spiking behaviour. Using these encoded states as samples, correlations between the

spiking behaviours in the data layers strengthen a shared weight matrix, and correlations

between the model layers weaken that shared weight matrix. Changes in the weight

matrix reach an equilibrium when the data correlations match the model correlations,

canceling out the update and implying the model distribution has correctly learned the

data distribution.

The idea of using a windowing function and persistent populations to represent the

instantaneous samples needed for contrastive divergence can be extended to any RBM

composed of time-persistent elements, and the weight update rule presented here can be

altered depending on the dynamics of the elements.

The next goal was to:

2. Design an event-driven, asynchronous implementation of this rule to achieve high

performance in scalable systems;

The evtCD algorithm described in Section 3 only adjusts the weights in the system in

response to spike events from the hidden layers, and can be implemented in a purely

event-driven, asynchronous fashion as shown in Appendix A. To demonstrate, the evtCD

training algorithm was paired with a real-time spiking image sensor [24]. This imple-

mentation brings together an event-based vision sensor with an event-based training

algorithm to yield an entirely event-driven, asynchronously updating training paradigm.

Real-time event-driven training was demonstrated in Section 6.3 with the unsupervised

training of a spiking RBM. After presentation of 1500 handwritten digits during 60

seconds, the network developed receptive fields corresponding to the presented digits and

achieved an 86.7% classification accuracy when using a linear classifier on the extracted

features.

The final of the three goals was to:

3. Demonstrate this training rule’s effectiveness on a common benchmark task.

Even without the additional computational power of a linear decoder, in Section 6.1 it

was demonstrated how a single training epoch and a hidden layer of only 100 neurons

could achieve 81.5% classification accuracy on the MNIST handwritten digit classifi-

cation task [35], without employing any transformations or distortions that commonly


provide additional training samples [14, 37]. When paired with a linear decoder, the

network recorded 90.4% classification accuracy of handwritten digits, exceeding the ac-

curacy of an optimal linear system.

In a much broader context, evtCD now allows training of RBMs on platforms designed for

simulations of neurons and STDP [18, 19], though the accuracy is not yet high enough for

evtCD to be a feasible replacement for standard contrastive divergence implementations

in general. In cases where online learning is important, as in robotics, evtCD still

holds an advantage in efficiency. Moreover, the results shown in Chapter 6 indicates

this method is promising, and many optimizations yet remain to close the gap between

evtCD and the state of the art.

In terms of biological modeling, the evtCD algorithm currently requires a biologically

unrealistic sharing of weights between different neurons so that neurons of the data distri-

bution population can potentiate the weights and the neurons of the model distribution

population can depotentiate the weights. However, this requirement may be relaxed

in the future; with population coding of the states, the net strength of a population of

neurons and another must match, rather than individual weights, and a method of dupli-

cating or adjusting the connection strength between two populations can be examined.

Moreover, the similarity between long-chain Gibbs sampling and network recurrence as

shown in Section 6.1.6 suggests that examining sampling methods as biological models

may prove fruitful as suggested in [59].

In summary, it was argued that event-driven algorithms for online training of spiking

Restricted Boltzmann machines would be a valuable contribution indeed, as such an

algorithm could benefit from the state-of-the-art in machine learning [4–10] as well as

neuromorphic engineering [24, 27, 30, 31]. The first such algorithm is evtCD.

7.2 Future Work

There remains a significant amount of future work to be done with these networks, as

they are still quite new and many open questions remain. We list several areas in which

improvements can be made in the online learning of RBMs composed of spiking neurons.

First, the experiments in Section 6.1 involving extensions and parameters have only be-

gun to examine the ideas that have been implemented as optimizations for contrastive

divergence since its invention. Fast weights [60] are a way of rapidly exploring the pa-

rameter space, and could prove effective especially after being successfully demonstrated

for use with rate-based networks in [14]. Sparsity and selectivity, two constraints that

are biological in origin and can improve performance in deep networks [37, 42], may


also prove fruitful for spiking RBMs. The initialization conditions and the number of

neurons in the hidden layer of the network also play a strong role in the overall accuracy

of classification systems trained with CD [2, 13, 37, 53], but are still an open research

topic with the evtCD algorithm. As pointed out in [1, 14], a larger network size in-

creases the representational capacity of neural network. By adding additional hidden

nodes, the network is able to learn more discriminative features and will likely improve

performance.

Second, standard RBMs have been trained to represent continuous values. Accomplished

by encoding parameters of Gaussians, this extension is very important for real-world and

computer visions tasks; for example, pixel intensity encoding is very important in visual

identification tasks [2, 37, 39]. Using spike rates to encode continuous-valued inputs in

spiking RBMs has not been investigated yet, but forms an important area for future

research.

Third, one of the fundamental advantages of RBMs is the possibility of stacking them

into deep architectures that can dramatically reduce error rates [1–3, 37]. Now that a

method for training spiking RBMs online has been introduced, it is possible to investigate

whether the evtCD learning rule, too, can be implemented to yield deep networks made

of spiking RBMs. Though the offline training algorithm for deep networks is a greedy

layer-wise training paradigm [2, 37, 39], greedy online training with early stopping could

preserve the properties of layer-wise training that work so well offline.

Finally, one of the powerful advantages of moving to a time-based representation is that

the RBM now exists in the time domain, and could have the ability to learn about

the passage of time. Sequence learning with the evtCD algorithm has not yet been

investigated, but could prove an intriguing direction if weights in the system could

encode the sequential firing patterns of neurons over time as in previous investigations

[43, 61, 62]. Specifically, STDP has been demonstrated to exploit firing information

to learn patterns [63, 64], suggesting that evtCD-trained networks may be similarly

capable.

Appendix A

Java Implementation

Printed here is the reference implementation of the evtCD learning rule for Java, current

as of this publishing data.

1 public void processSpike(SpikeTriplet spike_triplet){

2 // Check for real-world errors

3 if(spike_triplet.time < sys_time)

4 resetTimes();

5

6 sys_time = spike_triplet.time;

7 int layer = spike_triplet.layer + 1;

8

9 // Reconstruct the imaginary first-layer action that resulted in this spike

10 if(layer == 1){

11 last_spiked[0].put(spike_triplet.address, spike_triplet.time - axon_delay);

12 thr[0].put(spike_triplet.address, (thr[0].get(spike_triplet.address) < min_thr) ?

13 min_thr : thr[0].get(spike_triplet.address) - eta * thresh_eta);

14 spike_count[0]++;

15 if(calc_recons[0]){

16 recon[0].muli(Math.exp( - (spike_triplet.time-last_recon[0])/recon_tau));

17 recon[0].put(spike_triplet.address, recon[0].get(spike_triplet.address) + recon_imp);

18 last_recon[0] = spike_triplet.time - axon_delay;

19 }

20 }

21

22 // Update neurons

23 // Decay membrane

24 membranes[layer].muli(Math.exp(-(spike_triplet.time - last_update[layer]) / tau));

25

26 // Add impulse

27 if(layer == 0){

28 membranes[layer].put(spike_triplet.address,

29 membranes[layer].get(spike_triplet.address) + inp_scale);

30 }

31 else if (layer % 2 == 0){

32 membranes[layer].addi(weights.getColumn(spike_triplet.address).muli(

33 refrac_end[layer].lt(spike_triplet.time)));

34 }

66

Appendix A. Java Implementation Code 67

35 else{

36 membranes[layer].addi(weights.getRow(spike_triplet.address).muli(

37 refrac_end[layer].lt(spike_triplet.time)));

38 }

39

40 // Add noise

41 addNoise(membranes[layer], layer);

42

43 // Update last_update

44 last_update[layer] = spike_triplet.time;

45

46 // Add firings to queue

47 int [] newspikes = membranes[layer].gt(thr[layer % 2]).findIndices();

48 for(int n=0; n<newspikes.length; n++){

49 // Update counts

50 spike_count[layer]++;

51

52 // Update refrac end

53 refrac_end[layer].put(newspikes[n], spike_triplet.time + t_refrac);

54

55 // Reset firings

56 membranes[layer].put(newspikes[n], 0);

57

58 // Record time for STDP

59 last_spiked[layer].put(newspikes[n], spike_triplet.time);

60

61 // STDP Threshold Adjustment

62 double thr_direction = (layer < 2) ? -1.0 : 1.0;

63 double wt_direction = (layer < 2) ? 1.0 : -1.0;

64 thr[layer % 2].put(newspikes[n], thr[layer % 2].get(newspikes[n]) +

65 thr_direction * eta * thresh_eta);

66 thr[layer % 2].put(thr[layer % 2].lt(min_thr).findIndices(), min_thr);

67

68 // STDP Weight Adjustment

69 if (layer % 2 == 1) {

70 weights.putColumn(newspikes[n],

71 weights.getColumn(newspikes[n]).addi(

72 last_spiked[layer-1].gt(spike_triplet.time - stdp_lag).

73 muli(wt_direction * eta)));

74 }

75

76 // Reconstruct the layer if desired

77 if(calc_recons[layer]){

78 recon[layer].muli(Math.exp( - (spike_triplet.time-last_recon[layer])/recon_tau));

79 recon[layer].put(newspikes[n], recon[layer].get(newspikes[n]) + recon_imp);

80 last_recon[layer] = spike_triplet.time;

81

82 }

83

84 // Add spikes to the queue if not in the end layer

85 if (layer != 3) {

86 pq.add(new SpikeTriplet(spike_triplet.time + 2*axon_delay*rng.nextFloat(),

87 layer,

88 newspikes[n]));

89 }

Appendix A. Java Implementation Code 68

90 }

91 }

Algorithm 2.

Appendix B

Matlab Implementation

Printed here for reference is the time-stepped implementation of the Matlab code, current

as of this publishing data.

1 %% Training

2 numtrains = size(train_x, 1);

3 numbatches = numtrains / opts.batchsize;

4 train_inp = [train_x train_y];

5 start_time = tic;

6 for e = 1 : opts.numepochs

7 % Choose random order

8 kk = randperm(numtrains);

9

10 % Reset membrane potentials

11 membranes = cell(1,4);

12 membranes{2} = zeros(opts.batchsize, dims(2));



15

16 % Reset refractory period ends

17 refrac_end = cell(1,4);

18 refrac_end{2} = -inf(opts.batchsize, dims(2));



21

22 % Reset timecounter

23 last_active{1} = -inf(opts.batchsize, dims(1));




27

28 % Reset firings

29 firings = cell(1,4);

30 firings{1} = zeros(opts.batchsize, dims(1));




34

69

Appendix A. Matlab Implementation 70

35 % Reset noise

36 noise = cell(1,4);

37 noise{1} = zeros(opts.batchsize, dims(1));




41

42 % Reset time

43 t_curr = 0;

44

45 % Loop through all batches

46 for b = 1:numbatches

47 % Get batch

48 batch = train_inp(kk((b - 1) * opts.batchsize + 1 : b * opts.batchsize), :);

49

50 % Clear out the slowpass

51 slowpass{1} = zeros(opts.batchsize, dims(1));




55

56 % Go through all repeats

57 for br=1:opts.batchrepeat

58 for t = t_curr:opts.dt:t_curr+opts.t_stop

59 % Generate input and log

60 % --------------------------------------------

61 noise_batch = (1-opts.temps(1)) * batch + ...

62 opts.temps(1) * (rand(size(batch)) > 0.5);

63 in_current = opts.input_rescale * ...

64 (noise_batch .* opts.rate_rescale) > rand(size(batch));

65 firings{1} = (in_current > 0);

66 last_active{1}(firings{1}) = t;

67 slowpass{1} = slowpass{1} + firings{1};

68

69 for pop=2:4

70 % Hidden Layer

71 % ----------------------------------------

72 % Decay membrane

73 membranes{pop} = membranes{pop} * exp(- opts.dt/opts.tau);

74 % Add impulse

75 if(pop == 2)

76 membranes{pop} = membranes{pop} + ...

77 (t > refrac_end{pop}) .* (firings{1} * W’);

78 elseif(pop == 3)

79 if(opts.pcd == 1)


81 (t > refrac_end{pop}) .* (firings{4} * W);

82 else


84 (t > refrac_end{pop}) .* (firings{2} * W);

85 end

86 elseif(pop == 4)


88 (t > refrac_end{pop}) .* (firings{3} * W’);

89 end


90

91 % Add noise

92 if(opts.n_noise == 1)

93 noise{pop} = noise{pop}*(1-1/opts.n_tau) + ...

94 1/opts.n_tau * opts.temps(pop)*randn(size(membranes{pop}));

95 else

96 noise{pop} = noise{pop}*(1-1/opts.n_tau) + ...

97 1/opts.n_tau * opts.temps(pop)*rand(size(membranes{pop}));

98 end

99

100 membranes{pop} = membranes{pop} + noise{pop};

101

102 % Get firings

103 full_firings = bsxfun(@gt, membranes{pop}, Thr{mod(pop-1,2)+1});

104 firings{pop} = full_firings;

105 slowpass{pop} = slowpass{pop} + (full_firings);

106 % Reset

107 membranes{pop}(firings{pop}) = 0;

108 refrac_end{pop}(firings{pop}) = t + opts.t_refrac;

109 last_active{pop}(firings{pop}) = t;

110

111 % Bound

112 membranes{pop}(membranes{pop} < opts.min_m) = opts.min_m;

113 end

114

115 % Learn

116 % Threshold adjustment

117 dBv = zeros(opts.batchsize, dims(1));

118 dBv(firings{1}) = - opts.eta * opts.thresh_eta;

119 dBv(firings{3}) = dBv(firings{3}) + opts.eta * opts.thresh_eta;

120 dBh = zeros(opts.batchsize, dims(2));

121 dBh(firings{2}) = - opts.eta * opts.thresh_eta;

122 dBh(firings{4}) = dBh(firings{4}) + opts.eta * opts.thresh_eta;

123

124 Thr{1} = Thr{1} + sum(dBv, 1) / opts.batchsize;

125 Thr{2} = Thr{2} + sum(dBh, 1) / opts.batchsize;

126

127 % STDP

128 if(~isempty(firings{2}))

129 dWp = opts.eta .* ...

130 (double(firings{2})’ * (last_active{1} > (t - opts.pos_lag)));

131 end

132 if(~isempty(firings{4}))

133 dWn = opts.eta .* ...

134 (double(firings{4})’ * (last_active{3} > (t - opts.pos_lag)));

135 end

136

137 dW = (dWp - dWn) / opts.batchsize;

138 W = W + dW;

139

140 W = W * (1 - opts.decay);

141 W = W + opts.inv_decay;

142

143 if(opts.wt_lim)

144 W(W>opts.wt_lim) = opts.wt_lim;


145 W(W<-opts.wt_lim) = -opts.wt_lim;

146 end

147 end

148

149 % Clear between presentations

150 t_curr = t + opts.t_gap;

151 for pop=1:4

152 membranes{pop} = membranes{pop} * exp(- opts.t_gap/opts.tau);

153 end

154 end

155 end

156 end

Algorithm 3.

Appendix C

EyeMove Implementation

Printed here is the reference implementation of the eye movement script, adapted from

[52] and discussed in Section 5.3.

1 function offsets = get_tremor_offsets(iters, vis_size, varargin)

2

3 % Build regions

4 L = vis_size;

5 x_0 = floor(L / 2);

6 y_0 = floor(L / 2);

7 x = x_0;

8 y = y_0;

9

10 % Set up energy surface

11 [X, Y] = meshgrid(1:L, 1:L);

12 activations = zeros(L, L);

13 focuser = opts.lambda * L * ((X - x_0).^2/x_0 + (Y - y_0).^2/y_0);

14

15 % Create the choices

16 offsets_for_choice = {[0 1] [0 -1] [-1 0] [1 0]};

17 % Pre-allocate the output offsets

18 offsets = zeros(iters+opts.throwout, 2);

19 for k=1:iters+opts.throwout

20 % Create a saccade?

21 if(activations(x,y) > opts.hc)

22 % Find global minimum with orientation preference

23 u1 = opts.chi * L * ((X - x).^2/x_0 .* (Y - y).^2/y_0);

24 costmat = focuser + activations + u1;

25 [gminj, gmini] = find(costmat==min(min(costmat)));

26 newx = gminj(1);

27 newy = gmini(1);

28 else

29 % Calculate choices

30 u_up = focuser(x,y+1) + activations(x,y+1);

31 u_down = focuser(x,y-1) + activations(x,y-1);

32 u_l = focuser(x-1,y) + activations(x-1,y);

33 u_r = focuser(x+1,y) + activations(x+1,y);

34

73

Appendix C. EyeMove Implementation 74

35 % Choose new spot

36 choices = [u_up u_down u_l u_r];

37 permidx = randperm(4);

38 [~, choice] = min(choices(permidx));

39

40 % Make update

41 newx = x + offsets_for_choice{permidx(choice)}(1);

42 newy = y + offsets_for_choice{permidx(choice)}(2);

43 end

44

45 % Update avoidance path

46 oldspot = activations(x,y);

47 activations = (1-opts.sinkeps) * activations;

48 activations(x,y) = oldspot + 1;

49

50 % Yield a result

51 offsets(k,:) = [newx-x newy-y];

52

53 % Move

54 x = newx;

55 y = newy;

56 end

57 % Remove "burn-in" period used to set up a valid energy well

58 offsets = offsets(opts.throwout+1:end,:);

Algorithm 4.

Bibliography

[1] Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm

for deep belief nets. Neural Computation, 18(7):1527–1554, 2006.

[2] G.E. Hinton and R.R. Salakhutdinov. Reducing the dimensionality of data with

neural networks. Science, 313(5786):504–507, 2006.

[3] Ruslan Salakhutdinov and Geoffrey E. Hinton. Deep boltzmann machines. In

International Conference on Artificial Intelligence and Statistics, pages 448–455,

2009.

[4] MIT Technology Review. 10 breakthrough technologies 2013: Deep

learning. http://www.technologyreview.com/featuredstory/513696/deep-learning/,

April 2013.

[5] H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio. An empirical

evaluation of deep architectures on problems with many factors of variation. In

Proc. of ICML, pages 473–480. ACM, 2007.

[6] D. C. Ciresan, U. Meier, L. M. Gambardella, and J. Schmidhuber. Deep, big,

simple neural nets for handwritten digit recognition. Neural Comp., 22(12):3207–

3220, 2010.

[7] F. Seide, G. Li, and D. Yu. Conversational speech transcription using context-

dependent deep neural networks. In Proc. Interspeech, pages 437–440, 2011.

[8] Q. V. Le, M. A. Ranzato, R. Monga, M. Devin, K. Chen, G. Corrado, J. Dean, and

A. Y. Ng. Building high-level features using large scale unsupervised learning. In

ICML, 2012.

[9] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolutional deep belief networks

for scalable unsupervised learning of hierarchical representations. In Proc. of ICML,

pages 609–616, 2009.

75

Bibliography 76

[10] Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. Imagenet classification with

deep convolutional neural networks. In Advances in Neural Information Processing

Systems 25, pages 1106–1114, 2012.

[11] Paul Smolensky. Information processing in dynamical systems: Foundations of

harmony theory. 1986.

[12] Yoav Freund and David Haussler. Unsupervised learning of distributions of binary

vectors using two layer networks. Computer Research Laboratory [University of

California, Santa Cruz, 1994.

[13] Geoffrey E. Hinton. Training products of experts by minimizing contrastive diver-

gence. Neural Computation, 14(8):1771–1800, 2002.

[14] Peter O’Connor, Daniel Neil, Shih-Chii Liu, Tobi Delbruck, and Michael Pfeiffer.

Real-time classification and sensor fusion with a spiking deep belief network. Fron-

tiers in Neuroscience, 7, 2013.

[15] C. Eliasmith, T. C. Stewart, X. Choo, Tr. Bekolay, T. DeWolf, Y. Tang, and D. Ras-

mussen. A large-scale model of the functioning brain. Science, 338(6111):1202–1205,

2012.

[16] J. Perez-Carrasco, B. Zhao, C. Serrano, B. Acha, T. Serrano-Gotarredona, S. Chen,

and B. Linares-Barranco. Mapping from frame-driven to frame-free event-driven

vision systems by low-rate rate-coding and coincidence processing. Application to

feed forward ConvNets. IEEE Trans. on Pattern Analysis and Machine Intelligence,

in press, 2013.

[17] P. A. Merolla, J. V. Arthur, R. Alvarez-Icaza, J-M. Bussat, and K. A. Boahen. A

multicast tree router for multichip neuromorphic systems. IEEE Transactions on

Circuits and Systems I, 2013.

[18] M. M. Khan, D. R. Lester, L. A. Plana, A. Rast, X. Jin, E. Painkras, and S. B.

Furber. SpiNNaker: mapping neural networks onto a massively-parallel chip mul-

tiprocessor. In Proc. 2008 International Joint Conference on Neural Networks

(IJCNN’08), 2008.

[19] DARPA SyNAPSE Program, 2013. URL http://www.artificialbrains.com/

darpa-synapse-program.

[20] A. Cassidy, A.G. Andreou, and J. Georgiou. Design of a one million neuron single

FPGA neuromorphic system for real-time multimodal scene analysis. In 2011 45th

Annual Conference on Information Sciences and Systems (CISS)., pages 1–6. IEEE,

2011.

http://www.artificialbrains.com/darpa-synapse-program

http://www.artificialbrains.com/darpa-synapse-program

Bibliography 77

[21] Daniel L. Neil and Shih-Chii Liu. Minitaur, an event-driven FPGA-based spiking

network accelerator. submitted (under review), 2013.

[22] G. Indiveri, B. Linares-Barranco, T.J. Hamilton, A. van Schaik, R. Etienne-

Cummings, T. Delbruck, S.-C. Liu, P. Dudek, P. Hafliger, S. Renaud, J. Schem-

mel, G. Cauwenberghs, J. Arthur, K. Hynna, F. Folowosele, S. Saighi, T. Serrano-

Gotarredona, J. Wijekoon, Y. Wang, and K. Boahen. Neuromorphic silicon neuron

circuits. Frontiers in Neuroscience, 5:1–23, 2011. ISSN 1662-453X.

[23] S.-C. Liu and T. Delbruck. Neuromorphic sensory systems. Current Opinion in

Neurobiology, 20(3):288–295, 2010.

[24] P. Lichtsteiner, C. Posch, and T. Delbruck. A 128× 128 120 db 15 µs latency

asynchronous temporal contrast vision sensor. IEEE Journal of Solid-State Circuits,

43(2):566–576, 2008.

[25] S-C. Liu, A. van Schaik, B. Minch, and T. Delbruck. Event-based 64-channel

binaural silicon cochlea with Q enhancement mechanisms. In Proceedings of the

2010 IEEE International Symposium on Circuits and Systems, pages 2027–2030,

May 2010. ISCAS 2010: Paris, France, 30 May–2 June.

[26] C. Farabet, R. Paz, J. Perez-Carrasco, C. Zamarreno, A. Linares-Barranco, Y. Le-

Cun, E. Culurciello, T. Serrano-Gotarredona, and B. Linares-Barranco. Compari-

son between frame-constrained fix-pixel-value and frame-free spiking-dynamic-pixel

ConvNets for visual processing. Frontiers in Neuroscience, 6(32), 2012.

[27] T. Delbruck, Bernabe Linares-Barranco, Eugenio Culurciello, and Christoph Posch.

Activity-driven, event-based vision sensors. In Proceedings of 2010 IEEE Interna-

tional Symposium on Circuits and Systems (ISCAS), pages 2426–2429. IEEE, 2010.

[28] J.M. Nageswaran, N. Dutt, J.L. Krichmar, A. Nicolau, and A. Veidenbaum. Effi-

cient simulation of large-scale spiking neural networks using CUDA graphics proces-

sors. In International Joint Conference on Neural Networks (IJCNN) 2009, pages

2145–2152. IEEE, 2009.

[29] L.P. Maguire, T.M. McGinnity, B. Glackin, A. Ghani, A. Belatreche, and J. Harkin.

Challenges for large-scale implementations of spiking neural networks on FPGAs.

Neurocomputing, 71(1):13–29, 2007.

[30] Teresa Serrano-Gotarredona and Bernabe Linares-Barranco. A 128 x 128 1.5%

contrast sensitivity 0.9% fpn 3 µs latency 4 mw asynchronous frame-free dynamic

vision sensor using transimpedance preamplifiers. 2013.

Bibliography 78

[31] B. Minch S-C. Liu, A. van Schaik and T. Delbruck. Asynchronous binaural spatial

audition sensor with 2x64x4 channel output. IEEE Transactions on Biomedical

Circuits and Systems, 2013. In press.

[32] Yang Dan and Mu-ming Poo. Spike timing-dependent plasticity of neural circuits.

Neuron, 44(1):23–30, 2004.

[33] Daniel E. Feldman. The spike-timing dependence of plasticity. Neuron, 75(4):

556–571, 2012.

[34] B. Nessler, M. Pfeiffer, L. Buesing, and W. Maass. Bayesian computation emerges

in generic cortical microcircuits through spike-timing-dependent plasticity. PLoS

Computational Biology, 9(4):e1003037, 2013.

[35] Yann Lecun and Corinna Cortes. The MNIST database of handwritten digits. URL

http://yann.lecun.com/exdb/mnist/.

[36] G. E. Hinton and T. J. Sejnowski. Learning and relearning in Boltzmann machines.

MIT Press, Cambridge, Mass., 1:282–317, 1986.

[37] Y. Bengio. Learning deep architectures for AI. Foundations and Trends in Machine

Learning, 2(1):1–127, 2009.

[38] David E. Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning repre-

sentations by back-propagating errors. Cognitive Modeling, 1:213, 2002.

[39] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training

of deep networks. In Advances in Neural Information Processing Systems 19. MIT

Press, 2006.

[40] Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based

learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–

2324, 1998.

[41] T. Tieleman. Training restricted Boltzmann machines using approximations to the

likelihood gradient. In Proc. of ICML, pages 1064–1071. ACM, 2008.

[42] H. Goh, N. Thome, and M. Cord. Biasing restricted Boltzmann machines to ma-

nipulate latent selectivity and sparsity. In NIPS workshop on deep learning and

unsupervised feature learning, 2010.

[43] Geoffrey E. Hinton and Andrew D. Brown. Spiking boltzmann machines. In NIPS,

pages 122–128. Citeseer, 1999.

http://yann.lecun.com/exdb/mnist/

Bibliography 79

[44] Yee Whye Teh and Geoffrey E. Hinton. Rate-coded restricted boltzmann machines

for face recognition. Advances in neural information processing systems, pages 908–

914, 2001.

[45] Hsin Chen and Alan Murray. A continuous restricted boltzmann machine with

a hardware-amenable learning algorithm. In Artificial Neural Networks—ICANN

2002, pages 358–363. Springer, 2002.

[46] Peter O’Connor. A real-time sensory-fusion model using a Deep Belief Network with

spiking neurons. Master’s thesis, Institute of Neuroinformatics, Zurich, Switzerland,

2012.

[47] A. J. F. Siegert. On the first passage time probability problem. Physical Review,

81(4):617, 1951.

[48] F. Jug, J. Lengler, C. Krautz, and A. Steger. Spiking networks and their rate-

based equivalents: does it make sense to use Siegert neurons? In Swiss Society for

Neuroscience. 2012.

[49] F. Jug, M. Cook, and A. Steger. Recurrent competitive networks can learn lo-

cally excitatory topologies. In International Joint Conference on Neural Networks

(IJCNN), pages 1–8, 2012.

[50] G-Q. Bi and M-M. Poo. Synaptic modifications in cultured hippocampal neurons:

Dependence on spike timing, synaptic strength, and postsynaptic cell type. Jour.

of Neuroscience, 18(24):10464–10472, 1998.

[51] Dan Ciresan, Ueli Meier, and Jurgen Schmidhuber. Multi-column deep neural

networks for image classification. In Computer Vision and Pattern Recognition

(CVPR), 2012 IEEE Conference on, pages 3642–3649. IEEE, 2012.

[52] Ralf Engbert, Konstantin Mergenthaler, Petra Sinn, and Arkady Pikovsky. An

integrated model of fixational eye movements and microsaccades. Proceedings of

the National Academy of Sciences, 108(39):E765–E770, 2011.

[53] D. Erhan, Y. Bengio, A. Courville, P.A. Manzagol, P. Vincent, and S. Bengio.

Why does unsupervised pre-training help deep learning? The Journal of Machine

Learning Research, 11:625–660, 2010.

[54] G. Hinton. A practical guide to training restricted boltzmann machines. Momen-

tum, 9:1, 2010.

[55] Ilya Sutskever, James Martens, and Geoffrey E. Hinton. Generating text with

recurrent neural networks. In Proceedings of the 28th International Conference on

Machine Learning (ICML-11), pages 1017–1024, 2011.

Bibliography 80

[56] Alain Destexhe, Michael Rudolph, J-M Fellous, and Terrence J. Sejnowski. Fluc-

tuating synaptic conductances recreate in vivo-like activity in neocortical neurons.

Neuroscience, 107(1):13–24, 2001.

[57] J. Masci, U. Meier, D. Ciresan, and J. Schmidhuber. Stacked convolutional auto-

encoders for hierarchical feature extraction. Artificial Neural Networks and Machine

Learning–ICANN 2011, pages 52–59, 2011.

[58] P. Lichtsteiner, T. Delbruck, and C. Posch. A 100db dynamic range high-speed dual-

line optical transient sensor with asynchronous readout. In International Symposium

on Circuits and Systems, ISCAS 2006. IEEE, 2006.

[59] L. Busing, J. Bill, B. Nessler, and W. Maass. Neural Dynamics as Sampling: A

Model for Stochastic Computation in Recurrent Networks of Spiking Neurons. PLoS

Computational Biology, 7(11):e1002211, 2011. ISSN 1553-7358.

[60] T. Tieleman and G. Hinton. Using fast weights to improve persistent contrastive

divergence. In Proc. of ICML, pages 1033–1040. ACM, 2009.

[61] O. Bichler, D. Querlioz, S.J. Thorpe, J.P. Bourgoin, and C. Gamrat. Extraction

of temporally correlated features from dynamic vision sensors with spike-timing-

dependent plasticity. Neural Networks, 2012.

[62] S. Mitra, S. Fusi, and G. Indiveri. Real-time classification of complex patterns using

spike-based learning in neuromorphic VLSI. IEEE Transactions on Biomedical

Circuits and Systems, 3(1):32–42, Feb. 2009.

[63] Timothee Masquelier and Simon J Thorpe. Unsupervised learning of visual features

through spike timing dependent plasticity. PLoS Computational Biology, 3(2):e31,

2007.

[64] T. Masquelier, R. Guyonneau, and S.J. Thorpe. Competitive STDP-based spike

pattern learning. Neural Computation, 21:1259–1276, 2009.

Online Learning in Event-based Restricted …dannyneil.com/attach/dneil_thesis.pdfAbstract Online...

Documents

Transcript of Online Learning in Event-based Restricted …dannyneil.com/attach/dneil_thesis.pdfAbstract Online...