CS229 FINAL REPORT, FALL 2015 1 Neural Memory …cs229.stanford.edu/proj2015/367_report.pdfFigure 3...

CS229 FINAL REPORT, FALL 2015 1

Neural Memory NetworksDavid Biggs and Andrew Nuttall

I. INTRODUCTION

CONTEXT is vital in formulating intelligent classifica-tions and responses, especially under uncertainty. In a

standard feed-forward neural network (FFNN), context comesin the form of information encoded in the input vector andtrained in weight parameters. However, useful information canalso be present in the temporal nature of the input vectors,or from past internal states of a network. Future outputs canachieve better accuracy by observing transient trends in theinput data, or by utilizing key memories from distant inputswhich could be crucial to formulating a correct output. Byproviding a neural network with an architecture for storingand maintaining memories this additional context can beeffectively leveraged.

A simple implementation of memory in a neural networkwould be to write inputs to external memory and use thisto concatenate additional inputs into a neural network. Fornoisy analog inputs, memory inputs pulled from Gaussiandistributions can act to preprocess and filter the data. Fig. 1shows a schematic and memory weight distribution of a FFNNwith external memory augmentation.

(a) Network schematic

0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Memory Locations

Weig

hts

(b) Weights of memorymatrix

Fig. 1: Architecture of FFNN with simple memory implemen-tation

This architecture was implemented to predict the nextsequence in a noisy sinusoidal signal. Ten previous inputswere stored in memory, from which five additional inputs weredrawn from Gaussian weights on the memory. The network istrained with a single sinusoid then tested with a sum of threesinusoids of varying frequencies with errors around 10% fortraining and 15% for testing. The convergence and input/outputwaveforms are shown in Fig. 2.

Network architectures with delayed inputs or additional in-puts drawn from memory can be useful tools, but are limited infunctionality since the memory architecture is predetermined.A more powerful memory architecture would store memoryinside the network to allow the network to learn how to utilizememory via read and write commands.

II. RELATED WORK

Neural networks were first introduced over 60 years agoas one of the first learning algorithms. Recurrent neuralnetworks (RNN) were then introduced in the 1980’s to betterprocess sequential inputs by maintaining an internal statebetween inputs. Long-Short-Term-Memory (LSTM) improvedupon RNN’s in the late 1990’s by adding logic gates to read,write, and forget internal memory states. Recently, Weston etal. [1] defined a framework for neural networks to interactwith external memory in order to read and write long termmemory. In their paper, Westin implements an RNN withmemory for textual story processing in which several actorsmove between rooms and while carrying and deposit objects.In their results they showed that with memory their networkoutperformed a similar RNN and LSTM without memoryaccess at answering questions about a story. Around the sametime, work was published by Graves et al. [2] in which theyimplement an algorithm they call the Neural Turing Machine(NTM). The NTM algorithm has memory structures calledmemory heads that can be accessed in order to read, write, anderase data. They ran several test on their NTM, one of whichwas teaching the network to store then copy sequences ofnumbers. They trained their network up to sequences of twenty8 bit numbers and tested up to sequences of 120 numberswith good results. This project differs from related work byimplementing internal neural memory by assigning a set ofmemory nodes (memory bank) to each standard neuron, tobe discussed in the following section. With this approach aricher memory can be achieved by not only maintaining keymemories, but by retaining memory histories with associatedtemporal information.

III. METHODOLOGY

The neural memory network (NMN) architecture is com-prised of an FFNN with a dedicated internal memory bankassociated with each standard neuron in each of the hiddenlayers. The input and output layers are composed of onlystandard nodes, while the hidden layers are composed ofstandard nodes with their respective memory nodes. In thisformulation, the output of each node is given by a sigmoidfunction of an affine combination of its inputs. The output ofa node is used as a linear switch to activate the memoriesdedicated to that node. During forward propagation the neuralnetwork can learn to call upon these memories as neededto provide additional information in making classifications.Figure 3 depicts the general network layout for a simplememory neural network.

Training and testing of the algorithm is comprised of threekey steps:


0 2 4 6 8

Training Points ×105

5

10

15

20

25

30

35

40

45

50

55

Err

or

[%]

Training Error

Test Error

(a) Error convergence

0 0.5 1 1.5 2 2.5 3 3.5 4

0

0.5

1

Training Data

0 0.5 1 1.5 2 2.5 3 3.5 4

0

0.5

1

Testing Data

Input Signal

Output Signal

(b) Waveforms

Fig. 2: Signal prediction results from memory augmented FFNN

Fig. 3: A neural memory network has input and outputlayers composed of standard sigmoid nodes and hidden layerscomposed of both standard and memory nodes. Each standardnode in a hidden layer has a memory bank associated withit, and each memory bank is composed of multiple memorynodes. The outputs of standard nodes are used as switches toactivate memories when called upon.

• Forward propagation of the input across each networklayer to form the output

• Back propagation of the errors across each layer startingat the output to perform gradient descent updates onnetwork parameters

• Memory storage in each layer of nearby neuron outputsas an orthogonal set in a higher dimensional space

A. Forward Propagation

Forward propagation in an NMN functions in a similarmanner to that of an FFNN, with an augmented output vectorfrom each hidden layer. The output of each layer i is givenby a two-step equation containing outputs from both standardneurons and memory neurons,

x(i) = f(W(i−1)x(i−1) + b(i−1)

)(1)

x(i) :=[x(i)1 , x

(i)1 M

(i)>1 , ..., x

(i)k , x

(i)k M

(i)>k

]>(2)

The output state vector for each layer’s standard nodes isaugmented with the output of the layer’s memory nodes. Theoutput of each memory node is the product of the content ofthat memory slot and the output of the standard node it isattached to. With this approach the standard node acts as alinear switch to turn memories on or off as desired. Duringforward propagation in the hidden layers the outputs of ahidden layer only go to standard nodes, memory nodes onlyreceive input from their respective standard node in the samelayer.

B. Backward Propagation and Gradient Descent

The network is trained by updating the parameters W andb using gradient descent with the back-propagation technique.In the training process, an input is forward propagated throughthe neural network to produce an hypothesis h(xj) that is thencompared to a known solution y. The error is calculated as

ε =1

2||h(x(output)

)− y||22. (3)

Performing gradient descent on an NMN requires differentconsiderations to be given to the set of first and last layers andthe set of hidden layers due to the network’s asymmetry. Toperform gradient descent on the weights the partial derivativesw.r.t w(i)

j,k are taken as,

∂ε

∂W(i)jk

=∂ε

∂h(x(i)j )

∂h(x(i)j )

∂x(i)j

∂x(i)j

∂W(i)jk

(4)

Evaluating this expression for the final layer I of weightsyields

∂ε

∂W(I)jk

=(h(x

(I)j )− y

)h(x

(I)j )(1− h(x

(i)j ))xI−1 (5)

Back propagation is implemented in the standard way bystoring gradients in the variable δ and passed to previouslayers. This process starts at the output layer as,

δ(output) = (x(output) − y)x(output)(1− x(output)) (6)


The errors of the subsequent layers are weighted by theparameters

δ(i) = W(i+1)δ(i+1)x(i+1)(1− x(i+1)) (7)

There are two paths back to the primary neurons, so the errorsof each hidden layer must be summed across memory nodes

δ(i)j = δ

(i)j +

∑k

δ(i)j+kM

(i)j+k (8)

The updates on W and b are then,

W(i) := W(i) − αδ(i)x(i+1) (9)

b(i) := b(i) − αδ(i) (10)

C. Memory Architecture

Every node in the hidden layers contains a static memorybank. Each unique memory in a node is orthogonal to all othermemories in that node, such that no information is lost whenthe memories are added together. A collection of memoriesis then represented as a single vector in a higher dimensionalspace called a composite memory. The direction of the vectordictates what memories are present, and the magnitude of thevector when projected onto each memory’s axis determinesthe order in which the memories occurred. When a nodefires the current composite vector that is stored in memoryis scaled according to some scheduled decay and then thenew memory is added. The new memory is created by anorthogonal mapping of the outputs the node and its τ nearestneighbors. Nearest neighbors can be defined in a spatial senseto be nodes in the same layer with close indices, or it can alsobe defined to include nodes in previous and future layers. Thememory bank update equation for each node is given by

M(i)j := γ

⌊x(i)j

⌉M

(i)j + g

([x(i)j−τ/2, ... x

(i)j , ..., x

(i)j+τ/2

])(11)

where g(x) : Rτ+1 → R2τ+1

, is a function mapping theoutputs of its τ nearest neighbors to 2τ+1 dimensional space,and γ is the given decay schedule which is turned on and offby the rounded output of the standard node. When a node fires

γ

⌊x(i)j

⌉evaluates to γ, and if it does not fire γ

⌊x(i)j

⌉evaluates

to 1.During the training phase memories are allowed to decay to

zero quickly to provide the network the plasticity to convergeto its desired orientation. During use memories can be retainedindefinitely, or according to any desired decay schedule. Withan asymptotic memory decay schedule, transient informationcan be retained for the more recent memories, and any remain-ing memories which have hit the asymptotic magnitude are stillretained, but their time ordering can no longer be comparedto other memories which have also hit the asymptotic decaymagnitude.

IV. EXPERIMENTS/RESULTS

The NMN algorithm was implemented and then tested withseveral test data sets:• Sorted Classification• Classification and Recall

Fig. 4: Each standard node stores a composite memory inits dedicated memory bank. A composite memory is the sumof all of the orthogonal memories associated with that node,such that the original memories are retained by projecting thecomposite memory onto different dimensions. In the nearestneighbor approach the activation permutations of a cell’s near-est neighbors is mapped to a higher dimensional orthogonalmemory and then added to the cell’s composite memory.

• Memory Recall• Extended RecallEach of the test sets are meant to evaluate the various

functionality of the algorithm. For comparison, a standardFFNN (NMN with memory turned off) and an RNN availablein the MatLab Neural Network toolbox (layrecnet()) were alsotrained and evaluated with the test data sets. layrecnet() hasone hidden layer with feedback and one output layer; a blockdiagram of is shown in Fig. 6.

A. Sorted Classification

A data set composed of points sampled from 10 randomlygenerated multivariate Gaussian distributions, with the goalbeing to classify which distribution a point was sampled from.Samples were drawn from the distributions in a stationaryorder. Due to overlap in the distributions a standard feedforward neural net was only able to achieve 4.5% classificationerror. By implementing memory the neural network was ableto identify that there was a pattern to the input points andreduces classification error to 0.7%, an improvement of 600%.An RNN with a single feedback delay was also tested on thedata, with classification error of 1.8%. Convergence results areshown in Fig. 7.

B. Classification and Recall

The task of this data set was to classify an input and alsostate the previous two classifications it made, in the orderit made them. A memory-less neural network can performno better than converging to the most prolific output on thistype of task. A memory based neural network with the sameparameters was able to achieve 7% classification error on thesame data set. An RNN with a single feedback delay wasalso tested on the data, with classification error of < 0.1%.Convergence results are shown in Fig. 8.

C. Memory Recall

The networks were given sequences of 5 bit numbers towrite into memory, followed by empty input vectors of zeros


Node Firings

0 20 40 60 80 100

Com

posite M

em

ory

Magnitude

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Memory Vector Magnitude

No Memory Permanence

Memory Permanence

Node Firings

0 20 40 60 80 100

Com

ponent M

agnitude

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Memory Component Magnitude

Memory Occurs

Fig. 5: Every time a standard node is activated that memories associated with that node have their magnitude decrementedaccording to some decay schedule to allow for the retention of temporal information. Memories can either decay out ofexistence, or can be given some asymptotic decay.

Fig. 6: Block diagram of layrecnet(), a layered recurrent neuralnetwork function available in the MatLab Neural NetworkToolbox.

0 20 40 60 80 100 120 140 160

10−2

10−1

100

Epochs

Tra

inin

g E

rror

FFNN

NMN

RNN

Fig. 7: A comparison of error convergence between an FNN,NMN, and RNN on a classification task demonstrates theability of the NMN to leverage temporal patterns in the inputdata.

of the same series length to signify to the networks to readand output the stored sequences. The NMN algorithm wasset with one internal memory layer to process the data.Both memory writing procedures were tested, external writingwhich recorded the input layer upon a standard neuron firing,and nearest neighbor which recorded nearby neuron outputs asoutlined above. Convergence results for sequences of lengthtwo are shown in Fig. 9.

0 50 100 150 200 250 300

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Epochs

Tra

inin

g E

rror

FFNN

NMN

RNN

Fig. 8: A convergence plot for the classification and recalltask, where outputs are given by the current classification andthe previous two classifications.

0 10 20 30 40 50 60 700

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Epoch

Tra

inin

g E

rro

r

RNN

NMN − External Write

NMN − Nearest Neighbor

Fig. 9: Convergence plot of the memory recall task for thevarious networks, RNN, NMN with an input write function,and NMN with nearest neighbor write function.


Epoch

0 10 20 30 40

Te

stin

g E

rro

r

0

0.2

0.4

0.6

0.8

1NMN Total Error

NMN Long Recall Error

RNN Total Error

RNN Long Recall Error

Fig. 10: Convergence plot for the extended recall task. Inthis data set, some classifications depend only on the currentinputs, but some depend on inputs from 25 time units ago. Astandard RNN is unable to perform those long term classifi-cations better than randomly guessing, while the NMN is ableto achieve near perfect long term recall.

D. Extended Recall

In the extended recall data set the task was to performclassifications, with some classifications depending only uponthe current input, and some classifications depending oninformation from 25 inputs prior. Both the RNN and theNMN were able to achieve perfect testing performance onthe classification task where the outputs only depending uponthe current input. However, in the case of outputs dependingupon data from 25 inputs ago the RNN was only able toachieve random guess (66% error) levels of performance,while the NMN was able to achieve near perfect long termrecall. The NMN in this setup was composed of 2 hiddenlayers. Convergence results for all four error rates are shownin Fig. 10.

V. CONCLUSION

We have developed and demonstrated a feed-forward neuralnetwork algorithm that stores memory associated with stan-dard hidden layer neurons. The benefit of neural memoryis shown to give context to the algorithm while computingpattern recognition and time series data sets. Memories arestored with a higher dimensional mapping to ensure orthogo-nality, and memories are decremented upon recall to maintaina time history. The neural memory network was evaluatedwith several data sets of classification and recall and againsta similar feed-forward neural network without memory and acommercial recurrent neural network. The results show thatalthough indeed the neural memory network indeed learns toutilize memory in order to solve problems requiring context,in the current implementation it does not outperform thecommercial recurrent neural network for most tasks. On theother hand, neural memory allow for the network to excelat long term memory storage and recall, a functionality thatescapes the standard recurrent neural network.

VI. FUTURE WORK

The algorithms described in this paper are only a firstapproach towards the NMN architecture. Future developmentswill depend upon un-linking the read and write operators,memory scaling issues, training schedules, and the generaloptimization techniques used to perform gradient descent.

In this implementation the NMN learns to recall informationfrom memory as needed, but it is not explicitly trained on howand when to write information. To combat this, in our currentimplementation we have tied together the write and readcommands so that any potentially useful information is savedto memory in the appropriate memory slots as determined bythe read command. However, to get a more robust performancethe read and write commands need to be learned separatelyand independently.

In the current nearest neighbor approach, the size of thememory banks scale according to 2τ , which can cause majorperformance issues if τ becomes too large. This constraint isintroduced to maintain the orthogonal nature of the memoriesto avoid the problem of having to learn to reference specificmemories. With orthogonal memories, a composite memory isreferenced instead of a single component memory. Instead ofthis approach, the permutations of the nearest neighbors can beused to address a specific memory location in a nodes memorybank through a binary to linear mapping, thus eliminatingthe orthogonal constraint and the 2τ scaling law. An exampleoutput of a memory node with 2 nearest neighbors with thisapproach could be given by

M1 = (1− x3)(1− x2)x1M1,1+

(1− x3)x2x1M1,2+

x3(1− x2)x1M1,3+

x3x2x1M1,4

(12)

which is a differentiable expression for selecting a memoryfrom a learned location. In this expression x2 and x3 are theoutputs of the two nearest neighbors. This expression couldbe further processed with some type of focusing technique,but this approach allows for linear memory scaling and doesimpose an orthogonal constraint.

To allow for more accurate comparisons between the NMNand other off the shelf algorithms the optimization techniquesused to perform gradient descent on the NMN can be improvedupon to allow for better convergence. The implementation ofa dynamic step-size achieved via a line search could prove tobe very effective.

REFERENCES

[1] Weston, Jason, Sumit Chopra, and Antoine Bordes. Memory networks.arXiv preprint arXiv:1410.3916 (2014).

[2] Graves, Alex, Greg Wayne, and Ivo Danihelka. Neural Turing Machines.arXiv preprint arXiv:1410.5401 (2014).

CS229 FINAL REPORT, FALL 2015 1 Neural Memory …cs229.stanford.edu/proj2015/367_report.pdfFigure 3...

Documents

Transcript of CS229 FINAL REPORT, FALL 2015 1 Neural Memory …cs229.stanford.edu/proj2015/367_report.pdfFigure 3...