2009SpecialIssue ...aix1.uottawa.ca/~schartie/chartier-NN.pdf · NeuralNetworks22(2009)568 578...

Neural Networks 22 (2009) 568–578

Contents lists available at ScienceDirect

Neural Networks

journal homepage: www.elsevier.com/locate/neunet

2009 Special Issue

A new bidirectional heteroassociative memory encompassing correlational,competitive and topological propertiesSylvain Chartier a,∗, Gyslain Giguère b,1, Dominic Langlois a,2a University of Ottawa, School of Psychology, 200 Lees E217, 125 University Street, Ottawa, ON K1N 6N5, Canadab The University of Texas at Austin, Department of Psychology, 1 University Station, A8000, Austin, TX 78712-0187, USA

a r t i c l e i n f o

Article history:Received 6 May 2009Received in revised form 26 May 2009Accepted 25 June 2009

Keywords:Bidirectional heteroassociative memoryCompetitive learningTopological architectureUnsupervised learningCategorizationSelf-organizing map

a b s t r a c t

In this paper,we present a new recurrent bidirectionalmodel that encompasses correlational, competitiveand topological model properties. The simultaneous use of many classes of network behaviors allows forthe unsupervised learning/categorization of perceptual patterns (through input compression) and theconcurrent encoding of proximities in a multidimensional space. All of these operations are achievedwithin a common learning operation, and using a single set of defining properties. It is shown that themodel can learn categories by developing prototype representations strictly from exposition to specificexemplars. Moreover, because the model is recurrent, it can reconstruct perfect outputs from incompleteand noisy patterns. Empirical exploration of themodel’s properties and performance shows that its abilityfor adequate clustering stems from: (1) properly distributing connection weights, and (2) producing aweight space with a low dispersion level (or higher density). In addition, since the model uses a sparserepresentation (k-winners), the size of topological neighborhood can be fixed, and no longer requiresa decrease through time as was the case with classic self-organizing feature maps. Since the model’slearning and transmission parameters are independent from learning trials, the model can develop stablefixed points in a constrained topological architecture, while being flexible enough to learn novel patterns.

© 2009 Elsevier Ltd. All rights reserved.

1. Introduction

1.1. Cognitive science background

Every day, humans are exposed to situations in which theyare required to either differentiate or regroup perceptual patterns(such as objects presented to the visual system). Their percep-tual/cognitive system achieves these operations in order to pro-duce appropriate responses, upon the identity and properties ofthe encountered stimuli. To accomplish these perceptual tasks, thesystemmust create and enrich context-dependent memory repre-sentations, which are adapted to different environments, but canbe shared between these environments through generalization.This general process, known as perceptual learning, mainly con-sists in the implicit abstraction of previously unavailable informa-tion, leading to semi-permanent changes at the memory structurelevel (Gibson & Gibson, 1955). Most perceptual learning processes

∗ Corresponding author. Tel.: +1 613 562 5800x4419; fax: +1 613 562 5147.E-mail addresses: [email protected] (S. Chartier),

[email protected] (G. Giguère), [email protected] (D. Langlois).1 Tel.: +1 512 471 4253.2 Tel.: +1 613 562 5800x4481.

0893-6080/$ – see front matter© 2009 Elsevier Ltd. All rights reserved.doi:10.1016/j.neunet.2009.06.011

can be achieved autonomously, through the associative abstrac-tion of the environment’s statistical structures (as someneural net-works do: Goldstone, 1998; Hall, 1991).One of the system’s main goals in defining mental represen-

tations is cognitive economy (Goldstone & Kersten, 2003). Invari-ance is a quality that leads to economy by reducing the quantityof information that the system must take into account in a givensituation, thus accelerating cognitive processes (or more precisely,retrieval of memory traces). Hence, in order to be efficient, thehuman perceptual/cognitive system must create representationsincluding relevant statistical properties leading to quick differenti-ation, identification and recognition (Goldstone, 1998). These rep-resentationswill be useful in future situations, when the need for adecision based on a newor repeating perceptual stimulation arises.Strictly memorizing invariant information lessens the computa-tional burden on the cognitive system, and allows it to follow aninformation reduction strategy. At the object level, this strategywould be found in the form of a dimensional reduction applied toinputs (Edelman & Intrator, 1997).Another strategy used by the system to reduce information

and increase processing speed is the representation of similarperceptual stimulations by a single higher-level entity, createdaccording to the common perceptual properties of themember ob-jects. This process is called perceptual category learning (Murphy,2002). Cognitive scientists have historically argued over the fact

http://www.elsevier.com/locate/neunet

http://www.elsevier.com/locate/neunet

mailto:[email protected]



http://dx.doi.org/10.1016/j.neunet.2009.06.011

S. Chartier et al. / Neural Networks 22 (2009) 568–578 569

that the human cognitive system either uses generic abstractions(such as prototypes) or very specific perceptual stimulations (suchas complete exemplars) to achieve category learning and furtherclassification (Komatsu, 1992). Prototype supporters believe thatcategories possess a special representational status within thecognitive system. Hence, every exemplar set is linked to a moregeneric, summary representation that can be retrieved indepen-dently, and for which the memory trace is stronger and moredurable than for any exemplar (Posner & Keele, 1968, 1970). Theserepresentations help the system in quickly deciding on a stim-ulus’ category membership, and are enhanced according to newincoming information. To achieve this, categories must be easy todifferentiate; this is why the natural world is composed of cohe-sive categories, in which associated exemplars are similar to eachother (Rosch, 1973).Following the work of Knowlton and Squire (1993) and

Knowlton, Mangels, and Squire (1996), we now know that theprototype theory is valid in categorization. Neuropsychologicaldissociation studies have led to the conclusion that whilesingle object representations must be memorized in order toachieve recognition, identification and discrimination, categoricalrepresentations are completely separate in the system, and operateaccording to a prototype principle.

1.2. Modeling background

1.2.1. Recurrent/Bidirectional associative memoriesWhen trying to achieve autonomous (or unsupervised) learning

and categorization, many neural network options are available.A class of networks known to achieve these types of tasks isthat of recurrent autoassociative memories (RAMs). In psychology,AI and engineering, autoassociative memory models are widelyused to store correlated patterns. One characteristic of RAMs isthe use of a feedback loop, which allows for generalization tonew patterns, noise filtering and pattern completion, among otheruses (Hopfield, 1982). Feedback enables a given network to shiftprogressively from an initial pattern towards an invariable state(namely, an attractor).A problem with these models is that contrarily to what is

found in real-world situations, they store learned informationusing noise-free versions of the input patterns. In comparison, toovercome the possibly infinite number of stimuli stemming, forinstance, from multiple perceptual events, humans must regroupthese unique inputs into a finite number of stimuli or categories.Also, while RAMs can perfectly reconstruct learned patternsthrough an iterative process, they cannot associate many distinctinputs with a single representation, that is they cannot categorize.Direct generalization of RAM models is the development of

bidirectional associative memory (BAM) models (Kosko, 1988).BAMs can associate any two data vectors of equal or differentlengths (representing for example, a visual input and a prototypeor category). These networks possess the advantage of being bothautoassociative andheteroassociativememories (Kosko, 1988) andtherefore encompass both unsupervised and supervised learning.A BAMmodel would be able to develop prototype representationslinked to different exemplars, but only in a supervised fashion.Nowadays, many RAM/BAM models can display both stability andplasticity for a data set of exemplars (e.g. Davey and Hunt (2000)and Du, Chen, Yuan, and Zhang (2005)).

1.2.2. Competitive networksOverall, category formation in a perceptual framework is often

seen as a process akin to classic clustering techniques, whichinvolve partitioning stimulus spaces in a number of finite sets,

or clusters (Ashby & Waldron, 1999). In cognitive modeling,competitive networks (e.g. Grossberg (1988) and Kohonen (1982))are known for their capacity to achieve clustering behavior. In fact,they constitute local, dynamic versions of clustering algorithms.In these models, each output unit represents a specific cluster.When taking decisions, the association between an exemplar andits determined cluster unit in the output layer is strengthened.In winner-take-all (WTA) networks (Grossberg, 1988; Kohonen,1982), exemplarsmay only be associatedwith one cluster (i.e. onlyone output unit at a time can be activated).An example of one such hard competitive framework is that

of the Adaptive Resonance Theory (ART: (Grossberg, 1988)).ART networks possess the advantage of being able to dealeffectively with the exemplars/prototype scheme, while solvingthe stability/plasticity dilemma. These unsupervised modelsachieve the desired behavior through the use of a novelty detector(using vigilance). Various degrees of generalization can be achievedby this procedure: low vigilance parameter values lead to thecreation of broad categories, while high values lead to narrowcategories, with the network ultimately performing exemplarlearning.Another example of this framework is the self-organizing feature

map (SOFM: Kohonen, 1982), which, in addition to showing WTAbehaviors, uses a topological representation of inputs and outputs.Although SOFMs only consider one active output unit at a time,the learning algorithm also allows for physically close neighboringunits to update their connection weights. In a SOFM, an exemplarmay thus, for instance, be geometrically positioned between twoclusters, and possess various degrees of membership.An extension of the WTA principle selects the k largest outputs

from the total n outputs (Majani, Erlanson, & Abu-Mostafa, 1989).This k-winners-take-all (kWTA) rule is thus a more general caseof the WTA principle, within which exemplars may be associatedwith many clusters at differing degrees. This procedure provides amore distributed classification.While extremely useful for object and category processing,

competitive networks do not achieve recurrent behavior. They arethus generally sensitive to input noise during recall. ART networkscan deal with noise through a novelty detection procedure, but donot encompass topological properties. In comparison, SOFMs showtopological properties, but are not resistant to noise and do notshow plasticity.

1.3. Goals and presentation

In the present paper, we propose a new bidirectional heteroas-sociative memory (BHM), named BHM + SOFM, which encom-passes kWTA and SOFM properties. This modification will enablea known BHM model (Chartier & Boukadoum, 2006a) to increaseits clustering capacity; hence, higher network performance andreadability should follow. In addition, this kWTA–SOFM version ofa BHM will allow for sparse coding, which is a distributed repre-sentation principle supported by neuropsychological findings (Ol-shausen & Field, 2004). The network will also inherit propertiesfrom its recurrent memory status, such as attractor developmentand noise tolerance (Hassoun, 1989). Finally, using sparse coding,we will propose a modification to the original network (Chartier,Giguère, Langlois, & Sioufi, in press) that will enable it to solve asimple version (exemplar data set) of the stability–plasticity prob-lem (Grossberg, 1987). Therefore, the goal of this study is to pro-pose a generalmodel based on BAMs, that shows properties similarto those found in other network types, such as competitive behav-ior in SOFMs.

570 S. Chartier et al. / Neural Networks 22 (2009) 568–578

2. Theoretical background: SOFMs and other general competi-tive models

Competitive models are based on a linear output functiondefined by:y = Wx (1)where x is the original input vector,W is the weight matrix, and yis the output. Weight connection modifications follow an Hebbianlearning principle with forgetting term3 g(yj)wj, where g(yj)mustsatisfy the following requirement:g(yj) = 0 for yj = 0. (2)Therefore, the learning rule can be expressed bywj(p+ 1) = wj(p)+ η(p)yjx− g(yj)wj(p) (3)where wj represents neuron j’s weight vector, η represents alearning parameter, and p is the learning trial number. The specificchoice of function g(yj) can lead to two different unsupervisedmodel rules. Eq. (3) will lead to Kohonen (1982)’s competitivelearning rule if:g(yj) = ηyj (4)and to a PCA learning rule (Oja, 1982) if:

g(yj) = y2j and η(p) = η. (5)Hence, both PCA and competitive neural networks constitutespecial cases of Hebbian learning using a forgetting term.In a WTA perspective, the largest output must be selected. This

allows for the detection of the winning unit, and of the topologicalneighborhood center’s position for SOFMs. This is equivalent tominimizing the Euclidean distance between the input vector x andthe weight vectorwj. If i(x) identifies the closest match for x, thenthe winning unit is determined by

i(x) = argminj

∥∥x−wj∥∥ , j = 1, . . . , n. (6)

Moreover, in the original SOFM, the choice of a topologicalneighborhood h is based on a Gaussian function:

hj,i(x)(p) = exp

(−d2j,i(x)2σ 2(p)

), d2j,i(x) =

∥∥rj − ri∥∥2 (7)

where rj defines the position of neuron j, ri defines the position ofwinning neuron i, and σ represents neighborhood size. Therefore,the connection between input x and winning unit i(x) will bemaximally reinforced, and the level of strengthening for theremaining connections will be lower as the distance from thewinning unit dj,i(x) increases. In order to achieve stability, the sizeof the topological neighborhood and the value of the learningparameter on trial p are also required to decrease with time. Thisis done by using exponential decay:

σ(p) = σ0 exp(−pτ1

)and η(p) = η0 exp

(−pτ2

)(8)

where τ1 and τ2 represent distinct time constants, and σ0 and η0respectively represent the initial values for the neighborhood sizeparameter σ and learning parameter η. Because both the size ofthe topological neighborhood and the learning parameter decreaseuntil stability, once a SOFM completes learning of a given set ofpatterns, it cannot adapt to a new extended set. Learning must becompletely re-done, using new initial weights and a new learningset composed of old and new patterns. This type of model thuscannot show plasticity and stability simultaneously.

3 Using a forgetting term prevents the model’s weight values from growingindefinitely,whichmakes themodel ‘‘explode’’, such aswith strict Hebbian learning(Sutton, 1988).

Fig. 1. Network architecture.

3. Model properties

Like any artificial neural network, the proposed model,BHM+ SOFM, can be entirely described by its architecture, as wellas its transmission function and learning rule.

3.1. Architecture

The BHM + SOFM architecture (Fig. 1) is closely linked to thatof the Bidirectional HeteroassociativeMemory (BHM) proposed byChartier and Boukadoum (2006a). It extends a previous effort byChartier and Giguère (2008; see also: Chartier, Giguère, Renaud,Proulx, and Lina (2007); Giguère, Chartier, Proulx, and Lina (2007a)and Giguère, Chartier, Proulx, and Lina (2007b)). The architectureconsists of two Hopfield-like neural networks interconnectedin head-to-toe fashion (Hassoun, 1989). When connected, thesenetworks allow a recurrent flow of information that is processedbidirectionally. As in a standard BAM, both layers serve as a‘‘teacher’’ for the other layer, and the connections are explicitlydepicted in the model.The model thus includes two layers, x and y. The x layer serves

as an initial network input, as well as for input reconstruction. Itssize is equal to the number of pixels in the original input images.The y layer serves as an information compression layer; it appliesand memorizes a dimensional reduction mapping (Chartier et al.,2007). Here, the compression is in fact a topological transformationinto a 2D space. The compression level depends on the number ofwinning units used for layer y (Chartier & Giguère, 2008).As with the BHM, two separate weight matrices are used, in

order to represent the possibility of asymmetric connections, andthus creating two distinct memories. MatrixW (the compressionmatrix) links layer x to layer y, and is used to determine amappingallowing the transformation of inputs into their topologically-compressed version. Matrix V (the reconstruction matrix) linksback layer y to layer x and allows the network to reconstructthe original pattern using a nonlinear combination of the learnedcomponents (or filters). Here, the main difference between thepresent model and Chartier and Boukadoum (2006a)’s BHM liesin the absence of an ‘‘external’’ input for layer y. In general, aBAM (such as the BHM) is used to associate two sets of knownvectors; for each vector x, a pre-associated vector y must also beprovided. In the present case, there is no initial input for layer y.This layer’s contentwill be created during the original compressionprocess, inwhich anoriginal patternwill be associatedwith its owntopologically reduced version.Here, the network’s compression layer y is further constrained

using a rectangular 2D topology; this enables the network tobehave like a SOFM. A SOFM not only categorizes the input data,


Fig. 2. ‘‘Mexican-hat’’ function for p = 1, τ = 435, σ0 = 10.

but also recognizes which input patterns are nearby each otherin stimulus space. In order to achieve this, the SOFM’s originaltopological neighborhood function was extended into a ‘‘Mexican-hat’’ function (Fig. 2). This function is well-suited to bipolar stimuliand is described by:

hj,i(x)(p) =12

3e− d2j,i(x)σ2(p) − e−

d2j,i(x)4σ2(p)

; d2j,i(x) =∥∥rj − ri

∥∥2 (9)

where rj defines the position of neuron j, ri defines the positionof winning neuron i, p represents the learning trial number,and σ represents neighborhood size. The size of the topologicalneighborhood on trial p is also required to exponentially decreasewith time:

σ(p) = σ0 exp(−pτ

)(10)

where τ represents a time constant, and σ0 represents the initialvalue for σ .In order to emulate kWTA processes, the compressed (y) layer’s

architecture must be constrained using inhibitory connectionsfrom each compression unit to all units from that same layer. Asmore and more y units become active (i.e. going from a WTA to akWTA-type setting), the network will evolve towards a more dis-tributed (or ‘‘sparse’’) type of representation (Chartier & Giguère,2008).The learning process used by the new model can thus be de-

scribed as follows: the networkmust first compute the product be-tweenW and initial input x(0), and transform the result accordingto the transmission function (described in the next section). Thisallows the model to determine the initial activation of each unit inlayer y. Then, the k most activated y units are chosen through theMexican-hat inhibitory process (Eq. (9)), and the initial compres-sion y(0) is computed. This topologically-compressed version goesthrough matrix V, in order to produce final reconstruction x(1).Finally, the reconstruction goes through W once again, produc-ing final compression y(1).4 TheW weight matrix will be used tomemorize themappings between the input and the associated con-strained representation,while theV connectionmatrixwill be usedto reconstruct inputs using compressed and constrained versions.

3.2. Transmission function

The transmission function is expressed by the followingequations:

∀ i, . . . ,N, yi(t + 1) =

1, IfWxi(t) > 1−1, IfWxi(t) < −1(δ + 1)Wxi(t)− δ(Wx)3i (t), Else

(11)

4 Here, it is necessary to go through the W matrix twice to achieve weightupdates, since learning is time-difference based.

and

∀ i, . . . ,M, xi(t + 1) =

1, If Vyi(t) > 1−1, If Vyi(t) < −1(δ + 1)Vyi(t)− δ(Vy)

3i (t), Else

(12)

where N and M are the number of units in layers x and yrespectively, i is the index of the respective vector element, y(t)and x(t) represent layer contents at iteration cycle t , W is thecompression matrix, V is the reconstruction matrix, and δ is ageneral positive transmission parameter that should be fixed ata value lower than 0.5 to assure fixed-point behavior (Chartier &Proulx, 2005; Hélie, 2008). This function possesses the advantageof exhibiting continuous-valued (gray-level) attractor behavior(for a detailed example see Chartier and Boukadoum (2006a)).Such properties contrast with networks using a standard nonlinearoutput function, which can only exhibit bipolar attractor behavior(e.g. Kosko (1988)).

3.3. Learning rule

Learning follows correlational principles. It is based on time-difference Hebbian association (Chartier & Boukadoum, 2006a;Chartier, Renaud, & Boukadoum, 2008; Chartier & Proulx, 2005;Kosko, 1990; Oja, 1989; Sutton, 1988), and is formally expressedby the following equations:

W(p+ 1) = W(p)+ η(y(0)− y(t))(x(0)+ x(t))T (13)

V(p+ 1) = V(p)+ η(x(0)− x(t))(y(0)+ y(t))T (14)

where η represents a learning parameter, y(0) and x(0), the initialpatterns at t = 0, y(t) and x(t), the state vectors after t iterationsthrough the network, and p the learning trial. The learning rule isthus very simple, and can be shown to constitute a generalizationofHebbian/anti-Hebbian correlation in its autoassociativememoryversion (Chartier & Boukadoum, 2006a; Chartier & Proulx, 2005).For weight convergence to occur, η must be set according to thefollowing condition (Chartier et al., 2008):

η <1

2(1− 2δ)Max[N,M], δ 6= 1/2. (15)

Eqs. (13) and (14) show that the weights can only convergewhen the content of layers is identical to the initial inputs (thatis, y(t) = y(0) and x(t) = x(0)). The function therefore correlatesdirectly with network outputs, instead of activations. As a result,the learning rule is dynamically linked to the network’s output(unlike most BAMs).

4. Simulations: General learning behavior

This first empirical section is concerned with the more ‘‘cogni-tive’’ simulations, that is, those related to the model’s learning andcategorization behavior.Wewill be interested in determining if wecan successfully implement the kWTA and topological properties.

4.1. Simulation: Prototype-based categorization

Categorization, as already argued, is a crucial cognitive task.Adding competitive and topological properties to the original BHMwill enable it to emulate this process, by creating a categoricallandscape based on a topological 2D map. A first simulationwas conducted in order to show the network’s ability to clusterexemplars into categories. During recall, whenever the networkassociated two or more inputs with the exact same x layerreconstruction, these inputs were considered as being categorizedtogether.


Fig. 3. (a) Examples of patterns used to train the network. (b) Converged inputs for the x layer.

4.1.1. MethodologyFor this simulation, artificial pixel-based stimuli (Fig. 3(a)) were

created. Four category prototypes were produced by generatingbipolar-valued, 36-position vectors, for which each value (or‘‘feature’’) followed a random discrete uniform distribution. Thepresence of a feature (black pixel) was represented by a value of+1, and absence (white pixel) by a value of −1. Five exemplarswere generated using each prototype, by randomly ‘‘flipping’’ thevalue of one to four features (Fig. 3(a)). The average intra-categorycorrelation was r = 0.72 (average inter-category correlation: r =0.08).Learning was conducted according to the following procedure:

0. Random weight initializations;1. Random selection of a given exemplar (x(0));2. Computation of the activation according to Eq. (11);3. Selection of k-winners;4. Computation of the initial compression (y(0)) according toEq. (9);

5. Computation of the reconstructed input (x(1)) according toEq. (12);

6. Computation of the compressed output (y(1)) according toEq. (11);

7. Weight updates according to Eqs. (13) and (14);8. Repetition of 1 to 7 for all exemplars;9. Repetition of 1 to 8 until the mean error (‖y(0) − y(1)‖/g),where g is the number of exemplars in the learning set, issufficiently low (<0.1).

Since the BHM + SOFM is a recurrent network, the followingiterative recall process, using the converged weight matrices, wasused:

1. Selection of a given exemplar;2. Iteration through the network (with selection of k-winnersduring each iteration) until both the contents from each layerhave converged, that is: x(t) = x(t + 1), and y(t) = y(t + 1);

3. Repetition of 1 and 2 for all exemplars.4. Determination of category landscape: if the final compression yfor two or more exemplars are identical, then these exemplarsare members of the same category.

For this simulation, parameters were set to the followingvalues: η = 0.01; δ = 0.1; σ0 = 10; and τ = 100. The number ofy units was set to 100 (10 × 10 grid) and the number of winningunits was set to 15.

4.1.2. ResultsAfter 236 learning blocks, the error (‖y(0) − y(1)‖/g) value

was lower than the criterion, and therefore learning stopped. Fig. 4shows the associations between category membership and y layerunits. Each category is represented by a different color. In orderto determine category association, each exemplar was passedthrough the transmission function (Eq. (11)) once, using the Wmatrix in its current state, at different stages of learning (i.e., afterp blocks), For each exemplar, the activation for each unit of layery was computed. The category containing the exemplar with thehighest level of activation was selected for each unit. Each panelrepresents the ‘‘categorical landscape’’ after p learning blocks.It can be seen that as the number of learning blocks increases,category sensitivity is adequately represented topologically inthe matrix. The network creates 4 well-separated categories.Moreover, each category occupies approximately 25% of the totalspace. Therefore, the model seemingly acts in a way similar toKohonen’s SOFMmodel (see Chartier et al. (in press)).The BHM + SOFM is a recurrent model, and should therefore

preserve autoassociative memory properties (Hassoun, 1989).During the recall phase, when the initial inputs (Fig. 3(a); x(0))are iterated until convergence (x(c)), the network is lead to aninvariant state (or attractor). Fig. 3(b) shows the converged recalloutput for the x layer (x(c)). The network develops four attractors,one for each category. Each separate exemplar from a categoryreaches a common attractor, and groupings respect predeterminedcategorymembership. The network has extracted prototypes fromtheir related exemplars. In this case, because the within-categorycorrelations are quite high and the number of output units (100)is low, the task can be achieved without supervision. As shownin the next simulation, if one is to store specific exemplars usingunsupervised learning, then the number of output units must beincreased.5

4.2. Simulation: Learning pixel-matrix letters

4.2.1. MethodologyThe previous simulation was replicated using stimuli with no

predetermined categorical (or cluster) membership, thus makingit an exemplar learning (or identification) task. Here, input patternsmust not be clustered (such as in the categorization task), butrather be differentiated; each stimulus must ‘‘form its owncluster’’.Wewere interested once again in studying the topologicalproperties, but we would also like to explore the simultaneousreconstruction capacities of the network, under normal and noisy

5 Of course, there are other ways to encode the desired number of exemplars ina more efficient way. For instance, a vigilance parameter could be used (Grossberg,1988).


Fig. 4. 2D topology in function of p learning blocks.

Fig. 5. Set of patterns used for training.

Fig. 6. 2D topology for weight connections. Each letter represents the output’sstable solution. Blank boxes indicate unused weight connections.

conditions (recurrent network properties). The patterns used forthe simulationwere 7×7 pixelmatrices representing a letter of thealphabet (Fig. 5).White andblack pixelswere respectively assignedcorresponding values of −1 and +1. Correlations (in absolutevalues) between the patterns varied from 0.02 to 0.84. Because thenetwork had to discriminate between a larger number of ‘‘classes’’than in the previous simulation (10 stimuli vs. 4 categories), thesize of the output grid was increased to 400 units (20 × 20). Thelearning parameter η was set to 0.001 and the number of winningunits was increased to 20 (5% of the output units’ resources).Learning was conducted using the procedure described in theprevious section. However, in this case, the learning error is nowbased on correct reconstruction of the x layer, and therefore wasdefined as (‖x(0)− x(1)‖/g). The stopping criterion was changedto 0.001.In addition to regular recall, an additional recall phase was

conducted. The same recall procedure was used. However, theoriginal exemplars were initially contaminated with various typesof noise (random pixels flipped, addition of normally-distributedrandom noise, specific portion of a given pattern removed).

Fig. 7. Noisy recall of different patterns after 20 cycles through the network. Thefirst pattern is the letter ‘‘A’’ for which 6 pixels have been randomly flipped. For thesecond pattern, a random noise vector N ∼ (0, 0.5) has been added to the letter‘‘F’’. The third pattern is the letter ‘‘B’’, for which the upper portion pixels have beenset to zero. The fourth pattern is the letter ‘‘J’’ for which the right portion pixels havebeen set to zero.

4.2.2. Results

4.2.2.1. SOFM properties. The weights have converged after ap-proximately 4500 learning blocks. Fig. 6 shows the best matchedpattern for each unit of the y layer, once all 20 winning outputunits have stabilized. Once again, for each unit, the most activatedpattern was selected. Each pattern is well-separated into a closed,well-delimited region, andwe can see through the topological rep-resentation how common features lead the network to representsome letters closer in space. For example, letters E and F, whichshare a left vertical bar, as well as two horizontal bars, are repre-sented as closer to each other than to letter I, which does not shareany of these features. Letters C and G, who are highly similar, arealso closer in 2D space. It is clear that only a small portion of the 2Dmap is used at one time. It is thus apparent that the BHM can en-compass topological and kWTA properties successfully, in a moregeneral learning situation.

4.2.2.2. Autoassociative properties. Themodelwas able to perfectlyreconstruct the original patterns from Fig. 5. Because thearchitecture is bidirectional, these reconstructed vectors alsoare invariant states in the network (Hassoun, 1989). This allowthe network to filter noise and complete patterns just likeany recurrent associative memory (e.g. Hopfield (1982)). Fig. 7illustrates such properties through a noisy recall task, wheredifferent examples of input patterns are contaminated withvarious types of noise (random pixels flipped, addition of normallydistributed random noise, specific portion of a given patternremoved). As shown, the contamination does not prevent thenetwork from recalling one of the original patterns.We compared theperformance of thenewmodel, BHM+ SOFM,

with the original BHM in amore systematical fashion; randompix-els were flipped during recall. Fig. 8 shows the recall performanceof both networks when the value of 1 to 15 pixels was changed(from 1 to −1 or vice versa). The performance level of the newmodel is generally slightly lower than that of the original BHMwhen pixels are flipped. However, the original BHM is not an unsu-pervised model per se, and does not display any SOFM properties.Also, the new model’s performance is still higher than that of theoriginal BAM (Kosko, 1988).


80

60

40

20

100

Perf

orm

ance

(%

)

00.03 0.06 0.09 0.11 0.14 0.17 0.2 0.23 0.26 0.29 0.31 0.34 0.37 0.430

Noise (%)

0.4

BHM+SOFM

BHM

Fig. 8. Performance comparison between the BHM+ SOFM and the standard BHM, in function of the proportion of pixels flipped.

4.3. Discussion

In this section, we have shown that it can be advantageous toreunite principles stemming fromcompetitive and topological net-works with bidirectional Hebbian (correlational) properties. Theresulting model can achieve topological categorization and identi-fication, while preserving autoassociativememory behaviors, suchas attractor development and resistance to noise.While the model performs well, it still retains one of the major

downsides from the original SOFM. In order for this model to learnadequately, the size of the topological neighborhood and the valueof the learning parameter are required to decrease with time (Eqs.(6) and (7)). This leads to stability in the model, but prevents themodel from learning novel stimuli without having to start over thelearning process, using new weights and an extended data set.Although the BHM + SOFM can use a constant learning

parameter value (Eqs. (13) and (14)), variance for the Gaussianfunction must still decrease with time (Eqs. (9) and (10)), makingit impossible to learn novel stimuli after a finalized learning phase.Hence, just as the SOFM, the BHM+ SOFM model shows stability,but not plasticity (Grossberg, 1988). In a topologically-constrainednetwork using aWTA learning algorithm (such as the SOFM), thereare no solutions to achieve simultaneous stability and plasticity.However, if one were to allow the use of several winning units(kWTA situation), then, under some conditions, it is possible topreserve essential topological properties, while achieving novellearning.The principle is straightforward. Fig. 9(a) illustrates a single

y unit using a Mexican-Hat function (generally used in a WTAcase for bipolar patterns, as well as in our model so far). In aWTA situation, if unit A wins the competition, then, in functionof its topological neighborhood’s size (solid line), connectionsfor another unit C will either be positively reinforced if unit Cis included in unit A’s neighborhood (Fig. 9 (a)), or negativelyreinforced (inhibited) if it is outside of that neighborhood(Fig. 9(b)). If two units, both using Mexican-Hat functions, areallowed to win (kWTA situation), then, in function of the distancebetween winning units A and B, unit C may be negativelyreinforced by both units if it is positioned outside of theirneighborhood (Fig. 9(c); dashed line), slightly positively reinforcedif it is part of both units’ neighborhoods (Fig. 9(d)), and may evenreceivemore combined positive reinforcement than any of the twowinning units if these units’ neighborhoods show a large overlap(Fig. 9(e)).

This last situation illustrates how the combination of excitationand inhibition from many winning units could ‘‘bring the unitstogether’’ to forma single clusterwhile keeping the topological sizeconstant. Therefore, in a kWTA setting, instead of having the size oftopological neighborhood function that decreases over time, thedistance between the winners has to decrease over time. This canbe accomplished when using a converging learning algorithm thatuses time-difference association (e.g., Eqs. (13) and (14)).To sum up, the size of the topological neighborhood (standard

deviation in Eq. (9)) will from now on be independent from thenumber of learning trials. Eq. (10) thus becomes:

hj,i(x) =12

(3e−

d2j,i(x)σ2 − e−

d2j,i(x)4σ2

); d2j,i(x) =

∥∥rj − ri∥∥2 (16)

where σ (>0) is a general free parameter. The value for σ shouldbe selected in function of the number of desired winners and thesize of the grid. A higher number of winners, or a larger grid shouldlead to a higher σ value being used.

5. Simulations: General properties

The following subsections will explore some of the moregeneral properties of the model, which now uses a new constantdefinition of neighborhood size. We will use results from the nextset of simulations in order to determine, in a following study, ifthe change in defining neighborhood size allows for simultaneousstability and plasticity.

5.1. Simulation: Number of winning units

The network replicated the categorization task used in Sec-tion 4.1. We studied the effect of varying the number of winningunits used in a given simulation (from 1 to 15 units). Two per-formance indicators were used. We first measured the distancebetween the actual and optimal (uniform) weight distributions.Given that four categories were used, the optimal proportion ofweights related to each category should be equal to 25% (or 0.25).In addition, a measure of dispersion index was used. Here, for eachcategory, the distance between each unit in the grid and all theother units that belong to the same category are computed, aver-aged and standardized. The process is repeated for all categories,


Fig. 9. Effect of combining 2 winners using the ‘‘Mexican-hat’’ transmission. In a WTA setting, the reinforcement of unit C will be a function of the size of the topologicalneighborhood for winning unit A. The reinforcement can be positive (a) or negative (b). In a kWTA setting, reinforcement of unit C will be a function of the distance betweenwinning units A and B. If the distance is too large, unit C will be negatively reinforced (c). If the distance is small, then unit C can be positively reinforced (d) as in a WTAsituation (a), or even be more reinforced than the two winning units (e) to form a single cluster.

and then globally averaged. More formally, the distance d for agiven category c is given by

dc =

∑i

∑j

∑k

∑l

∣∣zij − zkl∣∣nc

(17)

where z represents the position of a particular unit from categoryc and nc the total units of category c . The global standardized valueof the dispersion index is given by

dispIndex =d̄c −MinDisp

MaxDisp−MinDisp(18)

where, d̄c represents the average distance for all categories,and MinDisp and MaxDisp respectively represent the minimumand maximum theoretical dispersion. Both the minimum andmaximum will vary according to the size of the topological grid.The dispersion index measures within-cluster variability, and thusestablishes how dense the developed clusters are. Values rangefrom 0 (maximally sparse) to 1 (maximally dense).

5.1.1. MethodologyFor each separate simulation, the learning and recall procedures

were identical to those of Section 4.1. In order to obtain a robust

Fig. 10. Distance and dispersion indices in function of the number ofwinning units.

estimate of the model’s performance, for each different variablevalue (1 to 15 units), learning and recall were repeated 50 times,each time with a different set of exemplars (stemming fromrandomly-generated prototypes).

5.1.2. ResultsFig. 10 shows that the dispersion index (and thuswithin-cluster

variability) decreases as the number of winners increases. This


Fig. 11. (a) Distance index in function of output grid size for BHM + SOFM andSOFM. (b) Dispersion index.

means that cluster density increases as the number of winningunits increases from 1 to 11, after which it stabilizes. Fig. 10 alsoshows that theweight distributionwas fairly optimal (low averagedistance between the observed and theoretical proportion) forall conditions. Therefore, if the number of winning units is toolow (10 or less), the network will use about 25% of the availableunits to encode the patterns (low distance), but these units will bescattered all over themap (as shown by the high dispersion index).With approximately 11 units or more, the network will encodethe four categories with the same optimal weight proportion andwill create well-defined clusters (as shown by the low dispersionindex).

5.2. Simulation: Size of the topological grid

The previous task was replicated using a different variable,namely the size of the 2D topological grid (from 4× 4 (16 units) to8 × 8 (64 units)). Following results from the previous simulation,in order to obtained well-distributed weights and dense clusters,the number of winning units was kept constant at 15. For thissimulation, performance was compared to that of the originalSOFM (Haykin, 1999; Kohonen, 1982).

5.2.1. MethodologyFor each separate simulation, the learning and recall procedures

were identical to those of Section 4.1, except of course for the sizeof the topological functionσ , which remained constant throughout(σ = 4). For each different variable value, learning and test wererepeated 50 times, each time using a different set of patterns. Theperformance indicators from the preceding simulation were used.

5.2.2. ResultsFig. 11 illustrates the performance for the BHM + SOFM

and the SOFM as it relates to distance from the optimal weight

40

30

20Err

or

10

050 100 150 200 250

phase 1 phase 2

0Nb of Epochs

300

Fig. 12. Error in function of the number of epochs, for each learning phase.

proportions (Fig. 11(a)) and the dispersion index (Fig. 11(b)). Ingeneral, the value of both performance indicators for the SOFMremained relatively unchangedwhen varying grid size. In contrast,because the BHM + SOFM model uses several winning units, itsperformance is a function of the grid’s dimensionality. Using agrid size of at least 8 × 8 with 15 winning units enables theBHM+ SOFM to show better performance than the original SOFMfor both measures.

6. Simulations: Simultaneous stability and plasticity

Now that certain constants have been established, we candetermine the usefulness of the constant neighborhood sizeparameter, by testing if the network can learn novel patternswithout losing memory for a previously topologically-encodedpattern set.

6.1. Simulation: Learning pixel-matrix letters

This simulation aims at determining if the model can increaseits memory for novel patterns. Two learning phases are defined.During the first phase, the network must learn four letter patterns(Fig. 5(a): letters A to D). Once learning is complete, the networkenters a second learning phase, where the number of patterns tolearn is increased from four to six (the four original patterns plustwo new ones; Fig. 5(a): letters A to F).

6.1.1. MethodologyPatterns used in both phases are identical to those illustrated in

Fig. 5(a). For each phase, the learning and recall procedures wereidentical to those described in Section 4.1, using the error measuredefined as (‖x(0)−x(1)‖/g), and a stopping criterion set to 0.001.During the first phase, pixel versions of the uppercase letters A, B,C and D were presented to the network. During the second phase,the letters E and F were added to the original pattern set. Becausea total of only 6 patterns will be learned, a 15×15 topological gridwas used. The number of winning units was set to 20; the size ofthe topological function was set to σ = 4 and remained constantthroughout learning. Learning was achieved in a blocked fashion.

6.1.2. ResultsFig. 12 shows the error in function of the number of learning

blocks or epochs. Following completion of the first learning phase,the error is minimal, as expected. When the second phase begins,the error value increases suddenly, but remains much lower thanthe error measured on the first block of the initial learning phase.The sudden increase is clearly attributable to the new patterns


Fig. 13. 2D topology for weight connections, for each of the learned letters. (a) Beginning of Phase 1. The first four letters are scattered all over the map. (b) End of Phase1 and beginning of Phase 2. The first four letters have been learned, but the new inputs (‘‘E’’ and ‘‘F’’) are scattered all over the map. (c) End of Phase 2. The six letters havebeen learned properly.

being learned; if the initial patterns had been forgotten, the errorvalue would have been similar to that measured at the beginningof the initial learning phase. This result thus suggests that thenetwork has preserved its memory for the original patterns, eventhough new patterns are inserted in the learning set.Another way to explore this behavior is by examining the

topological grid at certain stages of learning. Fig. 13(a) showsthe 15 × 15 topological grids for the first four patterns (lettersA to D) after only one block of training (Phase 1). It can beseen that no cluster has been formed in stimulus space. The 20winning units are scattered all over the bidimensional map. Aftercompleting the first learning phase, clusters have been formed, andthe network has correctly learned the initial pattern set. Duringthe subsequent phase, two additional patterns are added to thelearning set. Fig. 13(b) shows the grids after one block of learningin the second phase. It can be seen that while the new patternshave not been learned yet, adding new patterns to the learning sethas absolutely no effect on the grids for the initial four patterns.Once the second learning phase is complete, all six patterns havebeen learned correctly (Fig. 13(c)). The output now forms six well-defined clusters. These results therefore show that the network isable to increase its memory for patterns without losingwhat it hadalready stored. This behavior cannot be achieved by a strict SOFM.

7. General discussion

In this paper, we have proposed a new model encompassingcorrelational, competitive and topological properties. The newmodel, BHM + SOFM, primarily stems from Chartier andBoukadoum’s (2006a) BHM. The first modification to the originalBHM concerns the architecture. While the BHM uses twoinputs (supervised model), the BHM + SOFM only needs one(unsupervisedmodel). Also, a kWTA setting, a topological grid, anda Mexican-Hat function were added to the original model.A first set of simulations showed that uniting different

classes of network properties allows the model to categorizeand learn patterns in a topological fashion, while retaining theBHM’s basic properties, such as attractor development and noisetolerance during recall. The recurrent architecture allows for thesimultaneous development of prototypes in one layer, and of

clusters in the other layer. All the tasks are achieved using a singlearchitecture, as well as the same learning and transmission rules.A slight modification to the model allows it to show both

stability and plasticity, properties that have not before beenshown simultaneously in neural networks based on topologicalarchitectures. This is a very interesting breakthrough, since ashumans, we do not forget as soon as we learn something new.However, the variance of Gaussian is fixed, making convergenceto a proper clustered solution a function of the number of winningunits and the size of grid. As more and more units are allowed, thenetwork will produce representations that are less sparse.A solution would be to increase the value for parameter σ .

However, if this value is too high, the network will not be ableto differentiate between categories. In comparison, if σ is small,then it is hypothesized that the network will regroup the outputson more than one cluster, but less than the number of winningunits. Therefore, further studies should evaluate the possibility ofoptimizing the value of σ in function of the size of the grid and thenumber of winning units. The type and number of clusters couldthen be evaluated, in order to provide an estimation of grouping(shared clusters) between exemplars and categories. In addition,although the type of stability/plasticity shown by the model is theidentical to the one achieved by RAM/BAMs, its flexibility couldbe increased if a novelty detector (vigilance) was implemented(Grossberg, 1988).The new set of properties presented in this paper adds to

the many other BHM framework properties, such as bidirectionalassociation of gray-level patterns. aperiodic recall, many-to-oneassociation, multi-step pattern recognition, perceptual featureextraction, input compression and signal separation (Chartier &Boukadoum, 2006a, 2006b; Chartier & Giguère, 2008; Chartieret al., 2007, 2008; Giguère et al., 2007a, 2007b). In short, the BHMframework can replicate various behaviors shown by recurrentassociative memories, principal component analysis and self-organizing maps while keeping the same learning and outputfunctions, as well as its general bidirectional heteroassociativearchitecture. This result constitutes a step toward unifying variousclasses within a general architecture.Finally, while it was shown that the addition of competitive

and topological properties was beneficial to the network, someof the desirable properties of the model may stem from the


simple modification made to the architecture. Work is alreadyunder way to explore the basic features of a fully distributedversion of the network, which then becomes totally equivalentto a dynamic heteroassociative memory (Hassoun, 1989). Furtherstudies will also be focused on the extent to which this frameworkcan reproduce other kinds of human learning behaviors.

References

Ashby, F. G., & Waldron, E. M. (1999). On the nature of implicit categorization.Psychonomic Bulletin & Review, 6, 363–378.

Edelman, S., & Intrator, N. (1997). In R. L. Goldstone, D. L.Medin, & P. G. Schyns (Eds.),Perceptual learning: The psychology of learning and motivation: Vol. 36. Learningas extraction of low-dimensional representations (pp. 353–380). San Diego, CA:Academic Press.

Chartier, S., & Boukadoum, M. (2006a). A bidirectional heteroassociative memoryfor binary and grey-level patterns. IEEE Transactions on Neural Networks, 17,385–396.

Chartier, S., & Boukadoum, M. (2006b). A sequential dynamic heteroassociativememory for multi-step pattern recognition and one-to-many association. IEEETransactions on Neural Networks, 17, 59–68.

Chartier, S., & Giguère, G. (2008). Autonomous perceptual feature extraction in atopology-constrained architecture. In B. C. Love, K. McRae, & V. M. Sloutsky(Eds.), Proceedings of the 30th Annual Conference of the Cognitive Science Society(pp. 1868–1873). Austin, TX: Cognitive Science Society.

Chartier, S., Giguère, G., Langlois, D., & Sioufi, R. Bidirectional associative memories,self-organizing maps and k-winners-take-all: Uniting feature extraction andtopological principles. In Proceedings of the 2009 International Joint Conferenceon Neural Networks (in press).

Chartier, S., Giguère, G., Renaud, P., Proulx, R., & Lina, J. -M. (2007). FEBAM: Afeature-extracting bidirectional associative memory. In Proceedings of the 2007International Joint Conference on Neural Networks (pp. 1679-1687).

Chartier, S., Renaud, P., & Boukadoum, M. (2008). A nonlinear dynamic artificialneural network model of memory. New Ideas in Psychology, 26, 252–277.

Chartier, S., & Proulx, R. (2005). NDRAM: Nonlinear dynamic recurrent associativememory for learning bipolar and nonbipolar correlated patterns. IEEETransactions on Neural Networks, 16, 1393–1400.

Davey, N., & Hunt, S. P. (2000). A comparative analysis of high performanceassociative memory models. In Proceedings of 2nd International ICSC Symposiumon Neural Computation (NC’2000) (pp. 55-61).

Du, S., Chen, Z., Yuan, Z., & Zhang, X. (2005). Sensitivity to noise in bidirectionalassociative memory (BAM). IEEE Transactions on Neural Networks, 16, 887–898.

Gibson, J. J., & Gibson, E. (1955). Perceptual learning: Differentiation or enrichment?Psychological Review, 62, 32–41.

Giguère, G., Chartier, S., Proulx, R., & Lina, J.M. (2007a). Category development andreorganization using a bidirectional associative memory-inspired architecture.In R. L. Lewis, T. A. Polk, & J.E. Laird (Eds.), Proceedings of the 8th InternationalConference on Cognitive Modeling, (pp. 97-102). Ann Arbor, MI: University ofMichigan.

Giguère, G., Chartier, S., Proulx, R., & Lina, J. M. (2007b). Creating perceptualfeatures using a BAM-inspired architecture. In D. S. McNamara, & J. G. Trafton(Eds.), Proceedings of the 29th Annual Conference of the Cognitive Science Society(pp. 1025–1030). Austin, TX: Cognitive Science Society.

Goldstone, R. L. (1998). Perceptual learning. Annual Review of Psychology, 49,585–612.

Goldstone, R. L., & Kersten, A. (2003). Concepts and categories. In A. F. Healy, & R.W. Proctor (Eds.), Experimental psychology: Vol. 4. Comprehensive handbook ofpsychology (pp. 591–621). New York: Wiley.

Grossberg, S. (1987). Competitive learning: From interactive activation to adaptiveresonance. Cognitive Science, 11, 23–63.

Grossberg, S. (1988). Neural networks and natural intelligence.. Cambridge, MA: MITPress.

Hall, G. (1991). Perceptual and associative learning.. Oxford: Clarendon Press.Hassoun, M. H. (1989). Dynamic heteroassociative neural memories. NeuralNetworks, 2, 275–287.

Haykin, S. (1999). Neural networks: A comprehensive foundation. New York:Cambridge University Press.

Hélie, S. (2008). Energy minimization in the nonlinear dynamic recurrentassociative memory. Neural Networks, 21, 1041–1044.

Hopfield, J. (1982). Neural networks and physical systems with emergent collectivecomputational abilities. Proceedings of the National Academy of Sciences of theUSA, 9, 2554–2558.

Knowlton, B. J., Mangels, J. A., & Squire, L. R. (1996). A neostriatal habit learningsystem in humans. Science, 273, 1399–1402.

Knowlton, B. J., & Squire, L. R. (1993). The learning of categories: Parallel brainsystems for item memory and category knowledge. Science, 262, 1747–1749.

Kohonen, T. (1982). Self-organized formation of topologically correct feature maps.Biological Cybernetics, 43, 59–69.

Komatsu, L. (1992). Recent views of conceptual structure. Psychological Bulletin, 112,500–526.

Kosko, B. (1988). Bidirectional associative memories. IEEE Transactions on Systems,Man and Cybernetics, 18, 49–60.

Kosko, B. (1990). Unsupervised learning in noise. IEEE Transactions on NeuralNetworks, 1, 44–57.

Majani, E., Erlanson, R., & Abu-Mostafa, Y. (1989). On the k-winners-take-allnetwork. InD. Touretsky (Ed.).,Advances in neural information processing systems(pp. 634–642). San Mateo, Ca: Morgan Kaufmann.

Murphy, G. L. (2002). The big book of concepts. Cambridge: MIT Press.Oja, E. (1982). Simplified neuron model as a principal component analyzer. Journalof Mathematical Biology, 15, 267–273.

Oja, E. (1989). Neural networks, principal components and subspaces. InternationalJournal of Neural Systems, 1, 61–68.

Olshausen, B. A., & Field, D. J. (2004). Sparse coding of sensory inputs. CurrentOpinions in Neurobiology, 14, 481–487.

Posner, M. I., & Keele, S. W. (1968). On the genesis of abstract ideas. Journal ofExperimental Psychology, 77, 353–363.

Posner, M. I., & Keele, S. W. (1970). Retention of abstract ideas. Journal ofExperimental Psychology, 83, 304–308.

Rosch, E. (1973). Natural categories. Cognitive Psychology, 4, 328–350.Sutton, R. S. (1988). Learning to predict by the methods of temporal difference.Machine Learning , 3, 9–44.

2009SpecialIssue ...aix1.uottawa.ca/~schartie/chartier-NN.pdf · NeuralNetworks22(2009)568 578...

Documents

Transcript of 2009SpecialIssue ...aix1.uottawa.ca/~schartie/chartier-NN.pdf · NeuralNetworks22(2009)568 578...