The Organization of Behavior into Temporal and Spatial ...ring/Ring,Schaul; The...The Organization...

7
The Organization of Behavior into Temporal and Spatial Neighborhoods Mark Ring IDSIA / University of Lugano / SUPSI Galleria 1 6928 Manno-Lugano, Switzerland Email: [email protected] Tom Schaul Courant Institute of Mathematical Sciences New York University 715, Broadway, New York, NY 10003 Email: [email protected] Abstract—The mot 1 framework [1] is a system for learn- ing behaviors while organizing them across a two-dimensional, topological map such that similar behaviors are represented in nearby regions of the map. The current paper introduces temporal coherence into the framework, whereby temporally extended behaviors are more likely to be represented within a small, local region of the map. In previous work, the regions of the map represented arbitrary parts of a single global policy. This paper introduces and examines several different methods for achieving temporal coherence, each applying updates to the map using both spatial and temporal neighborhoods, thus encouraging parts of the policy that commonly occur together in time to reside within a common region. These methods are analyzed experimentally in a setting modeled after a human behavior-switching game, in which players are rewarded for producing a series of short but specific behavior sequences. The new methods achieve varying degrees—in some cases high degrees—of temporal coherence. An important byproduct of these methods is the automatic decomposition of behavior sequences into cohesive groupings, each represented individually in local regions. I. BACKGROUND The mot framework [1] is a system for continual learning [2] in which behaviors are organized into a two-dimensional map according to their similarity. 2 This organization was conjectured to convey many useful properties to the learning agent—properties such as robustness, non-catastrophic forget- ting, and intelligent resource allocation. One method shown for achieving such a map in practice was by laying out reinforcement-learning modules (SERL modules [3]) in a two- dimensional grid and then updating these modules in local spatial neighborhoods, much like the nodes of self-organizing maps (SOMs) are updated in local spatial neighborhoods [4]. As a result, the maps show spatial smoothness, the property underlying most of the advantages conjectured. Our goal in the current paper is to achieve temporal smoothness as well, such that temporally extended, coherent behaviors tend to be represented in small, local regions of the map. The methods presented here for achieving temporal smoothness 1 Pronounced “m¯ ot” or “moUt”, like moat or mote, rhyming with “boat” as in “motor boat”. 2 “Continual learning” refers to the constant and incremental acquisition of new behaviors built on previously acquired behaviors, where each behavior is learned through interaction with an environment that can provide positive and negative rewards. apply updates not just to local spatial neighborhoods but to local temporal neighborhoods as well. The mot framework was inspired by recent evidence in neuroscience that the motor cortex may be laid out as a topo- logical map organized according to behavior, where similar behaviors occur close together and very different behaviors lie far apart [5], [6]. As described by Ring et al. (2011b), this organization in and of itself conveys many surprising advantages, including: smoothness (the closer two regions are, the more likely they are to represent similar behav- iors, thus providing a gradient in behavior space); robustness (should a failure occur, nearby regions can provide similar behavior, and learning can be done simultaneously across entire regions); hierarchical organization (large regions tend to represent generic behaviors, smaller regions represent more specific behaviors); safe dimensionality reduction (only those sensorimotor connections needed by a region are delivered there, but all connections remain accessible somewhere in the map); intelligent use and reuse of resources (obsolete behaviors can be replaced by similar ones, and new behaviors can be learned in those regions already representing the most similar behaviors); state aggregation by policy similarity (the position in the map of the currently active behavior provides a compact representation of state); continual learning and graceful degradation (regions can compete to best cover the useful behavior space). The Motmap is the two-dimensional sheet of mots whose purpose is to achieve the above advantages for an artificial agent. Each mot has a location in the map where it receives its input and computes its output. While the mot framework is quite general and allows a large number of possible in- stantiations, the system underlying the methods discussed here is exactly that described in detail by Ring et al. (2011b). It is composed of a fixed number of SERL modules that learn to combine their individually limited capacities to represent a complex policy. If there are more modules than necessary, they redundantly represent large areas of behavior space (thus increasing robustness). If there are too few modules (the more common case), they spread out to cover the most important areas best. The learning rule encourages smoothness, and the map becomes organized such that nearby mots compute similar

Transcript of The Organization of Behavior into Temporal and Spatial ...ring/Ring,Schaul; The...The Organization...

Page 1: The Organization of Behavior into Temporal and Spatial ...ring/Ring,Schaul; The...The Organization of Behavior into Temporal and Spatial Neighborhoods Mark Ring IDSIA / University

The Organization of Behavior into Temporal andSpatial Neighborhoods

Mark RingIDSIA / University of Lugano / SUPSI

Galleria 16928 Manno-Lugano, Switzerland

Email: [email protected]

Tom SchaulCourant Institute of Mathematical Sciences

New York University715, Broadway, New York, NY 10003

Email: [email protected]

Abstract—The mot1 framework [1] is a system for learn-ing behaviors while organizing them across a two-dimensional,topological map such that similar behaviors are represented innearby regions of the map. The current paper introduces temporalcoherence into the framework, whereby temporally extendedbehaviors are more likely to be represented within a small, localregion of the map. In previous work, the regions of the maprepresented arbitrary parts of a single global policy. This paperintroduces and examines several different methods for achievingtemporal coherence, each applying updates to the map using bothspatial and temporal neighborhoods, thus encouraging parts ofthe policy that commonly occur together in time to reside withina common region. These methods are analyzed experimentallyin a setting modeled after a human behavior-switching game, inwhich players are rewarded for producing a series of short butspecific behavior sequences. The new methods achieve varyingdegrees—in some cases high degrees—of temporal coherence.An important byproduct of these methods is the automaticdecomposition of behavior sequences into cohesive groupings,each represented individually in local regions.

I. BACKGROUND

The mot framework [1] is a system for continual learning [2]in which behaviors are organized into a two-dimensionalmap according to their similarity.2 This organization wasconjectured to convey many useful properties to the learningagent—properties such as robustness, non-catastrophic forget-ting, and intelligent resource allocation. One method shownfor achieving such a map in practice was by laying outreinforcement-learning modules (SERL modules [3]) in a two-dimensional grid and then updating these modules in localspatial neighborhoods, much like the nodes of self-organizingmaps (SOMs) are updated in local spatial neighborhoods [4].As a result, the maps show spatial smoothness, the propertyunderlying most of the advantages conjectured. Our goalin the current paper is to achieve temporal smoothness aswell, such that temporally extended, coherent behaviors tendto be represented in small, local regions of the map. Themethods presented here for achieving temporal smoothness

1Pronounced “mot” or “moUt”, like moat or mote, rhyming with “boat” asin “motor boat”.

2“Continual learning” refers to the constant and incremental acquisition ofnew behaviors built on previously acquired behaviors, where each behavioris learned through interaction with an environment that can provide positiveand negative rewards.

apply updates not just to local spatial neighborhoods but tolocal temporal neighborhoods as well.

The mot framework was inspired by recent evidence inneuroscience that the motor cortex may be laid out as a topo-logical map organized according to behavior, where similarbehaviors occur close together and very different behaviorslie far apart [5], [6]. As described by Ring et al. (2011b),this organization in and of itself conveys many surprisingadvantages, including: smoothness (the closer two regionsare, the more likely they are to represent similar behav-iors, thus providing a gradient in behavior space); robustness(should a failure occur, nearby regions can provide similarbehavior, and learning can be done simultaneously acrossentire regions); hierarchical organization (large regions tendto represent generic behaviors, smaller regions represent morespecific behaviors); safe dimensionality reduction (only thosesensorimotor connections needed by a region are deliveredthere, but all connections remain accessible somewhere inthe map); intelligent use and reuse of resources (obsoletebehaviors can be replaced by similar ones, and new behaviorscan be learned in those regions already representing the mostsimilar behaviors); state aggregation by policy similarity (theposition in the map of the currently active behavior providesa compact representation of state); continual learning andgraceful degradation (regions can compete to best cover theuseful behavior space).

The Motmap is the two-dimensional sheet of mots whosepurpose is to achieve the above advantages for an artificialagent. Each mot has a location in the map where it receivesits input and computes its output. While the mot frameworkis quite general and allows a large number of possible in-stantiations, the system underlying the methods discussed hereis exactly that described in detail by Ring et al. (2011b). Itis composed of a fixed number of SERL modules that learnto combine their individually limited capacities to representa complex policy. If there are more modules than necessary,they redundantly represent large areas of behavior space (thusincreasing robustness). If there are too few modules (the morecommon case), they spread out to cover the most importantareas best.

The learning rule encourages smoothness, and the mapbecomes organized such that nearby mots compute similar

Page 2: The Organization of Behavior into Temporal and Spatial ...ring/Ring,Schaul; The...The Organization of Behavior into Temporal and Spatial Neighborhoods Mark Ring IDSIA / University

outputs to similar inputs; however, this organization does notimply that the behaviors represented in a region will be tempo-rally cohesive, or that the input-output pairs of any frequentlyoccurring sequential behavior will likely be represented withinthe same region, as seems to be evidenced by the motorcortex [5], [6]. Thus, the current paper addresses this issue byintroducing mechanisms that encourage temporally extendedbehaviors to be represented in small, local regions of the map.

II. FORMAL DESCRIPTION

In the current system, each mot is implemented as asingle SERL module, extended with a coordinate on a two-dimensional grid (Figure 1, left). Since neither SERL northe mot system are widely known, we repeat their formaldescription here.

SERL is best understood as a multi-modular system forreinforcement learning (RL) based in the standard RL frame-work [7], in which a learning agent interacts with a Markovdecision process (MDP) over a series of time steps t ∈{0, 1, 2, ...}. At each time step the agent takes an actionat ∈ A from its current state st ∈ S . As a result ofthe action the agent transitions to a state st+1 ∈ S , andreceives a reward rt ∈ R. The dynamics underlying theenvironment are described by the state-to-state transition prob-abilities Pass′ = Pr{st+1=s

′ | st=s, at=a} and expectedrewards Rass′ = E{rt+1 | st=s, at=a, st+1=s

′}. Theagent’s decision-making process is described by a policy,π(s, a) = Pr{at=a | st=s}, which the agent refines throughrepeated interaction with the environment so as to maximizeQ(s, a) = E{

∑∞k=0 γ

krt+k+1 | st = s, at = a}, the totalfuture reward (discounted by γ ∈ [0, 1]) that it can expect toreceive by taking any action a in any state s and followingpolicy π thereafter.

SERL is an online, incremental, modular learning methodthat autonomously assigns different parts of a reinforcement-learning task to different modules, requiring no interventionor prior knowledge.

SERL consists of a set of modules, M. Each receives asinput an observed feature vector o ∈ O, which uniquely iden-tifies the state. Each module i ∈M contains two components:a controller function,

f c,i : O → R|A|,

which generates a vector of action-value estimates; and anexpertise estimator (also called “predictor function”),

fp,i : O → R|A|,

which generates a vector of predicted action-value errors. Atevery time step, each module produces values based on thecurrent observation vector, ot :

qit = f c,i(ot)

pit = fp,i(ot)

These are combined for each module to create an |M| × |A|matrix Lt of lower confidence values such that

Lit = qit − |pit|,

where Lit is the ith row of Lt.At every time step there is a winning module, wt, which is

generally one whose highest L value matches L∗t , the highestvalue in Lt. But this rule is modified in an ε-greedy fashion [7]to allow occasional random selection of winners, based on arandom value, xt ∼ U(0, 1):

Wt = {i ∈M : maxa

Liat = L∗t }

Pr{wt = i | Lt} =

1|M | if xt < εM1|Wt| if xt ≥ εM and i ∈Wt

0 otherwise,

where Liat is the value for action a in Lit. Once a winneris chosen, SERL calculates an ε-greedy policy based on thewinner’s L values: Lwt

t , using a potentially different constant,εA.

Learning. The function approximators for both controllersand expertise estimators are updated with targets generated byTD-learning [8]. Unlike simple SERL modules, in additionto the controller and expertise estimator, each mot is as-signed a coordinate in an evenly spaced, two-dimensional grid.Whereas in SERL, only the winner’s controller is updated, themot system updates the controllers for a set of mots w+

t thatsurround the winner in the Motmap within a given radius.

The controllers are updated using Q-learning [9]; thus foreach i ∈ w+

t the target for qiatt (the component of qitcorresponding to action at) is rt + γL∗t+1.

All the expertise estimators are updated at every step, andtheir target is the magnitude of the mot’s TD error:

δit = rt + γL∗t+1 − qiatt .

III. TEMPORAL COHERENCE

The method just outlined does result in a Motmap wheresimilar policies are organized nearby each other and wherethe mapping represented by each mot is more similar to itsneighbors than to mots farther away. However, the policiesthemselves are not necessarily temporally coherent: in thegeneral case, if a mot has high expertise in a given state, it maynot have high expertise in any of the immediately subsequentstates. Conversely, for a learned global policy in which onestate frequently or always follows immediately after another,there is no increased probability that the regions of greatestexpertise for the two states are nearby each other. Indeed,the individual state-action mappings of extended behaviorsequences are distributed arbitrarily throughout the Motmap.Thus, it cannot be argued with the above rules that extendedbehaviors are represented within local regions, as seems tobe the case in the motor cortex. A temporally coherentorganization, however, would be advantageous in all of thefollowing ways:

Motor Vocabulary. When we learn extended coordinatedbehaviors, such as picking up a glass, crawling or walking,brushing our teeth, etc., these activities are generally composedof smaller behaviors that are themselves useful, such as reach-ing the hand to a specific location, swinging a leg forward,

Page 3: The Organization of Behavior into Temporal and Spatial ...ring/Ring,Schaul; The...The Organization of Behavior into Temporal and Spatial Neighborhoods Mark Ring IDSIA / University

Fig. 1. Left: A Motmap of 100 mots laid out in a two-dimensional grid. Thered mots depict a learning update: the controllers and predictors of all motswithin a certain radius around the winning mot are updated. For the othermots, only the predictors are updated. Right: Smoothly varying expertiselevels (darker corresponds to greater expertise) across the motmap, on ninerandom observations. Both figures adapted from Ring, et al. (2011b).

grasping and holding, etc.—meaningful behaviors, that recurin a variety of situations for a variety of purposes. One ofthe goals of the mot framework is to learn such behaviors,to isolate them, and organize them for reuse. One of thehypotheses of the system is that it is particularly beneficialto build up a vocabulary of small, useful motor behaviors thatcan be strung together on the fly, much as words are strungtogether as circumstances demand to form desired meanings.Thus, it would be beneficial if there were a specific region ofthe map where each entire extended behavior could be located,learned, and accessed.

Larger-scale smoothness. The Motmap described for-mally in Section II encourages the development of small-scale smoothness: for each individual observation, expertisein the Motmap generally varies smoothly. There is usuallya center of expertise somewhere in the map with expertisegradually dropping off as distance from the center increases(see Figure 1, right). However, if two observations alwaysoccur in succession, their centers of expertise are no morelikely to be near each other than if they never occur in suc-cession. Larger-scale smoothness could be achieved if entireextended behaviors represented in one region were similar toextended behaviors represented by its neighbors. If the mapwere organized in this way, then taking small steps betweenneighboring regions would correspond to navigating througha smooth space of full, extended behaviors.

Behavior-based hierarchy. One of the most attractivepotential properties expected from the Motmap (which is, how-ever, yet to be demonstrated in publication) is the autonomousformation of hierarchies, in which smaller subregions representmore refined behaviors than the larger regions they make up.Without temporal coherence, these hierarchies are small scale:formed with respect to isolated observation-action mappingsonly. Temporal coherence encourages the larger-scale corre-late, where larger regions represent coarse versions of extendedbehaviors, and smaller regions represent more specific, morerefined, extended behaviors.

Increased robustness. In the case of environmental noise,locality of behavior provides additional information to beexploited for increased robustness. If the agent preferentially

chooses winners near the previous winner, it is more likelyto remain within the region of state space appropriate to thecurrent behavior.

Autonomous discovery of behavior. Perhaps the most im-portant goal of the temporal-coherence mechanisms describedabove is the possibility of automatically and autonomouslydiscovering useful units of behavior—the words of the motorvocabulary described above. We hypothesize that the continuallearning of motor skills is not so much a matter of stringingtogether small motor skills into larger ones, but a matter offinding an ever more refined set of useful skills. That is to say,it is more about subtlety than sequencing.

The question obviously arises, then, how can useful skills bediscovered? How can a good vocabulary of behaviors be foundthat can be put together in useful ways? In other words, howcan we, as Plato said, “carve nature at its joints” [10], or inthis case, carve an agent’s behavior space at its joints? Whilewe do not propose to have solved this long-standing problem,we believe that an important part of the answer is to allowthese motor words to form around the places where decisionsmust be made, drawing inspiration from Dawkins [11].

Dawkins proposed a method for describing sequences hi-erarchically by focusing on the decision points within thesequences, the places where successors are less clearly de-termined by their predecessors. For example, in the sequence“AbcXyzAbcAbcXyzXyz”, whenever A occurs, “bc” alwaysfollows; whenever X occurs, “yz” always follows. But thereare no such deterministic successors for “c” or “z”; theseare the decision points that suggest the boundaries betweensubsequences. The joints of behavior space are the placeswhere decisions must be made, the places where it is lessobvious to the agent which action should occur next, in otherwords, the states where the entropy of the policy is highest.These are the clues that a coherent behavior has ended and anew one should begin.

As adults, during our daily lives we are constantly engagingin an enormous variety of different short-term behaviors: wereach for things, touch things, grasp things, pick things up(each different thing in a slightly different way), point atthings; we move our heads and eyes to locate and watchthings; make hand gestures and facial gestures; we form anenormous variety of exact tongue and mouth positions whenwe communicate, and body-posture and leg movements to getfrom place to place; we scratch; we rub; we pick and preen;and on and on. The range of short-term behaviors we have tocall from is considerable. We choose each of these in context,stringing them together to achieve our desires of the moment.

But how do we learn to divide our overall behavior intomeaningful components? Our approach is to encourage indi-vidual responses (state-action pairs) to be stored in the samelocation as other responses depending on both their similarityand their temporal contiguity. As a consequence of this updaterule, we anticipate that strings of responses occurring togetherfrequently will be stored in nearby locations of the map.

We imagine that in early learning, small behaviors (smallsequences of responses) are formed by a planning process

Page 4: The Organization of Behavior into Temporal and Spatial ...ring/Ring,Schaul; The...The Organization of Behavior into Temporal and Spatial Neighborhoods Mark Ring IDSIA / University

in which the agent chooses actions to exploit the regularitieswithin its immediate environment. Due to the presence of theseenvironmental regularities, the agent will frequently producesimilar or identical plans, followed by moments of decisionmaking in which a new plan is formed that may or may notbe related to the previous behavior sequence. Thus, boundariesemerge automatically as a result of planning, decision making,and the statistical regularities of the environment.

IV. IMPLEMENTATION AND TESTING

To encourage the mots to represent temporally contiguousportions of the global policy, we introduce, test, and compareseveral new possible update mechanisms. In all cases, theexpertise estimators are updated as above; only the controllerupdates are modified here.• Method W t,t+1: at time t all controllers in w+

t−1∪w+t are

updated using rt−1+γL∗t as the target. This means that iftwo mots tend to be temporal neighbors, they will oftenget trained on the same data, until one of them dominateson both.

• Method W t−1,t,t+1: same as Method W t,t+1, but thecontrollers in w+

t+1 are also updated, making the methodsymmetric with respect to the past and future. (Thismethod of course necessitates keeping the target untilt+ 1.)

• Method W t−1,t: same as Method W t−1,t,t+1, but thecontrollers in w+

t−1 are not updated.In addition to these are three more aggressive variants(W t,t+1

2α, W t−1,t,t+12α, and W t−1,t

2α) that double thelearning rate on those controllers not in w+

t . The intuitionhere is that the temporally adjacent winning mots will therebybe forced to assume some of the current winners’ expertise.

Benchmark task.. The video game Dance Central providesa useful illustration of the process of planning, choosing,and ultimately learning short-term behaviors. In the game,the screen shows a picture representing which dance movethe player should perform next. The player, whose actionsare analyzed by computer as part of the KinectTM gamingsystem, receives points for performing the sequence correctly.The game then displays a different movement sequence froma fixed library of such sequences, and the process continuesuntil the song is over. The player thus learns a broad rangeof different motor-control behaviors, each having sequentialcontiguity, punctuated by decision points in which new plansand decisions are made.

For training purposes, we model the game as an MDP,where each of D dance sequences is represented as a chainof N states: the starting state is at one end of the chain andthe final state is at the other. In each state of the chain, theagent can choose from a variety of actions; a single “correct”action advances it to the next state of the chain, while allother actions take the agent back to the previous state. Whenthe agent reaches the end state of a chain (which correspondsto successfully producing the dance sequence), it receives areward of 1.0 and is placed at the starting state of a randomly

chosen chain. (All other state transitions return a reward ofzero.) The agent’s observation in each state is a feature vectorcomprising two concatenated, binary unit subvectors: the firstsubvector identifies the task, while the second identifies thecurrent step within the task. (Each subvector has a 1 in a singleposition, all other positions are 0.) This abstract representationcaptures at a high level the agent’s overall goal of mapping atarget dance move and a current progress indicator to a discreteaction.

Before training begins, the actions that take the agent to thenext state are assigned randomly, independently and uniformly.Generalization is therefore impossible, and the agent cannotchoose the correct action based on the regularities in the inputs.Thus, for each behavior the agent should learn to produce astring of actions that are temporally coherent, but once thebehavior is finished, a new one is selected at random. Thisscenario allows us to test the mechanisms proposed abovefor their ability to capture the temporal coherence of thelearned behaviors and to assign them to nearby locations ofthe Motmap.

We examined games of different sizes, varying the numberof dance sequences (behaviors or tasks), their length, and thenumber of possible actions that are available to the agent atevery step.

Measures of coherence.. To compare the methods de-scribed above, we use two measures of temporal coherence:(P) the probability of switching to a mot outside the currentwinner’s local neighborhood, and (H) the entropy of a se-quence of mot winners. For measure (P), we compare twovalues: (Ps) the probability of a switch during the sequence,and (Pe) the probability of a switch at the end of the sequence(when a new dance sequence is chosen); thus,

Ps =1

D(N − 1)·D∑d=1

N−1∑n=1

{0 if adjacent(wd,n, wd,n+1)

1 otherwise,

Pe =1

D(D − 1)·D∑d=1

D∑d′ 6=d

{0 if adjacent(wd,N , wd′,1)

1 otherwise,

where adjacent(x, y) is true if mot x is immediately above,below, left or right of mot y on the Motmap, and wd,n is thewinning mot when the agent’s input is the observation fromthe nth step in dance sequence d.

For measure H, we define the entropy of a sequence as :

H(s) =

M∑i

psi logM psi ,

where M is the number of mots, and psi is the fraction ofstates in sequence s in which mot i is the winner. (If a singlemot is the winner for every state in the sequence, its entropyis zero.) Then we compare Hd (the average entropy of thosedance sequences to be learned) with Hr (the expected entropy

Page 5: The Organization of Behavior into Temporal and Spatial ...ring/Ring,Schaul; The...The Organization of Behavior into Temporal and Spatial Neighborhoods Mark Ring IDSIA / University

of a random dance sequence):

Hd =1

D

D∑d=1

H(sd)

Hr = E [H(sr)] ,

where sd is the dth dance sequence to be learned, and sris a randomly generated sequence where for step i of thesequence, each observation is drawn uniformly from the ith

step of all D sequences to be learned. The expectation in Hr

is approximated by averaging over 100 such sequences.

V. RESULTS

We use the following parameter settings: αc = 0.005(learning rate for controllers), αe = 0.05 (learning rate forexpertise estimators), εA = 0.02 (exploration on actions),εM = 0.5 (exploration on mots), γ = 0.95 (reward discount),and M = 100 (number of mots).

Figure 2 displays results for two variants of the mot system:(1) with update radius 1 (corresponding to one neighbor of thewinner in each cardinal direction), denoted as the ‘Motmap’scenario; and (2) with update radius 0 (no neighbors), denotedas the ‘SERL’ scenario. In the latter scenario, no spatialorganization is applicable, and we study purely the dynamicsof the temporal coherence. In each case we test the six updatemechanisms from section IV on the same task. The taskuses 16 dance sequences (D = 16), each dance sequencerepresenting a behavior of length 4 (N = 4), with 10 actionspossible in each state (|A| = 10). The results of all 14variants are shown. Figure 3 then shows the results of thebest method on three different task settings to show how theresults generalize.

VI. DISCUSSION

The results demonstrate temporal coherence emerging fromseveral of the methods tested, visible both through the lowentropy and the low switching probability within a sequence,in comparison to the relatively high entropy of the randomsequence and the relatively high switching probability betweendifferent sequences. This shows two things. First, behaviorsequences are coalescing within local regions of the map.Second, the boundaries of the behaviors are being discernedand captured as an emergent property of the system; i.e.,themots are learning to carve the agent’s behavior at its joints.

Interestingly, it seems that temporal coherence can beachieved without harming task performance—compare theblack curves from the top row in Figure 2 to the rows below.

The shape of the two probability curves Ps and Pe dependson the update method used. Without an explicit method forencouraging temporal coherence (top row), these curves are es-sentially the same. Thus, the probability of switching to a motoutside the current winner’s local neighborhood is no lowerduring the course of a dance sequence than when changing to anew sequence. In contrast, all the temporal-coherence methodssuccessfully reduce Ps (intra-sequence switching) and Hd

0 500k 1M 0 500k 1M

SERL Motmap

base

TT-sym

T-rev

T+T-sym+

T-rev+

Fig. 2. Average performance curves (over 10 trials) for a million update steps,with 14 different settings. Left column is the SERL scenario, right columnthe motmap scenario. The first row is the baseline performance (using noneof the temporal update mechanisms), and the following 6 rows contain onemechanism variant each. In each graph, the black circles shows the percentageof actions the agent has learned to choose correctly. The crosses measure theswitching probabilities: red ‘x’s measure Ps (see text), to compare to itsreference value Pe (purple ‘+’s). The dashed dark blue line measures Hd(reference value is the dotted light blue line, Hr).

Page 6: The Organization of Behavior into Temporal and Spatial ...ring/Ring,Schaul; The...The Organization of Behavior into Temporal and Spatial Neighborhoods Mark Ring IDSIA / University

SERL MotmapD

= 1

6, |A

| = 1

0D

= 8

, |A|

= 1

0D

= 1

6, |A

| = 4

0 500k 1M 0 500k 1M

Fig. 3. Results using the best mechanism found (W t,t+12α), but now across

three different tasks, where N = 4 in all cases and the first row coincideswith the default task from before. See Figure 2 for plot details.

(intra-sequence entropy) without drastically reducing Pe (inter-sequence switching) nor Hr (random-sequence entropy). Inboth the Motmap and SERL case, W t,t+1

2α has the greatestimpact.

In each graph a phase transition can be seen in which asudden reduction in Ps and Hd coincide with an increasein the number of correct action choices. Before learning thesequence well, the agent generally takes many actions to reachthe end; thus it repeatedly experiences the same small setof observations, and the updates are applied consistently toa small number of winning mots. However, once the agenthas learned the sequence, it spends less time within sequencesand relatively more time at the ends of sequences, where it isexposed to a broader range of mot winners at the followingstep. The agent then applies relatively more updates to thewinners dedicated to one sequence with values from thefollowing (unrelated) sequence, which may explain the gradualloss in coherence visible in some graphs after a good policyis learned. Keeping high values of ε helps to avert the lossby guaranteeing more intra-sequence updates even after theoptimal policy has been learned (thus the high value we usefor εM , the probability of switching to a random mot).

In addition to the temporal update mechanisms presentedhere, we tried numerous other methods that were less suc-cessful. For the reader interested in negative results, thoseincluded: (a) sharing eligibility traces Q(λ) among spatio-temporal neighbors; (b) rather than modifying the learningrule, a preference was given to the previous winner(s), boost-ing the probability of being chosen again.

Finally, the desire to carve behavior at its joints is hardly

new in AI or RL. A different method proposed for segmentingRL policies into meaningful chunks is to find bottleneckstates (“doorways”) through which many possible paths canpass [12]. These methods bear a certain resemblance to thecurrent method, in that doorways are also likely to be statesof higher entropy.

VII. CONCLUSIONS

The temporal and spatial organization of behavior is ofgreat potential benefit for continual-learning agents, promotingincreased robustness, hierarchical learning, and the navigationand search of learned behaviors. Building upon our earlierwork, which had introduced mechanisms for the spatial or-ganization of behavior in a two-dimensional “Motmap,” thecurrent paper introduced six related, novel update mechanismsfor achieving temporal coherence in SERL and the Motmap. Ineach case, this coherence is achieved as an emergent propertyof the update rule.

A variable set of sequential tasks patterned after the gameDance Central allowed us to demonstrate that the new mecha-nisms can segment behavior into cohesive chunks, representingeach chunk within individual modules of a SERL system orwithin local neighborhoods of the two-dimensional Motmap.The latter result, a topological map organized in both spaceand time, is an important step towards an agent that maintainsa library of useful motor behaviors—ordered, accessible, andextensible according to their similarities. We posit that thisorganization will be critical for an agent that learns continuallyand is constantly expanding the sophistication of its behavior.

VIII. ACKNOWLEDGMENTS

This research was funded in part through EU project IM-Clever (231722) and through AFR postdoc grant number2915104, of the National Research Fund Luxembourg.

REFERENCES

[1] M. B. Ring, T. Schaul, and J. Schmidhuber, “The Two-DimensionalOrganization of Behavior,” in First Joint IEEE International Conferenceon Developmental Learning and Epigenetic Robotics, Frankfurt, 2011.

[2] M. B. Ring, “Continual learning in reinforcement environments,” Ph.D.dissertation, University of Texas at Austin, Austin, Texas 78712, August1994.

[3] M. Ring and T. Schaul, “Q-error as a Selection Mechanism in ModularReinforcement-Learning Systems,” in Proceedings of the InternationalJoint Conference on Artificial Intelligence (IJCAI), 2011, to appear.

[4] T. Kohonen, Self-Organization and Associative Memory. Springer,second edition, 1988.

[5] M. S. A. Graziano and T. N. Aflalo, “Rethinking cortical organization:moving away from discrete areas arranged in hierarchies.” Neuroscien-tist, vol. 13, no. 2, pp. 138–47, 2007.

[6] M. Graziano, The Intelligent Movement Machine: An Ethological Per-spective on the Primate Motor System. Oxford University Press, 2009.

[7] R. S. Sutton and A. G. Barto, Reinforcement Learning: AnIntroduction. Cambridge, MA: MIT Press, 1998. [Online]. Available:http://www-anw.cs.umass.edu/ rich/book/the-book.html

[8] R. S. Sutton, “Learning to predict by the methods of temporal differ-ences,” Machine Learning, vol. 3, pp. 9–44, 1988.

[9] C. J. C. H. Watkins, “Learning from delayed rewards,” Ph.D. disserta-tion, King’s College, May 1989.

[10] H. Fowler, W. Lamb, and R. Bury, Plato. 4. Cratylus, Parmenides,Greater Hippias, Lesser Hippias, ser. Loeb Classical Library. HarvardUniversity Press, 1970.

Page 7: The Organization of Behavior into Temporal and Spatial ...ring/Ring,Schaul; The...The Organization of Behavior into Temporal and Spatial Neighborhoods Mark Ring IDSIA / University

[11] R. Dawkins, “Hierarchical organisation: a candidate principle for ethol-ogy,” in Growing Points in Ethology, P. P. G. Bateson and R. A. Hinde,Eds. Cambridge: Cambridge University Press, 1976, pp. 7–54.

[12] A. Mcgovern and A. G. Barto, “Automatic discovery of subgoals inreinforcement learning using diverse density,” in In Proceedings of theeighteenth international conference on machine learning. MorganKaufmann, 2001, pp. 361–368.