Adaptive Behavior · Additional services and information for Adaptive Behavior can be found at: ......

http://adb.sagepub.com/Adaptive Behavior

http://adb.sagepub.com/content/early/2012/05/22/1059712312445902The online version of this article can be found at:

DOI: 10.1177/1059712312445902

published online 22 May 2012Adaptive BehaviorEmmanouil Hourdakis and Panos Trahanias

primatesComputational modeling of observational learning inspired by the cortical underpinnings of human

Published by:

http://www.sagepublications.com

On behalf of:

International Society of Adaptive Behavior

can be found at:Adaptive BehaviorAdditional services and information for

http://adb.sagepub.com/cgi/alertsEmail Alerts:

http://adb.sagepub.com/subscriptionsSubscriptions:

http://www.sagepub.com/journalsReprints.navReprints:

http://www.sagepub.com/journalsPermissions.navPermissions:

What is This?

- May 22, 2012OnlineFirst Version of Record >>

at Institute of Marine Biology of Crete (IMBC) on May 29, 2012adb.sagepub.comDownloaded from

http://adb.sagepub.com/

http://adb.sagepub.com/content/early/2012/05/22/1059712312445902

http://www.sagepublications.com

http://www.isab.org.uk/ISAB/

http://adb.sagepub.com/cgi/alerts

http://adb.sagepub.com/subscriptions

http://www.sagepub.com/journalsReprints.nav

http://www.sagepub.com/journalsPermissions.nav

http://adb.sagepub.com/content/early/2012/05/22/1059712312445902.full.pdf

http://online.sagepub.com/site/sphelp/vorhelp.xhtml


Original Paper

Adaptive Behavior0(0) 1–20� The Author(s) 2012Reprints and permissions:sagepub.co.uk/journalsPermissions.navDOI: 10.1177/1059712312445902adb.sagepub.com

Computational modeling ofobservational learning inspired by thecortical underpinnings of humanprimates

Emmanouil Hourdakis1,2 and Panos Trahanias1,2

AbstractRecent neuroscientific evidence in human and non-human primates indicates that the regions that become active duringmotor execution and motor observation overlap extensively in the cerebral cortex. This suggests that to observe anaction, these primates employ their motor and somatosensation areas in order to simulate it internally. In line with thisfinding, in the current paper, we examine relevant neuroscientific evidence in order to design a computational agent thatcan facilitate observational learning of reaching movements. For this reason, we develop a novel motor control system,inspired from contemporary theories of motor control, and demonstrate how it can be used during observation to facil-itate learning, without the active involvement of the agent’s body. Our results show that novel motor skills can beacquired only by observation, by optimizing the peripheral components of the agent’s motion.

KeywordsObservational learning, computational modeling, liquid state machines, overlapping pathways

1 Introduction

Observational learning is the ability to learn andacquire new skills only by observation and is definedas the symbolic rehearsal of a physical activity inthe absence of any gross muscular movements(Richardson, 1967). Recently, it has gained increasedattention due to evidence in human and non-humanprimates indicating that during action observation andaction execution similar cortical regions are being acti-vated (Caspers, Zilles, Laird, & Eickhoff, 2010; Raos,Evangeliou, & Savaki, 2004, 2007). Inspired from thisfinding, in the current paper, we examine how a com-putational agent can be designed with the capacity tolearn new reaching skills only by observation, using anoverlapping pathway of activations between executionand observation.

The fact that common regions are being activatedduring motor observation and motor execution suggeststhat, when we observe, we employ our motor system inorder to simulate what we observe. Recently, neurophy-siological studies have shed light in the neural substrateof action observation in primates, by discovering thatthe cortical regions that pertain to action execution andaction observation overlap extensively in the brain(Caspers et al., 2010; Raos et al., 2004, 2007). Similar

evidence, of common systems encoding observationand execution, has been reported for both humans andmonkeys even though the two species are capable of dif-ferent kinds of imitation (Byrne & Whiten, 1989). In arecent paper (Hourdakis, Savaki, & Trahanias, 2011),we have examined how learning can be facilitated dur-ing observation, based on the overlapping cortical path-ways that are being activated during execution andobservation in macaques.

In contrast to monkeys, humans are capable ofacquiring novel skills during observation (Byrne &Whiten, 1989). This claim is supported by various stud-ies that show how observation alone can improve theperformance of motor skills in a variety of experimen-tal conditions (Denis, 1985). This enhanced ability isalso supported by a network of cortical regions that is

1Institute of Computer Science, Foundation for Research and Technology

– Hellas (FORTH), Heraklion, Crete, Greece2Department of Computer Science, University of Crete, Heraklion,

Crete, Greece

Corresponding author:

Panos Trahanias, Institute of Computer Science, Foundation for Research

and Technology – Hellas (FORTH), 100 N. Plastira str., GR 700 13,

Heraklion, Crete, Greece

Email: [email protected]



activated during motor execution and observation, andincludes the somatosensory and motor control regionsof the agent (Caspers et al., 2010).

In the robotic literature, motor learning has exten-sively focused on the process of imitation (Dautenhahn& Nehaniv, 2002; Schaal, 1999), because it is a conveni-ent way to reduce the size of the state-action space thatmust be explored by a computational agent. A greatimpetus in this effort has been given by the findings ofmirror neurons, a group of visuo-motor cells in thepre-motor cortex of the macaque monkey thatresponds to both execution and observation of a beha-vior (Iacoboni, 2009).

In one of the mirror neuron computational models(Tani, Ito, & Sugita, 2004), the authors use a recurrentneural network in order to learn new behaviors in theform of spatio-temporal motor control patterns. Usinga similar line of approach, other authors have sug-gested that new behaviors can be learned based on thecooperation between a forward and an inverse model(Demiris & Hayes, 2002). The same intuition is alsoused in the MOSAIC model (Haruno, Wolpert, &Kawato, 2001), which employs pairs of forward andinverse models in order to implement learning in themotor control component of the agent. Finally, usinghierarchical representations, Demiris and Simmons(2006) have developed the HAMMER architecture,which is able to generate biologically plausible trajec-tories through the rehearsal of candidate actions.

Similarly to the above, other research groups haveused principles of associative learning in order todevelop imitation models for motor control. Inspiredfrom mirror neurons (Iacoboni, 2009), the associationsin these architectures are formed between visual andmotor stimuli. For example, Elshaw, Weber, Zochios,and Wermter (2004) use an associative memory thatcan learn new motor control patterns by correlatingvisual and motor representations. Billard and Hayes(1999) present the DRAMA architecture, which canlearn new spatio-temporal motor patterns using arecurrent neural network with Hebbian synapses. Incontrast to these approaches, which employ the samenetworks for action generation and action execution,the mirror neuron system (Oztop & Arbib, 2002) usesmirror neurons only for action recognition, which isconsistent with the cognitive role that has been identi-fied for the cells (Fabbri-Destro & Rizzolatti, 2008).However, despite its close biological relevance, thismodeling choice is also reflected in the design of thecomputational agent, which is unable to generate anymotor control actions.

The models discussed above employ the motor con-trol system of the agent in order to facilitate learning.Due to the fact that observational learning implies thatthe embodiment of the agent must remain immobile atall times, it has recently started to attract a lot of atten-tion in robotics (Saunders, Nehaniv, & Dautenhahn,

2004). For example, Bentivegna and Atkeson (2002)have developed a model that can learn by observing acontinuously changing game, while in Scasselatti(1999), observation is employed in order to identifywhat to imitate.

The property of immobility raises more challengeswhen implementing motor learning, since the computa-tional model must rely on a covert system in order toperceive and acquire new behaviors. In the currentpaper, we address this problem, by developing a com-putational model that can learn new reaching behaviorsonly by observation. For this reason, we examine thebiological evidence that underpins observational learn-ing in humans, and investigate how motor control canbe structured accordingly in order to enable some of itsperipheral components to be optimized duringobservation.

In the rest of the paper, we describe the developmentof the proposed computational model of observationallearning. We first examine the evidence of overlappingpathways in human and non-human primates (Section 2)and use it to derive a formal definition of observationallearning in the context of computational modeling(Section 3). Based on this definition, in Section 4, wedescribe the implementation of the model, while inSection 5 we show how it can facilitate learning duringobservation in a series of experiments that involve exe-cution and observation of motor control tasks. Finally,in the discussion section (Section 6) we revisit the mostimportant properties of the model and suggest direc-tions for future work.

2 Biological underpinnings ofobservational learning

In the current section, we examine neurophysiologicalevidence derived from human and macaque experi-ments, which study the cortical processes that takeplace during action observation. We mainly focus onthe fact that in both human and monkeys the observa-tion of an action activates a similar network of regionsas in its execution (Caspers et al., 2010; Raos et al.,2004, 2007).

2.1 Cortical overlap during action observation andaction execution

In monkeys, due to the ability to penetrate and recordsingle neurons in the cerebral cortex, studies were ableto identify specific neurons that respond to the observa-tion of actions done by others (Iacoboni, 2009). Theseneurons, termed as mirror neurons in the literature,have been discovered in frontal and parietal areas(Gallese, Fadiga, Fogassi, & Rizzolatti, 1996), and arecharacterized by the fact that they discharge both whenthe monkey executes a goal-directed action and when it

2 Adaptive Behavior 0(0)



observes the same action being executed by a demon-strator. More recently, 14C-deoxyglucose experimentshave shown that the network of regions that is beingactivated in macaques during observation extends fur-ther than the parieto-frontal regions and includes theprimary motor and somatosensory cortices (Raoset al., 2004, 2007).

In humans, even though single cell recordings arenot feasible, data from neuroimaging experiments havealso pointed out to the existence of an overlapping net-work during action execution and action observation(Caspers et al., 2010). Studies have identified severalareas that become active during observation, includingthe supplementary motor area (SMA; Roland, Skinhoj,Lassen, & Larsen, 1980), the prefrontal cortex (Decety,Philippon, & Ingvar, 1988), the basal ganglia and thepremotor cortex (Decety, Ryding, Stenberg, & Ingvar,1990). Decety et al. (1994) have reported the bilateralactivation of parietal areas (both superior and inferiorparietal), while Fieldman et al. (1993) reported on theactivation of the primary motor cortex.

These activations are indications of a very importantproperty of the human brain: when observing an action,it activates a network of cortical regions that overlapswith the one used during its execution. This network ofmotor and somatosensation areas, as it has been sug-gested (Raos et al., 2007; Savaki, 2010), is responsiblefor internally simulating an observed movement. In thefollowing section, we investigate how such evidence canbe employed in a computational model, by examiningwhat processes must be employed during observation,and how these can be integrated together in order tofacilitate observational learning.

3 Definition of observational learning inthe context of computational modeling

In the current section, we derive a mathematical formu-lation of observational learning based on the cognitivefunctions that participate in the process. Followingthat, we outline the architecture of the proposed modelbased on modular subsystems, termed computationalpathways (Hourdakis et al., 2011; Hourdakis &Trahanias, 2011b), which are responsible for carryingout specific functions.

3.1 Mathematical derivation of observationallearning

In the previous section, we examined the available neu-roscientific data and identified the cortical regions thatbecome active during execution and observation of anaction. The fact that the regions that are activated dur-ing action execution and action observation overlap, ata lower intensity in the latter case (Caspers et al., 2010;Raos et al., 2004, 2007), suggests that when we observe,

we recruit our motor system to simulate an observedaction (Savaki, 2010). Consequently, in terms of themotor component, the representations evoked duringaction observation are similar to the ones evoked dur-ing the execution of the same action (except that in theformer case, the activation of the muscles is inhibited atthe lower levels of the corticospinal system). This sug-gests that the implementation of observational learningin a computational agent presupposes the definition ofthe motor control component of the model.

To accomplish this we adopt the definition fromSchaal, Ijspeert, and Billard (2003), where the authorssuggest that when we reach towards an arbitrary loca-tion, we look for a control policy p that generates theappropriate torques so that the agent moves to adesired state. This control policy is defined as:

v=p q, t, að Þ ð1Þ

where v are the joint torques that must be applied toperform reaching, q is the agent’s state, t stands fortime and a is the parameterization of the computationalmodel. The main difference between action executionand action observation is that, in the latter case, thevector q is not available to the agent, since its hand isimmobile. Consequently, the representations duringaction observation must be derived using other avail-able sources of information. Therefore in the secondcase, Equation 1 becomes:

vo =p qo, t, aoð Þ ð2Þ

where qo is the agent’s representation of the observedmovement, p is as in Equation 1 and ao is the parame-terization of the computational model that is responsi-ble for action observation. t denotes the time lapse ofthe (executed or observed) action and is not distin-guished in Equations 1 and 2 because neuroscientificevidence suggests that the time required to perform anaction covertly and overtly is the same (Parsons,Gabrieli, Phelps, & Gazzaniga, 1998). Moreover, basedon the adopted neuroscientific evidence, that we useour motor system to simulate an observed action(Jeannerod, 1994; Savaki, 2010), the policy p inEquations 1 and 2 is the same. This assumption is sup-ported by evidence from neuroscience; Jeannerod hasshown that action observation uses the same internalmodels that are employed during action execution(Jeannerod, 1988). Savaki suggested that an observedaction is simulated by activating the motor regions thatare responsible for its execution (Savaki, 2010).

From the definitions of Equations 1 and 2, we iden-tify a clear distinction between observation and execu-tion. In the first case, the agent produces a movementand calculates its state based on the proprioception ofits hand, while in the second the agent must use othersources of information to keep track of its state duringobservation. Another dissimilarity between Equations

Hourdakis and Trahanias 3



1 and 2 is that the computational models a and ao aredifferent. However, as we discussed in the previous sec-tion, the activations of the regions that pertain tomotor control in the computational models a and ao,overlap during observation and execution. Therefore,there is a shared subsystem in both models that isresponsible for implementing the policy p (which is thesame in both equations). In the following, we refer tothis shared subsystem as the motor control system m.

In the case of the execution computational model a,the state of the agent is predicted based on the proprio-ceptive information of its movement, by a module p:

q= p pr,mð Þ ð3Þ

where p is the agent’s internal (execution) module, pr isits proprioceptive state and m is the parameterization ofits motor control system; q is the state of the agent.

During observation, the state estimate can be derivedfrom the observation of the demonstrator’s motion andthe internal model of the agent’s motor control system.Therefore, in the case of action observation, the stateestimate qo is obtained by a module po:

qo = po vo,mð Þ ð4Þ

where qo is the state of the agent during observation, po

is the agent’s internal (observation) model, vo is thevisual observation of the demonstrator’s action and m

is the same motor control component as in Equation 3.The co-occurrence of the visual observation componentvo and the motor control system m in Equation 4makes a clear suggestion about the computationalimplementation of observational learning. It states thatthe computational agent must be able to integrate theinformation from the observation of the demonstratorwith its innate motor control system. This claim is sup-ported by neuroscientific evidence that suggests thataction perception pertains to visuospatial representa-tions, rather than purely motor ones (Chaminade,Meltzoff, & Decety, 2005).

3.2 Model architecture

The theoretical framework outlined in Equations 1–4formulates the ground principles of our model. Toimplement it computationally, we must identify howsimilar functions, such as those described above, arecarried out in the primate’s brain. To accomplish this,we assemble the functionality of the regions thatbecome active during observation/execution (Section 2)into segregate processing streams, which we call compu-tational pathways (Hourdakis et al., 2011; Hourdakis &Trahanias, 2011b). Each pathway is assigned a distinctcognitive function and is characterized by two factors:(i) the regions that participate in its processing and (ii)the directionality of the flow of its information. For theproblem of observational learning of motor actions, we

identify five different pathways: (i) Motor control, (ii)reward assignment, (iii) higher-order control, (iv) stateestimation and (v) visual.

Because each pathway implements a distinct and inde-pendent cognitive function, we can map it directly intothe components defined in Equations 1–4. The motorcontrol, reward assignment and higher-order controlpathways implement the motor control module m. Thestate estimation pathway is responsible for calculatingexecuted or observed state estimates, while the proprio-ceptive and visual estimation pathways implement theproprioceptive state pr and visual observation vo func-tions, respectively.

Since Equations 1 and 2 use the same control policyp to match the vectors v and vo, during observationand execution, the agent must produce an observedstate estimate qo that is the same as its state q would beif it was executing. Equations 3 and 4 state that theseestimates can be computed using the shared motor con-trol system m of the agent. Moreover, by examiningEquations 1 and 3 (execution), and Equations 2 and 4(observation) we derive another important property ofthe system: To implement the policy p the motor controlsystem must employ a state estimate, which in turnrequires the motor control system in order to be com-puted. This indicates a recurrent component in the cir-cuit that must be implemented by connecting motorcontrol with the proprioception and visual perceptionmodules respectively (Figure 1).

Each subsystem in Figure 1 consists of a separatecomponent in the model. The motor control and stateestimation modules are shared during execution andobservation, whereas the proprioception and visual per-ception modules are activated only for execution andobservation respectively. In the current model, visualfeedback is only considered during the demonstrationof a movement in the observation phase. In the execu-tion phase, where the agent has been taught the proper-ties of a motor skill, the model relies on the feedforwardcontribution of the proprioceptive information in orderto compensate for the visual feedback.

Figure 1. Schematic illustration of the observation/executionsystem described by Equations 1–4. The components marked ingreen are used during execution, while the ones marked in blueduring observation. The motor control and state estimationcomponents are used during both execution and observation.




Motor control is an integrated process that com-bines several different computations including monitor-ing and higher-order control of the movement. Manyof our motor control skills are acquired at the earlystages of infant imitation, where we learn to regulateand control our complex musculoskeletal system(Touwen, 1998). The above suggests that learning inthe human motor system is implemented at differentfunctional levels of processing: (i) learn to regulate thebody and reach effortlessly during the first develop-mental stages of an infant’s life and (ii) adapt and learnnew control strategies after the basic reaching skillshave been established. This hierarchical form offers avery important benefit: one does not need to learn allthe kinematic or dynamic details of a movement foreach new behavior. Instead, new skills can be acquiredby using the developed motor control system, and asmall set of behavioral parameters that define differentcontrol strategies. In the current model, the implemen-tation of such control strategies is accomplished usingan additional motor component, which changes areaching trajectory in a way that will allow it toapproach an object from different sides.

To embellish our computational agent with this flex-ibility in learning, its motor control system has beendesigned based on these principles. It consists of (i) anadaptive reaching component, which after an initialtraining phase can reach towards any given location,and (ii) a higher-order control component that imple-ments different control strategies based on alreadyacquired motor knowledge. Figure 2 shows a graphicillustration of how the system components we describedin this section are mapped onto the modules ofFigure 1.

4 Model implementation

Each of the separate subsystems in Figure 2 defines adifferent process that is activated during motor control.To implement these processes computationally, we fol-low biologically inspired principles. Each component inFigure 2 is decomposed into several regions that con-tribute to its function. To design the agent we haveidentified some basic roles for each of these regions,

and combined them in order to develop a computa-tional model that can facilitate observational learning.These are shown in Figure 3, where each box corre-sponds to a different region in the computational agent.The implementation of these regions is described indetail in the current section.

In Figure 3, all regions are labeled based on the cor-responding brain areas that perform similar functions,and are grouped according to the pathway they belong.For each of these regions we derive a neural implemen-tation; exceptions to that are regions MI and Sc, whichare discussed in Motor pathway section below.

More specifically, we have identified five differentpathways, each listed with a different color. The motorcontrol pathway is marked in orange, and is responsi-ble for implementing a set of primitive movement pat-terns, that when combined can generate any reachingtrajectory. In the current implementation, four differentpatterns are used, namely up, down, left and right. Tolearn to activate each of these basic movements appro-priately, the agent relies on the function of the rewardassignment pathway, marked in grey in Figure 3. Thispathway implements a circuit that can process rewardsfrom the environment by predicting the forthcoming ofa reinforcement, and is used to activate each of the fourprimitives in order to allow the computational agent toperform a reaching movement.

The reward processing pathway, along with themotor control and higher-order control of the move-ment pathway, are responsible for implementing amotor component, able to execute any reaching move-ment. To accomplish this, we train the computationalmodel during an initial phase in which the agent learnsto activate the correct force field primitives in the motorcontrol module based on the position of the robot’shand and the target position that must be reached. Inaddition, the hand/object distance component (markedin orange in Figure 3) modulates the activation of eachindividual force field, according to the aforementioneddistance, in order to change the magnitude of the forcesthat are applied at the end-point of the robot’s arm.These events result in a reaching force that will alwaysmove the hand towards a desired target position. Theconcept is schematically illustrated in Figure 4A.

Using the function of the reaching component, thehigher-order control component is responsible formodulating further the activation of the force-field pri-mitives, in order to implement different strategies ofapproach, based on the end-point position of the object(as shown in Figure 4B). The latter is identified by theventral stream of the visual pathway. In addition, thedorsal stream of the same pathway learns to extract adifferent class label for each object in the environment,which is used in order to associate a given movementwith a particular object. Relative parameters of thestrategy that must be used to approach an object areextracted, based on the trajectory of the demonstrator,

Figure 2. A schematic layout of how the motor control systemcan interact with the proprioception and observation streams ofthe agent.




through the function of the visual pathway. Finally,proprioceptive feedback from the agent’s movement,and visual feedback from the demonstrator’s move-ment are processed using the state estimation pathway,whose role is to learn to extract the appropriate repre-sentations, given the movement of the joints and avisual representation of the demonstrated trajectory,and combine them in the symbolic component in orderto enable the agent to match its own movement withthe movement of the demonstrator in action space.

To implement the reward assignment and planningpathways we use liquid state machines (LSMs; Maass,Natschlaeger, & Markram, 2002), a recently proposedbiologically inspired neural network that can process

any continuous function, without requiring a circuitdependent construction. The use of LSMs in the cur-rent architecture was preferred due to their ability tointegrate the temporal domain of an input signal, andtransform it into a spatio-temporal pattern of spikingneuron activations, that preserves recent and past infor-mation about the input. This encoding becomes impor-tant when processing information related to motorcontrol, because it allows the model to process a beha-vior by integrating information from previous momentsof a movement. For the implementation of the stateestimation pathway, we use self-organizing maps andfeedforward neural networks. To implement the pro-prioceptive and visual processing signals of the model

Figure 3. Layout of the proposed computational model consisting of five pathways, marked in different colors: (i) visual (blue), (ii)state estimation (red), (iii) higher-order control (green), (iv) reward assignment (grey), (v) motor control (orange).

Figure 4. The forces applied by the reaching (A plot) and higher-order control (B plot) component. (A) To move from a startingposition (green point) to a target position (red point), the model must activate the correct type of primitives. In the current example,to move towards the target, the model must learn to activate the up and right primitives, and scale them according to the distance ineach axis. This results in the force RA being applied at the end-effector of the agent’s hand. (B) In addition to the reaching force (RA)that will move the hand towards the object, another force field is activated (e.g. in this example the Fleft force that corresponds tothe left force field), causing the hand to change its initial trajectory towards the object. If this effect of the additional forces (e.g. inthis example the effect of the Fleft force) is reduced progressively with time, then the reaching force RA will eventually take over themovement of the hand (e.g. in the position marked with yellow), and ensure that the agent’s arm will reach the final object.




we have employed feedforward neural networks, trainedwith back-propagation, because they are a straightfor-ward and sufficient way to learn the appropriate neuralrepresentations in the computational model. Moreover,self-organizing maps (SOMs), through self-organization,were able to discretize the robot’s control space ade-quately, without requiring any training signal. In the cur-rent section, we describe the detailed derivation of eachof these components.

To evaluate the performance of the model, we haveused the two-link simulated robotic arm implementedin the Matlab robotics toolbox (Corke, 1996), that iscontrolled by two joints, one that exists in the shoulderand one in the elbow. The length of each arm is 1 m,while the angles between each individual arm segmentare in the range of 0 to 360�.

4.1 Motor pathway

Due to the high nonlinearity and dimensionality that isinherent in controlling the arm, devising an appropriatepolicy for learning to reach can be quite demanding. Inthe current paper, this policy is established upon a fewprimitives, i.e. low level behaviors that can be synthe-sized in order to compose a compound movement.

From a mathematical perspective, the method ofprimitives, or basis functions, is an attractive way tosolve the complex nonlinear dynamic equations thatare required for motor control. For this reason, severalmodels have been proposed, including the VITE andthe FLETE model that consist of parameterized sys-tems that produce basis motor commands (seeDegallier & Ijspeert, 2010, for a review). Recent experi-ments in paralyzed frogs revealed that limb posturesare stored as convergent force fields (Giszter, Mussa-Ivaldi, & Bizzi, 1993). In Bizzi, Mussa-Ivaldi, andGiszter (1991), the authors describe how such elemen-tary basis fields can be used to replicate the motor con-trol patterns of a given trajectory.

In order to make the agent generalize motor knowl-edge to different domains, the primitive model must beconsistent with two properties: (i) superposition, i.e. theability to combine different basis modules together and(ii) invariance, so that it can be scaled appropriately.Primitives based on force fields satisfy these properties(Bizzi et al., 1991). As a result, by weighting and sum-ming four higher-order primitives, up, down, right andleft, we can produce any motor pattern required.

The higher-order primitives are composed from a setof basis torque fields, implemented in the Sc module(Figure 3). By deriving the force fields using basis tor-que fields, the primitive model creates a direct mappingbetween the state space of the robot (i.e. joint valuesand torques) and the Cartesian space that the trajectorymust be planned in (i.e. forces and Cartesian positions).We first define each torque field in the workspace of therobot, which in the current implementation where the

embodiment is modeled using a two-link planar arm, itrefers to the space defined by its elbow and shoulderangles. Each torque field is then transformed to its cor-responding force field using a Gaussian multivariatepotential function:

G q, qi0

� �= �e

q�qi0ð ÞT Ki q�qi

0ð Þ2

� �ð5Þ

where q0i is the equilibrium configuration of each tor-

que field, q is the robot’s angle and Ki a stiffness matrix.The torque applied by the field is derived using the gra-dient of the potential function:

ti qð Þ=rG q, qi0

� �=Ki q� qi

0

� �G q, qi

0

� �ð6Þ

To ensure good convergence properties we have usednine discrete and nine rotational basis torque fields,spread throughout different locations of the robot’sworkspace, i.e. the space defined by its shoulder andelbow angles (Figure 5).

Each plot in Figure 5 shows the gradient of each tor-que field. Since we want the model of primitives to bebased on the forces that act on the end point of thelimb, we need to derive the appropriate torque to forcetransformation. To accomplish this we convert a tor-que field to its corresponding force field using the fol-lowing equation:

j= JT� t ð7Þ

In Equation 7, t is the torque produced by a torquefield while j is the corresponding force that will beacted to the end point of the plant if the torques areapplied. JT is the transpose of the robot’s Jacobian.Each higher-order force field from Figure 5 is com-posed by summing and weighting the basis force fieldsfrom Equation 6. To find the weight coefficients, we

Figure 5. Nine basis discrete (left block) and rotational fields(right block) scattered across the [−p.p] configuration spaceof the robot. On each subplot the x-axis represents the elbowangle of the robot while the y-axis represents the shoulderangle. The two stiffness matrices used to generate the fields are

Kdisc = �0:672 00 �0:908

� �and Krot = 0 1

�1 0

� �.




form a system of linear equations by sampling vectorsfrom the robot’s operational space. Each force field isformed by summing and scaling the basis force fieldswith the weight coefficients a. The vector a is obtainedfrom the least squares solution to the equation:

F � a=P ð8Þ

Even though the above model addresses 2D handmotion, it can very easily be extended by appendingadditional dimensions in the equations of the Jacobianand vector fields. In the results section we show theforce fields that are produced by solving the system inEquation 8, as well as how the plant moves in responseto a force field.

4.2 Reward pathway

To be able to reach adaptively, the agent must learn tomanipulate its primitives using control policies thatgeneralize across different behaviors. In the cerebralcortex, one of the dominant means used for learning isby receiving rewards from the environment. Reward inthe brain is processed in the dopaminergic neurons ofthe basal ganglia, where one of the properties of theseneurons is that they start firing when a reinforcementstimulus is first given to the primate, but suppress theirresponse with repeated presentations of the same event(Schultz, Tremblay, & Hollerman, 2000). At this con-vergent phase, the neurons start responding to stimulithat predicts a reinforcement, i.e. events in the nearpast that have occurred before the presentation of thereward.

In the early nineties, Barto (1994) suggested anactor-critic architecture that was able to facilitate learn-ing based on the properties of the basal ganglia, whichgave inspiration to several models that focused onreplicating the properties of the dopamine neurons. In

the current paper, we propose an implementation basedon LSMs (Figure 6).

The complete architecture is shown in Figure 6a, andconsists of three neuronal components: (i) actors, (ii)critics and (iii) liquid columns. In this architecture, thecritic neurons are responsible for predicting the rewardsfrom the environment while the actor neurons for learn-ing, using the predicted reward signal, to activate thecorrect primitive based on the distance of the end-pointaffector and the position of an object. For the currentimplementation, where a two-link planar arm was used,this distance is defined in 2D coordinates. Figure 6bshows how these regions can be mapped onto the model(Figure 3, Reward circuit).

The input source to the circuit consists of Poissonspike neurons that fire at an increased firing rate (above80 Hz) to indicate the presence of a certain event stimu-lus. Each input projects to a different liquid column, i.e.a group of spiking neurons that is interconnected withfeedforward, delayed, dynamic synapses. The role of aliquid column is to transform the rate code from eachinput source into a spatio-temporal pattern of actionpotentials in the spiking neuron circuitry (Figure 7).The neuron and synapse models used for the implemen-tation of this circuit are described more thoroughly inHourdakis and Trahanias (2011a). More specificallythe synapses of the liquid were implemented as simpleanalog synapses with delay, whereas the synapses forthe critic and actor neurons were implemented usingthe imminence weighting scheme. Transmission of theinput spike signals is carried out through the liquid col-umns, which consist of 10 neuronal layers that eachintroduces a delay of 5 ms. This sort of connectivityfacilitates the formation of a temporal representation ofthe input, and implicitly models the timing of the stimu-lus events. The occurrence of an event results in theactivation of the first layer of neurons in a liquid col-umn, which is subsequently propagated towards the

Figure 6. (a) The liquid state machine (LSM) implementation of the actor-critic architecture. Each liquid column is implementedusing an LSM with feedforward delayed synapses. The critics are linear neurons, while the readouts are implemented using linearregression. (b) The actor-critic architecture mapped on the model of Figure 3.




higher layers with a small delay. This temporal repre-sentation is important for the implementation of theimminence weighting scheme that is used to train thesynapses of the dopaminergic critic neurons discussedbelow.

The critic neurons (P1, P2, P3) model the dopamineneurons in the basal ganglia. Their role is to learn topredict the reward that will be delivered to the agent inthe near future. To accomplish this, the critic neuronsuse the temporal representation that is encoded in eachliquid column and associate it with the occurrence of areward from the environment. This is accomplished bytraining the synapses between the liquid columns andthe critic neurons in a way that they learn to predict theoccurrence of a reward by associating events in the nearpast.

To implement the synapses between the liquid col-umns and the P, A neurons, we use the imminenceweighting scheme (Barto, 1994). In this setup, the criticmust learn to predict the reward of the environmentusing the weighted sum of past rewards:

Pt = rt + 1 + grt+ 2 + g2rt+ 3 + . . . + gtr1 ð9Þ

where the factor g represents the weight importance ofpredictions in the past and rt is the reward receivedfrom the environment at time t. To teach the critics tooutput the prediction of Equation 9, we update theirweights using gradient learning, by incorporating theprediction from the previous step:

vct = vct�1 + n rt + gPt � Pt�1½ �xct�1 ð10Þ

where vtc is the weight of the critic at time t, n is the

learning rate and xtc is the activation of the critic at

time t. The parameters g, P and r are as in Equation 9.The weights of the actor are updated according to pre-diction signal emitted by the critic:

vat = vat�1 + n rt � Pt�1½ �xat�1 ð11Þ

where vta is the weight of the actor at time t, n is the

learning rate and xt21a is the activation of the actor at

time t21.The basis of the imminence weighting scheme is that

it uses the temporal representation of an input stimulus,in order to learn to predict the forthcoming of a reward.Consequently, in the proposed architecture, the actors,input the response of the neurons in the LSM, and aretrained using the spatio-temporal dynamics of eachliquid column (Figure 7). The use of feedforwardsynapses within the liquid creates a temporal represen-tation, i.e. activates different neurons, to indicate theoccurrence of a certain stimulus event at a specific timeinterval of the simulation. This temporal representationis used by the imminence weighting scheme in order tolearn to predict the forthcoming of a reward.

The A1, A2, A3 neurons are trained using the signalemitted by the critic neurons. To model them in the cur-rent implementation we use a set of linear neurons. Theinput to each of these linear neurons consists of a set ofreadouts that are trained to calculate the average firingrate of each liquid column using linear regression. Eachactor neuron is connected to all the readout units, withsynapses that are updated using gradient descent. In

Figure 7. The spatio-temporal dynamics of an event as they are transformed by a liquid column. The plot shows the temporaldecay of a certain event by the liquid column’s output for four stereotypical columns.




Section 5, we illustrate how this circuitry can replicatethe properties of the dopaminergic neurons in the basalganglia and help the agent learn new behaviors by pro-cessing rewards.

4.3 Visual pathway

The role of the visual observation pathway (Figure 3,Ventral visual stream) is to convert the iconic represen-tation of an object into a discrete class label. This labelis used by the motor control and planning pathways inorder to associate the object with specific behavioralparameters. To implement this circuitry, LSMs werealso employed. To encode the input we first sharpeneach image using a Laplacian filter and consequentlyconvolve it with four different Gabor filters with orien-tations p, p

2, 2p and � p

2, respectively.

The four convolved images from the input are pro-jected into four neuronal grids of 25 neurons, whereeach neuron corresponds to a different location in theGabor output. The four representations that are gener-ated by the neuronal fields are injected into the liquidof an LSM, which creates a higher-order representationthat combines the individual encoded output from eachGabor image into an integrative representation in theliquid. Information from the liquid response is classi-fied using a linear regression readout that is trained tooutput a different class label based on the object type.

In addition, the visual observation pathway(Figure 3, Dorsal visual stream) includes a LSM that isresponsible for extracting basic parameters from thedemonstrator’s movement. Its input is modeled using a25 neuronal grid, which contains neurons that fire at anincreased rate when the position of the end-point effec-tor of the demonstrator corresponds to the positiondefined in the x,y-coordinates of the grid. Informationfrom this liquid is extracted using a readout that isimplemented as a feedforward neural network, trainedto extract the additional force that must be applied bythe higher-order control component in order to producethe demonstrated behavior. The output of this readoutis subsequently used in the higher-order control path-way during the observational learning phase, in orderto teach the model to produce the correct behavior.

4.4 State estimation pathway

The state estimation pathway consists of two compo-nents, a forward and an observation model. The for-ward model is responsible for keeping track of theexecution and imagined state estimates using two dif-ferent functions. To implement the first function, wehave designed the SI network to encode the propriocep-tive state of the agent using population codes (Figure 3,SI), inspired from the local receptive fields that exist inthis region and the somatotopic organization of the SI.

To encode an end-point position we use a populationcode with 10 neurons for each dimension (i.e. the x,y-coordinates). Thus for the 2D space, 20 input neuronsare used. Neurons in each population code are assigneda tuning value uniformly from the ½0::1� range. The inputsignal is normalized to the same range. We then generatea random vectorv from a Gaussian distribution:

v=G 100� dev � Iin � Tnj j, 2ð Þ ð12Þ

where Iin is the input signal normalized to the ½0::1�range, Tn is the neuron’s tuning value, and dev controlsthe range of values that each neuron is selective to. Thefirst part in the parenthesis of Equation 12 defines themean of the Gaussian distribution. The second is thedistribution’s variance.

Population codes assume a fixed tuning profile ofthe neuron, and therefore can provide a consistent rep-resentation of the encoded variable. To learn the for-ward transformation, we train a feedforward neuralnetwork in the SPL region that learns to transformthe state of the plant to a Cartesian x,y-coordinate(Figure 3, Position hand-actor).

For the visual perception of the demonstrator’smovement, we have also used a feedforward neural net-work that inputs a noisy version (i.e. with addedGaussian white noise) of the perceived motion in anallocentric frame of reference (Figure 3, Position hand-observer). In this case, the feedforward NN is trainedto transform this input into an egocentric frame of ref-erence that represents the observed state estimate of theagent, using back-propagation as the learning rule.

During observation, the role of the state estimationpathway is to translate the demonstrator’s behaviorinto appropriate motoric representations to use in itsown system. In the computational modeling literature,this problem is known as the correspondence problembetween one’s own and others’ behaviors. To create amapping between the demonstrator and the imitatorwe use principles of self-organization, where homoge-nous patterns develop through competition into form-ing topographic maps. This type of neural network isideal for developing feature encoders, because it formsclusters of neurons that respond to specific ranges ofinput stimuli.

The structure of the SOM is formed, through vectorquantization, during the execution phase based on theoutput of the forward model pathway discussed above.During its training, the network’s input consists of theend point positions that have been estimated by the for-ward model pathway, and its role is to translate theminto discrete labels that identify different position esti-mates of the agent (Figure 3, Symbolic component ofmovement).

The symbolic component is responsible for discretiz-ing the output of the forward model pathway, so that




for different end-point positions of the robot’s hand, adifferent label will be enabled (Figure 8).

During observation, the same network is input thetransformed, in egocentric coordinates, state estimateof the demonstrator’s movement, and outputs therespective labels that correspond to a specific space inits x,y operating environment. In the results section wedemonstrate how different configurations of the SOMmap affect the perception capabilities of our agent.

4.5 Reaching policy

Based on the higher-order primitives and reward sub-systems described above, the problem of reaching canbe solved by searching for a policy that will produce theappropriate joint torques to reduce the error:

qe = bq � q ð13Þ

where q̂ is the desired state of the plant and q is its cur-rent state. In practice, we do not know the exact valueof this error since the agent has only informationregarding the end point position of its hand and thetrajectory that it must follow in Cartesian coordinates.However because our primitive model is defined inCartesian space, minimizing this error is equivalent tominimizing the distance of the plant’s end point loca-tion with the nearest point in the trajectory:

de = l� tj j ð14Þ

where l and t are the Cartesian coordinates of the handand point in the trajectory, respectively. The transfor-mation from Equation 13 to Equation 14 is inherentlyencoded in the primitives discussed before. The policyis learned based on two elements: (i) activate the cor-rect combination of higher-order primitive force fields(Figure 3, Motor control action), and (ii) set each one’s

weight (Figure 3, Hand/object distance). The output ofthe actor neurons described in the previous sectionresembles the activation of the canonical neurons in thepremotor cortex (Rizzolatti & Fadiga, 1988), which areresponsible for gating the primitives. In a similar man-ner, every actor neuron in the reward processing path-way is responsible for activating one of the primitivesof the model, which in this setup correspond to the up,down, left and right primitives. Due to the binary out-put of the actor neurons, when a certain actor is not fir-ing, its corresponding force field will not be activated.In contrast, when an actor is firing, its associated forcefield is scaled using the output of the hand/object dis-tance component, mentioned above, and added to com-pose the final force.

To teach the actors the local control law, we use asquare trajectory, which consists of eight consecutivepoints p1 . p8. The model is trained based on this tra-jectory by starting from the final location (p8) in fourblocks. Each block contains the whole repertoire ofmovements up to that point. Whenever it finishes a trialsuccessfully, the synapses of each actor are changedbased on a binary reward and training progresses to thenext phase, which includes the movement from the pre-vious block as well as a new one.

Reward is delivered only when all movements in ablock have been executed successfully. Therefore, theagent must learn to activate the correct force field pri-mitives using the prediction signal from the critic neu-rons. The final torque that is applied on each joint isthe linear summation of the scaled primitives.

4.6 Higher-order control pathway

In the current paper, the higher-order control compo-nent is designed to inhibit the forces exerted by themotor control component in a way that alters the cur-vature of approach towards the object (as shown inFigure 4B). This inhibition is realized as a force that isapplied at the beginning of the movement and allowsthe hand to approach the target object with differenttrajectories. To ensure that the hand will reach theobject in all cases, the effect of this force must convergeto zero as the hand approaches the target location.This allows the reaching component to progressivelytake over the movement and ensure that the hand willarrive at the object. As shown in Figure 4B, the magni-tude of the force is reduced (e.g. Fleft in Figure 4B), andtherefore the reaching force (RA) has a greater effect onthe movement.

To implement this concept computationally we useLSMs. The network consists of two liquids of 125 neu-rons each, connected with the dynamic synapses andlocal connectivity (the models of neurons and dynamicsynapses that were used are the same as in the rewardassignment pathway). The first liquid is designed tomodel a dynamic continuous attractor, which replicates

Figure 8. A schematic illustration of the operation of thesymbolic component. The end-point positions of the agent’sdifferent trajectories (marked in white circles in the left graph)are input into a self-organizing map that quantifies the end pointpositions into labels that correspond to specific spaces. Theimage shows one example in which a label from the self-organizing map (marked in red in the grid of neurons on theright graph) corresponds to a specific space in the x,y Cartesianspace in which the agent’s hand moves.




the decreasing aspects of the force, while the second isused to encode the values of the additional force thatwill be exerted to the agent.

To replicate the decreasing aspect of the force onthe attractor circuit we use a liquid that inputs twosources; a continuous analog value and a discretespike train. During training, the former, simulates thedynamics of the attractor by encoding the expectedrate as an analog value. The latter encodes differentstarting values based on the perceived starting forceof the demonstrator.

The second liquid in the circuit consists of 125 neu-rons interconnected with local connectivity (Figure 9).The input to the network consists of three sources. Thefirst is the readout trained by the attractor circuit thatoutputs the decreasing force. The second is the neuronthat inputs the start force of the movement. The thirdsource is a population code that inputs the symboliclabels from the map of the state estimation pathway.After collecting the states for the observer every100 ms, the liquid states are trained for every label pre-sented to the circuit up to that moment. The output ofthese liquid states are trained using a feedforwardneural network readout that must learn to approximatethe start force of the demonstrator and the decreasingrate of the force, based on the simulated liquid states.

5 Results

Here we present experimental results that attest theeffectiveness and appropriateness of the proposedmodel. We first illustrate the training results for eachindividual pathway of the model. We then continue toshow the ability of the motor control component to

perform online reaching, i.e. reach towards any loca-tion with very little training. Finally, we present theresults of the observational learning process, i.e. theacquisition of new behaviors only by observation.

5.1 Motor control component training

The first result we consider is the convergence of theleast squares solution for the system of linear equa-tions in Equation 8. Figure 10 presents the solutionfor the ‘‘up’’ higher-order primitive, where the leastsquares algorithm has converged to an error value of2, and created an accurate approximation of the vec-tor field. The three subplots at the bottom show threesnapshots of the hand while moving towards the ‘‘up’’direction when this force field is active. Similar solu-tions were obtained for the other three primitives,where the least squares solution converged to seven(left), two (right) and five (down) errors (the error rep-resents the extent to which the directions of the forcesin a field deviate from the direction that is defined bythe primitive).

5.2 Reward assignment pathway

The policy for reaching was learned during an initialimitation phase where the agent performed the trainingtrajectory, and was delivered a binary reinforcementsignal upon successful completion of a whole trial.Since the reward signal was only delivered at the end ofthe trial, the agent relied on the prediction of the rewardsignal elicited by the critic neurons. In the following sec-tions, we look more thoroughly in the response proper-ties of the simulated dopaminergic critic neurons andhow the actors learned to activate each force fieldaccordingly based on this signal.

Figure 11a illustrates how the critic neurons of themodel learned to predict the forthcoming of a rewardduring training. In the first subplot (first successfultrial) when reward is delivered at time block 4, the pre-diction of the first critic is high, to indicate the presenceof the reward at that time step. After the first 10 suc-cessful trials (Figure 11a, subplot 2), events that pre-cede the presentation of the reward (time block=3)start eliciting some small prediction signal. This effectis more evident in the third and fourth subplots wherethe prediction signal is even higher at time block 3 andstarts responding at time block 2 as well. The effects ofthis association are more evident in Figure 9b, where itis shown how after training, even though rewards arenot available in the environment, the neurons start fir-ing because they predict the presence of a reward in thesubsequent steps. Using the output of this predictionsignal, the actor, i.e. in the case of the model the neu-rons that activate the force fields in the motor controlpathway, forms its weights in order to perform therequired reaching actions.

Figure 9. The liquid state machine (LSM) circuit thatimplements the higher-order control component, showing thetwo liquids and the input/readout neurons. The readout istrained to output the decreasing force, i.e. which of the up, right,left and down primitives must be modulated in order to changethe strategy of approach towards an object. The attractor liquidimplements an attractor within the LSM, based on the simulatedinput of the simulated attractor neuron. The object label is inputon the two liquids in order for the network to learn to associatean object from the environment with its corresponding behavior.




5.3 State estimation pathway

In the current section, we present the results from thetraining of the two feedforward neural networks thatwere used in order to implement the forward and obser-vation models in the state estimation pathway. In thefirst case, the network was trained in order to performthe forward transformation from the proprioceptivestate of the agent to the end point position of its hand.For this reason, the joint positions of the simulatedagent were extracted in every step of the simulation andencoded as population codes in the Sc module. The

feedforward neural network consisted of two layers ofsigmoidal activation neurons, and was trained for 100iterations, to output the end point position of the hand(Figure 12).

The visual observation of the demonstrated move-ment was also processed by a feedforward neural net-work. In this second case, the network input consistsof a noisy version of the demonstrated movement inallocentric coordinates, and was trained to transformit in an egocentric frame of reference. In this context,the noise represents the observer’s ability to perceive

Figure 10. The force field (upper left subplot) and torque field (upper right subplot) as converged by the least squares solution forthe ‘‘up’’ primitive. The three subplots at the bottom show the snapshots of the hand while moving when the primitive is active.

Figure 11. (a) An illustration of how the weights of one of the critic neurons from the reward processing pathway are formedduring the initial stages of the training (subplot 1), after Nt = 10 trials (subplot 2), after Nt = 20 trials (subplot 3) and after Nt = 30trials (subplot 4). As the Figure shows, the neuron increases its output at time block 2 and 3, because it responds to events thatprecede the presentation of a reward. (b) The effect is evident in plot b, where the neuron learns to predict the actual reward signalgiven to the robot at the end of a successful trial (upper subplot), by eliciting a reward signal that predicts the presentation of thereward (bottom subplot). The x-axis represents the 100 ms time blocks of the simulation while the y-axis the values of the rewardand prediction signals respectively.




a behavior correctly. Figure 13 demonstrates theoutput of the training of the neural network usingthree different noise levels, 0.001, 0.005 and 0.01,respectively.

In addition, the state estimation pathway included aself-organizing map whose role was to discretize theend point positions of the observed or executed move-ments into symbolic labels. The map was trained dur-ing the execution phase, where the agent was taught toperform elementary reaching behaviors.

Each map inputs the x,y-coordinates of the agent’smovement, and outputs a symbolic label that representsa specific space of the hand’s end point position (asshown in Figure 8). A map with 144 labels (Figure 14,third subplot) can represent the output space of a beha-vior more accurately. All three maps were evaluatedduring the observational learning stage against their

ability to produce the same labels during execution andobservation.

5.4 Motor control – hand/object distancecomponent

To complete the implementation of the reaching policythe model must learn to derive the distance of the endeffector location from the current point in the trajec-tory. This is accomplished by projecting the outputfrom the forward model and the position of the objectestimated by the visual pathway in an LSM, and usinga readout neuron to calculate their subtraction. InFigure 15, we illustrate two sample signals as input tothe liquid (top subplot), the output of the readout neu-ron in the 10-ms resolution (middle subplot) and theaveraged over the 100 ms of simulation time output ofthe readout neuron (bottom subplot).

5.5 Online reaching

Having established that the individual pathways/com-ponents of the proposed model operate successfully, wenow turn our attention to the performance of the modelin various reaching tasks. The results presented here areproduced by employing the motor control, rewardassignment and state estimation pathways only. Wenote here that the model was not trained to performany of the given reaching tasks, apart from the initialtraining/imitation period at the beginning of the experi-ments. After this stage, the model was only given a setof points in a trajectory and followed them with verygood performance.

To evaluate the performance of the model we usedtwo complex trajectories. The first required the robot toreach towards various random locations spread in therobot’s workspace (Figure 16a, Trajectory 1) while thesecond complex trajectory required the robot to performa circular motion in a cone shaped trajectory (Figure16a, Trajectory 2). Figure 16a illustrates how the afore-mentioned trajectories were followed by the robot.

Figure 12. The output of the forward model neural network.Red crosses model the actual end point location of the hand,while blue circles the output of the network. The x,y-axesrepresent the Cartesian coordinates. As the Figure shows theforward model was able to predict accurately the position of thehand on each trial.

Figure 13. The output of the visual observation model of the agent, using three different noise levels. The x,y-axes represent theCartesian coordinates. As the Figure shows, the perception of a demonstrated trajectory depends on the amount of noise that is used.




To evaluate the model performance quantitativelywe created 100 random trajectories and tested whetherthe agent was able to follow them. Each of these ran-dom movements was generated by first creating astraight-line trajectory (Figure 16b, left plot) and thenrandomizing the location of two, three or four of its

points; an example is illustrated in Figure 16b, rightplot.

The error was calculated by summing the overalldeviation of the agent’s movement from the points inthe trajectory for all the entries in the dataset. Theresults indicate that the agent was able to follow all

Figure 14. Training of three self-organizing maps (SOMs) with different capacities for labels. In the first case, the map consisted of25 labels, in the second of 81 labels and in the third of 144 labels. Red circles illustrate the symbolic labels of the map, while blackcrosses the training positions of the agent’s movements.

Figure 15. The output of the hand/object distance liquid state machine (LSM) after training. The top plot illustrates two sampleinput signals of 5.5 s duration. The bottom two plots show the output of the neural network readout used to learn the subtractionfunction from the liquid (middle plot), and how this output is averaged using a 100-ms window (bottom plot). From the thirdsubplot, it is evident that the LSM can learn to calculate with high accuracy the distance between the end-point position of the handand position of the target object.




trajectories with an average error of 2%. This suggeststhat the motor control component can confront, withhigh accuracy, any reaching task.

5.6 Planning circuit and attractor tuning

As mentioned in Section 4, the higher-order control cir-cuit consists of two liquids, which model the decreasingforce that is exerted to the hand of the agent. Herewithwe present the results of the attractor liquid. AsFigure 17 illustrates, the readout unit can produce astable response despite the varying liquid dynamics.The output of the readout is used to inhibit directlythe force produced by the motor control component.As the figure shows, the attractor dynamics cause theoutput of the readout to descent to zero after the firststeps of the simulation.

5.7 Observational learning

In this section, we illustrate the results from the obser-vational learning experiments, which involve the func-tion of all the pathways of the model. To test the abilityof the agent to learn during observation, we have

generated sample trajectories by simulating the modelusing predefined parameters. The agent was demon-strated one trajectory at a time, and its individual path-ways were trained for 1500 ms. The same simulationtime was used during the execution of the respectivemotor control behaviors, i.e. 1500 ms. Training of themodel in this phase regarded only to the convergence ofthe higher-order control pathway, and consequently itonly required a few number of trials (40 trials for theresults shown in Figures 18 and 19) for the model toconverge to a solution. The role of this phase was toevaluate the agent’s ability to learn a demonstrated tra-jectory only by observation, i.e. without being allowedto move its hand. Subsequently, to evaluate the extentto which the agent learned the demonstrated movementwe run an execution trial, where the agent was requiredto replicate the demonstrated behavior.

Figure 18 illustrates a sample trajectory that wasdemonstrated to the agent (left, red circles), the outputof the planning module (Figure 18, bottom right), thecorresponding class labels generated by the model, andthe state of the agent during observation for a 1500-mstrial (Figure 18, top right). The noise used for the visualobservation pathway was 0.05 while the size of the state

Figure 16. (a) Two complex trajectories shown to the robot (red points) and the trajectories produced by the robot (blue points).Numbers mark the sequence with which the points were presented. (b) The template used to generate the random test set of 100trajectories (left plot) and a random trajectory generated from this template (right plot).

Figure 17. The output of the trained readout that models the force exerted by the higher-order control pathway (bottom subplot,blue line) and the desired force value (bottom subplot, red line) for the same period. The top subplot illustrates the input to thecircuit. The x-axis in all plots represents time in 100-ms intervals.




estimation map was 81 labels. The optimal result thatshould be accomplished, which is the trajectory demon-strated to the robot, is shown in red circles.

As the figure illustrates (Figure 18, left subplot), theagent was able to keep track of the demonstrated tra-jectory with very good accuracy. The top right subplotin Figure 18 illustrates the class labels that were gener-ated by the state estimation pathway using the visualobservation pathway’s output in green and the states

(in red) that would have been generated if the agentwas executing the behavior covertly. In blue, we markthe labels generated during the first erroneous trial. Asthe top-right subplot illustrates the model was able tomatch the perceived trajectory (green line) with the oneit should execute (red line) with high accuracy. Thisimprovement was accomplished after 100 learning trialsduring which the sum of squares error of the deviationof each trajectory, reduced from 9 units to 0.2 units. Toverify that the agent learned the new behavior after theobservational learning phase we run the same simula-tion using the forward model for the state estimation.Figure 19 illustrates the trajectory that was actuallyexecuted by the agent during execution. The optimalresult that should be accomplished, which is the trajec-tory demonstrated to the robot, is shown in red circles.

As discussed in Section 4, the noise levels and size ofthe state estimation map had a direct effect on the per-formance of the model. Larger noise levels altered theperception of the agent and compromised its ability tomentally keep track of the observed movement.Figure 20 illustrates the response of the agent for 0.01,0.05, 0.1 and 0.3 noise levels respectively.

As the results of this section demonstrate, the devel-oped computational agent is able to acquire new motorskills only by observation. The quality of learning iscorrelated with the agent’s perceptual abilities, i.e. theextent to which it can perceive an observed action cor-rectly. This skill is facilitated due to the design of themodel that allows new motor skills to be learned based

Figure 18. The observed state of the hand during observational learning. The left subplot illustrates the trajectory demonstrated(red circles) and the trajectory perceived by the agent (blue squares). The top right subplot illustrates the output of the self-organizing map (SOM) in the state estimation pathway, while the bottom right subplot illustrates the output of the linear regressionreadout in the planning pathway (red circles are the desired values of the force, blue boxes are the output of the readout).

Figure 19. The trajectory executed by the robot during theexecution phase. The left subplot illustrates the trajectorydemonstrated (red circles) and the trajectory executed by theagent (blue squares).




on a set of simple parameters that can be derived onlyby observation.

Discussion

In the current paper, we presented a computationalimplementation of observational learning. We haveexploited the fact that when observing, humans activatethe same pathway of regions as when executing. Thecognitive interpretation of this fact is that when weobserve, we use our motor system, i.e. our owngrounded motor experiences, in order to understandwhat we observe. The developed model was able tolearn new motor skills only by observation, by adopt-ing the peripheral, higher-order control component ofits motor system during observation.

The fact that the brain activates the same pathwaysto simulate an observed action is an important compo-nent of human intelligence, and as it has been suggested(Baron-Cohen et al., 1993), a basis for social cognition.In the computational modeling community, mostresearch in this area has focused on the function of mir-ror neurons. The evidence of activating pathwaysthroughout the cerebral cortex suggests that the corti-cal overlap of the regions is much more extended thanthe mirror neuron mechanism. More importantly, sinceaction observation activates the same regions as inaction execution, observational learning can be used torevise our understanding about the content of motorrepresentations. Computational models, such as theone presented, may potentially facilitate our under-standing regarding the basis under which all these pro-cesses operate together in order to accomplish abehavioral task.

To evaluate the model’s ability to learn during obser-vation we have employed a two-link simulated planararm, with simplified simulation conditions. To compen-sate for the uncertainty in the measurements of a realworld environment we have incorporated noise withinthe perceptual streams of the agent. This simplified con-dition is enough to prove our assumption, that learningcan be implemented during observation using the simu-lation of the motor control system, however in order totransfer the model into a real-world embodiment one

must take under consideration additional issues, regard-ing the encoding of the proprioceptive information ofthe agent and the visual streams. In the future, we planto extend the model in order to compensate for theseissues, by looking into the function and structure of theproprioceptive association and visual estimation modelpathways.

The model was designed to perform two main func-tions: (i) online reaching, i.e. enable the agent to reachtowards any given location with very little training, and(ii) observational learning, i.e. the acquisition of novelskills only by observation. To implement the reachingcomponent we have devised a local reaching policy,where the computational model exerts forces that movethe hand towards a desired location. The benefit of thisapproach is that any errors in the movement can be com-pensated at the later stages of motor control. Learningduring observation was implemented on the higher-ordercontrol component based on simple parameters. Thisintuition was very important for the implementation ofthe observational learning process, since the agent wasnot required to restructure its whole motor system inorder to acquire new behaviors. In the current imple-mentation, during the execution of a reaching trajectorywe used only the proprioceptive feedback of the agent inorder to guide the movement. In real-world conditions,primates also have access to additional visual informa-tion derived from the perception of their hand. Becausethe addition of such component can provide benefitsinto producing more stable movements, in the future weplan to consider how this visual feedback can be inte-grated within the suggested model.

Having established a working model of observa-tional learning, one of the important aspects that weplan to investigate in the future is the cortical underpin-nings of motor inhibition during observation. Morespecifically, what are the reasons that cause thehuman’s body to stay immobile during observation?Cortically, inhibition must exist at the spinal levels bypreventing the excitation of the muscle reflexes(Baldissera, Cavallari, Craighero, & Fadiga, 2001). Forthis reason, we plan to exploit possible implementa-tions of the cerebellum, and how its function can allowus to inhibit specific components of the movement.

Figure 20. The executed trajectory under the impact of noise. The four subplots show the trajectory executed by the agent withdifferent values of noise.




Moreover, we also plan to focus on implementingagency attribution, i.e. the process that allows the corti-cal agents to perceive their body as their own. Bothprocesses are considered very important to observa-tional learning and their implementation may signifi-cantly contribute towards our understanding ofrelevant biological mechanisms.

Acknowledgment

The authors would like to thank the anonymous reviewers fortheir valuable comments and suggestions that helped improvethe quality of the manuscript.

Funding

The work presented in this paper has been partly supportedby the European Commission funded project MATHESIS,under contract IST-027574.

References

Baldissera, F., Cavallari, P., Craighero, L., & Fadiga, L.

(2001). Modulation of spinal excitability during observa-

tion of hand actions in humans. European Journal of Neu-

roscience, 13(1), 90–94.Baron-Cohen, S., Tager-Flusberg, H., & Cohen, D. J. (1993).

Understanding other minds: perspectives from autism.

Oxford: Oxford University Press.Barto, A. G. (1994). Adaptive critics and the basal ganglia. In

J. C. Houk, J. L. Davis, & D. G. Beiser (Eds.), Models of

information processing in the basal ganglia. Cambridge:

MIT Press.Bentivegna, D. C., & Atkeson C. G. (2002). Learning how to

behave from observing others. In Proceedings of the

SAB’02- Workshop on Motor Control in Humans and

Robots: On the interplay of real brains and artificial devices,

Edinburgh, UK, August.Billard, A., & Hayes, G. (1999). DRAMA, a connectionist

architecture for control and learning in autonomous

robots. Adaptive Behavior, 7(1), 35–63.Bizzi, E., Mussa-Ivaldi, F. A., and Giszter, S. F. (1991). Com-

putations underlying the execution of movement: A novel

biological perspective. Science, 253, 287–291.Byrne, R., & Whiten, A. (1989). Machiavellian intelligence:

Social expertise and the evolution of intellect in monkeys,

apes, and humans. Oxford, UK: Oxford Science

Publications.Caspers, S., Zilles, K., Laird, A. R., & Eickhoff, S. B. (2010).

ALE meta-analysis of action observation and imitation in

the human brain. NeuroImage, 50, 1148–1167.Corke, P. I. (1996) A robotics toolbox for MATLAB. IEEE

Robotics and Automation Magazine, 1, 24–32.Chaminade, T., Meltzoff, A. N., & Decety, J. (2005). An

fMRI study of imitation: Action representation and body

schema. Neuropsychologia, 43(1), 115–127.Dautenhahn, K., & Nehaniv C. K. (2002). Imitation in animals

and artifacts. Cambridge, MA: MIT Press.Decety, J., Philippon, B., & Ingvar, D. H. (1988). rCBF land-

scapes during motor performance and motor ideation of a

graphic gesture. European Archives Psychiatry and Neuro-

logical Sciences, 238, 33–38.

Decety, J., Ryding, E., Stenberg, G., & Ingvar, D. H. (1990).

The cerebellum participates in mental activity: Tomo-

graphic measurements of regional cerebral blood flow.Brain Research, 535(2), 313–317.

Decety, J., Perani, D., Jeannerod, M., Bettinardi, Tadary, B.,

Woods, R., Mazziotta, J. C., & Fazio, F. (1994). Mapping

motor representations with PET. Nature, 371, 600–602.Degallier, S., & Ijspeert, A. (2010). Modeling discrete and

rhythmic movements through motor primitives: A review.

Biological Cybernetics, 103(4), 319–338.Demiris, Y., & Hayes, G. (2002). Imitation as a dual-route

process featuring predictive and learning components: A

biologically-plausible computational model. In K. Dauten-

hahn, & C. Nehaniv (Eds.), Imitation in animals and arti-

facts. Cambridge, MA: MIT Press.Demiris, Y., & Simmons, G. (2006). Perceiving the unusual:

Temporal properties of hierarchical motor representationsfor action perception. Neural Networks, 19(3), 272–284.

Denis, M. (1985). Visual imagery and the use of mental prac-

tice in the development of motor skills. Canadian Journal

of Applied Sport Science, 10, 4–16.Elshaw, M., Weber, C., Zochios, A., & Wermter, S. (2004).

An associator network approach to robot learning by imi-

tation through vision, motor control and language. Inter-national Joint Conference on Neural Networks, Budapest,

Hungary.Fabbri-Destro, M., & Rizzolatti, G. (2008). Mirror neurons

and mirror systems in monkeys and humans. Physiology

23, 171–179.Fieldman, J. B., Cohen, L. G., Jezzard, P., Pons, T., Sadato,

R., Turner, R., et al. (1993). Functional neuroimaging with

echo-planar imaging in humans during execution and men-tal rehearsal of a simple motor task. 12th Annual Meeting

of the Society of Magnetic Resonance in Medicine, 1416.Gallese, V., Fadiga, L., Fogassi, L., & Rizzolatti, G. (1996).

Action recognition in the premotor cortex. Brain, 119,

593–609.Georgopoulos, A., Kalaska, J., Caminiti, R., & Massey, J.

(1982). On the relations between the direction of two-dimensional arm movements and cell discharge in primate

motor cortex. Journal of Neuroscience, 2, 1527–1537.Giszter, S. F, Mussa-Ivaldi, F. A., & Bizzi, E. (1993). Conver-

gent force fields organized in the frog’s spinal cord. Journal

of Neuroscience, 13(2), 467.Haruno, M., Wolpert, D. M., & Kawato, M. (2001). Mosaic

model for sensorimotor learning and control. Neural Com-

putation, 13(10), 2201–2220.Hourdakis, E., & Trahanias, P. (2011a). Computational mod-

eling of online reaching, European Conference on Artificial

Life, ECAL11, Paris, France.Hourdakis, E., & Trahanias, P. (2011b). Observational learn-

ing based on overlapping pathways. Second International

Conference on Morphological Computation, MORPH-COMP11, Venice, Italy.

Hourdakis, E., Savaki, E., & Trahanias, P. (2011). Computa-

tional modeling of cortical pathways involved in action

execution and action observation. Neurocomputing, 74(7),

1135–1155.Iacoboni, M. (2009). Imitation, empathy, and mirror neurons.

Annual Review of Psychology, 60, 653–670.Jeannerod, M. (1994). The representing brain: Neural corre-

lates of motor imagery and intention. Behavioral and Brain

Sciences, 17, 187–245.




Maass, W., Natschlaeger, T., & Markram, H. (2002). Real-

time computing without stable states: A new framework

for neural computation based on perturbations, Neural

Computation, 14(11), 2531–2560.Oztop, E., & Arbib, M. A. (2002). Schema design and imple-

mentation of the grasp-related mirror neuron system. Bio-

logical Cybernetics, 87, 116–140.Parsons, L. M., Gabrieli, J. D. E., Phelps, E. A., & Gazzaniga,

M. S. (1998). Cerebrally lateralized mental representations

of hand shape and movement. Journal of Neuroscience, 18,

6539–6548.Raos, V., Evangeliou, M. N., & Savaki, H. E. (2004). Obser-

vation of action: Grasping with the mind’s hand. Neuro-

Image, 23(1), 193–201.Raos, V., Evangeliou, M. N., & Savaki, H. E. (2007). Mental

simulation of action in the service of action perception.

NeuroImage, 27(46), 12675–12683.Richardson, A. (1967). Mental practice: A review and discus-

sion, Part 1. Research Quarterly, 38, 95–107.Rizzolatti G., & Fadiga L. (1998). Grasping objects and grasp-

ing action meanings: The dual role of monkey rostroventral

premotor cortex (area F5). New York: Wiley.Roland, P. E., Skinhoj, E., Lassen, N. A., & Larsen, B.

(1980). Different cortical areas in man in organization of

voluntary movements in extrapersonal space. Journal of

Neurophysiology, 43, 137–150.Saunders, J., Nehaniv, C. L., & Dautenhahn, K. (2004). An

experimental comparison of imitation paradigms used in

social robotics. 13th IEEE International Workshop on

Robot and Human Interactive Communication, ROMAN

2004, 691–696.Savaki, E. (2010). How do we understand the actions of oth-

ers? By mental simulation not mirroring. Cognitive Cri-

tique, 2, 99–140.Scassellati, B. (1999). Imitation and mechanisms of joint

attention: A developmental structure for building social

skills. In C. L. Nehaniv (Ed.), Computation for metaphors,

analogy and agents, vol. 1562, 176–195. Springer Lecture

Notes in Artificial Intelligence. Berlin, Germany: Springer.Schaal, S. (1999). Is imitation learning the route to humanoid

robots? Trends in Cognitive Sciences, 3, 233–242.Schaal, S., Ijspeert, A., & Billard, A. (2003) Computational

approaches to motor learning by imitation. Philosophical

Transactions of the Royal Society of London. Series B: Bio-

logical Sciences, 358(1431), 537–547.Schultz, W., Tremblay, L., & Hollerman, J. R. (2000).

Reward processing in primate orbitofrontal cortex and

basal ganglia. Cerebral Cortex, 10(3), 272.Tani, J., Ito, M., & Sugita, Y. (2004). Self-organization of dis-

tributed represented multiple behavior schemata in a mir-

ror system: Reviews of robot experiments using RNNPB.

Neural Networks, 17(8–9), 1273–1289.

Touwen, B. C. L. (1998). The brain and development of func-

tion. Developmental Review, 18(4), 504–526.

About the Authors

Emmanouil Hourdakis received his Ph.D. in Computer Science from the University ofCrete, Greece (March 2012). Currently he is a Postdoctoral Researcher at theComputational Vision and Robotics Laboratory of the Foundation for Research andTechnology — Hellas (FORTH). His ongoing research is related to computational model-ing, biologically inspired systems, artificial neural networks and robotic control.

Panos Trahanias is a Professor with the Department of Computer Science, University ofCrete, Greece and the Institute of Computer Science, Foundation for Research andTechnology – Hellas (FORTH). He received his Ph.D. in Computer Science from theNational Technical University of Athens, Greece, in 1988. Following that, he had posi-tions as Research Associate at the Institute of Informatics and Telecommunications,National Center for Scientific Research ‘Demokritos’, Athens, Greece (1989–1991), andat the Department of Electrical and Computer Engineering, University of Toronto,Toronto, Canada (1991–1993). He has participated in many research projects in imageprocessing and analysis at the University of Toronto and has been a consultant to SPARAerospace Ltd., Toronto. Since 1993, he is with the University of Crete and FORTH. Hehas held the position of Director of Graduate Studies at the Department of ComputerScience, University of Crete and currently he chairs the same department. At FORTH heheads the Computational Vision and Robotics Laboratory, where he coordinates researchand development activities in human-robot visual interaction, robot navigation, visualtracking, and brain-inspired robotic control. He has coordinated and participated inmany research projects funded by the European Commission and Greek funding agencies.He has participated in the programme committees of numerous international conferencesand has been General Chair of Eurographics 2008 (EG’08) and the European Conferenceof Computer Vision 2010 (ECCV’10). He has published over 110 papers in technical jour-nals and conference proceedings.




Adaptive Behavior · Additional services and information for Adaptive Behavior can be found at: ......

Documents

Transcript of Adaptive Behavior · Additional services and information for Adaptive Behavior can be found at: ......