Learning and imitation in heterogeneous robot groups

Introduction Architecture Imitation in robot groups Conclusion

Learning and imitation in

heterogeneous robot groups

Wilhelm [email protected]

Fakultät für Elektrotechnik, Informatik und Mathematik,Universität Paderborn

22. Dezember 2009

Learning and imitation in heterogeneous robot groups 1 / 58


MotivationWhy do we need learning and imitation?

State of the art

L Off-line learning (mostly population-based)

L Behavior is fixed afterwards

Swarmanoid [Dorigo et al., 2006] Symbrion [Baele et al., 2009]

Desired

L On-line learning to intelligently react on unforeseeable events/problems

L Means to benefit from the “redundancy” in group behavior

L Robustness to arbitrary robot groups



MotivationWhy do we need learning and imitation?

State of the art

L Off-line learning (mostly population-based)

L Behavior is fixed afterwards

Swarmanoid [Dorigo et al., 2006] Symbrion [Baele et al., 2009]Desired

L On-line learning to intelligently react on unforeseeable events/problems

L Means to benefit from the “redundancy” in group behavior

L Robustness to arbitrary robot groups



The five big challenges in imitation[Dautenhahn and Nehaniv, 2002]

Five big challenges governing successful imitation in multi-robot systems:

whom � heterogeneous robot groups

when � concentrate on salient behavior

what � the results, the actions, or the hidden goals of the imitatee?

how � correspondence problem

how to evaluate What should be counted as successful imitation?



Thesis objectives

Robots in a groups shall be able to

1. combine learning with imitation,

2. recognize and learn observedbehavior non-obtrusively, and

3. choose potential imitatees wiselyalso in heterogeneous robot groups.



Robot architecture

motivation layer

strategy layer

skill layer

current motivation

request result

perc

ep

tion

acti

on

ch

oic

eof

the

imit

ate

e

imit

ati

on

interaction example



Strategy layer

motivation layer

strategy layer

skill layer

current motivation

request result

perc

ep

tion

acti

on

ch

oic

eof

the

imit

ate

e

imit

ati

on

L Inspired by AMPS [Kochenderfer, 2006]

raw perception, motivationI, µi

perception filteringot > Is

experience`�o, a, d, µi, f �t�N , . . . , �o, a, d, µi, f �te

abstractions � ξ�o� heuristics

modelT ,R, γ

reinforcementlearning

policyπ

action selectiona � π�s� > A



Strategy layer

L State abstraction function ξ might use any

abstraction method supportingL insertion of new state observationsL deletion of old state observationsL querying most similar state observation to

a new state observation

L Experiments use nearest neighbor

region("abstract state")

state observation("raw state")





modelT ,R, γ


policyπ




Strategy layer

L Heuristics maintain the models so that the sameaction feels similar in all observations of thesame state

L Heuristics may split or merge regionstransition, failure, reward, simplification, experience

L Example: transition heuristic







modelT ,R, γ


policyπ




Building a policy

L Reinforcement Learning with SMDPL Q�s, a� � R�s, a� �Q

s�>SP�s�Ss, a�γ�s, a, s��Vπ�s��

L Determine current best policyL Vπ�s� � max

a>AQ�s, a�

L π�s� � argmaxa>A

Q�s, a�







modelT ,R, γ


policyπ




Strategy layer

L Strategy layer requests symbolic actions

L Execution of these actions is up to the skill layer

motivation layer

strategy layer

skill layer

current motivation

request result

perc

ep

tion

acti

on

ch

oic

eof

the

imit

ate

e

imit

ati

on





modelT ,R, γ


policyπ




Skill layer

Tasks

1. discover and learn a set of skills that are useful to thestrategy layer � ground symbols > A

2. execute them when requested and optimize at runtime

Skill

L skill s � �f 1e , . . . , f Ne �, whereL error function fe � Ia � Ia � R� assigns an error value to a

pair of perception �I�ti�, I�tj��Example: “approach the ball and orient towards it”

f 1e �I�ti�, I�tj�� dball�I�tj�� minimize the ball distancef 2e �I�ti�, I�tj�� Sαball�I�tj��S � minimize the ball angles � �f 1e , f 2e � � approach the ball and orient towards it



Skill layerMeasuring a skill’s progress

L Progress function fp � Ia � Ia � �0, 1� measures a skill’s progressL For a skill s � �f 1e , . . . , f Ne � it is defined as

fp�I�ti�, I�tj�� ¢¨¦¨¤

0 if Ca BW�I�ti�, I�tj��Ca�W�I�ti�,I�tj��

Ca�Csif Cs @W�I�ti�, I�tj�� @ Ca

1 if W�I�ti�, I�tj�� B Cs

f ie : error function, I�ti�: perception when the skill has been started, I�tj�: current perception, success and

abort thresholds Cs > R� and Ca > R� (Cs @ Ca)

L W�I�ti�, I�tj�� PNk�1 f

ke �I�ti�, I�tj��

L Example graph:Cs � 0.15, Ca � 0.75full skill definition



ImitationOverview of the approach

L Robots observe each other permanently

L Moving window of observations and well-being statesfor each observed robot

L Imitation process starts when well-beingimprovement is detected

motivation layer

strategy layer

skill layer

current motivation

request result

perc

ep

tion

acti

on

ch

oic

eof

the

imit

ate

e

imit

ati

on

observed episode`�oI1 , eI1�, . . . , �oIN , eIN�e

transform observations

subjective observation data`�oD1 , e1�, . . . , �oDN , eN�e

interpret behavior

recognized episodes`. . . , ��t, oD , s�, at , �t�, o�D , s�� , . . .e

estimate rewards

observed interpreted experience`. . . , ��t, oD , s�, at , rt , �t�, o�D , s�� , . . .e

integrate into experience,update SMDP



ImitationHMM and the Viterbi connection [Viterbi, 1967]

sa

ox

sb

oy

sc

oz

P�sb S sa�

P�sc S sa�

P�ox Ssa �

P�oy S s

a �P�o

z S sa�

o1o2 . . . oT Ð� Viterbi Ð� s1s2 . . . sT

V�s, t� � P�ot S st � s�maxs� �P�st � s S st�1 � s��V�s� , t � 1��



ImitationHMM and the Viterbi connection [Viterbi, 1967]

sa

ox

sb

oy

sc

oz

P�sb S sa�

P�sc S sa�

P�ox Ssa �

P�oy S s

a �P�o

z S sa�

o1o2 . . . oT Ð� Viterbi Ð� s1s2 . . . sT

V�s, t� � P�ot S st � s�maxs� �P�st � s S st�1 � s��V�s� , t � 1��Learning and imitation in heterogeneous robot groups 12 / 58


ImitationInterpreting observed behavior with the imitator’s own knowledge

Knowledge in strategy layer

L Imitator’s own transition probabilities

instead of “foreign” HMM transition

probabilities

Knowledge in skill layer

L Skills vote on perceptual changes plus

the following heuristics ...





s0

s1

s2

T�s 0,

a 0, s 1�

T�s 0,

a 1, s 1�

T�s 0,

a 2, s 1�

T�s0, a0, s2�T�s0, a1, s2�T�s0, a2, s2�



probabilities








s0

s1

s2

T�s 0,

a 0, s 1�

T�s 0,

a 1, s 1�

T�s 0,

a 2, s 1�

T�s0, a0, s2�T�s0, a1, s2�T�s0, a2, s2�



probabilities


a0

∆o0

a1

∆o1

a2

∆o2

approach ball approach goal lift ball

��

�1

�0.4

0

��

��

�0.2

�1

0

��

��

0

0

0.3

��

ball dist

goal dist

ball height







s0

s1

s2

T�s 0,

a 0, s 1�

T�s 0,

a 1, s 1�

T�s 0,

a 2, s 1�

T�s0, a0, s2�T�s0, a1, s2�T�s0, a2, s2�



probabilities


a0

∆o0

a1

∆o1

a2

∆o2


��

�1

�0.4

0

��

��

�0.2

�1

0

��

��

0

0

0.3

��

ball dist

goal dist

ball height

P�∆o2 Sa0�P�∆o2 Sa1 �

P�∆o2 Sa

2 �

L Skills vote on perceptual changes � f ap

plus the following heuristics ...





s0

s1

s2

T�s 0,

a 0, s 1�

T�s 0,

a 1, s 1�

T�s 0,

a 2, s 1�

T�s0, a0, s2�T�s0, a1, s2�T�s0, a2, s2�



probabilities


a0

∆o0

a1

∆o1

a2

∆o2


��

�1

�0.4

0

��

��

�0.2

�1

0

��

��

0

0

0.3

��

ball dist

goal dist

ball height

P�∆o2 Sa0�P�∆o2 Sa1 �

P�∆o2 Sa

2 �

L Skills vote on perceptual changes � f applus the following heuristics ...



Recognition

1. Recognize observation changes ot�1 � ota) Prefer nearer goals

Ambiguous situation: Robot might drive either to the red or yellow goal base

b) Ignore skills that “seem to have finished”c) Clip votes to �0, 1�

Pa�ot S ot�1� �

¢¦¤min�max�

f ap �ot��fap �ot�1�

1�f ap �ot�

, 0� , 1� , 1 � f ap �ot� @ є

0, otherwise

2. Recognize actions in sequence ot2t1 � ot1ot1�∆ . . . ot2

aml � argmaxa

Pt2t�t1

Pa�ot S ot�1�t2 � t1

3. Recognize state transitions

P�st2 S st1� � T�st1 , aml , st2�



Recognition

1. Recognize observation changes ot�1 � ota) Prefer nearer goalsb) Ignore skills that “seem to have finished”

c) Clip votes to �0, 1�

Pa�ot S ot�1� �¢¦¤min�max� f ap �ot��f

ap �ot�1�

1�f ap �ot�, 0� , 1� , 1 � f ap �ot� @ є

0, otherwise


aml � argmaxa

Pt2t�t1






Recognition

1. Recognize observation changes ot�1 � ota) Prefer nearer goalsb) Ignore skills that “seem to have finished”c) Clip votes to �0, 1�

Pa�ot S ot�1� �¢¦¤min�max� f ap �ot��f

ap �ot�1�

1�f ap �ot�, 0� , 1� , 1 � f ap �ot� @ є

0, otherwise


aml � argmaxa

Pt2t�t1






EvaluationRecognition scenario: description

L Demonstrator (right robot) has totransport the yellow ball onto thebase

L Imitator (left robot) tries to“understand” its observations

L Two scenarios:

1. Imitator is only able to drive (andthereby push the ball)

2. Imitator is also able to lift theball

fig/lifting.png



EvaluationRecognition scenario: results

1. Without lifting capabilities

???

dis

tance

[m

]

move toball

move togoal

L Recognized “drive to ball” (B) and “drive togoal” (G) correctly

L Detected “missing behavior” in between

2. With lifting capabilities

dis

tance

[m

]

move toball

move togoal

lift theball

L Recognized “drive to ball” (B), “lift the ball”(L), and “drive to goal” (G) correctly



EvaluationMulti-robot scenario “three bases”

L Task: transport objects to goal bases

L Reward for reaching an object: 10

L Goal bases provide different reward

L State space consists ofL distance to closest objectL distance of closest object to closest goalL ID of closest goal



ConclusionObjectives achieved in this thesis

1. Combination of learning and imitation

2. Non-obtrusive recognition and learningof observed behavior

3. Support for heterogeneous robotgroups

Thank you for your attention!



L ArchitectureL State of the artL OverviewL Layer interaction

L Motivation layerL ExcitationL Prioritizing goals

L Strategy layerL State abstractionL HeuristicsL PolicyL Sample frequencyL Strategy example

L Skill layerL Overview of the approach

explore, exploitL Skill managerL Model managerL Error minimizerL ConfigurationL Skill example

L Imitation in robot

groupsL Overview of the approach

L Recognizing behaviorL ViterbiL Interpreting observed behaviorL Recognition example

L Integrating recognized behavior

L EvaluationL CTF with three basesL PerformanceL State abstractionL Group homogeneityL CTF with five basesL PerformanceL State abstractionL Group homogeneity

L Choice of the imitateeL Affordance detectionL Affordance network generationL Comparing ANsL Choice of the imitateeL EvaluationL Parameterization of the

environmentL Robustness experimentL Clustering experiment



State of the art

[Takahashi et al., 2008] use imitation to learnrobotic soccer behaviors (approaching,shooting a ball)

� combines learning with imitation� requires the robot group to stop

whenever a robot imitates� needs multiple presentation of the

same behavior� needs sufficient prior knowledge of

the task to imitate

[Priesterjahn, 2008] evolves game bots withsimilar performance as the humanplayer

[Inamura et al., 2003] combine top-downteaching with the bottom-up learningfrom the robot’s side



State of the art



� shows that imitation-basedadaptation is able to outperform theevolutionary only approach

� targeted to computer gamescenarios, not stochastic real-worldapplications

� assumes group homogeneity


The Rule-Based Operation Cycle of an Agent



State of the art




� exclusive approach (cannot becombined with other learningtechniques)

� HMM is learned and then fixthroughout the robot’s lifetime

Motion capturing system: motion for learning data

A result of motion generation on a humanoid robot



Layer interaction

Ê Strategy step is triggered

L Determining the current motivationand the corresponding next strategyaction.

L The strategy layer requires the mostcurrent motivation as feedbackregarding its last chosen action � bothare synchronous.

Ë Skill step is triggered

L Strategy step does not have to befinished yet

L The skill layer simply executesaccording to the action most recentlydelivered by the strategy layer

Ì,Í Strategy step has finished

L It signals the next action to executeand to the skill layer.

L Subsequent skill steps then performthis action accordingly.

clock motivation layer strategy layer skill layer perception action

Ê next strategy step event

request Improcessed perception

set next motivationrequest Is

processed perceptiondetermine next strategy step

set next skill

Strategy stepStrategy step

Ë next skill step event

request Iaprocessed perception

calculate best actuator command

set next low-level action

Skill stepSkill step

Ì next skill step event

request Iaprocessed perception

calculate best actuator command

set next low-level action

Skill stepSkill step

Í next skill step event



Motivation layerMotivation system example

L The current motivation µ is the vector

to the current drive state, dependent

onL timeL perception

L Each drive measures the status ofaccomplishing a sub-goal(0 = fully accomplished)

L A drive i is called satisfied (goalachieved) if the correspondingmotivation is below its threshold:µi @ µθ

i

drive 3

drive 1

drive 2

well-beingregion

currentdrive state

currentmotivation

shortest vectorto desired drive area,used for prioritization

p

more



A sub-goal subjected to an excitation

t

1

well-being region

0

excitation

threshold triggeringbehavior

L Excitation describes the force, which the current drive stateis subjected to.

L By specifying it dependent on the perception and on theinternal state of the robot the user is “programming” thefinal behavior.



Prioritizing goals

L At each time step, the motivation layer provides the currentmotivation vector to the strategy layer.

L With µp the strategy layer prioritizes, which of the sub-goalsare to be handled first

µp �

��

max�0, µ1 � µθ1 �

max�0, µ2 � µθ2 �

�

max�0, µn � µθn�

��L Different drives can be prioritized by means of an according

scaling � modeling a hierarchy of needs



Strategy layerSample frequency

A new interaction is made in one of the following conditions:

L Sufficiently different perception (measured by some scenario-specific distancemetric d):

d�ot1 , ot2� A θo

L Sufficiently interesting motivation change:

Sµt2 � µt1 S A θr

L Enough time has passed:

t2 � t1 A θt

θo, θr, and θt are application specific and have to be determined empirically.



Strategy example

S

GG

S

2

G

v

3>

(3, 1) (4, 1)

(4, 2)

(1, 1)

(2, 1)

4

v

6 >

(6, 5)

(4, 1) (5, 1)

(2, 1)

(1, 1)

(6, 4)

(6, 1)

(6, 6)

(6, 3)(6, 2)(3, 1)

G



Skill layer

1. discover and learn a set of skills that are useful to thestrategy layer � ground symbols > A

2. execute them when requested and optimize at runtime

exploration mode

skill layer

strategy layer

skillmanager

skills

modelmanager

errorminimizer

acti

on

perc

ep

tion

training mode notify new skill

create & fetch skills

createmod-els

Ia

Oexplore actions

exploitation mode

skill layer

strategy layer

skillmanager

skills

modelmanager

errorminimizer

acti

on

perc

ep

tion

execution mode request skill

set current skill

fetch cur-rent skill

Ia

updatemod-els

O



Skill layerData flow in exploration mode

skill layer

strategy layer

skillmanager

skills

modelmanager

errorminimizer

acti

on

perc

ep

tion



createmod-els

Ia

Oexplore actions



Skill layerData flow in exploitation mode

skill layer

strategy layer

skillmanager

skills

modelmanager

errorminimizer

acti

on

perc

ep

tion


set current skill


Ia

updatemod-els

O



Skill definition

L extraction function fext � Ia � R extracts information from a perception I�t� > IaL control function fc � R � R � R� associates an error value to the tuple �vti , vtj�

L decrease: fc�vti , vtj� � Svtj SL increase: fc�vti , vtj� � 1

Svtj S

L keep value: fc�vti , vtj� � Svti � δ � vtj SL error function fe � Ia � Ia � R� assigns an error value to a perception pairL progress function fp � Ia � Ia � �0, 1� measures a skill’s progress between two

time pointsmore about fp



Skill manager

L exploration phaseL generate skills that enable the robot to

control the perceived propertiesL assign a priority to each skill dependent on

its execution priorityL determine the skills the robot can reliably

perform and notify them as new skills tothe strategy layer

skill layer

strategy layer

skillmanager

skills

modelmanager

errorminimizer

acti

on

perc

ep

tion



createmod-els

Ia

Oexplore actions

L exploitation phaseL manage the execution of requested skills

skill layer

strategy layer

skillmanager

skills

modelmanager

errorminimizer

acti

on

perc

ep

tion


set current skill


Ia

updatemod-els

O



Model manager

L creating prediction models for each perceived

propertyL prediction model is the tuple �idp , S, M,m�

idp > IDp: perception feature to be predictedS ` IDo � IDp: subset of the perceptual featuresM ` O: subset of the actuators to controlm � RSSS�SMS � R predicts the value for theperceptual feature idp at the next inputperception given the values of S and M.

L m in experiments: Poly, RBF

L updating prediction models to reflect newexperiences

L scoring each model dependent on its predictionaccuracy:

score�m� � n

Pk�ni�k �m�S�ti�,M�ti�� vti�1�2

skill layer

strategy layer

skillmanager

skills

modelmanager

errorminimizer

acti

on

perc

ep

tion



createmod-els

Ia

Oexplore actions

skill layer

strategy layer

skillmanager

skills

modelmanager

errorminimizer

acti

on

perc

ep

tion


set current skill


Ia

updatemod-els

O



Error minimizer1. Ic�t�� only those perceptual features, on which the error functions of the current

skill s are dependent on current time t

2. Estimate the next perception, Ic�t � 1�*, dependent on the motor action M as

predicted by mj

best� argmaxm�score�m��:

IMc �t � 1� � �mj

best�Ic�t�,M�t�� S pj > Ic�t��

3. For each error function f ke : calculate the expected next error eMk�t � 1�, with Ic�ti�

being the perception when the skill has been started:

eMk �t � 1� � f ke �Ic�ti�, IMc �t � 1��

4. Determine the best actuator command M�t�, by finding the one that minimizes theaccumulated expected error:

Mnext�t� � minM

N

Qk�1

eMk �t � 1�

*t � 1 is the time point of the next interaction after time t



Skill layer configuration

Greater universality leads to a bigger exploration space. It is wise to limit theexploration space by specifying non-changing parameters beforehand. This can beachieved by configuring the following parameters:

L Degrees of freedom specify the number of actors the skill layer has to control.

L Extraction functions define the language that can be used to specify the errorfunctions.

L Control functions specify the functions that the error minimizer will minimize bymeans of the error functions.

L Regression models are used by the model manager to build predictions for theenvironment interaction. A regression model consists of two methods: one that fitsa model to an experience trace and one that predicts the value of the modeledproperty.



Skill example“Minimize angle to object” learned with radial basis functions

Controlling speed dependent on angle anddistance to the object

Controlling rotational speed dependent onangle and distance to the object



ImitationViterbi [Viterbi, 1967]

Problem description

L Given the observation sequence oN1 � `o1 , o2 , . . . oNe �oi > Rd�L Find the most likely hidden state sequence sN1 � `s1 , s2 , . . . , sNe �si > S�

Approach

L Maximizing probability P�sN1 S oN1 �: sN�1 � argmaxsN1

P �sN1 S oN1 �

by recursively calculating the probability V�s, t� � maxst�11 P�ot1 , s1 . . . st�1st � s� that

s > S is the observed hidden state at time t given the observations ot1:

L V�s, 1� � P�o1 S s1 � s�P�s1 � s� ¦ s > SL V�s, t� � P�ot S st � s�maxs� �P�st � s S st�1 � s��V�s� , t � 1��L φ�s, t� � argmaxs� �P�st � s S st�1 � s��V�s� , t � 1��



ImitationRecognition

Problem description

L Given the observation sequence oN1 � `o1, o2, . . . oNe �oi > Rd�L Find the most likely behavior sequence �t > R�, o > Rd , s > S, a > A)

Γ � �. . . , ��tk, ok, sk�, ak, �tk�1, ok�1, sk�1�� , . . .�

Approach

L Maximizing probability P�sn1 , an�11 S oN1 �, nP N

L Adapting V�s, 1� and V�s, t�:L Use own state and action space for S and AL Support bootstrapping of probabilitiesL Let actions recognize themselves� technical realization of the mirror neuron system



Recognition: determining P�ot S st � s�

Viterbi: V�s, t� � P�ot S st � s�maxs�

�P�st � s S st�1 � s��V�s�, t � 1��

L Every state gets a chancedependent on the distance of itsobservations:

P�ot S st � s� � Po>Nko�sÕ ot � o Õ�2

Po>NkoÕ ot � o Õ�2

L Example: P�o S s � s2�, k � 3 state observa("raw state"

region

("abstract state")



Recognition: determining P�st2 S st1�


�P�st � s S st�1 � s��V�s�, t � 1��

L Replace P�st2 S st1� with T�st1 , aml , st2�, where

aml � argmaxa

Pt2t�t1 Pa�ot S ot�1�

t2 � t1

L Pa�ot S ot�1� is the vote of skill a by means of its progressfunction �f ap �ot�1� � f ap �ot�� plus the following heuristics ...



Recognition: determining Pa�ot S ot�1�


�P�st � s S st�1 � s��V�s�, t � 1��1. Prefer nearer goals

Ambiguous situation: Robot might drive either to the red or yellow goal base

2. Ignore skills that “seem to have finished”

3. Clip votes to �0, 1�

Pa�ot S ot�1� �

¢¦¤min�max�

f ap �ot��f ap �ot�1�1�f ap �ot�

, 0� , 1� , 1 � f ap �ot� @ є0, otherwise





�P�st � s S st�1 � s��V�s�, t � 1��1. Prefer nearer goals



Pa�ot S ot�1� �¢¦¤

min�max�

f ap �ot��f ap �ot�1�1�f ap �ot�

, 0� , 1�

, 1 � f ap �ot� @ є0, otherwise





�P�st � s S st�1 � s��V�s�, t � 1��

1. Prefer nearer goals



Pa�ot S ot�1� �¢¦¤min�max� f ap �ot��f ap �ot�1�

1�f ap �ot� , 0� , 1� , 1 � f ap �ot� @ є0, otherwise



Integrating recognized behavior

L Estimate missing informationL recognition output: Γ � �. . . , ��tk , ok , sk�, ak , �tk�1 , ok�1 , sk�1�� , . . .�L needed for learning:

Itk�1tk

� �otk , atk , dtk , µtk , ftk , otk�1�

L Integrate recognized behavior into existing experience





Itk�1tk

� �otk , atk , dtk , µtk , ftk , otk�1�duration

dtk � tk�1 � tk






Itk�1tk



failureftk � false






Itk�1tk



motivation

µItk � µDtk �µImax�µ

Imin

µDmax�µDmin

failureftk � false




EvaluationMulti-robot scenario “three bases”: results

L Time to reach the goal decreaseddrastically in the beginning

L Imitation slightly above no-imitationin the long-run.

L Reason: Robots have learned toprefer the black base: Higher reward,but longer way.

L More realistic measure: Reward/time

L Result: Imitation increases learningspeed by up to 50%

more results



EvaluationMulti-robot scenario “three bases”: abstraction

L Experience heuristic cuts off at 2000interactions

L Imitation starts with a lower amount of

experiences:L Reaching the goals faster � less

experienceL Learning faster to drive to the black

base � more experience

L Less than 6 abstracts states withno-imitation

L Less than 10 abstracts states withimitation



EvaluationMulti-robot scenario “three bases”: group homogeneity

L Percentage of goal bases

no imitation

imitation

number of episodes0 10 20 30 40 50

60

goal base

[%

]

10

20

30

40

50

yellowredblack

L Group homogeneity G�X� measured by

normalized Shannon entropy H�X��Hmax � log SXS�:L H�X� � �Q

x>X

p�x� log p�x�

L G�X� � Hmax �H�X�Hmax

group homogeneityimitationno-imitation

number of episodes0 10 20 30 40 50

hom

ogeneit

y

0.02

0.04

0.06

0.08

0.10

0.12

0.14

1.0 = all robots prefer the same goal



EvaluationMulti-robot scenario “five bases”: description

L Robots have to transport objects to goalbases

L No reward for reaching an object

L Goal bases provide different rewardL black: 10,000 pointsL blue, green, red, yellow: 20 points

L State space consists ofL distance to closest objectL distance of closest object to closest goalL ID of closest goal



EvaluationMulti-robot scenario “five bases”: results

L Time to reach the goal:lower for imitation in the beginning,higher in the long run

L Imitation above no-imitation in thelong-run.

L Reason: needle eye between greenand blue base

L More realistic measure: Reward/time

L Result: Imitation increases learningspeed in the long run by up to 100%



EvaluationMulti-robot scenario “five bases”: abstraction

L Experience heuristic cuts off at 2000interactions

L Less than 8 abstracts states withno-imitation

L Less than 10 abstracts states withimitation



EvaluationMulti-robot scenario “five bases”: group homogeneity

L Percentage of goal bases

no imitation

imitation

L Group homogeneity G�X� measured by

normalized Shannon entropy H�X��Hmax � log SXS�:L H�X� � �Q

x>X

p�x� log p�x�

L G�X� � Hmax �H�X�Hmax

group homogeneity

1.0 = all robots prefer the same goal



Choice of the imitateeTask

L Find the best imitatee prior to theimitation process itself

motivation layer

strategy layer

skill layer

current motivation

request result

perc

ep

tion

acti

on

ch

oic

eof

the

imit

ate

e

imit

ati

on

Approach

1. Observe behavior capabilities by means ofaffordances

2. Encode recognized affordances stochastically3. Compare representation differences

raw perceptionI

1. affordance detection

accumulated affordancesT

quit 2. affordance network generation

affordance networksAN1, . . . ,ANn

3. choice of the imitateeRimitate � argmin

Ri>R, RixRm

�DAN �ANi,ANm��

don’t imitate imitate



Affordance detectionAffordance testing conditions modeled as FSAs

driveto

object

alignto

object

seizeobject

finish

ok

fail

ok

fail

(a) seizable

driveto

object

alignto

object

seizeobject

liftobject

finish

ok

fail

ok

fail

ok

fail

(b) liftable

driveto

object

alignto

object

drivefor-

ward

finish

ok

fail

ok

fail

(c) pushable

driveto

object

alignto

object

seizeobject

driveback-ward

finish

ok

fail

ok

fail

ok

fail

(d) pullable



Affordance network generationPreparing collected data

Λ1 � seizable

Λ2 � liftable

Λ3 � pushable

Λ4 � pullable

T red

T redred T red

blue

Λj ok Validj�Ired�t�, ok ,Rred� Λj ok Validj�Ired�t�, ok ,Rblue�Λ1 o1 T Λ1 o1 T

Λ2 o1 � Λ2 o1 �

Λ3 o1 T Λ3 o1 T

Λ4 o1 F Λ4 o1 F

Λ1 o2 F Λ1 o2 F

Λ2 o2 � Λ2 o2 �

Λ3 o2 F Λ3 o2 F

Λ4 o2 F Λ4 o2 F

Λ1 o3 T Λ1 o3 T

Λ2 o3 T Λ2 o3 T

Λ3 o3 T Λ3 o3 T

Λ4 o3 T Λ4 o3 F

ok Λ1 Λ2 Λ3 Λ4o1 T � T F

o2 F � F F

o3 T T T T


o2 F � F F

o3 T T T F

T redred

�T red

blue�



Affordance network generationLearned Bayesian networks [Friedman, 1997]

T redred

�


o2 F � F F

o3 T T T T

�

ANred

A1

A3

0

1

P�A1�

0.00

1.00

A4

A3

0

1

P�A4�

0.00

0.33

A3

P�A3� � 0.60

A2

P�A2� � 0.60

T redblue

�


o2 F � F F

o3 T T T F

�

ANblue

A1

A3

0

1

P�A1�

0.00

1.00

A2

P�A2� � 0.60

A3

P�A3� � 0.60

A4

P�A4� � 0.00



Comparing affordance networksBased on Graph Edit Distance [Dickinson et al., 2003]

ANred

A1

A3

0

1

P�A1�

0.00

1.00

A4

A3

0

1

P�A4�

0.00

0.33

A3

P�A3� � 0.60

A2

P�A2� � 0.60

ANblue

L Affordance network difference:

DAN�ANred ,ANblue� � η �Dstruct�ANred ,ANblue� � �1 � η� �Dparam�ANred ,ANblue�L structural difference: Dstruct�ANred ,ANblue� � SCredS � SCblueS � 2SC0 S

� 2 � 1 � 2 � 1

L parameter difference: Dparam�ANred ,ANblue� � P4i�1 Dparam�Aredi ,Ablue

i �

� 0.33



Comparing affordance networksBased on Graph Edit Distance [Dickinson et al., 2003]

ANred

A1

A3

0

1

P�A1�

0.00

1.00

A4

A3

0

1

P�A4�

0.00

0.33

A3

P�A3� � 0.60

A2

P�A2� � 0.60

ANblue

L Affordance network difference:

DAN�ANred ,ANblue� � η �Dstruct�ANred ,ANblue� � �1 � η� �Dparam�ANred ,ANblue�L structural difference: Dstruct�ANred ,ANblue� � SCredS � SCblueS � 2SC0 S � 2 � 1 � 2 � 1L parameter difference: Dparam�ANred ,ANblue� � P4i�1 Dparam�Ared

i ,Abluei � � 0.33



Choosing the best imitatee

A1

A3

01

P(A1)

0.001.00

A4

A3

01

P(A4)

0.000.33

A3

P(A3) = 0.60A2

P(A2) = 0.60

Rimitate � argminRi>R, RixRred

�DAN �ANi,ANred��



EvaluationExperimental setup

L Robots imitate other robots that areperforming different skills on thedifferent objects

L Thereby: Listening to the imitator’serror functions of the involved skills

L Number of failure signals received bythe strategy layer serves as anindicator how wise the demonstratorchoice had been.



EvaluationParametrization of the robots and objects

robots

parameter values description

moto

r

power �0.3, 6.0� kg maximal weight a robot canpull/push

speed �0.03, 0.2� m~s controls the impulse a robot im-pact on an object

gri

pp

er

length �0.08, 0.2�m the longer the gripper the deeperthe objects can be

span �0.16, 0.5� m limits the diameter of objects thatcan be gripped

closing force�1.0, 30.0� kg controls the contact pressure (to

pull heavier objects the closingforce must be higher)

lifting force�30.0, 80.0� kg controls the friction (to lift heavier

objects the closing force must behigher)

form �normal, barb� see figures below

objects

parameter values discret.

mass �1.0, 5.0� kg 0.5 kg

width �0.04, 0.24� m 0.05 m

height �0.17, 0.2� m 0.05 m

friction �50, 100� % 0.1 %

shape { sphere, cube,cylinder}



EvaluationRobustness experiment

Figure: Impact of the demonstrator selection algorithm on the failure rates fordifferent fractions of unknown data (with 95% confidence interval for the 0% case)



EvaluationClustering experiment

L Cluster objects by their appearanceprior to AN creation

L Generate one AN for each clusterL Distance is the weighted sum:

DcAN�Ra ,Rb� � Pn

l�1klk� DAN�ANa,l , ANb,l�

k � min ��ST ma,l S � ST m

b,l S� S 1 B l B n�

kl � min �ST ma,l S, ST m

b,l S�

Impact of the clustered demonstrator selectionalgorithm



Baele, G., Bredeche, N., Haasdijk, E., Maere, S., Michiels, N., Van de Peer, Y., Schwarzer, C., and Thenius, R.

(2009).Open-ended on-board evolutionary robotics for robot swarms.In Tyrrell, A., editor, 2009 IEEE Congress on Evolutionary Computation, pages –, Trondheim, Norway. IEEEComputational Intelligence Society, IEEE Press.

Dautenhahn, K. and Nehaniv, C. (2002).

Imitation in Animals and Artifacts, chapter “An agent-based perspective on imitation”.MIT Press.

Dickinson, P. J., Bunke, H., Dadej, A., and Kraetzl, M. (2003).

On graphs with unique node labels.In Graph Based Representations in Pattern Recognition, volume 2726, pages 409–437, Heidelberg, DE.Springer Berlin.

Dorigo, M., Tuci, E., Trianni, V., Gro"s, R., Nouyan, S., Ampatzis, C., Labella, T. H., O’Grady, R., Bonani, M.,

and Mondada, F. (2006).SWARM-BOT: Design and Implementation of Colonies of Self-Assembling Robots.In Computational Intelligence: Principles and Practice. IEEE Computational Intelligence Society, New York.

Friedman, N. (1997).

Learning belief networks in the presence of missing values and hidden variables.In Proc. 14th International Conference on Machine Learning, pages 125–133. Morgan Kaufmann.

Inamura, T., Toshima, I., Nakamura, Y., and Saitama, J. (2003).

Acquiring Motion Elements for Bidirectional Computation of Motion Recognition and Generation.Experimental Robotics VIII.

Kochenderfer, M. (2006).

Adaptive Modelling and Planning for Learning Intelligent Behaviour.PhD thesis, School of Informatics, University of Edinburgh.



Priesterjahn, S. (2008).

Online imitation and adaptation in modern computer games.PhD thesis, University of Paderborn.

Takahashi, Y., Tamura, Y., and Asada, M. (2008).

Mutual development of behavior acquisition and recognition based on value system.In From Animals to Animats 10, 10th International Conference on Simulation of Adaptive Behavior (SAB2008), pages 291–300.

Viterbi, A. (1967).

Error bounds for convolutional codes and an asymptotically optimum decoding algorithm.Information Theory, IEEE Transactions on, 13(2):260–269.


Learning and imitation in heterogeneous robot groups

Technology

Transcript of Learning and imitation in heterogeneous robot groups