Behavior Hierarchy Learning in a Behavior- based System ...€¦ · Babak Najar Araabi...

IROS04 (Japan, Sendai)University of Tehran

Amir massoud Farahmand - Majid Nili Ahmadabadi Babak Najar Araabi

[email protected], {mnili, araabi}@ut.ac.ir

Behavior HierarchyLearning in a Behavior-

based System usingReinforcement Learning

Department of Electrical and Computer EngineeringUniversity of Tehran

Iran


Paper Outline

• Challenges and Requirements of Robotic Systems• Behavior-based Approach to AI• How should we design a Behavior-based System

(BBS)?!• Learning in BBS• Structure Learning in BBS• Value Function Decomposition• Experiments: Multi-Robot Object Lifting• Conclusions, Ongoing Research, and Future Work


Challenges andRequirements

of Robotic SystemsChallenges

• Sensor and EffectorUncertainty

• Partial Observability

• Non-Stationarity

Requirements

(among many others)

• Multi-goal

• Robustness

• Multiple Sensors

• Scalability

• Automatic design

• [Learning]


Behavior-based Approach to AI

• Behavior-based approach as a good candidate for low-levelintelligence.

• Behavioral (activity) decomposition– against functional decomposition

• Behavior: Sensor->Action (Direct link between perception andaction)

• Situatedness– Situatedness motto: The world is its own best model!

• Embodiment• Intelligence as Emergence

– (interaction of agent with environment)


Behavioral decomposition

manipulatethe world

build maps

explore

locomote

avoid obstacles

sensors actuators


Behavior-based System Design

• Hand Design– Common in almost everywhere (just ask some people in

IROS04)– Complicated: may be infeasible in complex problems– Even if it is possible to find a working system, probably it is not

optimal.

• Evolution– Time consuming– Good solutions can be found– Biologically feasible

• Learning– Biologically feasible– Learning is essential for life-time survival of the agent.

We have focuses on learning in this presentation.


The Importance of Learning

• Unknown environment/body– [exact] Model of environment/body is not known

• Non-stationary environment/body– Changing environment (offices, houses, streets, and almost

everywhere)– Aging

• Designer may not know how to benefit from everyaspects of her agent/environment– Let’s the agent learn it by itself (learning as optimization)

• etc …


Learning in Behavior-basedSystems

• There are a few works on behavior-basedlearning– Mataric, Mahadevan, Maes, and ...

• … but there is no deep investigation aboutit (specially mathematical formulation)!



There are different methods of learning withdifferent viewpoints, but we haveconcentrated on Reinforcement Learning.– [Agent] Did I perform it correctly?!

– [Tutor] Yes/No!



We have divided learning in BBS into these twoparts:

• Structure Learning– How should we organize behaviors in the architecture

assume having a repertoire of working behaviors

• Behavior Learning– How should each behavior behave? (we do not have

a necessary toolbox)


Structure Learning Assumptions

• Structure Learning inSubsumption Architecture as agood sample for BBS

• Purely parallel case• We know B1, B2, and … but we

do not know how to arrangethem in the architecture

– we know how to {avoidobstacles, pick an object,stop, move forward, turn,…} but we don’t knowwhich one is superior toothers.


Structure Learning

manipulatethe world

build maps

explore

locomoteavoid obstacles

Behavior Toolbox

The agent wants to learnhow to arrange thesebehaviors in order to getmaximum reward from itsenvironment (or tutor).


Structure Learning

manipulatethe world

build maps

explore

locomoteavoid obstacles

Behavior Toolbox


Structure Learning

manipulatethe world

build maps

explorelocomote

avoid obstacles

Behavior Toolbox 1-explore becomescontrolling behavior andsuppress avoid obstacles

2-The agent hits a wall!


Structure Learning

manipulatethe world

build maps

explorelocomote

avoid obstacles

Behavior Toolbox Tutor (environment) givesexplore a punishment for itsbeing in that place of thestructure.


Structure Learning

manipulatethe world

build maps

explorelocomote

avoid obstacles

Behavior Toolbox“explore” is not a very goodbehavior for the highestposition of the structure. Soit is replaced by “avoidobstacles”.


Structure Learning Issues

• How should we represent structure?– Sufficient (Concept space should be covered by

Hypothesis space)– Tractable (small Hypothesis space)– Well-defined credit assignment

• How should we assign credits to architecture?– If the agent receives a reward/punishment, how

should we reward/punish structure of thearchitecture?


Value Function Decomposition andStructure Learning

Each structure has a value regarding itsreceiving reinforcement signal.

[ ]T structure agent with thetTrEV =

•The objective is finding a structure T with ahigh value.•We have decomposed value function tosimpler components that enable us to benefitfrom previous experiments.


Value Function Decomposition

• It is possible to decompose total system’s valueto value of each behavior in each layer.

• We call it Zero-Order method.

[ ]layeri in thebehavior gcontrollin is ),( th

jtijZO BrEVjiV ==


Value Function DecompositionZero Order Method

It stores the value of behavior-being in a specificlayer.

avoid obstacles(0.8)

avoid obstacles(0.6)

explore(0.7)

explore(0.9)

locomote(0.4)Higher layer

Lower layer

ZO Value Table in the agent’s mind

locomote(0.4)


Credit Assignment forZero Order Method

• Controlling behavior is the only responsiblebehavior for the current reinforcement signal.

• Appropriate ZO value table updating method isavailable.


Value Function DecompositionAnother Method (First Order)

It stores the value of relative order of behaviors– How much is it good/bad if “B1 is being placed higher than B2”?!

• V(avoid obstacles>explore) = 0.8

• V(explore>avoid obstacles) = -0.3

• Sorry! Not that easy (and informative) to show graphically!!• Credits are assigned to all (controlling, activated) pairs of

behaviors.– The agent receives reward while B1 is controlling and B3 and B5 are

activated• (B1>B3): +• (B1>B5): +


Structure Representation

Both of these methods are provided with alot of probabilistic reasoning which showshow to– decompose total system value to simple

components

– assign credits

– update values table

Check the Proceeding for MathematicalFormulation!


Example: Multi-RobotObject Lifting

• A Group of three robots wantto lift an object using theirown local sensors– No central control

– No communication

– Local sensors

• Objectives– Reaching prescribed height

– Keeping tilt angle small



Behavior Toolbox

Stop

Push More

Hurry Up

Slow DownDon’t Go Fast

?!



Sample shot of tilt angle of the object after sufficient learning

5 10 15 20 25 30 35 40 45 50

-40

-30

-20

-10

0

10

20

30

40

Episodes

Average total reward per episode

Mean hand-designed performance

Zero order

First order



Sample shot of height of each robot after sufficient learning

0 10 20 30 40 50 60 70 80 900

0.5

1

1.5

2

2.5

3

3.5

Steps

z of robots

goal

1

2

3



Sample shot of tilt angle of the object after sufficient learning

0 10 20 30 40 50 60 70 80 900

5

10

15

20

25

30

35

40

45

Steps

Tilt angle (in degrees)


Conclusions, Ongoing Research,and Future Work

• We have devised two different methods forstructure learning for behavior-basedsystem.

• Good results in two different tasks– Multi-robot Object Lifting

– An Abstract Problem (not reported yet)



• … but from where should we findnecessary behaviors?!– Behavior Learning

• We have devised some methods forbehavior learning which will be reportedsoon.



• However, there are many steps remained forfully automated agent design– How should we generate new behaviors without even

knowing which sensory information is necessary forthe task (feature selection)

– Problem of Reinforcement Signal Design• Designing a good reinforcement signal is not easy at all.

Behavior Hierarchy Learning in a Behavior- based System ...€¦ · Babak Najar Araabi...

Documents

Transcript of Behavior Hierarchy Learning in a Behavior- based System ...€¦ · Babak Najar Araabi...