Active Inference and Behavior Trees for Reactive Action ...

1

Active Inference and Behavior Trees forReactive Action Planning and Execution in Robotics

Corrado Pezzato, Carlos Hernandez Corbato, Stefan Bonhof, and Martijn Wisse

Abstract—We propose a hybrid combination of active inferenceand behavior trees (BTs) for reactive action planning andexecution in dynamic environments, showing how robotic taskscan be formulated as a free-energy minimization problem. Theproposed approach allows to handle partially observable initialstates and improves the robustness of classical BTs againstunexpected contingencies while at the same time reducing thenumber of nodes in a tree. In this work, the general nominalbehavior is specified offline through BTs, where a new type ofleaf node, the prior node, is introduced to specify the desiredstate to be achieved rather than an action to be executed astypically done in BTs. The decision of which action to executeto reach the desired state is performed online through activeinference. This results in the combination of continual onlineplanning and hierarchical deliberation, that is an agent is ableto follow a predefined offline plan while still being able tolocally adapt and take autonomous decisions at runtime. Theproperties of our algorithm, such as convergence and robustness,are thoroughly analyzed, and the theoretical results are validatedin two different mobile manipulators performing similar tasks,both in a simulated and real retail environment.

Index Terms—Active Inference, Reactive Action Planning, Be-havior Trees, Mobile Manipulators, Biologically-Inspired Robots,Free-energy Principle

I. INTRODUCTION

DELIBERATION and reasoning capabilities for acting arecrucial parts in online robot control, especially when

operating in dynamic environments to complete long termtasks. Over the years, researchers developed many task plan-ners with various degrees of optimality, but little attention hasbeen payed to actors [1], [2], i.e. algorithms endowed withreasoning and deliberation tools during plan execution. Worksas [1], [2] advocate a change in focus, explaining why thislack of actors could be one of the main causes of the limitedspread of automated planning applications. Works such as [3]–[5] proposed the use of Behavior Trees as graphical models formore reactive task execution, showing promising results. Otherauthors also tried to answer the call for actors [6], [7], but thereare still open challenges to be addressed. These challengeshave been identified and highlighted by many researchers, andcan be summarised as in [1], [4] in two properties that an actorshould possess:• Hierarchical deliberation: each action in a plan may be

a task that an actor may need to further refine online.

This research was supported by Ahold Delhaize. All content represents theopinion of the authors, which is not necessarily shared or endorsed by theirrespective employers and/or sponsors

Corrado Pezzato, Carlos Hernandez Corbato, Stefan Bonhof, and MartijnWisse are with the Cognitive Robotics Department, TU Delft, 2628 CD Delft,The Netherlands {c.pezzato, c.h.corbato, s.d.bonhof,m.wisse}@tudelft.nl

• Continual online planning and reasoning: an actor shouldmonitor, refine, extend, update, change, and repair itsplans throughout the acting process, generating activitiesdynamically at run-time.

Actors should not be mere action executors then, but theyshould be capable of intelligently taking decisions. This isparticularly useful for challenging problems such as mobilemanipulation in dynamic environments, where actions plannedoffline are prone to fail. In this paper we consider mobilemanipulation tasks in a retail environment with partially ob-servable initial state, and we propose an actor which is capableof following a task planned offline while still being able oftaking autonomous decisions at run-time to resolve unexpectedsituations.

To achieve so, we propose the use of active inference, aneuroscientific theory which has recently shown its potential incontrol engineering and robotics [8]–[11], in particular in realworld experiments for low level adaptive control [12], [13].Active inference describes a biologically plausible algorithmfor perception, action, planning, and learning. This theoryhas been initially developed for continuous processes [14]–[16] where the main idea is that the brain’s cognition andmotor control functions could be described in terms of free-energy minimization [17]. In other words we, as humans,take actions in order to fulfill prior expectations about adesired prior sensation [18]. Active inference has also beenextended to Markov Decision Processes (MDP) for discretedecision making [19], recently gathering more and more inter-est [20]–[23]. In this formulation, active inference is proposedas a unified framework to solve the exploitation-explorationdilemma by acting to minimize the free-energy. Agents cansolve complicated problems once provided a context sensitiveprior about preferences. Probabilistic beliefs about the stateof the world are built through Bayesian inference, and a finitehorizon policy is selected in order to maximise the evidencefor a model that is biased towards the agent’s preferences. Atthe time of writing, the use of discrete Active Inference forsymbolic action planning is limited to low dimensional andsimplified simulations [22], [23]. In addition, current solutionsrely on fundamental assumptions such as instantaneous actionswithout preconditions, which do not hold in real roboticsituations.

To tackle these limitations of active inference and to addressthe two main open challenges of hierarchical deliberation andcontinual planning, we propose a hybrid combination of activeinference and Behavior Trees. We then apply this new idea tomobile manipulation in a dynamic retail environment.

arX

iv:2

011.

0975

6v3

[cs

.RO

] 9

Jun

202

1

2

A. Related Work

In this section we mainly focus on related work on reactiveaction planning and execution, a class of methods that exploitsreactive plans, which are stored structures that contain thebehaviour of an agent. To begin with, Behavior Trees [3],[24] gathered increasing popularity in robotics to specify re-active behaviors. BTs are graphical representations for actionsexecution. The general advantage of BTs is that they aremodular and can be composed in more complex higher-levelbehaviours, without the need to specify how different BTsrelate to each other. They are also an intuitive representa-tion which modularizes other architectures such as FSM anddecision trees, with proven robustness and safety properties[3]. These advantages and the structure of BTs make themparticularly suitable for the class of dynamic problems we areconsidering in this work, as explained later on in Section II.However, in classical formulations of BTs, the plan reactivitystill comes from hard-coded recovery behaviors. This meansthat highly reactive BTs are usually big and complex, and thatadding new robotic skills would require revising a large tree.To partially cope with these problems, Colledanchise et al. [4]proposed a blended reactive task and action planner whichdynamically expands a BT at runtime through back-chaining.The solution can compensate for unanticipated scenarios, butcannot handle partially observable environments and uncertainaction outcomes. Conflicts due to contingencies are handledby prioritizing (shifting) sub-trees. Authors in [5] extended [4]and showed how to handle uncertainty in the BT formulationas well as planning with non-deterministic outcomes foractions and conditions. Other researchers tried to combine theadvantages of behavior trees with the theoretical guarantees onperformance of PDDL planning [6]. Even though this results ina more concise problem description with respect to using BTsonly, online reactivity is again limited to the scenarios plannedoffline. For unforeseen contingencies re-planning would benecessary, which is more resource demanding than reactingas shown in the experimental results in [6].

Goal Oriented Action Planning (GOAP) [25], instead, fo-cuses on online action planning. This technique is used fornon-player-characters (NPC) in video-games [26]. Goals inGOAP do not contain predetermined plans. Instead, GOAPconsiders atomic behaviors which contain preconditions andeffects. The general behavior of an agent can then be specifiedthrough very simple Finite State Machines (FSMs) becausethe transition logic is separated from the states themselves.GOAP generates a plan at run-time by searching in the spaceof available actions for a sequence that will bring the agentfrom the starting state to the goal state. However, GOAPrequires hand-designed heuristics that are scenario specific,and is computationally expensive for long term tasks.

Hierarchical task Planning in the Now (HPN) [27] is analternate plan and execution algorithm where a plan is gen-erated backwards starting from the desired goal, using A*.HPN recursively executes actions and re-plans. To cope withthe stochasticity of the real world, HPN has been extended tobelief HPN (BHPN) [28]. A follow up work [29] focused onthe reduction of the computational burden implementing selec-

tive re-planning to repair to local poor-choices or exploit newopportunities without the need to re-compute the whole planwhich is very costly since the search process is exponential inthe length of the plan.

A different method for generating plans in a top-downmanner is Hierarchical Task Network (HTN) [30], [31]. Ateach step, a high-level task is refined into lower-level tasks.In practice [32], the planner exploits a set of standard operat-ing procedures for accomplishing a given task. The plannerdecomposes the given task choosing among the availableones, until the chosen primitive is directly applicable to thecurrent state. Tasks are then iteratively replaced with new tasknetworks. An important remark is that reaction to failures,time-outs, and external events are still a challenge for HTNplanners. Besides, current interfaces for HTN planning areoften cumbersome and not user-friendly [31]. The drawbacksof this approach are that it requires the designer to write anddebug potentially complex domain specific recipes, and thatin very dynamic situations re-planning might occur too often.

Finally, active inference is a normative principle underwrit-ing perception, action, planning, decision-making and learn-ing in biological or artificial agents. Active inference ondiscrete state-spaces [33] is a promising approach to solvethe exploitation-exploration dilemma and empowers agentswith continuous deliberation capabilities. The application ofthis theory, however, is still in an early stage for discretedecision making where current works only focus on simplifiedsimulations as proof of concept [20]–[23]. In [23], for instance,the authors simulated an artificial agent which had to learnand to solve a maze given a set of simple possible actionsto move (up, down, left, right, stay). Actions were assumedinstantaneous and always executable. In general, current dis-crete active inference solutions lack a systematic and task-independent way of specifying prior preferences, which isfundamental to achieve a meaningful agent behavior, and theynever consider action preconditions which are crucial in realworld robotics. As a consequence, policies with conflictingactions which might arise in dynamic environments are neveraddressed in the current state-of-the-art.

B. ContributionsIn this work, we propose the hybrid combination of BTs

and active inference to obtain more reactive actors with hierar-chical deliberation and continual online planning capabilities.We introduce a method to include action preconditions andconflicts resolution in active inference, as well as a system-atic way of providing prior preferences through BTs. Theproposed hybrid scheme leverages the advantages of onlineaction selection with active inference and it removes the needfor complex predefined fallback behaviors while providingconvergence guarantees. In addition, with this paper we wantto facilitate the use of active inference in robotics by making itmore accessible to a plurality or researchers, thus we provideextensive mathematical derivations, examples, and code.

C. Paper structureThe remainder of the paper is organized as follows. In

Section II we provide an extensive background on active

3

inference and BTs. Our novel hybrid approach is presentedin Section III, and its properties in terms of robustness andstability are analysed in Section IV. In Section V we reportthe experimental evaluation, showing how our solution can beused with different mobile manipulators for different tasks ina retail environment. Finally, Section VI contains discussion,conclusions, and future work.

II. BACKGROUND ON ACTIVE INFERENCE AND BTS

A. Background on Active Inference

Active inference provides a powerful architecture for thecreation of reactive agents which continuously update theirbeliefs about the world and act on it, to match specific desires.This behavior is automatically generated through free-energyminimization, which provides a metric for meaningful decisionmaking in the environment. Active inference provides a unify-ing theory for perception, action, decision-making and learningin biological or artificial agents [33]. The core idea is thatthese processes can be described in terms of optimization oftwo complementary objective functions, namely the variationalfree-energy F , and the expected free-energy G. Creaturesare assumed to have a set of sensory observations o andan internal generative model to represent the world which isused to build internal probabilistic representations, or Bayesianbeliefs, of the environment’s dynamics. Agents can also acton the environment to change its states, and thus the sensoryobservations. External states are internally represented throughthe so called hidden states s. Variational free-energy measuresthe fit between the internal model and past and currentsensory observations, while expected free-energy scores futurepossible courses of action according to prior preferences. Inthis subsection we report the necessary theory for planningand decision making with discrete active inference from thevast literature. All the necessary mathematical derivations areadded in the appendix.

1) Discrete state-space generative models: Usually, agentsdo not have access to the true states but they can only perceivethe world via possibly noisy sensory observations [20], [34],and have an internal representation, or model, of the world.The problem can then be framed as a variant of a Markovdecision process where the internal generative model is a jointprobability distribution over possible hidden states and sensoryobservations. This is used to predict sensory data, and to inferthe causes of these data [20]:

P (o, s,A,B,D, π) =

P (π)P (A)P (B)P (D)

T∏τ=1

P (sτ |sτ−1, π)P (oτ |sτ ) (1)

This factorization takes into account the Markov property,so the next state and current observation depend only onthe current state. In eq. (1), o and s represent sequences ofobservations and states from time τ = 1 to the current timet. Regarding the parameters of the model: A is the likelihoodmatrix, so the probability of an outcome to be observed givena specific state; Bat , or simply B, is the transition matrix,so the probability of state st+1 while applying action at from

state st; D is the probability or initial belief about the state atτ = 1. Each column of A, B, D is a categorical distributions,Cat(·); finally, π is a policy, defined as sequence of actionsover a time horizon T . The full derivation of the generativemodel in eq. (1), the assumptions under its factorization, andhow this joint probability is used to define the free-energy Fas an upper bound on surprise, can be found in Appendix A.

2) Variational (negative) Free-energy: Given the generativemodel as before, one can show how free-energy becomes anupper bound on the atypicality of observations, defined assurprise in information theory. By minimising variational free-energy an agent can determine the most likely hidden statesgiven sensory information. An expression for F is given by:

F (π) =

T∑τ=1

sπτ ·[

ln sπτ − ln (Bπτ−1)sπτ−1 − lnA · oτ

](2)

where F (π) is a policy specific free-energy. For the derivationsplease refer to Appendix B.

TABLE INOTATION FOR ACTIVE INFERENCE

Symbol Description

sτ , sπτHidden state at time τ , and expected value of

posterior estimate under a policy

oτ , oπτOutcome at time τ , and expected value of

posterior estimate under a policy

ALikelihood matrix mapping from hidden states

to outcomes P (ot|st) = Cat(A)B Transition matrix P (st+1|st, at) = Cat(B)

CLogarithm of prior preferences over outcomes

lnP (oτ ) = CD Prior over initial states P (s1|s0) = Cat(D)F Variational free-energyG Expected free-energy

π, π Policy specifying action sequences, andposterior expectation

σ Softmax functionat Action at time t

3) Perception: According to active inference, both percep-tion and decision making are based on the minimisation offree-energy. This is achieved through a gradient descent onF . In particular, for state estimation we take partial derivativeswith respect to the states and set the gradient to zero. The ex-pected value of the posterior estimate of the state, conditionedby a policy, is given by:

sπτ = σ(ln (Bπτ−1)sπτ−1 + ln (Bπ

τ ) · sπτ+1 + lnA · oτ ) (3)

where σ is the softmax function. For the complete derivation,please refer to Appendix C. Note that when τ = 1 the firstterm is simply lnD.

4) Expected Free-energy: Active inference unifies actionselection and perception by assuming that action fulfills pre-dictions based on inferred states. Since the internal modelcan be biased towards preferred states or outcomes (priordesires), active inference induces actions that will bring thecurrent beliefs towards the preferred states. More specifically,the generative model entails beliefs about future states andpolicies, where policies that lead to preferred outcomes aremore likely. Preferred outcomes are specifies in the modelparameterC. This enables action to realize the next (proximal)

4

outcome predicted by the policy that leads to (distal) goals.By minimising expected free-energy the agents gives higherplausibility to actions that will achieve those sensations. Theexpected free-energy for a policy π at time τ is given by (seeAppendix D) :

G(π, τ) = oπτ · [lnoπτ −C] + sπτ ·A · lnA (4)

a) Planning and decision making: Taking a gradientdescent of F with respect to policies, and recalling that thegenerative model specifies P (π) = σ(−G(π)), it holds thatthe approximate posterior over policies is given by:

π = σ(−Gπ − Fπ) (5)

See Appendix E for the details. The vector π contains theprobability for each policy.

b) Policy independent state-estimation: Given the prob-ability over possible policies, and the policy dependant statessπτ , we can compute the overall probability distribution for thestates over time through Bayesian Model Average:

sτ =∑p

πp · sπpτ (6)

were sπpτ is the probability of a state at time τ under policy p.This is the average prediction for the state at a certain time, sosτ , according to the probability of each policy. In other words,this is a weighted average over different models. Models withhigh probability receive more weight, while models with lowerprobabilities are discounted.

c) Action selection: The action for the agent at thecurrent time t is determined from the most likely policy:

a = arg minu

(∑p

δu,πpπ

)(7)

were δ is the Kronecker delta, and πp is the probability ofpolicy p.

The pseudo-code for action selection and belief updating,once given the model parameters A, B, D, is reported inAlgorithm 1.

Algorithm 1 Action selection with active inference1: Set C . prior preferences2: for τ = 1 : T do3: Sample state from B if not specified4: Sample outcome from A if not specified5: Compute free-energy . eq. (2)6: Update posterior state sπτ . eq. (3)7: Compute expected free-energy . eq. (4)8: Bayesian model averaging . eq. (6)9: Action selection . eq. (7)

10: end for11: Return a . Preferred policy

B. Background on Behavior Trees

We now describe the high level concepts at the basis ofBehavior Trees according to previous work such as [3], [24].These concepts will be useful to understand the novel hybridscheme proposed in the next section. A Behavior Tree is adirected tree composed of nodes and edges that can be seenas a graphical modeling language. It provides a structuredrepresentation for execution of actions which are based onconditions and observations in a system. The nodes in a BTfollow the classical definition of parents and children. The rootnode is the only node without a parent, while the leaf nodesare all the nodes without children. In a BT, the nodes can bedivided in control flow nodes (Fallback, Sequence, Parallel,or Decorator), and in execution nodes (Action or Condition)which are the leaf nodes of the tree. When executing a givenbehavior tree in a control loop, the root node sends a tick toits child. A tick is nothing more than a signal that allows theexecution of a child. The tick propagates in the tree followingthe rules dictated by each control node. A node returns a statusto the parent, that can be running if its execution has notfinished yet, success if the goal is achieved, or failure in theother cases. At this point, the return status is propagated backup the tree, which is traversed again following the same rules.The most important control nodes are:

Fallback nodes: A fallback node ticks its children fromleft to right. It returns success (or running) as soon a one ofthe children returns success (or running). When a child returnssuccess or running, the fallback does not tick the next child, ifpresent. If all the children return failure, the fallback returnsfailure. This node is graphically identified by a gray box witha question mark ”?”;

Sequence nodes: The sequence node ticks its childrenfrom left to right. It returns running (or failure) as soon as achild returns running (or failure). The sequence returns successonly if all the children return success. If a child returns runningor failure, the sequence does not tick the next child, if present.In the library we used to implement our BTs [35] the sequencenode, indicated with [→], keeps ticking a running child, andit restarts only if a child fails. [35] provides also reactivesequences [→R] where every time a sequence is ticked, theentire sequence is restarted from the first child.

The execution nodes are Actions and Conditions:Action nodes: An Action node performs an action in the

environment. While an action is being executed, this nodereturns running. If the action is completed correctly it returnssuccess, while if the action cannot be completed it returnsfailure. Actions are represented as red rectangles;

Condition nodes: A Condition node determines if a con-dition is met or not, returning success or failure accordingly.Conditions never return running and do not change any statesor variables. They are represented as orange ovals;

An example BT is given in Fig. 1.

III. ACTIVE INFERENCE AND BTS FOR REACTIVE ACTIONPLANNING AND EXECUTION

In this section we introduce our novel approach using be-havior trees and active inference. Even though active inference

5

Fig. 1. Example of BT. The tick traverses the tree starting from the Root. IfCondition is true Action1 is executed. If both of them return success, theRoot returns success, otherwise Action2 is executed.

is a very promising theory, from a computational perspectivecomputing the expected free-energy for each possible policythat a robot might take is cost-prohibitive. This curse ofdimensionality is due do the combinatorial explosion whenlooking deep into the future [33]. To solve this problem,we propose to replace deep policies with shallow decisiontrees that are hierarchically composable. This will allow usto simplify our offline plans, exploit opportunities, and actintelligently to resolve local unforeseen contingencies. Ouridea consists in two main intuitions:• To avoid combinatorial explosion while planning and

acting with active inference for long term tasks wespecify the nominal behavior of an agent through a BT,used as a prior. Doing so, Behavior Trees provide globalreactivity to foreseen situations

• To avoid coding every possible contingency in the BT, weprogram only desired states offline, and we leave actionselection to the online active inference scheme. Activeinference provides then local reactivity to unforeseensituations.

To achieve such a hybrid integration, and to be able to deploythis architecture on real robotic platforms, we addressed thefollowing three fundamental problems: 1) how to define thegenerative models for active inference in robotics, 2) how touse BTs to provide priors as desired states to active inference,3) how to handle action preconditions in active inference, andpossible conflicts which might arise at run-time in a dynamicenvironment.

A. Definition of the models for active inference

The world in which an agent operates needs to be abstractedsuch that the active inference agent can perform its reasoningprocesses. In this work, we operate in a continuous environ-ment with the ability of sensing and acting through symbolicdecision making. We model the skills of a robot as a set ofatomic actions aj ⊂ A with j = 1...k which have associatedparameters, pre- and postconditions:

Action aj Preconditions Postconditionsaction_name(par) precaj postaj

where precaj and postaj are first-order logic predicatesthat can be evaluated at run-time. A logical predicate is aboolean-valued function P : X → {true, false}.

We define x ⊂ X the continuous states of the world and theinternal states of the robot are accessible through the symbolicperception system. The role of this perception system is todiscretize the continuous states x into symbolic observationsoi ⊂ O with i = 1...n that can be manipulated by thediscrete active inference agent. Observations o are used tobuild a probabilistic belief about the current state, indicatedwith si ⊂ S with i = 1...n. Each belief for a state i issi ∈ Rmi where mi is the number of mutually exclusivesymbolic values that a state can have. Each entry of a particularbelief si can take any value in between 0 and 1 (∈ [0, 1]),while each entry of a particular observation oi is a binaryvalue ∈ {0, 1}, indicating if a particular state is sensed ornot. Finally, we define l ⊂ L as the most probable logicalstate based on the belief s. Defining a logic state based onthe probabilistic belief s built with active inference, insteadof directly using the observation of the states o, increasesrobustness against noisy sensor readings.

Given the model of the world just introduced, we can nowdefine the likelihood matrix A, the transition matrix B andthe prior over states C. When an observation is available,A provides information about the corresponding value of astate. Thus, for a particular state si, the probability of a stateoutcome pair oi, si at the current time t is:

P (oi,t|si,t) = Cat(Ai), Ai = Id ∈ Rmi×mi (8)

Note that knowing the mapping between observations andstates does not necessarily mean that we can observe all thestates at all time. Observations can be present or not, and whenthey are the likelihood matrix indicates that a specific state issensed.

To define the transition matrices, we need to encode in amatrix form the effects of each action on the relevant states.The probability of a ending up in a state st+1, given st andaction ai is given by:

P (st+1|st, ai) = Cat(Bai), Bai ∈ Rmi×mi (9)

In other words, we define Bai as a square matrix encodingthe post-conditions of action ai.

Example 1. Consider a mobile manipulator in a retail en-vironment. The first skill that we provide to the robot isthe ability to navigate to a specified goal location such asmoveTo(goal).

Actions Preconditions PostconditionsmoveTo(goal) - lg = [1 0]>

During an execution, the current position in space of therobot ⊂ X , the robot is given an observation og whichindicates simply if the goal has been reached or not. Theagent is then constantly building a probabilistic belief sgencoding the chances of being at the goal. In case the robothas not yet reached the goal, a possible configuration at timet is the following:

og =

[isAt(goal)!isAt(goal)

]=

[01

], sg =

[0.080.92

], lg =

[01

],

Bg =

[0.95 0.90.05 0.1

]Bidle =

[1 00 1

](10)

6

The transition matrix Bg encodes the probability of reach-ing a goal location through the action moveTo(goal), whichmight fail with a certain probability. We also encode an Idleaction, which does not modify any state, but it providesinformation on the outcome of the action selection processas we will see in the next subsections.

Using the proposed problem formulation for the active in-ference models, we can abstract redundant information whichis not necessary to make high-level decisions. For instance,in the example above we are not interested in building aprobabilistic belief of the current robot position. To decideif to use the action moveTo(goal) or not, it is sufficientto encode if the goal has been reached or not. One has alsoto define the vector Di ∈ Rmi encoding the initial beliefabout the probability distribution of the states. When no priorinformation is available, each entry of Di will be 1/mi. Inthis work, we assume the likelihood and transition matrices,so the model parameters, to be known. However, one coulduse the free-energy minimisation to learn them [36]. Finally,the prior preferences over outcomes (or states), need to beencoded in Ci ∈ Rmi . Each entry contains a real value fora particular state. The higher the value of a preference, themore preferred the state is, and vice-versa. Priors are formedaccording to specific desires and they will be used to interfaceactive inference and Behavior Trees.

B. BTs integration: planning preferences, not actions

To achieve a meaningful behavior in the environmentthrough active inference, we need to encode specific desiresinto the agent’s brain, so populate Ci appropriately.

A prior as Behavior Tree: We propose to extend theavailable BT nodes in order to be able to specify desired statesto be achieved as leaf nodes. We introduce a new type ofleaf nodes called prior nodes, indicated with green hexagon.These nodes can be seen as standard action nodes but insteadof commanding an action, they simply set the desired value ofa state in C and they run active inference for action selection.The prior node is then just a leaf node in the Behavior Treewhich returns: Success if a state is achieved, Runningwhile trying to achieve it, or Failure if for some reasons itis not possible to reach the desired state. The return statusesare according to the outcome of our reactive action selectionprocess as explained in Section III-D.

Sub-goals through BTs: To reach a distal goal state, weplan achievable sub-goals in the form of desired logical statesl, according to the available actions that a robot possesses. Thisidea of using sub-goals was already used in [23], but in oursolution with BTs we provide a task independent way to definesub-goals which is only based on the set of available skills ofthe robot, such that we can make sure that it can completethe task. At planning time, we define the ideal sequence ofstates and actions to complete a task such that subsequentsub-goals (or logical desired states) are achievable by meansof one action. This can be done manually or through automatedplanning. At run-time, however, we only provide the sequenceof states to the algorithm, as in Fig. 2.

Fig. 2. The path among states is planned offline using the available set ofactions but only the sequence of states is provided at run-time. Actions arechosen online from the available set with active inference.

Example 2. We would like to program the behavior of a ourmobile manipulator to navigate to a goal location in a retailstore. Example 1 provides the necessary models. To inducea goal oriented behavior with active inference, the BT willset the prior over sg to Cg = [1, 0]> meaning that the robotwould like to sense to be at goal.

A classical BT and a BT for active inference with priornodes are reported in Fig. 3. Note that the action is left outin the BT for active inference because these are selected atruntime. In this particular case, the condition isAt(goal)can be seen as the desired observation to obtain.

Fig. 3. BT to navigate to a location using a classical BT and a BT foractive inference. One action moveTo(goal) is available and one conditionisAt(goal) provides information if the current location is at the goal. Theprior node for active inference (green hexagon) sets the desired prior and runsthe action selection process.

Note that the amount of knowledge (i.e. number of statesand actions) which is necessary to code a classical BT or ouractive inference version in Example 2 is the same. However,we abstract the fallback by planning in the state space andnot in the action space. Instead of programming the actionmoveTo(goal) we only set a prior preference over the stateisAt(goal) since the important information is retained inthe state to be achieved rather than in the sequence of actionsto do so. Action selection through active inference will thenselect the appropriate skills to match the current state ofthe world with the desired one, minimizing this discrepancythrough free-energy minimization.

7

Fig. 4. Overview of the control architecture for reactive action planning and execution using active inference. Adaptive action selection is performed accordingto Algorithm 2.

C. Action preconditions and conflicts

Past work on active inference, such as [23], was based onthe assumption that actions were always executable and non-conflicting, but these do not hold in more realistic scenarios.

Action preconditions in active inference: We proposeto encode action preconditions as desired logical states thatneed to hold to be able to execute a particular action. This isillustrated in the next example.

Example 3. We add one more action to the set of skills of ourmobile manipulator: Pick(obj) and the relative transitionmatrix Bh. The action templates with are extended as follows:


lr = [1 0]>

Pick(obj) isReachable(obj) lh = [0 1]>

oh =

[isHolding(obj)!isHolding(obj)

]Bh =

[0.95 0.90.05 0.1

]or =

[isReachable(obj)!isReachable(obj)

](11)

where we added a new logical state lh, the relative beliefsh and observation oh, which indicates if the robot is holdingthe object obj. In the simplest case, we suppose that the onlyprecondition for a successful grasping is that obj is reachable.We then add a logical state lr, as well as sr and or, to provideactive inference with information about this precondition. orcan be built for instance trying to compute a grasping posefor a given object. The robot can act on the state lh throughPick, and it can act on lr through moveTo.

Conflicts in active inference: Given the current beliefabout the states L and a prior preference C, active inferencewill select the most appropriate action to execute amongthe available ones. The preconditions of the selected actionare then checked, and if one or more preconditions arenot satisfied, the relative logical states are pushed into thecorresponding prior with a high preference (i.e. > 1). Actionselection is then performed again with the updated prior C,such that actions that will satisfy missing action preconditionsare more likely, as explained in Algorithm 2. This can howeverlead to conflicts with the original BT [4], that is the robotmight want to simultaneously achieve two conflicting states.With active inference though, we can represent which state

is important, but also when. Since missing preconditions areadded to the current prior with a higher preference with respectto the offline plan, this will induce a behavior that can initiallygo against the initial BT because the new desire is moreappealing to be satisfied. Conflict resolution is then achievedby locally updating prior desires about a state, giving themhigher preference. The convergence analysis of this approachis reported in Section IV.

D. Complete control scheme

Our solutions is summarised in Algorithm 2 and Fig. 4.At run-time, the BT for a certain task is ticked at a givenfrequency. The symbolic perception layer takes the sensoryreadings and translates these continuous quantities into logicalobservations. This can be achieved through user defined mod-els according to the specific environment and sensors available.The logical observations are used to perform belief updatingto keep a probabilistic representation of the world in S. Then,the logical state L is formed. Every time a prior node in theBT is ticked, the corresponding priors in C are set.

For both missing preconditions and conflicts, high prioritypriors are removed from the preferences C whenever thepreconditions are satisfied or the conflicts resolved (lines6-10), allowing to resume the nominal flow of the BT. Activeinference from Algorithm 1 is then ran for action selection. Ifno action is required since the logical state corresponds to theprior, the algorithm returns Success, otherwise the selectedaction’s preconditions are checked and eventually pushed withhigher priority. Then, action selection is performed with theupdated prior. This procedure is repeated until either an exe-cutable action is found, returning Running, or no action canbe executed, returning Failure. The case of Failure ishandled through the global reactivity provided by the BT. Thiscreates dynamic and stable regions of attraction as explained inSection IV-A, by means of sequential controller composition[37] (lines 17-31 of Algorithm 2).

The new idea of of using dynamic priors instead of staticones as in previous active inference schemes, allows forprecondition checking and conflict resolution. A robot canfollow a long programmed routine while autonomously takingdecisions to locally compensate for unexpected events. Thisreduces considerably the need for hard-coded fallbacks, al-lowing to compress the BT to a minimal number of nodes.

8

Algorithm 2 Pseudo-code for Adaptive Action Selection1: Get desired prior and parameters from BT:2: C, param ← BT3: Set current observations, beliefs and logical state:4: Set O, S, L5: Remove preferences with high-priority if satisfied:6: for all high priority prior Ci ∈ C do7: if Ci ⊂ L then8: Remove pushed priors;9: end if

10: end for11: Run active inference given O, S, L and C:12: at ← Action_selection(O, S, L, C) . Alg. 113: if at == Idle then14: return Success; . No action required15: else16: Check action preconditions:17: while at !=Idle do18: if li ≡ precat then19: Execute(at);20: return Running; . Executing action21: else22: Push missing preconditions in C:23: C ← precat ;24: Exclude at and re-run Alg. 1:25: Remove(at);26: at← Action_selection(O, S, L, C)27: if at == Idle then28: return Failure; . No solution29: end if30: end if31: end while32: end if

IV. THEORETICAL ANALYSIS

A. Analysis of convergence

We provide a theoretical analysis of the proposed controlarchitecture. There are two possible scenarios that might occurat run-time. Specifically, the environment might or mightnot differ from what has been planned offline trough BTs.These two cases are analysed in the following to study theconvergence to the desired goal of our proposed solution.

1) The dynamic environment IS as planned: In a nominalexecution, where the environment in which a robot is operatingis the same as the one at planning time, there is a one toone equivalence between our approach and a classical BTformulation. This follows directly by the fact that the BT isdefined according to Section III-B, so each subsequent stateis achievable by means of one single action. At any point ofthe task the robot finds itself in the planned state and has onlyone preference over the next state given by the BT throughC. The only action which can minimise the expected free-energy is the one used during offline planning. In a nominalcase, then, we maintain all the properties of BTs, which arewell explained in [3]. In particular, the behavior will be FiniteTime Successful (FTS) [3] if the atomic actions are assumed

to return success after finite time. Note that so far we did notconsider actions with the same postconditions. However, inthis case Algorithm 2 would sequentially try all the alternativesfollowing the given order at design time. This can be improvedfor instance by making use of semantic knowledge at runtimeto inform the action selection process about preferences overactions to achieve the same outcome. This information can bestored for instance in a knowledge base and can be used toparametrize the generative model for active inference.

2) The dynamic environment IS NOT as planned: The mostinteresting case is when a subsequent desired state is notreachable as initially planned. As explained before, in such acase we push the missing preconditions of the selected actioninto the current prior C to locally and temporarily modify thegoal. We analyse this idea in terms of sequential controllers (oractions) composition as in [37], and we show how Algorithm 2generates a policy that will eventually converge to the initialgoal. First of all we provide some assumptions and definitionsthat will be useful for the analysis.

Assumption 1: The action templates with pre- and post-conditions provided to the agent are correct;

Assumption 2: A given desired goal (or a prior C) isachievable by at least one atomic action;

Definition 1: The domain of attraction (DoA) of an actionai is defined as the set of its preconditions. The DoA of ai isindicated as D(ai);

Definition 2: We say that an action a1 prepares actiona2 if the postconditions Pc of a2 lie within the DoA of a1, soPc(a2) ⊆ D(a1);

Following Algorithm 2 each time a prior leaf node is tickedin the BT, active inference is used to define a sequence of ac-tions to bring the current state towards the goal. It is sufficientto show, then, that the asymptotically stable equilibrium of ageneric generated sequence is the initial given goal.

Lemma 1. Let Lc be the current logic state of the world,and A the set of available actions. An action ai ∈ A canonly be executed within its DoA, so when D(ai) ⊆ Lc. Letus assume that the goal C is a postcondition of an action a1

such that Pc(a1) = C, and that Lc 6= C. If D(a1) 6⊆ Lc,Algorithm 2 generates a sequence π = {a1, . . . , aN} withdomain of attraction D(π) according to the steps below.

1) Let the initial sequence contain a1 ∈ A, π(1) = {a1},D(π) = D(a1), set N = 1 and π(N) = Pc(a1)

2) Remove aN from the available actions, and add unmetpreconditions C = C ∪ D(aN )

3) Select aN+1 through active inference (Algorithm 1).Then, aN+1 prepares aN by construction, π(N + 1) =π(N) ∪ {aN}, DN+1(π) = DN (π) ∪ D(aN+1), andN = N + 1

4) Repeat 2, 3 until D(aN ) ⊆ Lc OR aN == Idle

If D(aN ) ⊆ Lc and D(a1) ⊆⋃Pc(π\a1), the sequential

composition π stabilizes the system at the given desired stateC. If ai ∈ A are FTS, then π is FTS.

Proof. Since D(aN ) ⊆ Lc and Pc(aN ) ⊆ D(aN−1), itfollows that Lc is moving towards C. Moreover, whenD(a1) ⊆

⋃Pc(π\a1), after action completion Lc ≡ C since

by definition Pc(a1) = C.

9

Note that if D(aN ) ⊆ Lc does not hold after sampling allavailable actions, it means that the algorithm is unable to find aset of actions which can satisfy the preconditions of the initialplanned action. This situation is a major failure which need tobe handled by the overall BT. Lemma 1 is a direct consequenceof the sequential behavior composition of FTS actions whereeach action has effects within the DoA of the action below. Theasymptotically stable equilibrium of each controller is eitherthe goal C, or it is within the region of attraction of anotheraction earlier in the sequence, see [3], [37], [38]. One canvisualize the idea of sequential composition in Fig. 5.

Fig. 5. DoA of different controllers around the current logical state Lc, aswell as their postconditions within the DoA of the controller below.

B. Analysis of robustness

It is not easy to find a common and objective definitionof robustness to analyse the characteristics of algorithms fortask execution. One possible way is to describe robustnessin terms of domains or regions of attraction as in past work[3], [37]. When considering task planning and execution withclassical behavior trees, often these regions of attraction aredefined offline leading to a complex and extensive analysisof the possible contingencies that might eventually happen[3], and these are specific to each different task. Alternatively,adapting the region of attraction requires either re-planning [6]or dynamic BT expansion [4]. Our solution does not require todesign complex and large regions of attractions offline, ratherthese are automatically generated by Algorithm 2, according tothe minimisation of free-energy. We then measure robustnessaccording to the size of these regions, such that a robot canachieve a desired goal from a plurality of initial conditions.With our approach we achieve robust behavior by dynamicallygenerating a suitable region of attraction which is biasedtowards a desired goal. We then cover only the necessaryregion in order to be able to steer the current state to thedesired goal, changing prior preferences at run-time.

Corollary 1. When an executable action aN is found duringtask execution such that π = {a1, . . . , aN}, Algorithm 2has created a region of attraction towards a given goal thatincludes the current state Lc. If D(a1) ⊆

⋃Pc(π\a1) the

region is asymptotically stable.

Proof. The corollary follows simply from Lemma 1.

Example 4. Let us assume that Algorithm 2 produced a policyπ = {a1, a2}, a set of FTS actions, where a2 is executableso D(a2) ⊆ Lc, and its effects are such that Pc(a2) ≡ D(a1).Since a2 is FTS, after a certain running time Pc(a2) ⊆ Lc.The next tick after the effects of a2 took place, π = {a1} wherethis time a1 is executable since D(a1) ⊆ Lc and Pc(a1) = C.The overall goal is then achieved in finite time.

Instead of looking for globally asymptotically stable policiesfrom each initial state to each possible goal, which can beunfeasible or at least very hard [37], we define smaller regionsof attractions dynamically, according to current state and goal.

V. EXPERIMENTAL EVALUATION

In this section we evaluate our algorithm in terms ofrobustness, safety, and conflicts resolution in two validationscenarios with two different mobile manipulators and tasks.We also provide a comparison with classical and dynamicallyexpanded BTs.

A. Experimental scenarios

1) Scenario 1: The task is to pick one object from acrate, and place it on top of a table. This object might ormight not be reachable from the initial robot configuration,and the placing location might or might not be occupied byanother movable box. We suppose that other external events,or agents, can interfere with the execution of the task, resultingin either helping or adversarial behavior. The robot used forthe first validation scenario is a mobile manipulator consistingof a Clearpath Boxer mobile base, in combination with aFranka Emika Panda arm. The experiment for this scenariowas conducted in a Gazebo simulation in a simplified versionof a real retail store, see Fig. 6.

Fig. 6. Simulation of the mobile manipulation task.

2) Scenario 2: The task is to fetch a product in a mockupretail store and stock it in a shelf using the real mobilemanipulator TIAGo, as in Fig. 7.

Importantly, the behavior tree for completing the task in thereal store with TIAGo is the same used for simulation withthe Panda arm and the mobile base, just parametrised with adifferent object and place location. The code developed forthe experiments and more theoretical examples are publiclyavailable.1.

1https://github.com/cpezzato/discrete active inference

https://github.com/cpezzato/discrete_active_inference

10

Fig. 7. Experiments with TIAGo, stocking a product in the shelf.

B. Implementation

In order to program the tasks for Scenarios 1 and 2,we extended the robot skills defined in our theoretical Ex-ample 3. We added then two extra states and their rela-tive observations: isPlacedAt(loc, obj) called sp, andisLocationFree(loc) called sf . The state sp indicateswhether or not obj is at loc, with associated probability,while sf indicates whether loc is occupied by another object.Then, we also had to add three more actions, which are1) Place(obj, loc), 2) Push(obj) to free a placinglocation, and 3) PlaceOnPlate(obj), to place the objectheld by the gripper on the robot’s plate. We summarise statesand skills for the mobile manipulator in Table II.

TABLE IINOTATION FOR STATES AND ACTIONS

State, Boolean State Descriptionsg , lg Belief about being at the goal locationsh, lh Belief about holding an objectsr, lr Belief about reachability of an object

sp, lpBelief about an object being at a

locationsf , lf Belief about a location being free


lr = [1 0]>

Pick(obj) isReachable(obj) lh = [1 0]>

!isHoldingPlace(obj,loc) isLocationFree(loc) lp = [1 0]>

Push() !isHolding lf = [1 0]>

PlaceOnPlate() - lh = [0 1]>

The likelihood matrices are just the identity, while thetransition matrices simply map the postconditions of actions,similarly to Example 3. Note that the design of actions andstates is not unique, and other combinations are possible. Onecan make atomic actions increasingly more complex, or addmore preconditions. The plan, specified in a behavior tree,contains the desired sequence of states to complete the task,leaving out from the offline planning other complex fallbacksto cope with contingencies associated with the dynamic natureof the environment. The behavior tree for performing the tasksin both Scenario 1 and Scenario 2 is reported in Fig. 8. Note

that the fallback for the action moveTo could be substitutedby another prior node as in Fig. 3, however we opted forthis alternative solution to highlight the hybrid combinationof classical BTs and active inference. Design principles tochoose when to use prior nodes and when normal fallbacksare reported in Sec.V-F.

Fig. 8. BT with prior nodes to complete the mobile manipulation task in theretail store, Scenario 1 and 2. locs, locp are respectively the location in frontof the shelf in the store and the desired place location of an item

The implementation of the algorithm for mobile manipula-tion described so far was developed for Scenario 1, and thenit was entirely reused in Scenario 2, with the only adaptationof the desired object and locations in the BT. In Sec. V-Cand Sec. V-D, robustness and run-time conflicts resolutionare analysed for Scenario 1, but similar considerations canbe derived for the scenario 2.

C. Robustness: Dynamic regions of attraction

The initial configuration of the robot is depicted in Fig. 6.The first node which is ticked in the behavior tree is the nodecontaining the prior isHolding(obj). According to thecurrent world’s state, Algorithm 2 will select different actions,to generate a suitable DoA. An example is reported below.

Example 5. Supposing the initial conditions are such that theobject is not reachable. Let sh ∈ [0, 1] be the probabilisticbelief of holding an object, and sr ∈ [0, 1] the probabilisticbelief of its reachability. The DoA generated by Algorithm 2at runtime is depicted in Fig. 9 using phase portraits as in[3]. Actions, when performed, increase the probability of theirpostconditions.

From Fig. 9, we can see that the goal of the active inferenceagent is to hold obj so Ch = [1 0]. The first selected actionis then Pick(obj). However, since the current logical stateis not contained in the domain of attraction of the action, theprior preferences are updated with the missing (higher priority)precondition according to the action template provided, that isisReachable so Cr = [2, 0]. This results in a sequentialcomposition of controllers with a stable equilibrium corre-sponding to the postconditions of Pick(obj). On the otherhand, to achieve the same domain of attraction with a classicalbehavior tree, one would require several additional nodes,as explained in Sec. V-F and visualized in Fig. 11. Insteadof extensively programming fallback behaviors, Algorithm 2endows our actor with deliberation capabilities, and allows the

11

Fig. 9. Dynamic DoA generated by Algorithm 2 for Example 5. (a) relates tothe action Pick(obj), and (b) is the composition of moveTo(loc) andPick(obj) after automatically updating the prior preferences

agent to reason about the current state of the environment, thedesired state, and the available action templates.

D. Resolving run-time conflicts

Perturbations affecting the system during action selectioncan lead to conflicts with the initial offline plan. This is thecase in the manipulation task in Fig. 6 after picking the objectand moving in front of the table, and then sensing that theplace location is not free.

Example 6. The situation just described can be representedschematically by considering the priors and the logical statesof the quantities of interest (see tab. II for notation). Forcompleting the task, the robot should hold the object, be at thedesired place location, and then have the object placed (12a).On the other hand, there is a mismatch between the currentlogical belief about the state lp and the desired Cp (12b).

Ch =

[10

], Cg =

[10

], Cp =

[10

](12a)

lh =

[10

], lg =

[10

], lp =

[01

](12b)

The only remaining state to be reached is the last state, sothe action selected by active inference is Place(obj,loc).The missing precondition on the place location to be free isthen added to the prior Cf . According to Algorithm 2, activeinference is ran again. This lead to the next action whichcan minimise free-energy, that is Push. Again, the missingprecondition !isHolding is pushed in the current prior withhigher priority.

Ch =

[12

], Cf =

[20

](13)

In this situation, a conflict with the offline plan arises, as canbe seen in Ch. Even though the desired state specified in theBT is isHolding(obj), in this particular moment there is ahigher preference in having the gripper free due to a missingprecondition to proceed with the plan. Algorithm 2 selectsthen the action that best matches the current prior desires, orequivalently that minimises expected free-energy the most, that

is PlaceOnPlate to obtain lh = [0, 1]>. This allows thento perform the action Push. At this point, there are no morepreferences with high priority, thus the prior over the state lhis only set by the BT as Ch = [1, 0]>. Now the object can bepicked again and placed on the table since no more conflictsare present.

The simulations of Scenario 1 and the experiments of Sce-nario 2, as well as the behavior of the mobile manipulator fromExamples 5 and 6, can be visualized in the recorded video2.The Behavior trees to encode priors for active inference areimplemented using a standard library [35].

E. Safety

When designing adaptive behaviors for autonomous robots,attention should be paid to safety. The proposed algorithmallows to retain control over the general behavior of the robot,and to force a specific routine in case something goes wrongleveraging the structure of BTs. In fact, we are able to includeadaptation through active inference only in the parts of thetask that require that, keeping all the properties of BTs intact.Safety guarantees can easily be added by using a sequencenode where the leftmost part is the safety criteria to be satisfied[3], as shown in Fig. 10. In this example, the BT allows

Fig. 10. BT with safety guarantees while allowing runtime adaptation.

to avoid battery drops below a safety critical value whileperforming a task. The sub-tree on the right can be any otherbehavior tree, for instance the one used to solve Scenario 1and Scenario 2 from Fig. 8.

Since, by construction, a BT is executed from left to right,one can assure that the robot is guaranteed to satisfy theleftmost condition first before proceeding with the rest of thebehavior. In our specific case, this allows to easily override theonline decision making process with active inference whereneeded, in favour of safety routines.

F. Comparison and design principles

1) Comparison with similar approaches: The hybridscheme of active inference and BTs aims at providing aframework for reactive action planning and execution inrobotic systems. For this reason, we compare the properties of

2https://youtu.be/dEjXu-sD1SI

https://youtu.be/dEjXu-sD1SI

12

Fig. 11. Possible standard BT to perform Scenario 1 and Scenario 2 without prior nodes for active inference. Parts of the behavior that require severalfallbacks can be substituted by prior nodes for online adaptation instead.

our approach with standard BTs [3] and with BTs generatedthrough expansion from goal conditions [4]. Scenario 1 andScenario 2 can be tackled, for instance, by explicitly planningevery fallback behavior with classical BTs, as in Fig. 11. Evenif this provides the same reactive behavior as the one generatedby Fig. 8, far more (planning) effort is needed: to solve thesame task one would require 12 control nodes, 8 conditionnodes, and 7 actions, for a total of 27 nodes compared to the6 needed in our approach that is an ∼ 88% compression.

Importantly, the development effort of a prior node in a BTis the same as a standard action node. It is true that activeinference requires to specify the likelihood and transitionmatrices encoding actions pre- and postconditions, but thishas to be done only once while defining the available skills ofa robot, and it is independent of the task to be solved. Thus, adesigner is not concerned with this when adding a prior nodein a BT.

Instead of planning several fallbacks offline, [4] dynam-ically expands a BT from a single goal condition, throughbackchaining, to blend planning and acting online. To solveScenario 1 and Scenario 2 with this approach, one needs todefine a goal condition isPlacedAt(obj, loc) similarlyto our solution, and define the preconditions of the actionPlace(obj, loc) such that they contain the fact that therobot is holding the object, that the place location is reachable,and it is free. Then, to solve Scenarios 1 and 2 one needsto define only the final goal condition and run the algorithmproposed in [4]. Even though this allows to complete the taskssimilarly to what we propose, [4] comes with a fundamentaltheoretical limitation: adaptation cannot be selectively addedonly in specific parts of the tree. The whole behavior is indeeddetermined at runtime based on preconditions and effectsof actions starting from a goal condition. The addition ofsafety guarantees in specific parts of the tree is not possibleexplicitly unlike in our approach, which might be a dealbreaker for safety critical applications. To conclude, the hybridcombination of active inference and BTs allows to combine theadvantages of both offline design of BTs and online dynamicexpansion. In particular: it drastically reduces the numberof necessary nodes in a BT planned offline, it can handle

partial observability of the initial state, it allows to selectivelyadd adaptation in specific parts of the tree, it allows to addsafety guarantees while being able to adapt to unforeseencontingencies.

Another important difference between our approach andother BTs solutions is that we introduced the concept of statein a BT through the prior node for which a probabilistic beliefis built, updated, and used for action planning at runtime withuncertain action outcomes.

Additionally, one may argue that other solutions such asROSPlan [39] could also be used for planning and executionin robotics in dynamic environments. ROSPlan leverages au-tomated planning with PDDL2.1, but it is not designed for fastreaction in case of dynamic changes of the environment causedby external events. Moreover, it is based on the assumption thatthe initial state is fully observable and its reactivity is limitedto re-planning of the whole task, which might be a waste ofresources.

Table III reports a summary of the comparison with standardBTs and BT with dynamic expansion for Scenario 1 and 2.

TABLE IIISUMMARY OF COMPARISON

Approach # Nodes Unforeseencontingen.

Selectiveadaptation

Safetyguarantees

StandardBT 27 7 7 3

DynamicBT [4] 1 3 7 7

Ours 6 3 3 3

2) Design principles: We position our work in betweentwo extremes, namely fully offline planning and fully on-line dynamic expansion of BTs. In our method, a designercan decide if to lean towards a fully offline approach or afully online synthesis, depending on the task at hand andthe modelling of the actions pre- and postconditions. Eventhough the design of behaviors is still an art, we give somedesign principles which can be useful in the development ofrobotic applications using this hybrid BTs and active inferencemethod. Take for instance Fig. 11 and Fig. 8. Prior nodes for

13

local adaptation can be included in the behavior when thereare several contingencies to consider or action preconditionsto be satisfied in order to achieve a sub-goal. A designer can:1) plan offline where the task is certain or equivalently wherea small number of things can go wrong; 2) use prior nodesimplemented with active inference to decide at runtime theactions to be executed whenever the task is uncertain. Thisis a compromise in between a fully defined plan where thebehavior of the robot is predefined in every part of the statespace, and a fully dynamic expansion of BTs which can resultin sub-optimal action sequence [4]. This is illustrated in Fig.8for instance, where the actions for holding and placing anobject are chosen online due to various possible unexpectedcontingencies, whereas the moveTo action is planned since itis only based on one precondition. Prior nodes should be usedwhenever capturing the variability of a part of a certain taskwould require much effort during offline planning.

VI. DISCUSSION AND CONCLUSION

The focus of this paper is on the runtime adaptability andreactivity in dynamic environments with partially observableinitial state, and not on the generation of complex offline plans.The core idea in this paper is to enhance offline plans withonline decision making in specific parts of the behavior to beable to quickly repeat, skip, or stop actions. We proposed acombination of BTs and active inference to achieve so, butone should notice that the core principle is independent fromthe techniques chosen both for the offline planning part andthe online adaptation. One could use a different probabilisticsearch approach in combination with BTs, or substitute BTswith other methods.

Another interesting aspect of this approach is that theonline decision making relies on symbolic actions that canbe improved over time independently of the action planner.New actions can easily be added, modified, extended, as longas the correct models are added (only once) in the currentactive inference model.

The main limitation of this approach, which will be ad-dressed in future work, is that a more powerful reasoningmethod to choose among different alternatives of achievingthe same goal is still missing. At the current stage, the order(or preference) among different alternatives is fixed a prioriin the generative model. One could however parametrize thispreference to include the the likelihood of success of theavailable alternatives, and then choose the appropriate valuesat runtime through reasoning, for instance using semanticknowledge, or learning methods. The likelihood of an actioncould also be decreased if that action is failing repeatedly suchthat other alternatives become more appealing.

Also, in the current formulation the behavior trees weremanually designed but future work can enhance the potentialof this work, for example by including automated planningwith translation to BTs such as [40].

To conclude, in this work, we tackled the problem of actionplanning and execution in real world robotics with partiallyobservable initial state. We addressed two open challenges inthe robotics community, namely hierarchical deliberation and

continual online planning, by combining behavior trees andactive inference. The proposed algorithm and its core idea aregeneral, and independent of the particular robot platform, task,and even the techniques chosen to structure the offline plan orthe online action selection process. Our solution provides localreactivity to unforeseen situations while keeping the initialplan intact. In addition, it is possible to easily add safetyguarantees to override the online decision making processthanks to the properties of BTs. We showed how robotic taskscan be described in terms of free-energy minimization, andwe introduced action preconditions and conflict resolution foractive inference by means of dynamic priors. This means thata robot can locally set its own sub-goals to resolve a localinconsistency, and then return to the initial plan specified in theBT. We performed a theoretical analysis of convergence androbustness of the algorithm, and the effectiveness of the ap-proach is demonstrated on two different mobile manipulatorsand different tasks, both in simulation and real experiments.

ACKNOWLEDGMENT

This research was supported by Ahold Delhaize. All contentrepresents the opinion of the author(s), which is not necessar-ily shared or endorsed by their respective employers and/orsponsors.

APPENDIX AGENERATIVE MODELS

In active inference [20] the generative model is chosen tobe a Markov process which allows to infer the states of theenvironment and to predict the effects of actions as well asfuture observations. This can very well be expressed as a jointprobability distribution P (o, s,η, π), where o is a sequenceof observations, s is a sequence of states, η represents modelparameters, and π is a policy. Using the chain rule, this jointprobability is rewritten as:

P (o, s,η, π) = P (o|s,η, π)P (s|η, π)P (η|π)P (π) (14)

Note that o is conditionally independent from η and π givens. In addition, under the Markov property, the next state andcurrent observations depend only on the current state:

P (o|s,η, π) =

T∏τ=1

P (oτ |sτ ) (15)

The model is further simplified considering that s and η areconditionally independent given π:

P (s|η, π) =

T∏τ=1

P (sτ |sτ−1, π) (16)

Finally, consider the model parameters explicitly:

P (o, s,η, π) = P (o, s,A,B,D, π) =

P (π)P (A)P (B)P (D)

T∏τ=1

P (sτ |sτ−1, π)P (oτ |sτ ) (17)

The parameters in the joint distribution include (A,B,D). Indetails:

14

• A is the likelihood matrix. It indicates the probabilityof outcomes being observed given a specific state. Eachcolumn i of A is a categorical distribution P (oit|sit) =Cat(Ai). In general P (ot|st) = Cat(A).

• Bat is the transition matrix. It indicates the probability ofstate transition under action at. The columns of Bat arecategorical distributions which define the probability ofbeing in state st+1 while applying at from state st. Bat

is also indicated simply as B, and P (st+1|st, at) = B.• D defines the probability or belief about the initial state

at t = 1. So P (s1|s0) = Cat(D).• π contains policies that are action sequences over a time

horizon T . π is the posterior expectation, a vector holdingthe probability of policies. These probabilities dependon the expected free-energy in future time steps underpolicies given the current belief: P (π) = σ(−G(π)).Here, σ indicates the softmax function used to normaliseprobabilities.

Given the generative model above, we are interested in findingthe posterior about hidden causes of sensory data. For the sakeof these derivations, we consider that the parameters associatedwith the task are known and do not introduce uncertainty.Using Bayes rule:

P (s, π|o) =P (o|s, π)P (s, π)

P (o)(18)

Computing the model evidence P (o) exactly is a well knownand often intractable problem in Bayesian statistics. Theexact posterior is then computed minimising the Kullback-Leibler divergence (DKL, or KL-Divergence) with respectto an approximate posterior distribution Q(s, π). Doing so,we can define the free-energy as a functional of approximateposterior beliefs which result in an upper bound on surprise.By definition DKL is a non-negative quantity given by theexpectation of the logarithmic difference between Q(s, π) andP (s, π|o). Applying the KL-Divergence:

DKL [Q(s, π)||P (s, π|o)] =

EQ(s,π) [lnQ(s, π)− lnP (s, π|o)] ≥ 0 (19)

DKL is the information loss when Q is used instead of P .Considering equation (18) and the chain rule, equation (19)can be rewritten as:

DKL [·] = EQ(s,π)

[lnQ(s, π)− ln

P (o, s, π)

P (o)

](20)

And finally:

DKL [·] = EQ(s,π) [lnQ(s, π)− lnP (o, s, π)]︸︷︷︸F [Q(s,π)]

+ lnP (o)

(21)We have just defined an upper bound on surprise, the free-energy:

F [Q(s, π)] ≥ − lnP (o) (22)

APPENDIX BVARIATIONAL FREE-ENERGY

To fully characterize the free-energy in equation (21), weneed to specify a form for the approximate posterior Q(s, π).

There are different ways to choose a family of probabilitydistributions [41], compromising between complexity and ac-curacy of the approximation. In this work we choose the mean-field approximation. It holds:

Q(s, π) = Q(s|π)Q(π) = Q(π)

T∏τ=1

Q(sτ |π) (23)

Under mean-field approximation, the policy dependant statesat each time step are approximately independent of the statesat any other time step. We can now find an expressionfor the variational free-energy. Considering the mean-fieldapproximation, and the generative model with known task-associated parameters as:

P (o, s, π) = P (π)

T∏τ=1

P (sτ |sτ−1, π)P (oτ |sτ ) (24)

we can write:

F [Q(s, π)] = EQ(s,π)

[lnQ(π) +

T∑τ=1

lnQ(sτ |π)

− lnP (π)−T∑τ=1

lnP (sτ |sτ−1, π)−T∑τ=1

lnP (oτ |sτ )

](25)

Since Q(s, π) = Q(s|π)Q(π), and since the expectation of asum is the sum of the expectation, we can write:

F [·] = DKL [Q(π)||P (π)] + EQ(π) [F (π) [Q(s|π)]] (26)

where

F (π) [Q(s|π)] = EQ(s|π)

[ T∑τ=1

lnQ(sτ |π)

−T∑τ=1

lnP (sτ |st−τ , π)−T∑τ=1

lnP (oτ |sτ )

](27)

One can notice that F (π) is accumulated over time, or in otherword it is the sum of free-energies over time and policies:

F (π) =

T∑τ=1

F (π, τ) (28)

Substituting the agent’s belief about the current state at timeτ given π (i.e. Q(s|π)), with its sufficient statistics sπτ , weobtain a matrix form for F (π, τ) that we can compute giventhe generative model:

F (π) =

T∑τ=1

sπτ ·[

ln sπτ − ln (Bπτ−1)sπτ−1 − lnA · oτ

](29)

Given a policy π, the probability of state transitionP (sτ |st−τ , π) is given by the transition matrix under policyπ at time τ , multiplied by the probability of the state at theprevious time step. In the special case of τ = 1, we can write:

F (π, 1) = sπ1 ·[

ln sπ1 − lnD − lnA · o1

](30)

Finally, we can compute the expectation of the policy depen-dant variational free-energy F (π) as

EQ(π)

[F (π)

]= π · F=π (31)

15

Where we indicate Fπ = (F (π1), F (π2)...) for every allow-able policy.

To derive state and policy updates which minimise free-energy, F in equation (26) is partially differentiated and setto zero, as we will see in the next appendixes.

APPENDIX CSTATE ESTIMATION

We differentiate F with respect to the sufficient statisticsof the probability distribution of the states. Note that the onlypart of F dependent on the states is F (π). Then:

∂F

∂sπτ=

∂F

∂F (π)

∂F (π)

∂sπτ= π ·

[1 + ln sπτ − ln (Bπ

τ−1)sπτ−1

− ln (Bπτ ) · sπτ+1 − lnA · oτ

](32)

Setting the gradient to zero and using the softmax function tonormalize the probability distribution over states:

∂F

∂sπτ= 0⇒

sπτ = σ(ln (Bπτ−1)sπτ−1 + ln (Bπ

τ ) · sπτ+1 + lnA · oτ )(33)

Note that the softmax function is insensitive to the constant1. Also, for τ = 1 the term ln (Bπ

τ−1)sπτ−1 is replaced byD. Finally, lnA ·oτ contributes only to past and present timesteps, so for this term is null for t < τ ≤ T since thoseobservations are still to be received.

APPENDIX DEXPECTED FREE-ENERGY

We indicate with G(π) the expected free-energy obtainedover future time steps until the time horizon T while followinga policy π. Basically, this is the variational free-energy offuture trajectories which measures the plausibility of policiesaccording to future predicted outcomes [21]. To compute itwe take the expectation of variational free-energy under theposterior predictive distribution P (oτ |sτ ). Following [21] wecan write:

G(π) =

T∑τ=t+1

G(π, τ) (34)

then:

G(π, τ) = EQ[

lnQ(sτ |π)− lnP (oτ , sτ |sτ−1)]

= EQ[

lnQ(sτ |π)− lnP (sτ |oτ , sτ−1)− lnP (oτ )]

(35)

where Q = P (oτ |sτ )Q(sτ |π). The expected free energy isthen:

G(π, τ) ≥ EQ[

lnQ(sτ |π)− lnQ(sτ |oτ , sτ−1, π)− lnP (oτ )]

(36)Equivalently, we can express the expected free-energy in termsof preferred outcomes [33]:

G(π, τ) = EQ[

lnQ(oτ |π)−lnQ(oτ |sτ , sτ−1, π)− lnP (oτ )]

(37)

Making use of Q(oτ |sτ , π) = P (oτ |sτ ) since the predictedoutcomes in the future are only based on A which is policyindependent given sτ , we have:

G(π, τ) = DKL [Q(oτ |π)||P (oτ )]︸︷︷︸Expected cost

+EQ(sτ |π) [H(P (oτ |sτ ))]︸︷︷︸Expected ambiguity

(38)were H[P (oτ |sτ )] = EP (oτ |sτ ) [− lnP (oτ |sτ )] is the entropy.We are now ready to express the expected free-energy inmatrix form, such that we can compute it. From the previousequation one can notice that policy selection aims at min-imizing the expected cost and ambiguity. The latter relatesto the uncertainty about future outcomes given hidden states.In a sense, policies tend to bring the agent to future statesthat generate unambiguous information over states. On theother hand, the cost is the difference between predicted andprior beliefs about final states. Policy are more likely if theyminimize cost, so lead to outcomes which match prior desires.Minimising G leads to both exploitative (cost minimizing) andexplorative (ambiguity minimizing) behavior. This results in abalance between goal oriented and novelty seeking behaviours.

Substituting the sufficient statistics in equation (38), andrecalling that the generative model specifies lnP (oτ ) = C

G(π, τ) = oπτ · [lnoπτ −C] + sπτ ·A · lnA (39)

APPENDIX EUPDATING POLICY DISTRIBUTION

The update rule for the distribution over possible policiesfollows directly from the definition of the variational free-energy:

F [·] = DKL [Q(π)||P (π)] + π · Fπ (40)

The first term of the equation above can be further specifiedas:

DKL [Q(π)||P (π)] = EQ(π) [lnQ(π)− lnP (π)] (41)

Recalling that the generative model defines P (π) =σ(−G(π)), it results:

DKL [Q(π)||P (π)] = π · (lnπ +Gπ) +

EQ(π)

[lnQ(π)− ln

∑i

eG(πi)

](42)

Taking the gradient with respect to π:

∂F

∂π= lnπ +Gπ + Fπ + 1 (43)

where Fπ = (F (π1), F (π2), ...) and Gπ =(G(π1), G(π2), ...) Finally, setting the gradient to zeroand normalizing through softmax, the distribution overpolicies is obtained:

π = σ(−Gπ − Fπ) (44)

The policy that an agent should pursue is the most likely one.

16

REFERENCES

[1] M. Ghallab, D. Nau, and P. Traverso, “The actor’s view of automatedplanning and acting: A position paper,” Artificial Intelligence, vol. 208,pp. 1 – 17, 2014.

[2] D. S. Nau, M. Ghallab, and P. Traverso, “Blended planning and act-ing: Preliminary approach, research challenges,” in Twenty-Ninth AAAIConference on Artificial Intelligence, 2015.

[3] M. Colledanchise and P. Ogren, “How Behavior Trees Modularize Hy-brid Control Systems and Generalize Sequential Behavior Compositions,the Subsumption Architecture, and Decision Trees,” IEEE Transactionson Robotics, vol. 33, no. 2, pp. 372–389, 2017.

[4] M. Colledanchise, D. Almeida, M, and P. Ogren, “Towards blendedreactive planning and acting using behavior tree,” in IEEE InternationalConference on Robotics and Automation (ICRA), 2019.

[5] E. Safronov, M. Colledanchise, and L. Natale, “Task planning withbelief behavior trees,” IEEE/RSJ International Conference on IntelligentRobots and Systems (IROS), 2020.

[6] C. Paxton, N. Ratliff, C. Eppner, and D. Fox, “Representing robottask plans as robust logical-dynamical systems,” in 2019 IEEE/RSJInternational Conference on Intelligent Robots and Systems (IROS),2019, pp. 5588–5595.

[7] C. R. Garrett, C. Paxton, T. Lozano-Perez, L. P. Kaelbling, and D. Fox,“Online replanning in belief space for partially observable task andmotion problems,” in 2020 IEEE International Conference on Roboticsand Automation (ICRA). IEEE, 2020, pp. 5678–5684.

[8] A. Meera and M. Wisse, “Free energy principle based state and inputobserver design for linear systems with colored noise,” in 2020 AmericanControl Conference (ACC), 2020, pp. 5052–5058.

[9] M. Baioumy, P. Duckworth, B. Lacerda, and N. Hawes, “Active inferencefor integrated state-estimation, control, and learning,” in Internationalconference on Robotics and Automation, ICRA, 2021.

[10] C. Pezzato, M. Baioumy, C. H. Corbato, N. Hawes, M. Wisse, andR. Ferrari, “Active inference for fault tolerant control of robot manip-ulators with sensory faults,” in 1st International Workshop on ActiveInference, ECML PKDD, 2020.

[11] M. Baioumy, C. Pezzato, R. Ferrari, C. H. Corbato, and N. Hawes,“Fault-tolerant control of robot manipulators with sensory faults usingunbiased active inference,” European Control Conference, ECC, 2021.

[12] C. Pezzato, R. Ferrari, and C. H. Corbato, “A novel adaptive controllerfor robot manipulators based on active inference,” IEEE Robotics andAutomation Letters, vol. 5, no. 2, pp. 2973–2980, 2020.

[13] G. Oliver, P. Lanillos, and G. Cheng, “An empirical study of activeinference on a humanoid robot,” IEEE Transactions on Cognitive andDevelopmental Systems, 2021.

[14] K. J. Friston, “The free-energy principle: a unified brain theory?” NatureReviews Neuroscience, vol. 11(2), pp. 27–138, 2010.

[15] C. Buckley, C. Kim, S. McGregor, and A. Seth, “The free energyprinciple for action and perception: A mathematical review,” Journalof Mathematical Psychology, vol. 81, pp. 55–79, 2017.

[16] R. Bogacz, “A tutorial on the free-energy framework for modellingperception and learning,” Journal of mathematical psychology, 2015.

[17] K. J. Friston, J. Mattout, and J. Kilner, “Action understanding and activeinference,” Biological cybernetics, vol. 104(1-2), 2011.

[18] K. J. Friston, J. Daunizeau, and S. Kiebel, “Action and behavior: a free-energy formulation,” Biological cybernetics, vol. 102(3), 2010.

[19] K. Friston, S. Samothrakis, and R. Montague, “Active inference andagency: optimal control without cost functions,” Biological cybernetics,vol. 106, no. 8-9, pp. 523–541, 2012.

[20] K. Friston, T. FitzGerald, F. Rigoli, P. Schwartenbeck, and G. Pezzulo,“Active inference: a process theory,” Neural computation, vol. 29, no. 1,pp. 1–49, 2017.

[21] N. Sajid, P. J. Ball, T. Parr, and K. J. Friston, “Active inference:demystified and compared,” Neural Computation, vol. 33, no. 3, pp.674–712, 2021.

[22] P. Schwartenbeck, J. Passecker, T. U. Hauser, T. H. FitzGerald, M. Kro-nbichler, and K. J. Friston, “Computational mechanisms of curiosity andgoal-directed exploration,” Elife, vol. 8, p. e41703, 2019.

[23] R. Kaplan and K. J. Friston, “Planning and navigation as active infer-ence,” Biological cybernetics, vol. 112, no. 4, pp. 323–343, 2018.

[24] M. Colledanchise and P. Ogren, Behavior trees in robotics and AI: anintroduction. ser. Chapman and Hall/CRC Artificial Intelligence andRobotics Series. CRC Press, Taylor & Francis Group, 2018.

[25] J. Orkin, “Applying goal-oriented action planning to games,” AI GameProgramming Wisdom, vol. 2, pp. 217–228, 2003.

[26] ——, “Three states and a plan: the AI of FEAR,” in Game DevelopersConference, 2006, pp. 1–18.

[27] L. P. Kaelbling and T. Lozano-Perez, “Hierarchical task and motionplanning in the now,” Proceedings - IEEE International Conference onRobotics and Automation, pp. 1470–1477, 2011.

[28] L. Pack Kaelbling and T. Lozano-Perez, “Integrated task and motionplanning in belief space,” International Journal of Robotics Research,pp. 1–60, 2013.

[29] M. Levihn, L. P. Kaelbling, T. Lozano-Perez, and M. Stilman, “Foresightand reconsideration in hierarchical planning and execution,” in IEEEInternational Conference on Intelligent Robots and Systems, 2013, pp.224–231.

[30] K. Erol, J. Hendler, and D. S. Nau, “Htn planning: Complexity andexpressivity,” in AAAI, vol. 94, 1994, pp. 1123–1128.

[31] I. Georgievski and M. Aiello, “An Overview of Hierarchical TaskNetwork Planning,” in arXiv 1403.7426, no. March 2015, 2014.

[32] M. Ghallab, D. Nau, and P. Traverso, Automated planning and acting.Cambridge University Press, 2016.

[33] L. Da Costa, T. Parr, N. Sajid, S. Veselic, V. Neacsu, and K. Friston,“Active inference on discrete state-spaces: a synthesis,” Journal ofMathematical Psychology, vol. 99, p. 102447, 2020.

[34] K. J. Friston, T. Parr, and B. de Vries, “The graphical brain: beliefpropagation and active inference,” Network Neuroscience, vol. 1, no. 4,pp. 381–414, 2017.

[35] Davide Faconti, “BehaviorTree.CPP,” https://www.behaviortree.dev/.[36] K. Friston, T. FitzGerald, F. Rigoli, P. Schwartenbeck, J. O’Doherty, and

G. Pezzulo, “Active inference and learning,” Neuroscience & Biobehav-ioral Reviews, vol. 68, pp. 862–879, 2016.

[37] R. R. Burridge, A. A. Rizzi, and D. E. Koditschek, “Sequential compo-sition of dynamically dexterous robot behaviors,” International Journalof Robotics Research, vol. 18, no. 6, pp. 534–555, 1999.

[38] E. Najafi, R. Babuska, and G. A. Lopes, “An application of sequentialcomposition control to cooperative systems,” in 2015 10th InternationalWorkshop on Robot Motion and Control, RoMoCo 2015, 2015, pp. 15–20.

[39] M. Cashmore, M. Fox, D. Long, D. Magazzeni, B. Ridder, A. Carrera,N. Palomeras, N. Hurtos, and M. Carreras, “Rosplan: Planning in therobot operating system,” in Proceedings of the International Conferenceon Automated Planning and Scheduling, vol. 25, no. 1, 2015.

[40] F. Martın, M. Morelli, H. Espinoza, F. J. Lera, and V. Matellan,“Optimized execution of pddl plans using behavior trees,” in 20th In-ternational Conference on Autonomous Agents and MultiAgent Systems(AAMAS), 2021, pp. 1596–1598.

[41] S. Schwobel, S. Kiebel, and D. Markovic, “Active inference, beliefpropagation, and the bethe approximation,” Neural computation, vol. 30,no. 9, pp. 2530–2567, 2018.

https://www.behaviortree.dev/

Active Inference and Behavior Trees for Reactive Action ...

Documents

Transcript of Active Inference and Behavior Trees for Reactive Action ...