Abstract arXiv:2003.10386v1 [cs.LG] 23 Mar 2020 · Likewise, a neural disjunction function f...

Incorporating Relational Background Knowledge into Reinforcement Learningvia Differentiable Inductive Logic Programming

Ali Payani 1 Faramarz Fekri 1

AbstractRelational Reinforcement Learning (RRL) can of-fers various desirable features. Most importantly,it allows for incorporating expert knowledge intothe learning, and hence leading to much fasterlearning and better generalization compared tothe standard deep reinforcement learning. How-ever, most of the existing RRL approaches are ei-ther incapable of incorporating expert backgroundknowledge (e.g., in the form of explicit predicatelanguage) or are not able to learn directly fromnon-relational data such as image. In this paper,we propose a novel deep RRL based on a differ-entiable Inductive Logic Programming (ILP) thatcan effectively learn relational information fromimage and present the state of the environment asfirst order logic predicates. Additionally, it cantake the expert background knowledge and incor-porate it into the learning problem using appro-priate predicates. The differentiable ILP allowsan end to end optimization of the entire frame-work for learning the policy in RRL. We show theefficacy of this novel RRL framework using envi-ronments such as BoxWorld, GridWorld as wellas relational reasoning for the Sort-of-CLEVRdataset.

1. IntroductionRelational Reinforcement Learning (RRL) has been investi-gated in early 2000s by works such as (Bryant et al., 1999;Dzeroski et al., 1998; 2001) among others. The main ideabehind RRL is to describe the environment in terms of ob-jects and relations. One of the first practical implementationof this idea was proposed by (Dzeroski et al., 1998) and laterwas improved in (Dzeroski et al., 2001) based on a modifi-cation to Q-Learning algorithm (Watkins & Dayan, 1992)via the standard relational tree-learning algorithm TILDE

*Equal contribution 1Department of Electrical an ComputerEngineering, Georgia Institute of Technology. Correspondence to:Ali Payani <[email protected]>.

(Blockeel & De Raedt, 1998). As shown in (Dzeroski et al.,2001), the RRL system allows for very natural and humanreadable decision making and policy evaluations. Moreimportantly, the use of variables in ILP system, makes itpossible to learn generally formed policies and strategies.Since these policies and actions are not directly associatedwith any particular instance and entity, this approach leadsto a generalization capability beyond what is possible inmost typical RL systems. Generally speaking RRL frame-work offers several benefits over the traditional RL: (i) Thelearned policy is usually human interpretable, and hencecan be viewed, verified and even tweaked by an expert ob-server. (ii) The learned program can generalize better thanthe classical RL counterpart. (iii) Since the language for thestate representation is chosen by the expert, it is possibleto incorporate inductive biases into learning. This can be asignificant improvement in complex problems as it mightbe used to manipulate the agent to choose certain actionswithout accessing the reward function, (iv) It allows for theincorporation of higher level concepts and prior backgroundknowledge.

In recent years and with the advent of the new deep learningtechniques, significant progress has been made to the classi-cal Q-learning RL framework. By using algorithms such asdeep Q-learning and its variants (Mnih et al., 2013; Van Has-selt et al., 2016), as well as Policy learning algorithms suchA2C and A3C (Mnih et al., 2016), more complex problemsare now been tackled. However, the classical RRL frame-work cannot be easily employed to tackle large scale andcomplex scenes that exist in recent RL problems. Sincestandard RRL framework is in not usually able to learnfrom complex visual scenes and cannot be easily combinedwith differentiable deep neural In particular, none of theinherent benefits of RRL have been materialized in the deeplearning frameworks thus far. This is because existing RRLframeworks usually are not designed to learn from complexvisual scenes and cannot be easily combined with differen-tiable deep neural networks. In (Payani & Fekri, 2019) anovel ILP solver was introduced which uses Neural-LogicalNetwork (NLN) (Payani & Fekri, 2018) for constructing adifferentiable neural-logic ILP solver (dNL-ILP). The keyaspect of this dNL-ILP solver is a differentiable deductionengine which is at the core of the proposed RRL framework.

arX

iv:2

003.

1038

6v1

[cs

.LG

] 2

3 M

ar 2

020

Figure 1. Connected graph example

As such, the resulting differentiable RRL framework can beused similar to deep RL in an end-to-end learning paradigm,trainable via the typical gradient optimizers. Further, in con-trast to the early RRL frameworks, this framework is flexibleand can learn from ambiguous and fuzzy information. Fi-nally, it can be combined with deep learning techniquessuch as CNNs to extract relational information from thevisual scenes. In the next section we briefly introduce thedifferentiable dNL-ILP solver. In section, 3 we show howthis framework can be used to design a differentiable RRLframework. Experiments will be presented next, followedby the conclusion.

2. Differentiable ILP via neural logicnetworks

In this section, we briefly present the basic design of thedifferentiable dNL-ILP which is at the core of the pro-posed RRL. More detailed presentation of dNL-ILP couldbe found in (Payani & Fekri, 2019). Logic programming isa paradigm in which we use formal logic (and usually first-order-logic) to describe relations between facts and rules ofa program domain. In logic programming, rules are usuallywritten as clauses of the form H← B1, B2, . . . , Bm, whereH is called head of the clause and B1, B2, . . . , Bm is calledbody of the clause. A clause of this form expresses that ifall the atoms Bi in the body are true, the head is neces-sarily true. Each of the terms H and B is made of atoms.Each atom is created by applying an n-ary Boolean func-tion called predicate to some constants or variables. Apredicate states the relation between some variables orconstants in the logic program. We use lowercase lettersfor constants (instances) and uppercase letters for variables.To avoid technical details, we consider a simple logic pro-gram. Assume that a directed graph is defined using a seriesof facts in the form of edge(X,Y) where for exampleedge(a,b) states that there is an edge from node a to thenode b. As an example, the graph in Fig. 1 can be repre-sented as {edge(a,b), edge(b,c), edge(c,d),edge(d,b)}. Assume that our task is to learn thecnt(X,Y) predicate from a series of examples, wherecnt(X,Y) is true if there is a directed path from node X tonode Y. The set of positive examples in graph depicted inFig. 1 is P = {cnt(a,b), cnt(a,c), cnt(a,d),cnt(b,b),cnt(b,c), cnt(b,d),...}. Similarly

xi mi Fc

0 0 10 1 01 0 11 1 1

(a)

xi mi Fd

0 0 00 1 01 0 01 1 1

(b)

Figure 2. Truth table of Fc(·) and Fd(·) functions

the set of negative examples N includes atoms such as{cnt(a,a),cnt(b,a),...}.

It is easy to verify that the predicate cnt defined as be-low satisfies all the provided examples (entails the positiveexamples and rejects the negative one):

cnt(X,Y)← edge(X,Y)cnt(X,Y)← edge(X,Z), cnt(Z,Y) (1)

In fact by applying each of the above two rules to theconstants in the program we can produce all the conse-quences of such hypothesis If we allow for formulas with3 variables (X,Y,Z) as in (1), we can easily enumer-ate all the possible symbolic atoms that could be usedin the body of each clause. In our working example,this corresponds to Icnt ={edge(X,X), edge(X,Y),edge(X,Z), ..., cnt(Z,Y), cnt(Z,Z)}. Asthe size of the problem grows, considering all the possi-bilities becomes unfeasible. Consequently, almost all ILPsystems use some form of rule templates to reduce the possi-ble combinations. For example, the dILP (Evans & Grefen-stette, 2018) model, allows for the clauses (in the body)of at most two atoms in each clause predicate. In (Payani& Fekri, 2019), a novel approach was introduced to allevi-ate the above limitation and to allow for learning arbitrarycomplex predicate formulas. The main idea behind thisapproach is to use multiplicative neurons (Payani & Fekri,2018) that are capable of learning and representing Booleanlogic. Consider the fuzzy notion of Boolean algebra wherefuzzy Boolean value are represented as a real value in range[0,1], where True and False are represented by 1 and 0, re-spectively. Let x be the logical ‘NOT’ of x. Let xxxn ∈ {0,1}n

be the input vector for a logical neuron. we can associatea trainable Boolean membership weight mi to each inputelements xi from vector xxxn. Consider Boolean functionFc(xi,mi) with the truth table as in Fig. 2a which is able toinclude (exclude) each element xi in (out of) the conjunctionfunction fcon j(xxxn). This design ensures the incorporation ofeach element xi in the conjunction function only when thecorresponding membership weight mi is 1. Consequently,the neural conjunction function fcon j can be defined as:

fcon j(xxxn) =n

∏i=1

Fc(xi,mi)

where, Fc(xi,mi) = ximi = 1−mi(1− xi) (2)

Likewise, a neural disjunction function fdis j(xxxn) can bedefined using the auxiliary function Fd with the truth tableas in Fig. 2b. By cascading a layer of N neural conjunctionfunctions with a layer of N neural disjunction functions,we can construct a differentiable function to be used forrepresenting and learning a Boolean Disjunctive NormalForm (DNF).

dNL-ILP employs these differentiable Boolean functions(e.g. dNL-DNF) to represent and learn predicate functions.Each dNL function can be seen as a parameterized sym-bolic formula where the (fuzzy) contribution of each symbol(atom) in the learned hypothesis is controlled by the train-able membership weights (e.g., wi where mi = sigmoid(wi)).If we start from the background facts ( e.g. all the ground-ings of predicate edge(X,Y) in the graph example andapply the parameterized hypothesis we arrive at some newconsequences (e.g., forward chaining). After repeating thisprocess to obtain all possible consequences, we can updatethe parameters in dNL by minimizing the cross entropy be-tween the desired outcome (provided positive and negativeexamples) and the deduced consequences.

An ILP description of a problem in this framework consistof these elements:

1. The set of constants in the program. In example of Fig.1, this consists of C ={a,b,c,d}

2. The set of background facts. In the graphexample above this consists of groundings ofpredicate edge(X,Y), i.e., B ={edge(a,b),edge(b,c), edge(c,d), edge(d,b)}

3. The definition of auxiliary predicates. In the simpleexample of graph we did not include any auxiliarypredicates. However, in more complex example theywould greatly reduce the complexity of the problem.

4. The signature of the target hypothesis. In the graphexample, This signature indicates the target hypothesisis 2-ary predicate cnt(X,Y) and in the symbolic rep-resentation of this Boolean function we are allowed touse three variables X,Y,Z.

In addition to the aforementioned elements, some param-eters such as intial values for the membership weights(mi = sigmoid(wi)), as well as the number of steps of for-ward chaining should be provided. Furthermore, in dNL-ILPthe memberships are fuzzy Boolean values between 0 and 1.As shown in (Payani & Fekri, 2019), for ambiguous prob-lems where a definite Boolean hypothesis may not be foundwhich could satisfy all the examples, there is no guarantythat the membership weights converge to zero or 1. In appli-cations where our only goal is to find a successful hypothesisthis result is satisfactory. However, if the interpretability of

the learned hypothesis is by itself a goal in learning, we mayneed to encourage the membership weights to converge to 0and 1 by adding a penalty term:

interpretability penalty ∝ mi(1−mi) (3)

3. Relational Reinforcement Learning viadNL-ILP

Early works on RRL (Dzeroski et al., 2001; Van Otterlo,2005) mainly relied on access to the explicit representationof states and actions in terms of relational predicate lan-guage. In the most successful instances of these approaches,a regression tree algorithm is usually used in combinationwith a modified version of Q-Learning algorithms. The fun-damental limitation of the traditional RRL approaches is thatthe employed ILP solvers are not differentiable. Therefore,those approaches are typically only applicable the problemsfor which the explicit relational representation of states andactions is provided. Alternatively, deep RL models, due torecent advancement in deep networks, have been success-fully applied to the much more complex problems. Thesemodels are able to learn from raw images without relyingon any access to the explicit representation of the scene.However, the existing RRL counterparts are falling behindsuch desirable developments in deep RL.

In this paper, we establish that differentiable dNL-ILP pro-vides a platform to combine RRL with deep learning meth-ods, constructing a new RRL framework with the best ofboth worlds. This new RRL system allows the model tolearn from the complex visual information received from theenvironment and extract intermediate explicit relational rep-resentation from the raw images by using the typical deeplearning models such as convolutional networks. Althoughthe dNL-ILP can also be used to formulate RL algorithmssuch as deep Q-learning, we focus only on deep policy gra-dient learning algorithm. This formulation is very desirablebecause it makes the learned policy to be interpretable byhuman. One of the other advantages of using policy gradi-ent in our RRL framework is that it enables us to restrictactions according to some rules obtained either from humanpreferences or from problem requirements. This in turnmakes it possible to account for human preferences or toavoid certain pitfalls, e.g., as in safe AI.

In our RRL framework, although we use the generic formula-tion of the policy gradient with the ability to learn stochasticpolicy, certain key aspects are different from the traditionaldeep policy gradient methods, namely state representation,language bias and action representation. In the following,we will explain these concepts in the context of BoxWorldgame. In this game, the agent’s task is to learn how to stackthe boxes on top of each other (in a certain order). For il-lustration, consider the simplified version of the game as in

Figure 3. States representation in the form of predicates in BoxWorld game, before and after an action

Fig.3 where there are only three boxes labeled as a,b, andc. A box can be on top of another or on the floor. A boxcan be moved if it is not covered by another box and canbe either placed on the floor or on top of another uncoveredbox. For this game, the environment state can be fully ex-plained via the predicate on(X,Y). Fig. 3 shows the staterepresentation of the scene before and after an action (indi-cated by the predicate move(c,b)). In the following wediscuss each distinct elements of the proposed frameworkusing the BoxWorld environment. Fig. 4 displays the overalldesign of our proposed RRL framework. In the followingwe discuss the elements of this RRL system.

3.1. State Representation

In the previous approaches to the RRL (Dzeroski et al.,1998; 2001; Jiang & Luo, 2019), state of the environmentis expressed in an explicit relational format in the form ofpredicate logic. This significantly limits the applicability ofRRL in complex environments where such representationsare not available. Our goal in this section is to develop amethod in which the explicit representation of states canbe learned via typical deep learning techniques in a formthat will support the policy learning via our differentiabledNL-ILP. As a result, we can utilize the various benefitsof the RRL discipline without being restricted only to theenvironments with explicitly represented states.

For example, consider the BoxWorld environment ex-plained earlier where the predicate on(X,Y) is used torepresent the state explicitly in the relational form (asshown in Fig.3). Past works in RRL relied on accessto explicit relational representation of states, i.e., all thegroundings of the state representation predicates. Sincethis example has 4 constants, i.e. C ={a,b,c,floor},these groundings would be the binary values (‘true’ or‘false’) for the atoms on(a,a), on(a,b), on(a,c),on(a,floor), ..., on(floor,floor). In re-cent years, extracting relational information from visualscenes has been investigated. Fig. 6 shows two types ofrelational representation extracted from images in (Santoroet al., 2017). The idea is to first process the images throughmultiple CNN networks. The last layer of the convolutionalnetwork chain is treated as the feature vector and is usu-ally augmented with some non-local information such as

absolute position of each point in the final layer of the CNNnetwork. This feature map is then fed into a relational learn-ing unit which is tasked with extracting non-local features.Various techniques have been then introduced recently forlearning these non-local information from the local fea-ture maps, namely, self attention network models (Vaswaniet al., 2017; Santoro et al., 2017) as well as graph networks(Narayanan et al., 2017; Allamanis et al., 2017). Unfortu-nately, none of the resulting presentations from past worksis in the form of predicates needed in ILP.

In our approach, we use similar networks discussed earlier toextract non-local information. However given the relationalnature of state representation in our RRL model, we considerthree strategies in order to facilitate learning the desiredrelational state from images. Namely:

1. Finding a suitable state representation: In our Box-World example, we used the on(X,Y) to representthe state of the environment. However, learning thispredicate requires inferring relation among various ob-jects in the scene. As shown by previous works (e.g.,(Santoro et al., 2017)), this is a difficult task even in thecontext of a fully supervised setting (i.e., all the labelsare provided) which is not applicable here. Alterna-tively, we propose to use lower-level relation for staterepresentation and build higher level representation viapredicate language. In the game of BoxWorld as an ex-ample, we can describe states by the respective positionof each box. In particular, we define two predicatesposH(X,Y) and posV(X,Y) such that variable Xis associated with the individual box, whereas Y in-dicate horizontal or vertical coordinates of the box,respectively. Fig. 5 shows how this new lower-levelrepresentations can be transformed into the higher leveldescription by the appropriate predicate language:

on(X, Y)← posH(X, Z),posH(Y, T),inc(T, Z),sameH(X, Y)

sameH(X, Y)← posH(X, Z),posH(Y, Z) (4)

2. State constraints: When applicable, we may incorpo-rate relational constraint in the form of a penalty termin the loss function. For example, in our BoxWorld ex-ample we can notice that posY(floor) should be always

Figure 4. Learning explicit relational information from images in our proposed RRL; Images are processed to obtain explicit representationand dNL-ILP engine learns and expresses the desired policy (actions)

Figure 5. Transforming low-level state representation to high-level form via auxiliary predicates

0. In general, the choice of relational language makesit possible to pose constraints based on our knowledgeregarding the scene. Enforcing these constraints doesnot necessarily speed up the learning as we will showin the BoxWorld experiment in Section 4.1. However,it will ensure that the (learned) state representation andconsequently the learned relational policy resemble ourdesired structure of the problem.

3. Semi-supervised setting: While it is not desirableto label every single scene that may happen duringlearning, in most cases it is possible to provide a fewlabeled scene to help the model to learn the desiredstate representation faster. These reference points canthen be incorporated to the loss function to encouragethe network to learn a representation that matches tothose provided labeled scenes. We have used a similarapproach in Asterix experiment (see appendix D) tosignificantly increase the speed of learning.

3.2. Action Representation

We formulate the policy gradient in a form that allows thelearning of the actions via one (or multiple) target predi-cates. These predicates exploit the background facts, thestate representation predicates, as well as auxiliary predi-cates to incorporate higher level concepts. In a typical Deeppolicy gradient (DPG) learning, the probability distributionsof actions are usually learned by applying a multi layer per-ception with a softmax activation function in the last layer.In our proposed RRL, the action probability distributionscan usually be directly associated with groundings of anappropriate predicate. For example, in BoxWorld examplein Fig.3, we define a predicate move(A,B) and associate

the actions of the agent with the groundings of this predi-cate. In an ideal case, where there is deterministic solutionto the RRL problem, the predicate move(A,B) may belearned in such a way that, at each state, only the grounding(corresponding to the correct action) would result 1 (’true’)and all the other groundings of this predicate become 0.In such a scenario, the agent will follow the learned logicdeterministically. Alternatively, we may get more than onegrounding with value equal to 1 or we get some fuzzy val-ues in the range of [0,1]. In those cases, we estimate theprobability distribution of actions similar to the standarddeep policy learning by applying a softmax function tothe valuation vector of the learned predicate move (i.e., thevalue of move(X,Y) for X,Y∈ {a,b,c,floor}).

4. ExperimentsIn this section we explore the features of the proposed RRLframework via several examples. We have implemented1

the models using Tensorflow (Abadi et al., 2016).

4.1. BoxWorld Experiment

BoxWorld environment has been widely used as a bench-mark in past RRL systems (Dzeroski et al., 2001; Van Ot-terlo, 2005; Jiang & Luo, 2019). In these systems the stateof the environment is usually given as an explicitly relationaldata via groundings of the predicate on(X,Y). While ILPbased systems are usually able to solve variations of thisenvironments, they rely on explicit representation of stateand they cannot infer the state from the image. Here, we

1The python implementation of the algorithms in this paper isavailable at https://github.com/dnlRRL2020/RRL

https://github.com/dnlRRL2020/RRL

(a) A sample from CLEVER datset (b) A sample from sort-of-CLEVER dat-set

Figure 6. Extracting relational information from visual scene (Santoro et al., 2017)

consider the task of stacking boxes on top of each other.We increase the difficulty of the problem compared to theprevious examples (Dzeroski et al., 2001; Van Otterlo, 2005;Jiang & Luo, 2019) by considering the order of boxes andrequiring that the stack is formed on top of the blue box(the blue box should be on the floor). To make sure themodels learn generalization, we randomly place boxes onthe floor in the beginning of each episode. We consider upto 5 boxes. Hence, the scene constants in our ILP setupis the set {a,b,c,d,e,floor}. The dimension of theobservation images is 64x64x3 and no explicit relationalinformation is available for the agents. The action spacefor the problem involving n boxes is (n+1)× (n+1) corre-sponding to all possibilities of moving a box (or the floor)on top of another box or the floor. Obviously some of theactions are not permitted, e.g., placing the floor on top of abox or moving a box that is already covered by another box.

Comparing to Baseline: In the first experiment, we com-pare the performance of the proposed RRL technique to abaseline. For the baseline we consider standard deep A2C(with up to 10 agents) and we use the implementation instable-baseline library (Hill et al., 2018). We con-sidered both MLP and CNN policies for the deep RL but wereport the results for the CNN policy because of its superiorperformance. For the proposed RRL system, we use twoconvolutional layers with the kernel size of 3 and strides of 2with tanh activation function. We apply two layers of MLPwith softmax activation functions to learn the ground-ings of the predicates posH(X,Y) and posV(X,Y). Ourpresumed grid is (n+ 1)× (n+ 1) and we allow for posi-tional constants {0,1,...,n} to represent the locationsin the grid in our ILP setting. As constraint we add penaltyterms to make sure posV(floor,0) is True. We usevanilla gradient policy learning and to generate actions wedefine a learnable hypothesis predicate move(X,Y). Sincewe have n+1 box constants (including floor), the ground-ings of this hypothesis correspond to the (n+1)× (n+1)possible actions. Since the value of these groundings in

dNL-ILP will be between 0 and 1, we generate softmaxlogits by multiplying these outputs by a large constant c(e.g., c = 10). For the target predicate move(X,Y), weallows for 6 rules in learning ( corresponding to dNL-DNFfunction with 6 disjunctions). The complete list of aux-iliary predicates and parameters and weights used in thetwo models are given in appendix A. As indicated in Fig.5 and defined in (4), we introduce predicate on(X,Y) asa function of the low-level state representation predicatesposV(X,Y) and posH(X,Y). We also introduce higherlevel concepts using these predicates to define the aboveness(i.e., above(X,Y)) as well as isCovered(X,Y). Fig.7 compares the average success per episode for the two mod-els for the two cases of n = 4 and n = 5. The results showsthat for the case of n = 4, both models are able to learn asuccessful policy after around 7000 episodes. For the moredifficult case of n = 5, our proposed approach convergesafter around 20K episodes whereas it takes more than 130Kepisodes for the A2C approach to converge, and even thenit fluctuates and does not always succeed.

Effect of background knowledge: Contrary to the stan-dard deep RL, in an RRL approach, we can introduce ourprior knowledge into the problem via the powerful predi-cate language. By defining the structure of the problem viaILP, we can explicitly introduce inductive biases (Battagliaet al., 2018) which would restrict the possible form of thesolution. We can speed up the learning process or shapethe possible learnable actions even further by incorporat-ing background knowledge. To examine the impact of thebackground knowledge on the speed of learning, we con-sider three cases for the BoxWorld problem involving n = 4boxes. The baseline model (RRL1) is as described before.In RRL2, we add another auxiliary predicate which definesthe movable states as:

movable(X,Y)←¬isCovered(X),¬isCovered(Y),¬same(A,B),¬isfloor(X),¬on(X,Y)

Figure 7. Comparing deep A2C and the proposed model on Box-World task

Figure 8. Effect of background knowledge on learning BoxWorld

where ¬ indicates the negate of a term. In the thirdmodel (RRL3), we go one step further, and we force thetarget predicate move(X,Y) to incorporate the predicatemovable(X,Y) in each of the conjunction terms. Fig.8 compares the learning performance of these models interms of average success rate (between [0,1]) vs the numberof episodes.

Interpretability: In the previous experiments, we did notconsider the interpretability of the learned hypothesis. Sinceall the weights are fuzzy values, even though the learnedhypothesis is still a parameterized symbolic function, itdoes not necessarily represent a valid Boolean formula. Toachieve an interpretable result we add a small penalty asdescribed in (3). We also add a few more state constraints tomake sure the learned representation follow our presumedgrid notations (see Appendix A for details). The learnedaction predicate is found as:

move(X, Y)←moveable(X, Y), ¬lower(X, Y)move(X, Y)←moveable(X, Y), isBlue(Y)lower(X, Y)← posV(X, Z), posV(Y, T), lessthan(Z, T)

Figure 9. GridWorld environment (Zambaldi et al., 2018)

4.2. GridWorld Experiment

We use the GridWorld environment introduced in (Zam-baldi et al., 2018) for this experiment. This environment isconsisted of a 12×12 grid with keys and boxes randomlyscattered. It also have an agent, represented by a single darkgray square box. The boxes are represented by two adjacentcolors. The square on the right represents the boxs lock typewhose color indicates which key can be used to open thatlock. The square on the left indicates the content of the boxwhich is inaccessible while the box is locked. The agentmust collect the key before accessing the box. When theagent has a key, provided that it walks over the lock boxwith the same color as its key, it can open the lock box, andthen it must enter to the left box to acquire the new keywhich is inside the left box. The agent cannot get the newkey prior to successfully opening the lock box on the rightside of the key box. The goal is for the agent to open thegem box colored as white. We consider two difficulty levels.In the simple scenario, there is no (dead-end) branch. Inthe more difficult version, there can be one branch of deadend. An example of the environment and the branchingscenarios is depicted in Fig. 9. This is a very difficult taskinvolving complex reasoning. Indeed, in the original workit was shown that a multi agent A3C combined with a non-local learning attention model could only start to learn afterprocessing 5×108 episodes. To make this problem easierto tackle, we modify the action space to include the locationof any point in the grid instead of directional actions. Giventhis definition of the problem, the agent’s task is to give thelocation of the next move inside the rectangular grid. Hence,the dimension of the action space is 144 = 12×12. For thisenvironment, we define the predicates color(X,Y,C),where X ,Y ∈ {1, . . .12}, C ∈ {1, . . . ,10} and hasKey(C)to represent the state. Here, variables X ,Y denote the co-ordinates, and the variable C is for the color. Similar tothe BoxWorld game, we included a few auxiliary predi-cates such as isBackground(X,Y), isAgent(X,Y)

Figure 10. Effect of background knowledge on learning GridWorld

and isGem(X,Y) as part of the background knowledge.The representational power of ILP allows us to incorpo-rate our prior knowledge about the problem into the model.As such we can include some higher level auxiliary helperpredicates such as :

isItem(X, Y)← ¬isBackground(X, Y),¬isAgent(X, Y)locked(X, Y)← isItem(X, Y), isItem(X,Z), inc(Y, Z)

where predicate inc(X,Y) defines increments for integers(i.e., inc(n,n+1) is true for every integer n). The list ofall auxiliary predicates used in this experiment as well as theparameters of the neural networks used in this experimentare given in Appendix B. Similar to previous experimentswe consider two models, an A2C agent as the baseline andour proposed RRL model using the ILP language describedin Appendix B. We listed the number of episodes it takes

Table 1. Number of training episodes required for convergence

model Without Branch With Branch

proposed RRL 700 4500A2C > 108 > 108

to converge in each setting in Table1. As the results suggest,the proposed approach can learn the solution in both settingsvery fast. On the contrary, the standard deep A2C was notable to converge after 108 episodes. This example restatesthe fact that incorporating our prior knowledge regardingthe problem can significantly speed up the learning process.

Further, similar to the BoxWorld experiment, we study theimportance of our background knowledge in the learning.In the first task (RRL1), we evaluate our model on thenon-branching task by enforcing the action to include theisItem(X ,Y ) predicate. In RRL2, we do not enforce this.As shown in Fig10, RRL1 model learns 4 times faster thanRRL2. Arguably, this is because, enforcing the inclusion ofisItem(X ,Y ) in the action hypothesis reduces the possibilityof exploring irrelevant moves (i.e., moving to a locationwithout any item).

4.3. Relational Reasoning

Combining dNL-ILP with standard deep learning techniquesis not limited to the RRL settings. In fact, the same approachcan be used in other areas in which we wish to reason aboutthe relations of objects. To showcase this, we considerthe relational reasoning task involving the Sort-of-CLEVR(Santoro et al., 2017) dataset. This dataset (See Fig.6b)consists of 2D images of some colored objects. The shapefo each object is either a rectangle or a circle and eachimage contains up to 6 objects. The questions are hard-coded as fixed-length binary strings. Questions are eithernon-relational (e.g, ”what is the color of the green object?”)or relational (e.g., ”what is the shape of the nearest objectto the green object?”). In (Santoro et al., 2017), the authorscombined a CNN generated feature map with a special typeof attention based non-local network in order to solve theproblem. We use the same CNN network and similar tothe GridWorld experiment, we learn the state representationusing predicate color(X,Y,C) (the color of each cell inthe grid) as well as isCircle(X,Y) which learn if theshape of an object is circle or not. Our proposed approachreaches the accuracy of 99% on this dataset compared to the94% for the non-local approach presented in (Santoro et al.,2017). The details of the model and the list of predicates inour ILP implementation is given in appendix C.

5. ConclusionIn this paper, we proposed a novel deep Relational Rein-forcement Learning (RRL) model based on a differentiableInductive Logic Programming (ILP) that can effectivelylearn relational information from image. We showed howthis model can take the expert background knowledge andincorporate it into the learning problem using appropriatepredicates. The differentiable ILP allows an end to endoptimization of the entire framework for learning the pol-icy in RRL. We showed the performance of the proposedRRL framework using environments such as BoxWorld andGridWorld.

ReferencesAbadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean,

J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.Tensorflow: A system for large-scale machine learning.In 12th {USENIX} Symposium on Operating SystemsDesign and Implementation ({OSDI} 16), pp. 265–283,2016.

Allamanis, M., Brockschmidt, M., and Khademi, M. Learn-ing to represent programs with graphs. arXiv preprintarXiv:1711.00740, 2017.

Battaglia, P. W., Hamrick, J. B., Bapst, V., Sanchez-

Gonzalez, A., Zambaldi, V., Malinowski, M., Tacchetti,A., Raposo, D., Santoro, A., Faulkner, R., et al. Rela-tional inductive biases, deep learning, and graph networks.arXiv preprint arXiv:1806.01261, 2018.

Blockeel, H. and De Raedt, L. Top-down induction of first-order logical decision trees. Artificial intelligence, 101(1-2):285–297, 1998.

Bryant, C., Muggleton, S., Page, C., Sternberg, M., et al.Combining active learning with inductive logic program-ming to close the loop in machine learning. In AISB99Symposium on AI and Scientific Creativity, pp. 59–64.Citeseer, 1999.

Dzeroski, S., De Raedt, L., and Blockeel, H. Relationalreinforcement learning. In International Conference onInductive Logic Programming, pp. 11–22. Springer, 1998.

Dzeroski, S., De Raedt, L., and Driessens, K. Relationalreinforcement learning. Machine learning, 43(1-2):7–52,2001.

Evans, R. and Grefenstette, E. Learning explanatory rulesfrom noisy data. Journal of Artificial Intelligence Re-search, 61:1–64, 2018.

Hessel, M., Modayil, J., van Hasselt, H., Schaul, T., Ostro-vski, G., Dabney, W., Horgan, D., Piot, B., Azar, M., andSilver, D. Rainbow: Combining improvements in deepreinforcement learning, 2017.

Hill, A., Raffin, A., Ernestus, M., Gleave, A., Kanervisto, A.,Traore, R., Dhariwal, P., Hesse, C., Klimov, O., Nichol,A., Plappert, M., Radford, A., Schulman, J., Sidor, S.,and Wu, Y. Stable baselines. https://github.com/hill-a/stable-baselines, 2018.

Jiang, Z. and Luo, S. Neural logic reinforcement learning,2019.

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A.,Antonoglou, I., Wierstra, D., and Riedmiller, M. Playingatari with deep reinforcement learning. arXiv preprintarXiv:1312.5602, 2013.

Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap,T., Harley, T., Silver, D., and Kavukcuoglu, K. Asyn-chronous methods for deep reinforcement learning. InInternational conference on machine learning, pp. 1928–1937, 2016.

Narayanan, A., Chandramohan, M., Venkatesan, R., Chen,L., Liu, Y., and Jaiswal, S. graph2vec: Learningdistributed representations of graphs. arXiv preprintarXiv:1707.05005, 2017.

Payani, A. and Fekri, F. Decoding ldpc codes on binaryerasure channels using deep recurrent neural-logic layers.In Turbo Codes and Iterative Information Processing(ISTC), 2018 International Symposium On. IEEE, 2018.

Payani, A. and Fekri, F. Inductive logic programming viadifferentiable deep neural logic networks. arXiv preprintarXiv:1906.03523, 2019.

Santoro, A., Raposo, D., Barrett, D. G., Malinowski, M.,Pascanu, R., Battaglia, P., and Lillicrap, T. A simple neu-ral network module for relational reasoning. In Advancesin neural information processing systems, pp. 4967–4976,2017.

Van Hasselt, H., Guez, A., and Silver, D. Deep reinforce-ment learning with double q-learning. In Thirtieth AAAIConference on Artificial Intelligence, 2016.

Van Otterlo, M. A survey of reinforcement learning in rela-tional domains. Centre for Telematics and InformationTechnology (CTIT) University of Twente, Tech. Rep, 2005.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Atten-tion is all you need. In Advances in neural informationprocessing systems, pp. 5998–6008, 2017.

Watkins, C. J. and Dayan, P. Q-learning. Machine learning,8(3-4):279–292, 1992.

Zambaldi, V., Raposo, D., Santoro, A., Bapst, V., Li, Y.,Babuschkin, I., Tuyls, K., Reichert, D., Lillicrap, T.,Lockhart, E., et al. Relational deep reinforcement learn-ing. arXiv preprint arXiv:1806.01830, 2018.

https://github.com/hill-a/stable-baselines

https://github.com/hill-a/stable-baselines

A. BoxWorld ExperimentFor the problem consists of n box, we need n+1 constants of type box (note that we consider the floor as one of the boxes inour problem definition). Additionally, we define numerical constants {0, . . . ,n} to represent box coordinates using in posHand posV predicates. For the numerical constants we define orderings via extensional predicates lt/2 and inc/2. For thebox constants, we define two extensional predicates same/2 and isBlue/1. Here, by p/N we mean predicate p of arity N(i.e., p has N arguments). Since these are extensional predicates, their truth values are fixed in the beginning of the programvia the background facts. For example, for predicate inc/2 which defines the increment by one for the natural numbers, weneed to set these background facts the beginning: {inc(0,1), inc(1,2),...,inc(n-1,n)}. Similarly, for thepredicate lt (short for lessThan) this set includes items such as {lt(0,1), lt(0,2),...,lt(n-1,n)}. It is worthnoting that introducing predicates such as isBlue for boxes does not mean we already know which box is the blue one(the target box that needs to be first box in the stack). This predicate merely provides a way for the system to distinguishbetween boxes. Since in our learned predicates we can only use symbolic atoms (predicates with variables) and in dNL-ILPimplementation, no constant is allowed in forming the hypothesis, this method allows for referring to an specific constant inthe learned action predicates. Here, for example, we make an assumption that box a in our list of constants correspondsto the blue box via the background fact isBlue(a). Table 3 explains all extensional predicates as well as those helperauxiliary predicates that was used in our BoxWorld program.

So far in our discussions, we have not distinguished between the type of variables. However, in the dNL-ILP implementation,the type of each variable should be specified in the signature of each defined predicate. This allows the dNL-ILP to useonly valid combination of variables to generate the symbolic formulas. In the BoxWorld experiment, we have two types ofconstants Tb and Tp referring to the box and numeric constants, respectively. In Table3, for each defined predicate, the list ofvariables and their corresponding types are given, where for example X(Tp) states that the type of variable X is Tp.

For clarity, Table 3 is divided into two sections. In the top section the constants are defined. In the other section, 4 groups ofpredicates are presented: (i) state representation predicates that their groundings are learned from image, (ii) extensionalpredicates which are solely defined by background facts, (iii) auxiliary predicates and, (iv) the signature for the targetpredicate that is used to represent the actions in the policy gradient scheme. To learn the policy, we used discount factor of0.7, and we use ADAM optimizer with learning rate of 0.002. We set the maximum number of steps for each episode to 20.To learn the feature map, we used two layers of CNNs with kernel size of 5 and strides of 3 and we applied fully connectedlayers with softmax activation functions to learn the groundings of the state representation predicates posH and posV.

B. GridWorld ExperimentIn the GridWorld experiment, we distinguish between constants used for vertical and horizontal coordinates as shown Table4. This reduces the number of possible symbolic atoms and makes the convergence faster. We also define 10 constants oftype Tc to represent each of the 10 possible colors for the grid cells in the scene. The complete list of all the extensional andauxiliary predicates for this program is listed in Table 4. In this task, to incorporate our knowledge about the problem, wedefine the concept of item via predicate isItem(X,Y). Using this definition, an item represents any of the the cells thatare neither background nor agent. By incorporating this concept, we can further define higher level concepts via lockedand isLock predicates which define the two adjacent items as a locked item and its corresponding key, respectively. Thedimension of the observation image is 112x112x3. We consider two types of networks to extract grid colors from the image.In the first method, we apply three layers of convolutional network with kernel sizes of [5,3,3] and strides of 2 for all. Weuse relu activation function apply batch normalization after each layer. The number of features in each layer are set to 24.We take the feature map of size 14×14×24 and apply two layers MLP with dimensions of 64 and 10 (10 is the the numberof color constants) and activation functions of relu and softmax, to generate the grounding matrix G (the dimension ofG is 14x14x10). As can be seen in Fig 11, the current key is located at the top left border. We extract the grounding forpredicate hasKey from the value of G[0,0, :] and the groundings of the predicate color by discarding the border elementsfrom G (i.e., G[1..12,1..12, :]).

Alternatively, and because of the simplicity of the environment, instead of using CNNs, we may take the center of each boxto directly create the feature map of size 14×14×3 and then apply the MLP to generate the groundings. In our experimentswe tested both scenarios. Using the CNN approach the speed of convergence in our method was around 2 times slower. Forthe A2C algorithm, we could not solve the problem using either of these approaches. We set the maximum number of stepsin an episode to 50 and we use learning rate of .001. For our models, we use discount factor of 0.9 and for the A2C wetested various numbers in range of 0.9 to 0.99 to find the best solution.

Figure 11. An example GridWorld scene

Table 2. Flat index for a grid of 4 by 4 used in relational learning task

0 1 2 34 5 6 78 9 10 1112 13 14 15

C. Relational Reasoning ExperimentThis task was introduced as a benchmark for learning relational information from images in (Santoro et al., 2017). In thistask, the objects of two types (circle or rectangle) and in 6 different colors are randomly placed in a 4x4 grid and questionslike ’is the blue object circle’ or ’what is the color of nearest object to blue’ should be answered. To formulate this problemvia ILP, instead of using grid coordinates (vertical and horizontal positions as in past experiments) we consider a flat indexas can be seen in Table 2. As such, we will have 16 positional constants of type Tp to represent grid positions as well as 6constants of type Tc to represent the color of each object. In the original problem in (Santoro et al., 2017), a question isrepresented as binary vector of length 11. The first 6 bits represent the one-hot representation of the color of the object inquestion. The next 5 bits represent one of the five possible types of questions, i.e., ’is it a circle or a rectangle’ and ’thecolor of the nearest object?’ for example.

We define a nullary predicate for each of 11 possible bits in the question via predicates isQ0(),. . . ,isQ10(). Similar to tothe original paper we use 4 layers of CNNs with relu activation function and batch normalization to obtain a feature mapof size 4x4X24. We use kernel size of 5 and strides of [3,3,2,2] for each of the layers. By applying fully connected layers tothe feature map, we learn the groundings of predicates color(X,Y,Z), isObject(X,Y) and isCircle(X,Y). Wedefine some auxiliary predicates as shown in Table 5. For each of the 10 possible answers, we create one nullary predicate.The vector of size 10 that is created by the the groundings of these 10 predicates (i.e., isAnswer0(), . . . ) are then usedto calculate the cross entropy loss between the network output and the provided ground truth. We need to mention thatin the definition of the auxiliary predicate qa(X), we exploit our prior knowledge regarding the fact that the first 6 bitsof the question vector correspond to the colors of the object. Without introducing this predicate, the convergence of ourmodel is significantly slower. For example, while by incorporating this predicate we need around 30000 training samples forconvergence, it would take than 200000 training samples without this predicate.

D. Asterix ExperimentIn Asterix game (an Atari 2600 game), the player controls a unit called Asterix to collect as many objects as possible whileavoiding deadly predators. Even though it is a rather simple game, many standard RL algorithms score significantly belowthe human level. for example, as reported in (Hessel et al., 2017), DQN and A3C achieve the scores of 3170 and 22140 onaverage, respectively even after processing hundreds of millions of frames. Here, our goal is not to outperform the currentapproaches. Instead, we demonstrate that by using a very simple language for describing the problem, an agent can achievescores in the range of 30K-40K with only a few thousands of training scenes. For this game we consider the active part ofthe scene as a 8x12 grid. For simplicity, we consider only 4 types of objects; agent, left to right moving predator, right to

Figure 12. Score during training in Astreix experiment

left moving predator and finally the food objects. The dimension of the input image is 128x144. We use 4 convolutionallayers with strides of [(2,3),(2,1),(2,2),(2,2)] and kernel size of 5 to generate a feature map of size 8x12x48. By applying 4fully connected layers with softmax activation function we learn the groundings of the predicates corresponding to thefour types of objects, i.e., O1(X,Y),. . . ,O4(X,Y). The complete list of predicates for this experiment is listed in Table6. We learn 5 nullary predicates corresponding to the 5 direction of moves (i.e., no move, left,right,up,down) and use thesame policy gradient learning approach as before. The notable auxiliary predicates in the chosen language are the 4 helperpredicates that define bad moves. For example, badMoveUp() states that an upward move is bad when there is a predatorin the close neighborhood of the agent and in the top row. Similarly, badMoveLeft() is defined to state that when apredator is coming from left side of an agent, it is a bad idea to move left.

However, given the complexity of the scene and the existence of some overlappings between the objects, learning therepresentation of the state is not as easy as the previously explored experiments. To help the agent learn the presumed gridpresentation, we provide a set of labeled scene (a semi-supervised approach) and we penalize the objective function usingthe loss that is calculated between these labels and the learned representation. Fig. 12 shows the learning curve for twocases of the using 20 and 50 randomly generated labeled scenes. In the case of 50 provided labels, the agent finally learns toscore around 30-40K each episodes. Please note that we did not include all the possible objects that are encountered duringlater stages of the game and we use a simplistic representation and language just to to demonstrate the application of RRLframework in more complex scenarios.

Table 3. ILP definition of the BoxWorldConstants Description ValuesTb box constants {a, b, c, d, floor}Tp coordinate position constants {0, 1, 2, . . . , n}

Predicate Variables Definition

posH(X,Y) X(Tb), Y(Tp) learned from ImageposV(X,Y) X(Tb), Y(Tp) learned from Image

isFloor(X) X(Tb) isFloor(floor)isBlue(X) X(Tb) isBlue(a)isV1(X) X(Tp) isV1(1)inc(X,Y) X(Tp), Y(Tp) inc(0,1), inc(1,2), . . . , inc(n-1,n)lt(X,Y) X(Tp), Y(Tp) lt(0,1), lt(0,2), . . . , lt(n-1,n)

same(X,Y) X(Tb), Y(Tb) same(a,a), same(b,b), . . . , same(floor,floor)sameH(X,Y) X(Tb), Y(Tb), Z(Tp) posH(X,Z), posH(Y,Z)sameV(X,Y) X(Tb), Y(Tb), Z(Tp) posV(X,Z), posV(Y,Z)above(X,Y) X(Tb), Y(Tb), Z(Tp), T(Tp) sameH(X,Y), posV(X,Z), posV(Y,T), lt(T,Z)below(X,Y) X(Tb), Y(Tb), Z(Tp), T(Tp) sameH(X,Y), posV(X,Z), posV(Y,T), lt(Z,T)on(X,Y) X(Tb), Y(Tb), Z(Tp),T(Tp) sameH(X,Y), posV(X,Z), posV(Y,T), inc(T,Z)isCovered(X) X(Tb), Y(Tb) On(Y,X), ¬isFloor(X)moveable(X,Y) X(Tb), Y(Tb) ¬isCovered(X),¬isCovered(Y),¬same(A,B),¬isfloor(X),¬on(X,Y)

move(X,Y) X(Tb), Y(Tb), Z(Tp),T(Tp) Action predicate that is learned via policy gradient

Table 4. ILP definition of the GridWorldConstants Description ValuesTv vertical coordinates {0,1,2,. . . ,11}Th horizontal coordinates {0,1,2,. . . ,11}Tc cell color code {0,1,2,. . . ,9}


color(X,Y,Z) X(Tv), Y(Th), Z(Tc) learned from ImagehasKey(X) X(Tc) Learned from Image

incH(X,Y) X(Th) incH(0,1),. . . ,incH(10,11)isC0(X) X(Tc) isC0(0)isC1(X) X(Tc) isC1(1)isC2(X) X(Tc) isC2(2)

isBK(X,Y) X(Tv), X(Th), Z(Tc) color(X,Y,Z), isC0(Z)isAgent(X,Y) X(Tv), X(Th), Z(Tc) color(X,Y,Z), isC1(Z)isGem(X,Y) X(Tv), X(Th), Z(Tc) color(X,Y,Z), isC2(Z)isItem(X,Y) X(Tv), X(Th) ¬ isBK(X,Y), ¬ isAgent(X,Y),locked(X) X(Tv), X(Th), Z(Th) isItem(X,Y), isItem(X,Z), incH(Y,Z)isLock(X) X(Tv), X(Th), Z(Th) isItem(X,Y), isItem(X,Z), incH(Z,Y)

move(X,Y) X(Tv), Y(Th), Z(Tc) Action predicate that is learned via policy gradient

Table 5. ILP definition of the relational reasoning task

Constants Description ValuesTp Flat position of items in a 4 by 4 grid {0,1,2,. . . ,15}Tc color of an item {0,1,2,. . . ,5}


isQ0() Given as a binary value. . . . . .isQ10() Given as a binary valuecolor(X,Y) X(Tp), Y(Tc) Learned from ImageisCircle(X,Y) X(Tp) Learned from ImageisObject(X,Y) X(Tp) Learned from Image

equal(X,Y) X(Tp), Y(Tp) equal(0,0),. . . ,incH(15,15)

lt(X,Y) X(Tp), Y(Tp), Z(Tp)true if distance between grid cells corresponding to X and Yis less than distance between X and Z (see Table 2)

left(X) X(Tp) left(0),left(1),. . . ,left(12),left(13)right(X) X(Tp) right(2),right(3),. . . ,right(14),right(15)top(X) X(Tp) top(0),left(1),. . . ,left(6),left(7)bottom(X) X(Tp) bottom(8),left(9),. . . ,left(14),left(15)

closer(X,Y,Z) X(Tp), X(Tp), Z(Tp) isObject(X), isObject(Y), isObject(Z), lt(X,Y,Z)farther(X,Y,Z) X(Tp), X(Tp), Z(Tp) isObject(X), isObject(Y), isObject(Z), gt(X,Y,Z)notClosest(X,Y) X(Tp), X(Tp), Z(Tp) closer(X,Z,Y)notFarthest(X,Y) X(Tp), X(Tp), Z(Tp) farther(X,Z,Y)

qa(X) X(Tp), Y(Tc)

isQ0(), color(X,Y), isC0(Y)isQ1(), color(X,Y), isC1(Y)isQ2(), color(X,Y), isC2(Y)isQ3(), color(X,Y), isC3(Y)isQ4(), color(X,Y), isC4(Y)isQ5(), color(X,Y), isC5(Y)

isAnswer0() X(Tp), Y(Tp) The learned hypothesis : Is answer is 0. . . . . . . . .isAnswer9() X(Tp), Y(Tp) The learned hypothesis : Is answer is 9

Table 6. ILP definition of the Asterix experiment

Constants Description ValuesTv vertical coordinates {0,1,2,. . . ,8}Th horizontal coordinates {0,1,2,. . . ,12}


O1(X,Y) X(Tv), Y(Th), learned from Image : objects of type agentO2(X,Y) X(Tv), Y(Th), learned from Image : objects of type L2R predatorO3(X,Y) X(Tv), Y(Th), learned from Image : objects of type R2L predatorO4(X,Y) X(Tv), Y(Th), learned from Image : objects of type food

isV0(X) X(Tv) isV(0)isV11(X) X(Tv) isV11(11)isH0(X) X(Th) isH0(0)isH7(X) X(Th) isH7(7)incV(X,Y) X(Tv), X(Tv) incV(0,1), . . . ,incV(6,7)ltH(X,Y) X(Th), X(Th) ltH(0,1),ltH(0,2), . . . ,ltH(11,12)closeH(X,Y) X(Th), Y(Th) true if |X−Y | ≤ 2

agentH(X) X(Tv), Y(Th) O1(X,X)agentV(X) X(Th), Y(Tv) O1(Y,X)

predator(X,Y) X(Tv), Y(Th)O2(X,X)O3(X,Y)

agent() X(Tv) agentV(X)badMoveUp() X(Tv), Y(Th), Z(Tv), T(Th) O1(X,Y), predator(Z,T), incV(Z,X), closeH(Y,T)badMoveDown() X(Tv), Y(Th), Z(Tv), T(Th) O1(X,Y), predator(Z,T), incV(X,Z), closeH(Y,T)badMoveLeft() X(Tv), Y(Th), Z(Th) O1(X,Y), O2(X,Z), ltH(Z,Y), closeH(Z,Y)badMoveRight() X(Tv), Y(Th), Z(Th) O1(X,Y), O3(X,Z), ltH(Y,Z), closeH(Z,Y)

moveUp() X(Tv), Y(Th), Z(Tv), T(Th), Action predicate that is learned via policy gradientmoveDown() X(Tv), Y(Th), Z(Tv), T(Th), Action predicate that is learned via policy gradientmoveLeft() X(Tv), Y(Th), Z(Tv), T(Th), Action predicate that is learned via policy gradientmoveRight() X(Tv), Y(Th), Z(Tv), T(Th), Action predicate that is learned via policy gradientmoveNOOP() X(Tv), Y(Th), Z(Tv), T(Th), Action predicate that is learned via policy gradient

Abstract arXiv:2003.10386v1 [cs.LG] 23 Mar 2020 · Likewise, a neural disjunction function f...

Documents

Transcript of Abstract arXiv:2003.10386v1 [cs.LG] 23 Mar 2020 · Likewise, a neural disjunction function f...