Abstract arXiv:1902.04546v1 [cs.LG] 12 Feb 2019 · 2019. 2. 13. · It also uses a replay buffer as...

ACTRCE: Augmenting Experience via Teacher’s AdviceFor Multi-Goal Reinforcement Learning

Harris Chan * 1 2 Yuhuai Wu * 1 2 Jamie Kiros 3 Sanja Fidler 1 2 Jimmy Ba 1 2

AbstractSparse reward is one of the most challenging prob-lems in reinforcement learning (RL). HindsightExperience Replay (HER) attempts to addressthis issue by converting a failed experience to asuccessful one by relabeling the goals. Despiteits effectiveness, HER has limited applicabilitybecause it lacks a compact and universal goal rep-resentation. We present Augmenting experienCevia TeacheR’s adviCE (ACTRCE), an efficientreinforcement learning technique that extends theHER framework using natural language as thegoal representation. We first analyze the differ-ences among goal representation, and show thatACTRCE can efficiently solve difficult reinforce-ment learning problems in challenging 3D nav-igation tasks, whereas HER with non-languagegoal representation failed to learn. We also showthat with language goal representations, the agentcan generalize to unseen instructions, and evengeneralize to instructions with unseen lexicons.We further demonstrate it is crucial to use hind-sight advice to solve challenging tasks, and evensmall amount of advice is sufficient for the agentto achieve good performance.

1. IntroductionMany impressive deep reinforcement learning (deep RL)applications rely on carefully-crafted reward functions toencourage the desired behavior. However, designing a goodreward function is non-trivial (Ng et al., 1999), and requiresa significant engineering effort. For example, even for theseemingly simple task of stacking Lego blocks, (Popovet al., 2017) needed 5 complicated reward terms with dif-ferent importance weights. Moreover, handcrafted rewardshaping (Gullapalli & Barto, 1992) can lead to biased learn-

*Equal contribution 1Department of Computer Science, Univer-sity of Toronto, Toronto, Canada 2Vector Institute, Toronto, Canada3Google Brain, Toronto, Canada. Correspondence to: Harris Chan<[email protected]>, Yuhuai Wu <[email protected]>.

Preprint. Work in Progress.

ing, which may cause the agent to learn unexpected andundesired behaviors (Clark & Amodei, 2016).

One approach to avoid defining a complicated reward func-tion is to use a sparse and binary reward function, i.e., giveonly a positive or negative reward at the terminal state, de-pending on the success of the task. However, the sparsereward makes learning difficult.

Hindsight Experience Replay (HER) (Andrychowicz et al.,2017) attempts to address this issue. The main idea ofHER is to utilize failed experiences by substituting withan alternative goal in order to convert them to successfulexperiences. For their algorithm to work, Andrychowiczet al. made the non-trivial assumption that for every statein the environment, there exists a goal which is achievedin that state. As the authors pointed out, this assumptioncan be trivially satisfied by choosing the goal space to bethe state space. However, representing the goal using theenormous state space is very inefficient and contains muchredundant information.

For example, if we want to ask an agent to avoid collisionswhile driving where the state is the raw pixel value fromthe camera, then there can be many states (i.e. frames) thatachieve this goal. It is redundant to represent the goal usingeach state.

Therefore, we need a goal representation that is (1) expres-sive and flexible enough to satisfy the assumption in HER,while also being (2) compact and informative where similargoals are represented using similar features. Natural lan-guage representation of goals satisfies both requirements.First, language can flexibly describe goals across tasks andenvironments. Second, language representation is abstract,hence able to compress any redundant information in thestates. Recall the previous driving example, for which wecan simply describe “avoid collisions" to represent all statesthat satisfy this goal. Moreover, the compositional natureof language provides transferable features for generalizingacross goals.

In this paper, we combine the HER framework with natu-ral language goal representation, and propose an efficienttechnique called Augmenting experienCe via TeacheR’s ad-viCE (ACTRCE; pronounced “actress”) to a broad range

arX

iv:1

902.

0454

6v1

[cs

.LG

] 1

2 Fe

b 20

19

Augmenting Experience via Teacher’s Advice

of reinforcement learning problems. Our method works asfollows. Whenever an agent finishes an episode, a teachergives advice in natural language to the agent based on theepisode. The agent takes the advice to form a new expe-rience with a corresponding reward, alleviating the sparsereward problem. For example, a teacher can describe whatthe agent has achieved in the episode, and the agent canreplace the original goal with the advice and a reward of1. We show many benefits brought by language goal rep-resentation when combining with hindsight advice. Theagent can efficiently solve reinforcement learning problemsin challenging 2D and 3D environments; it can generalizeto unseen instructions, and even generalize to instructionwith unseen lexicons. We further demonstrate it is crucial touse hindsight advice to solve challenging tasks, but we alsofound that little amount of hindsight advice is sufficient forthe agent’s performance to significantly improve, showingthe practical aspect of the method.

We note that our work is also interesting from a languagelearning perspective. Learning to achieve goals describedin natural language is part of a class of problem calledlanguage grounding (Harnad, 1990), which has recently re-ceived increasing interest, as grounding is believed to benecessary for more general understanding of natural lan-guage. Early attempts to ground language in a simulatedphysical world (Winograd, 1972; Siskind, 1994) consistedof hard coded rules which could not scale beyond theiroriginal domain. Recent work has been using reinforce-ment learning techniques to address this problem (Hermannet al., 2017; Chaplot et al., 2017b). Our work combinesreinforcement learning and rich language advice, providingan efficient technique for language grounding.

2. Background2.1. Reinforcement Learning and Deep Q-Networks

We first review the traditional reinforcement learning set-ting, where an agent interacts with an infinite-horizon, dis-counted Markov Decision Process (S,A, γ, P, r). Here, Sis the state space, A is the action space, γ the discountfactor, P the transition model and r is a reward functionr : S × A → R. We consider a policy πθ(a|s) parameter-ized by θ. At time t, the agent chooses an action at ∈ Aaccording to its policy πθ(a|st) where st ∈ S is the cur-rent state. The environment in turn produces a rewardr(st, at) and transitions to the next state st+1 accordingto the transition probability P (st+1|st, at). The goal ofthe agent is to maximize the expected γ-discounted cumu-lative return Eπ[

∑∞t=0 γ

tr(st, at)] with respect to the pol-icy parameters θ. The action-value function is denotedas Qπ(st, at) = Eπ[

∑∞i=t γ

i−tr(si, ai)|s = st, a = at],which is the expected discounted sum of rewards obtainedby performing an action a at state s and following the policy

π thereafter. We denote π∗ as the optimal policy such thatQπ∗(s, a) ≥ Qπ(s, a) for every s ∈ S, a ∈ A and policy

π. The optimal Q-function Q∗ = Qπ∗

satisfies the Bellmanequation:

Q∗(s, a) = Es′∼p(·|s,a)[r(s, a) + γ max

a′∈AQ∗(s′, a′)

](1)

Q-learning (Sutton & Barto, 2018) is an off-policy, model-free RL algorithm that is based on the Bellman equa-tion. The algorithm uses semi-gradient descent (Sutton& Barto, 2018) to minimize the squared Temporal Dif-ference (TD) error: L = E(Q(st, at) − yt)

2, where theyt = rt + γmaxa′∈AQ(st+1, a

′) is the TD target. DeepQ-Network (Mnih et al., 2013) builds on the Q-learningalgorithm and uses a neural network to approximate Q∗.It also uses a replay buffer as well as a target network toimprove training stability.

2.2. Goal Oriented Reinforcement Learning

We review the goal-oriented reinforcement learning frame-work following (Schaul et al., 2015). We augment the previ-ously defined infinite-horizon, discounted Markov DecisionProcess with a goal space G. A goal g is chosen from Gand stays fixed for each episode. The goal induces a rewardfunction rg : S × A → R, that assigns reward to a stateconditioned on the given goal. At time step t, the agentπ(at|st, g) chooses an action conditioned on the currentstate st, as well as the current goal g. The agent’s objectiveis to maximize the expected discounted cumulative returngiven the goal, i.e., Eπ[

∑∞t=0 γ

tr(st, at, g)].

2.3. Hindsight Experience Replay (HER)

We follow (Andrychowicz et al., 2017) and consider a spe-cific family of goal conditioned reward functions, wherethe reward is either 0 or 1, depending on the resulting state,rg(st, at) = fg(st+1), where fg : S → {0, 1}. The func-tion fg acts as a predicate that determines whether the agentis successful or not according to g, and only assigns positivereward at the terminal state. Hence the reward function isvery sparse, making it difficult for an agent to learn.

HER proposes a solution to this problem. Theagent first collects an episode of experiencess0, a0, r1, ..., sT−1, aT−1, rT , sT under the goal g,where rT = rg(sT−1, aT−1) is the terminal reward. IfrT is zero, one can replace g with g′ ∈ G such thatrg′(sT−1, aT−1) = 1. With an off-policy algorithm, oneis able to learn from the goal transformed experience. Inorder to flexibly relabel with a desirable goal, the authorsassume that there exists a representation map from thestate space to the goal space T : S → G, so that for eachs ∈ S, the corresponding goal representation satisfiesthe predicate fg (T(s)) = 1. However, such mapping isnot simple to construct. Although the author pointed out


Language Processing

Value Network

Natural Language

Goal G

State St

Valuefor

actions"Go to the blue torch"

Vision Processing

Gated Attention

Goal Embedding

Attention

Text

Image Representation

Figure 1. The diagram illustrates our model architecture.

a trivial construction can be done by taking G = S andfg(s) = [g = s], such goal representation is redundantand uninformative, and largely limits the applicability ofthe algorithm. Consider an example where the state is thefirst-person raw pixel observation in a 3-D environment,and the goal is to walk towards a particular object. Thereare many possible directions to approach the object from,which results in many different possible states that satisfythe goal. However, no subset of the raw pixel observationcan abstractly represent the goal.

3. Augmenting Experience via Teacher’sAdvice (ACTRCE)

For a goal oriented MDP, {S,G,A, γ, P, rg}, and a pa-rameterized agent π(at|st, g), our objective is to max-imize the expected discounted cumulative return, i.e.,Eπ[∑∞t=0 γ

trg(st, at)] for each g ∈ G. Following HER,we assume there exists one or more goal representation map-pings T : S → G, which describe what goal is satisfiedat each state. For example, in the original HER paper, themapping T is simply a subspace projection from the state.Such representation can work well for their robotic tasks,where subset of the state (i.e., the coordinates of objects andbody parts) can reasonably well represent the goal. How-ever, in general, such mapping can not abstract meaningfulinformation about the goal, and will cause redundancy.

Language provides an abstract representation for a goal andhence reduces redundancy in representation. This motivatesour proposal to use natural language for representing thegoal space. Concretely, for each state s ∈ S, we define Tas a teacher that gives an advice T(s), a natural languagedescription of the goal that is achieved in the state s. Toimplement such a teacher, one can ask a human to lookat a state and describe the goal in words, or generate agoal description automatically, as part of a narration oraccessibility feature in a game. We take the latter approach,using the underlying emulator states to generate the teacherinstruction. Given T, we can convert a failure trajectoryto a successful one by relabeling the original goal with theteacher advice. To illustrate, say the original goal for theepisode is "Go to the armor", but at this particular time step,the agent has reached the blue torch instead. The teacher

then tells the agent that the goal that reflects the current stateis "Go to blue torch". The agent can hence take the adviceand use it as a positive experience.

Besides positive reward signals, we also find that negativeexperience is necessary for training. Hence, in addition toa teacher who describes the achieved goal with a positivereward, we also introduce a teacher who gives negative re-ward with a description of a goal that has not been achieved.We further consider the scenario where there is more thanone goal that can be satisfied in a state, e.g., the state thatis described as “Reach a blue torch" may also be describedas “Reach the largest blue object". Each different descrip-tion of the goal corresponds to a different teacher (see morediscussions and experiments on variations of teachers in Ap-pendix E). Therefore, in general, there is a group of teachersof different kinds, each giving different advice. We thenrelabel the original goal with each advice and correspondingpositive or negative reward, and augment the replay bufferwith these trajectories. Algorithm 1 describes our proposedapproach formally.

Note that in the above formulation, we assume a MDPsetting, and let the teacher give advice based solely on theterminal state g′ = T(sT ). By contrast, in the POMDPsetting, we ask the teacher to give advice based on thehistory of states and actions during the episode, i.e., g′ =T({s0, a0, . . . , sT }).

Algorithm 1 Augmenting Experience via Teacher’s Advice(ACTRCE)

Given• an off-policy RL algorithm A, and replay buffer R. . e.g.

DQN, DDPG, NAF, SDQN• A language goal space G and a desired goal space Gd.• A group of teachers {T}

for episode = 1,M doSample desirable goal description g ∈ Gd, initial state s0.for t = 0, T − 1 do

Sample an action at using behavioral policy from A:at ← πb(st||g) . || denotes concatenation

end forfor every teacher T do

Compute advice g′ = T(sT ). . Teacher’s advice innatural language

for t = 0, T − 1 dor = rg′(st, at)Store transitions {(st||g′, at, r, st+1||g′)} in R

end forend forPerform training with A and a minibatch B from the replay

buffer.end for

3.1. Language representation

There remains a question of how we deal with the naturallanguage goal representation, a sequence of discrete sym-


bols. In this paper, we explore two standard ways to converta language goal into a continuous vector representation forneural network training. One way is to represent each wordas a one-hot vector, and use a recurrent neural network(RNN) to sequentially embed each word into its hiddenstate. We can then obtain some function of its hidden states(e.g., last hidden state, an attention over all hidden states) asthe representation of the goal. The other way is to use a pre-trained language component obtained by other approachesand dataset (e.g. (Mikolov et al., 2013)) to represent thegiven language instructions. Since a pre-trained sentenceembedding defines a reasonable similarity metric amongsentences, one can expect the agent to understand unseenlexicons that are closely related to the language goals intraining. This allows the agent to gain better natural lan-guage understanding, and potentially be more robust to thelanguage advice it gets from teachers (advice from humanscan be quite noisy).

To integrate language representation of goals into the model,we design the architecture as follows. The architecture con-sists of 3 modules. One language component that takesan instruction and convert it into a continuous vector, andwe apply an attention vector. The second component pro-cesses the observation to obtain a image representation usingconvolution neural networks. We then use gated attentionfrom (Chaplot et al., 2017b) to fuse goal information andthe observation. The third component then takes the resultof the fused representation and computes the output value.Figure.(1) shows the computational graph.

4. ExperimentsIn the following experiments, we demonstrate the effec-tiveness of the proposed method and discuss qualitativedifferences from the baseline. We first describe the exper-imental set up with our 2 environments, KrazyGrid Worldand ViZDoom. In the subsequent sections, we investigate:

1. A comprehensive comparison between goal representa-tions in hindsight advice, generalization, and semanticsimilarities. We compare different goal representa-tions: a naive one hot vector for each instruction, aGated Recurrent Unit (GRU) (Chung et al., 2014) thatembeds the discrete language tokens, and a pre-trainedword embedding. We show that as the number of in-structions increases, the one hot approach does notscale as well as the GRU and pre-trained embedding.From visualizing the representation, we can see thatstructures emerge from GRU and pre-trained embed-dings. We also demonstrate that pre-trained embeddingrepresentation can generalize to goals containing outof training vocabulary words.

2. Does the hindsight language advice help with learn-

ing? How much advice do we need? We show signifi-cant improvement in sample efficiency by using advicefrom teachers. In many of the challenging tasks, theagent cannot learn at all without teachers’ advice. Wealso show that significant improvement can be achievedeven if we provide limited amount of teachers’ advice,showing low burden of the method in practice.

4.1. Environments and Training

We tested our method in a 2D grid world (KrazyGrid World(Stadie et al., 2018)), as well as a 3D environment (ViZ-Doom (Chaplot et al., 2017b)). We describe the environmentand our modifications below.

KrazyGrid World (KGW) (Stadie et al., 2018): Krazy-Grid World is a 2D grid environment. For our experimentswe chose to use 4 functionality of tiles: Goal, Lava, Nor-mal, and Agent. We added an additional colour attribute in:Red, Blue, Green. The desired goals in this environmentis to reach goals of different colours. Appendix A lists alllanguage goals and additional environment details.ViZDoom(Kempka et al., 2016; Chaplot et al., 2017b): The3D learning environment is based on the first person shootergame Doom. The state is a raw pixel image from firstperson perspective of a room containing several randomdoom objects of various types, colours, and sizes.

The goal for the episode is a natural language instruction inthe form "Go to the [target attribute] [object]", such as "Goto the green torch".

See Appendix B for list of instructions and more descriptionof ViZDoom.

Compositions of goals. We refer language goals introducedin KGW and ViZDoom sections as singleton tasks. Wefurther consider a set of more challenging tasks – tasksthat are composed of singleton tasks. Given two singletontasks A, B, we take composition function “and" and “or"to form two new tasks “A and B", and “A or B". Thetask “A and B" is considered as complete when the agentcompletes both A and B within an episode, and “A or B" isconsidered as complete when one of the task is achieved. Inour experiments, we consider all combinations of singletontasks with both compositions “and" and “or" for KGW, and“and" for ViZDoom.

Training details. We used the DQN algorithm as the basereinforcement learning algorithm in both environments. Adetailed architecture description for all of our experienmentscan be found in Appendix C. We carried out multi envi-ronment training as follows. First we sample 16 differentenvironments. We collected data from all 16 environments.We updated the agents with an average gradient. We thenresampled 16 environments. Further training details can befound in Appendix D.


4.2. Investigating different types of goal representation

In this section, we investigate the hypothesis that usinglanguage for goal representation helps learning in variousaspects. We compare language-based goal representationsand a non-language goal representation. In summary, weshow that when the difficulty of the tasks increases, languagegoal representations became more effective in providinglearning signals. We also show how one can use languagegoal representation to generalize to unseen goals in training,whereas non-language goal representation could not. Witha pre-trained sentence embedding, we can even generalizeto instructions consisting of unseen lexicons, showing therobustness of the language goal representation. Three goalrepresentations are described as follows,

Language Sentence Representation with GRUs: Foreach instruction, we represent each word as a one hot vector,and sequentially embed words into a GRU. We use take thelast hidden state representation of the GRU for representingthe instruction. We use the same GRU architecture from(Chaplot et al., 2017b), with hidden size 256.

Pre-trained Language Sentence Representation: Wealso consider models where the language component ispre-trained. We use InferLite (Kiros & Chan, 2018) asthe pre-trained sentence representation and hold the param-eters fixed during training. InferLite is a lightweight formof generic sentence encoder trained to perform natural lan-guage inference (Bowman et al., 2015; Williams et al., 2018)resembling InferSent (Conneau et al., 2017) but without theuse of RNNs. The original sentence embedding vector is ofdimension 4096, and we use a learned linear layer to projectit down to 256 for keeping every other part of the modelsame as the other two.

One-Hot Representation: The non-language baseline rep-resents each instruction as a one hot vector, which does notcontain any language structure. We embed this one-hot goalrepresentation to a vector with the same number of dimen-sions as the GRU representation. This embedding is learnedindependently for each goal encountered.

4.2.1. HOW DOES EACH GOAL REPRESENTATIONPERFORM?

We use all three goal representations with hindsight advicefor learning both singleton tasks and compositional tasks ofViZDoom environment. The singleton tasks consists of 7objects, and the compositional tasks consists of 5 objects.The result of comparisons is plotted in Figure 2. We seethat in an easier benchmark—singleton—tasks, agents usingone-hot goal representations are still able to learn as quicklyas agents using the other two goal representations. However,in a much more challenging benchmark— compositional–tasks, the one-hot representations can only achieve a 24%success rates, whereas ACTRCE with language represen-

0M 10M 20M 30M 40MFrames

0.0

0.2

0.4

0.6

0.8

1.0

Succ

ess R

ate

Representation: ViZDoom (Single)

ACTRCE GRUACTRCE OneHotACTRCE InferLite


0.0

0.2

0.4

0.6

0.8

1.0

Succ

ess R

ate

Representation: ViZDoom (Composition)

ACTRCE GRUACTRCE OneHotACTRCE InferLite

(a) (b)

Figure 2. Performance in average success rate during training, com-paring between different sentence embedding methods for thesingle target and composition task on ViZDoom.

tations can achieve 97%. Since one hot representationsrepresent each goal independently, the agent is unable togeneralize to any new unseen instructions. With languagerepresentation, we can easily generalize to unseen instruc-tions that consists of seen lexicons. We reported the general-ization results in Table 1, under “ZSL" (zero-shot learning).Remarkably, with GRU language goal representation, theagent is able to generalize to unseen instructions 83% of thetime, showing great generalization ability of language goalrepresentations.

4.2.2. VISUALIZATION OF LEARNED EMBEDDINGS

We carried out a visualization analysis to see the statisticalrelations between the learned embeddings of goals. We tookthe model trained in the singleton VizDoom benchmark,where the agent was able to learn with all three goal rep-resentations. However, we find there are huge differencesin learned embeddings. For each goal instruction, we ex-tract the corresponding learned embedding for all three goalrepresentation, which is the continuous vector before theattention layer (recall Fig.(1)), all of size 256. We then cal-culated the correlation matrix for each goal representations.We plot all three correlation matrices in Figure.(3). Therows and columns are grouped by object type, then colourand size. We found that GRU and InferLite embeddingshave very similar block-like structure, due to clustering ofcolours and shapes, while each one-hot goal embeddingsshare almost no correlation between each other. We furtherperformed t-SNE (Maaten & Hinton, 2008) embeddingsand observe meaningful clustering with language goal repre-sentations and sporadic embeddings for one-hot goals. Seemore details in Appendix J.1.

4.2.3. GENERALIZING TO UNSEEN LEXICONS WITHPRE-TRAINED WORD EMBEDDING

We now show how to use pre-trained embeddings to al-low our model to generalize to unseen lexicons at test timethrough representation transfer, similar in spirit to DeViSE


Method Single (7 objects) Composition (5 objects)MT ZSL MT ZSL

DQN (GRU) 0.08± 0.01 0.065± 0.007 0.115± 0.007 0.07± 0.04DQN (InferLite) 0.05± 0.07 0.04± 0.06 0.10± 0.02 0.13± 0.07DQN (OneHot) 0.035± 0.007 - 0.02± 0.03 -

ACTRCE (GRU) 0.80± 0.06 0.76± 0.03 0.97± 0.04 0.83± 0.09ACTRCE (InferLite) 0.80± 0.01 0.76± 0.02 0.88± 0.05 0.62± 0.01ACTRCE (One Hot) 0.80± 0.03 - 0.24± 0.05 -

Table 1. The table shows the averaged success rates, calculated over 100 episodes, in multitask and zero-shot scenario for the modeltrained with DQN versus DQN with ACTRCE. The standard deviations are calculated across 2 seeds. In the Multitask (MT) scenario, thegoal for the episode is sampled from the training set of instructions but with a random environment initialization. For Zero-Shot (ZSL),the instruction is sampled from the held out test instruction set and with a random environment initialization.

(a) GRU (b)InferLite (c)One-hot

Figure 3. Comparison among different sentence embedding meth-ods showing the pairwise correlation between the sentence embed-ding vectors for each of the singleton instructions. The darker thecolour, the higher the correlation.

(Frome et al., 2013) that was used for image classificationwith unseen classes. We took the agent trained in singletontasks with InferLite goal representations, and replaced oneword in each original instruction with its synonym1. Weevaluate the performance of the agent on these new instruc-tions that contain unseen lexicons. The results are shownin Tab.(4). We find the agent is able to achieve tasks above66% of time. The ability of understanding sentences ofsimilar meanings become useful when one implements themethod with advice comes from humans. Since humans candescribe the same meaning in many different ways, under-standing synonyms can improve the robustness of learningin general noisy settings.

Method MT + Synonym ZSL + SynonymACTRCE (InferLite) 0.66± 0.05 0.62± 0.02

Table 2. Multitask evaluation with synonym and Zero-shot withsynonym using InferLite.

4.3. Do we need hindsight advice?

In the previous section, we demonstrate the effectiveness oflanguage goal representation from various perspectives. Inthis section, we show how hindsight advice plays an impor-

1See Appendix B.4 for details on the synonyms used.

tant role in learning. We compared our method (denoted as“ACTRCE") to the algorithm DQN, the base algorithm with-out any hindsight language advice. We show that withouthindsight advice, DQN is not able to learn many challeng-ing tasks. However, we also found that little advice (1%) issufficient for learning to take off, showing the practicality ofour method. For language representation, we chose to userecurrent neural networks for embedding language goals,for both our method and the baseline method.

4.3.1. COMPARISON OF ACTRCE TO DQN

Singleton tasks. For experiments on KGW, we tried twodifferent kind of grids, one with 3 lavas and the other with6 lavas, and both with 3 goals of different colours. Fig. 4(a) and (b) show the average success rate over all goals on16-environments training. The shaded area represents thestandard deviation over 2 random seeds. The baseline DQNis shown in blue curve, which failed to learn at all. Ourmethod (shown in green) quickly learned and achieved goodresults on both environments (around 80% success rate).

In ViZDoom experiments, we trained our agent in 3 con-figurations: 5 objects in (1) easy and (2) hard mode, and(3) 7-objects in the hard mode with 50% larger room size.We ran the A3C baseline from (Chaplot et al., 2017b), fromtheir implementation available online (Chaplot et al., 2017a)using the hyperparameter settings stated in their paper. Ta-ble 3 and Figure 5 and 6 summarize the training results forthe 5 objects easy/hard mode. On the easy task, we wereable to solve the task with A3C, but with an order of magni-tude less sample efficiency compared to our DQN/ACTRCEimplementation. On the hard task, we were unable to repro-duce their results with our limited computational budget. Asimilar trend of an order of magnitude difference in sampleefficiency is observed between A3C and DQN/ACTRCE onthe hard task. We found that in the more difficult 7-objectsenvironment, only the agent trained with ACTRCE was ablelearn, compared to DQN. Figure 4 (e) illustrates the traininginstruction average success rate (over 100 episode chunks)vs. the number of frames set, when using ACTRCE versus


DQN baseline. Table 1 summarizes the agent’s Multitaskand Zero-Shot generalization performance.

Compositional tasks. In KGW, we carried out the experi-ments in a grid with 3 goals and 3 lavas. A comparison of av-erage success rates over all goals of ACTRCE, and baselineDQN is shown in in Fig. 4 (c). The shaded area representsthe standard deviation over 2 random seeds. We observedthat the baseline still could not learn the task while AC-TRCE learned efficiently and achieved good performancein this highly challenging environment.

In ViZDoom, we chose to use 5 objects in easy mode for thecompositional task, where all 5 objects appeared in front ofthe agent at the beginning of the episode. Figure 4 (g) showsthe average success rate when using ACTRCE vs. baselineDQN. Likewise, our agent was able to learn very well withhindsight advice, whereas the baseline DQN failed to learn.

4.3.2. HOW MUCH ADVICE DO WE NEED?

We consider a variant of ACTRCE where the teacher pro-vides hindsight advice only in the first {10%, 1%} of framesduring training, and stops giving any advice for the remain-ing portion of the training. We perform this experiment inthe ViZDoom environment single target mode with 7 ob-jects, using the GRU as the sentence embedding. Figure 4illustrates the training average success rate over the frames,and Table 5 lists the final MT and ZSL performance on thetrained agents. We observe that the agent is still able tolearn comparably well even given very little (1%) advice,indicating of the method’s robustness in practical settings.

5. Related WorkSeveral approaches have used natural language in the con-text of reinforcement learning. In pioneering work, (Maclin& Shavlik, 1994) translated language advice into a shortprogram, implemented as a neural network. This advicenetwork encouraged the policy to take the suggested action.Similarly, (Kuhlmann et al., 2004) exploited natural lan-guage advice for a RL agent learning to play a (simulated)soccer game. In (Ling & Fidler, 2017), human feedback inthe form of natural language was exploited to shape the re-ward of an image captioning RL agent. (Kaplan et al., 2017)introduced an agent that learns to use English instructionsto self-monitor its progress to beat Atari games. This wasaccomplished by implementing a multi-modal embeddingof the pairs of game frames and instructions to determine ifthe instruction is satisfied, then provide additional reward tothe agent. (Andreas et al., 2017) proposed to use languageas the latent parameter space for few-shot learning problems,including policy search.

The reverse has also been studied: using reinforcement learn-ing to learn grounded language, referred to as task-oriented

language grounding. (Misra et al., 2017) mapped languageinstructions and visual observations to actions for manip-ulating blocks on a 2D plane. (Yu et al., 2018) groundedlanguage in 2D maze environment, via sentence-directednavigation and question answering task. They introducedthe notion of extrapolation zero-shot sentence, where newwords are transferred from other use cases and models. Intheir work, the unseen word in the navigation sentence wasseen (transferred from) during the question answering task.In contrast, we transfer the unseen synonym words froma pre-trained sentence embedding InferLite model. (Bah-danau et al., 2019) proposed a framework, AGILE, to learnthe reward model given the instruction-state pairs from ex-pert trajectories, and train a policy to perform instructions bymaximizing the modeled reward. While our work assumesthat we are given the teacher to provide the hindsight goaland reward, we can view AGILE as learning to model theteacher’s reward function, which is a discriminator outputsstate-goal pair rθ(s, g) ∈ {0, 1}. To fully implement theteacher in ACTRCE, we would also require a generator togenerate a goal instruction given the state. (Co-Reyes et al.,2019) considered the case where the language instructionis not fixed for the episode, but can be interactively usedto correct policies, in a 2D grid environment. (Hermannet al., 2017) presented an agent that learns to execute writ-ten instructions in a 3D environment through reinforcementlearning and unsupervised learning. Our work builds on(Chaplot et al., 2017b) who proposed a gated-attention ar-chitecture for combining the language and image featuresto learn a policy that execute written instruction in a 3Denvironment.

6. ConclusionIn this work, we propose the ACTRCE method that uses nat-ural language as a goal representation for hindsight advice.The main goal of the paper is to show that using languageas goal representations can bring many benefits, when com-bined with hindsight advice. We analyzed the differencesamong goal representation, and show that ACTRCE canefficiently solve difficult reinforcement learning problemsin challenging 3D navigation tasks, whereas HER with non-language goal representation failed to learn. We also showthat with language goal representations, the agent can gen-eralize to unseen instructions. With pre-trained languagecomponent, the agent can even generalize to instructionswith unseen lexicons, demonstrating its potential to dealwith noisy natural language advice from humans. AlthoughACTRCE algorithm crucially relies on hindsight advice, weshowed that little amount of advice is sufficient for the agentto learn, showing its great practicality.



0.0

0.2

0.4

0.6

0.8

Succ

ess R

ate

KrazyGrid World 3 goals 3 lavas

DQNACTRCE


0.0

0.2

0.4

0.6

0.8

Succ

ess R

ate


DQNACTRCE


0.0

0.2

0.4

0.6

0.8

Succ

ess R

ate

KrazyGrid World compositional goals

DQNACTRCE

(a) (b) (c)


0.0

0.2

0.4

0.6

0.8

1.0

Succ

ess R

ate

ViZDoom Single (7 objects)

DQNACTRCE


0.0

0.2

0.4

0.6

0.8

1.0Su

cces

s Rat

eViZDoom Composition (5 objects)

DQNACTRCE


0.0

0.2

0.4

0.6

0.8

1.0

Succ

ess R

ate

Limit Advice: ViZDoom (Single) GRU

DQNACTRCE GRU 1.0ACTRCE GRU 0.1ACTRCE GRU 0.01

(e) (f) (g)

Figure 4. Performance comparisons on KGW and ViZDoom environments. The success rates are calculated over all desired goals and 16different environments. Shaded area represents standard deviation 2 random seeds.

MT ZSL# of frames 8M 16M 150M N/A 16M 150M N/A

A3C [1] - - - 0.83 - - 0.73A3C (Reprod.) 0.10± 0.01 0.09± 0.04 0.73± 0.01 - - 0.71± 0.02 -

DQN 0.4± 0.2 0.73± 0.08 - - 0.75± 0.05 - -ACTRCE 0.69± 0.04 0.83± 0.02 - - 0.77± 0.02 - -

Table 3. The table shows the averaged success rates in multitask and zero-shot scenario for the model trained with DQN versus DQN withACTRCE, compared to A3C published and reproduced result, on ViZDoom single target 5 objects hard mode. The standard deviations arecalculated across 2 seeds. MT was evaluated throughout the training process, while ZSL was evaluated at the end of training.


0.0

0.2

0.4

0.6

0.8

1.0

Succ

ess R

ate

ViZDoom Single (5 objects easy)

DQNACTRCEA3C

0M 30M 60M 90MFrames

0.0

0.2

0.4

0.6

0.8

1.0

Succ

ess R

ate

ViZDoom Single (5 objects easy)

DQNACTRCEA3C

(a) (b)

Figure 5. ViZDoom experiment with 5 objects in easy mode forsingle target case.


0.0

0.2

0.4

0.6

0.8

1.0

Succ

ess R

ate

ViZDoom Single (5 objects hard)

DQNACTRCEA3C

0M 30M 60M 90M 120M 150MFrames

0.0

0.2

0.4

0.6

0.8

1.0

Succ

ess R

ate

ViZDoom Single (5 objects hard)DQNACTRCEA3C

(a) (b)

Figure 6. ViZDoom experiment with 5 objects in hard mode forsingle target case.


ACKNOWLEDGMENTS

HC was supported by an NSERC CGS-M award. YW issupported by the Google PhD fellowship in machine learn-ing, and an NSERC CGS-D award. We thank Silviu Pitis,Tingwu Wang, Jesse Bettencourt, and Sheldon Huang forhelpful feedback on the draft.

ReferencesAndreas, J., Klein, D., and Levine, S. Learning with Latent

Language. ArXiv e-prints, November 2017.

Andrychowicz, M., Crow, D., Ray, A. K., Schneider, J.,Fong, R., Welinder, P., McGrew, B., Tobin, J., Abbeel, P.,and Zaremba, W. Hindsight experience replay. In NIPS,2017.

Bahdanau, D., Hill, F., Leike, J., Hughes, E., Kohli, P.,and Grefenstette, E. Learning to understand goal speci-fications by modelling reward. In International Confer-ence on Learning Representations, 2019. URL https://openreview.net/forum?id=H1xsSjC9Ym.

Bowman, S. R., Angeli, G., Potts, C., and Manning, C. D.A large annotated corpus for learning natural languageinference. In EMNLP, 2015.

Chaplot, D. S., Sathyendra, K. M., Pasumarthi, R. K.,Rajagopal, D., and Salakhutdinov, R. Gated-attentionarchitectures for task-oriented language grounding(code). https://github.com/devendrachaplot/DeepRL-Grounding, 2017a.

Chaplot, D. S., Sathyendra, K. M., Pasumarthi, R. K., Ra-jagopal, D., and Salakhutdinov, R. Gated-attention ar-chitectures for task-oriented language grounding. arXivpreprint arXiv:1706.07230, 2017b.

Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. Empiri-cal Evaluation of Gated Recurrent Neural Networks onSequence Modeling. ArXiv e-prints, December 2014.

Clark, J. and Amodei, D. Faulty reward functions in thewild, 2016. URL https://blog.openai.com/faulty-reward-functions/.

Co-Reyes, J. D., Gupta, A., Sanjeev, S., Altieri, N., DeNero,J., Abbeel, P., and Levine, S. Meta-learning language-guided policy learning. In International Conferenceon Learning Representations, 2019. URL https://openreview.net/forum?id=HkgSEnA5KQ.

Conneau, A., Kiela, D., Schwenk, H., Barrault, L., andBordes, A. Supervised learning of universal sentencerepresentations from natural language inference data. InEMNLP, 2017.

Frome, A., Corrado, G. S., Shlens, J., Bengio, S., Dean,J., Mikolov, T., et al. Devise: A deep visual-semanticembedding model. In NIPS, 2013.

Gullapalli, V. and Barto, A. G. Shaping as a method foraccelerating reinforcement learning. In Proceedings ofthe 1992 IEEE International Symposium on IntelligentControl, pp. 554–559, Aug 1992. doi: 10.1109/ISIC.1992.225046.

Harnad, S. The symbol grounding problem.Physica D: Nonlinear Phenomena, 42(1):335 – 346, 1990. ISSN 0167-2789. doi:https://doi.org/10.1016/0167-2789(90)90087-6. URLhttp://www.sciencedirect.com/science/article/pii/0167278990900876.

Hasselt, H. v., Guez, A., and Silver, D. Deep rein-forcement learning with double q-learning. In Pro-ceedings of the Thirtieth AAAI Conference on ArtificialIntelligence, AAAI’16, pp. 2094–2100. AAAI Press,2016. URL http://dl.acm.org/citation.cfm?id=3016100.3016191.

Hermann, K. M., Hill, F., Green, S., Wang, F., Faulkner, R.,Soyer, H., Szepesvari, D., Czarnecki, W. M., Jaderberg,M., Teplyashin, D., Wainwright, M., Apps, C., Hassabis,D., and Blunsom, P. Grounded Language Learning in aSimulated 3D World. ArXiv e-prints, June 2017.

Kaplan, R., Sauer, C., and Sosa, A. Beating atari withnatural language guided reinforcement learning. ArXive-prints, 04 2017.

Kempka, M., Wydmuch, M., Runc, G., Toczek, J., andJaskowski, W. ViZDoom: A Doom-based AI researchplatform for visual reinforcement learning. In IEEEConference on Computational Intelligence and Games,pp. 341–348, Santorini, Greece, Sep 2016. IEEE. URLhttp://arxiv.org/abs/1605.02097. The bestpaper award.

Kingma, D. and Ba, J. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980, 2014.

Kiros, J. R. and Chan, W. Inferlite: Simple universal sen-tence representations from natural language inferencedata. In EMNLP, 2018.

Kuhlmann, G., Stone, P., Mooney, R., and Shavlik, J. Guid-ing a reinforcement learner with natural language advice:Initial results in robocup soccer. In AAAI Workshop onSupervisory Control of Learning and Adaptive Systems,2004.

Lample, G. Arnold. https://github.com/glample/Arnold,2017.

https://openreview.net/forum?id=H1xsSjC9Ym

https://openreview.net/forum?id=H1xsSjC9Ym

https://blog.openai.com/faulty-reward-functions/

https://blog.openai.com/faulty-reward-functions/

https://openreview.net/forum?id=HkgSEnA5KQ

https://openreview.net/forum?id=HkgSEnA5KQ

http://www.sciencedirect.com/science/article/pii/0167278990900876


http://dl.acm.org/citation.cfm?id=3016100.3016191


http://arxiv.org/abs/1605.02097


Lample, G. and Chaplot, D. S. Playing fps games with deepreinforcement learning. In Proceedings of AAAI, 2017.

Ling, H. and Fidler, S. Teaching machines to describeimages via natural language feedback. In NIPS, 2017.

Luong, T., Pham, H., and Manning, C. D. Effective ap-proaches to attention-based neural machine translation.In EMNLP, 2015.

Maaten, L. v. d. and Hinton, G. Visualizing data using t-sne.In JMLR, 2008.

Maclin, R. and Shavlik, J. W. Incorporating advice intoagents that learn from reinforcements. In National Con-ference on Artificial Intelligence, pp. 694–699, 1994.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., andDean, J. Distributed representations of words and phrasesand their compositionality. In Burges, C. J. C., Bottou,L., Welling, M., Ghahramani, Z., and Weinberger, K. Q.(eds.), Advances in Neural Information Processing Sys-tems 26, pp. 3111–3119. Curran Associates, Inc., 2013.

Misra, D., Langford, J., and Artzi, Y. Mapping instructionsand visual observations to actions with reinforcementlearning. EMNLP, 2017.

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A.,Antonoglou, I., Wierstra, D., and Riedmiller, M. A.Playing atari with deep reinforcement learning. CoRR,abs/1312.5602, 2013.

Mnih, V., Puigdomènech Badia, A., Mirza, M., Graves, A.,Lillicrap, T. P., Harley, T., Silver, D., and Kavukcuoglu, K.Asynchronous Methods for Deep Reinforcement Learn-ing. ArXiv e-prints, February 2016.

Ng, A. Y., Harada, D., and Russell, S. J. Policy invari-ance under reward transformations: Theory and appli-cation to reward shaping. In Proceedings of the Six-teenth International Conference on Machine Learning,ICML ’99, pp. 278–287, San Francisco, CA, USA, 1999.Morgan Kaufmann Publishers Inc. ISBN 1-55860-612-2. URL http://dl.acm.org/citation.cfm?id=645528.657613.

Popov, I., Heess, N., Lillicrap, T. P., Hafner, R., Barth-Maron, G., Vecerik, M., Lampe, T., Tassa, Y., Erez,T., and Riedmiller, M. A. Data-efficient deep rein-forcement learning for dexterous manipulation. CoRR,abs/1704.03073, 2017.

Schaul, T., Horgan, D., Gregor, K., and Silver, D. Universalvalue function approximators. In Proceedings of the 32NdInternational Conference on International Conferenceon Machine Learning - Volume 37, ICML’15, pp. 1312–1320. JMLR.org, 2015. URL http://dl.acm.org/citation.cfm?id=3045118.3045258.

Siskind, J. M. Grounding language in perception. ArtificialIntelligence Review, 8(5):371–391, Sep 1994. ISSN 1573-7462. doi: 10.1007/BF00849726. URL https://doi.org/10.1007/BF00849726.

Stadie, B., Yang, G., Houthooft, R., Chen, P., Duan, Y.,Wu, Y., Abbeel, P., and Sutskever, I. The importance ofsampling inmeta-reinforcement learning. In Advances inNeural Information Processing Systems, pp. 9300–9310,2018.

Sutton, R. S. and Barto, A. G. Reinforcement Learning:An Introduction. MIT Press, Cambridge, MA, USA, 2ndedition, 2018. ISBN 9780262193986.

Williams, A., Nangia, N., and Bowman, S. A broad-coverage challenge corpus for sentence understandingthrough inference. In NAACL, 2018.

Winograd, T. Understanding natural language. CognitivePsychology, 3(1):1 – 191, 1972. ISSN 0010-0285. doi:https://doi.org/10.1016/0010-0285(72)90002-3. URLhttp://www.sciencedirect.com/science/article/pii/0010028572900023.

Yu, H., Zhang, H., and Xu, W. Interactive grounded lan-guage acquisition and generalization in a 2d world. ICLR,2018.





https://doi.org/10.1007/BF00849726

https://doi.org/10.1007/BF00849726




A. KrazyGrid World Language GoalsA.1. More Details on the Environment

KrazyGrid World is a 2D grid environment. For our experiments we chose to use 4 functionality of tiles: Goal, Lava,Normal, and Agent. We added an additional colour attribute ranging in 3 colours: Red, Blue, Green. There are 4 discreteactions, “up", “down", “right", “left". The desired goals in this environment is to reach goals of different colours. We use aglobal view of the grid state as the observation to the agent. Each tile in the grid is represented by a concatenation of itsfunctionality and colour attribute, both in one hot vectors. The grid state also includes a one hot vector that represents theagent’s position. In our experiments, we use an environment of grid size 9 × 9, with 3 goals in the environment, each has adistinct colour. We tried various number of lavas in the environment, and made sure there is at least 1 lava of each colour. Inthe simple task setting, the episode terminates when the agent reaches a lava or a goal, or the episode runs over a maximumtime step T = 25. We automatically generates descriptions give any states of the environment as the teacher.

Compositional We enlarged the goal space by considering the task of compositions of goals. As the goal space changed,we also had to make several modifications to the environment. We let the episode terminate when the agent reaches a lava orreaches 2 different goals, or runs over the maximum time step of 50. Since we still included simple goals (e.g., “Reachblue goal") in the goal space, we added an extra action called “flag". This action can be used for the agent to terminatethe episode when it thinks the given goal is accomplished. The environment will be terminated if the agent indeed hasaccomplished the given task, and will stay unchanged otherwise.

A.2. List of Language Instructions

Single goal setting The entire goal space G consists of 8 goals: {Reach red goal, Reach blue goal, Reach green goal,Reach red lava, Reach blue lava, Reach green lava, Avoid any lava, Avoid any goal}. The desired goal space Gd consists of3 goals {Reach red goal, Reach blue goal, Reach green goal}.

Compositional goal setting The desired goal space Gd consists of 21 goals {Reach red goal, Reach blue goal, Reachgreen goal, Reach red goal and Reach blue goal, Reach blue goal and Reach red goal, Reach red goal or Reach blue goal,Reach blue goal or Reach red goal, Reach red goal and Reach green goal, Reach green goal and Reach red goal, Reach redgoal or Reach green goal, Reach green goal or Reach red goal, Reach blue goal and Reach green goal, Reach green goal andReach blue goal, Reach blue goal or Reach green goal, Reach green goal or Reach blue goal, Reach red goal and Avoid anylava, Avoid any lava and Reach red goal, Reach blue goal and Avoid any lava, Avoid any lava and Reach blue goal, Reachgreen goal and Avoid any lava, Avoid any lava and Reach green goal}.

The entire goal space G has another of 17 goals: {Reach red lava or Reach blue lava, Reach blue lava or Reach red lava,Reach red lava or Reach green lava, Reach green lava or Reach red lava, Reach blue lava or Reach green lava, Reach greenlava or Reach blue lava, Reach red lava, Reach red lava and Avoid any goal, Avoid any goal and Reach red lava, Reach bluelava, Reach blue lava and Avoid any goal, Avoid any goal and Reach blue lava, Reach green lava, Reach green lava andAvoid any goal, Avoid any goal and Reach green lava, Avoid any lava and Avoid any goal, Avoid any goal and Avoid anylava}.

B. ViZDoom Environment DetailB.1. Environment Description

ViZDoom(Kempka et al., 2016; Chaplot et al., 2017b): The 3D learning environment is based on the first person shootergame Doom. At each time step, the state is a raw pixel image from first person perspective of the 3D environment. Theenvironment consists of a room containing several random doom objects of various types, colours, and sizes. The agentcan navigate the environment via 3 possible actions: turn left, turn right, and move forward. The goal for the episode isa natural language instruction in the form "Go to the [target attribute] [object]", such as "Go to the green torch". Onlyone target object is present in each episode. The episode terminates either when the agent reaches an object (regardless ofwhether it is correct or not), or the maximum time step is reached (T = 30). A reward of 1 is given if the correct object isreached, otherwise a reward of 0 is given. The environment has several difficulty modes, depending on how the objects andthe agent are distributed at the beginning of the episode. In the easy mode, the agent is spawned at a fixed location facingforward. The objects are placed evenly spaced apart along a horizontal row in front of the agent. In hard, both the objects


Go to the red torch Go to the pillar

Go to the torchGo to the blue object

(a) (b)

Figure 7. Example of (a) unambiguous composition of instructions and (b) ambiguous composition of instructions

(a) (b) (c)

Figure 8. Example state input screens seen by the agent in a ViZDoom composition task episode. The composition instruction for thisepisode is "Go to the green short torch and Go to the green short pillar". (a) Shows the starting state. (b) is the state when the agent hasreached the green short pillar. (c) is the final state once the agent reached the short green torch.

and the agent are randomly spawned, and the objects are not necessarily in view. We focus on the easy and hard mode forour experiments.

B.2. ViZDoom Composition Instructions

The compositional instructions consist of two single object instructions from the original training set joined by the word"and", such as "Go to the red torch and Go to the pillar". We did not include any superlative instructions, such as "Go to thelargest object".

We ensured that the desirable instructions were unambiguous–given two instructions, the set of objects satisfying the firstinstruction is mutually exclusive from the set of objects satisfied in the second instruction. For example, "Go to the torchand Go to the pillar" have mutually exclusive set of valid objects. An ambiguous compositional instruction may have objectsthat satisfy both instructions. For example, "Go to the blue object and Go to the torch" is ambiguous because a blue torchsatisfies both instructions. This is illustrated in Figure 7.

B.3. ViZDoom Composition: Modification for State

For the ViZDoom composition task, we modify the raw pixel image of the environment to include a basic head-up display(HUD) consisting of black rectangle in the bottom left of the screen. When the agent reaches an object, a small thumbnailimage of the reached object will appear inside the HUD. The HUD can show up to 2 objects reached, and the episodeterminates once the again has reached a second object. Figure 8 illustrates what the agent sees as the input image throughouta composition task episode.

B.4. ViZDoom Synonym Instructions

We generate the synonym instructions by replacing an original word in the instruction with one of the synonyms listed inTable 4.


Original Word SynonymGo to Reach, Approach, Get tolargest biggestsmallest littlestsmall tinylarge big, hugeshort low, littletall big, toweringred crimsonblue indigo, tealgreen olive, limeyellow amber, gold, sunnytorch lamp, beacon, lantern, flaming objectpillar column, pedestalskull cranium, headkey opener, unlockercard badge, passarmor shieldobject thing, item

Table 4. Original vocabulary and corresponding synonyms

B.5. ViZDoom Teacher Advice Generation

Singleton task. When the agent does reach an object (correct or incorrect), we generate a positive teacher advice bysampling from the training set of instructions that can describe the reached object. Additionally, we also generate a negativefeedback by randomly picking one of the object in the current environment that was not reached, then sampling an instructionthat described that unreached object, while ensuring that it will not describe the reached object.

When the agent did not reach any object, we simply do not give any positive teacher advice (i.e. the achieved goal). We hadoriginally tried to give the advice of “Reach no object” when the agent did not reach any objects at the end of the episode,but it did not improve the performance. However, we still give the unachieved goal, from any of the objects in the currentenvironment.

Compositional task. We follow a similar process for compositional task, depending on the number of objects reached. Ifthe agent reached 0 or 1 object only, we generate a singleton advice in similar fashion as singleton task. We later foundthat not giving any advice when reaching 0/1 object also performed equally well and slightly faster learning. If the agentreached 2 objects during the episode, for positive advice we sample generate singleton instructions for each of the twoobjects, ensuring that they were not the same instructions, then merge them with "and" keyword. For negative advice, wegenerate singleton instructions for the unreached objects and combine two of them, ensuring that the instructions do notrefer to the reached objects at all.

C. ArchitecturesKrazyGrid World

• Singleton We use series of convolution layers follows with ReLU activation functions for prepossessing the gridobservation: 32 of 3x3 kernel with stride 1, followed by 64 of 3x3 kernel size with stride of 2 and then 64 of 3x3 kernelwith a stride of 1 and 128 of 3x3 kernel with a stride of 2. We input the language sentences as word level one hotvectors to a LSTM of hidden size 128. The LSTM’s last hidden vector is passed into a fully connected layer with128 output units with sigmoid activation. This acts as the attention vector that is multiplied channel-wise with the64 feature maps from the convolutional later. The gated feature maps are then flattened and passed through a fullyconnected layer with ReLU activation with 256 units. We then pass into a linear layer to predict the 4 action values.

• Compositional We use series of convolution layers follows with ReLU activation functions for prepossessing the grid


observation: 32 of 1x1 kernel with stride 1, followed by 32 of 3x1 kernel size with stride of 2. We input the languagesentences as word level one hot vectors to a bidirection LSTM, of hidden size 40. After obtaining the preprocessedobservation, we augment with the history vector provided by the environment as a query to attend to all the hiddenstates in the Bi-direction LSTMs via Luong attention (Luong et al., 2015). We then keep processing the observationusing two more convolution layers follows with ReLU activation functions: 64 of 3x3 kernel with stride 2, followedby 64 of 3x3 kernel size with stride of 2. At each of these two layers, we use the context vector formed by languageattention to attend back the observation by gating on 64 feature maps in a channel-wise fashion. The final gated featuremaps are then flattened and passed through a fully connected layer with ReLU activation with 256 units. We then passinto a linear layer to predict the 5 action values.

ViZDooms Our architecture is almost identical to (Chaplot et al., 2017b), except that we uses a linear output layer for theaction values, and we did not use dueling architecture. The state input to the network is RGB image of size 3× 300× 168.A series of convolution layers follows with ReLU activation functions: 128 of 8x8 kernel with stride 4, followed by 64 of4x4 kernel size with stride of 2 and then 64 of 4x4 kernel with a stride of 2. To process the language input, we first use anembedding matrix to embed the voculary to a vector of size 32. We use the Gated Recurrent Unit (GRU) (Chung et al.,2014) of size 256 as the recurrent cell to process word embeddings. The GRU’s last hidden vector is passed into a fullyconnected layer with 64 output units with sigmoid activation. This acts as the attention vector that is multiplied channel-wisewith the 64 feature maps from the convolutional later. The gated feature maps are then flattened and passed through a fullyconnected layer with ReLU activation with 256 units. We then pass into the LSTM with 256 hidden units, and its hiddenunits go through a linear layer to predict the 3 action values.

D. Training DetailsKrazyGrid World For KrazyGrid World experiments, we tune the following hyperparameters within corresponding range:learning rate {0.0003, 0.001, 0.003}, replay buffer size of {5000, 10000}, training the network every {1, 2, 4} frames. Wegenerate episodes with ε-greedy policy starting at ε = 1.0 then decaying linearly to 0.01 by 10000 frames and remain at0.01. We use Double DQN (Hasselt et al., 2016) to reduce the Q-value overestimation and Huber loss (δ = 1) for stablegradients.

ViZDoom For ViZDoom environment, we use the same set of 56 training instructions to train the agent, and a set of 15held out test instructions for zero-shot evaluation as in (Chaplot et al., 2017b). We also used their code (Chaplot et al.,2017a) for reproducing training using Asynchronous Advantage Actor Critic (A3C) (Mnih et al., 2016). We implementedour version of DQN on top of the implementation of Arnold (Lample & Chaplot, 2017; Lample, 2017). The cyclic bufferreplay buffer contains the last 10000 and 20000 most recent transitions for the easy and hard mode, respectively, chosenfrom the range {1000, 10000, 100000} and fine tuned. We generate episodes with ε-greedy policy starting at ε = 1.0 thendecaying linearly to 0.01 by 10000 frames and remain at 0.01. We found that this performed better than using a largerepsilon = 0.1. Only after the first 1000 frames have been collected that the training begins, selected from a range of{1000, 10000, 100000}. We use Double DQN (Hasselt et al., 2016) to reduce the Q-value overestimation and Huber loss(δ = 1) for stable gradients. We use the Adam optimizer (Kingma & Ba, 2014) with a learning rate of 0.0001. The networkis updated every 4 frames on easy and 16 frames on difficult mode. While updating less often reduces sample efficiency,in practice we found that this speeds up training wall clock time and converges to a better performance. When samplingfrom the replay buffer, we sample the start a random episode and select 32 consecutive frames from the replay buffer. Thisapproach leads to more accurate estimates of the hidden state of the LSTM. In addition, the sequential mini-batch allowsus to perform n-step Q learning as outlined in (Mnih et al., 2016). The correlation between samples in the mini-batch isalleviated by running 16 parallel threads to send gradient updates to the shared network parameters. We synchronize thetraining thread model with the shared network each time before computing the training loss and gradient. The target networkis synchronized with the current model once every 500 time steps. and use one additional thread to evaluate the multi-tasksuccess rate throughout the training process.

E. Different types of teachersIn this section we described the details of teacher types. Firstly, note that not all of the language descriptions describefavorable behaviors. For example, in the KGW, a desirable goal is “reach a goal", but there are also undesirable states thatcorresponds to goal “reach a lava". Hence we consider a subset Gd ⊆ G to denote all desired goals that the agent is expectedto perform. At each episode, we first sample a goal g ∈ Gd from the set of all desired language goal spaces, and ask the


agent to explore the environments conditioned on the goal. When the episode ends, we obtained an advice from a teacher Tby observing the terminal state, g∗ = T(sT ). Depending on the kind of teacher, we obtain a different kind of advice or noadvice. We consider three kinds of teacher as follows:

• Optimistic A teacher who gives advice only when the agent achieved a desirable goal, i.e., when g∗ ∈ Gd. When theagent performs poorly, there is no advice from this teacher.

• Knowledgeable A teacher who describes what the agent has achieved in all scenarios, including the undesirablebehaviors, as advice to the agent.

• Discouraging A teacher who describes a desired goal g∗ ∈ Gd that the agent has not achieved as advice to the agent.In this case, the trajectory with goal replaced by the teacher’s advice will receive a reward of 0.

E.1. How does each teacher perform?

In this section, we investigated how different teachers help in giving advice. We compared our method to the DQN algorithm.We denote the method using only optimistic and discouraging teachers as “ACTRCE−". We denote the method usingknowledgeable teachers as well as discouraging teachers as ACTRCE (this is the version we use in the main paper). Weevaluated three methods in KrazyGrid World and the results are as follows.

We tried two different kind of grids, one with 3 lavas and the other with 6 lavas, and both with 3 goals of different colours.Fig. 4 (a) and (b) show the average success rate over all goals on 16-environments training. The shaded area represents thestandard deviation over 2 random seeds. The baseline DQN is shown in blue curve, which failed to learn anything at all.“ACTRCE−" (shown in orange) quickly learned and achieved good results on both environments. However, we observed thatwhen the number of lavas increased, the task became harder, and ACTRCE− performed worse after 32 millions frames oftraining (the performance dropped from 83% to 63%). On the other hand, we observed that having knowledgeable teachersalways helped speed up learning. In particular, when the number of lavas increased, the task became harder for ACTRCE−.However, with knowledgeable teachers, language advice was provided even when the agent reached lava, and hence theamount of advice per step was kept the same in the more difficult setting. Therefore, we observed ACTRCE learning at asimilar rate in the more difficult setting, leaving a big gap from ACTRCE− after 32 millions frames of training (80% vs63%).


0.0

0.2

0.4

0.6

0.8

Succ

ess R

ate


DQNACTRCEACTRCE


0.0

0.2

0.4

0.6

0.8

Succ

ess R

ate


DQNACTRCEACTRCE


0.0

0.2

0.4

0.6

0.8

Succ

ess R

ate

KrazyGrid World compositional goals

DQNACTRCEACTRCE

(a) (b) (c)

Figure 9. Performance comparisons on KrazyGrid World environments. The success rates are calculated over all desired goals and 16different environments. Shaded area represents standard deviation 2 random seeds.

F. Can learning easy goals help learning difficult goals?As we mentioned, there are desirable tasks as well as undesirable tasks. One would worry what if the desirable tasks arevery difficult, would the agent only learns to accomplish easier tasks which are undesirable? Hence, we would like to askwhether learning the easier tasks from hindsight can provide any learning signal for more difficult tasks. We hypothesizethat it is possible in the KGW setting as learning to “reach lava" can aid learning controllers, which is used to “reach goals".


We designed the following transfer learning experiment to further our hypothesis. First, we artificially constructed apessimistic teacher who only gives advice when the agent achieved undesirable goals, i.e., when g∗ ∈ G\Gd. We pretrainedthe agents with only the pessimistic teacher for 5 million frames. Now, we carried out training using ACTRCE− (withoptimistic and discouraging teachers) with the pretrained agent and compared to unpretrained ones. Results are shownin Figure 10. We found that by pretraining the agent using pessimistic agents, even though the language goals in thepretraining phase have no overlapping with the actual training goals, the agent learned much faster than the unpretrainedones. In particular, in the environment with 3 lavas, the pretrained agent learned the fastest. In the environment with 6lavas, pretrained agents learned as fast as ACTRCE, leaving a big gap from ACTRCE−, even though it was given the sameamount of advice during training as ACTRCE−. This provides an evidence that learning easier goals can sometimes providelearning signals for harder goals, especially when both tasks require similar modules.


0.0

0.2

0.4

0.6

0.8

Succ

ess R

ate

3 goals 3 lavas Transfer Exp

DQNACTRCEACTRCEPretrained


0.0

0.2

0.4

0.6

0.8

Succ

ess R

ate

3 goals 6 lavas Transfer Exp

DQNACTRCEACTRCEPretrained

Figure 10. Transfer experiments that demonstrate pre-training with easier goals can aid learning more difficult goals.

G. ViZDoom Average Episode LengthWe report on the average episode length (averaged over 100 episodes) during training for 3 ViZDoom scenarios: singletarget 5 and 7 objects in hard mode, and composition target 5 objects in easy mode. The GRU hidden state representation ofthe sentence were used in the plots in Figure 11. We observe that the average episode length is slowly decreasing when usingACTRCE, while the baseline DQN remains fairly flat for the harder 7 objects (single target) and 5 objects (compositional).

0M 3M 6M 9M 12M 16MFrames

14

16

18

20

22

24

Episo

de L

engt

h


DQNACTRCE


16

18

20

22

24

Aver

age

100

Episo

de L

engt

h

ViZDoom Single (7 objects)

DQNACTRCE


15

20

25

30

35

40

Aver

age

100

Episo

de L

engt

h

ViZDoom Composition (5 objects)

DQNACTRCE

(a) (b) (c)

Figure 11. Average training episode lengths in VizDoom environments. (a) 5 objects in hard mode, in single target case. (b) 7 objects inhard mode, in single target case. (c) 5 objects in easy mode, in composition case.

H. ViZDoom Cumulative Success Rate vs. Episode Length CurveWe take the final trained model and run 100 episodes, noting whether each run was successful or not (binary value). Weconstruct a cumulative success rate (CSR) versus episode length curve, where the x-axis is the episode length, and the y-axis


is the fraction of total number of episodes that were successful and had episode length less than or equal to the x-axis value:

CSR(x) =# of successful eps with eps length ≤ x

Total number of episodes(2)

Therefore, the curve is monotonically increasing, with the y-axis being the overall success rate when the x-axis value isthe maximum episode length. The better the model, the larger the area under this curve will be—similar in spirit to theprecision-recall curve—because it will be able to have more of successful trajectories that are short early on.

Figure 12 shows the Multi-task (MT) cumulative success rate for the 3 ViZDoom environment tasks, using GRU hiddenstate language encoding: single target 5 and 7 objects in hard mode, and composition target 5 objects in easy mode. In the 5objects hard mode (Figure 12a), we observe that all 3 training algorithms had similar performance until around episodelength of 20, where ACTRCE has more successful trajectories which are longer. In the 7 objects hard mode (Figure 12b),ACTRCE maintains a similar behaviour while the baseline DQN essentially was only able to get success on very shortepisodes (i.e. when the target object was very close by). Lastly, in the 5 objects composition task (Figure 12c), the curve forACTRCE indicates that there were 2 groups of trajectories: one requiring less than 10 time-steps, while another requiringover 20 time-steps. The former group occurs when the two target objects are adjacent to one another, making it easy to reachthe second object after the first in only a few time steps. The latter group occurs when the target objects are not adjacent toeach other, and thus requires the agent to more carefully turn around and avoid hitting other objects when trying to reach thesecond target.

5 10 15 20 25 30Episode Length

0.0

0.2

0.4

0.6

0.8

Succ

ess R

ate

with

in E

ps L

en


DQNACTRCEA3C

5 10 15 20 25 30Episode Length

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Succ

ess R

ate

with

in E

ps L

en


DQNACTRCE

10 15 20 25 30 35 40Episode Length

0.0

0.2

0.4

0.6

0.8

1.0

Succ

ess R

ate

with

in E

ps L

en

ViZDoom Composition (5 objects)

DQNACTRCE

(a) (b) (c)

Figure 12. Multi-task (MT) cumulative success rate versus episode in ViZDoom environments, using GRU hidden state sentencerepresentation. (a) 5 objects in hard mode, in single target case. (b) 7 objects in hard mode, in single target case. (c) 5 objects in easymode, in composition case.

I. Additional details on Hindsight Advice AnnealingWe report the results for the hindsight advice annealing when using InferLite and One-Hot representation when providinghindsight advice on the first 10% and 1% of training:

J. VisualizationFor each single-target instruction in the training and test set (70 instructions), we obtain an embedding vector of dimension256 using either GRU, InferLite, or One Hot (storing entire embedding). From this embedding matrix, we can visualize thelearned instruction embedding in several ways.

J.1. Embedding Correlation Comparison

For each pair of instructions i and j, we obtain their embeddings vectors vi and vj , and compute their correlation distance,cdist(vi, vj). We define the correlation distance between two vectors u and v as:

cdist(u, v) = 1− (u− u) · (v − v)

||(u− u)||2||(v − v)||(3)


Method MT ZSLACTRCE (GRU) 1.0 0.80± 0.06 0.76± 0.03ACTRCE (GRU) 0.1 0.75± 0.01 0.74± 0.06ACTRCE (GRU) 0.01 0.73± 0.04 0.66± 0.05ACTRCE (InferLite) 1.0 0.80± 0.01 0.76± 0.02ACTRCE (InferLite) 0.1 0.78± 0.04 0.68± 0.01ACTRCE (InferLite) 0.01 0.76± 0.01 0.68± 0.01ACTRCE (OneHot) 1.0 0.80± 0.03 −ACTRCE (OneHot) 0.1 0.74± 0.06 −ACTRCE (OneHot) 0.01 0.70± 0.06 −

Table 5. Comparing the Multitask and Zero-shot Generalization when the agent has limited hindsight advice to only the beginning oftraining.


0.0

0.2

0.4

0.6

0.8

1.0

Succ

ess R

ate

Limit Advice: ViZDoom (Single) InferLite

DQNACTRCE InferLite 1.0ACTRCE InferLite 0.1ACTRCE InferLite 0.01


0.0

0.2

0.4

0.6

0.8

1.0

Succ

ess R

ate

Limit Advice: ViZDoom (Single) One-Hot

DQNACTRCE OneHot 1.0ACTRCE OneHot 0.1ACTRCE OneHot 0.01

(a) (b)

Figure 13. Performance comparisons on VizDoom environment.

We divide each matrix by the maximum entry as to scale the results to [0, 1].

J.2. t-SNE Plot Comparison

We use t-SNE to visualize the space of instruction embeddings. In the following figure, the shape of the datapoint indicatesthe type of object, such as triangle for armor, diamond for pillar, etc., and the colour indicate the object’s colour, with blackbeing used for when no colour is specified. Finally the size of the data point corresponds to the size indicated in the object,from ’smallest’, ’small’, none specified, ’large’, and ’largest’. For the instructions with synonym word replacement, wesimply leave the colour as black and the default size.


Go to the short red object

Go to the tall red object

Go to the red short object

Go to the red tall object

Go to the red object

Go to the largest red object

Go to the smalle

st red object

Go to the tall blue object

Go to the short blue object

Go to the blue tall object

Go to the blue short object

Go to the blue object

Go to the smalle

st blue object

Go to the largest blue object

Go to the tall green object

Go to the short green object

Go to the green tall object

Go to the green short object

Go to the green object

Go to the smalle

st green object

Go to the largest green object

Go to the yello

w object

Go to the smalle

st yello

w object

Go to the largest yello

w object

Go to the largest object

Go to the smalle

st object

Go to the short object

Go to the tall object

Go to no object

Go to the red arm

or

Go to the green arm

or

Go to the arm

or

Go to the short red pillar

Go to the tall red pillar

Go to the red short pillar

Go to the red tall pillar

Go to the red pillar

Go to the tall green pillar

Go to the short green pillar

Go to the green tall pillar

Go to the green short pillar

Go to the green pillar

Go to the pillar

Go to the tall pillar

Go to the short pillar

Go to the short red torch

Go to the tall red torch

Go to the red short torch

Go to the red tall torch

Go to the red torch

Go to the short blue torch

Go to the tall blue torch

Go to the blue tall torch

Go to the blue short torch

Go to the blue torch

Go to the tall green torch

Go to the short green torch

Go to the green tall torch

Go to the green short torch

Go to the green torch

Go to the torch

Go to the short torch

Go to the tall torch

Go to the red keyca

rdGo to the blue keyca

rdGo to the yello

w keyca

rdGo to the keyca

rdGo to the red sku

llkey

Go to the blue sku

llkey

Go to the yello

w sku

llkey

Go to the sku

llkey

Go to the short red objectGo to the tall red object

Go to the red short objectGo to the red tall object

Go to the red objectGo to the largest red object

Go to the smallest red objectGo to the tall blue object

Go to the short blue objectGo to the blue tall object

Go to the blue short objectGo to the blue object

Go to the smallest blue objectGo to the largest blue object

Go to the tall green objectGo to the short green objectGo to the green tall object

Go to the green short objectGo to the green object

Go to the smallest green objectGo to the largest green object

Go to the yellow objectGo to the smallest yellow objectGo to the largest yellow object

Go to the largest objectGo to the smallest object

Go to the short objectGo to the tall object

Go to no objectGo to the red armor

Go to the green armorGo to the armor

Go to the short red pillarGo to the tall red pillar

Go to the red short pillarGo to the red tall pillar

Go to the red pillarGo to the tall green pillar

Go to the short green pillarGo to the green tall pillar

Go to the green short pillarGo to the green pillar

Go to the pillarGo to the tall pillar

Go to the short pillarGo to the short red torchGo to the tall red torch

Go to the red short torchGo to the red tall torch

Go to the red torchGo to the short blue torchGo to the tall blue torchGo to the blue tall torch

Go to the blue short torchGo to the blue torch

Go to the tall green torchGo to the short green torchGo to the green tall torch

Go to the green short torchGo to the green torch

Go to the torchGo to the short torchGo to the tall torch

Go to the red keycardGo to the blue keycard

Go to the yellow keycardGo to the keycard

Go to the red skullkeyGo to the blue skullkey

Go to the yellow skullkeyGo to the skullkey

0.0

0.2

0.4

0.6

0.8

1.0

Figure 14. Correlation matrix for GRU. Best seen in electronic form.








Go to the smalle

st red object






Go to the smalle

st blue object







Go to the smalle

st green object


Go to the yello

w object

Go to the smalle

st yello

w object


w object


Go to the smalle

st object



Go to no object

Go to the red arm

or

Go to the green arm

or

Go to the arm

or











Go to the pillar







Go to the red torch











Go to the torch



Go to the red keyca


rdGo to the yello

w keyca

rdGo to the keyca

rdGo to the red sku

llkey

Go to the blue sku

llkey

Go to the yello

w sku

llkey

Go to the sku

llkey

































0.0

0.2

0.4

0.6

0.8

1.0

Figure 15. Correlation matrix for InferLite. Best seen in electronic form.








Go to the smalle

st red object






Go to the smalle

st blue object







Go to the smalle

st green object


Go to the yello

w object

Go to the smalle

st yello

w object


w object


Go to the smalle

st object



Go to no object

Go to the red arm

or

Go to the green arm

or

Go to the arm

or











Go to the pillar







Go to the red torch











Go to the torch



Go to the red keyca


rdGo to the yello

w keyca

rdGo to the keyca

rdGo to the red sku

llkey

Go to the blue sku

llkey

Go to the yello

w sku

llkey

Go to the sku

llkey

































0.0

0.2

0.4

0.6

0.8

1.0

Figure 16. Correlation matrix for One-Hot. Best seen in electronic form.


−600 −400 −200 0 200 400 600−500

−400

−300

−200

−100

0

100

200

300

400armor

pillar

torch

keycard

skullkey

object

−600 −400 −200 0 200 400−300

−200

−100

0

100

200

300

400

500armor

pillar

torch

keycard

skullkey

object

−600 −400 −200 0 200 400 600−600

−400

−200

0

200

400

600armor

pillar

torch

keycard

skullkey

object

−150 −100 −50 0 50 100 150 200−200

−150

−100

−50

0

50

100

150armor

pillar

torch

keycard

skullkey

object

Figure 17. t-SNE plots of instruction embeddings for GRU (top left), InferLite (top right), One-hot (bottom left) and Inferlite synonyms(bottom right). Best seen in electronic form.

Abstract arXiv:1902.04546v1 [cs.LG] 12 Feb 2019 · 2019. 2. 13. · It also uses a replay buffer as...

Documents

Transcript of Abstract arXiv:1902.04546v1 [cs.LG] 12 Feb 2019 · 2019. 2. 13. · It also uses a replay buffer as...