Explainable Reinforcement Learning: Visual Policy ... · The most notable of such Deep Learning...

Explainable Reinforcement Learning:Visual Policy Rationalizations Using

Grad-CAM

Laurens Weitkamp11011629

Bachelor thesisCredits: 18 EC

Bachelor Opleiding Kunstmatige Intelligentie

University of AmsterdamFaculty of ScienceScience Park 904

1098 XH Amsterdam

SupervisorsElise van der Pol

Zeynep Akata

UvA-Bosch Delta LabUniversity of Amsterdam

Science Park 9041098 XH Amsterdam

June 29, 2018

Abstract

Deep reinforcement learning has made large advances through the use of deep neural networks.But using such deep neural networks make decision making opaque, it is difficult to find alogical connection between the model’s input and output. Explainable Artificial Intelligenceis an emerging field dedicated with the task of making such deep neural networks explainable,but little research has been done to create Explainable Reinforcement Learning. This thesisproposes a rationalization framework that combines Grad-CAM with the A3C reinforcementlearning algorithm to create visual rationalizations that can aid in understanding decisionmaking in such models. It does this by visualizing the evidence on which the agent bases itsdecision, thus increasing trust and understanding in how such agents make decisions. Thisframework has been evaluated on two specific tasks: analyzing an agent at different timesteps during training and analyzing situations in which the agent fails at a task. Both thesetasks have been evaluated using three Atari 2600 environments provided by the OpenAI Gymtoolkit. Each environment has been chosen because it has a different level of difficulty ora different long-term reward dependency. This thesis emphasizes the importance of visualrationalizations as a means to increase both trust and understanding in deep reinforcementlearning agents.

2

Acknowledgments

I would like to thank my supervisors: Elise van der Pol and Zeynep Akata, for guiding me throughthe process, giving me constant feedback and coming up with the project in the first place. Further-more I would also like to thank a group of students for several feedback sessions: Douwe van derWal, Max Filtenborg. Lastly I would like to thank the UvA Intelligent Robotics Lab for providingme with the necessary computational power needed to train the models used in this thesis.

3

Contents

1 Introduction 5

2 Background 62.1 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.1 Exploration and Exploitation . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1.2 Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1.3 Policy Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1.4 Actor-Critic methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2.1 Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . 82.2.2 Long Short Term Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 Deep Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3.1 Deep-Q Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3.2 Asynchronous Advantage Actor Critic . . . . . . . . . . . . . . . . . . . . . 9

2.4 Explainable Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 102.4.1 Gradient Based Class Activation Map . . . . . . . . . . . . . . . . . . . . . 10

3 Visual Rationalization Model 11

4 Experiments 134.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.2 Ranking Grad-CAM Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.3 Learning a Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.4 Agent Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5 Related Work 23

6 Conclusion And Discussion 246.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246.2 Further Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

Appendices 27

A Grad-CAM Output for Randomly Initiated Agent 27

B Hyper-parameters 27

References 28

4

1 Introduction

Reinforcement learning is an area of Machine Learning dealing with how agents should take actionswhen given an environment to interact with. This task is done in such a manner as to maximize theagents’ long-term reward signal. The reward itself can be anything that would enforce a particularbehaviour.

For instance, consider an agent that is learning how to drive a car. The agent would geta positive reward for driving safe, and a negative reward for driving against objects. Agentsfollow a policy and learn the value of a state or the actual policy by experiencing states in theenvironment, by which they can update the policy. This works well in low-dimensional fully-observable environments, but traditional reinforcement learning methods do not scale well in highdimensional and noisy environments.

The field of Deep Learning is well equipped to deal with such high dimensional and noisydata, such as images that display multiple objects. The most notable of such Deep Learningmethods are the Convolutional Neural Networks (CNNs) that can learn spatial information givenhigh dimensional input.

Through the use of Deep CNNs as function approximators, the field of deep reinforcementlearning has recently made many advances on hard problems in a wide range of environments(Mnih et al., 2013, 2016; Schulman, Levine, Moritz, Jordan, & Abbeel, 2015). However, replacingthe value or policy function in reinforcement learning methods with a deep CNN will make thedecision making process a black-box operation.

Recent progress in providing explainability for black-box Deep CNN classifiers is focused atimproving understanding and understanding in the model to the user. Because Deep reinforcementlearning methods often use Deep CNNs, perhaps explainability methods that primarily use CNNscan be used in the field of deep reinforcement learning. Adding understanding in CNN decisionmaking can come in various forms, such as textual explanations that specify what features of theclass were prominent during classification (Hendricks et al., 2016). Another approach is providingvisual explanations in the form of attention-maps that point out class-specific regions.

This thesis will focus on using Grad-CAM, which is a gradient-based class activation map(Selvaraju et al., 2016). Grad-CAM creates a heat map that shows prominent spaces of activationgiven an input image and class. Using Grad-CAM and the A3C reinforcement learning algorithmdescribed in (Mnih et al., 2016), the following research question will be investigated:

How can visual explainability methods aid in understanding Deep Reinforcement Learn-ing decision making and learning?

The research question itself is divided into two sub-questions:

1. How do policies differ over the course of training?

2. What is the agent focused on before/when it fails at a task?

These questions will be investigated upon in three Atari 2600 environments provided by theOpenAI Gym toolkit. The games are Pong, BeamRider and Seaquest, which are chosen becauseall three differ in complexity, long-term reward behaviour and the amount of active opponents perstate.

This thesis is structured as follows: The next section, 2, provides the theoretical backgroundon reinforcement learning required to understand the visualization model. Section 3 presents thevisual rationalization model and explains how it is adapted to reinforcement learning tasks. Fol-lowing after that is section 4 which provides the setup required for experiments including whatreinforcement learning model works best. This section also provides the results for the rational-ization model applied to four Atari 2600 games with different tasks. The last two sections, 5 and6.2 discuss related work and further research respectively.

The codebase required to reproduce this thesis can be found at GitHub1 and interactive gifsshowing agents playing the game plus Grad-CAM outputs can be found at my website2.

1https://github.com/lweitkamp/rl rationalizations2https://lweitkamp.github.io/thesis2018

5

https://github.com/lweitkamp/rl_rationalizations

https://lweitkamp.github.io/thesis2018

2 Background

2.1 Reinforcement Learning

Reinforcement Learning aims to train agents that solve problems through experience. The agentlearns to map situations to actions and how to maximize a reward based on these actions (Sutton& Barto, 1998). Many recent advances in the field of reinforcement learning are in the areaof model-free decision strategies, which do not explicitly model the agent’s environment but letthe agent map directly from inputs from the environment. The main framework used by agentsin reinforcement learning is a Markov Decision Process (MDP), which can model probabilisticdecision problems (Sutton & Barto, 1998). MDPs are defined as a 5-tuple:

• S is a finite set of states

• A is a finite set of actions, with As denoting the actions available in state s.

• P : S ×A× S′ → [0, 1] is a state transition probability function.

• R : S ×A× S′ → R is a reward function.

• γ ∈ [0, 1] is a discount factor which determines the importance of future rewards.

In an MDP, the effects of an action taken in a state depend only on the current state. This iscalled the Markov Property, and can be expressed using the following equation:

P [Rt+1 = r, St+1 = s′|S0, A0, R1, . . . , Rt, St−1, At−1] = P [Rt+1 = r, St+1 = s′|St, At]. (1)

Where t denotes the time step. The schema in figure 1 shows an agent interacting with an MDP-based environment.

Figure 1: The agent in a MDP environment (Sutton & Barto, 1998).

In many real-world problems the agent will not have knowledge of the complete environment.A Partially Observable MDP (POMDP) is a generalization of a MDP in which the agent cannotdirectly observe the current state. It instead maintains a probability distribution over the set ofpossible states based on observations and observation probabilities. Most modern day reinforce-ment learning methods are based on the idea of having a partially observable state space.

2.1.1 Exploration and Exploitation

Given an environment and a policy, the agent can now manoeuvre through this environment. Butif the policy is non-optimal the agent still needs to figure out what the best action is to take as tomaximize its reward. Acting by maximizing reward is called a greedy- or an exploitative approach,and can work well if the agent has an optimal policy. If on the other hand the agent is non-optimal,this could lead to worsened results. Image if an agent driving a car gets a small positive rewardeach step when it does not hit an objects. This agent might just decide to stand still and donothing each step (which is also called reward hacking). On the other hand, picking a uniformlydistributed random action which guarantees that the agent is explorative, but will not ensure ahigh return. The agent driving a car could go back and forth each step using a uniform randomaction each time. Figuring out the correct way to balance exploration vs exploitation is importantto find an optimal policy.

6

2.1.2 Q-Learning

Q-Learning is an Off-Policy value iteration algorithm (Watkins & Dayan, 1992), defined as:

Q(St, At) = Q(St, At) + α[Rt+1 + γmaxa

Q(St+1, a)−Q(St, At)]. (2)

The action-value function Q directly approximates q∗, the optimal action-value function indepen-dent of the policy being followed. All that is required for convergence is that all state-value pairscontinue to update. It is a tabular method, meaning that it holds a table of all state, action pairsand updates this table every iteration. The problem with tabular methods like Q-Learning is thatthey do not scale well when dealing with high dimensional data that model more real-world likeenvironments. A real-world environment can have a massive amount of state, action pairs whichwould break the convergence property for Q-Learning as it is impossible to visit every state, actionpair multiple times. Another problem is that keeping up with so many state, action pairs in atable is intractable for computers with a finite storage.

2.1.3 Policy Gradients

Policy based model-free methods parameterize the policy π(a|s; θ) and typically update the param-eters θ by performing gradient descent on a loss function J with respect to the policy’s parameters:

θt+1 = θt + α∇J(θt) (3)

Basing the loss function on the value of the starting state vπθ (s0), the policy gradient theorem3

establishes that it is proportional to

∇J(θt) ∝ Eπ[∑a∇θπ(at|st, θ)qπ(st, at)]

∇J(θt) ∝ Eπ[∑a∇θπ(at|st, θ)Rt]

(4)

This final expression can be sampled by interacting with the environment, and will be usedduring the gradient descent procedure:

θt+1 = θt + α∇θt log π(at|st; θt)Rt (5)

Update procedures like equation 5 are used in the REINFORCE line of algorithms which arewidely used in policy gradient methods (Williams, 1992).

The Advantage Function. In order to reduce the variance of the policy estimation a learnedfunction of the state bt(st) can be subtracted from the return, which is known as a baseline function(Williams, 1992). Subtracting this baseline function from equation 5 results in:

θt+1 = θt +∇θt log π(at|st; θt)(Rt − bt(st)). (6)

A commonly used baseline function is V (st; θt), which is the value function that indicates thevalue of being in state st. This function can be seen as the long-term reward for being in statest in contrast to the direct reward for being in state st given by the reward function R. Theresulting baseline At = Rt − V (st; θt) is called the Advantage Function, because it is an estimateof the advantage of choosing action at in state st. Using the Advantage Function in equation 6,the policy gradient estimate is transformed into

θt+1 = θt +∇θt log π(at|st; θt)At. (7)

3Full derivation can be found in (Sutton & Barto, 1998), page 269.

7

2.1.4 Actor-Critic methods

Actor-Critic methods implement both value- and policy estimation for the value function V (st; θ)and the policy π(at|st; θ) respectively. The main idea in actor-critic methods is that the actorinteracts with the environment given a policy π and the critic assigns a value to each state thatthe actor is in using the value function V , which can be seen as a schema in figure 2.

Figure 2: Actor-Critic architecture.

In the case of policy gradient Actor Critic methods, the learning is done in two steps:

• the actor updates policy π (possibly using the advantage function) in a direction suggestedby the critic.

• the critic updates the current value function V .

2.2 Deep Learning

Deep Learning methods, on the other hand, are made to handle high dimensional data. The firstbreakthrough came with the creation of AlexNet (Krizhevsky, Sutskever, & Hinton, 2012), which isa deep neural network because it has several convolutional layers followed by non-linear activations.This method of stacking convolutional layers in conjunction with other types of layers (max-pool,non-linear activations) is very successful in image classification and reinforcement learning tasks.Deep Learning has also proven to be successful in data that is sequentially organized such aslanguage, sound or video through the use of recurrent neural networks, such as Long Short TermMemory (Hochreiter & Schmidhuber, 1997).

2.2.1 Convolutional Neural Network

The convolutional operation itself (in the case of images) is a two-dimensional filter that passes ina sliding window fashion over a matrix. The filter sums up the multiplications on each position,which produces a new value for the window position. In total this will amount to a new matrix.A convolutional layer is a series of such filters which each produce a new channel based on theoriginal input matrix. One of the more interesting parameters for a convolutional operation is thestride, which indicates the number of values the window skips. Having a stride of one, for example,will return a new matrix of the same size but having a stride of two will effectively downsamplethe matrix by a factor of two. The stride will be discussed more in chapter 4 when the input to aconvolutional network is a state in a reinforcement learning environment.

An interesting insight into CNNs is that the first layers act as primitive edge detectors that flowover into the next layers which combines these primitive edges slowly into abstract representationsof the final classes. This effect will be noticeable in the case of Pong later in this thesis, as it is agame that consists only of such primitive edges.

8

2.2.2 Long Short Term Memory

In classification problems, data is often presumed to be Independent and Identically Distributed(IID). IID means that any data point will not give the model more information about a differentdata point (they are independent of each other). The presumption that data is IID does not holdin all fields. For example in language processing, where words can be dependent on previous words.Dependencies can arise in reinforcement learning when an agent is dealing with enemies; learningthe trajectories or velocities for example. Even if the environment assumes the Markov propertyto hold, there can still be long-term dependencies to learn. One way to help with this is increasingthe input size, give the agent n states instead of one. Another way to deal with this is to use neuralnetworks such as Long Short Term Memory (LSTM).

Figure 3: Abstract representation of an LSTM module. Each cell outputs the hidden value hi andpasses both hi and Ci to the next cell.

Long Short Term Memory (LSTM) is designed to deal with problems of long-term dependency.Each cell in an LSTM uses the inputs xi and Ci−1 to calculate the cell state value Ci and thehidden state value hi (see figure 3). The cell state value Ci is created using only linear operations(although the weight and bias can itself be non-linear functions) and represents the long-termsequential information. The hidden state value hi is calculated through non-linear functions andis the direct output of the cell to the next layer (Olah, 2015). In the case of reinforcement learningthe cell state could be the velocity or direction of an enemy and the hidden state could be how toreact to this enemy.

2.3 Deep Reinforcement Learning

Using both the spatial information provided by a CNN and the memory provided by LSTM cells,the field of deep reinforcement learning has become quite successful when dealing with high dimen-sional and noisy data. This is in stark contrast with more traditional methods in reinforcementlearning. Several State-of-The-Art (SoTA) results have been achieved through deep reinforcementlearning in complex environments such as Atari 2600 games and physics simulators such as theMuJoCo Locomotion environment (Mnih et al., 2015; Todorov, Erez, & Tassa, 2012).

2.3.1 Deep-Q Network

The Deep-Q network was initially proposed in (Mnih et al., 2013), where Q-Learning uses a CNNas a function approximator for Q-values. DQN has created a new SoTA record in various situationsand it has been extended by the authors in (Mnih et al., 2015) to gain human level results on arange of Atari 2600 games. To balance out exploration vs exploitation, the DQN algorithm usesan ε-greedy policy that has probability ε to act greedily and probability 1− ε to perform a uniformrandom action. ε decays over time with a limit at probability 0.1 to ensure early exploration andlate exploitation during training.

2.3.2 Asynchronous Advantage Actor Critic

Unlike the Advantage Actor-Critic method, Asynchronous Advantage Actor Critic (A3C) utilizesmultiple actors to learn more efficiently. Each actor has its own environment and set of parameters

9

and update a global model. This algorithm was first discussed in (Mnih et al., 2016), where it hadbeen proven more time efficient in comparison to the DQN algorithm. This efficiency is not onlydue to parallelization of agents, but also due to a exploration factor. Having multiple agents withdifferent exploration tactics, the training through experience becomes more diverse. The agentscould, for instance, all have a different ratio of exploration vs exploitation which results in a widerset of states being visited. This also removes the need for the experience replay used in the DQNmodel (Juliani, 2016).

However, all these deep reinforcement learning methods have essentially become black-boxdecision-making processes precisely because they use CNNs. Agents can fail in non-obvious ways,and the deep models can fail to generalize well without the user knowing why. To build under-standing in deep reinforcement learning this thesis aims to use methods from Explainable ArtificialIntelligence to complex reinforcement learning tasks and see if they offer guidance or rational ex-planations in failure cases and during training.

2.4 Explainable Reinforcement Learning

Explainable Artificial Intelligence (XAI) is an emerging field that is dedicated to developing meth-ods to make deep learning more understandable, and thus build understanding of the model to theuser (Hendricks et al., 2016; Park et al., 2018; Zintgraf, Cohen, Adel, & Welling, 2017). There aremany ways that these models can be used to build understanding such as visual or textual, butdue to the lack of textual data in the Atari 2600 reinforcement learning environment, only a visualapproach will be discussed in this thesis.

2.4.1 Gradient Based Class Activation Map

A visual approach to building understanding in Deep Learning methods that use a convolutionalneural network is a Class Activation Map (CAM) (Zhou, Khosla, Lapedriza, Oliva, & Torralba,2015). The CAM method requires the chosen network to have a Global Average Pooling layer,and uses this layer’s activations, given a predicted class, to create a class discriminative heat mapthat highlights class-specific regions. Because the network architecture needs to be altered forCAM to work, this is not an optimal method for class discriminative heat maps. This method hasbeen extended to a more general framework called Grad-CAM in (Selvaraju et al., 2016), whichsomewhat agnostic to the architecture in use, as long as it is a convolutional neural network. Theadded benefit to Grad-CAM is that it weights class activations with gradients created during thebackwards pass. Grad-CAM will be discussed more in the next chapter, where it is also adaptedto work with deep reinforcement learning models.

10

3 Visual Rationalization Model

Because deep reinforcement learning methods use neural networks to estimate either the valuefunction or the policy of an agent, reasoning as to why a specific action, e.g. go left or go right istaken becomes opaque. This opaqueness comes from the fact that there is no simple link betweenweights in the neural network and the function that it is trying to approximate. It is not easy toinspect the weights and draw intuitive conclusions about the network.

In addition to this problem, training an agent is nontrivial and highly susceptible to smallchanges in hyperparameters or difference in codebase (Islam, Henderson, Gomrokchi, & Precup,2017). Even the activation functions being used can have a significant impact. For instance, theReLU activation function commonly used in neural networks can suffer from the ’dying ReLU’function causing worsened learning results (Xu, Wang, Chen, & Li, 2015).

Referring back to the self-driving car metaphor, knowing why the agent is making a decision canbe very important to build understanding in the agent. On top of this, justifying why agents makepredictions that are consistent with visual elements to a user are likely to increase understandingin the agents’ decision making (Teach & Shortliffe, 1981). Being able to explain situations in whichan agent fails in non-obvious ways could also provide intuitions as to why they fail.

One such explainability method that calculates class-based discriminative features is a Gradient-weighted Class Activation Map that generates visual explanations for CNN based classifiers. Be-cause this method applies to a wide variety of CNNs, it can also be used for the CNNs foundcommonly in Deep-RL methods. In contrast to the Prediction Difference Analysis method thatcan also be applied in a wide variety of CNNs, the Grad-CAM method is computationally moretractable feasible 4 for a significant amount of inputs.

4The Prediction Difference Analysis takes 70 minutes to compute one input in the VGG model, in contrast, theGrad-CAM method can process an input in one second in the VGG model. (Zintgraf et al., 2017)

11

Figure 4: The original Grad-CAM overview modified for RL-A3C specific tasks. The model takesas input a state (in this case the input state is from the game BeamRider), calculates the state-action policy (with as highest value in this case LEFTFIRE) and then produces an activation mapthat is overlayed on the original state based on LEFTFIRE.

Visual rationalization model Grad-CAM computes a class-discriminative localization mapLsGradCAM ∈ Ru×v using the gradient of any target class. These gradients are global-average-pooled to obtain the neuron importance weights ack for class c, for activation layer K in the CNN5:

αck =1

Z

∑i

∑j

∂yc

∂Akij. (8)

Adapting this method in particular to the A3C actor output, let ha be the score for action abefore the softmax, αak now represents the importance weight for state a in activation layer k:

αak =1

Z

∑i

∑j

∂ha

∂Akij, (9)

with |h| = |A|, the total amount of actions the agent can take. The gradient then gets weightedby the forward-pass activations Ak and passes an ELU activation6 to produce a weighted classactivation map:

LaGradCAM = ELU(

K∑k=1

αakAk). (10)

This activation map has values in the range [0, 1] with higher weights corresponding to a strongerresponse of the output state. This can be applied to the critic output in the same fashion. Theresulting activation map can extrapolated to the size of the input state and can then be overlayedon top of this state to produce a high-quality heatmap that indicate regions that motivate theagent to take action a. A visual representation of this process is depicted in figure 4.

5K is usually chosen to be the last convolutional layer in the CNN.6the Exponential Linear Unit has been chosen in favor of the ReLU used in the original Grad-CAM paper due

to the dying ReLU effect described earlier.

12

4 Experiments

4.1 Setup

OpenAI Gym. The OpenAI Gym is a toolkit for developing reinforcement learning algorithms.The toolkit has a wide variety of environments that all have a shared interface to enable the writingof general algorithms. To support the use of the Atari 2600 video games, the Arcade LearningEnvironment (ALE) needs to be compiled (Bellemare, Naddaf, Veness, & Bowling, 2012). Eachgame in the ALE has some different versions: NoFrameskip (does not skip frames), Deterministic(skips 3-4 frames depending on game, this is what DeepMind uses) and two different versions, v0and v4 with minor environmental adjustments.

Atari 2600 Actions. The original Atari 2600 had a joystick controller that has a single red’fire’ button attached. The joystick itself has eight directions: up, upright, left, up-left, down-right, down and down-left. The button can be pressed by itself but also in combination with thejoystick, which totals the number of actions to 17 (up-left-fire, down-right-fire, etc.). Note that theOpenAI Environment adds a ’NOOP’, an action that essentially skips the frame, and each gamehas a different selection of these actions to choose from which depend on the environment.

Table 1: Action space of the three Atari 2600 games Pong, BeamRider and Seaquest.

NOOP FIRE UP LEFT RIGHT DOWNLEFTFIRE

RIGHTFIRE

UPLEFT

UPRIGHT

UPFIRE

DOWNLEFT

DOWNRIGHT

DOWNFIRE

UPLEFTFIRE

UPRIGHTFIRE

DOWNLEFTFIRE

DOWNRIGHTFIRE

Pong x x x x x x

BeamRider x x x x x x x x x

Seaquest x x x x x x x x x x x x x x x x x x

Pong. Pong is meant to represent a table tennis game in 2D. The agent is represented on theright and the opponent on the left. PongDeterministic-v4 has 6 actions to choose from which aredisplayed in table 1. Note that the FIRE and NOOP are identical in Pong and the same goes forFIRE/RIGHTFIRE and LEFT/LEFTFIRE. RIGHT represents going up in the game, and LEFTrepresents going down in the game. The goal in Pong for either side is to achieve 21 wins afterwhich the game ends.

BeamRider. BeamRider is a typical shooter game taking place in outer-space. Each level has15 enemies combined with floating debris, an increased amount of variation in enemies and a bossthat appears after killing all 15 enemies. The boss is optional and will disappear after a few secondsof hovering at the top of the game, and is only killable with a torpedo of which the agent has threeeach level. It has an extended set of actions from Pong because of the torpedo which is usable withthe UP motion on the joystick, adding the actions UP, UPLEFT and UPRIGHT which amountsto 9 actions total, seen in table 1. The game has a finite number of levels, and the agent can gethit three times before the game finishes.

Seaquest. In Seaquest, the agent plays a submarine with torpedoes that can fire at enemy sharksand submarines for points. The submarine can also pick up divers (8 maximum) and bring themabove the sea to gain additional points. The submarine has a finite amount of oxygen, and mustthus come up for air every once in a while. The added oxygen brings a long-term reward mechanismto the game which could be interesting for Reinforcement Learning Agents. Seaquest has the fullcontrol of both joystick and button, adding up to 18 actions in total displayed in table 1.

13

(a) The original frame.

(b) The pre-processed frame.

Figure 5: The pre-processing of an Atari frame.

Pre-processing. The default Atari game frame returned by the Gym environment is a 3x210x160pixel window using RGB color channels. Each frame is cropped to an 80x80 window. The originalframe, filled with integers in the range [0, 255], will then be normalized to the range [0, 1]. Afterthis, the three color channels will be averaged, effectively turning the image to grayscale. Theresult of preprocessing a frame is pictured in figure 5 for game Seaquest.

Model Selection. Both algorithms have been trained on the two network architectures usingthe game Pong from the Arcade Learning Environment in the OpenAI Gym toolkit (Brockmanet al., 2016). Neural Network 1 is the network proposed by DeepMind in (Mnih et al., 2013) andused in training both Deep-Q and A3C models in their respective papers. Neural Network 2 is animplementation optimized for quicker convergence proposed in (Kostrikov, 2018), through the useof the ELU activation and normalizing the weights in the neural networks (Salimans & Kingma,2016). The specifics of each network can be found in 2.

Table 2: Difference in neural network architecture for network 1 and network 2.

Non-linear Activations Input Frames LSTM Cell Convolutional Layers

Neural Network 1 ReLU 4 No 5Neural Network 2 ELU 1 Yes 4

For each architecture, five agents were trained using 1,000,000 frames, after which each agenthad played 100 episodes7. The resulting returns for each episode has been averaged to obtain fourmean returns which are summed up in table 3. Because of the success of Neural Network 2 on theA3C algorithm, both the model and algorithm will be used in this thesis.

Table 3: Mean and variance for DQN and A3C in NN1 and NN2.

NN1 mean NN1 variance NN2 mean NN2 variance

DQN 20.13 6.60 -21.00 0.00A3C 5.40 4.30 21.0 0.00

A3C. The A3C implementation used in this thesis is a modified version of the PyTorch imple-mentation written in (Kostrikov, 2018). Most notably, the input image has been pre-processed toa pixel size of 80x80 instead of 42x42 and the third convolutional layer has a stride of one insteadof two. These modifications have been made to suit the Grad-CAM output better: a stride of twoeffectively down-samples the image by a factor of two, with the resulting Grad-CAM output ofconvolutional layer four being of size 5x5. Interpolating this to an 80x80 image will cause it to losespatial information and results in a blurry attention visualization. This is in contrast to the new

7All agents were acting greedily with regards to the policy.

14

setup where the output Grad-CAM will be of size 10x10, which contain more spatial informationand thus will be more accurate when interpolated. figure 6 contrasts the 5x5 and 10x10 images.A more detailed explanation of the network can be found in figure 7.

(a) Grad-CAM 5x5 interpolation to 80x80. (b) Grad-CAM 10x10 interpolation to 80x80.

Figure 6: Interpolation of both a 5x5 Grad-CAM output and a 10x10 Grad-CAM output. No-ticeable is the difference in size for single activations, in 6a the bottom right represents a singleactivation and in 6b the middle represents a single activation.

Figure 7: The convolutional neural network in use by the A3C algorithm. The input images arepre-processed to 80x80 frames and pass four convolutional layers all having 32 filters of size 3x3.convolutional layers one, two and four have a stride of 2 effectively down-sampling the frame as itpasses through but convolutional layer three has a stride of 1 to keep the activation filter neededfrom Grad-CAM large enough for high quality heat maps. Each convolutional layer is separatedby an Exponential Linear Unit. After the convolutions the frame passes through an LSTM moduleto retain some direction and velocity information after which the actor and critic linear outputsare returned to the user. The linear output size n depends on the environmental action space.

Trained Models. For the purpose of this thesis, two models have been trained. The first modelwhich will be called the Full Agent has been trained using (at least) 40 million frames. The secondagent which will be called the Half Agent has been trained using 20 million frames, except for thecase of Pong where it has been trained using 500,000 frames. All agents have been trained usingthe same hyper-parameters described in appendix B. The mean and variance described in table 4have been calculated by playing with a greedy policy for 100 games. These results have also beencontrasted to results found in other literature. It is hard to contrast these results fairly becausemost authors only note a training time in hours which does not count for the amount of threadsin work.

Table 4: Trained models compared to results found in literature

Full Agent Mean Full Agent Variance Half Agent Mean Half Agent Variance DeepMind8

Pong 21.00 0.00 14.99 0.09BeamRider 4659.04 1932.58 1597.40 1202.00Seaquest 1749.00 11.44 N/A N/A

15

4.2 Ranking Grad-CAM Outputs

The Grad-CAM model outputs a class activation map with values in the range of [0.0, 1.0], but theresulting maps are sparsely filled with values towards 1.0. These type of activations are usuallyindicative of a moving object the agent is focusing on (what objects have its attention). To create aclear distinction between high activations and low activations the outputs will be ranked as follows:

High Activations ranging from 0.7-1.0, these appear as red in Grad-CAM outputs. An example isthe red bounded-boxes in figure 8.

Medium Activations ranging from 0.4-0.7, these appear as a light green/yellow in Grad-CAM outputs.An example is the orange bounded-boxes in figure 8.

Low Activations ranging from values between 0.0-0.4, these appear as light blue in Grad-CAMoutputs. An example is the yellow bounded-boxes in figure 8.

Figure 8: Examples of types of attention in Pong, BeamRider and Seaquest respectively. Redindicates a high Grad-CAM activation, orange indicates a medium amount of activation and yellowindicates a low amount of activation.

An example of the different ranks at their thresholds is shown below in figure 9.

Figure 9: A Grad-CAM output thresholded at three stages: High (values higher than 0.7), Medium(values between 0.4 and 0.7) and Low (values between 0.0 and 0.4). The first image is the originalGrad-CAM output.

4.3 Learning a Policy

Training an agent to gain human-like or superhuman-like performance in a complex environmentcan take millions of frames. Seeing how an agent is reacting to different situations at differenttimes of training might make it clear how an agent is trying to maximize long-term rewards. Foreach game states were manually sampled, after which both agents have ’played’ the state to learnspatial-temporal information using the LSTM cells in the convolutional model. Using the action ofeach agent, the Grad-CAM outputs can be contrasted to highlight differences in policy. To contrastthe resulting Grad-CAM outputs to a model with randomized weights as a baseline, appendix Ahas a random selection of frames for generated by untrained agents.

16

Pong. The Full Agent has learned to shoot the ball in such a way that it scores by hitting theball only once each round. The initial round might differ but after that, all rounds are the same:the Full Agent shoots the ball up high which makes the ball bounce off the wall all the way downover the opponent’s side, at which point the agent retreats to the lower right corner. In contrast,he Half Agent is actively tracking the ball at each step and could potentially lose some roundsbecause of this. This tracking behaviour of the Half agent is also demonstrated in figure 10 whereit is trying to meet the ball’s height in frames 50, 51 and 53. By looking at the Grad-CAM outputthe agent probably makes this decision based on the ball itself because there is a high attentionlevel on the ball in these frames. The Half Agent on the other hand has a high attention level onitself and the enemy in all but one frame (53). The action of going down is not based solely on theattention on the ball, but also largely because of the attention on itself and its direct surroundings.The Full Agent is calculating where the ball might go next to hit it, the Half Agent is trying tokeep up with the ball at every step.

Figure 10: Manually sampled states from the game Pong, combined with the Full Agent and theHalf Agent’s actions Grad-CAM outputs based on these states.

BeamRider. Both agents have learned to hit enemies, but the Full Agent has a higher averagereturn, this indicates that it has more knowledge about how to play the game. Looking at figure11, both agents have a measure of attention on the two white enemy saucers, but the rank ofattention differs; the Full Agent has a high attention on the enemies, in comparison with the HalfAgent which has a low attention on the enemies. The Half Agent is either going right which isessentially a NOOP in that area or it could shooting a the incoming enemy. More interestingare the last two frames: 175 and 176. The attention of the Full Agent turns from the directlyapproaching enemy saucer to the enemy saucer on the left of it, and the agent would try to moveinto its direction (LEFTFIRE). the Full Agent’s attention in frame 176 is placed in a medium

17

degree at the trajectory of its own laser that will hit the enemy saucer in the next frame. Thiscould indicate that the Full Agent knows it will hit the target and is thus moving away from it, tofocus on the other remaining enemy.

Figure 11: Manually sampled states from the game BeamRider, combined with the Full Agent andthe Half Agent’s actions Grad-CAM outputs based on these states.

From the analysis of both agents another interesting result is discovered: the agent do notlearn to ’properly’ use the torpedoes. At the beginning of each episode/level both agents wouldfire torpedoes until they are all used up and then continue on as usual. In figure 12 a manuallysampled configuration is depicted which is played by the Full Agent. The torpedoes have not beenused yet, on purpose, and there are enemies coming towards the agent at different time-steps.Looking at the Grad-CAM attention map, it would appear to be highly focused on the remainingthree torpedoes in the upper right corner. This occurs even when the action chosen is not of theUP-variety which would trigger firing a torpedo.

18

Figure 12: Manually sampled states from the game BeamRider while not firing torpedoes. Com-bined with the Full Agent’s actions Grad-CAM outputs based on these states. In the 300 framesplayed it has chosen any UP-variant 219 times, LEFTFIRE 67 and other actions 14 times.

4.4 Agent Failures

Even well-trained agent make mistakes, and in the case of BeamRider and Seaquest the trainedagents still end the episode by dying. This section provides an analysis of where the agents’attention was in the four frames leading up to death. This section will only concern BeamRiderand Seaquest, because Pong does not fail the game when trained well.

BeamRider. In figure 13 the agent initially has a high amount of attention focused itself, amedium attention directed at an incoming laser and a low amount of attention focused at the debrisright ahead of the him. A white enemy saucer is approaching from the left, and the attention ofthe agent shifts to both this enemy and the incoming piece of debris. In all frames the agent isperforming the LEFTFIRE action which, given the Grad-CAM activations and the trajectory ofthe laser seen in frames 4139 and 4140 is aimed at the piece of debris. Unfortunately, the laserthe agent has can’t destroy debris because debris requires a torpedo. That would lead to thehypothesis that the agent was too late to focus on the enemy saucer because it was looking at thedebris earlier. This is also supported by the Grad-CAM output, where the agent is only lookingat the enemy saucer in the three frames leading to its death. If the agent had made the causalrelation between debris and torpedoes or had learned to always avoid debris, it would have beenable to focus on the enemy saucer earlier.

19

Figure 13: Agent dies because it is hit by enemy laser.

In the next situation pictured in figure 14 the agent is approached by a number of differentenemies, one of which only appears after level 7: the green bounce craft. This is another enemythat can only be destroyed by a torpedo and it jumps from beam to beam trying to hit the agentwhich is what kills the agent eventually in frame 5664. In all frames, a high attention is focusedat the nearest three enemies, two of which are only killable using a torpedo. The agent is shootingusing LEFTFIRE, which is, based on Grad-CAM activations, on the green bounce craft directlyin front of the agent. Like in the previous scenario in figure 13 the agent can’t kill the incomingenemies and is not able to avoid the enemies or use its torpedoes effectively.

Figure 14: Agent dies because it is hit by a green bouncecraft.

In the next case depicted in figure 15 the agent eventually dies because it is hit by a greenblocker shield, another enemy it can only kill through the use of a torpedo. The first frameis rather interesting; Attention is scattered throughout the frame focused on almost everything

20

except for the enemies. Comparing frame 2817 with 2818 would give the impression that theytarget exactly the opposite locations in the frame. It is hard to interpret such resulting heat mapswith respect to the agent, except for the fact that is is focusing on everything except enemies.Frames 2818 and 2819 are easier to interpret; the agent is focused on enemies and tries to eitherhit them or avoid them, but because it is trying to hit them the agent is too late to maneuver outof the way. It eventually dies because of this.

Figure 15: Agent dies because it is hit by a green blocker shield.

All three cases exhibit similar behaviour when it comes to Grad-CAM rationalizations: Theagent is focusing on- and shooting at enemies that it cannot kill with its laser. The attention mapexplains as to what the agent is firing at, which can lead to the conclusion that the policy theagent is using has not learned to use its torpedoes correctly, a theory that can be supported bythe results from the previous section, where the agent cannot start a game without firing off alltorpedoes.

Seaquest. In Seaquest, the agent needs to learn that going up for air is beneficial and gives ahigh long-term return (staying alive). Many factors could lead the agent to not learn this suchas a too small LSTM cell or a lack of explorative actions. A solution to this could be the useof Fine-Grained Action Repetition which selects a random action and performs this action for adecaying amount every so often (Sharma, Lakshminarayanan, & Ravindran, 2017) (the authorseven use Seaquest as an example). During the training of the Seaquest agent in this paper, theagent did not learn to go up and thus was stuck on a maximum of 1900 rewards. The agent will gointo the water and shoot as many enemies as it can, but it will die when it is out of breath. Thissection will provide a small case study using Grad-CAM to analyze what the agent is looking at,when there are enemies in the screen, and right before running out of breath.

In figure 16 the agent is seen sporadically going left and right, sometimes shooting (there is amedium attention focused on shots, but there is no clear target). The agent has a high attentionfocused at itself an enemies which could indicate some spatial awareness, but there is no attentionfocused on the oxygen meter pictured at the bottom.

21

Figure 16: A showcasing of Grad-CAM applied to Seaquest in a setting with enemies and otherobjects.

The situation in which the agent is running out of oxygen is depicted in figure 17. The agent hasbeen shooting at an enemy on the right hand side which explains a medium amount of attentionfocused on that area, there is a diver below the agent that also has a medium amount of attentionfocused on it. Other than these sources the attention is highly focused at the agent itself. Thefact that there are little activations happening on the oxygen bar would support the theory thatthe agent has not created a correct correlation between time spent alive and going up for water.

Figure 17: The agent dies due to lack of oxygen (oxygen bar visible in the bottom).

22

5 Related Work

Visual occlusion or heat maps are often used to introduce explainability in Deep Learning models,such as the Prediction Difference Analysis method proposed in (Zintgraf et al., 2017). This methodis however quite computationally expensive (the authors note 70-minute processing for a singleimage) which makes it intractable for a sequence of over 1000 images, which is often the case whendealing with reinforcement learning tasks. Textual approaches such as (Hendricks et al., 2016) toexplainability were not discussed in this thesis due to a lack of ground truth in class labels whendealing with Atari 2600 games.

In (Harrison, Ehsan, & Riedl, 2017), a model is proposed to explain an autonomous system’sbehavior as if a human had performed the behavior. The authors use a natural language trainingdataset that is collected from human players thinking out loud while playing a game. This datasetis then used to generate explanations for an agent playing the same state-actions that the humanswere playing. The downside to this approach is that the explanations did not come from the agentbut from humans. The framework proposed in this thesis, in contrast, generates heat maps basedon the agent’s actions which are then rationalized.

In (Greydanus, Koul, Dodge, & Fern, 2017) the authors propose to use salience maps to under-stand how an agent learns and follows a policy. The authors also propose a perturbation methodthat selectively blurs regions to calculate the impact it has on the policy. Although this methoddemonstrates important regions for the agent’s decision making, the method proposed in this thesishighlights important regions without the need for a perturbation method.

The use of T-SNE embedding in the Atari 2600 game Space Invaders used in (Mnih et al.,2015) adds an understanding of how an agent ranks environments based on the value function inQ-Learning but this is outside the scope of this work. In general, there is not much literatureto be found on Explainable Reinforcement Learning but with the recent advances and focus onexplainability in AI this might change.

23

6 Conclusion And Discussion

This thesis has presented a novel rationalization framework that uses Grad-CAM in combinationwith the A3C deep reinforcement learning model to produce an action specific activation map.This action specific activation map can then be combined with the original input frame to producean interpretable location specific heat map indicating points of attention for the agent. This heatmap can then be used to rationalize the decision that the agent makes.

We have shown that during training of an agent, the rationalization framework can help byproviding visual distinctions of different stages of training with respect to the policy the agentis using. The rationalizations help understanding of how an agent learns and what it should befocusing its attention on to get a higher reward.

We have also shown that the rationalization framework helps in creating understanding whenan agent fails at its task. Specifically, looking at when an agent dies and where the agents’ attentionwas during this scenario. In doing so, we found that the agent can fail due to not having learnedlong-term rewards that it either understands wrong (BeamRider’s torpedoes) or is not focusing itsattention on (Seaquest’s oxygen bar).

This thesis has argued and emphasized the usability of visual rationalizations as a means toincrease understanding of deep reinforcement learning agents. However, there are some problemswhen dealing with reinforcement learning agents and explainability methods.

6.1 Discussion

A problem in explainable reinforcement learning is the lack of ground truth. In image classificationtasks, each image usually has a class assigned to it indicating the ground truth. In the caseof reinforcement learning, it is hard to say what action is correct and what action is wrong,it can be subjective to the strategy that is being used. The lack of a ground truth makes itimpossible to determine if a specific rationalization is correct, but rationalizations can still enhancethe understanding of the agent’s decision making.

Grad-CAM Interpretability As shown in figure 15, Grad-CAM heat maps can somtimes bequite hard to interpret. This behaviour is most prominent at the start of a round and at the endof a round ((re)spawning). A few examples of these noisy Grad-CAM outputs are shown in figure18. It is important to know that during the (re)spawning the agent is able to act in the same wayas normal but the actions will not be executed. It could be that the agent is aware of this due tothe fact that every enemy will be removed from the frame. In the case of BeamRider, the agenteven changes color. This changing of color could be another indicator to the agent that somethingis wrong, and for Pong and Seaquest the agent gets removed from the game for a short period.The frames itself are also not helpful for investigating agent failure due to the fact that the agentcannot move/interact properly. But they might provide clues as to what visual objects catches theagent’s attention.

24

Figure 18: Noisy Grad-CAM outputs generated when the agent was in the process of (re)spawning.

As mentioned, this behaviour is most prominent during (re)spawning, but it does not onlyappear there. In figure 19 some more scenes are depicted that are from a different setting. Eventhough these images do not depict a clear reason for the action being taken and are thus hard torationalize, they still depict interesting behaviour i.e. the Pong 1 frame the attention is scattered inlow volume throughout the screen and highly on both agents. In the case of Pong and BeamRider,the noise often comes in the form of activations spread throughout the image, but the activationsare not prominent on any moving or changing object. This includes the agent itself, scores andenemies.

Figure 19: Noisy Grad-CAM outputs.

25

6.2 Further Research

Firstly, further research could compare Grad-CAM outputs for multiple actions in one state. Thesecould be compared to each other to see what activations motivate the agent to go for example leftand not right. This could be done by first subtracting the Critic Grad-CAM output to produce anaction-only attention map.

The second point of further research could be based on finding difference in deep reinforcementlearning models with the use of Grad-CAM. This research could contrast model behaviour in acontrolled environment, where each model is given the same state and the resulting Grad-CAMoutputs will be compared for each model.

The third and last point of further research could focus on testing the model proposed in thisthesis on different environments such as MuJoCo’s continuous control tasks (Todorov et al., 2012).These offer more challenging tasks with a larger action space in contrast to the limited action spacein Atari 2600 games.

26

Appendices

A Grad-CAM Output for Randomly Initiated Agent

A selection of random actions and the Grad-CAM outputs can be seen in figure 20.

Figure 20: Selection of random actions and the resulting Grad-CAM outputs for untrained agents(frames are not in any particular order). Note that for the game Pong, the Grad-CAMs are morelocation-specific when compared to BeamRider and Seaquest. This might be due to the fact thatconvolutions itself are good at detecting abstract objects and edges, which pong is filled with.

B Hyper-parameters

Table 5: Hyperparameters using during training. The hyperparameters equal to that in (Mnih etal., 2016).

Name Value DescriptionLearning Rate 0.0001 Scale of the gradient updateγ 0.99 Discount factor for rewardsτ 1.00 Scale of Advantage function returnEntropy Coefficient 0.01 Scales entropy in policy loss estimationnumber of processes 14 How many parallel agents are training

27

References

Bellemare, M. G., Naddaf, Y., Veness, J., & Bowling, M. (2012). The arcade learning environment:An evaluation platform for general agents. CoRR, abs/1207.4708 . Retrieved from http://

arxiv.org/abs/1207.4708

Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., & Zaremba,W. (2016). Openai gym. CoRR, abs/1606.01540 . Retrieved from http://arxiv.org/abs/

1606.01540

Greydanus, S., Koul, A., Dodge, J., & Fern, A. (2017). Visualizing and understanding atari agents.CoRR, abs/1711.00138 . Retrieved from http://arxiv.org/abs/1711.00138

Harrison, B., Ehsan, U., & Riedl, M. O. (2017). Rationalization: A neural machine translationapproach to generating natural language explanations. CoRR, abs/1702.07826 . Retrievedfrom http://arxiv.org/abs/1702.07826

Hendricks, L. A., Akata, Z., Rohrbach, M., Donahue, J., Schiele, B., & Darrell, T. (2016). Gener-ating visual explanations. In Eccv.

Hochreiter, S., & Schmidhuber, J. (1997, November). Long short-term memory. Neural Comput.,9 (8), 1735–1780. Retrieved from http://dx.doi.org/10.1162/neco.1997.9.8.1735 doi:10.1162/neco.1997.9.8.1735

Islam, R., Henderson, P., Gomrokchi, M., & Precup, D. (2017). Reproducibility of benchmarkeddeep reinforcement learning tasks for continuous control. CoRR, abs/1708.04133 . Retrievedfrom http://arxiv.org/abs/1708.04133

Juliani, A. (2016). Simple reinforcement learning with tensorflow part 8: Asynchronous actor-critic agents (a3c) (Blog No. December 17). https://medium.com/emergent-future/

simple-reinforcement-learning-with-tensorflow-part-8-asynchronous-actor

-critic-agents-a3c-c88f72a5e9f2.Kostrikov, I. (2018). Pytorch implementations of asynchronous advantage actor critic. https://

github.com/ikostrikov/pytorch-a3c. GitHub.Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep con-

volutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, & K. Q. Wein-berger (Eds.), Advances in neural information processing systems 25 (pp. 1097–1105).Curran Associates, Inc. Retrieved from http://papers.nips.cc/paper/4824-imagenet

-classification-with-deep-convolutional-neural-networks.pdf

Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T. P., Harley, T., . . . Kavukcuoglu, K.(2016). Asynchronous methods for deep reinforcement learning. CoRR, abs/1602.01783 .Retrieved from http://arxiv.org/abs/1602.01783

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M.(2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 .

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., . . . Hassabis,D. (2015, February). Human-level control through deep reinforcement learning. Nature,518 (7540), 529–533. Retrieved from http://dx.doi.org/10.1038/nature14236

Olah, C. (2015). Understanding lstm networks (Blog No. August 27). http://colah.github.io/posts/2015-08-Understanding-LSTMs/.

Park, D. H., Hendricks, L. A., Akata, Z., Rohrbach, A., Schiele, B., Darrell, T., & Rohrbach, M.(2018). Multimodal explanations: Justifying decisions and pointing to the evidence. In Ieeecvpr.

Salimans, T., & Kingma, D. P. (2016). Weight normalization: A simple reparameterizationto accelerate training of deep neural networks. CoRR, abs/1602.07868 . Retrieved fromhttp://arxiv.org/abs/1602.07868

Schulman, J., Levine, S., Moritz, P., Jordan, M. I., & Abbeel, P. (2015). Trust region policyoptimization. CoRR, abs/1502.05477 . Retrieved from http://arxiv.org/abs/1502.05477

Selvaraju, R. R., Das, A., Vedantam, R., Cogswell, M., Parikh, D., & Batra, D. (2016). Grad-cam: Why did you say that? visual explanations from deep networks via gradient-basedlocalization. CoRR, abs/1610.02391 . Retrieved from http://arxiv.org/abs/1610.02391

Sharma, S., Lakshminarayanan, A. S., & Ravindran, B. (2017). Learning to repeat: Fine grainedaction repetition for deep reinforcement learning. CoRR, abs/1702.06054 . Retrieved fromhttp://arxiv.org/abs/1702.06054

28

http://arxiv.org/abs/1207.4708






http://dx.doi.org/10.1162/neco.1997.9.8.1735


https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-8-asynchronous-actor-critic-agents-a3c-c88f72a5e9f2



https://github.com/ikostrikov/pytorch-a3c

https://github.com/ikostrikov/pytorch-a3c

http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf


http://dx.doi.org/10.1038/nature14236

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

http://colah.github.io/posts/2015-08-Understanding-LSTMs/





Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA,USA: MIT Press. Retrieved from http://www.cs.ualberta.ca/%7Esutton/book/ebook/

the-book.html

Teach, R. L., & Shortliffe, E. H. (1981). An analysis of physician attitudes regarding computer-based clinical consultation systems. In Use and impact of computers in clinical medicine (pp.68–85). Springer.

Todorov, E., Erez, T., & Tassa, Y. (2012). Mujoco: A physics engine for model-based control.In Iros (p. 5026-5033). IEEE. Retrieved from http://dblp.uni-trier.de/db/conf/iros/

iros2012.html#TodorovET12

Watkins, C. J. C. H., & Dayan, P. (1992, May 01). Q-learning. Machine Learning , 8 (3), 279–292.Retrieved from https://doi.org/10.1007/BF00992698 doi: 10.1007/BF00992698

Williams, R. J. (1992, May 01). Simple statistical gradient-following algorithms for connectionistreinforcement learning. Machine Learning , 8 (3), 229–256. Retrieved from https://doi.org/

10.1007/BF00992696 doi: 10.1007/BF00992696Xu, B., Wang, N., Chen, T., & Li, M. (2015). Empirical evaluation of rectified activations in

convolutional network. CoRR, abs/1505.00853 . Retrieved from http://arxiv.org/abs/

1505.00853

Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2015). Learning deep featuresfor discriminative localization. CoRR, abs/1512.04150 . Retrieved from http://arxiv.org/

abs/1512.04150

Zintgraf, L. M., Cohen, T. S., Adel, T., & Welling, M. (2017). Visualizing deep neural networkdecisions: Prediction difference analysis. CoRR, abs/1702.04595 . Retrieved from http://

arxiv.org/abs/1702.04595

29

http://www.cs.ualberta.ca/%7Esutton/book/ebook/the-book.html

http://www.cs.ualberta.ca/%7Esutton/book/ebook/the-book.html

http://dblp.uni-trier.de/db/conf/iros/iros2012.html#TodorovET12

http://dblp.uni-trier.de/db/conf/iros/iros2012.html#TodorovET12

https://doi.org/10.1007/BF00992698

https://doi.org/10.1007/BF00992696

https://doi.org/10.1007/BF00992696







Explainable Reinforcement Learning: Visual Policy ... · The most notable of such Deep Learning...

Documents

Transcript of Explainable Reinforcement Learning: Visual Policy ... · The most notable of such Deep Learning...