Notes on Reinforcement Learning - v0.1

Notes on Reinforcement Learning

Joo-Haeng Lee

GTA (Game Theory AOC), ETRI, 2017-09-21

Joo-Haeng Lee 2017 [email protected]

Goal of Today’s Talk

R 게sm 강화학습p vn하기 o해i

- 기본 개념p r해하t!

R Pr해했다W hT하는 f람의 rk기b 들let."

- 연구 동향p gw g펴et!

R 도대y .GGgDe -eeH2ind의 -eeH Q-NetNGIkr d데 ,T,60 2600p Xc u?Q

- 참Wta

R AttH://neuIG.cJ.ut.ee/demyJti?ying-deeH-Iein?GIcement-De:Ining/

R WikiHedi:, :IXiM, J:M: ,T,60, .it/ub, TNG 2inute P:HeI, …

Background & Motivation


Various types of systems! Each requires different control policies.


Many computer games belong to a dynamic system with discrete components. Not every!


S1S0

a0

S4

S2

a1

S3a2

S7

S5

S6

S8

A state machine SM defined with 9 states, 3 actions, and 20 transitions.


S1S0

a0

S4

S2

a1

S3a2

S7

S5

S6

S8

For a given input at at time t, the state machine SM returns its state representation st and reward rt.

at

rt

st

Action

State

Reward


S1S0

a0

S4

S2

a1

S3a2

S7

S5

S6

S8

A human player.

at

rt

st

Action

State

Reward

PerceptionControl


The system details and dynamics are unknown, or only partially known.

at

rt

st

Action

State

Reward

PerceptionControl

?


A human builds his cognitive model by learning.

at

rt

st

Action

State

Reward

PerceptionControl

?


How can a machine learn to perform?

at

rt

st

Action

State

Reward

PerceptionControl

?


A classic approach is Reinforcement Learning (RL)!

at

rt

st

Action

State

Reward

PerceptionControl

?


One of RL methods is Q-Learning (QL)!

at

rt

st

Action

State

Reward

PerceptionControl

?


A recent advancement in QL is Deep Q-Learning (DQL) by DeepMind!

at

rt

st

Action

State

Reward

PerceptionControl

?


Can we build Alpha Ma using DQL or its variants?

at

rt

st

Action

State

Reward

PerceptionControl

?

Key Concepts


Reinforcement Learning

• “Reinforcement learning (RL) is an area of machine learning inspired by behaviorist psychology, concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward. “— Wikipedia

• “강화 학습은 기계 학습이 다루는 문제의 하나로, 어떤 환경 안에서 정의된 에이전트가 현재의 상태를 인식하여, 선택 가능한 행동들 중 보상을 최대화하는 행동 혹은 행동 순서를 선택하는 방법이다.” — Wikipedia


Main Challenges in RL

• Credit Assignment Problem

- “방금 얻은 점수는 어떤 액션 덕분이지?”

• Exploration-Exploitation Dilemma

- 탐사-채광: “더 큰 금맥을 찾아 갈까? 여기서 채굴할까?”


Mathematical Formalism for RL

• Markov Decision Process (MDP)- 액션에 따른 상태의 전이

- Action- State- Transition

S1S0

a0

S4

S2

a1

S3

a2

S7

S5

S6

S8

-10 -10

+100

+2

+20

+20

-50


Key Concepts

• Encoding long term-strategies: discounted future reward

• Estimate the future reward: table-based Q-learning

• Huge state space: Q-table is replaced with neural network

• Implementation tip: experience replay for stability

• Exploration-exploitation dilemma: !-greedy exploration


Reinforcement Learning — Breakout


Reinforcement Learning — Breakout

• Problem description- Input: game screen with scores- Output: game controls = ( left || right || space )- Training?

• Experts dataset for supervised learning — how to get them?• Self practice with occasional rewards as humans do — reinforcement learning


Markov Decision Process (MDP) — Formalism for RL

• Environment: game, system, simulator, …• Agent: a human user, SW• State: stochastic transition to another state for a action• Action

• Reward

• Policy

EnvironmentsAgent

Action

State

Reward

S1S0

a0

S4

S2

a1

S3

a2

S7

S5

S6

S8

-10 -10

+100

+2

+20

+20

-50


Markov Decision Process (MDP) — Formalism for RL

• Episode

- a sequence of states, actions, and rewards in a game- so, ao, r1, s1, a1, r2, s2, a2, r3, s3, a3, …, rn-1, sn = game over = GG

• Markov assumption

- The probability of the next state si+1 depends only on current state si and performed action ai, but not on preceding states or actions.


• To play well in long-term, need to consider current and future rewards at once.• Total reward for an episode: R = r1 + r2 + r3+…+ rn• Total future reward at time t: Rt = rt + rt+1 + rt+2+…+ rn• Considering the stochastic nature,

• Discounted future reward: Rt = rt +! rt+1 + !2 rt+2+…+ !n-t rn = rt +! Rt+1

• NOTE: A good strategy for an agent would be to always choose an action at at the state st, that maximizes the discounted future reward Rt+1.

• BUT, how?

Discounted Future Reward (DFR)


Q-Learning

• Q-function: Q(s, a) — the discounted future reward from a sequence of optimal actions

- Q(s, a) = max Rt+1

- Among myriads of possible episodes, the maximum DFR could be earned from a certain sequence of actions after the current action a at the state s.

- The “quality” of the current action affects DFR.


Q-Learning

• Policy: - π(s) = arg maxa’ Q(s, a’) = a

- The action a which results in the maximum DFR: Q(s, a)


Q-Learning

• Bellman equation for the transition < s1, a1, s2, r2 > - Q(s1, a1) = r2 + ! maxai Q(s2, ai) S1 S2

a1

r2

S3

S4

r3

r4

a2

a3


Q-Learning

• Naïve algorithm for Q-table filling:- Initialize Q-table arbitrarily with #state rows and #action columns.- Observe initial state s- Repeat

• Select an action a and input to the environment E- Action a will be carried out in E.

• Observe reward r and new state s’• Update the table: Q(s, a) = (1-α) Q(s, a) +α (r + γ maxa' Q(s’, a’))

- s = s'- until terminated

Q-table a1 a2 a2 … an

s1 100 130 80 … 121

s2 200 99 99 … 2

s2 50 99 150 … 2

... … … … … …

sn 101 124 124 … 199


Q-Learning

• The estimations get more and more accurate with every iteration and it has been shown that, if we perform this update enough times, then the Q-function will converge and represent the true Q-value.

• OK. BUT, how to generalize a Q-function (or Q-table) to handle many similar problems at once? — (ex) ATARI 2600.


Deep Q Network

• Q-Learning + Deep Neural Network• DQN

• Google DeepMind (NIPS 2013 Workshop, Nature 2015)


ATARI 2600


Q-Learning — Breakout

• Modeling for Breakout:- State: description of all the game elements such as ball, bar, and bricks- Reward: score- Output: game controls = ( left || right || space )

• BUT, how to handle all the other ATARI 2600 games?- The problem of generalization!

(# bricks) * (x, y, on) + (x) for bar + (x, y) for ball


Q-Learning — All ATARI 2600 Games?

• Modeling for any Atari 2600 games:

- State: all the pixels in the game screens

- Reward: score- Output: all the control actions in the joystick




- State: 84x84 pixels * 4 frames * 256 gray

- Reward: score- Output:18 actions

difficult and engaging for human players. We used the same networkarchitecture, hyperparameter values (see Extended Data Table 1) andlearning procedure throughout—taking high-dimensional data (210|160colour video at 60 Hz) as input—to demonstrate that our approachrobustly learns successful policies over a variety of games based solelyon sensory inputs with only very minimal prior knowledge (that is, merelythe input data were visual images, and the number of actions availablein each game, but not their correspondences; see Methods). Notably,our method was able to train large neural networks using a reinforce-ment learning signal and stochastic gradient descent in a stable manner—illustrated by the temporal evolution of two indices of learning (theagent’s average score-per-episode and average predicted Q-values; seeFig. 2 and Supplementary Discussion for details).

We compared DQN with the best performing methods from thereinforcement learning literature on the 49 games where results wereavailable12,15. In addition to the learned agents, we also report scores fora professional human games tester playing under controlled conditionsand a policy that selects actions uniformly at random (Extended DataTable 2 and Fig. 3, denoted by 100% (human) and 0% (random) on yaxis; see Methods). Our DQN method outperforms the best existingreinforcement learning methods on 43 of the games without incorpo-rating any of the additional prior knowledge about Atari 2600 gamesused by other approaches (for example, refs 12, 15). Furthermore, ourDQN agent performed at a level that was comparable to that of a pro-fessional human games tester across the set of 49 games, achieving morethan 75% of the human score on more than half of the games (29 games;

Convolution Convolution Fully connected Fully connected

No input

Figure 1 | Schematic illustration of the convolutional neural network. Thedetails of the architecture are explained in the Methods. The input to the neuralnetwork consists of an 84 3 84 3 4 image produced by the preprocessingmap w, followed by three convolutional layers (note: snaking blue line

symbolizes sliding of each filter across input image) and two fully connectedlayers with a single output for each valid action. Each hidden layer is followedby a rectifier nonlinearity (that is, max 0,xð Þ).

a b

c d

0 200 400 600 800

1,000 1,200 1,400 1,600 1,800 2,000 2,200

0 20 40 60 80 100 120 140 160 180 200

Ave

rage

sco

re p

er e

piso

de

Training epochs

0 1 2 3 4 5 6 7 8 9

10 11

0 20 40 60 80 100 120 140 160 180 200

Ave

rage

act

ion

valu

e (Q

)

Training epochs

0

1,000

2,000

3,000

4,000

5,000

6,000

0 20 40 60 80 100 120 140 160 180 200

Ave

rage

sco

re p

er e

piso

de

Training epochs

0 1 2 3 4 5 6 7 8 9

10

0 20 40 60 80 100 120 140 160 180 200

Ave

rage

act

ion

valu

e (Q

)

Training epochs

Figure 2 | Training curves tracking the agent’s average score and averagepredicted action-value. a, Each point is the average score achieved per episodeafter the agent is run with e-greedy policy (e 5 0.05) for 520 k frames on SpaceInvaders. b, Average score achieved per episode for Seaquest. c, Averagepredicted action-value on a held-out set of states on Space Invaders. Each point

on the curve is the average of the action-value Q computed over the held-outset of states. Note that Q-values are scaled due to clipping of rewards (seeMethods). d, Average predicted action-value on Seaquest. See SupplementaryDiscussion for details.

RESEARCH LETTER

5 3 0 | N A T U R E | V O L 5 1 8 | 2 6 F E B R U A R Y 2 0 1 5

Macmillan Publishers Limited. All rights reserved©2015




- State: 84x84 pixels * 4 frames * 256 gray = 25684x84x4 ~ 1067970





No input



a b

c d

0 200 400 600 800

1,000 1,200 1,400 1,600 1,800 2,000 2,200

0 20 40 60 80 100 120 140 160 180 200

Ave

rage

sco

re p

er e

piso

de

Training epochs

0 1 2 3 4 5 6 7 8 9

10 11

0 20 40 60 80 100 120 140 160 180 200

Ave

rage

act

ion

valu

e (Q

)

Training epochs

0

1,000

2,000

3,000

4,000

5,000

6,000

0 20 40 60 80 100 120 140 160 180 200

Ave

rage

sco

re p

er e

piso

de

Training epochs

0 1 2 3 4 5 6 7 8 9

10

0 20 40 60 80 100 120 140 160 180 200

Ave

rage

act

ion

valu

e (Q

)

Training epochs



RESEARCH LETTER






- State: 84x84 pixels * 4 frames * 256 gray = 25684x84x4 ~ 1067970



Deep Q Network — All ATARI 2600 Games!

• We can hardly implement a Q-function as a table: size and sparsity!• Now, deep learning steps in!

- Deep convolutional neural network (CNN) is specially good at extracting small set of features from a big data.

- We can replace Q-table with a deep neural network — DQN!

Q(s, an)


Deep Q Network — All ATARI 2600 Games!

Layer Input Filter size Stride Num filters Activation Output

conv1 84x84x4 8×8 4 32 ReLU 20x20x32

conv2 20x20x32 4×4 2 64 ReLU 9x9x64

conv3 9x9x64 3×3 1 64 ReLU 7x7x64

fc4 7x7x64 512 ReLU 512

fc5 512 18 Linear 18




No input



a b

c d

0 200 400 600 800

1,000 1,200 1,400 1,600 1,800 2,000 2,200

0 20 40 60 80 100 120 140 160 180 200

Ave

rage

sco

re p

er e

piso

de

Training epochs

0 1 2 3 4 5 6 7 8 9

10 11

0 20 40 60 80 100 120 140 160 180 200

Ave

rage

act

ion

valu

e (Q

)

Training epochs

0

1,000

2,000

3,000

4,000

5,000

6,000

0 20 40 60 80 100 120 140 160 180 200

Ave

rage

sco

re p

er e

piso

de

Training epochs

0 1 2 3 4 5 6 7 8 9

10

0 20 40 60 80 100 120 140 160 180 200

Ave

rage

act

ion

valu

e (Q

)

Training epochs



RESEARCH LETTER



Q(s, an)


Deep Q Network

• Loss

- To measure how well a neural network is trained- The less, the better.- Current Q by prediction: Q(s, a) — forward evaluation of a neural network- Target Q from new reward: r + γ maxa' Q(s’, a’) — forward evaluation

- L = 1/2 (current - target)2 = 1/2 ( Q(s, a) - ( r + γ maxa' Q(s’, a’) ) )2

- Weights of a neural network are updated to minimize the loss — back propagation


Deep Q Network

• Experience Replay

- Training efficiency: “It takes a long time, almost a week on a single GPU.”- Experience: <s, a, r, a’>- Experience memory stores all the recent experiences. Actually, not all, but quite a few.- Train on adjacent experiences? - No! Random samples form experience memory to avoid local minimum.


Deep Q Network

• So far, we mainly focused on “credit assignment problem,” specially in the context of Q-learning.

• Exploration-Exploitation Dilemma?

- At first, Q-network gives a taste of randomness in selecting an optimal action due to random initialization — greedy exploration to find the first (not the best) solution.

- However, it converges as the training continues — exploitation at the local minimum.

• ε-greedy exploration

- “Maybe, there could be a better action with the change of ε.”- Choose between a random action and argmaxa' Q(s’, a’).


sometimes a nonlinear function approximator is used instead, such as a neuralnetwork. We refer to a neural network function approximator with weights h as aQ-network. A Q-network can be trained by adjusting the parameters hi at iterationi to reduce the mean-squared error in the Bellman equation, where the optimaltarget values rzc maxa0 Q

! s0,a0ð Þ are substituted with approximate target valuesy~rzc maxa0 Q s0,a0; h{

i

! ", using parameters h{

i from some previous iteration.This leads to a sequence of loss functions Li(hi) that changes at each iteration i,

Li hið Þ~ s,a,r Es0 yDs,a½ %{Q s,a; hið Þð Þ2# $

~ s,a,r,s0 y{Q s,a; hið Þð Þ2# $

zEs,a,r Vs0 y½ %½ %:

Note that the targets depend on the network weights; this is in contrast with thetargets used for supervised learning, which are fixed before learning begins. Ateach stage of optimization, we hold the parameters from the previous iteration hi

2

fixed when optimizing the ith loss function Li(hi), resulting in a sequence of well-defined optimization problems. The final term is the variance of the targets, whichdoes not depend on the parameters hi that we are currently optimizing, and maytherefore be ignored. Differentiating the loss function with respect to the weightswe arrive at the following gradient:

+hi L hið Þ ~ s,a,r,s0 rzc maxa0

Q s0,a0; h{i

! "{Q s,a; hið Þ

% &+hi Q s,a; hið Þ

' (:

Rather than computing the full expectations in the above gradient, it is oftencomputationally expedient to optimize the loss function by stochastic gradientdescent. The familiar Q-learning algorithm19 can be recovered in this frameworkby updating the weights after every time step, replacing the expectations usingsingle samples, and setting h{

i ~hi{1.Note that this algorithm is model-free: it solves the reinforcement learning task

directly using samples from the emulator, without explicitly estimating the rewardand transition dynamics P r,s0 Ds,að Þ. It is also off-policy: it learns about the greedypolicy a~argmaxa0Q s,a0; hð Þ, while following a behaviour distribution that ensuresadequate exploration of the state space. In practice, the behaviour distribution isoften selected by an e-greedy policy that follows the greedy policy with probability1 2 e and selects a random action with probability e.Training algorithm for deep Q-networks. The full algorithm for training deepQ-networks is presented in Algorithm 1. The agent selects and executes actionsaccording to an e-greedy policy based on Q. Because using histories of arbitrarylength as inputs to a neural network can be difficult, our Q-function instead workson a fixed length representation of histories produced by the function w describedabove. The algorithm modifies standard online Q-learning in two ways to make itsuitable for training large neural networks without diverging.

First, we use a technique known as experience replay23 in which we store theagent’s experiences at each time-step, et 5 (st, at, rt, st 1 1), in a data set Dt 5 {e1,…,et},pooled over many episodes (where the end of an episode occurs when a termi-nal state is reached) into a replay memory. During the inner loop of the algorithm,we apply Q-learning updates, or minibatch updates, to samples of experience,(s, a, r, s9) , U(D), drawn at random from the pool of stored samples. This approachhas several advantages over standard online Q-learning. First, each step of experienceis potentially used in many weight updates, which allows for greater data efficiency.Second, learning directly from consecutive samples is inefficient, owing to the strongcorrelations between the samples; randomizing the samples breaks these correla-tions and therefore reduces the variance of the updates. Third, when learning on-policy the current parameters determine the next data sample that the parametersare trained on. For example, if the maximizing action is to move left then the train-ing samples will be dominated by samples from the left-hand side; if the maximiz-ing action then switches to the right then the training distribution will also switch.It is easy to see how unwanted feedback loops may arise and the parameters could getstuck in a poor local minimum, or even diverge catastrophically20. By using experience

replay the behaviour distribution is averaged over many of its previous states,smoothing out learning and avoiding oscillations or divergence in the parameters.Note that when learning by experience replay, it is necessary to learn off-policy(because our current parameters are different to those used to generate the sam-ple), which motivates the choice of Q-learning.

In practice, our algorithm only stores the last N experience tuples in the replaymemory, and samples uniformly at random from D when performing updates. Thisapproach is in some respects limited because the memory buffer does not differ-entiate important transitions and always overwrites with recent transitions owingto the finite memory size N. Similarly, the uniform sampling gives equal impor-tance to all transitions in the replay memory. A more sophisticated sampling strat-egy might emphasize transitions from which we can learn the most, similar toprioritized sweeping30.

The second modification to online Q-learning aimed at further improving thestability of our method with neural networks is to use a separate network for gen-erating the targets yj in the Q-learning update. More precisely, every C updates weclone the network Q to obtain a target network Q and use Q for generating theQ-learning targets yj for the following C updates to Q. This modification makes thealgorithm more stable compared to standard online Q-learning, where an updatethat increases Q(st,at) often also increases Q(st 1 1,a) for all a and hence also increasesthe target yj, possibly leading to oscillations or divergence of the policy. Generatingthe targets using an older set of parameters adds a delay between the time an updateto Q is made and the time the update affects the targets yj, making divergence oroscillations much more unlikely.

We also found it helpful to clip the error term from the update rzc maxa0 Qs0,a0; h{

i

! "{Q s,a; hið Þ to be between 21 and 1. Because the absolute value loss

function jxj has a derivative of 21 for all negative values of x and a derivative of 1for all positive values of x, clipping the squared error to be between 21 and 1 cor-responds to using an absolute value loss function for errors outside of the (21,1)interval. This form of error clipping further improved the stability of the algorithm.Algorithm 1: deep Q-learning with experience replay.Initialize replay memory D to capacity NInitialize action-value function Q with random weights hInitialize target action-value function Q with weights h2 5 hFor episode 5 1, M do

Initialize sequence s1~ x1f g and preprocessed sequence w1~w s1ð ÞFor t 5 1,T do

With probability e select a random action at

otherwise select at~argmaxaQ w stð Þ,a; hð ÞExecute action at in emulator and observe reward rt and image xt 1 1

Set stz1~st ,at ,xtz1 and preprocess wtz1~w stz1ð ÞStore transition wt ,at ,rt ,wtz1

! "in D

Sample random minibatch of transitions wj,aj,rj,wjz1

) *from D

Set yj~rj if episode terminates at step jz1

rjzc maxa0 Q wjz1,a0; h{) *

otherwise

(

Perform a gradient descent step on yj{Q wj,aj; h) *) *2

with respect to thenetwork parameters hEvery C steps reset Q~Q

End ForEnd For

31. Jarrett,K., Kavukcuoglu,K., Ranzato,M.A.&LeCun,Y.What is thebestmulti-stagearchitecture for object recognition? Proc. IEEE. Int. Conf. Comput. Vis. 2146–2153(2009).

32. Nair, V. & Hinton, G. E. Rectified linear units improve restricted Boltzmannmachines. Proc. Int. Conf. Mach. Learn. 807–814 (2010).

33. Kaelbling, L. P., Littman, M. L. & Cassandra, A. R. Planning and acting in partiallyobservable stochastic domains. Artificial Intelligence 101, 99–134 (1994).

LETTER RESEARCH



Deep Q Network

• DQN Algorithm:- Initialize replay memory D.- Initialize Q-network with random weights.- Observe initial state s- Repeat

• Select a random action a with probability ε. Otherwise a = argmaxa’ Q(s’, a’)

• Input a to the environment E for state transition• Observe reward r and new state s’, and store them to replay memory D• Sample random transitions <sd, ad, rd, sd’> from replay memory D• Calculate target t for each mini-batch transition

- If sd’ is terminal state then t = rd- Otherwise t = rd + γ maxa’Q(sd’, a’)

• Train the Q network with the loss L = (t - Q(sd, ad))2 — Updating the Q-network- s = s'- until terminated

Examples Game

Examples Code

Implementation Notes Mathematica

S1S0

a0

S4

S2

a1

S3

a2

S7

S5

S6

S8

-10 -10

+100

+2

+20

+20

-50

References Paper


B. Rider Breakout Enduro Pong Q*bert Seaquest S. InvadersRandom 354 1.2 0 �20.4 157 110 179

Sarsa [3] 996 5.2 129 �19 614 665 271

Contingency [4] 1743 6 159 �17 960 723 268

DQN 4092 168 470 20 1952 1705 581Human 7456 31 368 �3 18900 28010 3690

HNeat Best [8] 3616 52 106 19 1800 920 1720HNeat Pixel [8] 1332 4 91 �16 1325 800 1145

DQN Best 5184 225 661 21 4500 1740 1075

Table 1: The upper table compares average total reward for various learning methods by runningan ✏-greedy policy with ✏ = 0.05 for a fixed number of steps. The lower table reports results ofthe single best performing episode for HNeat and DQN. HNeat produces deterministic policies thatalways get the same score while DQN used an ✏-greedy policy with ✏ = 0.05.

types of objects on the Atari screen. The HNeat Pixel score is obtained by using the special 8 colorchannel representation of the Atari emulator that represents an object label map at each channel.This method relies heavily on finding a deterministic sequence of states that represents a successfulexploit. It is unlikely that strategies learnt in this way will generalize to random perturbations;therefore the algorithm was only evaluated on the highest scoring single episode. In contrast, ouralgorithm is evaluated on ✏-greedy control sequences, and must therefore generalize across a widevariety of possible situations. Nevertheless, we show that on all the games, except Space Invaders,not only our max evaluation results (row 8), but also our average results (row 4) achieve betterperformance.

Finally, we show that our method achieves better performance than an expert human player onBreakout, Enduro and Pong and it achieves close to human performance on Beam Rider. The gamesQ*bert, Seaquest, Space Invaders, on which we are far from human performance, are more chal-lenging because they require the network to find a strategy that extends over long time scales.

6 ConclusionThis paper introduced a new deep learning model for reinforcement learning, and demonstrated itsability to master difficult control policies for Atari 2600 computer games, using only raw pixelsas input. We also presented a variant of online Q-learning that combines stochastic minibatch up-dates with experience replay memory to ease the training of deep networks for RL. Our approachgave state-of-the-art results in six of the seven games it was tested on, with no adjustment of thearchitecture or hyperparameters.

References

[1] Leemon Baird. Residual algorithms: Reinforcement learning with function approximation. InProceedings of the 12th International Conference on Machine Learning (ICML 1995), pages30–37. Morgan Kaufmann, 1995.

[2] Marc Bellemare, Joel Veness, and Michael Bowling. Sketch-based linear value function ap-proximation. In Advances in Neural Information Processing Systems 25, pages 2222–2230,2012.

[3] Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learningenvironment: An evaluation platform for general agents. Journal of Artificial Intelligence

Research, 47:253–279, 2013.

[4] Marc G Bellemare, Joel Veness, and Michael Bowling. Investigating contingency awarenessusing atari 2600 games. In AAAI, 2012.

[5] Marc G. Bellemare, Joel Veness, and Michael Bowling. Bayesian learning of recursively fac-tored environments. In Proceedings of the Thirtieth International Conference on Machine

Learning (ICML 2013), pages 1211–1219, 2013.

8

Playing Atari with Deep Reinforcement Learning

Volodymyr Mnih Koray Kavukcuoglu David Silver Alex Graves Ioannis Antonoglou

Daan Wierstra Martin Riedmiller

DeepMind Technologies

{vlad,koray,david,alex.graves,ioannis,daan,martin.riedmiller} @ deepmind.com

Abstract

We present the first deep learning model to successfully learn control policies di-rectly from high-dimensional sensory input using reinforcement learning. Themodel is a convolutional neural network, trained with a variant of Q-learning,whose input is raw pixels and whose output is a value function estimating futurerewards. We apply our method to seven Atari 2600 games from the Arcade Learn-ing Environment, with no adjustment of the architecture or learning algorithm. Wefind that it outperforms all previous approaches on six of the games and surpassesa human expert on three of them.

1 Introduction

Learning to control agents directly from high-dimensional sensory inputs like vision and speech isone of the long-standing challenges of reinforcement learning (RL). Most successful RL applica-tions that operate on these domains have relied on hand-crafted features combined with linear valuefunctions or policy representations. Clearly, the performance of such systems heavily relies on thequality of the feature representation.

Recent advances in deep learning have made it possible to extract high-level features from raw sen-sory data, leading to breakthroughs in computer vision [11, 22, 16] and speech recognition [6, 7].These methods utilise a range of neural network architectures, including convolutional networks,multilayer perceptrons, restricted Boltzmann machines and recurrent neural networks, and have ex-ploited both supervised and unsupervised learning. It seems natural to ask whether similar tech-niques could also be beneficial for RL with sensory data.

However reinforcement learning presents several challenges from a deep learning perspective.Firstly, most successful deep learning applications to date have required large amounts of hand-labelled training data. RL algorithms, on the other hand, must be able to learn from a scalar rewardsignal that is frequently sparse, noisy and delayed. The delay between actions and resulting rewards,which can be thousands of timesteps long, seems particularly daunting when compared to the directassociation between inputs and targets found in supervised learning. Another issue is that most deeplearning algorithms assume the data samples to be independent, while in reinforcement learning onetypically encounters sequences of highly correlated states. Furthermore, in RL the data distribu-tion changes as the algorithm learns new behaviours, which can be problematic for deep learningmethods that assume a fixed underlying distribution.

This paper demonstrates that a convolutional neural network can overcome these challenges to learnsuccessful control policies from raw video data in complex RL environments. The network istrained with a variant of the Q-learning [26] algorithm, with stochastic gradient descent to updatethe weights. To alleviate the problems of correlated data and non-stationary distributions, we use

1

Figure 1: Screen shots from five Atari 2600 Games: (Left-to-right) Pong, Breakout, Space Invaders,Seaquest, Beam Rider

an experience replay mechanism [13] which randomly samples previous transitions, and therebysmooths the training distribution over many past behaviors.

We apply our approach to a range of Atari 2600 games implemented in The Arcade Learning Envi-ronment (ALE) [3]. Atari 2600 is a challenging RL testbed that presents agents with a high dimen-sional visual input (210 ⇥ 160 RGB video at 60Hz) and a diverse and interesting set of tasks thatwere designed to be difficult for humans players. Our goal is to create a single neural network agentthat is able to successfully learn to play as many of the games as possible. The network was not pro-vided with any game-specific information or hand-designed visual features, and was not privy to theinternal state of the emulator; it learned from nothing but the video input, the reward and terminalsignals, and the set of possible actions—just as a human player would. Furthermore the network ar-chitecture and all hyperparameters used for training were kept constant across the games. So far thenetwork has outperformed all previous RL algorithms on six of the seven games we have attemptedand surpassed an expert human player on three of them. Figure 1 provides sample screenshots fromfive of the games used for training.

2 Background

We consider tasks in which an agent interacts with an environment E , in this case the Atari emulator,in a sequence of actions, observations and rewards. At each time-step the agent selects an actionat from the set of legal game actions, A = {1, . . . ,K}. The action is passed to the emulator andmodifies its internal state and the game score. In general E may be stochastic. The emulator’sinternal state is not observed by the agent; instead it observes an image xt 2 Rd from the emulator,which is a vector of raw pixel values representing the current screen. In addition it receives a rewardrt representing the change in game score. Note that in general the game score may depend on thewhole prior sequence of actions and observations; feedback about an action may only be receivedafter many thousands of time-steps have elapsed.

Since the agent only observes images of the current screen, the task is partially observed and manyemulator states are perceptually aliased, i.e. it is impossible to fully understand the current situationfrom only the current screen xt. We therefore consider sequences of actions and observations, st =x1, a1, x2, ..., at�1, xt, and learn game strategies that depend upon these sequences. All sequencesin the emulator are assumed to terminate in a finite number of time-steps. This formalism givesrise to a large but finite Markov decision process (MDP) in which each sequence is a distinct state.As a result, we can apply standard reinforcement learning methods for MDPs, simply by using thecomplete sequence st as the state representation at time t.

The goal of the agent is to interact with the emulator by selecting actions in a way that maximisesfuture rewards. We make the standard assumption that future rewards are discounted by a factor of� per time-step, and define the future discounted return at time t as Rt =

PTt0=t �

t0�trt0 , where T

is the time-step at which the game terminates. We define the optimal action-value function Q

⇤(s, a)

as the maximum expected return achievable by following any strategy, after seeing some sequences and then taking some action a, Q⇤

(s, a) = max⇡ E [Rt|st = s, at = a,⇡], where ⇡ is a policymapping sequences to actions (or distributions over actions).

The optimal action-value function obeys an important identity known as the Bellman equation. Thisis based on the following intuition: if the optimal value Q

⇤(s

0, a

0) of the sequence s

0 at the nexttime-step was known for all possible actions a

0, then the optimal strategy is to select the action a

0

2

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. — NIPS 2013 Deep Learning Workshop


LETTERdoi:10.1038/nature14236

Human-level control through deep reinforcementlearningVolodymyr Mnih1*, Koray Kavukcuoglu1*, David Silver1*, Andrei A. Rusu1, Joel Veness1, Marc G. Bellemare1, Alex Graves1,Martin Riedmiller1, Andreas K. Fidjeland1, Georg Ostrovski1, Stig Petersen1, Charles Beattie1, Amir Sadik1, Ioannis Antonoglou1,Helen King1, Dharshan Kumaran1, Daan Wierstra1, Shane Legg1 & Demis Hassabis1

The theory of reinforcement learning provides a normative account1,deeply rooted in psychological2 and neuroscientific3 perspectives onanimal behaviour, of how agents may optimize their control of anenvironment. To use reinforcement learning successfully in situationsapproaching real-world complexity, however, agents are confrontedwith a difficult task: they must derive efficient representations of theenvironment from high-dimensional sensory inputs, and use theseto generalize past experience to new situations. Remarkably, humansand other animals seem to solve this problem through a harmoniouscombination of reinforcement learning and hierarchical sensory pro-cessing systems4,5, the former evidenced by a wealth of neural datarevealing notable parallels between the phasic signals emitted by dopa-minergic neurons and temporal difference reinforcement learningalgorithms3. While reinforcement learning agents have achieved somesuccesses in a variety of domains6–8, their applicability has previouslybeen limited to domains in which useful features can be handcrafted,or to domains with fully observed, low-dimensional state spaces.Here we use recent advances in training deep neural networks9–11 todevelop a novel artificial agent, termed a deep Q-network, that canlearn successful policies directly from high-dimensional sensory inputsusing end-to-end reinforcement learning. We tested this agent onthe challenging domain of classic Atari 2600 games12. We demon-strate that the deep Q-network agent, receiving only the pixels andthe game score as inputs, was able to surpass the performance of allprevious algorithms and achieve a level comparable to that of a pro-fessional human games tester across a set of 49 games, using the samealgorithm, network architecture and hyperparameters. This workbridges the divide between high-dimensional sensory inputs andactions, resulting in the first artificial agent that is capable of learn-ing to excel at a diverse array of challenging tasks.

We set out to create a single algorithm that would be able to developa wide range of competencies on a varied range of challenging tasks—acentral goal of general artificial intelligence13 that has eluded previousefforts8,14,15. To achieve this, we developed a novel agent, a deep Q-network(DQN), which is able to combine reinforcement learning with a classof artificial neural network16 known as deep neural networks. Notably,recent advances in deep neural networks9–11, in which several layers ofnodes are used to build up progressively more abstract representationsof the data, have made it possible for artificial neural networks to learnconcepts such as object categories directly from raw sensory data. Weuse one particularly successful architecture, the deep convolutionalnetwork17, which uses hierarchical layers of tiled convolutional filtersto mimic the effects of receptive fields—inspired by Hubel and Wiesel’sseminal work on feedforward processing in early visual cortex18—therebyexploiting the local spatial correlations present in images, and buildingin robustness to natural transformations such as changes of viewpointor scale.

We consider tasks in which the agent interacts with an environmentthrough a sequence of observations, actions and rewards. The goal of the

agent is to select actions in a fashion that maximizes cumulative futurereward. More formally, we use a deep convolutional neural network toapproximate the optimal action-value function

Q! s,að Þ~ maxp

rtzcrtz1zc2rtz2z . . . jst~s, at~a, p! "

,

which is the maximum sum of rewards rt discounted by c at each time-step t, achievable by a behaviour policy p 5 P(ajs), after making anobservation (s) and taking an action (a) (see Methods)19.

Reinforcement learning is known to be unstable or even to divergewhen a nonlinear function approximator such as a neural network isused to represent the action-value (also known as Q) function20. Thisinstability has several causes: the correlations present in the sequenceof observations, the fact that small updates to Q may significantly changethe policy and therefore change the data distribution, and the correlationsbetween the action-values (Q) and the target values rzc max

a0Q s0, a0ð Þ.

We address these instabilities with a novel variant of Q-learning, whichuses two key ideas. First, we used a biologically inspired mechanismtermed experience replay21–23 that randomizes over the data, therebyremoving correlations in the observation sequence and smoothing overchanges in the data distribution (see below for details). Second, we usedan iterative update that adjusts the action-values (Q) towards targetvalues that are only periodically updated, thereby reducing correlationswith the target.

While other stable methods exist for training neural networks in thereinforcement learning setting, such as neural fitted Q-iteration24, thesemethods involve the repeated training of networks de novo on hundredsof iterations. Consequently, these methods, unlike our algorithm, aretoo inefficient to be used successfully with large neural networks. Weparameterize an approximate value function Q(s,a;hi) using the deepconvolutional neural network shown in Fig. 1, in which hi are the param-eters (that is, weights) of the Q-network at iteration i. To performexperience replay we store the agent’s experiences et 5 (st,at,rt,st 1 1)at each time-step t in a data set Dt 5 {e1,…,et}. During learning, weapply Q-learning updates, on samples (or minibatches) of experience(s,a,r,s9) , U(D), drawn uniformly at random from the pool of storedsamples. The Q-learning update at iteration i uses the following lossfunction:

Li hið Þ~ s,a,r,s0ð Þ*U Dð Þ rzc maxa0

Q(s0,a0; h{i ){Q s,a; hið Þ

# $2" #

in which c is the discount factor determining the agent’s horizon, hi arethe parameters of the Q-network at iteration i and h{

i are the networkparameters used to compute the target at iteration i. The target net-work parameters h{

i are only updated with the Q-network parameters(hi) every C steps and are held fixed between individual updates (seeMethods).

To evaluate our DQN agent, we took advantage of the Atari 2600platform, which offers a diverse array of tasks (n 5 49) designed to be

*These authors contributed equally to this work.

1Google DeepMind, 5 New Street Square, London EC4A 3TW, UK.

2 6 F E B R U A R Y 2 0 1 5 | V O L 5 1 8 | N A T U R E | 5 2 9





No input



a b

c d

0 200 400 600 800

1,000 1,200 1,400 1,600 1,800 2,000 2,200

0 20 40 60 80 100 120 140 160 180 200

Ave

rage

sco

re p

er e

piso

de

Training epochs

0 1 2 3 4 5 6 7 8 9

10 11

0 20 40 60 80 100 120 140 160 180 200A

vera

ge a

ctio

n va

lue

(Q)

Training epochs

0

1,000

2,000

3,000

4,000

5,000

6,000

0 20 40 60 80 100 120 140 160 180 200

Ave

rage

sco

re p

er e

piso

de

Training epochs

0 1 2 3 4 5 6 7 8 9

10

0 20 40 60 80 100 120 140 160 180 200

Ave

rage

act

ion

valu

e (Q

)

Training epochs



RESEARCH LETTER



see Fig. 3, Supplementary Discussion and Extended Data Table 2). Inadditional simulations (see Supplementary Discussion and ExtendedData Tables 3 and 4), we demonstrate the importance of the individualcore components of the DQN agent—the replay memory, separate targetQ-network and deep convolutional network architecture—by disablingthem and demonstrating the detrimental effects on performance.

We next examined the representations learned by DQN that under-pinned the successful performance of the agent in the context of the gameSpace Invaders (see Supplementary Video 1 for a demonstration of theperformance of DQN), by using a technique developed for the visual-ization of high-dimensional data called ‘t-SNE’25 (Fig. 4). As expected,the t-SNE algorithm tends to map the DQN representation of percep-tually similar states to nearby points. Interestingly, we also found instancesin which the t-SNE algorithm generated similar embeddings for DQNrepresentations of states that are close in terms of expected reward but

perceptually dissimilar (Fig. 4, bottom right, top left and middle), con-sistent with the notion that the network is able to learn representationsthat support adaptive behaviour from high-dimensional sensory inputs.Furthermore, we also show that the representations learned by DQNare able to generalize to data generated from policies other than itsown—in simulations where we presented as input to the network gamestates experienced during human and agent play, recorded the repre-sentations of the last hidden layer, and visualized the embeddings gen-erated by the t-SNE algorithm (Extended Data Fig. 1 and SupplementaryDiscussion). Extended Data Fig. 2 provides an additional illustration ofhow the representations learned by DQN allow it to accurately predictstate and action values.

It is worth noting that the games in which DQN excels are extremelyvaried in their nature, from side-scrolling shooters (River Raid) to box-ing games (Boxing) and three-dimensional car-racing games (Enduro).

Montezuma's RevengePrivate Eye

GravitarFrostbiteAsteroids

Ms. Pac-ManBowling

Double DunkSeaquest

VentureAlien

Amidar

River RaidBank Heist

Zaxxon

CentipedeChopper Command

Wizard of WorBattle Zone

AsterixH.E.R.O.

Q*bertIce Hockey

Up and DownFishing Derby

EnduroTime Pilot

FreewayKung-Fu Master

TutankhamBeam Rider

Space InvadersPong

James BondTennis

KangarooRoad Runner

AssaultKrull

Name This GameDemon Attack

GopherCrazy Climber

AtlantisRobotank

Star GunnerBreakout

BoxingVideo Pinball

At human-level or above

Below human-level

0 100 200 300 400 4,500%500 1,000600

Best linear learner

DQN

Figure 3 | Comparison of the DQN agent with the best reinforcementlearning methods15 in the literature. The performance of DQN is normalizedwith respect to a professional human games tester (that is, 100% level) andrandom play (that is, 0% level). Note that the normalized performance of DQN,expressed as a percentage, is calculated as: 100 3 (DQN score 2 random playscore)/(human score 2 random play score). It can be seen that DQN

outperforms competing methods (also see Extended Data Table 2) in almost allthe games, and performs at a level that is broadly comparable with or superiorto a professional human games tester (that is, operationalized as a level of75% or above) in the majority of games. Audio output was disabled for bothhuman players and agents. Error bars indicate s.d. across the 30 evaluationepisodes, starting with different initial conditions.

LETTER RESEARCH

2 6 F E B R U A R Y 2 0 1 5 | V O L 5 1 8 | N A T U R E | 5 3 1



Indeed, in certain games DQN is able to discover a relatively long-termstrategy (for example, Breakout: the agent learns the optimal strategy,which is to first dig a tunnel around the side of the wall allowing the ballto be sent around the back to destroy a large number of blocks; see Sup-plementary Video 2 for illustration of development of DQN’s perfor-mance over the course of training). Nevertheless, games demanding moretemporally extended planning strategies still constitute a major chal-lenge for all existing agents including DQN (for example, Montezuma’sRevenge).

In this work, we demonstrate that a single architecture can success-fully learn control policies in a range of different environments with onlyvery minimal prior knowledge, receiving only the pixels and the gamescore as inputs, and using the same algorithm, network architecture andhyperparameters on each game, privy only to the inputs a human playerwould have. In contrast to previous work24,26, our approach incorpo-rates ‘end-to-end’ reinforcement learning that uses reward to continu-ously shape representations within the convolutional network towardssalient features of the environment that facilitate value estimation. Thisprinciple draws on neurobiological evidence that reward signals duringperceptual learning may influence the characteristics of representationswithin primate visual cortex27,28. Notably, the successful integration ofreinforcement learning with deep network architectures was criticallydependent on our incorporation of a replay algorithm21–23 involving thestorage and representation of recently experienced transitions. Conver-gent evidence suggests that the hippocampus may support the physical

realization of such a process in the mammalian brain, with the time-compressed reactivation of recently experienced trajectories duringoffline periods21,22 (for example, waking rest) providing a putative mech-anism by which value functions may be efficiently updated throughinteractions with the basal ganglia22. In the future, it will be importantto explore the potential use of biasing the content of experience replaytowards salient events, a phenomenon that characterizes empiricallyobserved hippocampal replay29, and relates to the notion of ‘prioritizedsweeping’30 in reinforcement learning. Taken together, our work illus-trates the power of harnessing state-of-the-art machine learning tech-niques with biologically inspired mechanisms to create agents that arecapable of learning to master a diverse array of challenging tasks.

Online Content Methods, along with any additional Extended Data display itemsandSourceData, are available in the online version of the paper; references uniqueto these sections appear only in the online paper.

Received 10 July 2014; accepted 16 January 2015.

1. Sutton, R. & Barto, A. Reinforcement Learning: An Introduction (MIT Press, 1998).2. Thorndike, E. L. Animal Intelligence: Experimental studies (Macmillan, 1911).3. Schultz, W., Dayan, P. & Montague, P. R. A neural substrate of prediction and

reward. Science 275, 1593–1599 (1997).4. Serre, T., Wolf, L. & Poggio, T. Object recognition with features inspired by visual

cortex. Proc. IEEE. Comput. Soc. Conf. Comput. Vis. Pattern. Recognit. 994–1000(2005).

5. Fukushima, K. Neocognitron: A self-organizing neural network model for amechanism of pattern recognition unaffected by shift in position. Biol. Cybern. 36,193–202 (1980).

V

Figure 4 | Two-dimensional t-SNE embedding of the representations in thelast hidden layer assigned by DQN to game states experienced while playingSpace Invaders. The plot was generated by letting the DQN agent play for2 h of real game time and running the t-SNE algorithm25 on the last hidden layerrepresentations assigned by DQN to each experienced game state. Thepoints are coloured according to the state values (V, maximum expected rewardof a state) predicted by DQN for the corresponding game states (rangingfrom dark red (highest V) to dark blue (lowest V)). The screenshotscorresponding to a selected number of points are shown. The DQN agent

predicts high state values for both full (top right screenshots) and nearlycomplete screens (bottom left screenshots) because it has learned thatcompleting a screen leads to a new screen full of enemy ships. Partiallycompleted screens (bottom screenshots) are assigned lower state values becauseless immediate reward is available. The screens shown on the bottom rightand top left and middle are less perceptually similar than the other examples butare still mapped to nearby representations and similar values because theorange bunkers do not carry great significance near the end of a level. Withpermission from Square Enix Limited.

RESEARCH LETTER



METHODSPreprocessing. Working directly with raw Atari 2600 frames, which are 210 3 160pixel images with a 128-colour palette, can be demanding in terms of computationand memory requirements. We apply a basic preprocessing step aimed at reducingthe input dimensionality and dealing with some artefacts of the Atari 2600 emu-lator. First, to encode a single frame we take the maximum value for each pixel colourvalue over the frame being encoded and the previous frame. This was necessary toremove flickering that is present in games where some objects appear only in evenframes while other objects appear only in odd frames, an artefact caused by thelimited number of sprites Atari 2600 can display at once. Second, we then extractthe Y channel, also known as luminance, from the RGB frame and rescale it to84 3 84. The function w from algorithm 1 described below applies this preprocess-ing to the m most recent frames and stacks them to produce the input to theQ-function, in which m 5 4, although the algorithm is robust to different values ofm (for example, 3 or 5).Code availability. The source code can be accessed at https://sites.google.com/a/deepmind.com/dqn for non-commercial uses only.Model architecture. There are several possible ways of parameterizing Q using aneural network. Because Q maps history–action pairs to scalar estimates of theirQ-value, the history and the action have been used as inputs to the neural networkby some previous approaches24,26. The main drawback of this type of architectureis that a separate forward pass is required to compute the Q-value of each action,resulting in a cost that scales linearly with the number of actions. We instead use anarchitecture in which there is a separate output unit for each possible action, andonly the state representation is an input to the neural network. The outputs cor-respond to the predicted Q-values of the individual actions for the input state. Themain advantage of this type of architecture is the ability to compute Q-values for allpossible actions in a given state with only a single forward pass through the network.

The exact architecture, shown schematically in Fig. 1, is as follows. The input tothe neural network consists of an 84 3 84 3 4 image produced by the preprocess-ing map w. The first hidden layer convolves 32 filters of 8 3 8 with stride 4 with theinput image and applies a rectifier nonlinearity31,32. The second hidden layer con-volves 64 filters of 4 3 4 with stride 2, again followed by a rectifier nonlinearity.This is followed by a third convolutional layer that convolves 64 filters of 3 3 3 withstride 1 followed by a rectifier. The final hidden layer is fully-connected and con-sists of 512 rectifier units. The output layer is a fully-connected linear layer with asingle output for each valid action. The number of valid actions varied between 4and 18 on the games we considered.Training details. We performed experiments on 49 Atari 2600 games where resultswere available for all other comparable methods12,15. A different network was trainedon each game: the same network architecture, learning algorithm and hyperpara-meter settings (see Extended Data Table 1) were used across all games, showing thatour approach is robust enough to work on a variety of games while incorporatingonly minimal prior knowledge (see below). While we evaluated our agents on unmodi-fied games, we made one change to the reward structure of the games during trainingonly. As the scale of scores varies greatly from game to game, we clipped all posi-tive rewards at 1 and all negative rewards at 21, leaving 0 rewards unchanged.Clipping the rewards in this manner limits the scale of the error derivatives andmakes it easier to use the same learning rate across multiple games. At the same time,it could affect the performance of our agent since it cannot differentiate betweenrewards of different magnitude. For games where there is a life counter, the Atari2600 emulator also sends the number of lives left in the game, which is then used tomark the end of an episode during training.

In these experiments, we used the RMSProp (see http://www.cs.toronto.edu/,tijmen/csc321/slides/lecture_slides_lec6.pdf ) algorithm with minibatches of size32. The behaviour policy during training was e-greedy with e annealed linearlyfrom 1.0 to 0.1 over the first million frames, and fixed at 0.1 thereafter. We trainedfor a total of 50 million frames (that is, around 38 days of game experience in total)and used a replay memory of 1 million most recent frames.

Following previous approaches to playing Atari 2600 games, we also use a simpleframe-skipping technique15. More precisely, the agent sees and selects actions onevery kth frame instead of every frame, and its last action is repeated on skippedframes. Because running the emulator forward for one step requires much lesscomputation than having the agent select an action, this technique allows the agentto play roughly k times more games without significantly increasing the runtime.We use k 5 4 for all games.

The values of all the hyperparameters and optimization parameters were selectedby performing an informal search on the games Pong, Breakout, Seaquest, SpaceInvaders and Beam Rider. We did not perform a systematic grid search owing tothe high computational cost. These parameters were then held fixed across all othergames. The values and descriptions of all hyperparameters are provided in ExtendedData Table 1.

Our experimental setup amounts to using the following minimal prior know-ledge: that the input data consisted of visual images (motivating our use of a con-volutional deep network), the game-specific score (with no modification), numberof actions, although not their correspondences (for example, specification of theup ‘button’) and the life count.Evaluation procedure. The trained agents were evaluated by playing each game30 times for up to 5 min each time with different initial random conditions (‘no-op’; see Extended Data Table 1) and an e-greedy policy with e 5 0.05. This pro-cedure is adopted to minimize the possibility of overfitting during evaluation. Therandom agent served as a baseline comparison and chose a random action at 10 Hzwhich is every sixth frame, repeating its last action on intervening frames. 10 Hz isabout the fastest that a human player can select the ‘fire’ button, and setting therandom agent to this frequency avoids spurious baseline scores in a handful of thegames. We did also assess the performance of a random agent that selected an actionat 60 Hz (that is, every frame). This had a minimal effect: changing the normalizedDQN performance by more than 5% in only six games (Boxing, Breakout, CrazyClimber, Demon Attack, Krull and Robotank), and in all these games DQN out-performed the expert human by a considerable margin.

The professional human tester used the same emulator engine as the agents, andplayed under controlled conditions. The human tester was not allowed to pause,save or reload games. As in the original Atari 2600 environment, the emulator wasrun at 60 Hz and the audio output was disabled: as such, the sensory input wasequated between human player and agents. The human performance is the averagereward achieved from around 20 episodes of each game lasting a maximum of 5 mineach, following around 2 h of practice playing each game.Algorithm. We consider tasks in which an agent interacts with an environment,in this case the Atari emulator, in a sequence of actions, observations and rewards.At each time-step the agent selects an action at from the set of legal game actions,A~ 1, . . . ,Kf g. The action is passed to the emulator and modifies its internal stateand the game score. In general the environment may be stochastic. The emulator’sinternal state is not observed by the agent; instead the agent observes an imagext[Rd from the emulator, which is a vector of pixel values representing the currentscreen. In addition it receives a reward rt representing the change in game score.Note that in general the game score may depend on the whole previous sequence ofactions and observations; feedback about an action may only be received after manythousands of time-steps have elapsed.

Because the agent only observes the current screen, the task is partially observed33

and many emulator states are perceptually aliased (that is, it is impossible to fullyunderstand the current situation from only the current screen xt ). Therefore,sequences of actions and observations, st~x1,a1,x2,:::,at{1,xt , are input to thealgorithm, which then learns game strategies depending upon these sequences. Allsequences in the emulator are assumed to terminate in a finite number of time-steps. This formalism gives rise to a large but finite Markov decision process (MDP)in which each sequence is a distinct state. As a result, we can apply standard rein-forcement learning methods for MDPs, simply by using the complete sequence st

as the state representation at time t.The goal of the agent is to interact with the emulator by selecting actions in a way

that maximizes future rewards. We make the standard assumption that future rewardsare discounted by a factor of c per time-step (c was set to 0.99 throughout), and

define the future discounted return at time t as Rt~XT

t0~t

ct0{t rt0 , in which T is the

time-step at which the game terminates. We define the optimal action-valuefunction Q! s,að Þ as the maximum expected return achievable by following anypolicy, after seeing some sequence s and then taking some action a, Q! s,að Þ~maxp Rt Dst~s,at~a,p½ % in which p is a policy mapping sequences to actions (ordistributions over actions).

The optimal action-value function obeys an important identity known as theBellman equation. This is based on the following intuition: if the optimal valueQ! s0,a0ð Þ of the sequence s9 at the next time-step was known for all possible actionsa9, then the optimal strategy is to select the action a9 maximizing the expected valueof rzcQ! s0,a0ð Þ:

Q! s,að Þ ~ s0 rzc maxa0

Q! s0,a0ð ÞDs,a! "

The basic idea behind many reinforcement learning algorithms is to estimatethe action-value function by using the Bellman equation as an iterative update,Qiz1 s,að Þ~ s0 rzc maxa0 Qi s0,a0ð ÞDs,a½ %. Such value iteration algorithms convergeto the optimal action-value function, Qi?Q! as i??. In practice, this basic approachis impractical, because the action-value function is estimated separately for eachsequence, without any generalization. Instead, it is common to use a function approx-imator to estimate the action-value function, Q s,a; hð Þ<Q! s,að Þ. In the reinforce-ment learning community this is typically a linear function approximator, but

RESEARCH LETTER




i







2



Q s0,a0; h{i

! "{Q s,a; hið Þ


' (:









i







! "in D


) *from D



otherwise

(



End ForEnd For




LETTER RESEARCH





i







2



Q s0,a0; h{i

! "{Q s,a; hið Þ


' (:









i







! "in D


) *from D



otherwise

(



End ForEnd For




LETTER RESEARCH



Extended Data Figure 1 | Two-dimensional t-SNE embedding of therepresentations in the last hidden layer assigned by DQN to game statesexperienced during a combination of human and agent play in SpaceInvaders. The plot was generated by running the t-SNE algorithm25 on the lasthidden layer representation assigned by DQN to game states experiencedduring a combination of human (30 min) and agent (2 h) play. The fact thatthere is similar structure in the two-dimensional embeddings corresponding tothe DQN representation of states experienced during human play (orange

points) and DQN play (blue points) suggests that the representations learnedby DQN do indeed generalize to data generated from policies other than itsown. The presence in the t-SNE embedding of overlapping clusters of pointscorresponding to the network representation of states experienced duringhuman and agent play shows that the DQN agent also follows sequences ofstates similar to those found in human play. Screenshots corresponding toselected states are shown (human: orange border; DQN: blue border).

RESEARCH LETTER


Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH


Extended Data Table 1 | List of hyperparameters and their values

The values of all the hyperparameters were selected by performing an informal search on the games Pong, Breakout, Seaquest, Space Invaders and Beam Rider. We did not perform a systematic grid search owingto the high computational cost, although it is conceivable that even better results could be obtained by systematically tuning the hyperparameter values.

RESEARCH LETTER



Extended Data Table 2 | Comparison of games scores obtained by DQN agents with methods from the literature12,15 and a professionalhuman games tester

Best Linear Learner is the best result obtained by a linear function approximator on different types of hand designed features12. Contingency (SARSA) agent figures are the results obtained in ref. 15. Note thefigures in the last column indicate the performance of DQN relative to the human games tester, expressed as a percentage, that is, 100 3 (DQN score 2 random play score)/(human score 2 random play score).

LETTER RESEARCH


Extended Data Table 3 | The effects of replay and separating the target Q-network

DQN agents were trained for 10 million frames using standard hyperparameters for all possible combinations of turning replay on or off, using or not using a separate target Q-network, and three different learningrates. Each agent was evaluated every 250,000 training frames for 135,000 validation frames and the highest average episode score is reported. Note that these evaluation episodes were not truncated at 5 minleading to higher scores on Enduro than the ones reported in Extended Data Table 2. Note also that the number of training frames was shorter (10 million frames) as compared to the main results presented inExtended Data Table 2 (50million frames).

RESEARCH LETTER


Extended Data Table 3 | The effects of replay and separating the target Q-network

DQN agents were trained for 10 million frames using standard hyperparameters for all possible combinations of turning replay on or off, using or not using a separate target Q-network, and three different learningrates. Each agent was evaluated every 250,000 training frames for 135,000 validation frames and the highest average episode score is reported. Note that these evaluation episodes were not truncated at 5 minleading to higher scores on Enduro than the ones reported in Extended Data Table 2. Note also that the number of training frames was shorter (10 million frames) as compared to the main results presented inExtended Data Table 2 (50million frames).

RESEARCH LETTER


Extended Data Table 4 | Comparison of DQN performance with lin-ear function approximator

The performance of the DQN agent is compared with the performance of a linear function approximatoron the 5 validation games (that is, where a single linear layer was used instead of the convolutionalnetwork, in combination with replay and separate target network). Agents were trained for 10 millionframes using standard hyperparameters, and three different learning rates. Each agent was evaluatedevery 250,000 training frames for 135,000 validation frames and the highest average episode score isreported. Note that these evaluation episodes were not truncated at 5 min leading to higher scores onEnduro than the ones reported in Extended Data Table 2. Note also that the number of training frameswas shorter (10 million frames) as compared to the main results presented in Extended Data Table 2(50 million frames).

LETTER RESEARCH



Asynchronous Methods for Deep Reinforcement Learning

Volodymyr Mnih1 [email protected]à Puigdomènech Badia1 [email protected] Mirza1,2 [email protected] Graves1 [email protected] Harley1 [email protected] P. Lillicrap1 [email protected] Silver1 [email protected] Kavukcuoglu 1 [email protected] Google DeepMind2 Montreal Institute for Learning Algorithms (MILA), University of Montreal

AbstractWe propose a conceptually simple andlightweight framework for deep reinforce-ment learning that uses asynchronous gradientdescent for optimization of deep neural networkcontrollers. We present asynchronous variants offour standard reinforcement learning algorithmsand show that parallel actor-learners have astabilizing effect on training allowing all fourmethods to successfully train neural networkcontrollers. The best performing method, anasynchronous variant of actor-critic, surpassesthe current state-of-the-art on the Atari domainwhile training for half the time on a singlemulti-core CPU instead of a GPU. Furthermore,we show that asynchronous actor-critic succeedson a wide variety of continuous motor controlproblems as well as on a new task of navigatingrandom 3D mazes using a visual input.

1. IntroductionDeep neural networks provide rich representations that canenable reinforcement learning (RL) algorithms to performeffectively. However, it was previously thought that thecombination of simple online RL algorithms with deepneural networks was fundamentally unstable. Instead, a va-riety of solutions have been proposed to stabilize the algo-rithm (Riedmiller, 2005; Mnih et al., 2013; 2015; Van Has-selt et al., 2015; Schulman et al., 2015a). These approachesshare a common idea: the sequence of observed data en-countered by an online RL agent is non-stationary, and on-

Proceedings of the 33 rd International Conference on MachineLearning, New York, NY, USA, 2016. JMLR: W&CP volume48. Copyright 2016 by the author(s).

line RL updates are strongly correlated. By storing theagent’s data in an experience replay memory, the data canbe batched (Riedmiller, 2005; Schulman et al., 2015a) orrandomly sampled (Mnih et al., 2013; 2015; Van Hasseltet al., 2015) from different time-steps. Aggregating overmemory in this way reduces non-stationarity and decorre-lates updates, but at the same time limits the methods tooff-policy reinforcement learning algorithms.

Deep RL algorithms based on experience replay haveachieved unprecedented success in challenging domainssuch as Atari 2600. However, experience replay has severaldrawbacks: it uses more memory and computation per realinteraction; and it requires off-policy learning algorithmsthat can update from data generated by an older policy.

In this paper we provide a very different paradigm for deepreinforcement learning. Instead of experience replay, weasynchronously execute multiple agents in parallel, on mul-tiple instances of the environment. This parallelism alsodecorrelates the agents’ data into a more stationary process,since at any given time-step the parallel agents will be ex-periencing a variety of different states. This simple ideaenables a much larger spectrum of fundamental on-policyRL algorithms, such as Sarsa, n-step methods, and actor-critic methods, as well as off-policy RL algorithms suchas Q-learning, to be applied robustly and effectively usingdeep neural networks.

Our parallel reinforcement learning paradigm also offerspractical benefits. Whereas previous approaches to deep re-inforcement learning rely heavily on specialized hardwaresuch as GPUs (Mnih et al., 2015; Van Hasselt et al., 2015;Schaul et al., 2015) or massively distributed architectures(Nair et al., 2015), our experiments run on a single machinewith a standard multi-core CPU. When applied to a vari-ety of Atari 2600 domains, on many games asynchronousreinforcement learning achieves better results, in far less

arX

iv:1

602.

0178

3v2

[cs.L

G]

16 Ju

n 20

16


One way of propagating rewards faster is by using n-step returns (Watkins, 1989; Peng & Williams, 1996).In n-step Q-learning, Q(s, a) is updated toward the n-step return defined as r

t

+ �rt+1 + · · · + �n�1r

t+n�1 +

max

a

�nQ(st+n

, a). This results in a single reward r di-rectly affecting the values of n preceding state action pairs.This makes the process of propagating rewards to relevantstate-action pairs potentially much more efficient.

In contrast to value-based methods, policy-based model-free methods directly parameterize the policy ⇡(a|s; ✓) andupdate the parameters ✓ by performing, typically approx-imate, gradient ascent on E[R

t

]. One example of sucha method is the REINFORCE family of algorithms dueto Williams (1992). Standard REINFORCE updates thepolicy parameters ✓ in the direction r

✓

log ⇡(at

|st

; ✓)Rt

,which is an unbiased estimate of r

✓

E[Rt

]. It is possible toreduce the variance of this estimate while keeping it unbi-ased by subtracting a learned function of the state b

t

(st

),known as a baseline (Williams, 1992), from the return. Theresulting gradient is r

✓

log ⇡(at

|st

; ✓) (Rt

� bt

(st

)).

A learned estimate of the value function is commonly usedas the baseline b

t

(st

) ⇡ V ⇡

(st

) leading to a much lowervariance estimate of the policy gradient. When an approx-imate value function is used as the baseline, the quantityR

t

� bt

used to scale the policy gradient can be seen asan estimate of the advantage of action a

t

in state st

, orA(a

t

, st

) = Q(at

, st

)�V (st

), because Rt

is an estimate ofQ⇡

(at

, st

) and bt

is an estimate of V ⇡

(st

). This approachcan be viewed as an actor-critic architecture where the pol-icy ⇡ is the actor and the baseline b

t

is the critic (Sutton &Barto, 1998; Degris et al., 2012).

4. Asynchronous RL FrameworkWe now present multi-threaded asynchronous variants ofone-step Sarsa, one-step Q-learning, n-step Q-learning, andadvantage actor-critic. The aim in designing these methodswas to find RL algorithms that can train deep neural net-work policies reliably and without large resource require-ments. While the underlying RL methods are quite dif-ferent, with actor-critic being an on-policy policy searchmethod and Q-learning being an off-policy value-basedmethod, we use two main ideas to make all four algorithmspractical given our design goal.

First, we use asynchronous actor-learners, similarly to theGorila framework (Nair et al., 2015), but instead of usingseparate machines and a parameter server, we use multi-ple CPU threads on a single machine. Keeping the learn-ers on a single machine removes the communication costsof sending gradients and parameters and enables us to useHogwild! (Recht et al., 2011) style updates for training.

Second, we make the observation that multiple actors-

Algorithm 1 Asynchronous one-step Q-learning - pseu-docode for each actor-learner thread.

// Assume global shared ✓, ✓�, and counter T = 0.Initialize thread step counter t 0

Initialize target network weights ✓� ✓

Initialize network gradients d✓ 0

Get initial state s

repeatTake action a with ✏-greedy policy based on Q(s, a; ✓)

Receive new state s

0 and reward r

y =

⇢r for terminal s0

r + �max

a

0Q(s

0, a

0; ✓

�) for non-terminal s0

Accumulate gradients wrt ✓: d✓ d✓ +

@(y�Q(s,a;✓))2

@✓

s = s

0

T T + 1 and t t+ 1

if T mod I

target

== 0 thenUpdate the target network ✓

� ✓

end ifif t mod I

AsyncUpdate

== 0 or s is terminal thenPerform asynchronous update of ✓ using d✓.Clear gradients d✓ 0.

end ifuntil T > T

max

learners running in parallel are likely to be exploring dif-ferent parts of the environment. Moreover, one can explic-itly use different exploration policies in each actor-learnerto maximize this diversity. By running different explo-ration policies in different threads, the overall changes be-ing made to the parameters by multiple actor-learners ap-plying online updates in parallel are likely to be less corre-lated in time than a single agent applying online updates.Hence, we do not use a replay memory and rely on parallelactors employing different exploration policies to performthe stabilizing role undertaken by experience replay in theDQN training algorithm.

In addition to stabilizing learning, using multiple parallelactor-learners has multiple practical benefits. First, we ob-tain a reduction in training time that is roughly linear inthe number of parallel actor-learners. Second, since we nolonger rely on experience replay for stabilizing learning weare able to use on-policy reinforcement learning methodssuch as Sarsa and actor-critic to train neural networks in astable way. We now describe our variants of one-step Q-learning, one-step Sarsa, n-step Q-learning and advantageactor-critic.

Asynchronous one-step Q-learning: Pseudocode for ourvariant of Q-learning, which we call Asynchronous one-step Q-learning, is shown in Algorithm 1. Each thread in-teracts with its own copy of the environment and at eachstep computes a gradient of the Q-learning loss. We usea shared and slowly changing target network in comput-ing the Q-learning loss, as was proposed in the DQN train-ing method. We also accumulate gradients over multipletimesteps before they are applied, which is similar to us-


Figure 1. Learning speed comparison for DQN and the new asynchronous algorithms on five Atari 2600 games. DQN was trained ona single Nvidia K40 GPU while the asynchronous methods were trained using 16 CPU cores. The plots are averaged over 5 runs. Inthe case of DQN the runs were for different seeds with fixed hyperparameters. For asynchronous methods we average over the best 5models from 50 experiments with learning rates sampled from LogUniform(10

�4, 10

�2) and all other hyperparameters fixed.

two additional domains to evaluate only the A3C algorithm– Mujoco and Labyrinth. MuJoCo (Todorov, 2015) is aphysics simulator for evaluating agents on continuous mo-tor control tasks with contact dynamics. Labyrinth is a new3D environment where the agent must learn to find rewardsin randomly generated mazes from a visual input. The pre-cise details of our experimental setup can be found in Sup-plementary Section 8.

5.1. Atari 2600 Games

We first present results on a subset of Atari 2600 games todemonstrate the training speed of the new methods. Fig-ure 1 compares the learning speed of the DQN algorithmtrained on an Nvidia K40 GPU with the asynchronousmethods trained using 16 CPU cores on five Atari 2600games. The results show that all four asynchronous meth-ods we presented can successfully train neural networkcontrollers on the Atari domain. The asynchronous meth-ods tend to learn faster than DQN, with significantly fasterlearning on some games, while training on only 16 CPUcores. Additionally, the results suggest that n-step methodslearn faster than one-step methods on some games. Over-all, the policy-based advantage actor-critic method signifi-cantly outperforms all three value-based methods.

We then evaluated asynchronous advantage actor-critic on57 Atari games. In order to compare with the state of theart in Atari game playing, we largely followed the train-ing and evaluation protocol of (Van Hasselt et al., 2015).Specifically, we tuned hyperparameters (learning rate andamount of gradient norm clipping) using a search on sixAtari games (Beamrider, Breakout, Pong, Q*bert, Seaquestand Space Invaders) and then fixed all hyperparameters forall 57 games. We trained both a feedforward agent with thesame architecture as (Mnih et al., 2015; Nair et al., 2015;Van Hasselt et al., 2015) as well as a recurrent agent with anadditional 256 LSTM cells after the final hidden layer. Weadditionally used the final network weights for evaluationto make the results more comparable to the original results

Method Training Time Mean MedianDQN 8 days on GPU 121.9% 47.5%Gorila 4 days, 100 machines 215.2% 71.3%D-DQN 8 days on GPU 332.9% 110.9%Dueling D-DQN 8 days on GPU 343.8% 117.1%Prioritized DQN 8 days on GPU 463.6% 127.6%A3C, FF 1 day on CPU 344.1% 68.2%A3C, FF 4 days on CPU 496.8% 116.6%A3C, LSTM 4 days on CPU 623.0% 112.6%

Table 1. Mean and median human-normalized scores on 57 Atarigames using the human starts evaluation metric. SupplementaryTable SS3 shows the raw scores for all games.

from (Bellemare et al., 2012). We trained our agents forfour days using 16 CPU cores, while the other agents weretrained for 8 to 10 days on Nvidia K40 GPUs. Table 1shows the average and median human-normalized scoresobtained by our agents trained by asynchronous advantageactor-critic (A3C) as well as the current state-of-the art.Supplementary Table S3 shows the scores on all games.A3C significantly improves on state-of-the-art the averagescore over 57 games in half the training time of the othermethods while using only 16 CPU cores and no GPU. Fur-thermore, after just one day of training, A3C matches theaverage human normalized score of Dueling Double DQNand almost reaches the median human normalized score ofGorila. We note that many of the improvements that arepresented in Double DQN (Van Hasselt et al., 2015) andDueling Double DQN (Wang et al., 2015) can be incorpo-rated to 1-step Q and n-step Q methods presented in thiswork with similar potential improvements.

5.2. TORCS Car Racing Simulator

We also compared the four asynchronous methods onthe TORCS 3D car racing game (Wymann et al., 2013).TORCS not only has more realistic graphics than Atari2600 games, but also requires the agent to learn the dy-namics of the car it is controlling. At each step, an agentreceived only a visual input in the form of an RGB image


Schema Networks: Zero-shot Transfer with a Generative Causal Model ofIntuitive Physics

Ken Kansky Tom Silver David A. Mely Mohamed Eldawy Miguel Lazaro-Gredilla Xinghua LouNimrod Dorfman Szymon Sidor Scott Phoenix Dileep George

Abstract

The recent adaptation of deep neural network-based methods to reinforcement learning andplanning domains has yielded remarkableprogress on individual tasks. Nonetheless,progress on task-to-task transfer remains limited.In pursuit of efficient and robust generalization,we introduce the Schema Network, an object-oriented generative physics simulator capableof disentangling multiple causes of events andreasoning backward through causes to achievegoals. The richly structured architecture of theSchema Network can learn the dynamics of anenvironment directly from data. We compareSchema Networks with Asynchronous Advan-tage Actor-Critic and Progressive Networks on asuite of Breakout variations, reporting results ontraining efficiency and zero-shot generalization,consistently demonstrating faster, more robustlearning and better transfer. We argue thatgeneralizing from limited data and learningcausal relationships are essential abilities on thepath toward generally intelligent systems.

1. IntroductionA longstanding ambition of research in artificial intelli-gence is to efficiently generalize experience in one scenarioto other similar scenarios. Such generalization is essentialfor an embodied agent working to accomplish a variety ofgoals in a changing world. Despite remarkable progress onindividual tasks like Atari 2600 games (Mnih et al., 2015;Van Hasselt et al., 2016; Mnih et al., 2016) and Go (Silveret al., 2016a), the ability of state-of-the-art models to trans-

fer learning from one environment to the next remains lim-

All authors affiliated with Vicarious AI, California, USA. Cor-respondence to: Ken Kansky <[email protected]>, Tom Silver<[email protected]>.

Copyright 2017 by the author(s).

Figure 1. Variations of Breakout. From top left: standard version,middle wall, half negative bricks, offset paddle, random target,and juggling. After training on the standard version, Schema Net-works are able to generalize to the other variations without anyadditional training.

ited. For instance, consider the variations of Breakout illus-trated in Fig. 1. In these environments the positions of ob-jects are perturbed, but the object movements and sourcesof reward remain the same. While humans have no troublegeneralizing experience from the basic Breakout to its vari-ations, deep neural network-based models are easily fooled(Taylor & Stone, 2009; Rusu et al., 2016).

The model-free approach of deep reinforcement learning(Deep RL) such as the Deep-Q Network and its descen-dants is inherently hindered by the same feature that makesit desirable for single-scenario tasks: it makes no assump-tions about the structure of the domain. Recent work hassuggested how to overcome this deficiency by utilizingobject-based representations (Diuk et al., 2008; Usunieret al., 2016). Such a representation is motivated by the

Schema Networks: Zero-shot Transfer with a Generative Causal Model of Intuitive Physics

still be unable to generalize from biased training data with-out continuing to learn on the test environment. In contrast,Schema Networks exhibit zero-shot transfer.

Schema Networks are implemented as probabilistic graph-ical models (PGMs), which provide practical inference andstructure learning techniques. Additionally, inference withuncertainty and explaining away are naturally supported byPGMs. We direct the readers to (Koller & Friedman, 2009)and (Jordan, 1998) for a thorough overview of PGMs. Inparticular, early work on factored MDPs has demonstratedhow PGMs can be applied in RL and planning settings(Guestrin et al., 2003b).

3. Schema Networks3.1. MDPs and Notation

The traditional formalism for the Reinforcement Learningproblem is the Markov Decision Process (MDP). An MDPM is a five-tuple (S,A, T, R, �), where S is a set of states,A is a set of actions, T (s(t+1)|s(t), a(t)) is the probabil-ity of transitioning from state s(t) 2 S to s(t+1) 2 S af-ter action a(t) 2 A, R(r(t+1)|s(t), a(t)) is the probabilityof receiving reward r(t+1) 2 R after executing action a(t)

while in state s(t), and � 2 [0, 1] is the rate at which futurerewards are exponentially discounted.

3.2. Model Definition

A Schema Network is a structured generative model of anMDP. We first describe the architecture of the model infor-mally. An image input is parsed into a list of entities, whichmay be thought of as instances of objects in the sense ofOO-MDPs (Diuk et al., 2008). All entities share the samecollection of attributes. We refer to a specific attribute ofa specific entity as an entity-attribute, which is representedas a binary variable to indicate the presence of that attributefor an entity. An entity state is an assignment of states toall attributes of the entity, and the complete model state isthe set of all entity states.

A grounded schema is a binary variable associated witha particular entity-attribute in the next timestep, whosevalue depends on the present values of a set of binaryentity-attributes. The event that one of these present entity-attributes assumes the value 1 is called a precondition of thegrounded schema. When all preconditions of a groundedschema are satisfied, we say that the schema is active, andit predicts the activation of its associated entity-attribute.Grounded schemas may also predict rewards and may beconditioned on actions, both of which are represented asbinary variables. For instance, a grounded schema mightdefine a distribution over Entity 1’s “position” attribute attime 5, conditioned on Entity 2’s “position” attribute attime 4 and the action “UP” at time 4. Grounded schemas

Figure 2. Architecture of a Schema Network. An ungroundedschema is a template for a factor that predicts either the valueof an entity-attribute (A) or a future reward (B) based on entitystates and actions taken in the present. Self-transitions (C) predictthat entity-attributes remain in the same state when no schema isactive to predict a change. Self-transitions allow continuous orcategorical variables to be represented by a set of binary variables(depicted as smaller nodes). The grounded schema factors, instan-tiated from ungrounded schemas at all positions, times, and entitybindings, are combined with self-transitions to create a SchemaNetwork (D).

are instantiated from ungrounded schemas, which behavelike templates for grounded schemas to be instantiated atdifferent times and in different combinations of entities.For example, an ungrounded schema could predict the “po-sition” attribute of Entity x at time t + 1 conditioned onthe “position” of Entity y at time t and the action “UP”at time t; this ungrounded schema could be instantiated attime t = 4 with x = 1 and y = 2 to create the groundedschema described above. In the case of attributes like “po-sition” that are inherently continuous or categorical, severalbinary variables may be used to discretely approximate thedistribution (see the smaller nodes in Figure 2). A Schema

Network is a factor graph that contains all grounded instan-tiations of a set of ungrounded schemas over some windowof time, illustrated in Figure 2.

We now formalize the Schema Network factor graph. Forsimplicity, suppose the number of entities and the num-ber of attributes are fixed at N and M respectively. LetE

i

refer to the ith entity and let ↵(t)i,j

refer to the jth at-tribute value of the ith entity at time t. We use the notationE

(t)i

= (↵(t)i,1, ...,↵

(t)i,M

) to refer to the state of the ith en-

Schema Networks: Zero-shot Transfer with a Generative Causal Model of Intuitive Physics

(a) Mini Breakout Learning Rate (b) Middle Wall Learning Rate

Figure 3. Comparison of learning rates. (a) Schema Networks and A3C were trained for 100k frames in Mini Breakout. Plot shows theaverage of 5 training attempts for Schema Networks and the best of 5 training attempts for A3C, which did not converge as reliably. (b)PNs and Schema Networks were pretrained on 100K frames of Standard Breakout, and then training continued on 45K additional framesof the Middle Wall variation. We show performance as a function of training frames for both models. Note that Schema Networks areignoring all the additional training data, since all the required schemas were learned during pretraining. For Schema Networks, zero-shottransfer learning is happening.

the input to Schema Networks did not treat any object dif-ferently. Schema Networks were provided separate entitiesfor each part (pixel) of each object, and each entity con-tained 53 attributes corresponding to the available part la-bels (21 for bricks, 30 for the paddle, 1 for walls, and 1 forthe ball). Only one of these part attributes was active perentity. Schema Networks had to learn that some attributes,like parts of bricks, were irrelevant for prediction.

5.1. Transfer Learning

This experiment examines how effectively Schema Net-works and PNs are able to learn a new Breakout variationafter pretraining, which examines how well the two mod-els can transfer existing knowledge to a new task. Fig. 3ashows the learning rates during 100k frames of training onMini Breakout. In a second experiment, we pretrained onLarge Breakout for 100k frames and continued training onthe Middle Wall variation, shown in Fig. 1b. Fig. 3b showsthat PNs require significant time to learn in this new en-vironment, while Schema Networks do not learn anythingnew because the dynamics are the same.

5.2. Zero-Shot Generalization

Many Breakout variations can be constructed that all in-volve the same dynamics. If a model correctly learns thedynamics from one variation, in theory the others couldbe played perfectly by planning using the learned model.

Rather than comparing transfer with additional training us-ing PNs, in these variations we can compare zero-shot gen-eralization by training A3C only on Standard Breakout.Fig. 1b-e shows some of these variations with the followingmodifications from the training environment:

• Offset Paddle (Fig. 1d): The paddle is shifted upwardby a few pixels.

• Middle Wall (Fig. 1b): A wall is placed in the middleof the screen, requiring the agent to aim around it tohit the bricks.

• Random Target (Fig. 1e): A group of bricks isdestoyed when the ball hits any of them and then reap-pears in a new random position, requiring the agent todelibarately aim at the group.

• Juggling (Fig. 1f, enlarged from actual environmentto see the balls): Without any bricks, three balls arelaunched in such a way that a perfect policy could jug-gle them without dropping any.

Table 1 shows the average scores per episode in eachBreakout variation. These results show that A3C has failedto recognize the common dynamics and adapt its policy ac-cordingly. This comes as no surprise, as the policy it haslearned for Standard Breakout is no longer applicable inthese variations. Simply adding an offset to the paddle is


Playing FlappyBird with Deep Reinforcement Learning

Naveen Appiah

Mechanical [email protected]

Sagar Vare

Stanford [email protected]

Abstract

Learning to play games has been one among of the pop-ular topics researched in AI today. Solving such problemsusing game theory/ search algorithms require careful do-main specific feature definitions, making them averse toscalability. The goal here is to develop a more generalframework to learn game specific features and solve theproblem. The game we are considering for this project isthe popular mobile game - Flappy Bird. It involves navi-gating a bird through a bunch of obstacles. Though, thisproblem can be solved using naive RL implementation, itrequires good feature definitions to set up the problem. Ourgoal is to develop a CNN model to learn features from justsnapshots of the game and train the agent to take the rightactions at each game instance.

1 INTRODUCTION- PROBLEM

DEFINITION

Flappy bird (Figure 1) is a game in which the playerguides the bird, which is the "hero" of the game throughthe space between pairs of pipes. At each instant thereare two actions that the player can take: to press the ’up’key, which makes the bird jump upward or not pressingany key, which makes it descend at a constant rate.

Today, the recent advances in deep neural networks,in which several layers of nodes are used to build upprogressively more abstract representations of the data,have made it possible for machine learning models tolearn concepts such as object categories directly fromraw sensory data. It is has also been observed that deepconvolutional networks, which use hierarchical layers oftiled convolutional filters to mimic the effects of receptivefields produce promising results in solving computervision problems such as classification and detection.The goal here is to develop a deep neural network tolearn game specific features just from the raw pixels and

decide on what actions to take. Inspired by [1] and [2],we propose a reinforcement learning set-up to learn andplay this game..

Reinforcement learning is essential when it is not suf-ficient to tackle problems by programming the agent withjust a few predetermined behaviors. It is a way to teachthe agent to make the right decisions under uncertaintyand with very high dimensional input (such as a cam-era) by making it experiencing scenarios. In this way, thelearning can happen online and the agent can learn to re-act to even the rarest of scenarios which the brutal pro-gramming would never consider.

Figure 1: Flappybird Game - Schematics

2 RELATED WORK

Google Deepmind’s efforts to use Deep learning tech-niques to play games have paved way to looking at ar-tificial Intelligence problems with a completely differentlens. Their recent success, AlphaGo [4], the Go agent thathas been giving a stiff competition to the experts in thegame show clearly the potential of what Deep learning is

1

Figure 2: Schematic Architecture of the Convolutional Neural Network.

capable of. Deepmind’s previous venture was to learn andplay the Atari 2600 games just from the raw pixel data.Mnih et al. are able to successfully train agents to playthese games using reinforcement learning, surpassinghuman expert-level on multiple games [1],[2]. Here, theyhave developed a novel agent, a deep Q-network (DQN)combining reinforcement learning with deep neural net-works. The deep Neural Networks acts as the approximatefunction to represent the Q-value (action-value) in Q-learning. They also discuss a few techniques to improvethe efficiency of training and better the stability. They usea "experience replay" of previous experiences from whichmini-batches are randomly sampled to update the net-work so as to de-correlate experiences and delayed up-dates for the cloned model from which target values areobtained (explained in detail later) to better the stability.Another advantage of this pipeline is the complete ab-sence of labeled data. The model learns by playing withthe game emulator and learns to make good decisionsover time. It is this simple learning framework and theirstupendous results in playing the Atari games, inspired usto implement a similar algorithm for this project.

3 METHODS

In the following section, we describe how the model is pa-rameterized and the Q-learning algorithm. The task ofthe AI agent when the model gets deployed is to extractimages of game instances and output the necessary ac-tion to be taken from the set of feasible actions that canbe taken. This is similar to a classification problem. Un-like the common classification problem, we don’t have la-

beled data to train the model on. Instead, a reinforcementlearning setting tries to evaluate the actions at a givenstate based on the reward it observes by executing it.

3.1 MODEL FORMULATION

The actions the bird can take are to flap (a = 1) or do noth-ing (a = 0). The state at time (frame) t is derived by pre-processing the raw image of the current frame (xt ) witha finite number of previous frames (xt°1, xt°2, ...). Thisway, each state will uniquely recognize the trajectory thebird had followed to reach that position and thus providetemporal information to the agent. The number of previ-ous frames stored becomes a hyper-parameter. Ideally, st

should be a function of all frames from t = 1 but to reducethe state-space, only a finite number of frames are used.

As we know, that the bird dies when it hits the pipe orthe edges of the screen, we can associate a negative re-ward for bird crashing and a positive reward if it passesthrough the gap. This will be close to what a humanplayer tries to do, i.e try to avoid dying and score as manypoints as possible. Therefore, there are two rewards,r ew ar dPass and r ew ar dDi e. A discount factor (∞) of0.9 is used to discount the rewards propagated from thefuture action-values

4 Q-LEARNING

The goal of reinforcement learning is to maximize the to-tal pay-off (reward). In Q-learning, which is off-policy, weuse the bellman equation as an iterative update

section, the experiences are stored in a replay memoryand at regular intervals, a random mini-batch of experi-ences are sampled from the memory and used to performa gradient descent on the DQN parameters. Then we up-date the exploration probability as well as the target net-work parameters µ° if necessary.

Algorithm 1 Deep Reinforcement learning

1: Initialize replay memory D to certain capacity2: Initialize the Q-value function with random weights µ3: Initialize µ° = µ

4: for games = 1 ! maxGames do

5: for snapShots = 1 ! T do

6: With probability ≤ select a random action at7: otherwise select at = ar g maxaQ(st , a;µ)8: Execute at and observe rt and next sate st+19: Store transition st , at ,rt , st+1 in D

10: Sample mini-batch of transitions from D11: for j = 1 ! size of minibatch do

12: if game terminates at next state then

13: y j = r f14: else

15: y j = r j +∞maxa0 Q(s0, a0;µ°))16: end if

17: end for

18: Perform gradient descent on the loss w.r.t µ19: Every C steps reset µ° = µ

20: end for

21: end for

The score of the output game is the sole evaluationmetric. To make the results robust, we take an averagescore over a few games rather than a single one. The ≤ fac-tor is set to zero during test and while training, we use adecaying value. This is to model the surety of our deci-sions as we train and learn more.

5 EXPERIMENTS AND RESULTS

5.1 TRAINING PARAMETERS

Model Parameters: The Flappy bird is played at 10 framesper second, 3 recent frames are processed to generate astate, the discount factor ∞ is set to 0.9 and the rewards areas follows: r ew ar dPass =+1.0 and r ew ar dDi e =°1.0.

DQN parameters: The exploration probability (≤)linearly decreased from 0.6 to 0 in 1000 updates. The sizeof the replay memory is set to 1000 and the mini-batchesare sampled once it has 500 experiences. The parametersof the target model µ° are updated every C=100 updates.A mini-batch of 32 is randomly sampled every 5 framesto update the DQN parameters.

Training parameters: The Gradient descent updaterule used to update the DQN parameters is Ad am witha learning rate 1e ° 6, Ø1 = 0.9 and Ø2 = 0.999. Theseparameters were chosen on a trial and error basis observ-ing the convergence of the loss value. Our convolutionweights are initialized to have a normal distribution withmean 0 and variance 1e °2.

The whole DQN architecture and the Q-learningsetup was developed in python using numpyand matplotlib libraries. The game emulator isalso a python-pygame implementation found athttps://github.com/TimoWilken/flappy-bird-pygame.git

5.2 RESULTS AND ANALYSIS

After training, a few snapshots of the game were testedwith the model to see if results made sense. Figure3 shows some example snaps and their correspondingscores which make perfect sense.

(a) Score: UP = 1.870, DOWN = -1.830

(b) Score: UP = -1.999, DOWN = 1.983

Figure 3: Example snapshots with their correspondingscores. 3a is scenario where the bird has to jump up and3b is a scenario where the bird has to go down

To understand more about the working of the trainedCNN model, test image 3b was visualized after the con-volution layers to notice the activation. It could be seenthat most activation show clear patches on the edges of

References Web

Notes on Reinforcement Learning - v0.1

Technology

Transcript of Notes on Reinforcement Learning - v0.1