A Survey of Deep Reinforcement Learning in Video Games · 2019-12-30 · DRL in various video...

1

A Survey of Deep Reinforcement Learning in VideoGames

Kun Shao, Zhentao Tang, Yuanheng Zhu, Member, IEEE, Nannan Li, and Dongbin Zhao, Fellow, IEEE

Abstract—Deep reinforcement learning (DRL) has made greatachievements since proposed. Generally, DRL agents receivehigh-dimensional inputs at each step, and make actions accordingto deep-neural-network-based policies. This learning mechanismupdates the policy to maximize the return with an end-to-end method. In this paper, we survey the progress of DRLmethods, including value-based, policy gradient, and model-basedalgorithms, and compare their main techniques and properties.Besides, DRL plays an important role in game artificial in-telligence (AI). We also take a review of the achievements ofDRL in various video games, including classical Arcade games,first-person perspective games and multi-agent real-time strategygames, from 2D to 3D, and from single-agent to multi-agent.A large number of video game AIs with DRL have achievedsuper-human performance, while there are still some challengesin this domain. Therefore, we also discuss some key pointswhen applying DRL methods to this field, including exploration-exploitation, sample efficiency, generalization and transfer, multi-agent learning, imperfect information, and delayed spare re-wards, as well as some research directions.

Index Terms—reinforcement learning, deep learning, deepreinforcement learning, game AI, video games.

I. INTRODUCTION

ARTIFICIAL intelligence (AI) in video games is a long-standing research area. It studies how to use AI tech-

nologies to achieve human-level performance when playinggames. More generally, it studies the complex interactions be-tween agents and game environments. Various games provideinteresting and complex problems for agents to solve, makingvideo games perfect environments for AI research. Thesevirtual environments are safe and controllable. In addition,these game environments provide infinite supply of useful datafor machine learning algorithms, and they are much faster thanreal-time. These characteristics make games the unique andfavorite domain for AI research. On the other side, AI hasbeen helping games to become better in the way we play,understand and design them [1].

Broadly speaking, game AI involves the perception andthe decision-making in game environments. With these com-ponents, there are some crucial challenges and proposedsolutions. The first challenge is that the state space of thegame is very large, especially in strategic games. With the rise

K. Shao, Z. Tang, Y. Zhu, N. Li, and D. Zhao are with the State KeyLaboratory of Management and Control for Complex Systems, Institute of Au-tomation, Chinese Academy of Sciences. Beijing 100190, China. They are alsowith the University of Chinese Academy of Sciences, Beijing, China (e-mail:[email protected]; [email protected]; [email protected];[email protected], [email protected]).

This work is supported by National Natural Science Foundation ofChina (NSFC) under Grants No.61573353, No.61603382, No.6180337, andNo.61533017.

of representation learning, the whole system has successfullymodeled large-scale state space with deep neural networks.The second challenge is that learning proper policies to makedecisions in dynamic unknown environment is difficult. Forthis problem, data-driven methods, such as supervised learn-ing and reinforcement learning (RL), are feasible solutions.The third challenge is that the vast majority of game AI isdeveloped in a specified virtual environment. How to transferthe AI’s ability among different games is a core challenge. Amore general learning system is also necessary.

For a long time, solving these challenges with reinforcementlearning is widely used in game AI. And in the last fewyears, deep learning (DL) has achieved remarkable perfor-mance in computer vision and natural language processing[2]. The combination, deep reinforcement learning (DRL),teaches agents to make decisions in high-dimensional statespace in an end-to-end framework, and dramatically improvesthe generalization and scalability of traditional RL algorithms.Especially, DRL has made great progress in video games,including Atari, ViZDoom, StarCraft, Dota2, and so on. Thereare some related works to introduce these achievements in thisfield. Zhao et al. [3] and Tang et al. [4] survey the developmentof DRL research, and focus on AlphaGo and AlphaGo Zero.Justesen et al. [5] reviews DL-based methods in video gameplay, including supervised learning, unsupervised learning,reinforcement learning, evolutionary approaches, and somehybrid approaches. Arulkumaran et al. [6] make a brief intro-duction of DRL, covering central algorithms and presentinga range of visual RL domains. Li [7] gives an overview ofrecent achievements of DRL, and discusses core elements,important mechanisms, and various applications. In this paper,we focus on DRL-based game AI, from 2D to 3D, and fromsingle-agent to multi-agent. The main contributions includethe comprehensive and detailed comparisons of various DRLmethods, their techniques, properties, and the impressive anddiverse performances in these given video games.

The organization of the remaining paper is arranged asfollows. In Section II, we introduce the background of DL andRL. In Section III, we focus on recent DRL methods, includingvalue-based, policy gradient, and model-based DRL methods.After that, we make a brief introduction of research platformsand competitions, and present performances of DRL methodsin classical single-agent Arcade games, first-person perspectivegames, and multi-agent real-time strategy games. In SectionV, we discuss some key points and research directions in thisfield. In the end, we draw a conclusion of this survey.

arX

iv:1

912.

1094

4v2

[cs

.MA

] 2

6 D

ec 2

019

2

VideoGames

API

Value-based Policy gradient Model-based etc…

States Features

Environments RL Agents

Actions

Rewards

CNN, LSTM, etc…

Fig. 1. The framework diagram of the typical DRL for video games. The deep learning model takes input from video games API, and extract meaningfulfeatures automatically. DRL agents produces actions based on these features, and make the environments transfer to next state.

II. BACKGROUND

Generally speaking, training an agent to make decisionswith high-dimensional inputs is difficult. With the developmentof deep learning, researchers take deep neural networks asfunction approximations, and use plenty of samples to opti-mize policies successfully. The framework diagram of typicalDRL for video games is depicted in Fig. 1.

A. Deep learning

Deep learning comes from artificial neural networks, andis used to learn data representation. It is inspired by thetheory of brain development, and can be learned in supervisedlearning, unsupervised learning and semi-supervised learning.Although the term deep learning is introduced in 1986 [8],deep learning has a winter time because of lacking data andincapable computation hardware. However, with more andmore large-scale datasets being released, and capable hardwarebeing available, a big revolution happens in DL [9].

Convolutional neural network (CNN) [10] is a class ofdeep neural networks, which is widely applied to computervision. CNN is inspired by biological processes, and isshift invariant based on shared-weights architecture. RecurrentNeural Network (RNN) is another kind of deep nerial net-work, especially for natural language processing. As a specialkind of RNN, Long Short Term Memory (LSTM) [11] iscapable of learning long-term dependencies. Deep learningarchitectures have been applied into many fields, and haveachieved significant successes, such as speech recognition, im-age classification and segmentation, semantic comprehension,and machine translation [2]. DL-based methods with efficientparallel distributed computing resources can break the limitof traditional machine learning methods. This method inspiresscientists and researchers to achieve more and more state-of-the-art performance in respective fields.

B. Reinforcement learning

Reinforcement learning is a kind of machine learning meth-ods where agents learn the optimal policy by trial and error[12]. By interacting with the environment, RL can be suc-cessfully applied to sequential decision-making tasks. Consid-ering a discounted episodic Markov decision process (MDP)(S,A, γ, P, r), the agent chooses an action at according tothe policy π(at|st) at state st. The environment receives theaction, produces a reward rt+1 and transfers to the next statest+1 according to the transition probability P (st+1|st, at).This transition probability is unknown in RL domain. Theprocess continues until the agent reaches a terminal state or amaximum time step. The objective is to maximize the expecteddiscounted cumulative rewards

Eπ[Rt] = Eπ[

∞∑i=0

γirt+i], (1)

where γ ∈ (0, 1] is the discount factor.Reinforcement learning can be devided into off-policy and

on-policy methods. Off-policy RL algorithms mean that thebehavior policy used for selecting actions is different from thelearning policy. On the contrary, behavior policy is the samewith the learning policy in on-policy RL algorithms. Besides,reinforcement learning can also be devided into value-basedand policy-based methods. In value-based RL, agents updatethe value function to learn suitable policy, while policy-basedRL agents learn the policy directly.

Q-learning is a typical off-policy value-based method. Theupdate rule of Q-learning is

δt = rt+1 + γ arg maxa

Q(st+1, a)−Q(st, at), (2a)

Q(st, at)← Q(st, at) + αδt. (2b)

δt is the temporal difference (TD) error, and α is the learningrate.

3

(a) (b) (c) (d) (e)

Fig. 2. The network architectures of typical DRL methods, with increased complexity and performance. (a): DQN network; (b)Dueling DQN network; (c):DRQN network; (d): Actor-critic network; (e): Reactor network.

Policy gradient [13] parameterizes the policy and updatesparameters θ. In its general form, the objective function ofpolicy gradient is defined as

J(θ) = Eπ[

∞∑t=0

log πθ(at|st)R]. (3)

R is the total accumulated return.Actor-critic [12] reinforcement learning improves the policy

gradient with an value-based critic

J(θ) = Eπ[

∞∑t=0

Ψt log πθ(at|st)]. (4)

Ψt is the critic, which can be the state-action value functionQπ(st, at), the advantage function Aπ(st, at) = Qπ(st, at)−V π(st) or the TD error rt + V π(st+1)− V π(st).

III. DEEP REINFORCEMENT LEARNING

DRL makes a combination of DL and RL, achieving rapiddevelopments since proposed. This section will introducevarious DRL methods, including value-based methods, policygradient methods, and model-based methods.

A. Value-based DRL methods

Deep Q-network (DQN) [14] is the most famous DRLmodel which learns policies directly from high-dimensionalinputs. It receives raw pixels, and outputs a value function toestimate future rewards, as shown in Fig. 2(a). DQN uses theexperience replay method to break the sample correlation, andstabilizes the learning process with a target Q-network. Theloss function at iteration i is

Li(θi) = E(s,a,r,s′)∼U(D)[(yDQNi −Q(s, a; θi))

2], (5)

withyDQNi = r + γmax

a′Q(s′, a′; θi

−). (6)

DQN bridges the gap between high-dimensional visualinputs and actions. After that, researchers have improved DQNin different aspects. Double DQN [15] introduces double Q-learning to reduce observed overestimations, and it leads to

much better performance. Prioritized experience replay [16]helps prioritize experience to replay important transitionsmore frequently. The sample probability of transition i asP (i) =

pαi∑k p

αk

, where pi is the priority of transition i. DuelingDQN [17] uses the dueling neural network architecture formodel-free DRL. It includes two separate estimators: one forstate value function V (s; θ, β) and the other for advantagefunction A(s, a; θ, α), as shown in Fig. 2(b).

Q(s, a : θ, α, β) = V (s; θ, β) +A(s, a; θ, α). (7)

Pop-Art [18] is proposed to adapt to different and non-stationary target magnitudes, which successfully replaces theclipping of rewards as done in DQN to handle various mag-nitudes of targets. Fast reward propagation [19] is a noveltraining algorithm for reinforcement learning, which combinesthe strength of DQN, and exploits longer state-transitions inexperience replays by tightening the optimization via con-straints. This novel technique makes DRL more practicalby drastically reducing training time. Gorila [20] is the firstmassively distributed architecture for DRL. This architectureuses four main components: parallel actors; parallel learners;a distributed neural network to represent the value functionor behavior policy; and a distributed store of experience. Toaddress the limited memory and imperfect game informationat each decision point, Deep Recurrent Q-Network (DRQN)[21] replaces the first fully-connected layer with a recurrentneural network in DQN, as shown in Fig. 2(c).

Generally, DQN learns rich domain representations andapproximates the value function with deep neural networks,while batch RL algorithms with linear representations aremore stable and require less hyperparameter tuning. The LeastSquares DQN (LS-DQN) [22] combines DQN’s rich featurerepresentations with the stability of a linear least squaresmethod. In order to reduce approximation error variance inDQNs target values, averaged-DQN [23] averages previousQ-values estimates, leading to a more stable training andimproved performance. Deep Q-learning from Demonstrations(DQfD) [24] combines DQN with human demonstrations,which improves the sample efficiency greatly. DQV [25] usesTD learning to train a Value neural network, and uses this

4

network to train a second Quality-value network to esti-mate state-action values. DQV learns significantly faster andbetter than double-DQN. Researchers have proposed severalimprovements to DQN. However, it is unclear which ofthese are complementary and how much can be combined.Rainbow [26] combines with main extensions to DQN, andgives each component’s contribution to overall performance.RUDDER [27] is a novel reinforcement learning approachfor finite MDPs with delayed rewards, which is also a returndecomposition method, RUDDER is exponentially faster ontasks with different lengths of reward delays. Ape-X DQfD[28] uses a new transformed Bellman operator to processrewards of varying densities and scales, and applies humandemonstrations to ease the exploration problem to guide agentstowards rewarding states. Additional, it proposes an auxiliarytemporal consistency loss to train stably extending the effectiveplanning horizon by an order of magnitude. Soft DQN [29]is an entropy-regularized versions of Q-learning, with betterrobustness and generalization .

Distributional DRL learns the value distribution, in contrastto common RL that models the expectation of return, or value.C51 [30] focuses on the distribution of value, and designsdistributional DQN algorithm to learn approximate value dis-tributions. QR-DQN [31] methods close a number of gapsbetween theoretical and algorithmic results. Distributionalreinforcement learning with Quantile regression in which thedistribution over returns is modeled explicitly instead of onlyestimating the mean. Implicit Quantile Networks (IQN) [32] isa flexible, applicable, and state-of-the-art distributional DQN.IQN approximates the full Quantile function for the returndistribution with Quantile regression, and provides a fullyintegrated distributional RL agent without prior assumptionson the parameterization of the return distribution. Furthermore,IQN allows to expand the class of control policies to a widerange of risk-sensitive policies connected to distortion riskmeasures.

B. Policy gradient DRL methods

Policy gradient DRL optimizes the parameterized policydirectly. Actor-critic architecture computes the policy gradi-ent using a value-based critic function to estimate expectedfuture reward, as shown in Fig. 2(d). Asynchronous DRLis an efficient framework for DRL that uses asynchronousgradient descent to optimize the policy [33]. Asynchronousadvantage actor-critic (A3C) trains several agents on multipleenvironments, showing a stabilizing effect on training. Theobjective function of the actor is demonstrated as

J(θ) = Eπ[

∞∑t=0

Aθ,θv (st, at) log πθ(at|st) + βHθ(π(st))],

(8)where Hθ(π(st)) is an entropy term used to encourage explo-ration.

GA3C [34] is a hybrid CPU/GPU version of A3C, whichachieves a significant speed up compared to the original CPUimplementation. UNsupervised REinforcement and AuxiliaryLearning (UNREAL) [35] learns separate policies for max-imizing many other pseudo-reward functions simultaneously,

including value function replay, reward prediction, and pixelcontrol. This agent drastically improves both data efficiencyand robustness to hyperparameter settings. PAAC [36] is anovel framework for efficient parallelization of DRL, wheremultiple actors learn the policy on a single machine. Pol-icy gradient methods are efficient techniques for policiesimprovement, while they are usually on-policy and unableto take advantage of off-policy data. The new method isreferred as PGQ [37], which combines policy gradient with Q-learning. PGQ establishes an equivalency between regularizedpolicy gradient techniques and advantage function learningalgorithms. Retrace(λ) [38] takes the best of the importancesampling, off-policy Q(λ), and tree-backup(λ), resulting inlow variance, safety, and efficiency. It makes a combinationof dueling DRQN architecture and actor-critic architecture,as shown in Fig. 2(e). Reactor [39] is a sample-efficientand numerical efficient reinforcement learning agent basedon a multi-step return off-policy actor-critic architecture. Thenetwork outputs a target policy, an action-value Q-function,and an estimated behavioral policy. The critic is trained withthe off-policy multi-step Retrace method and the actor istrained by a β-leave-one-out policy gradient. Importance-Weighted Actor Learner Architecture (IMPALA) [40] is a newdistributed DRL, which can scale to thousands of machine.IMPALA uses a single reinforcement learning agent with asingle set of parameters to solve a mass of tasks. This methodachieves stable learning by combining decoupled acting andlearning with a novel V-trace off-policy correction method,which is critical for achieving learning stability.

1) Trust region method: Trust Region Policy Optimization(TRPO) [41] is proposed for optimizing control policies,with guaranteed monotonic improvement. TRPO computesan ascent direction to improve on policy gradient, whichcan ensure a small change in the policy distribution. Theconstrained optimization problem of TRPO in each epoch is

maximizeθ Es∼ρθ′ ,a∼πθ′ [πθ(a|s)πθ′(a|s)

Aθ′(s, a)], (9a)

s.t. Es∼ρθ′ [DKL(πθ′(·|s))] ≤ δKL. (9b)

This algorithm is effective for optimizing large nonlinearpolicies. Proximal policy optimization (PPO) [42] samplesdata by interaction with the environment, and optimizes theobjective function with stochastic gradient ascent

rt(θ) =πθ(at|st)πθold(at|st)

, (10a)

L(θ) = Et[min(rt(θ)At, clip(rt(θ), 1− ε, 1 + ε)At]. (10b)

rt(θ) denotes the probability ratio. This objective functionclips the probability ratio to modify the surrogate objective.PPO has some benefits over TRPO, and is much simpler toimplement, with better sample complexity. Actor-critic withexperience replay (ACER) [43] introduces several innovations,including stochastic dueling network, truncated importancesampling, and a new trust region method, which is stable andsample efficient. Actor-critic using Kronecker-Factored TrustRegion (ACKTR) [44] bases on natural policy gradient, anduses Kronecker-factored approximate curvature (K-FAC) with

5

trust region to optimize the actor and the critic. ACKTR issample efficient compared with other actor-critic methods.

2) Deterministic policy: Apart from stochastic policy, deepdeterministic policy gradient (DDPG) [45] is a kind of deter-ministic policy gradient method which adapts the success ofDQN to continuous control. The update rule of DDPG is

Q(st, at) = r(st, at) + γQ(st+1, πθ(st+1)). (11)

DDPG is an actor-critic, off-policy algorithm, and is ableto learn reasonable policies on various tasks. DistributedDistributional DDPG (D4PG) [46] is a distributional update toDDPG, combined with the use of multiple distributed workersall writing into the same replay table. This method has a muchbetter performance on a number of difficult continuous controlproblems.

3) Entropy-regularized policy gradient: Soft Actor Critic(SAC) is an off-policy policy gradient method, which estab-lishes a bridge between DDPG and stochastic policy optimiza-tion. SAC incorporates the clipped double-Q trick, and theobjective function of maximum entropy DRL is

J(π) =

T∑t=0

E(st,at)∼ρπ [r(st, at) + αH(π(.|st))], (12)

SAC uses an entropy regularization in its objective function.It trains the policy to maximize a trade-off between entropyand expected return. The entropy is a measure of randomnessin the policy. This mechanism is similar to the trade-offbetween exploration and exploitation. Increasing entropy canencourage more exploration, and accelerate learning process.Moreover, it can also prevent the learning policy from con-verging to a poor local optimum.

C. Model-based DRL methods

Combining model-free reinforcement learning with on-lineplanning is a promising approach to solve the sample effi-ciency problem. TreeQN [47] is proposed to address thesechallenges. It is a differentiable, recursive, tree-structuredmodel that serves as a drop-in replacement for any valuefunction network in DRL with discrete actions. TreeQN dy-namically constructs a tree by recursively applying a transitionmodel in a learned abstract state space and then aggregatingpredicted rewards and state-values using a tree backup to esti-mate Q-values. ATreeC is an actor-critic variant that augmentsTreeQN with a softmax layer to form a stochastic policynetwork. Both approaches are trained end-to-end, such thatthe learned model is optimized for its actual use in the plan-ner. TreeQN and ATreeC outperform n-step DQN and valueprediction networks on multiple Atari games. Vezhnevets etal. [48] presents STRategic Attentive Writer (STRAW) neuralnetwork architecture to build implicit plans. STRAW purelyinteracts with an environment, and is an end-to-end method.STRAW model can learn temporally abstracted high-levelmacro-actions, which enables both economic computation andstructured exploration. STRAW employs temporally extendedplanning strategies and achieves strong improvements on Atarigames. The world model [49] uses an unsupervised mannerto train a generative recurrent neural network, which can

model RL environments through compressed spatiotemporalrepresentations. It feeds extracted features into simple andcompact policies, achieving impressive results in several en-vironments. Value propagation (VProp) [50] bases on valueiteration, and is an efficient differentiable planning module. Itcan successfully be trained to learn to plan using reinforcementlearning. As a general framework of AlphaZero, MuZero[54] combines MCTS with a learned model, and predictsthe reward, the action-selection policy, and the value functionto make planning. It extends model-based RL to a range oflogically complex and visually complex domains, and achievessuperhuman performance.

A general review of various DRL methods from 2017 to2019 is presented in Table I.

IV. DRL IN VIDEO GAMES

Playing video games like human experts is challenging forcomputers. With the development of DRL, agents are ableto play various games end-to-end. Here we focus on gameresearch platforms and competitions, and impressive progressin various video games, from 2D to 3D, and from single-agentto multi-agent, as shown in Fig. 3.

A. Game research platforms

Platforms and competitions make great contributions tothe development of game AI, and help to evaluate agents’intelligence, as presented in Table II. Most platforms can bedescribed by two major categories: General Platforms andSpecific Platforms.

General Platforms: Arcade Learning Environment (ALE)[55] is the pioneer evaluation platform for DRL algorithms,which provides an interface to plenty of Atari 2600 games.ALE presents both game images and signals, such as playerscores, which makes it a suitable testbed. To promote theprogress of DRL research, OpenAI integrates a collection ofreinforcement learning tasks into a platform called Gym [56],which mainly contains Algorithmic, Atari, Classical Control,Board games, 2D and 3D robots. After that, OpenAI Universe[57] is a platform for measuring and training agents’ generalintelligence across a large supply of games. Gym Retro [58]is a wrapper for video game emulator with a unified interfaceas Gym, and makes Gym easy to be extended with a largecollection of video games, not only Atari but also NEC,Nintendo, and Sega, for RL research. The OpenAI Retrocontest aims at exploring the development of DRL that cangeneralize from previous experience. OpenAI bases on theSonic the HedgehogTM video game, and presents a new DRLbenchmark [59]. This benchmark can help to measure theperformance of few-shot learning and transfer learning inreinforcement learning. General Video Game Playing [60] isintended to design an agent to play multiple video gameswithout human intervention. The General Video Game AI(GVGAI) [61] competition is proposed to provide a easy-to-use and open-source platform for evaluating AI methods, in-cluding DRL. DeepMind Lab [62] is a first-person perspectivelearning environment, and provides multiple complicated tasksin partially observed, large-scale, and visually diverse worlds.

6

TABLE IA GENERAL REVIEW OF RECENT DRL METHODS FROM 2017 TO 2018.

DRL Algorithms Main Techniques Networks Category

DQN [14] experience replay, target Q-network CNN value-based, off-policyDouble DQN [15] double Q-learning CNN value-based, off-policyDueling DQN [17] dueling neural network architecture CNN value-based, off-policy

Prioritized DQN [16] prioritized experience replay CNN value-based, off-policyBootstrapped DQN [51] combine deep exploration with DNNs CNN value-based, off-policy

Gorila [20] massively distributed architecture CNN value-based, off-policyLS-DQN [22] combine least-squares updates in DRL CNN value-based, off-policy

Averaged-DQN [23] averaging learned Q-values estimates CNN value-based, off-policyDQfD [24] learn from the demonstration data CNN value-based, off-policy

DQN with Pop-Art [18] adaptive normalization with Pop-Art CNN value-based, off-policySoft DQN [29] KL penalty and entropy bonus CNN value-based, off-policy

DQV [25] training a Quality-value network CNN value-based, off-policyRainbow [26] integrate six extensions to DQN CNN value-based, off-policyRUDDER [27] return decomposition CNN-LSTM value-based, off-policy

Ape-X DQfD [28] transformed Bellman operator, temporal consistency loss CNN value-based, off-policyC51 [30] distributional Bellman optimality CNN value-based, off-policy

QR-DQN [31] distributional RL with Quantile regression CNN value-based, off-policyIQN [32] an implicit representation of the return distribution CNN value-based, off-policyA3C [33] asynchronous gradient descent CNN-LSTM policy gradient, on-policy

GA3C [34] hybrid CPU/GPU version CNN-LSTM policy gradient, on-policyPPO [42] clipped surrogate objective, adaptive KL penalty coefficient CNN-LSTM policy gradient, on-policy

ACER [43] experience replay, truncated importance sampling CNN-LSTM policy gradient, off-policyACKTR [44] K-FAC with trust region CNN-LSTM policy gradient, on-policy

Soft Actor-Critic [52] entropy regularization CNN policy gradient, off-policyUNREAL [35] unsupervised auxiliary tasks CNN-LSTM policy gradient, on-policyReactor [39] Retrace(λ), β-leave-one-out policy gradient estimate CNN-LSTM policy gradient, off-policyPAAC [36] parallel framework for A3C CNN policy gradient, on-policyDDPG [45] DQN with deterministic policy gradient CNN-LSTM policy gradient, off-policyTRPO [41] incorporate a KL divergence constraint CNN-LSTM policy gradient , on-policyD4PG [46] distributed distributional DDPG CNN policy gradient , on-policyPGQ [37] combine policy gradient and Q-learning CNN policy gradient, off-policy

IMPALA [40] importance-weighted actor learner architecture CNN-LSTM policy gradient, on-policyFiGAR-A3C [53] fine grained action repetition CNN-LSTM policy gradient, on-policy

TreeQN/ATreeC [47] on-line planning, tree-structured model CNN model-based, on-policySTRAW [48] macro-actions, planning strategies CNN model-based, on-policy

World model [49] mixture density network, variational autoencoder CNN-LSTM model-based, on-policyMuZero [54] representation function, dynamics function, and prediction function CNN model-based, off-policy

TABLE IIA LIST OF GAME AI COMPETITIONS SUITABLE FOR DRL RESEARCH.

Competition Name Time

ViZDoom AI competition 2016, 2017, 2018StarCraft AI competitions (AIIDE, CIG, SSCAIT) 2010 — 2019

microRTS competition 2017, 2018, 2019The GVGAI competition – learning track 2017, 2018

Microsoft Malmo collaborative AI challenge 2017The multi-agent RL in Malmo competition 2018

The OpenAI Retro contest 2018NeurIPS Pommerman competition 2018Unity Obstacle Tower Challenge 2019

NeurIPS MineRL competition 2019

Unity ML-Agents Toolkit [63] is a new toolkit for creatingand interacting with simulation environments. This platformhas sensory, physical, cognitive, and social complexity, andenables fast and distributed simulation, and flexible control.

Specific Platforms: Malmo [64] is a research platform forAI experiments, which is built on top of Minecraft. It is afirst-person 3D environment, and can be used for multi-agentresearch in Microsoft Malmo collaborative AI challenge 2017and the multi-agent RL in MalmO competition 2018. TORCS[65] is a racing car simulator which has both low-level andvisual features for the self-driving car with DRL. ViZDoom[66] is a first-person shooter game platform, and encouragesDRL agent to utilize the visual information to perform naviga-tion and shooting tasks in a semi-realistic 3D world. ViZDoomAI competition has attracted plenty of researchers to developtheir DRL-based Doom agents since 2016. As far as we

7

2D

3D

Single-agent Multi-agent

Minecraft Quake III Arena CTFViZDoom

TORCS

Montezuma’s Revenge

ALE

DM Lab

StarCraft

Dota2

Dimensions

Number of agents

Fig. 3. The diagram of various video games AI, from 2D to 3D, and from single-agent to multi-agent.

know, real-time strategy (RTS) games are very challenging forreinforcement learning method. Facebook proposes TorchCraftfor StarCraft I [67], and DeepMind releases StarCraft IIlearning environment [68]. They expect researchers to proposepowerful DRL agents to achieve high-level performance inRTS games and annual StarCraft AI competitions. CoinRun[69] provides a metric for an agent’s ability to transfer itsexperience to novel situations. This new training environmentstrikes a desirable balance in complexity: the environment ismuch simpler than traditional platform games, but it still posesa worthy generalization challenge for DRL algorithms. GoogleResearch Football is a new environment based on open-sourcegame Gameplay Football for DRL research.

B. Atari games

ALE is an evaluation platform that aims at building agentswith general intelligence across hundreds of Atari 2600 games.As the most popular testbed for DRL research, a large num-ber of DRL methods have achieved outstanding performanceconsecutively. Machado et al. [70] takes a review at theALE in DRL research community, proposes diverse evaluationmethodologies and some key concerns. In this section, wewill introduce the main achievements in the ALE domain,including the extremely difficult Montezuma’s Revenge.

As the milestone in this domain, DQN is able to surpassthe performances of previous algorithms, and achieves human-level performance across 49 games [14]. Averaged-DQN ex-amines the source of value function estimation errors, anddemonstrates significantly improved stability and performanceon the ALE benchmark [23]. UNREAL significantly outper-forms the previous best performance on Atari, averaging 880%expert human performance [35]. PAAC achieves sufficientlygood performance on ALE after a few hours of training [36].DQfD has better initial performance than DQN on most Atarigames, and receives more average rewards than DQN on 27of 42. In addition, DQfD learns faster than DQN even whengiven poor demonstration data [24]. Noisy DQN replaces the

TABLE IIIMEAN AND MEDIAN SCORES ACROSS 57 ATARI GAMES OF TYPICAL DRL

METHODS, MEASURED AS PERCENTAGES OF HUMAN BASELINE.

Methods Mean Median year

DQN [14] 228% 79% 2015C51 [30] 701% 178% 2017

UNREAL [35] 880% 250% 2017QR-DQN [30] 915% 211% 2017

IQN [32] 1019% 218% 2018Rainbow [26] 1189% 230% 2018

Ape-X DQN [71] 1695% 434% 2018Ape-X DQfD ∗ [28] 2346% 702% 2018

Note: ∗ means this method is measured across 42 Atari games.

conventional exploration heuristics with NoisyNet, and yieldssubstantially higher scores in ALE domain. As a distributionalDRL method, C51 obtains a new series of impressive results,and demonstrates the importance of the value distribution inapproximated RL [30]. Rainbow provides improvements interms of sample efficiency and final performance. The authorsalso show the contribution of each component to overall per-formance [26]. QR-DQN algorithm significantly outperformsrecent improvements on DQN, including the related C51 [30].IQN shows substantial gains on the Atari benchmark overQR-DQN, and even halves the distance between QR-DQNand Rainbow [32]. Ape-X DQN substantially improves theperformance on the ALE, achieving better final score in lesswall-clock training time [71]. When tested on a set of 42 Atarigames, the Ape-X DQfD algorithm exceeds the performanceof an average human on 40 games using a common set ofhyperparameters. Mean and median scores across multipleAtari games of typical DRL methods that achieve state-of-the-art performance consecutively are presented in Table III.

Montezuma’s Revenge is one of the most difficult Atarivideo games. It is a goal-directed behavior learning environ-ment with long horizons and sparse reward feedback signals.

8

Players must navigate through a number of different rooms,avoid obstacles and traps, climb ladders up and down, and thenpick up the key to open new rooms. It requires a long sequenceof actions before reaching the goal and receiving a reward,and is difficult to explore an optimal policy to tackle tasks.Efficient exploration is considered as a crucial factor to learnin a delayed feedback environment. Then, Ostrovski et al. [72]provide an improved version of count-based exploration withPixelCNN as a supplement for pseudo-count, also reveals theimportance of Monte Carlo return for effective exploration.In addition to improve the exploration efficiency, learningfrom human data is also a proper method to reach betterperformance in this problem. Le et al. [73] leverage imitationlearning from expert interaction and hierarchical reinforcementlearning at different levels. This method learns obviouslyfaster than original hierarchical reinforcement learning, andalso significantly more efficiently than conventional imitationlearning. Other than gameplay, the demonstration is also avaluable kind of sample for agent to learn. DQfD utilizesa small set of demonstration data to speed up the learningprocess [24]. It combines prioritized replay mechanism withtemporal difference updates and supervised classification, andfinally achieves a better and impressive result. Further, Aytaret al. [74] only use YouTube video as a demonstration sampleand invests a transformed Bellman operator for learning fromhuman demonstrations. Interestingly, these two works bothclaim being the first to solve the entire first level of Mon-tezuma’s Revenge. Go-explore [75] makes further progress,and achieves scores over 400,000 on average. Go-Explore sep-arates learning into exploration and robustification. It reliablysolves the whole game, and generalizes well.

C. First-person perspective games

Different from Atari games, agents in first-person perspec-tive video games can only receive observations from their ownperspectives, resulting from imperfect information inputs. InRL domain, this is a POMDP problem which requires efficientexploration and memory.

1) ViZDoom: First-person shooter (FPS) games play animportant role in game AI research. Doom is a classicalFPS game, and ViZDoom is presented as a novel testbedfor DRL [66]. Agents learn from visual inputs, and interactwith the ViZDoom environment in a first-person perspective.Wu et al. [76] propose a method that combines A3C andcurriculum learning. The agent learns to navigate and attackvia playing against built-in agents progressively. Parisotto etal. [77] develop Neural Map, which is a memory systemwith an adaptable write operator. Neural Map uses a spatiallystructured 2D memory image to store the environment’s in-formation. This method surpasses other DRL memories onseveral challenging ViZDoom maze tasks and shows a capablegeneralization ability. Shao et al. [78] show that ACKTR cansuccessfully teach agents to battle in ViZDoom environment,and significantly outperform A2C agents by a significantmargin.

2) TORCS: TORCS is a racing game where actions are ac-celeration, braking and steering. This game has more realistic

graphics than Atari games, but also requires agents to learn thedynamic of the car. FIGAR-DDPG can successfully completethe race task and finish 20 laps of the circuit, with a 10× totalreward against that obtained by DDPG, and much smootherpolicies [53]. Normalized Actor-Critic (NAC) normalizes theQ-function effectively, and learns an initial policy networkfrom demonstration and refine the policy in a real environment[79]. NAC is robust to suboptimal demonstration data, learnsrobustly and outperforms existing baselines when evaluatedon TORCS. Mazumder et al. [80] incorporate state-actionpermissibility (SAP) and DDPG, and applies it to tackle thelane keeping problem in TORCS. The proposed method canspeedup DRL training remarkably for this task. In [81], atwo-stage approach is proposed for the vision-based vehiclelateral control problem which includes an multi-task learningperception stage and an RL control stage. By exerting thecorrelation between multiple learning task, the perceptionmodule can robustly extract track features. Additionally, theRL agent learns by maximizing the geometry-based rewardand performs better than the LQR and MPC controllers. Zhuet al. [82] use DRL to train a CNN to perceive driving datafrom images of first-person view, and learns a controller to getdriving commands, showing a promising performance.

3) Minecraft: Minecraft is a sandbox construction game,where players can build creative creations, structures, andartwork across various game modes. Recently, it becomes apopular platform for game AI research, with 3D infinitelyvaried data. Project Malmo is an experimentation platform [83]that builts on the Minecraft for AI research. It supports a largenumber of scenarios, including navigation, problem solvingtasks, and survival to collaboration. Xiong et al. [84] proposea novel Q-learning approach with state-action abstraction andwarm start using human reasoning to learn effective policies inthe Microsoft Malmo collaborative AI challenge. The ability totransfer knowledge from source task to target task in Minecraftis one of the major challenges. Tessler et al. [85] provides aDRL agent which can transfer knowledge by learning reusableskills, and then incorporated into hierarchical DRL network(H-DRLN). H-DRLN exhibits superior performance and lowlearning sample complexity compared to regular DQN inMinecraft, and the potential to transfer knowledge betweenrelated Minecraft tasks without any additional learning. Tosolve the partial or non-Markovian observations problems,Jin et al. [86] propose a new DRL algorithm based oncounterfactual regret minimization that iteratively updates anapproximation to a cumulative clipped advantage function. Onthe challenging Minecraft first-person navigation benchmarks,this algorithm can substantially outperform strong baselinemethods.

4) DeepMind lab: DeepMind lab is a 3D first-persongame platform extended from OpenArena, which is basedon Quake3. Comparable to other first-person game platforms,DeepMind lab has considerably richer visuals and more re-alistic physics, making it a significantly complex platform.On a challenging suite of DeepMind lab tasks, the UNREALagent leads to a mean speedup in learning of 10× over A3Cand averaging 87% expert human performance. As learningagents become more powerful, continual learning has made

9

quick progress recently. To test continual learning capabilities,Mankowitz et al. [87] consider an implicit sequence of taskswith sparse rewards in DeepMind lab. The novel agent archi-tecture called Unicorn, demonstrates strong continual learningand outperforms several baseline agents on the proposeddomain. Schmitt et al. [88] present a method which usesteacher agents to kickstart the training of a new student agent.On a multi-task and challenging DMLab-30 suite, kickstartedtraining improves new agents’ sample efficiency to a greatextend, and surpasses the final performance by 42%. Jaderberget al. [89] focus on Quake III Arena Capture the Flag, which isa popular 3D first-person multiplayer video game, and demon-strates that DRL agents can achieve human-level performancewith only pixels and game points as input. The agent usespopulation based training to optimize the policy. This methodtrains a large number of agents concurrently from thousandsof parallel matches, where agents plays cooperatively in teamsand against each other on randomly generated environments.In an evaluation, the trained agents exceed the winrate of self-play baseline and high-level human players both as teammatesand opponents, and are proved far stronger than existing DRLagents.

D. Real-time strategy games

Real-time strategy games are very popular among players,and have become popular platforms for AI research.

1) StarCraft: In StarCraft, players need to perform actionsaccording to real-time game states, and defeat the enemies.Generally speaking, designing an AI bot have many chal-lenges, including multi-agent collaboration, spatial and tem-poral reasoning, adversarial planning, and opponent model-ing. Currently, most bots are based on human experiencesand replays, with limited flexibility and intelligence. DRL isproved to be a promising direction for StarCraft AI, especiallyin micromanagement, build order, mini-games and full-games[90].

Recently, micromanagement is widely studied as the firststep to solve StarCraft AI. Usunier et al. [91] introduce thegreedy MDP with episodic zero-order optimization (GMEZO)algorithm to tackle micromanagement scenarios, which per-forms better than DQN and policy gradient. BiCNet [92] isa multi-agent deep reinforcement learning method to playStarCraft combat games. It bases on actor-critic reinforcementlearning, and uses bi-directional neural networks to learncollaboration. BiCNet successfully learns some cooperativestrategies, and is adaptable to various tasks, showing betterperformances than GMEZO. In aforementioned works, re-searchers mainly develops centralized methods to play mi-cromanagement. Foerster et al. [93] focus on decentralizedcontrol for micromanagement, and propose a multi-agentactor-critic method. To stabilize experience replay and solvenonstationarity, they use fingerprints and importance sampling,which can improve the final performance. Shao et al. [94]follow decentralized micromanagement task, and propose pa-rameter sharing multi-agent gradient descent SARSA(λ) (PS-MAGDS) method. To resue the knowledge between variousmicromanagement scenarios, they also combine curriculum

transfer learning to this method. This improves the sampleefficiency, and outperforms GMEZO and BiCNet in large-scalescenarios. Kong et al. [95] bases on master-slave architecture,and proposes master-slave multi-agent reinforcement learning(MS-MARL). MS-MARL includes composed action represen-tation, independent reasoning, and learnable communication.This method has better performance than other methods inmicromanagement tasks. Rashid et al. [96] focus on sev-eral challenging StarCraft II micromanagement tasks, anduse centralized training and decentralized execution to learncooperative behaviors. This eventually outperforms state-of-the-art multi-agent deep reinforcement learning methods.

Researchers also use DRL methods to optimize the buildorder in StarCraft. Tang et al. [97] put forward neural networkfitted Q-learning (NNFQ) and convolutional neural networkfitted Q-learning (CNNFQ) to build units in simple StarCraftmaps. These models are able to find effective productionsequences, and eventually defeat enemies. In [68], researcherspresent baseline results of several main DRL agents in theStarCraft II domain. The fully convolutional advantage actor-critic (FullyConv-A2C) agents achieve a beginner-level inStarCraft II mini-games. Zambaldi et al. [98] introduce therelational DRL to StarCraft, which iteratively reasons aboutthe relations between entities with self-attention, and usesit to guide a model-free RL policy. This method improvessample efficiency, generalization ability, and interpretability ofconventional DRL approaches. Relational DRL agent achievesimpressive performance on SC2LE mini-games. Sun et al. [99]develop the DRL based agent TStarBot, which uses flat actionstructure. This agent defeats the built-in AI agents from level1 to level 10 in a full game firstly. Lee et al. [100] focuson StarCraft II AI, and present a novel modular architecture,which splits responsibilities between multiple modules. Eachmodule controls one aspect of the game, and two modules aretrained with self-play DRL methods. This method defeats thebuilt-in bot in ”Harder” level. Pang et al. [101] investigatea two-level hierarchical RL approach for StarCraft II. Themacro-action is automatically extracted from expert’s data, andthe other is a flexible and scaleable hierarchical architecture.More recently, DeepMind proposes AlphaStar, and defeatsprofessional players for the first time.

2) MOBA and Dota2: MOBA (Multiplayer Online BattleArena) is originated from RTS games, which has two teams,and each team consists of five players. To beat the opponent,five players in a team must cooperate together, kill enemies,upgrade heros, and eventually destroy the opponent base. SinceMOBA research is still in a primary stage, there are fewerworks than conventional RTS games. Most works on MOBAconcentrate on dataset analysis and case study. However, dueto a series of breakthroughs that DRL achieves in game AI,researchers start to pay more attention to MOBA recently.King of Glory (a simplified mobile version of Dota) is the mostpopular mobile-end MOBA game in China. Jiang et al. [102]apply Monte-Carlo Tree Search and deep neural networks tothis game. The experimental results indicate that MCTS-basedDRL method is efficient and can be used in 1v1 MOBAscenario. Most impressive works on MOBA are proposed byOpenAI. Their results prove that DRL method with self-play

10

can not only be successful in a 1v1 and 2v2 Dota2 scenarios[103], but also in 5v5 [104] [105]. The model architecture issimple, using a LSTM layer as the core component of neuralnetwork. Under the support of massively distributed cloudcomputing and PPO optimization algorithm, OpenAI Five canmaster the critical abilities of team fighting, searching forest,focusing, chasing, and diversion for team victory, and defeathuman champion OG with 2:0. Their works truly open a newdoor to MOBA research with DRL method.

V. CHALLENGES IN GAMES WITH DRL

Since DRL has achieved large progress in some videogames, it is considered as one of most promising ways torealize the artificial general intelligence. However, there arestill some challenges should be conquered towards goal. In thissecition, we discuss some crucial challenges for DRL in videogames, such as tradeoff between exploration and exploitation,low sample efficiency, dilemma in generalization and overfit-ing, multi-agent learning, incomplete information and delayedsparse rewards. Though there are some proposed approacheshave been tried to solve these problems, as presented in Fig.4, there are still some limitations should be broken.

A. Exploration-exploitation

Exploration can help to obtain more diversity samples,while exploitation is the way to learn the high reward policywith valuable samples. The trade-off between exploration andexploitation remains a major challenge for RL. Common meth-ods for exploration require a large amount of data, and can nottackle temporally-extended exploration. Most model-free RLalgorithms are not computationally tractable in complicatedenvironments.

Parametric noise can help exploration to a large extend inthe training process [106] [107]. Besides, randomized valuefunctions become an effective approach for efficient explo-ration. Combining exploration with deep neural networks canhelp to learn much faster, which greatly improves the learningspeed and final performance in most games [51].

A simple generalization of popular count-based ap-proach can reach satisfactory performance on various high-dimensional DRL benchmarks [108]. This method maps statesto hash codes, and counts their occurrences via a hash table.Then, according to the classic count-based method, we can usethese counts to compute a reward bonus. On many challengingtasks, these simple hash functions can achieve impressiveperformance. This exploration strategy provides a simple andpowerful baseline to solve MDPs requiring considerable ex-ploration.

B. Sample efficiency

DRL algorithms usually take millions of samples to achievehuman-level performance. While humans can quickly masterhighly rewarding actions of an environment. Most model-free DRL algorithms are data inefficient, especially for aenvironment with high dimension and large explore space.They have to interact with environment in a large time cost

for seek out high reward experiences in a complex samplespace, which limits their applicability to many scenarios. Inorder to reduce the exploration dimension of environment andease the expenditure of time on interaction, some solutionscan be used for improving data efficiency, such as hierarchyand demonstration.

Hierarchical reinforcement learning (HRL) allows agents todecompose the task into several simple subtasks, which canspeed up training and improve sample efficiency. Temporalabstraction is key to scaling up learning, while creatingsuch abstractions autonomously has remained challenging. Theoption-critic architecture has the ability to learn the internalpolicies and the options’ termination conditions, without anyadditional rewards or subgoals [109]. FeUdal Networks (FuNs)include a Manager module and a Worker module [110]. TheManager sets abstract goals at high-level. The Worker receivesthese goals, and generates actions in the environment. FuNdramatically outperforms baseline agents on tasks that involvelong-term credit assignment or memorization. Representationlearning methods can also be used to guide the option discov-ery process in HRL domain [111].

Demonstration is a proper technique to improve sampleefficiency. Current approaches that learn from demonstrationuse supervised learning on expert data and use reinforcementlearning to improve the performance. This method is difficultto jointly optimize divergent losses, and is very sensitive tonoisy demonstrations. Leveraging data from previous controlof the system can greatly accelerate the learning process evenwith small amounts of demonstration data [24]. Goals definedwith human preferences can effectively solve complicated RLtasks without the reward function, while greatly reducing thecost of human oversight [112].

C. Generalization and Transfer

The ability to transfer knowledge across multiple environ-ments is considered as a critical aspect of intelligent agents.With the purpose of promoting the performance of generaliza-tion in multiple environments, multi-task learning and policydistillation have been focus on these situations.

Multi-task learning with shared neural network parameterscan solve the generalization problem, and efficiency can beimproved through transfer across related tasks. Hybrid rewardarchitecture takes a decomposed reward function as input andlearns a separate value function for each component [113].The whole value function is much smoother, which can beeasily approximated with a low-dimensional representation,and learns more effectively. IMPALA shows the effectivenessfor multi-task reinforcement learning, using less data and ex-hibiting positive transfer between tasks [40] . PopArt-IMPALAcombines PopArt’s adaptive normalization with IMPALA, andallows a more efficient use of parallel data generation, showingimpressive performance on multi-task domain [114].

To successfully learn complex tasks with DRL, we usuallyneed large task-specific networks and extensive training toachieve good performance. Distral shares a distilled policywhich can learn common knowledge across multiple tasks[115]. Each worker is trained to solve individual task and to be

11

close to the shared policy, while the shared policy is trained bydistillation. This approach shows efficient transfer on complextasks, with more robust and more stable performance. Mix &Match is a training framework that is designed to encourageeffective and rapid learning in DRL agents [116]. It allows toautomatically form a curriculum over agent, and progressivelytrains more complex agents from simpler agents.

D. Multi-agent learning

Multi-agent learning is very important in video games,such as StarCraft. In a cooperative multi-agent setting, curse-of-dimensionality, communication, and credit assignment aremajor challenges.

Team learning uses a single learner to learn joint solutionsin multi-agent system, while concurrent learning uses multiplelearners for each agent. Recently, the centralised trainingof decentralised policies is becoming a standard paradigmfor multi-agent training. Multi-agent DDPG considers otheragents’ action policy and can successfully learn complexmulti-agent coordination behavior [117]. Counterfactual multi-agent policy gradients uses a centralized critic to estimatethe action-value function and decentralized actors to opti-mize each agents’ policies, with a counterfactual advantagefunction to address the multi-agent credit assignment problem[118] . In addition, communication protocols is importantto share information to solve multi-agent tasks. ReinforcedInter-Agent Learning (RIAL) and Differentiable Inter-AgentLearning (DIAL) use deep reinforcement learning to learn end-to-end communication protocols in complex environments.Analogously, CommNet is able to learn continuous communi-cation between multiple agents.

E. Imperfect information

In partially observable and first-perspective games, DRLagents need to tackle imperfect information to learn a suitablepolicy. Making decisions in these environments is challengingfor DRL agents.

A critical component of enabling effective learning in theseenvironment is the use of memory. DRL agents have usedsome simple memory architectures, such as several past framesor an LSTM layer. But these architectures are limited to onlyremember transitory information. Model-free episode controllearns difficult sequential decision-making tasks much faster,and achieves a higher overall reward [119]. Differentiableneural computer uses a neural network to read from andwrite to an external memory matrix [120]. This method cansolve complex, structured tasks which can not access to neuralnetworks without external read and write memory. Neuralepisodic control inserts recent state representations pairedwith corresponding value functions into the appropriate neuraldictionary, and learns significantly faster than other baselineagents [121].

F. Delayed spare rewards

The sparse and delayed reward is very common in manygames, and is also one of the reasons that reduce sampleefficiency in reinforcement learning.

In many scenarios, researchers use curiosity as an intrinsicreward to encourage agents to explore environment and learnuseful skills. Curiosity can be formulated as the error thatthe agent predicts its own actions’ consequence in a visualspace [122]. This can scale to high-dimensional continuousstate spaces. Moreover, it leaves out the aspects of environmentthat cannot affect agents. Curiosity search for DRL encouragesintra-life exploration by rewarding agents for visiting as manydifferent states as possible within each episode [123].

VI. CONCLUSION AND DISCUSSION

Game AI with deep reinforcement learning is a challengingand promising direction. Recent progress in this domain haspromote the development of artificial intelligence research. Inthis paper, we review the achievements of deep reinforcementlearning in video games. Different DRL methods and theirsuccessful applications are introduced. These DRL agentsachieve human-level or super-human performances in variousgames, from 2D perfect information to 3D imperfect infor-mation, and from single-agent to multi-agent. In addition tothese achievements, there are still some major problems whenapplying DRL methods to this field, especially in 3D imperfectinformation multi-agent video game. A high-level game AIrequires to explore more efficient and robust DRL techniques,and needs novel frameworks to be implemented in complexenvironment. These challenges have not been fully investigatedand could be opened for further study in the future.

ACKNOWLEDGMENT

The authors would like to thank Qichao Zhang, Dong Li andWeifan Li for the helpful comments and discussions about thiswork.

REFERENCES

[1] N. Y. Georgios and T. Julian, Artificial Intelligence and Games. NewYork: Springer, 2018.

[2] Y. Lecun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521,no. 7553, pp. 436–444, 2015.

[3] D. Zhao, K. Shao, Y. Zhu, D. Li, Y. Chen, H. Wang, D. Liu, T. Zhou,and C. Wang, “Review of deep reinforcement learning and discussionson the development of computer Go,” Control Theory and Applications,vol. 33, no. 6, pp. 701–717, 2016.

[4] Z. Tang, K. Shao, D. Zhao, and Y. Zhu, “Recent progress of deepreinforcement learning: from AlphaGo to AlphaGo Zero,” ControlTheory and Applications, vol. 34, no. 12, pp. 1529–1546, 2017.

[5] J. Niels, B. Philip, T. Julian, and R. Sebastian, “Deep learning for videogame playing,” CoRR, vol. abs/1708.07902, 2017.

[6] A. Kailash, P. D. Marc, B. Miles, and A. B. Anil, “Deep reinforcementlearning: A brief survey,” IEEE Signal Processing Magazine, vol. 34,pp. 26–38, 2017.

[7] L. Yuxi, “Deep reinforcement learning: An overview,” CoRR, vol.abs/1701.07274, 2017.

[8] R. Dechter, “Learning while searching in constraint-satisfaction-problems,” pp. 178–183, 1986.

[9] J. Schmidhuber, “Deep learning in neural networks,” Neural Networks,vol. 61, pp. 85–117, 2015.

[10] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” in International Conferenceon Neural Information Processing Systems, 2012, pp. 1097–1105.

[11] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” NeuralComputation, vol. 9, no. 8, pp. 1735–1780, 1997.

[12] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction.MIT Press, 1998.

12

[13] J. W. Ronald, “Simple statistical gradient-following algorithms forconnectionist reinforcement learning,” Machine Learning, vol. 8, pp.229–256, 1992.

[14] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, and G. Os-trovski, “Human-level control through deep reinforcement learning,”Nature, vol. 518, no. 7540, p. 529, 2015.

[15] v. H. Hado, G. Arthur, and S. David, “Deep reinforcement learningwith double Q-learning,” in AAAI Conference on Artificial Intelligence,2016.

[16] S. Tom, Q. John, A. Ioannis, and S. David, “Prioritized experiencereplay,” in International Conference on Learning Representations,2016.

[17] W. Ziyu, S. Tom, H. Matteo, v. H. Hado, L. Marc, and d. F. Nando,“Dueling network architectures for deep reinforcement learning,” inInternational Conference on Machine Learning, 2016.

[18] P. v. H. Hado, G. Arthur, H. Matteo, M. Volodymyr, and S. David,“Learning values across many orders of magnitude,” in Advances inNeural Information Processing Systems, 2016.

[19] S. H. Frank, L. Yang, G. S. Alexander, and P. Jian, “Learning to playin a day: faster deep reinforcement learning by optimality tightening,”in International Conference on Learning Representations, 2017.

[20] “Massively parallel methods for deep reinforcement learning,” inInternational Conference on Machine Learning Workshop on DeepLearning, 2015.

[21] J. H. Matthew and S. Peter, “Deep recurrent Q-learning for partiallyobservable MDPs,” CoRR, vol. abs/1507.06527, 2015.

[22] L. Nir, Z. Tom, J. M. Daniel, T. Aviv, and M. Shie, “Shallow updatesfor deep reinforcement learning,” in Advances in Neural InformationProcessing Systems, 2017.

[23] A. Oron, B. Nir, and S. Nahum, “Averaged-DQN: variance reductionand stabilization for deep reinforcement learning,” in InternationalConference on Machine Learning, 2017.

[24] “Deep Q-learning from demonstrations,” in AAAI Conference on Arti-ficial Intelligence, 2018.

[25] M. Sabatelli, G. Louppe, P. Geurts, and M. Wiering, “Deep quality-value (dqv) learning.” abs/1810.00368, 2018.

[26] “Rainbow: combining improvements in deep reinforcement learning,”in AAAI Conference on Artificial Intelligence, 2018.

[27] A. A.-M. Jose, G. Michael, W. Michael, U. Thomas, and H. Sepp,“RUDDER: return decomposition for delayed rewards,” CoRR, vol.abs/1806.07857, 2018.

[28] “Observe and look further: achieving consistent performance on Atari,”CoRR, vol. abs/1805.11593, 2018.

[29] S. John, A. Pieter, and C. Xi, “Equivalence between policy gradientsand soft Q-learning,” CoRR, vol. abs/1704.06440, 2017.

[30] G. B. Marc, D. Will, and M. Remi, “A distributional perspectiveon reinforcement learning,” in International Conference on MachineLearning, 2017.

[31] D. Will, R. Mark, G. B. Marc, and M. Remi, “Distributional rein-forcement learning with quantile regression,” in AAAI Conference onArtificial Intelligence, 2018.

[32] D. Will, O. Georg, S. David, and M. Remi, “Implicit quantile networksfor distributional reinforcement learning,” CoRR, vol. abs/1806.06923,2018.

[33] “Asynchronous methods for deep reinforcement learning,” in Interna-tional Conference on Machine Learnin, 2016.

[34] B. Mohammad, F. Iuri, T. Stephen, C. Jason, and K. Jan, “Reinforce-ment learning through asynchronous advantage actor-critic on a GPU,”in International Conference on Learning Representations, 2017.

[35] J. Max, M. Volodymyr, C. Wojciech, S. Tom, Z. L. Joel, S. David, andK. Koray, “Reinforcement learning with unsupervised auxiliary tasks,”in International Conference on Learning Representations, 2017.

[36] “Efficient parallel methods for deep reinforcement learning,” CoRR,vol. abs/1705.04862, 2017.

[37] O. Brendan, M. Remi, K. Koray, and M. Volodymyr, “PGQ: combin-ing policy gradient and Q-learning,” in International Conference onLearning Representations, 2017.

[38] M. Remi, S. Tom, H. Anna, and G. B. Marc, “Safe and efficientoff-policy reinforcement learning,” in Advances in Neural InformationProcessing Systems, 2016.

[39] G. Audrunas, G. A. Mohammad, G. B. Marc, and R. Munos,“The Reactor: a sample-efficient actor-critic architecture,” CoRR, vol.abs/1704.04651, 2017.

[40] “IMPALA: scalable distributed deep-RL with importance weightedactor-learner architectures,” CoRR, vol. abs/1802.01561, 2018.

[41] S. John, L. Sergey, A. Pieter, I. J. Michael, and M. Philipp, “Trustregion policy optimization,” in International Conference on MachineLearning, 2015.

[42] S. John, W. Filip, D. Prafulla, R. Alec, and K. Oleg, “Proximal policyoptimization algorithms,” CoRR, vol. abs/1707.06347, 2017.

[43] W. Ziyu, B. Victor, H. Nicolas, M. Volodymyr, M. Remi, K. Koray,and d. F. Nando, “Sample efficient actor-critic with experience replay,”in International Conference on Learning Representations, 2017.

[44] W. Yuhuai, M. Elman, L. Shun, B. G. Roger, and B. Jimmy, “Scalabletrust-region method for deep reinforcement learning using Kronecker-factored approximation,” in Advances in Neural Information ProcessingSystems, 2017.

[45] P. L. Timothy, J. H. Jonathan, P. Alexander, H. Nicolas, E. Tom,T. Yuval, S. David, and W. Daan, “Continuous control with deepreinforcement learning,” CoRR, vol. abs/1509.02971, 2015.

[46] “Distributed distributional deterministic policy gradients,” CoRR, vol.abs/1804.08617, 2018.

[47] F. Gregory, R. Tim, I. Maximilian, and W. Shimon, “TreeQN andATreeC: differentiable tree planning for deep reinforcement learning,”in International Conference on Learning Representations, 2018.

[48] V. Alexander, M. Volodymyr, O. Simon, G. Alex, V. Oriol, A. John,and K. Koray, “Strategic attentive writer for learning macro-actions,”in Advances in Neural Information Processing Systems, 2016.

[49] D. Ha and J. Schmidhuber, “Recurrent world models facilitate policyevolution,” Neural Information Processing Systems, 2018.

[50] N. Nantas, S. Gabriel, L. Zeming, K. Pushmeet, H. S. T. Philip, andU. Nicolas, “Value propagation networks,” CoRR, vol. abs/1805.11199,2018.

[51] O. Ian, B. Charles, P. Alexander, and V. R. Benjamin, “Deep ex-ploration via bootstrapped DQN,” in Advances in Neural InformationProcessing Systems, 2016.

[52] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochasticactor,” international conference on machine learning, pp. 1856–1865,2018.

[53] S. Sahil, S. L. Aravind, and R. Balaraman, “Learning to repeat:fine grained action repetition for deep reinforcement learning,” inInternational Conference on Learning Representations, 2017.

[54] S. Julian, A. Ioannis, and H. Thomas, “Mastering atari, go, chess andshogi by planning with a learned model,” abs/1911.08265, 2019.

[55] G. B. Marc, N. Yavar, V. Joel, and H. B. Michael, “The Arcade learningenvironment: An evaluation platform for general agents,” J. Artif. Intell.Res., vol. 47, pp. 253–279, 2013.

[56] B. Greg, C. Vicki, P. Ludwig, S. Jonas, S. John, T. Jie, and Z. Wojciech,“OpenAI Gym,” CoRR, vol. abs/1606.01540, 2016.

[57] “OpenAI Universe github,” https://github.com/openai/universe, 2016.[58] “OpenAI Retro github,” https://github.com/openai/retro, 2018.[59] N. Alex, P. Vicki, H. Christopher, K. Oleg, and S. John, “Gotta

learn fast: a new benchmark for generalization in RL,” CoRR, vol.abs/1804.03720, 2018.

[60] “General video game AI: a multi-track framework for evaluat-ing agents, games and content generation algorithms,” CoRR, vol.abs/1802.10363, 2018.

[61] R. T. Ruben, B. Philip, T. Julian, L. Jialin, and P.-L. Diego, “Deepreinforcement learning for general video game AI,” CoRR, vol.abs/1806.02448, 2018.

[62] “DeepMind Lab,” CoRR, vol. abs/1612.03801, 2016.[63] J. Arthur, B. Vincent-Pierre, V. Esh, G. Yuan, H. Hunter, M. Marwan,

and L. Danny, “Unity: a general platform for intelligent agentst,” CoRR,vol. abs/1809.02627, 2018.

[64] “OpenAI Malmo github,” https://github.com/Microsoft/malmo, 2017.[65] B. Wymann, E. Espie, C. Guionneau, C. Dimitrakakis, R. Coulom, and

A. Sumner, “Torcs, the open racing car simulator,” Software availableat http://torcs. sourceforge. net, vol. 4, p. 6, 2000.

[66] M. Kempka, M. Wydmuch, G. Runc, J. Toczek, and W. Jakowski,“ViZDoom: a Doom-based AI research platform for visual reinforce-ment learning,” in IEEE Conference on Computational Intelligence andGames, 2017, pp. 1–8.

[67] S. Gabriel, N. Nantas, A. Alex, C. Soumith, L. Timothee, L. Zeming,R. Florian, and U. Nicolas, “TorchCraft: a library for machine learningresearch on real-time strategy games,” CoRR, vol. abs/1611.00625,2016.

[68] “StarCraft II: a new challenge for reinforcement learning,” CoRR, vol.abs/1708.04782, 2017.

[69] C. Karl, K. Oleg, H. Chris, K. Taehoon, and S. John, “Quantifyinggeneralization in reinforcement learning,” CoRR, vol. abs/1812.02341,2018.

https://github.com/openai/universe

https://github.com/openai/retro

https://github.com/Microsoft/malmo

13

[70] C. M. Marlos, G. B. Marc, T. Erik, V. Joel, J. H. Matthew, andB. Michael, “Revisiting the Arcade learning environment: evaluationprotocols and open problems for general agents,” Journal of ArtificialIntelligence Research, vol. 61, pp. 523–562, 2018.

[71] H. Dan, Q. John, B. David, B.-M. Gabriel, H. Matteo, v. H. Hado, andS. David, “Distributed prioritized experience replay,” in InternationalConference on Learning Representations, 2018.

[72] O. Georg, G. B. Marc, O. Aaron, and M. Remi, “Count-based ex-ploration with neural density models,” in International Conference onMachine Learning, 2017.

[73] M. L. Hoang, J. Nan, A. Alekh, D. Miroslav, Y. Yisong, and D. Hal,“Hierarchical imitation and reinforcement learning,” in InternationalConference on Machine Learning, 2018.

[74] Y. Aytar, T. Pfaff, D. Budden, T. L. Paine, Z. Wang, and N. D. Freitas,“Playing hard exploration games by watching youtube,” 2018.

[75] “Montezuma’s revenge solved by go-explore, a new algorithm for hard-exploration problems,” https://eng.uber.com/go-explore/, 2018.

[76] Y. Wu and Y. Tian, “Training agent for first-person shooter gamewith actor-critic curriculum learning,” in International Conference onLearning Representations, 2017.

[77] P. Emilio and S. Ruslan, “Neural map: structured memory for deepreinforcement learning,” CoRR, vol. abs/1702.08360, 2017.

[78] K. Shao, D. Zhao, N. Li, and Y. Zhu, “Learning battles in ViZDoom viadeep reinforcement learning,” in IEEE Conference on ComputationalIntelligence and Games, 2018.

[79] G. Yang, X. Huazhe, L. Ji, Y. Fisher, L. Sergey, and D. Trevor,“Reinforcement learning from imperfect demonstrations,” CoRR, vol.abs/1802.05313, 2018.

[80] M. Sahisnu, L. Bing, W. Shuai, Z. Yingxuan, L. Lifeng, and L. Jian,“Action permissibility in deep reinforcement learning and applicationto autonomous driving,” in ACM SIGKDD Conference on KnowledgeDiscovery and Data Mining, 2018.

[81] D. Li, D. Zhao, Q. Zhang, and Y. Chen, “Reinforcement learningand deep learning based lateral control for autonomous driving,” IEEEComputational Intelligence Magazine, 2018.

[82] Y. Zhu and D. Zhao, “Driving control with deep and reinforcementlearning in the open racing car simulator,” in International Conferenceon Neural Information Processing, 2018.

[83] J. Matthew, H. Katja, H. Tim, and B. David, “The Malmo platform forartificial intelligence experimentation,” in International Joint Confer-ences on Artificial Intelligence, 2016.

[84] X. Yanhai, C. Haipeng, Z. Mengchen, and A. Bo, “HogRider: cham-pion agent of Microsoft Malmo collaborative AI challenge,” in AAAIConference on Artificial Intelligence, 2018.

[85] T. Chen, G. Shahar, Z. Tom, J. M. Daniel, and M. Shie, “A deephierarchical approach to lifelong learning in Minecraft,” in AAAIConference on Artificial Intelligence, 2017.

[86] H. J. Peter, L. Sergey, and K. Kurt, “Regret minimization for partiallyobservable deep reinforcement learning,” in International Conferenceon Learning Representations, 2018.

[87] “Unicorn: continual learning with a universal, off-policy agent,” CoRR,vol. abs/1802.08294, 2018.

[88] “Kickstarting deep reinforcement learning,” CoRR, vol.abs/1803.03835, 2018.

[89] “Human-level performance in first-person multiplayer gameswith population-based deep reinforcement learning,” CoRR, vol.abs/1807.01281, 2018.

[90] Z. Tang, K. Shao, Y. Zhu, D. Li, D. Zhao, and T. Huang, “A reviewof computational intelligence for StarCraft AI,” in IEEE SymposiumSeries on Computational Intelligence (SSCI), 2018.

[91] N. Usunier, G. Synnaeve, Z. Lin, and S. Chintala, “Episodic ex-ploration for deep deterministic policies: an application to StarCraftmicromanagement tasks,” in International Conference on LearningRepresentations, 2017.

[92] P. Peng, Q. Yuan, Y. Wen, Y. Yang, Z. Tang, H. Long, and J. Wang,“Multiagent bidirectionally-coordinated nets for learning to play Star-Craft combat games,” 2017.

[93] J. Foerster, N. Nardelli, G. Farquhar, T. Afouras, P. H. S. Torr,P. Kohli, and S. Whiteson, “Stabilising experience replay for deepmulti-agent reinforcement learning,” in International Conference onMachine Learning, 2017.

[94] K. Shao, Y. Zhu, and D. Zhao, “StarCraft micromanagement withreinforcement learning and curriculum transfer learning,” IEEETransactions on Emerging Topics in Computational Intelligence,DOI:10.1109/TETCI.2018.2823329, 2018.

[95] K. Xiangyu, X. Bo, L. Fangchen, and W. Yizhou, “Revisiting themaster-slave architecture in multi-agent deep reinforcement learning,”CoRR, vol. abs/1712.07305, 2017.

[96] R. Tabish, S. Mikayel, S. d. W. Christian, F. Gregory, N. F. Jakob, andW. Shimon, “QMIX: monotonic value function factorisation for deepmulti-agent reinforcement learning,” CoRR, vol. abs/1803.11485, 2018.

[97] T. Zhentao, Z. Dongbin, Z. Yuanheng, and G. Ping, “Reinforcementlearning for build-order production in StarCraft II,” in InternationalConference on Information Science and Technology, 2018.

[98] V. Zambaldi, D. Raposo, A. Santoro, V. Bapst, Y. Li, I. Babuschkin,K. Tuyls, D. Reichert, T. Lillicrap, E. Lockhart et al., “Relational deepreinforcement learning,” CoRR, vol. abs/806.01830, 2018.

[99] P. Sun, X. Sun, L. Han, J. Xiong, Q. Wang, B. Li, Y. Zheng, J. Liu,Y. Liu, H. Liu, and T. Zhang, “TStarBots: defeating the cheating levelbuiltin AI in StarCraft II in the full game,” CoRR, vol. abs/1809.07193,2018.

[100] L. Dennis, T. Haoran, O. Z. Jeffrey, X. Huazhe, D. Trevor, andA. Pieter, “Modular architecture for starcraft ii with deep reinforcementlearning,” CoRR, p. abs/1811.03555, 2018.

[101] P. Zhen-Jia, L. Ruo-Ze, M. Zhou-Yu, Z. Yi, Y. Yang, and L. Tong,“On reinforcement learning for full-length game of starcraft,” CoRR,p. abs/1809.09095, 2018.

[102] R. J. Daniel, E. Emmanuel, and L. Hao, “Feedback-based tree searchfor reinforcement learning,” in International Conference on MachineLearning, 2018.

[103] “OpenAI Dota 1v1,” https://blog.openai.com/dota-2/, 2017.[104] “OpenAI Dota Five,” https://blog.openai.com/openai-five/, 2018.[105] B. Christopher, B. Greg, and C. Brooke, “Dota 2 with large scale deep

reinforcement learning,” abs/1912.06680, 2019.[106] “Noisy networks for exploration,” CoRR, vol. abs/1706.10295, 2017.[107] “Parameter space noise for exploration,” CoRR, vol. abs/1706.01905,

2017.[108] T. Haoran, H. Rein, F. Davis, S. Adam, C. Xi, D. Yan, S. John, D. T.

Filip, and A. Pieter, “Exploration: a study of count-based explorationfor deep reinforcement learning,” in Advances in Neural InformationProcessing Systems, 2017.

[109] B. Pierre-Luc, H. Jean, and P. Doina, “The option-critic architecture,”in AAAI Conference on Artificial Intelligence, 2017.

[110] S. V. Alexander, O. Simon, S. Tom, H. Nicolas, J. Max, S. David, andK. Koray, “FeUdal networks for hierarchical reinforcement learning,”in International Conference on Machine Learning, 2017.

[111] C. M. Marlos, R. Clemens, G. Xiaoxiao, L. Miao, T. Gerald, andC. Murray, “Eigenoption discovery through the deep successor rep-resentation,” CoRR, vol. abs/1710.11089, 2017.

[112] F. C. Paul, L. Jan, B. B. Tom, M. Miljan, L. Shane, and A. Dario,“Deep reinforcement learning from human preferences,” in Advancesin Neural Information Processing Systems, 2017.

[113] v. S. Harm, F. Mehdi, L. Romain, R. Joshua, B. Tavian, and T. Jeffrey,“Hybrid reward architecture for reinforcement learning,” in Advancesin Neural Information Processing Systems, 2017.

[114] H. Matteo, S. Hubert, E. Lasse, C. Wojciech, S. Simon, and v. H.Hado, “Multi-task deep reinforcement learning with popart,” CoRR,vol. abs/1809.04474, 2018.

[115] “Distral: robust multitask reinforcement learning,” in Advances inNeural Information Processing Systems, 2017.

[116] “Mix& Match - agent curricula for reinforcement learning,” CoRR, vol.abs/1806.01780, 2018.

[117] L. Ryan, W. Yi, T. Aviv, H. Jean, A. Pieter, and M. Igor, “Multi-agentactor-critic for mixed cooperative-competitive environments,” 2017, pp.6382–6393.

[118] J. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson,“Counterfactual multi-agent policy gradients,” in AAAI Conference onArtificial Intelligence, 2018.

[119] B. Charles, U. Benigno, P. Alexander, L. Yazhe, R. Avraham, Z. L. Joel,W. R. Jack, W. Daan, and H. Demis, “Model-free episodic control,”CoRR, vol. abs/1606.04460, 2016.

[120] “Hybrid computing using a neural network with dynamic externalmemory,” Nature, vol. 538, pp. 471–476, 2016.

[121] “Neural episodic control,” in International Conference on MachineLearning, 2017.

[122] P. Deepak, A. Pulkit, A. E. Alexei, and D. Trevor, “Curiosity-drivenexploration by self-supervised prediction,” in IEEE Conference onComputer Vision and Pattern Recognition Workshops, 2017, pp. 488–489.

[123] S. Christopher and C. Jeff, “Deep curiosity search: intra-life explorationimproves performance on challenging deep reinforcement learningproblems,” CoRR, vol. abs/1806.00553, 2018.

https://eng.uber.com/go-explore/

https://blog.openai.com/dota-2/

https://blog.openai.com/openai-five/

A Survey of Deep Reinforcement Learning in Video Games · 2019-12-30 · DRL in various video...

Documents

Transcript of A Survey of Deep Reinforcement Learning in Video Games · 2019-12-30 · DRL in various video...