ELF OpenGo: An Analysis and Open Reimplementation of...

10
ELF OpenGo: An Analysis and Open Reimplementation of AlphaZero Yuandong Tian 1 Jerry Ma *1 Qucheng Gong *1 Shubho Sengupta *1 Zhuoyuan Chen 1 James Pinkerton 1 C. Lawrence Zitnick 1 Abstract The AlphaGo, AlphaGo Zero, and AlphaZero series of algorithms are remarkable demonstra- tions of deep reinforcement learning’s capabili- ties, achieving superhuman performance in the complex game of Go with progressively increas- ing autonomy. However, many obstacles remain in the understanding of and usability of these promising approaches by the research commu- nity. Toward elucidating unresolved mysteries and facilitating future research, we propose ELF OpenGo, an open-source reimplementation of the AlphaZero algorithm. ELF OpenGo is the first open-source Go AI to convincingly demonstrate superhuman performance with a perfect (20:0) record against global top professionals. We ap- ply ELF OpenGo to conduct extensive ablation studies, and to identify and analyze numerous in- teresting phenomena in both the model training and in the gameplay inference procedures. Our code, models, selfplay datasets, and auxiliary data are publicly available. 1 1. Introduction The game of Go has a storied history spanning over 4,000 years and is viewed as one of the most complex turn-based board games with complete information. The emergence of AlphaGo (Silver et al., 2016) and its descendants AlphaGo Zero (Silver et al., 2017) and AlphaZero (Silver et al., 2018) demonstrated the remarkable result that deep reinforcement learning (deep RL) can achieve superhuman performance even without supervision on human gameplay datasets. * Equal contribution 1 Facebook AI Research, Menlo Park, California, USA. Correspondence to: Yuandong Tian <yuan- [email protected]>, Jerry Ma <[email protected]>, Larry Zitnick <zit- [email protected]>. Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). 1 Resources available at https://facebook.ai/ developers/tools/elf-opengo. Additionally, the supplementary appendix for this paper is available at https://arxiv.org/pdf/1902.04522.pdf. However, these advances in playing ability come at signifi- cant computational expense. A single training run requires millions of selfplay games and days of training on thousands of TPUs, which is an unattainable level of compute for the majority of the research community. When combined with the unavailability of code and models, the result is that the approach is very difficult, if not impossible, to reproduce, study, improve upon, and extend. In this paper, we propose ELF OpenGo, an open-source reimplementation of the AlphaZero (Silver et al., 2018) algorithm for the game of Go. We then apply ELF OpenGo toward the following three additional contributions. First, we train a superhuman model for ELF OpenGo. Af- ter running our AlphaZero-style training software on 2,000 GPUs for 9 days, our 20-block model has achieved super- human performance that is arguably comparable to the 20- block models described in Silver et al. (2017) and Silver et al. (2018). To aid research in this area we provide pre- trained superhuman models, code used to train the models, a comprehensive training trajectory dataset featuring 20 mil- lion selfplay games over 1.5 million training minibatches, and auxiliary data. 2 We describe the system and software design in depth and we relate many practical lessons learned from developing and training our model, in the hope that the community can better understand many of the consider- ations for large-scale deep RL. Second, we provide analyses of the model’s behavior dur- ing training. (1) As training progresses, we observe high variance in the model’s strength when compared to other models. This property holds even if the learning rates are reduced. (2) Moves that require significant lookahead to determine whether they should be played, such as “ladder” moves, are learned slowly by the model and are never fully mastered. (3) We explore how quickly the model learns high-quality moves at different stages of the game. In con- trast to tabular RL’s typical behavior, the rate of progression for learning both mid-game and end-game moves is nearly identical in training ELF OpenGo. Third, we perform extensive ablation experiments to study 2 Auxiliary data comprises a test suite for difficult “ladder” game scenarios, comparative selfplay datasets, and performance validation match logs (both vs. humans and vs. other Go AIs).

Transcript of ELF OpenGo: An Analysis and Open Reimplementation of...

Page 1: ELF OpenGo: An Analysis and Open Reimplementation of AlphaZeroproceedings.mlr.press/v97/tian19a/tian19a.pdf · ELF OpenGo: An Analysis and Open Reimplementation of AlphaZero (i.e.

ELF OpenGo: An Analysis and Open Reimplementation of AlphaZero

Yuandong Tian 1 Jerry Ma * 1 Qucheng Gong * 1 Shubho Sengupta * 1 Zhuoyuan Chen 1 James Pinkerton 1

C. Lawrence Zitnick 1

AbstractThe AlphaGo, AlphaGo Zero, and AlphaZeroseries of algorithms are remarkable demonstra-tions of deep reinforcement learning’s capabili-ties, achieving superhuman performance in thecomplex game of Go with progressively increas-ing autonomy. However, many obstacles remainin the understanding of and usability of thesepromising approaches by the research commu-nity. Toward elucidating unresolved mysteriesand facilitating future research, we propose ELFOpenGo, an open-source reimplementation of theAlphaZero algorithm. ELF OpenGo is the firstopen-source Go AI to convincingly demonstratesuperhuman performance with a perfect (20:0)record against global top professionals. We ap-ply ELF OpenGo to conduct extensive ablationstudies, and to identify and analyze numerous in-teresting phenomena in both the model trainingand in the gameplay inference procedures. Ourcode, models, selfplay datasets, and auxiliary dataare publicly available. 1

1. IntroductionThe game of Go has a storied history spanning over 4,000years and is viewed as one of the most complex turn-basedboard games with complete information. The emergence ofAlphaGo (Silver et al., 2016) and its descendants AlphaGoZero (Silver et al., 2017) and AlphaZero (Silver et al., 2018)demonstrated the remarkable result that deep reinforcementlearning (deep RL) can achieve superhuman performanceeven without supervision on human gameplay datasets.

*Equal contribution 1Facebook AI Research, Menlo Park,California, USA. Correspondence to: Yuandong Tian <[email protected]>, Jerry Ma <[email protected]>, Larry Zitnick <[email protected]>.

Proceedings of the 36 th International Conference on MachineLearning, Long Beach, California, PMLR 97, 2019. Copyright2019 by the author(s).

1Resources available at https://facebook.ai/developers/tools/elf-opengo. Additionally,the supplementary appendix for this paper is available athttps://arxiv.org/pdf/1902.04522.pdf.

However, these advances in playing ability come at signifi-cant computational expense. A single training run requiresmillions of selfplay games and days of training on thousandsof TPUs, which is an unattainable level of compute for themajority of the research community. When combined withthe unavailability of code and models, the result is that theapproach is very difficult, if not impossible, to reproduce,study, improve upon, and extend.

In this paper, we propose ELF OpenGo, an open-sourcereimplementation of the AlphaZero (Silver et al., 2018)algorithm for the game of Go. We then apply ELF OpenGotoward the following three additional contributions.

First, we train a superhuman model for ELF OpenGo. Af-ter running our AlphaZero-style training software on 2,000GPUs for 9 days, our 20-block model has achieved super-human performance that is arguably comparable to the 20-block models described in Silver et al. (2017) and Silveret al. (2018). To aid research in this area we provide pre-trained superhuman models, code used to train the models,a comprehensive training trajectory dataset featuring 20 mil-lion selfplay games over 1.5 million training minibatches,and auxiliary data. 2 We describe the system and softwaredesign in depth and we relate many practical lessons learnedfrom developing and training our model, in the hope thatthe community can better understand many of the consider-ations for large-scale deep RL.

Second, we provide analyses of the model’s behavior dur-ing training. (1) As training progresses, we observe highvariance in the model’s strength when compared to othermodels. This property holds even if the learning rates arereduced. (2) Moves that require significant lookahead todetermine whether they should be played, such as “ladder”moves, are learned slowly by the model and are never fullymastered. (3) We explore how quickly the model learnshigh-quality moves at different stages of the game. In con-trast to tabular RL’s typical behavior, the rate of progressionfor learning both mid-game and end-game moves is nearlyidentical in training ELF OpenGo.

Third, we perform extensive ablation experiments to study

2Auxiliary data comprises a test suite for difficult “ladder”game scenarios, comparative selfplay datasets, and performancevalidation match logs (both vs. humans and vs. other Go AIs).

Page 2: ELF OpenGo: An Analysis and Open Reimplementation of AlphaZeroproceedings.mlr.press/v97/tian19a/tian19a.pdf · ELF OpenGo: An Analysis and Open Reimplementation of AlphaZero (i.e.

ELF OpenGo: An Analysis and Open Reimplementation of AlphaZero

the properties of AlphaZero-style algorithms. We identifyseveral important parameters that were left ambiguous inSilver et al. (2018) and provide insight into their roles insuccessful training. We briefly compare the AlphaGo Zeroand AlphaZero training processes. Finally, we find thateven for the final model, doubling the rollouts in gameplaystill boosts its strength by ≈ 200 ELO 3, indicating that thestrength of the AI is constrained by the model capacity.

Our ultimate goal is to provide the resources and the ex-ploratory insights necessary to allow both the AI researchand Go communities to study, improve upon, and test againstthese promising state-of-the-art approaches.

2. Related workIn this section, we briefly review early work in AI for Go.We then describe the AlphaGo Zero (Silver et al., 2017) andAlphaZero (Silver et al., 2018) algorithms, and the variouscontemporary attempts to reproduce them.

The game of Go Go is a two-player board game tradition-ally played on a 19-by-19 square grid. The players alternateturns, with the black player taking the first turn and thewhite player taking the second. Each turn, the player placesa stone on the board. A player can capture groups of en-emy stones by occupying adjacent locations, or “liberties”.Players can choose to “pass” their turn, and the game endsupon consecutive passes or resignation. In our setting, thegame is scored at the end using Chinese rules, which awardplayers one point for each position occupied or surroundedby the player. Traditionally, a bonus (“komi”) is given towhite as compensation for going second. The higher scorewins, and komi is typically a half-integer (most commonly7.5) in order to avoid ties.

2.1. Early work

Classical search Before the advent of practical deeplearning, classical search methods enjoyed initial successin AI for games. Many older AI players for board gamesuse minimax search over a game tree, typically augmentedwith alpha-beta pruning (Knuth & Moore, 1975) and game-specific heuristics. A notable early result was Deep-Blue (Campbell et al., 2002), a 1997 computer chess pro-gram based on alpha-beta pruning that defeated then-worldchampion Garry Kasparov in a six-game match. Even to-day, the predominant computer chess engine Stockfish usesalpha-beta pruning as its workhorse, decisively achievingsuperhuman performance on commodity hardware.

However, the game of Go is typically considered to be quiteimpervious to these classical search techniques, due to the

3Elo (1978)’s method is a commonly-used performance ratingsystem in competitive game communities.

game tree’s high branching factor (up to 19×19+1 = 362)and high depth (typically hundreds of moves per game).

Monte Carlo Tree Search (MCTS) While some earlyGo AIs, such as GNUGo (GNU Go Team, 2009), rely onclassical search techniques, most pre-deep learning AIsadopt a technique called Monte Carlo Tree Search (MCTS;Browne et al., 2012). MCTS treats game tree traversal as anexploitation/exploration tradeoff. At each state, it prioritizesvisiting child nodes that provide a high value (estimatedutility), or that have not been thoroughly explored. A com-mon exploitation/exploration heuristic is “upper confidencebounds applied to trees” (“UCT”; Kocsis & Szepesvari,2006); briefly, UCT provides an exploration bonus propor-tional to the inverse square root of a game state’s visit fre-quency. Go AIs employing MCTS include Leela (Pascutto,2016), Pachi (Baudis & Gailly, 2008), and Fuego (Enzen-berger et al., 2010).

Early deep learning for Go Early attempts at applyingdeep learning to Go introduced neural networks toward un-derstanding individual game states, usually by predictingwin probabilities and best actions from a given state. Go’ssquare grid game board and the spatial locality of moves nat-urally suggest the use of convolutional architectures, trainedon historical human games (Clark & Storkey, 2015; Maddi-son et al., 2015; Tian & Zhu, 2015). AlphaGo (Silver et al.,2016) employs policy networks trained with human gamesand RL, value networks trained via selfplay, and distributedMCTS, achieving a remarkable 4:1 match victory againstprofessional player Lee Sedol in 2016.

2.2. AlphaGo Zero and AlphaZero

The AlphaGo Zero (“AGZ”; Silver et al., 2017) and Alp-haZero (“AZ”; Silver et al., 2018) algorithms train a Go AIusing no external information except the rules of the game.We provide a high-level overview of AGZ, then briefly de-scribe the similar AZ algorithm.

Move generation algorithm The workhorse of AGZ is aresidual network model (He et al., 2016). The model acceptsas input a spatially encoded representation of the game state.It then produces a scalar value prediction, representing theprobability of the current player winning, and a policy pre-diction, representing the model’s priors on available movesgiven the current board situation.

AGZ combines this network with MCTS, which is usedas a policy improvement operator. Initially informed bythe network’s current-move policy, the MCTS operator ex-plores the game tree, visiting a new game state at eachiteration and evaluating the network policy. It uses a variantof PUCT (Rosin, 2011) to balance exploration (i.e. visitinggame states suggested by the prior policy) and exploitation

Page 3: ELF OpenGo: An Analysis and Open Reimplementation of AlphaZeroproceedings.mlr.press/v97/tian19a/tian19a.pdf · ELF OpenGo: An Analysis and Open Reimplementation of AlphaZero (i.e.

ELF OpenGo: An Analysis and Open Reimplementation of AlphaZero

(i.e. visiting game states that have a high value), trading offbetween the two with a cpuct constant.

MCTS terminates after a certain number of iterations andproduces a new policy based on the visit frequencies of itschildren in the MCTS game tree. It selects the next movebased on this policy, either proportionally (for early-stagetraining moves), or greedily (for late-stage training movesand all gameplay inference). MCTS can be multithreadedusing the virtual loss technique (Chaslot et al., 2008), andMCTS’s performance intuitively improves as the number ofiterations (“rollouts”) increases.

Training AGZ trains the model using randomly sampleddata from a replay buffer (filled with selfplay games). Theoptimization objective is defined as follows, where V andp are outputs of a neural network with parameters θ, andz and π are respectively the game outcome and the savedMCTS-augmented policy from the game record:

J(θ) = |V (s; θ)− z|2 − πT logp(·|θ) + c‖θ‖2 (1)

There are four major components of AGZ’s training:

• The replay buffer is a fixed-capacity FIFO queue ofgame records. Each game record consists of the gameoutcome, the move history, and the MCTS-augmentedpolicies at each move.

• The selfplay workers continually take the latest model,play out an AI vs. AI (selfplay) game using the model,and send the game record to the replay buffer.

• The training worker continually samples minibatchesof moves from the replay buffer and performs stochas-tic gradient descent (SGD) to fit the model. Every1,000 minibatches, it sends the model to the evaluator.

• The evaluator receives proposed new models. It playsout 400 AI vs. AI games between the new model andthe current model and accepts any new model with a55% win rate or higher. Accepted models are publishedto the selfplay workers as the latest model.

Performance Using Tensor Processing Units (TPUs) forselfplays and GPUs for training, AGZ is able to train a256-filter, 20-block residual network model to superhumanperformance in 3 days, and a 256-filter, 40-block model toan estimated 5185 ELO in 40 days. The total computationalcost is unknown, however, as the paper does not elaborateon the number of selfplay workers, which we conjecture tobe the costliest component of AGZ’s training.

AlphaZero While most details of the subsequently pro-posed AlphaZero (AZ) algorithm (Silver et al., 2018) are

similar to those of AGZ, AZ eliminates AGZ’s evaluation re-quirement, thus greatly simplifying the system and allowingnew models to be immediately deployed to selfplay work-ers. AZ provides a remarkable speedup over AGZ, reachingin just eight hours the performance of the aforementionedthree-day AGZ model. 4 AZ uses 5,000 first-generation TPUselfplay workers; however, unlike AGZ, AZ uses TPUs fortraining as well. Beyond Go, the AZ algorithm can be usedto train strong chess and shogi AIs.

2.3. Contemporary implementations

There are a number of contemporary open-source worksthat aim to reproduce the performance of AGZ and AZ,also with no external data. LeelaZero (Pascutto, 2017)leverages crowd-sourcing to achieve superhuman skill.PhoenixGo (Zeng et al., 2018), AQ (Yamaguchi, 2018),and MiniGo (MiniGo Team, 2018) achieve similar perfor-mance to that of LeelaZero. There are also numerous pro-prietary implementations, including “FineArt”, “Golaxy”,“DeepZen”, “Dolbaram”, “Baduki”, and of course the orig-inal implementations of DeepMind (Silver et al., 2017;2018).

To our knowledge, ELF OpenGo is the strongest open-source Go AI at the time of writing (under equal hardwareconstraints), and it has been publicly verified as superhumanvia professional evaluation.

2.4. Understanding AlphaGo Zero and AlphaZero

The release of AGZ and AZ has motivated a line of prelimi-nary work (Addanki et al., 2019; Dong et al., 2017; Wang& Rompf, 2018; Wang et al., 2018) which aims to analyzeand understand the algorithms, as well as to apply similaralgorithms to other domains. We offer ELF OpenGo as anaccelerator for such research.

3. ELF OpenGoOur proposed ELF OpenGo aims to faithfully reimplementAGZ and AZ, modulo certain ambiguities in the originalpapers and various innovations that enable our system tooperate entirely on commodity hardware. For brevity, wethoroughly discuss the system and software design of ELFOpenGo in Appendix A; highlights include (1) the colo-cation of multiple selfplay workers on GPUs to improvethroughput, and (2) an asynchronous selfplay workflow tohandle the increased per-game latency. Both our trainingand inference use NVIDIA Tesla V100 GPUs with 16 GBof memory. 5

4A caveat here that hardware details are provided for AZ butnot AGZ, making it difficult to compare total resource usage.

5One can expect comparable performance on most NVIDIAGPUs with Tensor Cores (e.g. the RTX 2060 commodity GPU).

Page 4: ELF OpenGo: An Analysis and Open Reimplementation of AlphaZeroproceedings.mlr.press/v97/tian19a/tian19a.pdf · ELF OpenGo: An Analysis and Open Reimplementation of AlphaZero (i.e.

ELF OpenGo: An Analysis and Open Reimplementation of AlphaZero

Comparison of training details In Appendix A’s Ta-ble S1, we provide a comprehensive list of hyperparameters,resource provisioning, and other training details of AGZ,AZ, and ELF OpenGo.

To summarize, we largely adhere to AZ’s training details.Instead of 5,000 selfplay TPUs and 64 training TPUs, weuse 2,000 selfplay GPUs and 8 training GPUs. Since AZ’sreplay buffer size is unspecified in Silver et al. (2018), weuse the AGZ setting of 500,000 games. We use the AGZselfplay rollout setting of 1,600 per move. Finally, we use acpuct constant of 1.5 and a virtual loss constant of 1.0; thesesettings are unspecified in Silver et al. (2017) and Silveret al. (2018), and we discuss these choices in greater detailin Section 5.

Silver et al. (2017) establish that during selfplay, the MCTStemperature is set to zero after 30 moves; that is, the pol-icy becomes a Dirac (one-hot) distribution that selects themost visited move from MCTS. However, it is left ambigu-ous whether a similar temperature decrease is used duringtraining. ELF OpenGo’s training worker uses an MCTStemperature of 1 for all moves (i.e. policy proportional toMCTS visit frequency).

Silver et al. (2018) suggest an alternative, dynamic variantof the PUCT heuristic which we do not explore. We use thesame PUCT rule as that of AGZ and the initial December2017 version of AZ.

Model training specification Our main training run con-structs a 256-filter, 20-block model (starting from randominitialization). First, we run our ELF OpenGo training sys-tem for 500,000 minibatches at learning rate 10−2. Subse-quently, we stop and restart the training system twice (atlearning rates 10−3, and 10−4), each time for an additional500,000 training minibatches. Thus, the total training pro-cess involves 1.5 million training minibatches, or roughly3 billion game states. We observe a selfplay generation totraining minibatch ratio of roughly 13:1 (see Appendix A).Thus, the selfplay workers collectively generate around 20million selfplay games in total during training. This train-ing run takes around 16 days of wall clock time, achievingsuperhuman performance in 9 days.

Practical lessons We learned a number of practicallessons in the course of training ELF OpenGo (e.g. stalenessconcerns with typical implementations of batch normaliza-tion). For brevity, we relate these lessons in Appendix C.

4. Validating model strengthWe validate ELF OpenGo’s performance via (1) direct eval-uation against humans and other AIs, and (2) indirect eval-uation using various human and computer-generated game

datasets. Unless stated otherwise, ELF OpenGo uses a sin-gle NVIDIA Tesla V100 GPU in these validation studies,performing 1,600 rollouts per move.

Based on pairwise model comparisons, we extract our “final”model from the main training run after 1.29 million trainingminibatches. The final model is approximately 150 ELOstronger than the prototype model, which we immediatelyintroduce. Both the prototype and final model are publiclyavailable as pretrained models.

4.1. Prototype model

Human benchmarking is essential for determining thestrength of a model. For this purpose, we use an earlymodel trained in April 2018 and subsequently evaluatedagainst human professional players. We trained this 224-filter, 20-block model using an experimental hybrid of theAGZ and AZ training processes. We refrain from providingadditional details on the model, since the code was modifiedduring training resulting in the model not being reproducible.However, it provides a valuable benchmark. We refer to thismodel as the “prototype model”.

We evaluate our prototype model against 4 top 30 profes-sional players. 6 ELF OpenGo plays under 50 seconds permove (≈ 80,000 rollouts), with no pondering during theopponent’s turn, while the humans play under no time limit.These evaluation games typically last for 3-4 hours, withthe longest game lasting over 6 hours. Using the prototypemodel, ELF OpenGo won every game for a final record of20:0.

Fig. 1 depicts the model’s predicted value during eight se-lected human games; this value indicates the perceived ad-vantage of ELF OpenGo versus the professionals over thecourse of the game. Sudden drops of the predicted valueindicate that ELF OpenGo has various weaknesses (e.g. lad-der exploitation); however, the human players ultimatelyproved unable to maintain the consequent advantages forthe remainder of the game.

Evaluation versus other AIs We further evaluate our pro-totype model against LeelaZero (Pascutto, 2017), which atthe time of evaluation was the strongest open-source GoAI. 7 Both LeelaZero and ELF OpenGo play under a timelimit of 50 seconds per move. ELF OpenGo achieves anoverall record of 980:18, corresponding to a win rate of

6The players (and global ranks as of game date) include KimJi-seok (#3), Shin Jin-seo (#5), Park Yeonghun (#23), and ChoiCheolhan (#30). All four players were fairly compensated for theirexpertise, with additional and significant incentives for winningversus ELF OpenGo. Each player played 5 games, for a total of 20human evaluation games.

72018 April 25 model with 192 filters and 15 residual blocks;public hash 158603eb.

Page 5: ELF OpenGo: An Analysis and Open Reimplementation of AlphaZeroproceedings.mlr.press/v97/tian19a/tian19a.pdf · ELF OpenGo: An Analysis and Open Reimplementation of AlphaZero (i.e.

ELF OpenGo: An Analysis and Open Reimplementation of AlphaZero

0 100 200 300Move number

−0.5

0.0

0.5

1.0Va

lue

A1A2B1B2C1C2D1D2

Figure 1. Predicted values after each move for 8 selected gamesversus human professional players (2 games per player). A positivevalue indicates that the model believes it is more likely to win thanthe human. Per players’ request, we anonymize the games byarbitrarily labeling the players with letters ‘A’ to ‘D’.

0 500000 1000000 1500000Training minibatches

0

1000

2000

3000

4000

5000

6000

ELO

ratin

g

Selfplay ELO (25,000)Selfplay ELO (50,000)Nash Averaging (50,000) −6

−4

−2

0

Nash

ratin

g

Figure 2. Progression of model skill during training. “SelfplayELO 25,000” and “Selfplay ELO 50,000” refer to the unnormal-ized selfplay ELO rating, calculated based on consecutive modelpairs at intervals of 25,000 and 50,000 training minibatches, re-spectively. “Nash Averaging 50,000” refers to the Nash averagingrating (Balduzzi et al., 2018), calculated based on a pairwise tour-nament among the same models as in “Selfplay ELO 50,000”.

98.2% and a skill differential of approximately 700 ELO.

4.2. Training analysis

Throughout our main training run, we measure the progres-sion of our model against the human-validated prototypemodel. We also measure the agreement of the model’s movepredictions with those made by humans. Finally, we explorethe rate at which the model learns different stages of thegame and complex moves, such as ladders.

Training progression: selfplay metrics In examiningthe progression of model training, we consider two selfplayrating metrics: selfplay ELO and Nash averaging. SelfplayELO uses Elo (1978)’s rating formula in order to determinethe rating difference between consecutive pairs of models,

where each pair’s winrate is known. Nash averaging (Bal-duzzi et al., 2018) calculates a logit-based payoff of eachmodel against a mixed Nash equilibrium, represented as adiscrete probability distribution over each model.

Intuitively, the selfplay ELO can be viewed as an “inflated”metric for two main reasons. First, the model’s rating willincrease as long as it can beat the immediately precedingmodel, without regard to its performance against earliermodels. Second, the rating is sensitive to the number ofconsecutive pairs compared; increasing this will tend toboost the rating of each model due to the nonlinear logitform of the ELO calculation.

Fig. 2 shows the selfplay ELO rating using every 25,000-thmodel and every 50,000-th model, and the Nash averagingrating using every 50,000-th model. The ratings are con-sistent with the above intuitions; the ELO ratings follow amostly consistent upward trend, and the denser comparisonslead to a more inflated rating. Note that the Nash averagingrating captures model skill degradations (particularly be-tween minibatches 300,000 and 400,000) that selfplay ELOfails to identify.

Training progression: training objective Fig. 3a showsthe progression of the policy and value losses. Note theinitial dip in the value loss. This is due to the model over-estimating the white win rate, causing the black player toresign prematurely, which reduces the diversity of the gamesin the replay buffer. This could result in a negative feedbackloop and overfitting. ELF OpenGo automatically correctsfor this by evenly sampling black-win and white-win games.With this diverse (qualitatively, “healthy”) set of replays,the value loss recovers and stays constant throughout thetraining, showing there is always new information to learnfrom the replay buffer.

Performance versus prototype model Fig. 3b shows theprogression of the model’s win rates against two weakervariants taken earlier in training of the prototype model(“prototype-α” and “prototype-β”) as well as the mainhuman-validated prototype model (simply “prototype”). 8

We observe that the model trends stronger as training pro-gresses, achieving parity with the prototype after roughly 1million minibatches, and achieving a 60% win rate againstthe prototype at the end of training. Similar trends emergewith prototype-α and prototype-β, demonstrating the robust-ness of the trend to choice of opponent. Note that while thestrength of the model trends stronger, there is significantvariance in the model’s strength as training progresses, evenwith weaker models. Surprisingly, this variance does notdecrease even with a decrease in learning rate.

8Prototype-α is roughly at the advanced amateur level, andprototype-β is roughly at a typical professional level.

Page 6: ELF OpenGo: An Analysis and Open Reimplementation of AlphaZeroproceedings.mlr.press/v97/tian19a/tian19a.pdf · ELF OpenGo: An Analysis and Open Reimplementation of AlphaZero (i.e.

ELF OpenGo: An Analysis and Open Reimplementation of AlphaZero

0 500000 1000000 1500000Training minibatches

0

1

2

3

Loss

Value lossPolicy lossTotal loss

(a)

0 500000 1000000 1500000Training minibatches

0.0

0.2

0.4

0.6

0.8

1.0

Win

rate

vs. Prototype-αvs. Prototype-βvs. Prototype

(b)

0 500000 1000000 1500000Training minibatches

0.0

0.2

0.4

0.6

0.8

1.0

Mat

ch ra

te

(c)

Figure 3. (a) Model’s training loss (value, policy, and sum) during training. (b) Win rates vs. the prototype model during training. (c)Match rate between trained model and human players on moves from professional games (as described in Appendix B.1). The learningrate was decreased every 500,000 minibatches.

Examining the win rates versus other models in Fig. 3b,the number of minibatches could potentially be reduced to250,000 at each learning rate, and still similar performancecan be achieved.

Comparison with human games Since our model is sig-nificantly stronger than the prototype model that showedsuperhuman strength, we hypothesize that our model is alsoof superhuman strength. In Fig. 3c, we show the agree-ment of predicted moves with those made by professionalhuman players. The moves are extracted from 1,000 profes-sional games played from 2011 to 2015. The model quicklyconverges to a human move match rate of ≈ 46% aroundminibatch 125,000. This indicates that the strengthening ofthe model beyond this point may not be due to better humanprofessional predictions, and as demonstrated by Silver et al.(2016), there may be limitations to supervised training fromhuman games.

Learning game stages An interesting question is whetherthe model learns different stages of the game at differentrates. During training, is the model initially stronger atopening or endgame moves?

We hypothesize that the endgame should be learned earlierthan the opening. With a random model, MCTS behavesrandomly except for the last few endgame moves, in whichthe actual game score is used to determine the winner of thegame. Then, after some initial training, the learned endgamesignals inform MCTS in the immediately preceding movesas well. As training progresses, these signals flow to ear-lier and earlier move numbers via MCTS, until finally theopening is learned.

In Fig. 4a and Fig. 4b, we show the agreement of predictedmoves from the model during training, and the prototypemodel and humans respectively. Fig. 4a shows the per-centage of moves at three stages of gameplay (moves 1-60,61-120, and 121-180) from training selfplay games thatagree with those predicted by the prototype model. Fig. 4b

is similar, but it uses moves from professional human games,and measures the match rate between the trained model andprofessional players. Consistent with our hypothesis, theprogression of early games moves (1-60) lags behind thatof the mid-game moves (61-120) and the end-game moves(121-180). Upon further exploration, we observed that thislag resulted from some initial overfitting before eventuallyrecovering. Counter to conventional wisdom, the progres-sion of match rates of mid-game and end-game moves arenearly identical. Note that more ambiguity exists for ear-lier moves than later moves, so after sufficient training thematch rates converge to different match rates.

Ladder moves “Ladder” scenarios are among the earli-est concepts of Go learned by beginning human players.Ladders create a predictable sequence of moves for eachplayer that can span the entire board as shown in Fig. 5. Incontrast to humans, Silver et al. (2017) observes that deepRL models learn these tactics very late in training. To gainfurther insight into this phenomenon, we curate a datasetof 100 ladder scenarios (as described in Appendix B.2) andevaluate the model’s ladder handling abilities throughouttraining.

As shown in Fig. 4c, we observe that the ability of thenetwork to correctly handle ladder moves fluctuates signif-icantly over the course of training. In general, ladder playimproves with a fixed learning rate and degrades after thelearning rate is reduced before once again recovering. Whilethe network improves on ladder moves, it still makes mis-takes even late in training. In general, increasing the numberof MCTS rollouts from 1,600 to 6,400 improves ladder play,but mistakes are still made.

5. Ablation studyWe employ ELF OpenGo to perform various ablation exper-iments, toward demystifying the inner workings of MCTSand AZ-style training.

Page 7: ELF OpenGo: An Analysis and Open Reimplementation of AlphaZeroproceedings.mlr.press/v97/tian19a/tian19a.pdf · ELF OpenGo: An Analysis and Open Reimplementation of AlphaZero (i.e.

ELF OpenGo: An Analysis and Open Reimplementation of AlphaZero

0 100000 200000 300000 400000 500000Training minibatches

0.2

0.3

0.4

0.5

0.6

0.7M

atch

rate

Moves 1-60Moves 61-120Moves 121-180

(a)

0 100000 200000 300000 400000 500000Training minibatches

0.0

0.1

0.2

0.3

0.4

0.5

Mat

ch ra

te

Moves 1-60Moves 61-120Moves 121-180

(b)

0 500000 1000000 1500000Training minibatches

0.0

0.2

0.4

0.6

Mist

ake

Rate

1600 rollouts/move6400 rollouts/move

(c)

Figure 4. (a) Move agreement rates with the prototype model at different stages of the game during training. Note that the model learnsbetter end-game moves before learning opening moves. The first part of the figure is clipped due to short-game dominance cased byinitial overfitting issues. (b) Move agreement rates with human games at different stages of the game during the first stage of training. (c)Model’s rate of “mistake moves” on the ladder scenario dataset during training, associated with vulnerability to ladder scenarios.

Figure 5. The beginning (left) and end (right) of a ladder scenarioin which OpenGo (black) mistakenly thought it could gain anadvantage by playing the ladder. Whether a ladder is advantageousor not is commonly dependent on stones distant from its start.

5.1. PUCT constant

Both AGZ (Silver et al., 2017) and AZ (Silver et al., 2018)leave cpuct unspecified. Recall that cpuct controls the bal-ance of exploration vs. exploitation in the MCTS algorithm.Thus, it is a crucial hyperparameter for the overall behaviorand effectiveness of the algorithm. Setting cpuct too low re-sults in insufficient exploration, while setting cpuct too highreduces the effective depth (i.e. long-term planning capac-ity) of the MCTS algorithm. In preliminary experimentation,we found cpuct = 1.5 to be a suitable choice. Fig. 6 depictsthe ELF OpenGo model’s training trajectory under variousvalues of cpuct; among the tested values, cpuct = 1.5 isplainly the most performant.

5.2. MCTS virtual loss

0.1 0.2 0.5 1.0 2.0 3.0 5.022% 37% 49% 50% 32% 36% 30%

Table 1. Win rates of various virtual loss constants versus the de-fault setting of 1.

Virtual loss (Chaslot et al., 2008) is a technique used to ac-

0 100000 200000 300000 400000 500000Training minibatches

0.00

0.05

0.10

0.15

Win

rate

0.51.55.0

Figure 6. Win rates vs. the prototype model during an abbreviatedtraining run for three different cpuct values. The learning rate isdecayed 10-fold after 290,000 minibatches.

celerate multithreaded MCTS. It adds a temporary and fixedamount of loss to a node to be visited by a thread, to preventother threads from concurrently visiting the same node andcausing congestion. The virtual loss is parameterized bya constant. For AGZ, AZ, and ELF OpenGo’s value func-tion, a virtual loss constant of 1 is intuitively interpretableas each thread temporarily assuming a loss for the movesalong the current rollout. This motivates our choice of 1 asELF OpenGo’s virtual loss constant.

We perform a sweep over the virtual loss constant using theprototype model, comparing each setting against the defaultsetting of 1. The results, presented in Table 1, suggest that avirtual loss constant of 1 is indeed reasonable.

5.3. AlphaGo Zero vs. AlphaZero

We hypothesize that AZ training vastly outperforms AGZtraining (given equivalent hardware) due to the former’sasynchrony and consequent improved throughput. We train64-filter, 5-block models from scratch with both AGZ andAZ for 12 hours and compare their final performance. Themodel trained with AZ wins 100:0 versus the model trained

Page 8: ELF OpenGo: An Analysis and Open Reimplementation of AlphaZeroproceedings.mlr.press/v97/tian19a/tian19a.pdf · ELF OpenGo: An Analysis and Open Reimplementation of AlphaZero (i.e.

ELF OpenGo: An Analysis and Open Reimplementation of AlphaZero

800 1600 3200 6400 12800 25600Rollout count

0.5

0.6

0.7

0.8

0.9

1.0W

inra

te o

f dou

ble-

rollo

ut b

ot Black rollouts doubledWhite rollouts doubled

Figure 7. Win rate when playing with 2x rollouts against the samemodel with 1x rollouts.

with AGZ, indicating that AZ is indeed much more perfor-mant.

5.4. MCTS rollout count

Intuitively, increasing the number of MCTS iterations (roll-out count) improves the AI’s strength by exploring moreof the game tree. Toward better understanding the rolloutcount’s impact on strength, we perform a selfplay analysiswith the final model, in which one player uses twice as manyMCTS rollouts as the other. We perform this analysis acrossa wide range of rollout counts (800-25,600).

From the results shown in Fig. 7, we find that ELF OpenGoconsistently enjoys an 80-90% win rate (≈ 250-400 addi-tional ELO) from doubling the number of rollouts as thewhite player. On the other hand, as the black player, ELFOpenGo only enjoys a 55-75% win rate (≈ 35-200 addi-tional ELO). Moreover, the incremental benefit for black ofdoubling rollouts shrinks to nearly 50% as the rollout countis increased, suggesting that our model has a skill ceilingwith respect to rollout count as the black player. That thisskill ceiling is not present as the white player suggests thata 7.5 komi (white score bonus) can be quite significant forblack.

Since using the same model on both sides could introducebias (the player with more rollouts sees all the branchesexplored by the opponent), we also experiment with theprototype/final model and still observe a similar trend thatdoubling the rollouts gives ≈ 200 ELO boost.

6. DiscussionELF OpenGo learns to play the game of Go differently fromhumans. It requires orders of magnitude more games thanprofessional human players to achieve the same level of per-formance. Notably, ladder moves, which are easy for humanbeginners to understand after a few examples, are difficultfor the model to learn and never fully mastered. We suspect

that this is because convolutional networks lack the rightinductive bias to handle ladders and resort to memorizingspecific situations. Finally, we observe significant varianceduring training, which indicates the method is still not fullyunderstood and may require much tuning. We hope this pa-per serves as a starting point for improving upon AGZ/AZto achieve efficient and stable learning.

Surprisingly, further reduction of the learning rate does notimprove training. We trained to 2 million minibatches witha learning rate of 10−5, but noticed minimal improvement inmodel strength. Furthermore, the training becomes unstablewith high variance. This may be due to a lack of diversityin the selfplay games, since either the model has saturatedor the lower learning rate of 10−5 resulted in the selfplaygames coming from nearly identical models. It may benecessary to increase the selfplay game buffer when usinglower learning rates to maintain stability.

Finally, RL methods typically learn from the states that areclose to the terminal state (end game) where there is a sparsereward. Knowledge from the reward is then propagatedtowards the beginning of the game. However, from Fig. 4aand Fig. 4b, such a trend is weak (moves 61-180 are learnedonly slightly faster than moves 1-60). This brings aboutthe question of why AGZ/AZ methods behave differently,and what kind of role the inductive bias of the model playsduring the training. It is likely that due to the inductive bias,the model quickly predicts the correct moves and values ofeasy situations, and then focuses on the difficult cases.

7. ConclusionWe provide a reimplementation of AlphaZero (Silver et al.,2018) and a resulting Go engine capable of superhumangameplay. Our code, models, selfplay datasets and auxil-iary data are publicly available. We offer insights into themodel’s behavior during training. Notably, we examine thevariance of the model’s strength, its ability to learn laddermoves, and the rate of improvement at different stages ofthe game. Finally, through a series of ablation studies, weshed light on parameters that were previously ambiguous.Interestingly, we observe significant and sustained improve-ment with the number of rollouts performed during MCTSwhen playing white, but diminishing returns when playingblack. This indicates that the model’s strength could still besignificantly improved.

Our goal is to provide the insights, code, and datasets nec-essary for the research community to explore large-scaledeep reinforcement learning. As demonstrated through ourexperiments, exciting opportunities lie ahead in exploringsample efficiency, reducing training volatility, and numerousadditional directions.

Page 9: ELF OpenGo: An Analysis and Open Reimplementation of AlphaZeroproceedings.mlr.press/v97/tian19a/tian19a.pdf · ELF OpenGo: An Analysis and Open Reimplementation of AlphaZero (i.e.

ELF OpenGo: An Analysis and Open Reimplementation of AlphaZero

ReferencesAddanki, R., Alizadeh, M., Venkatakrishnan, S. B., Shah,

D., Xie, Q., and Xu, Z. Understanding & generalizingalphago zero, 2019. URL https://openreview.net/forum?id=rkxtl3C5YX.

Balduzzi, D., Tuyls, K., Perolat, J., and Graepel, T. Re-evaluating evaluation. CoRR, abs/1806.02643, 2018.URL http://arxiv.org/abs/1806.02643.

Baudis, P. and Gailly, J. Pachi. https://github.com/pasky/pachi, 2008.

Brockman, G., Cheung, V., Pettersson, L., Schneider, J.,Schulman, J., Tang, J., and Zaremba, W. Openai gym.arXiv preprint arXiv:1606.01540, 2016.

Browne, C. B., Powley, E., Whitehouse, D., Lucas, S. M.,Cowling, P. I., Rohlfshagen, P., Tavener, S., Perez, D.,Samothrakis, S., and Colton, S. A survey of monte carlotree search methods. IEEE Transactions on Computa-tional Intelligence and AI in games, 4(1):1–43, 2012.

Campbell, M., Hoane Jr, A. J., and Hsu, F.-h. Deep blue.Artificial intelligence, 134(1-2):57–83, 2002.

Chaslot, G. M.-B., Winands, M. H., and van Den Herik, H. J.Parallel monte-carlo tree search. In International Con-ference on Computers and Games, pp. 60–71. Springer,2008.

Clark, C. and Storkey, A. Training deep convolutionalneural networks to play go. In International Conferenceon Machine Learning, pp. 1766–1774, 2015.

Coleman, C., Narayanan, D., Kang, D., Zhao, T., Zhang, J.,Nardi, L., Bailis, P., Olukotun, K., Re, C., and Zaharia,M. Dawnbench: An end-to-end deep learning benchmarkand competition. 2017.

Dong, X., Wu, J., and Zhou, L. Demystifying alphago zeroas alphago gan. arXiv preprint arXiv:1711.09091, 2017.

Elo, A. E. The rating of chess players, past and present. IshiPress International, 1978.

Enzenberger, M., Muller, M., Arneson, B., and Segal, R.Fuego: an open-source framework for board games andgo engine based on monte carlo tree search. IEEE Trans-actions on Computational Intelligence and AI in Games,2(4):259–270, 2010.

GNU Go Team. Gnu go. https://www.gnu.org/software/gnugo/, 2009.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn-ing for image recognition. In Proceedings of the IEEEconference on computer vision and pattern recognition,pp. 770–778, 2016.

Ioffe, S. and Szegedy, C. Batch normalization: Acceleratingdeep network training by reducing internal covariate shift.In Proceedings of the 32nd International Conference onMachine Learning, ICML 2015, Lille, France, 6-11 July2015, pp. 448–456, 2015. URL http://jmlr.org/proceedings/papers/v37/ioffe15.html.

Knuth, D. E. and Moore, R. W. An analysis of alpha-betapruning. Artificial intelligence, 6(4):293–326, 1975.

Kocsis, L. and Szepesvari, C. Bandit based monte-carloplanning. In European conference on machine learning,pp. 282–293. Springer, 2006.

Maddison, C. J., Huang, A., Sutskever, I., and Silver, D.Move evaluation in go using deep convolutional neuralnetworks. International Conference on Learning Repre-sentations, 2015.

MiniGo Team. Minigo. https://github.com/tensorflow/minigo, 2018.

Pascutto, G.-C. Leela. https://www.sjeng.org/leela.html, 2016.

Pascutto, G.-C. Leelazero. https://github.com/gcp/leela-zero, 2017.

Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E.,DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer,A. Automatic differentiation in pytorch. In NIPS-W,2017.

Rosin, C. D. Multi-armed bandits with episode context.Annals of Mathematics and Artificial Intelligence, 61(3):203–230, 2011.

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L.,Van Den Driessche, G., Schrittwieser, J., Antonoglou, I.,Panneershelvam, V., Lanctot, M., et al. Mastering thegame of go with deep neural networks and tree search.nature, 529(7587):484, 2016.

Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou,I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M.,Bolton, A., et al. Mastering the game of go withouthuman knowledge. Nature, 550(7676):354, 2017.

Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai,M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Grae-pel, T., Lillicrap, T., Simonyan, K., and Hassabis, D.A general reinforcement learning algorithm that mas-ters chess, shogi, and go through self-play. Science,362(6419):1140–1144, 2018. ISSN 0036-8075. doi:10.1126/science.aar6404. URL http://science.sciencemag.org/content/362/6419/1140.

Page 10: ELF OpenGo: An Analysis and Open Reimplementation of AlphaZeroproceedings.mlr.press/v97/tian19a/tian19a.pdf · ELF OpenGo: An Analysis and Open Reimplementation of AlphaZero (i.e.

ELF OpenGo: An Analysis and Open Reimplementation of AlphaZero

Tian, Y. and Zhu, Y. Better computer go player withneural network and long-term prediction. CoRR,abs/1511.06410, 2015. URL http://arxiv.org/abs/1511.06410.

Tian, Y., Gong, Q., Shang, W., Wu, Y., and Zitnick, C. L.ELF: an extensive, lightweight and flexible research plat-form for real-time strategy games. In Advances in NeuralInformation Processing Systems 30: Annual Conferenceon Neural Information Processing Systems 2017, 4-9 De-cember 2017, Long Beach, CA, USA, pp. 2656–2666,2017.

Tromp, J. http://tromp.github.io/go.html,2014.

Wang, F. and Rompf, T. From gameplay to symbolic reason-ing: Learning sat solver heuristics in the style of alpha(go) zero. arXiv preprint arXiv:1802.05340, 2018.

Wang, L., Zhao, Y., and Jinnai, Y. Alphax: exploring neuralarchitectures with deep neural networks and monte carlotree search. CoRR, abs/1805.07440, 2018. URL http://arxiv.org/abs/1805.07440.

Yamaguchi, Y. Aq. https://github.com/ymgaq/AQ, 2018.

Zeng, Q., Zhang, J., Zeng, Z., Li, Y., Chen, M., and Liu,S. Phoenixgo. https://github.com/Tencent/PhoenixGo, 2018.