Mastering the game of Go with deep neural networks and tree search: Presentation

Mastering the game of Go

with deep neural networks and tree search

Karel Ha

article by Google DeepMind

Spring School of Combinatorics 2016

Why AI?

Applications of AI

� spam filters

� recommender systems (Netflix, YouTube)

� predictive text (Swiftkey)

� audio recognition (Shazam, SoundHound)

� music generation (DeepHear - Composing and harmonizing

music with neural networks)

� self-driving cars

Applications of AI

� spam filters

Applications of AI

� spam filters

Applications of AI

� spam filters

Applications of AI

� spam filters

Applications of AI

� spam filters

Auto Reply Feature of Google Inbox

Corrado 2015 2

Artistic-style Painting

[1] Gatys, Ecker, and Bethge 2015 [2] Li and Wand 2016 3

Artistic-style Painting

[1] Gatys, Ecker, and Bethge 2015 [2] Li and Wand 2016 3

Baby Names Generated Character by Character

� Baby Killiel Saddie Char Ahbort With

� Rudi Levette Berice Lussa Hany Mareanne Chrestina Carissy

� Marylen Hammine Janye Marlise Jacacrie Hendred Romand

Charienna Nenotto Ette Dorane Wallen Marly Darine Salina

Elvyn Ersia Maralena Minoria Ellia Charmin Antley Nerille

Chelon Walmor Evena Jeryly Stachon Charisa Allisa Anatha

Cathanie Geetra Alexie Jerin Cassen Herbett Cossie Velen

Daurenge Robester Shermond Terisa Licia Roselen Ferine Jayn

Lusine Charyanne Sales Sanny Resa Wallon Martine Merus

Jelen Candica Wallin Tel Rachene Tarine Ozila Ketia Shanne

Arnande Karella Roselina Alessia Chasty Deland Berther

Geamar Jackein Mellisand Sagdy Nenc Lessie Rasemy Guen

Karpathy 2015 4

C code Generated Character by Character

Karpathy 2015 5

Algebraic Geometry Generated Character by Character

Karpathy 2015 6

DeepDrumpf

https://twitter.com/deepdrumpf

= a Twitter bot that has

learned the language of Donald Trump from his speeches

Hayes 2016 7

DeepDrumpf

https://twitter.com/deepdrumpf = a Twitter bot that has

learned the language of Donald Trump from his speeches

Hayes 2016 7

Atari Player by Google DeepMind

https://youtu.be/0X-NdPtFKq0?t=21m13s

Mnih et al. 2015 8

Heads-up Limit Holdem Poker Is Solved!

Cepheus http://poker.srv.ualberta.ca/

Bowling et al. 2015 9

Heads-up Limit Holdem Poker Is Solved!

Cepheus http://poker.srv.ualberta.ca/

Bowling et al. 2015 9

Basics of Machine learning

Supervised versus Unsupervised Learning

Supervised learning:

� data set must be labelled

� e.g. which e-mail is regular/spam, which image is duck/face,

Unsupervised learning:

� data set is not labelled

� it can try to cluster the data into different groups

� e.g. grouping similar news, ...

Supervised Learning

1. data collection: Google Search, Facebook “Likes”, Siri, Netflix, YouTube views, LHC collisions, KGS Go

Server...

2. training on training set

3. testing on testing set

4. deployment

http://www.nickgillian.com/ 11

Supervised Learning

Server...

4. deployment

Supervised Learning

Server...

4. deployment

Supervised Learning

Server...

4. deployment

Supervised Learning

Server...

4. deployment

Supervised Learning

Server...

4. deployment

Supervised Learning

Server...

4. deployment

Regression

Mathematical Regression

https://thermanuals.wordpress.com/descriptive-analysis/sampling-and-regression/13

Classification

https://kevinbinz.files.wordpress.com/2014/08/ml-svm-after-comparison.png 14

Underfitting and Overfitting

Beware of overfitting!

It is like learning for a math exam by memorizing proofs.

https://www.researchgate.net/post/How_to_Avoid_Overfitting 15

Reinforcement Learning

Specially: games of self-play

https://youtu.be/0X-NdPtFKq0?t=16m57s 16

Reinforcement Learning

Specially: games of self-play

https://youtu.be/0X-NdPtFKq0?t=16m57s 16

Monte Carlo Tree Search

Tree Search

Optimal value v∗(s) determines the outcome of the game:

� from every board position or state s

� under perfect play by all players.

It is computed by recursively traversing a search tree containing

approximately bd possible sequences of moves, where

� b is the games breadth (number of legal moves per position)

� d is its depth (game length)

Silver et al. 2016 17

Tree Search

Game tree of Go

Sizes of trees for various games:

� chess: b ≈ 35, d ≈ 80

� Go: b ≈ 250, d ≈ 150

⇒ more positions than atoms in the

universe!

That makes Go a googol

times more complex than

chess.

https://deepmind.com/alpha-go.html

How to handle the size of the game tree?

� for the breadth: a neural network to select moves

� for the depth: a neural network to evaluate current position

� for the tree traverse: Monte Carlo tree search (MCTS)

Allis et al. 1994 18

Game tree of Go

� chess: b ≈ 35, d ≈ 80

� Go: b ≈ 250, d ≈ 150 ⇒ more positions than atoms in the

universe!

chess.

Game tree of Go

� chess: b ≈ 35, d ≈ 80

universe!

chess.

Game tree of Go

� chess: b ≈ 35, d ≈ 80

universe!

chess.

Game tree of Go

� chess: b ≈ 35, d ≈ 80

universe!

chess.

Game tree of Go

� chess: b ≈ 35, d ≈ 80

universe!

chess.

Game tree of Go

� chess: b ≈ 35, d ≈ 80

universe!

chess.

Monte Carlo tree search

Neural networks

Neural Network: Inspiration

� inspired by the neuronal structure of the mammalian cerebral

cortex

� but on much smaller scales� suitable to model systems with a high tolerance to error

� e.g. audio or image recognition

http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 20

cortex

� but on much smaller scales

� suitable to model systems with a high tolerance to error

cortex

� e.g. audio or image recognitionhttp://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 20

Neural Network: Modes

Two modes

� feedforward for making predictions

� backpropagation for learning

Dieterle 2003 21

Two modes

Dieterle 2003 21

Two modes

Dieterle 2003 21

Two modes

� backpropagation for learningDieterle 2003 21

Neural Network: an example of feedforward

http://stevenmiller888.github.io/mind-how-to-build-a-neural-network/ 22

Gradient Descent in Neural Networks

Motto: ”Learn by mistakes!”

However, error functions are not necessarily convex or so “smooth”.

Deep Neural Network: Inspiration

The hierarchy of concepts is captured in the number of layers (the deep in “deep learning”)

Deep Neural Network: Inspiration

The hierarchy of concepts is captured in the number of layers (the deep in “deep learning”)

Convolutional Neural Network

Rules of Go

Classic games (1/2)

Backgammon: Man vs. Fate

Chess: Man vs. Man

Classic games (1/2)

Chess: Man vs. Man

Classic games (1/2)

Chess: Man vs. Man

Classic games (2/2)

Go: Man vs. Self

Robert Samal (White) versus Karel Kral (Black), Spring School of Combinatorics 2016 27

Rules of Go

Black versus White. Black starts the game.

the rule of liberty

the “ko” rule

Handicap for difference in ranks: Black can place 1 or more stones

in advance (compensation for White’s greater strength).

Rules of Go

the rule of liberty

the “ko” rule

Rules of Go

the rule of liberty

the “ko” rule

Rules of Go

the rule of liberty

the “ko” rule

Rules of Go

the rule of liberty

the “ko” rule

Rules of Go

the rule of liberty

the “ko” rule

in advance (compensation for White’s greater strength). 28

Scoring Rules: Area Scoring

A player’s score is:

� the number of stones that the player has on the board

� plus the number of empty intersections surrounded by that

player’s stones

� plus komi(dashi) points for the White player

which is a compensation for the first move advantage of the Black player

https://en.wikipedia.org/wiki/Go_(game) 29

player’s stones

Ranks of Players

Kyu and Dan ranks

or alternatively, ELO ratings

Ranks of Players

Kyu and Dan ranks

or alternatively, ELO ratings

Chocolate micro-break

AlphaGo: Inside Out

Policy and Value Networks

Training the (Deep Convolutional) Neural Networks

SL Policy Networks (1/3)

� 13-layer deep convolutional neural network

� goal: to predict expert human moves

� task of classification

� trained from 30 millions positions from the KGS Go Server

� stochastic gradient ascent:

∆σ ∝ ∂ log pσ(a|s)

(to maximize the likelihood of the human move a selected in state s)

Results:

� 44.4% accuracy (the state-of-the-art from other groups)

� 55.7% accuracy (raw board position + move history as input)

� 57.0% accuracy (all input features)

Results:

Small improvements in accuracy led to large improvements

in playing strength (see the next slide)

move probabilities taken directly from the SL policy network pσ (reported as a percentage if above 0.1%).

Rollout Policy

� Rollout policy pπ(a|s) is faster but less accurate than SL

policy network.

� accuracy of 24.2%

� It takes 2µs to select an action, compared to 3 ms in case

of SL policy network.

Rollout Policy

policy network.

Rollout Policy

policy network.

RL Policy Networks (1/2)

� identical in structure to the SL policy network

� goal: to win in the games of self-play

� weights ρ initialized to the same values, ρ := σ

� games of self-play

� between the current RL policy network and a randomly

selected previous iteration

� to prevent overfitting to the current policy

∆ρ ∝ ∂ log pρ(at |st)

∂ρzt

at time step t, where reward function zt is +1 for winning and −1 for losing.

∂ρzt

Results (by sampling each move at ∼ pρ(·|st)):

� 80% of win rate against the SL policy network

� 85% of win rate against the strongest open-source Goprogram, Pachi (Baudis and Gailly 2011)

� The previous state-of-the-art, based only on SL of CNN:

11% of “win” rate against Pachi

Value Network (1/2)

� similar architecture to the policy network, but outputs a single

prediction instead of a probability distribution

� goal: to estimate a value function

vp(s) = E[zt |st = s, at...T ∼ p]

that predicts the outcome from position s (of games played

by using policy pρ)

� Specifically, vθ(s) ≈ vpρ(s) ≈ v∗(s).

� task of regression

� stochastic gradient descent:

∆θ ∝ ∂vθ(s)

∂θ(z − vθ(s))

(to minimize the mean squared error (MSE) between the predicted vθ(s) and the true z)

Value Network (1/2)

vp(s) = E[zt |st = s, at...T ∼ p]

∆θ ∝ ∂vθ(s)

∂θ(z − vθ(s))

Value Network (1/2)

vp(s) = E[zt |st = s, at...T ∼ p]

∆θ ∝ ∂vθ(s)

∂θ(z − vθ(s))

Value Network (1/2)

vp(s) = E[zt |st = s, at...T ∼ p]

∆θ ∝ ∂vθ(s)

∂θ(z − vθ(s))

Value Network (1/2)

vp(s) = E[zt |st = s, at...T ∼ p]

∆θ ∝ ∂vθ(s)

∂θ(z − vθ(s))

Value Network (2/2)

� Successive positions are strongly correlated.

� Value network memorized the game outcomes, rather than

generalizing to new positions.

� Solution: generate 30 million (new) positions, each sampled

from a seperate game

� almost the accuracy of Monte Carlo rollouts (using pρ), but

15000 times less computation!

Value Network (2/2)

Selection of Moves by the Value Network

evaluation of all successors s′ of the root position s, using vθ(s)

Evaluation accuracy in various stages of a game

Move number is the number of moves that had been played in the given position.

Each position evaluated by:

� forward pass of the value network vθ

� 100 rollouts, played out using the corresponding policy

� 100 rollouts, played out using the corresponding policySilver et al. 2016 45

ELO Ratings for Various Combinations of Networks

MCTS Algorithm

The next action is selected by lookahead search, using simulation:

1. selection phase

2. expansion phase

3. evaluation phase

4. backup phase (at end of simulation)

Each edge (s, a) keeps:

� action value Q(s, a)

� visit count N(s, a)

� prior probability P(s, a) (from SL policy network pσ)

The tree is traversed by simulation (descending the tree) from the

root state.

MCTS Algorithm

1. selection phase

2. expansion phase

3. evaluation phase

root state.

MCTS Algorithm

1. selection phase

2. expansion phase

3. evaluation phase

root state.

MCTS Algorithm

1. selection phase

2. expansion phase

3. evaluation phase

root state.

MCTS Algorithm

1. selection phase

2. expansion phase

3. evaluation phase

root state.

MCTS Algorithm

1. selection phase

2. expansion phase

3. evaluation phase

root state.

MCTS Algorithm

1. selection phase

2. expansion phase

3. evaluation phase

root state.

MCTS Algorithm

1. selection phase

2. expansion phase

3. evaluation phase

root state.

MCTS Algorithm

1. selection phase

2. expansion phase

3. evaluation phase

root state.

MCTS Algorithm

1. selection phase

2. expansion phase

3. evaluation phase

root state.

MCTS Algorithm

1. selection phase

2. expansion phase

3. evaluation phase

root state.

MCTS Algorithm: Selection

At each time step t, an action at is selected from state st

at = arg maxa

(Q(st , a) + u(st , a))

where bonus

u(st , a) ∝P(s, a)

1 + N(s, a)

at = arg maxa

(Q(st , a) + u(st , a))

where bonus

u(st , a) ∝P(s, a)

1 + N(s, a)

at = arg maxa

(Q(st , a) + u(st , a))

where bonus

u(st , a) ∝P(s, a)

1 + N(s, a)

MCTS Algorithm: Expansion

A leaf position may be expanded (just once) by the SL policy network pσ .

The output probabilities are stored as priors P(s, a) := pσ(a|s).

MCTS: Evaluation

� evaluation from the value network vθ(s)

� evaluation by the outcome z using the fast rollout policy pπ until the end of game

Using a mixing parameter λ, the final leaf evaluation V (s) is

V (s) = (1− λ)vθ(s) + λz

MCTS: Evaluation

V (s) = (1− λ)vθ(s) + λz

MCTS: Evaluation

V (s) = (1− λ)vθ(s) + λz

MCTS: Evaluation

V (s) = (1− λ)vθ(s) + λz

MCTS: Evaluation

V (s) = (1− λ)vθ(s) + λz

Tree Evaluation from Value Network

action values Q(s, a) for each tree-edge (s, a) from root position s (averaged over value network evaluations only)

Tree Evaluation from Rollouts

action values Q(s, a), averaged over rollout evaluations only

MCTS: Backup

At the end of simulation, each traversed edge is updated by accumulating:

� the action values Q

� visit counts N

MCTS: Backup

At the end of simulation, each traversed edge is updated by accumulating:

� the action values Q

� visit counts N

Once the search is complete, the algorithm

chooses the most visited move from the root

position.

Percentage of Simulations

percentage frequency with which actions were selected from the root during simulations

Principal Variation (Path with Maximum Visit Count)

The moves are presented in a numbered sequence.

� AlphaGo selected the move indicated by the red circle;

� Fan Hui responded with the move indicated by the white square;

� in his post-game commentary, he preferred the move (labelled 1) predicted by AlphaGo.

Scalability

� asynchronous multi-threaded search

� simulations on CPUs

� computation of neural networks on GPUs

AlphaGo:

� 40 search threads

� 40 CPUs

� 8 GPUs

Distributed version of AlphaGo (on multiple machines):

� 1202 CPUs

� 176 GPUs

Scalability

AlphaGo:

� 40 CPUs

� 8 GPUs

� 1202 CPUs

� 176 GPUs

Scalability

AlphaGo:

� 40 CPUs

� 8 GPUs

� 1202 CPUs

� 176 GPUs

Scalability

AlphaGo:

� 40 CPUs

� 8 GPUs

� 1202 CPUs

� 176 GPUs

Scalability

AlphaGo:

� 40 CPUs

� 8 GPUs

� 1202 CPUs

� 176 GPUs

Scalability

AlphaGo:

� 40 CPUs

� 8 GPUs

� 1202 CPUs

� 176 GPUs

Scalability

AlphaGo:

� 40 CPUs

� 8 GPUs

� 1202 CPUs

� 176 GPUs

Scalability

AlphaGo:

� 40 CPUs

� 8 GPUs

� 1202 CPUs

� 176 GPUs

Scalability

AlphaGo:

� 40 CPUs

� 8 GPUs

� 1202 CPUs

� 176 GPUs

Scalability

AlphaGo:

� 40 CPUs

� 8 GPUs

� 1202 CPUs

� 176 GPUs

Scalability

AlphaGo:

� 40 CPUs

� 8 GPUs

� 1202 CPUs

� 176 GPUs

ELO Ratings for Various Combinations of Threads

Results: the strength of AlphaGo

Tournament with Other Go Programs

Fan Hui

� professional 2 dan

� European Go Champion in 2013, 2014 and 2015

� European Professional Go Champion in 2016� biological neural network:

� 100 billion neurons

� 100 up to 1,000 trillion neuronal connections

https://en.wikipedia.org/wiki/Fan_Hui 60

Fan Hui

� European Professional Go Champion in 2016

� biological neural network:

Fan Hui

� 100 up to 1,000 trillion neuronal connectionshttps://en.wikipedia.org/wiki/Fan_Hui 60

AlphaGo versus Fan Hui

AlphaGo won 5 - 0 in a formal match on October 2015.

[AlphaGo] is very strong and stable, it seems

like a wall. ... I know AlphaGo is a computer,

but if no one told me, maybe I would think

the player was a little strange, but a very

strong player, a real person.

Fan Hui

Fan Hui 61

Lee Sedol “The Strong Stone”

� the 2nd in international titles

� the 5th youngest (12 years 4 months) to become

a professional Go player in South Korean history

� Lee Sedol would win 97 out of 100 games against Fan Hui.

� biological neural network, comparable to Fan Hui’s (in number

of neurons and connections)

https://en.wikipedia.org/wiki/Lee_Sedol 62

of neurons and connections)https://en.wikipedia.org/wiki/Lee_Sedol 62

I heard Google DeepMind’s AI is surprisingly

strong and getting stronger, but I am

confident that I can win, at least this time.

Lee Sedol

...even beating AlphaGo by 4-1 may allow

the Google DeepMind team to claim its de

facto victory and the defeat of him

[Lee Sedol], or even humankind.

interview in JTBC

Newsroom

Lee Sedol

interview in JTBC

Newsroom

Lee Sedol

interview in JTBC

Newsroom

AlphaGo versus Lee Sedol

In March 2016 AlphaGo won 4-1 against the legendary Lee Sedol.

AlphaGo won all but the 4th game; all games were won

by resignation.

The winner of the match was slated to win $1 million.

Since AlphaGo won, Google DeepMind stated that the prize will be

donated to charities, including UNICEF, and Go organisations.

Lee received $170,000 ($150,000 for participating in all the five

games, and an additional $20,000 for each game won).

https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol 63

by resignation.

Conclusion

Difficulties of Go

� challenging decision-making

� intractable search space

� complex optimal solution

It appears infeasible to directly approximate using a policy or value function!

Difficulties of Go

AlphaGo: summary

� Monte Carlo tree search

� effective move selection and position evaluation

� through deep convolutional neural networks

� trained by novel combination of supervised and reinforcement

learning

� new search algorithm combining

� neural network evaluation

� Monte Carlo rollouts

� scalable implementation

� multi-threaded simulations on CPUs

� parallel GPU computations

� distributed version over multiple machines

AlphaGo: summary

learning

AlphaGo: summary

learning

AlphaGo: summary

learning

AlphaGo: summary

learning

AlphaGo: summary

learning

AlphaGo: summary

learning

AlphaGo: summary

learning

AlphaGo: summary

learning

AlphaGo: summary

learning

AlphaGo: summary

learning

Novel approach

During the match against Fan Hui, AlphaGo evaluated thousands

of times fewer positions than DeepBlue against Kasparov.

It compensated this by:

� selecting those positions more intelligently (policy network)

� evaluating them more precisely (value network)

Deep Blue relied on a handcrafted evaluation function.

AlphaGo was trained directly and automatically from gameplay.

It used general-purpose learning.

This approach is not specific to the game of Go. The algorithm

can be used for much wider class of (so far seemingly)

intractable problems in AI!

Novel approach

Thank you!

Questions?

Backup slides

Input features for rollout and tree policy

Silver et al. 2016

Results of a tournament between different Go programs

Silver et al. 2016

Results of a tournament between AlphaGo and distributed Al-

phaGo, testing scalability with hardware

Silver et al. 2016

AlphaGo versus Fan Hui: Game 1

Silver et al. 2016

AlphaGo versus Lee Sedol: Game 1

https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol

AlphaGo versus Lee Sedol: Game 2 (1/2)

Mastering the game of Go with deep neural networks and tree search: Presentation

Science

Transcript of Mastering the game of Go with deep neural networks and tree search: Presentation