Mastering the game of Go with deep neural networks and tree search: Presentation

266
Mastering the game of Go with deep neural networks and tree search Karel Ha article by Google DeepMind Spring School of Combinatorics 2016

Transcript of Mastering the game of Go with deep neural networks and tree search: Presentation

Page 1: Mastering the game of Go with deep neural networks and tree search: Presentation

Mastering the game of Go

with deep neural networks and tree search

Karel Ha

article by Google DeepMind

Spring School of Combinatorics 2016

Page 2: Mastering the game of Go with deep neural networks and tree search: Presentation

Why AI?

Page 3: Mastering the game of Go with deep neural networks and tree search: Presentation

Applications of AI

� spam filters

� recommender systems (Netflix, YouTube)

� predictive text (Swiftkey)

� audio recognition (Shazam, SoundHound)

� music generation (DeepHear - Composing and harmonizing

music with neural networks)

� self-driving cars

1

Page 4: Mastering the game of Go with deep neural networks and tree search: Presentation

Applications of AI

� spam filters

� recommender systems (Netflix, YouTube)

� predictive text (Swiftkey)

� audio recognition (Shazam, SoundHound)

� music generation (DeepHear - Composing and harmonizing

music with neural networks)

� self-driving cars

1

Page 5: Mastering the game of Go with deep neural networks and tree search: Presentation

Applications of AI

� spam filters

� recommender systems (Netflix, YouTube)

� predictive text (Swiftkey)

� audio recognition (Shazam, SoundHound)

� music generation (DeepHear - Composing and harmonizing

music with neural networks)

� self-driving cars

1

Page 6: Mastering the game of Go with deep neural networks and tree search: Presentation

Applications of AI

� spam filters

� recommender systems (Netflix, YouTube)

� predictive text (Swiftkey)

� audio recognition (Shazam, SoundHound)

� music generation (DeepHear - Composing and harmonizing

music with neural networks)

� self-driving cars

1

Page 7: Mastering the game of Go with deep neural networks and tree search: Presentation

Applications of AI

� spam filters

� recommender systems (Netflix, YouTube)

� predictive text (Swiftkey)

� audio recognition (Shazam, SoundHound)

� music generation (DeepHear - Composing and harmonizing

music with neural networks)

� self-driving cars

1

Page 8: Mastering the game of Go with deep neural networks and tree search: Presentation

Applications of AI

� spam filters

� recommender systems (Netflix, YouTube)

� predictive text (Swiftkey)

� audio recognition (Shazam, SoundHound)

� music generation (DeepHear - Composing and harmonizing

music with neural networks)

� self-driving cars

1

Page 9: Mastering the game of Go with deep neural networks and tree search: Presentation

Auto Reply Feature of Google Inbox

Corrado 2015 2

Page 10: Mastering the game of Go with deep neural networks and tree search: Presentation

Artistic-style Painting

[1] Gatys, Ecker, and Bethge 2015 [2] Li and Wand 2016 3

Page 11: Mastering the game of Go with deep neural networks and tree search: Presentation

Artistic-style Painting

[1] Gatys, Ecker, and Bethge 2015 [2] Li and Wand 2016 3

Page 12: Mastering the game of Go with deep neural networks and tree search: Presentation

Baby Names Generated Character by Character

� Baby Killiel Saddie Char Ahbort With

� Rudi Levette Berice Lussa Hany Mareanne Chrestina Carissy

� Marylen Hammine Janye Marlise Jacacrie Hendred Romand

Charienna Nenotto Ette Dorane Wallen Marly Darine Salina

Elvyn Ersia Maralena Minoria Ellia Charmin Antley Nerille

Chelon Walmor Evena Jeryly Stachon Charisa Allisa Anatha

Cathanie Geetra Alexie Jerin Cassen Herbett Cossie Velen

Daurenge Robester Shermond Terisa Licia Roselen Ferine Jayn

Lusine Charyanne Sales Sanny Resa Wallon Martine Merus

Jelen Candica Wallin Tel Rachene Tarine Ozila Ketia Shanne

Arnande Karella Roselina Alessia Chasty Deland Berther

Geamar Jackein Mellisand Sagdy Nenc Lessie Rasemy Guen

Karpathy 2015 4

Page 13: Mastering the game of Go with deep neural networks and tree search: Presentation

Baby Names Generated Character by Character

� Baby Killiel Saddie Char Ahbort With

� Rudi Levette Berice Lussa Hany Mareanne Chrestina Carissy

� Marylen Hammine Janye Marlise Jacacrie Hendred Romand

Charienna Nenotto Ette Dorane Wallen Marly Darine Salina

Elvyn Ersia Maralena Minoria Ellia Charmin Antley Nerille

Chelon Walmor Evena Jeryly Stachon Charisa Allisa Anatha

Cathanie Geetra Alexie Jerin Cassen Herbett Cossie Velen

Daurenge Robester Shermond Terisa Licia Roselen Ferine Jayn

Lusine Charyanne Sales Sanny Resa Wallon Martine Merus

Jelen Candica Wallin Tel Rachene Tarine Ozila Ketia Shanne

Arnande Karella Roselina Alessia Chasty Deland Berther

Geamar Jackein Mellisand Sagdy Nenc Lessie Rasemy Guen

Karpathy 2015 4

Page 14: Mastering the game of Go with deep neural networks and tree search: Presentation

Baby Names Generated Character by Character

� Baby Killiel Saddie Char Ahbort With

� Rudi Levette Berice Lussa Hany Mareanne Chrestina Carissy

� Marylen Hammine Janye Marlise Jacacrie Hendred Romand

Charienna Nenotto Ette Dorane Wallen Marly Darine Salina

Elvyn Ersia Maralena Minoria Ellia Charmin Antley Nerille

Chelon Walmor Evena Jeryly Stachon Charisa Allisa Anatha

Cathanie Geetra Alexie Jerin Cassen Herbett Cossie Velen

Daurenge Robester Shermond Terisa Licia Roselen Ferine Jayn

Lusine Charyanne Sales Sanny Resa Wallon Martine Merus

Jelen Candica Wallin Tel Rachene Tarine Ozila Ketia Shanne

Arnande Karella Roselina Alessia Chasty Deland Berther

Geamar Jackein Mellisand Sagdy Nenc Lessie Rasemy Guen

Karpathy 2015 4

Page 15: Mastering the game of Go with deep neural networks and tree search: Presentation

C code Generated Character by Character

Karpathy 2015 5

Page 16: Mastering the game of Go with deep neural networks and tree search: Presentation

Algebraic Geometry Generated Character by Character

Karpathy 2015 6

Page 17: Mastering the game of Go with deep neural networks and tree search: Presentation

DeepDrumpf

https://twitter.com/deepdrumpf

= a Twitter bot that has

learned the language of Donald Trump from his speeches

Hayes 2016 7

Page 18: Mastering the game of Go with deep neural networks and tree search: Presentation

DeepDrumpf

https://twitter.com/deepdrumpf = a Twitter bot that has

learned the language of Donald Trump from his speeches

Hayes 2016 7

Page 19: Mastering the game of Go with deep neural networks and tree search: Presentation

Atari Player by Google DeepMind

https://youtu.be/0X-NdPtFKq0?t=21m13s

Mnih et al. 2015 8

Page 20: Mastering the game of Go with deep neural networks and tree search: Presentation

8

Page 21: Mastering the game of Go with deep neural networks and tree search: Presentation

Heads-up Limit Holdem Poker Is Solved!

Cepheus http://poker.srv.ualberta.ca/

Bowling et al. 2015 9

Page 22: Mastering the game of Go with deep neural networks and tree search: Presentation

Heads-up Limit Holdem Poker Is Solved!

Cepheus http://poker.srv.ualberta.ca/

Bowling et al. 2015 9

Page 23: Mastering the game of Go with deep neural networks and tree search: Presentation

Basics of Machine learning

Page 24: Mastering the game of Go with deep neural networks and tree search: Presentation

Supervised versus Unsupervised Learning

Supervised learning:

� data set must be labelled

� e.g. which e-mail is regular/spam, which image is duck/face,

...

Unsupervised learning:

� data set is not labelled

� it can try to cluster the data into different groups

� e.g. grouping similar news, ...

10

Page 25: Mastering the game of Go with deep neural networks and tree search: Presentation

Supervised versus Unsupervised Learning

Supervised learning:

� data set must be labelled

� e.g. which e-mail is regular/spam, which image is duck/face,

...

Unsupervised learning:

� data set is not labelled

� it can try to cluster the data into different groups

� e.g. grouping similar news, ...

10

Page 26: Mastering the game of Go with deep neural networks and tree search: Presentation

Supervised versus Unsupervised Learning

Supervised learning:

� data set must be labelled

� e.g. which e-mail is regular/spam, which image is duck/face,

...

Unsupervised learning:

� data set is not labelled

� it can try to cluster the data into different groups

� e.g. grouping similar news, ...

10

Page 27: Mastering the game of Go with deep neural networks and tree search: Presentation

Supervised versus Unsupervised Learning

Supervised learning:

� data set must be labelled

� e.g. which e-mail is regular/spam, which image is duck/face,

...

Unsupervised learning:

� data set is not labelled

� it can try to cluster the data into different groups

� e.g. grouping similar news, ...

10

Page 28: Mastering the game of Go with deep neural networks and tree search: Presentation

Supervised versus Unsupervised Learning

Supervised learning:

� data set must be labelled

� e.g. which e-mail is regular/spam, which image is duck/face,

...

Unsupervised learning:

� data set is not labelled

� it can try to cluster the data into different groups

� e.g. grouping similar news, ...

10

Page 29: Mastering the game of Go with deep neural networks and tree search: Presentation

Supervised versus Unsupervised Learning

Supervised learning:

� data set must be labelled

� e.g. which e-mail is regular/spam, which image is duck/face,

...

Unsupervised learning:

� data set is not labelled

� it can try to cluster the data into different groups

� e.g. grouping similar news, ...

10

Page 30: Mastering the game of Go with deep neural networks and tree search: Presentation

Supervised versus Unsupervised Learning

Supervised learning:

� data set must be labelled

� e.g. which e-mail is regular/spam, which image is duck/face,

...

Unsupervised learning:

� data set is not labelled

� it can try to cluster the data into different groups

� e.g. grouping similar news, ...

10

Page 31: Mastering the game of Go with deep neural networks and tree search: Presentation

Supervised versus Unsupervised Learning

Supervised learning:

� data set must be labelled

� e.g. which e-mail is regular/spam, which image is duck/face,

...

Unsupervised learning:

� data set is not labelled

� it can try to cluster the data into different groups

� e.g. grouping similar news, ...

10

Page 32: Mastering the game of Go with deep neural networks and tree search: Presentation

Supervised Learning

1. data collection: Google Search, Facebook “Likes”, Siri, Netflix, YouTube views, LHC collisions, KGS Go

Server...

2. training on training set

3. testing on testing set

4. deployment

http://www.nickgillian.com/ 11

Page 33: Mastering the game of Go with deep neural networks and tree search: Presentation

Supervised Learning

1. data collection: Google Search, Facebook “Likes”, Siri, Netflix, YouTube views, LHC collisions, KGS Go

Server...

2. training on training set

3. testing on testing set

4. deployment

http://www.nickgillian.com/ 11

Page 34: Mastering the game of Go with deep neural networks and tree search: Presentation

Supervised Learning

1. data collection: Google Search, Facebook “Likes”, Siri, Netflix, YouTube views, LHC collisions, KGS Go

Server...

2. training on training set

3. testing on testing set

4. deployment

http://www.nickgillian.com/ 11

Page 35: Mastering the game of Go with deep neural networks and tree search: Presentation

Supervised Learning

1. data collection: Google Search, Facebook “Likes”, Siri, Netflix, YouTube views, LHC collisions, KGS Go

Server...

2. training on training set

3. testing on testing set

4. deployment

http://www.nickgillian.com/ 11

Page 36: Mastering the game of Go with deep neural networks and tree search: Presentation

Supervised Learning

1. data collection: Google Search, Facebook “Likes”, Siri, Netflix, YouTube views, LHC collisions, KGS Go

Server...

2. training on training set

3. testing on testing set

4. deployment

http://www.nickgillian.com/ 11

Page 37: Mastering the game of Go with deep neural networks and tree search: Presentation

Supervised Learning

1. data collection: Google Search, Facebook “Likes”, Siri, Netflix, YouTube views, LHC collisions, KGS Go

Server...

2. training on training set

3. testing on testing set

4. deployment

http://www.nickgillian.com/ 11

Page 38: Mastering the game of Go with deep neural networks and tree search: Presentation

Supervised Learning

1. data collection: Google Search, Facebook “Likes”, Siri, Netflix, YouTube views, LHC collisions, KGS Go

Server...

2. training on training set

3. testing on testing set

4. deployment

http://www.nickgillian.com/ 11

Page 39: Mastering the game of Go with deep neural networks and tree search: Presentation

Regression

12

Page 40: Mastering the game of Go with deep neural networks and tree search: Presentation

Regression

12

Page 41: Mastering the game of Go with deep neural networks and tree search: Presentation

Mathematical Regression

https://thermanuals.wordpress.com/descriptive-analysis/sampling-and-regression/13

Page 42: Mastering the game of Go with deep neural networks and tree search: Presentation

Classification

https://kevinbinz.files.wordpress.com/2014/08/ml-svm-after-comparison.png 14

Page 43: Mastering the game of Go with deep neural networks and tree search: Presentation

Underfitting and Overfitting

Beware of overfitting!

It is like learning for a math exam by memorizing proofs.

https://www.researchgate.net/post/How_to_Avoid_Overfitting 15

Page 44: Mastering the game of Go with deep neural networks and tree search: Presentation

Underfitting and Overfitting

Beware of overfitting!

It is like learning for a math exam by memorizing proofs.

https://www.researchgate.net/post/How_to_Avoid_Overfitting 15

Page 45: Mastering the game of Go with deep neural networks and tree search: Presentation

Underfitting and Overfitting

Beware of overfitting!

It is like learning for a math exam by memorizing proofs.

https://www.researchgate.net/post/How_to_Avoid_Overfitting 15

Page 46: Mastering the game of Go with deep neural networks and tree search: Presentation

Reinforcement Learning

Specially: games of self-play

https://youtu.be/0X-NdPtFKq0?t=16m57s 16

Page 47: Mastering the game of Go with deep neural networks and tree search: Presentation

Reinforcement Learning

Specially: games of self-play

https://youtu.be/0X-NdPtFKq0?t=16m57s 16

Page 48: Mastering the game of Go with deep neural networks and tree search: Presentation

Monte Carlo Tree Search

Page 49: Mastering the game of Go with deep neural networks and tree search: Presentation

Tree Search

Optimal value v∗(s) determines the outcome of the game:

� from every board position or state s

� under perfect play by all players.

It is computed by recursively traversing a search tree containing

approximately bd possible sequences of moves, where

� b is the games breadth (number of legal moves per position)

� d is its depth (game length)

Silver et al. 2016 17

Page 50: Mastering the game of Go with deep neural networks and tree search: Presentation

Tree Search

Optimal value v∗(s) determines the outcome of the game:

� from every board position or state s

� under perfect play by all players.

It is computed by recursively traversing a search tree containing

approximately bd possible sequences of moves, where

� b is the games breadth (number of legal moves per position)

� d is its depth (game length)

Silver et al. 2016 17

Page 51: Mastering the game of Go with deep neural networks and tree search: Presentation

Tree Search

Optimal value v∗(s) determines the outcome of the game:

� from every board position or state s

� under perfect play by all players.

It is computed by recursively traversing a search tree containing

approximately bd possible sequences of moves, where

� b is the games breadth (number of legal moves per position)

� d is its depth (game length)

Silver et al. 2016 17

Page 52: Mastering the game of Go with deep neural networks and tree search: Presentation

Tree Search

Optimal value v∗(s) determines the outcome of the game:

� from every board position or state s

� under perfect play by all players.

It is computed by recursively traversing a search tree containing

approximately bd possible sequences of moves, where

� b is the games breadth (number of legal moves per position)

� d is its depth (game length)

Silver et al. 2016 17

Page 53: Mastering the game of Go with deep neural networks and tree search: Presentation

Tree Search

Optimal value v∗(s) determines the outcome of the game:

� from every board position or state s

� under perfect play by all players.

It is computed by recursively traversing a search tree containing

approximately bd possible sequences of moves, where

� b is the games breadth (number of legal moves per position)

� d is its depth (game length)

Silver et al. 2016 17

Page 54: Mastering the game of Go with deep neural networks and tree search: Presentation

Tree Search

Optimal value v∗(s) determines the outcome of the game:

� from every board position or state s

� under perfect play by all players.

It is computed by recursively traversing a search tree containing

approximately bd possible sequences of moves, where

� b is the games breadth (number of legal moves per position)

� d is its depth (game length)

Silver et al. 2016 17

Page 55: Mastering the game of Go with deep neural networks and tree search: Presentation

Tree Search

Optimal value v∗(s) determines the outcome of the game:

� from every board position or state s

� under perfect play by all players.

It is computed by recursively traversing a search tree containing

approximately bd possible sequences of moves, where

� b is the games breadth (number of legal moves per position)

� d is its depth (game length)

Silver et al. 2016 17

Page 56: Mastering the game of Go with deep neural networks and tree search: Presentation

Game tree of Go

Sizes of trees for various games:

� chess: b ≈ 35, d ≈ 80

� Go: b ≈ 250, d ≈ 150

⇒ more positions than atoms in the

universe!

That makes Go a googol

times more complex than

chess.

https://deepmind.com/alpha-go.html

How to handle the size of the game tree?

� for the breadth: a neural network to select moves

� for the depth: a neural network to evaluate current position

� for the tree traverse: Monte Carlo tree search (MCTS)

Allis et al. 1994 18

Page 57: Mastering the game of Go with deep neural networks and tree search: Presentation

Game tree of Go

Sizes of trees for various games:

� chess: b ≈ 35, d ≈ 80

� Go: b ≈ 250, d ≈ 150 ⇒ more positions than atoms in the

universe!

That makes Go a googol

times more complex than

chess.

https://deepmind.com/alpha-go.html

How to handle the size of the game tree?

� for the breadth: a neural network to select moves

� for the depth: a neural network to evaluate current position

� for the tree traverse: Monte Carlo tree search (MCTS)

Allis et al. 1994 18

Page 58: Mastering the game of Go with deep neural networks and tree search: Presentation

Game tree of Go

Sizes of trees for various games:

� chess: b ≈ 35, d ≈ 80

� Go: b ≈ 250, d ≈ 150 ⇒ more positions than atoms in the

universe!

That makes Go a googol

times more complex than

chess.

https://deepmind.com/alpha-go.html

How to handle the size of the game tree?

� for the breadth: a neural network to select moves

� for the depth: a neural network to evaluate current position

� for the tree traverse: Monte Carlo tree search (MCTS)

Allis et al. 1994 18

Page 59: Mastering the game of Go with deep neural networks and tree search: Presentation

Game tree of Go

Sizes of trees for various games:

� chess: b ≈ 35, d ≈ 80

� Go: b ≈ 250, d ≈ 150 ⇒ more positions than atoms in the

universe!

That makes Go a googol

times more complex than

chess.

https://deepmind.com/alpha-go.html

How to handle the size of the game tree?

� for the breadth: a neural network to select moves

� for the depth: a neural network to evaluate current position

� for the tree traverse: Monte Carlo tree search (MCTS)

Allis et al. 1994 18

Page 60: Mastering the game of Go with deep neural networks and tree search: Presentation

Game tree of Go

Sizes of trees for various games:

� chess: b ≈ 35, d ≈ 80

� Go: b ≈ 250, d ≈ 150 ⇒ more positions than atoms in the

universe!

That makes Go a googol

times more complex than

chess.

https://deepmind.com/alpha-go.html

How to handle the size of the game tree?

� for the breadth: a neural network to select moves

� for the depth: a neural network to evaluate current position

� for the tree traverse: Monte Carlo tree search (MCTS)

Allis et al. 1994 18

Page 61: Mastering the game of Go with deep neural networks and tree search: Presentation

Game tree of Go

Sizes of trees for various games:

� chess: b ≈ 35, d ≈ 80

� Go: b ≈ 250, d ≈ 150 ⇒ more positions than atoms in the

universe!

That makes Go a googol

times more complex than

chess.

https://deepmind.com/alpha-go.html

How to handle the size of the game tree?

� for the breadth: a neural network to select moves

� for the depth: a neural network to evaluate current position

� for the tree traverse: Monte Carlo tree search (MCTS)

Allis et al. 1994 18

Page 62: Mastering the game of Go with deep neural networks and tree search: Presentation

Game tree of Go

Sizes of trees for various games:

� chess: b ≈ 35, d ≈ 80

� Go: b ≈ 250, d ≈ 150 ⇒ more positions than atoms in the

universe!

That makes Go a googol

times more complex than

chess.

https://deepmind.com/alpha-go.html

How to handle the size of the game tree?

� for the breadth: a neural network to select moves

� for the depth: a neural network to evaluate current position

� for the tree traverse: Monte Carlo tree search (MCTS)

Allis et al. 1994 18

Page 63: Mastering the game of Go with deep neural networks and tree search: Presentation

Monte Carlo tree search

19

Page 64: Mastering the game of Go with deep neural networks and tree search: Presentation

Neural networks

Page 65: Mastering the game of Go with deep neural networks and tree search: Presentation

Neural Network: Inspiration

� inspired by the neuronal structure of the mammalian cerebral

cortex

� but on much smaller scales� suitable to model systems with a high tolerance to error

� e.g. audio or image recognition

http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 20

Page 66: Mastering the game of Go with deep neural networks and tree search: Presentation

Neural Network: Inspiration

� inspired by the neuronal structure of the mammalian cerebral

cortex

� but on much smaller scales� suitable to model systems with a high tolerance to error

� e.g. audio or image recognition

http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 20

Page 67: Mastering the game of Go with deep neural networks and tree search: Presentation

Neural Network: Inspiration

� inspired by the neuronal structure of the mammalian cerebral

cortex

� but on much smaller scales

� suitable to model systems with a high tolerance to error

� e.g. audio or image recognition

http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 20

Page 68: Mastering the game of Go with deep neural networks and tree search: Presentation

Neural Network: Inspiration

� inspired by the neuronal structure of the mammalian cerebral

cortex

� but on much smaller scales� suitable to model systems with a high tolerance to error

� e.g. audio or image recognition

http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 20

Page 69: Mastering the game of Go with deep neural networks and tree search: Presentation

Neural Network: Inspiration

� inspired by the neuronal structure of the mammalian cerebral

cortex

� but on much smaller scales� suitable to model systems with a high tolerance to error

� e.g. audio or image recognitionhttp://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 20

Page 70: Mastering the game of Go with deep neural networks and tree search: Presentation

Neural Network: Modes

Two modes

� feedforward for making predictions

� backpropagation for learning

Dieterle 2003 21

Page 71: Mastering the game of Go with deep neural networks and tree search: Presentation

Neural Network: Modes

Two modes

� feedforward for making predictions

� backpropagation for learning

Dieterle 2003 21

Page 72: Mastering the game of Go with deep neural networks and tree search: Presentation

Neural Network: Modes

Two modes

� feedforward for making predictions

� backpropagation for learning

Dieterle 2003 21

Page 73: Mastering the game of Go with deep neural networks and tree search: Presentation

Neural Network: Modes

Two modes

� feedforward for making predictions

� backpropagation for learningDieterle 2003 21

Page 74: Mastering the game of Go with deep neural networks and tree search: Presentation

Neural Network: an example of feedforward

http://stevenmiller888.github.io/mind-how-to-build-a-neural-network/ 22

Page 75: Mastering the game of Go with deep neural networks and tree search: Presentation

Gradient Descent in Neural Networks

Motto: ”Learn by mistakes!”

However, error functions are not necessarily convex or so “smooth”.

http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 23

Page 76: Mastering the game of Go with deep neural networks and tree search: Presentation

Gradient Descent in Neural Networks

Motto: ”Learn by mistakes!”

However, error functions are not necessarily convex or so “smooth”.

http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 23

Page 77: Mastering the game of Go with deep neural networks and tree search: Presentation

Gradient Descent in Neural Networks

Motto: ”Learn by mistakes!”

However, error functions are not necessarily convex or so “smooth”.

http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 23

Page 78: Mastering the game of Go with deep neural networks and tree search: Presentation

Deep Neural Network: Inspiration

The hierarchy of concepts is captured in the number of layers (the deep in “deep learning”)

http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 24

Page 79: Mastering the game of Go with deep neural networks and tree search: Presentation

Deep Neural Network: Inspiration

The hierarchy of concepts is captured in the number of layers (the deep in “deep learning”)

http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 24

Page 80: Mastering the game of Go with deep neural networks and tree search: Presentation

Convolutional Neural Network

http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 25

Page 81: Mastering the game of Go with deep neural networks and tree search: Presentation

Rules of Go

Page 82: Mastering the game of Go with deep neural networks and tree search: Presentation

Classic games (1/2)

Backgammon: Man vs. Fate

Chess: Man vs. Man

26

Page 83: Mastering the game of Go with deep neural networks and tree search: Presentation

Classic games (1/2)

Backgammon: Man vs. Fate

Chess: Man vs. Man

26

Page 84: Mastering the game of Go with deep neural networks and tree search: Presentation

Classic games (1/2)

Backgammon: Man vs. Fate

Chess: Man vs. Man

26

Page 85: Mastering the game of Go with deep neural networks and tree search: Presentation

Classic games (2/2)

Go: Man vs. Self

Robert Samal (White) versus Karel Kral (Black), Spring School of Combinatorics 2016 27

Page 86: Mastering the game of Go with deep neural networks and tree search: Presentation

Rules of Go

Black versus White. Black starts the game.

the rule of liberty

the “ko” rule

Handicap for difference in ranks: Black can place 1 or more stones

in advance (compensation for White’s greater strength).

28

Page 87: Mastering the game of Go with deep neural networks and tree search: Presentation

Rules of Go

Black versus White. Black starts the game.

the rule of liberty

the “ko” rule

Handicap for difference in ranks: Black can place 1 or more stones

in advance (compensation for White’s greater strength).

28

Page 88: Mastering the game of Go with deep neural networks and tree search: Presentation

Rules of Go

Black versus White. Black starts the game.

the rule of liberty

the “ko” rule

Handicap for difference in ranks: Black can place 1 or more stones

in advance (compensation for White’s greater strength).

28

Page 89: Mastering the game of Go with deep neural networks and tree search: Presentation

Rules of Go

Black versus White. Black starts the game.

the rule of liberty

the “ko” rule

Handicap for difference in ranks: Black can place 1 or more stones

in advance (compensation for White’s greater strength).

28

Page 90: Mastering the game of Go with deep neural networks and tree search: Presentation

Rules of Go

Black versus White. Black starts the game.

the rule of liberty

the “ko” rule

Handicap for difference in ranks: Black can place 1 or more stones

in advance (compensation for White’s greater strength).

28

Page 91: Mastering the game of Go with deep neural networks and tree search: Presentation

Rules of Go

Black versus White. Black starts the game.

the rule of liberty

the “ko” rule

Handicap for difference in ranks: Black can place 1 or more stones

in advance (compensation for White’s greater strength). 28

Page 92: Mastering the game of Go with deep neural networks and tree search: Presentation

Scoring Rules: Area Scoring

A player’s score is:

� the number of stones that the player has on the board

� plus the number of empty intersections surrounded by that

player’s stones

� plus komi(dashi) points for the White player

which is a compensation for the first move advantage of the Black player

https://en.wikipedia.org/wiki/Go_(game) 29

Page 93: Mastering the game of Go with deep neural networks and tree search: Presentation

Scoring Rules: Area Scoring

A player’s score is:

� the number of stones that the player has on the board

� plus the number of empty intersections surrounded by that

player’s stones

� plus komi(dashi) points for the White player

which is a compensation for the first move advantage of the Black player

https://en.wikipedia.org/wiki/Go_(game) 29

Page 94: Mastering the game of Go with deep neural networks and tree search: Presentation

Scoring Rules: Area Scoring

A player’s score is:

� the number of stones that the player has on the board

� plus the number of empty intersections surrounded by that

player’s stones

� plus komi(dashi) points for the White player

which is a compensation for the first move advantage of the Black player

https://en.wikipedia.org/wiki/Go_(game) 29

Page 95: Mastering the game of Go with deep neural networks and tree search: Presentation

Scoring Rules: Area Scoring

A player’s score is:

� the number of stones that the player has on the board

� plus the number of empty intersections surrounded by that

player’s stones

� plus komi(dashi) points for the White player

which is a compensation for the first move advantage of the Black player

https://en.wikipedia.org/wiki/Go_(game) 29

Page 96: Mastering the game of Go with deep neural networks and tree search: Presentation

Ranks of Players

Kyu and Dan ranks

or alternatively, ELO ratings

https://en.wikipedia.org/wiki/Go_(game) 30

Page 97: Mastering the game of Go with deep neural networks and tree search: Presentation

Ranks of Players

Kyu and Dan ranks

or alternatively, ELO ratings

https://en.wikipedia.org/wiki/Go_(game) 30

Page 98: Mastering the game of Go with deep neural networks and tree search: Presentation

Chocolate micro-break

30

Page 99: Mastering the game of Go with deep neural networks and tree search: Presentation

AlphaGo: Inside Out

Page 100: Mastering the game of Go with deep neural networks and tree search: Presentation

Policy and Value Networks

Silver et al. 2016 31

Page 101: Mastering the game of Go with deep neural networks and tree search: Presentation

Training the (Deep Convolutional) Neural Networks

Silver et al. 2016 32

Page 102: Mastering the game of Go with deep neural networks and tree search: Presentation

SL Policy Networks (1/3)

� 13-layer deep convolutional neural network

� goal: to predict expert human moves

� task of classification

� trained from 30 millions positions from the KGS Go Server

� stochastic gradient ascent:

∆σ ∝ ∂ log pσ(a|s)

∂σ

(to maximize the likelihood of the human move a selected in state s)

Results:

� 44.4% accuracy (the state-of-the-art from other groups)

� 55.7% accuracy (raw board position + move history as input)

� 57.0% accuracy (all input features)

Silver et al. 2016 33

Page 103: Mastering the game of Go with deep neural networks and tree search: Presentation

SL Policy Networks (1/3)

� 13-layer deep convolutional neural network

� goal: to predict expert human moves

� task of classification

� trained from 30 millions positions from the KGS Go Server

� stochastic gradient ascent:

∆σ ∝ ∂ log pσ(a|s)

∂σ

(to maximize the likelihood of the human move a selected in state s)

Results:

� 44.4% accuracy (the state-of-the-art from other groups)

� 55.7% accuracy (raw board position + move history as input)

� 57.0% accuracy (all input features)

Silver et al. 2016 33

Page 104: Mastering the game of Go with deep neural networks and tree search: Presentation

SL Policy Networks (1/3)

� 13-layer deep convolutional neural network

� goal: to predict expert human moves

� task of classification

� trained from 30 millions positions from the KGS Go Server

� stochastic gradient ascent:

∆σ ∝ ∂ log pσ(a|s)

∂σ

(to maximize the likelihood of the human move a selected in state s)

Results:

� 44.4% accuracy (the state-of-the-art from other groups)

� 55.7% accuracy (raw board position + move history as input)

� 57.0% accuracy (all input features)

Silver et al. 2016 33

Page 105: Mastering the game of Go with deep neural networks and tree search: Presentation

SL Policy Networks (1/3)

� 13-layer deep convolutional neural network

� goal: to predict expert human moves

� task of classification

� trained from 30 millions positions from the KGS Go Server

� stochastic gradient ascent:

∆σ ∝ ∂ log pσ(a|s)

∂σ

(to maximize the likelihood of the human move a selected in state s)

Results:

� 44.4% accuracy (the state-of-the-art from other groups)

� 55.7% accuracy (raw board position + move history as input)

� 57.0% accuracy (all input features)

Silver et al. 2016 33

Page 106: Mastering the game of Go with deep neural networks and tree search: Presentation

SL Policy Networks (1/3)

� 13-layer deep convolutional neural network

� goal: to predict expert human moves

� task of classification

� trained from 30 millions positions from the KGS Go Server

� stochastic gradient ascent:

∆σ ∝ ∂ log pσ(a|s)

∂σ

(to maximize the likelihood of the human move a selected in state s)

Results:

� 44.4% accuracy (the state-of-the-art from other groups)

� 55.7% accuracy (raw board position + move history as input)

� 57.0% accuracy (all input features)

Silver et al. 2016 33

Page 107: Mastering the game of Go with deep neural networks and tree search: Presentation

SL Policy Networks (1/3)

� 13-layer deep convolutional neural network

� goal: to predict expert human moves

� task of classification

� trained from 30 millions positions from the KGS Go Server

� stochastic gradient ascent:

∆σ ∝ ∂ log pσ(a|s)

∂σ

(to maximize the likelihood of the human move a selected in state s)

Results:

� 44.4% accuracy (the state-of-the-art from other groups)

� 55.7% accuracy (raw board position + move history as input)

� 57.0% accuracy (all input features)

Silver et al. 2016 33

Page 108: Mastering the game of Go with deep neural networks and tree search: Presentation

SL Policy Networks (1/3)

� 13-layer deep convolutional neural network

� goal: to predict expert human moves

� task of classification

� trained from 30 millions positions from the KGS Go Server

� stochastic gradient ascent:

∆σ ∝ ∂ log pσ(a|s)

∂σ

(to maximize the likelihood of the human move a selected in state s)

Results:

� 44.4% accuracy (the state-of-the-art from other groups)

� 55.7% accuracy (raw board position + move history as input)

� 57.0% accuracy (all input features)

Silver et al. 2016 33

Page 109: Mastering the game of Go with deep neural networks and tree search: Presentation

SL Policy Networks (1/3)

� 13-layer deep convolutional neural network

� goal: to predict expert human moves

� task of classification

� trained from 30 millions positions from the KGS Go Server

� stochastic gradient ascent:

∆σ ∝ ∂ log pσ(a|s)

∂σ

(to maximize the likelihood of the human move a selected in state s)

Results:

� 44.4% accuracy (the state-of-the-art from other groups)

� 55.7% accuracy (raw board position + move history as input)

� 57.0% accuracy (all input features)

Silver et al. 2016 33

Page 110: Mastering the game of Go with deep neural networks and tree search: Presentation

SL Policy Networks (1/3)

� 13-layer deep convolutional neural network

� goal: to predict expert human moves

� task of classification

� trained from 30 millions positions from the KGS Go Server

� stochastic gradient ascent:

∆σ ∝ ∂ log pσ(a|s)

∂σ

(to maximize the likelihood of the human move a selected in state s)

Results:

� 44.4% accuracy (the state-of-the-art from other groups)

� 55.7% accuracy (raw board position + move history as input)

� 57.0% accuracy (all input features)

Silver et al. 2016 33

Page 111: Mastering the game of Go with deep neural networks and tree search: Presentation

SL Policy Networks (1/3)

� 13-layer deep convolutional neural network

� goal: to predict expert human moves

� task of classification

� trained from 30 millions positions from the KGS Go Server

� stochastic gradient ascent:

∆σ ∝ ∂ log pσ(a|s)

∂σ

(to maximize the likelihood of the human move a selected in state s)

Results:

� 44.4% accuracy (the state-of-the-art from other groups)

� 55.7% accuracy (raw board position + move history as input)

� 57.0% accuracy (all input features)

Silver et al. 2016 33

Page 112: Mastering the game of Go with deep neural networks and tree search: Presentation

SL Policy Networks (2/3)

Small improvements in accuracy led to large improvements

in playing strength (see the next slide)

Silver et al. 2016 34

Page 113: Mastering the game of Go with deep neural networks and tree search: Presentation

SL Policy Networks (3/3)

move probabilities taken directly from the SL policy network pσ (reported as a percentage if above 0.1%).

Silver et al. 2016 35

Page 114: Mastering the game of Go with deep neural networks and tree search: Presentation

Training the (Deep Convolutional) Neural Networks

Silver et al. 2016 36

Page 115: Mastering the game of Go with deep neural networks and tree search: Presentation

Rollout Policy

� Rollout policy pπ(a|s) is faster but less accurate than SL

policy network.

� accuracy of 24.2%

� It takes 2µs to select an action, compared to 3 ms in case

of SL policy network.

Silver et al. 2016 37

Page 116: Mastering the game of Go with deep neural networks and tree search: Presentation

Rollout Policy

� Rollout policy pπ(a|s) is faster but less accurate than SL

policy network.

� accuracy of 24.2%

� It takes 2µs to select an action, compared to 3 ms in case

of SL policy network.

Silver et al. 2016 37

Page 117: Mastering the game of Go with deep neural networks and tree search: Presentation

Rollout Policy

� Rollout policy pπ(a|s) is faster but less accurate than SL

policy network.

� accuracy of 24.2%

� It takes 2µs to select an action, compared to 3 ms in case

of SL policy network.

Silver et al. 2016 37

Page 118: Mastering the game of Go with deep neural networks and tree search: Presentation

Training the (Deep Convolutional) Neural Networks

Silver et al. 2016 38

Page 119: Mastering the game of Go with deep neural networks and tree search: Presentation

RL Policy Networks (1/2)

� identical in structure to the SL policy network

� goal: to win in the games of self-play

� task of classification

� weights ρ initialized to the same values, ρ := σ

� games of self-play

� between the current RL policy network and a randomly

selected previous iteration

� to prevent overfitting to the current policy

� stochastic gradient ascent:

∆ρ ∝ ∂ log pρ(at |st)

∂ρzt

at time step t, where reward function zt is +1 for winning and −1 for losing.

Silver et al. 2016 39

Page 120: Mastering the game of Go with deep neural networks and tree search: Presentation

RL Policy Networks (1/2)

� identical in structure to the SL policy network

� goal: to win in the games of self-play

� task of classification

� weights ρ initialized to the same values, ρ := σ

� games of self-play

� between the current RL policy network and a randomly

selected previous iteration

� to prevent overfitting to the current policy

� stochastic gradient ascent:

∆ρ ∝ ∂ log pρ(at |st)

∂ρzt

at time step t, where reward function zt is +1 for winning and −1 for losing.

Silver et al. 2016 39

Page 121: Mastering the game of Go with deep neural networks and tree search: Presentation

RL Policy Networks (1/2)

� identical in structure to the SL policy network

� goal: to win in the games of self-play

� task of classification

� weights ρ initialized to the same values, ρ := σ

� games of self-play

� between the current RL policy network and a randomly

selected previous iteration

� to prevent overfitting to the current policy

� stochastic gradient ascent:

∆ρ ∝ ∂ log pρ(at |st)

∂ρzt

at time step t, where reward function zt is +1 for winning and −1 for losing.

Silver et al. 2016 39

Page 122: Mastering the game of Go with deep neural networks and tree search: Presentation

RL Policy Networks (1/2)

� identical in structure to the SL policy network

� goal: to win in the games of self-play

� task of classification

� weights ρ initialized to the same values, ρ := σ

� games of self-play

� between the current RL policy network and a randomly

selected previous iteration

� to prevent overfitting to the current policy

� stochastic gradient ascent:

∆ρ ∝ ∂ log pρ(at |st)

∂ρzt

at time step t, where reward function zt is +1 for winning and −1 for losing.

Silver et al. 2016 39

Page 123: Mastering the game of Go with deep neural networks and tree search: Presentation

RL Policy Networks (1/2)

� identical in structure to the SL policy network

� goal: to win in the games of self-play

� task of classification

� weights ρ initialized to the same values, ρ := σ

� games of self-play

� between the current RL policy network and a randomly

selected previous iteration

� to prevent overfitting to the current policy

� stochastic gradient ascent:

∆ρ ∝ ∂ log pρ(at |st)

∂ρzt

at time step t, where reward function zt is +1 for winning and −1 for losing.

Silver et al. 2016 39

Page 124: Mastering the game of Go with deep neural networks and tree search: Presentation

RL Policy Networks (1/2)

� identical in structure to the SL policy network

� goal: to win in the games of self-play

� task of classification

� weights ρ initialized to the same values, ρ := σ

� games of self-play

� between the current RL policy network and a randomly

selected previous iteration

� to prevent overfitting to the current policy

� stochastic gradient ascent:

∆ρ ∝ ∂ log pρ(at |st)

∂ρzt

at time step t, where reward function zt is +1 for winning and −1 for losing.

Silver et al. 2016 39

Page 125: Mastering the game of Go with deep neural networks and tree search: Presentation

RL Policy Networks (1/2)

� identical in structure to the SL policy network

� goal: to win in the games of self-play

� task of classification

� weights ρ initialized to the same values, ρ := σ

� games of self-play

� between the current RL policy network and a randomly

selected previous iteration

� to prevent overfitting to the current policy

� stochastic gradient ascent:

∆ρ ∝ ∂ log pρ(at |st)

∂ρzt

at time step t, where reward function zt is +1 for winning and −1 for losing.

Silver et al. 2016 39

Page 126: Mastering the game of Go with deep neural networks and tree search: Presentation

RL Policy Networks (1/2)

� identical in structure to the SL policy network

� goal: to win in the games of self-play

� task of classification

� weights ρ initialized to the same values, ρ := σ

� games of self-play

� between the current RL policy network and a randomly

selected previous iteration

� to prevent overfitting to the current policy

� stochastic gradient ascent:

∆ρ ∝ ∂ log pρ(at |st)

∂ρzt

at time step t, where reward function zt is +1 for winning and −1 for losing.

Silver et al. 2016 39

Page 127: Mastering the game of Go with deep neural networks and tree search: Presentation

RL Policy Networks (1/2)

� identical in structure to the SL policy network

� goal: to win in the games of self-play

� task of classification

� weights ρ initialized to the same values, ρ := σ

� games of self-play

� between the current RL policy network and a randomly

selected previous iteration

� to prevent overfitting to the current policy

� stochastic gradient ascent:

∆ρ ∝ ∂ log pρ(at |st)

∂ρzt

at time step t, where reward function zt is +1 for winning and −1 for losing.

Silver et al. 2016 39

Page 128: Mastering the game of Go with deep neural networks and tree search: Presentation

RL Policy Networks (1/2)

� identical in structure to the SL policy network

� goal: to win in the games of self-play

� task of classification

� weights ρ initialized to the same values, ρ := σ

� games of self-play

� between the current RL policy network and a randomly

selected previous iteration

� to prevent overfitting to the current policy

� stochastic gradient ascent:

∆ρ ∝ ∂ log pρ(at |st)

∂ρzt

at time step t, where reward function zt is +1 for winning and −1 for losing.

Silver et al. 2016 39

Page 129: Mastering the game of Go with deep neural networks and tree search: Presentation

RL Policy Networks (2/2)

Results (by sampling each move at ∼ pρ(·|st)):

� 80% of win rate against the SL policy network

� 85% of win rate against the strongest open-source Goprogram, Pachi (Baudis and Gailly 2011)

� The previous state-of-the-art, based only on SL of CNN:

11% of “win” rate against Pachi

Silver et al. 2016 40

Page 130: Mastering the game of Go with deep neural networks and tree search: Presentation

RL Policy Networks (2/2)

Results (by sampling each move at ∼ pρ(·|st)):

� 80% of win rate against the SL policy network

� 85% of win rate against the strongest open-source Goprogram, Pachi (Baudis and Gailly 2011)

� The previous state-of-the-art, based only on SL of CNN:

11% of “win” rate against Pachi

Silver et al. 2016 40

Page 131: Mastering the game of Go with deep neural networks and tree search: Presentation

RL Policy Networks (2/2)

Results (by sampling each move at ∼ pρ(·|st)):

� 80% of win rate against the SL policy network

� 85% of win rate against the strongest open-source Goprogram, Pachi (Baudis and Gailly 2011)

� The previous state-of-the-art, based only on SL of CNN:

11% of “win” rate against Pachi

Silver et al. 2016 40

Page 132: Mastering the game of Go with deep neural networks and tree search: Presentation

RL Policy Networks (2/2)

Results (by sampling each move at ∼ pρ(·|st)):

� 80% of win rate against the SL policy network

� 85% of win rate against the strongest open-source Goprogram, Pachi (Baudis and Gailly 2011)

� The previous state-of-the-art, based only on SL of CNN:

11% of “win” rate against Pachi

Silver et al. 2016 40

Page 133: Mastering the game of Go with deep neural networks and tree search: Presentation

RL Policy Networks (2/2)

Results (by sampling each move at ∼ pρ(·|st)):

� 80% of win rate against the SL policy network

� 85% of win rate against the strongest open-source Goprogram, Pachi (Baudis and Gailly 2011)

� The previous state-of-the-art, based only on SL of CNN:

11% of “win” rate against Pachi

Silver et al. 2016 40

Page 134: Mastering the game of Go with deep neural networks and tree search: Presentation

RL Policy Networks (2/2)

Results (by sampling each move at ∼ pρ(·|st)):

� 80% of win rate against the SL policy network

� 85% of win rate against the strongest open-source Goprogram, Pachi (Baudis and Gailly 2011)

� The previous state-of-the-art, based only on SL of CNN:

11% of “win” rate against Pachi

Silver et al. 2016 40

Page 135: Mastering the game of Go with deep neural networks and tree search: Presentation

Training the (Deep Convolutional) Neural Networks

Silver et al. 2016 41

Page 136: Mastering the game of Go with deep neural networks and tree search: Presentation

Value Network (1/2)

� similar architecture to the policy network, but outputs a single

prediction instead of a probability distribution

� goal: to estimate a value function

vp(s) = E[zt |st = s, at...T ∼ p]

that predicts the outcome from position s (of games played

by using policy pρ)

� Specifically, vθ(s) ≈ vpρ(s) ≈ v∗(s).

� task of regression

� stochastic gradient descent:

∆θ ∝ ∂vθ(s)

∂θ(z − vθ(s))

(to minimize the mean squared error (MSE) between the predicted vθ(s) and the true z)

Silver et al. 2016 42

Page 137: Mastering the game of Go with deep neural networks and tree search: Presentation

Value Network (1/2)

� similar architecture to the policy network, but outputs a single

prediction instead of a probability distribution

� goal: to estimate a value function

vp(s) = E[zt |st = s, at...T ∼ p]

that predicts the outcome from position s (of games played

by using policy pρ)

� Specifically, vθ(s) ≈ vpρ(s) ≈ v∗(s).

� task of regression

� stochastic gradient descent:

∆θ ∝ ∂vθ(s)

∂θ(z − vθ(s))

(to minimize the mean squared error (MSE) between the predicted vθ(s) and the true z)

Silver et al. 2016 42

Page 138: Mastering the game of Go with deep neural networks and tree search: Presentation

Value Network (1/2)

� similar architecture to the policy network, but outputs a single

prediction instead of a probability distribution

� goal: to estimate a value function

vp(s) = E[zt |st = s, at...T ∼ p]

that predicts the outcome from position s (of games played

by using policy pρ)

� Specifically, vθ(s) ≈ vpρ(s) ≈ v∗(s).

� task of regression

� stochastic gradient descent:

∆θ ∝ ∂vθ(s)

∂θ(z − vθ(s))

(to minimize the mean squared error (MSE) between the predicted vθ(s) and the true z)

Silver et al. 2016 42

Page 139: Mastering the game of Go with deep neural networks and tree search: Presentation

Value Network (1/2)

� similar architecture to the policy network, but outputs a single

prediction instead of a probability distribution

� goal: to estimate a value function

vp(s) = E[zt |st = s, at...T ∼ p]

that predicts the outcome from position s (of games played

by using policy pρ)

� Specifically, vθ(s) ≈ vpρ(s) ≈ v∗(s).

� task of regression

� stochastic gradient descent:

∆θ ∝ ∂vθ(s)

∂θ(z − vθ(s))

(to minimize the mean squared error (MSE) between the predicted vθ(s) and the true z)

Silver et al. 2016 42

Page 140: Mastering the game of Go with deep neural networks and tree search: Presentation

Value Network (1/2)

� similar architecture to the policy network, but outputs a single

prediction instead of a probability distribution

� goal: to estimate a value function

vp(s) = E[zt |st = s, at...T ∼ p]

that predicts the outcome from position s (of games played

by using policy pρ)

� Specifically, vθ(s) ≈ vpρ(s) ≈ v∗(s).

� task of regression

� stochastic gradient descent:

∆θ ∝ ∂vθ(s)

∂θ(z − vθ(s))

(to minimize the mean squared error (MSE) between the predicted vθ(s) and the true z)

Silver et al. 2016 42

Page 141: Mastering the game of Go with deep neural networks and tree search: Presentation

Value Network (2/2)

Beware of overfitting!

� Successive positions are strongly correlated.

� Value network memorized the game outcomes, rather than

generalizing to new positions.

� Solution: generate 30 million (new) positions, each sampled

from a seperate game

� almost the accuracy of Monte Carlo rollouts (using pρ), but

15000 times less computation!

Silver et al. 2016 43

Page 142: Mastering the game of Go with deep neural networks and tree search: Presentation

Value Network (2/2)

Beware of overfitting!

� Successive positions are strongly correlated.

� Value network memorized the game outcomes, rather than

generalizing to new positions.

� Solution: generate 30 million (new) positions, each sampled

from a seperate game

� almost the accuracy of Monte Carlo rollouts (using pρ), but

15000 times less computation!

Silver et al. 2016 43

Page 143: Mastering the game of Go with deep neural networks and tree search: Presentation

Value Network (2/2)

Beware of overfitting!

� Successive positions are strongly correlated.

� Value network memorized the game outcomes, rather than

generalizing to new positions.

� Solution: generate 30 million (new) positions, each sampled

from a seperate game

� almost the accuracy of Monte Carlo rollouts (using pρ), but

15000 times less computation!

Silver et al. 2016 43

Page 144: Mastering the game of Go with deep neural networks and tree search: Presentation

Value Network (2/2)

Beware of overfitting!

� Successive positions are strongly correlated.

� Value network memorized the game outcomes, rather than

generalizing to new positions.

� Solution: generate 30 million (new) positions, each sampled

from a seperate game

� almost the accuracy of Monte Carlo rollouts (using pρ), but

15000 times less computation!

Silver et al. 2016 43

Page 145: Mastering the game of Go with deep neural networks and tree search: Presentation

Value Network (2/2)

Beware of overfitting!

� Successive positions are strongly correlated.

� Value network memorized the game outcomes, rather than

generalizing to new positions.

� Solution: generate 30 million (new) positions, each sampled

from a seperate game

� almost the accuracy of Monte Carlo rollouts (using pρ), but

15000 times less computation!

Silver et al. 2016 43

Page 146: Mastering the game of Go with deep neural networks and tree search: Presentation

Selection of Moves by the Value Network

evaluation of all successors s′ of the root position s, using vθ(s)

Silver et al. 2016 44

Page 147: Mastering the game of Go with deep neural networks and tree search: Presentation

Evaluation accuracy in various stages of a game

Move number is the number of moves that had been played in the given position.

Each position evaluated by:

� forward pass of the value network vθ

� 100 rollouts, played out using the corresponding policy

Silver et al. 2016 45

Page 148: Mastering the game of Go with deep neural networks and tree search: Presentation

Evaluation accuracy in various stages of a game

Move number is the number of moves that had been played in the given position.

Each position evaluated by:

� forward pass of the value network vθ

� 100 rollouts, played out using the corresponding policy

Silver et al. 2016 45

Page 149: Mastering the game of Go with deep neural networks and tree search: Presentation

Evaluation accuracy in various stages of a game

Move number is the number of moves that had been played in the given position.

Each position evaluated by:

� forward pass of the value network vθ

� 100 rollouts, played out using the corresponding policySilver et al. 2016 45

Page 150: Mastering the game of Go with deep neural networks and tree search: Presentation

Training the (Deep Convolutional) Neural Networks

Silver et al. 2016 46

Page 151: Mastering the game of Go with deep neural networks and tree search: Presentation

ELO Ratings for Various Combinations of Networks

Silver et al. 2016 47

Page 152: Mastering the game of Go with deep neural networks and tree search: Presentation

MCTS Algorithm

The next action is selected by lookahead search, using simulation:

1. selection phase

2. expansion phase

3. evaluation phase

4. backup phase (at end of simulation)

Each edge (s, a) keeps:

� action value Q(s, a)

� visit count N(s, a)

� prior probability P(s, a) (from SL policy network pσ)

The tree is traversed by simulation (descending the tree) from the

root state.

Silver et al. 2016 48

Page 153: Mastering the game of Go with deep neural networks and tree search: Presentation

MCTS Algorithm

The next action is selected by lookahead search, using simulation:

1. selection phase

2. expansion phase

3. evaluation phase

4. backup phase (at end of simulation)

Each edge (s, a) keeps:

� action value Q(s, a)

� visit count N(s, a)

� prior probability P(s, a) (from SL policy network pσ)

The tree is traversed by simulation (descending the tree) from the

root state.

Silver et al. 2016 48

Page 154: Mastering the game of Go with deep neural networks and tree search: Presentation

MCTS Algorithm

The next action is selected by lookahead search, using simulation:

1. selection phase

2. expansion phase

3. evaluation phase

4. backup phase (at end of simulation)

Each edge (s, a) keeps:

� action value Q(s, a)

� visit count N(s, a)

� prior probability P(s, a) (from SL policy network pσ)

The tree is traversed by simulation (descending the tree) from the

root state.

Silver et al. 2016 48

Page 155: Mastering the game of Go with deep neural networks and tree search: Presentation

MCTS Algorithm

The next action is selected by lookahead search, using simulation:

1. selection phase

2. expansion phase

3. evaluation phase

4. backup phase (at end of simulation)

Each edge (s, a) keeps:

� action value Q(s, a)

� visit count N(s, a)

� prior probability P(s, a) (from SL policy network pσ)

The tree is traversed by simulation (descending the tree) from the

root state.

Silver et al. 2016 48

Page 156: Mastering the game of Go with deep neural networks and tree search: Presentation

MCTS Algorithm

The next action is selected by lookahead search, using simulation:

1. selection phase

2. expansion phase

3. evaluation phase

4. backup phase (at end of simulation)

Each edge (s, a) keeps:

� action value Q(s, a)

� visit count N(s, a)

� prior probability P(s, a) (from SL policy network pσ)

The tree is traversed by simulation (descending the tree) from the

root state.

Silver et al. 2016 48

Page 157: Mastering the game of Go with deep neural networks and tree search: Presentation

MCTS Algorithm

The next action is selected by lookahead search, using simulation:

1. selection phase

2. expansion phase

3. evaluation phase

4. backup phase (at end of simulation)

Each edge (s, a) keeps:

� action value Q(s, a)

� visit count N(s, a)

� prior probability P(s, a) (from SL policy network pσ)

The tree is traversed by simulation (descending the tree) from the

root state.

Silver et al. 2016 48

Page 158: Mastering the game of Go with deep neural networks and tree search: Presentation

MCTS Algorithm

The next action is selected by lookahead search, using simulation:

1. selection phase

2. expansion phase

3. evaluation phase

4. backup phase (at end of simulation)

Each edge (s, a) keeps:

� action value Q(s, a)

� visit count N(s, a)

� prior probability P(s, a) (from SL policy network pσ)

The tree is traversed by simulation (descending the tree) from the

root state.

Silver et al. 2016 48

Page 159: Mastering the game of Go with deep neural networks and tree search: Presentation

MCTS Algorithm

The next action is selected by lookahead search, using simulation:

1. selection phase

2. expansion phase

3. evaluation phase

4. backup phase (at end of simulation)

Each edge (s, a) keeps:

� action value Q(s, a)

� visit count N(s, a)

� prior probability P(s, a) (from SL policy network pσ)

The tree is traversed by simulation (descending the tree) from the

root state.

Silver et al. 2016 48

Page 160: Mastering the game of Go with deep neural networks and tree search: Presentation

MCTS Algorithm

The next action is selected by lookahead search, using simulation:

1. selection phase

2. expansion phase

3. evaluation phase

4. backup phase (at end of simulation)

Each edge (s, a) keeps:

� action value Q(s, a)

� visit count N(s, a)

� prior probability P(s, a) (from SL policy network pσ)

The tree is traversed by simulation (descending the tree) from the

root state.

Silver et al. 2016 48

Page 161: Mastering the game of Go with deep neural networks and tree search: Presentation

MCTS Algorithm

The next action is selected by lookahead search, using simulation:

1. selection phase

2. expansion phase

3. evaluation phase

4. backup phase (at end of simulation)

Each edge (s, a) keeps:

� action value Q(s, a)

� visit count N(s, a)

� prior probability P(s, a) (from SL policy network pσ)

The tree is traversed by simulation (descending the tree) from the

root state.

Silver et al. 2016 48

Page 162: Mastering the game of Go with deep neural networks and tree search: Presentation

MCTS Algorithm

The next action is selected by lookahead search, using simulation:

1. selection phase

2. expansion phase

3. evaluation phase

4. backup phase (at end of simulation)

Each edge (s, a) keeps:

� action value Q(s, a)

� visit count N(s, a)

� prior probability P(s, a) (from SL policy network pσ)

The tree is traversed by simulation (descending the tree) from the

root state.

Silver et al. 2016 48

Page 163: Mastering the game of Go with deep neural networks and tree search: Presentation

MCTS Algorithm: Selection

At each time step t, an action at is selected from state st

at = arg maxa

(Q(st , a) + u(st , a))

where bonus

u(st , a) ∝P(s, a)

1 + N(s, a)

Silver et al. 2016 49

Page 164: Mastering the game of Go with deep neural networks and tree search: Presentation

MCTS Algorithm: Selection

At each time step t, an action at is selected from state st

at = arg maxa

(Q(st , a) + u(st , a))

where bonus

u(st , a) ∝P(s, a)

1 + N(s, a)

Silver et al. 2016 49

Page 165: Mastering the game of Go with deep neural networks and tree search: Presentation

MCTS Algorithm: Selection

At each time step t, an action at is selected from state st

at = arg maxa

(Q(st , a) + u(st , a))

where bonus

u(st , a) ∝P(s, a)

1 + N(s, a)

Silver et al. 2016 49

Page 166: Mastering the game of Go with deep neural networks and tree search: Presentation

MCTS Algorithm: Expansion

A leaf position may be expanded (just once) by the SL policy network pσ .

The output probabilities are stored as priors P(s, a) := pσ(a|s).

Silver et al. 2016 50

Page 167: Mastering the game of Go with deep neural networks and tree search: Presentation

MCTS Algorithm: Expansion

A leaf position may be expanded (just once) by the SL policy network pσ .

The output probabilities are stored as priors P(s, a) := pσ(a|s).

Silver et al. 2016 50

Page 168: Mastering the game of Go with deep neural networks and tree search: Presentation

MCTS Algorithm: Expansion

A leaf position may be expanded (just once) by the SL policy network pσ .

The output probabilities are stored as priors P(s, a) := pσ(a|s).

Silver et al. 2016 50

Page 169: Mastering the game of Go with deep neural networks and tree search: Presentation

MCTS: Evaluation

� evaluation from the value network vθ(s)

� evaluation by the outcome z using the fast rollout policy pπ until the end of game

Using a mixing parameter λ, the final leaf evaluation V (s) is

V (s) = (1− λ)vθ(s) + λz

Silver et al. 2016 51

Page 170: Mastering the game of Go with deep neural networks and tree search: Presentation

MCTS: Evaluation

� evaluation from the value network vθ(s)

� evaluation by the outcome z using the fast rollout policy pπ until the end of game

Using a mixing parameter λ, the final leaf evaluation V (s) is

V (s) = (1− λ)vθ(s) + λz

Silver et al. 2016 51

Page 171: Mastering the game of Go with deep neural networks and tree search: Presentation

MCTS: Evaluation

� evaluation from the value network vθ(s)

� evaluation by the outcome z using the fast rollout policy pπ until the end of game

Using a mixing parameter λ, the final leaf evaluation V (s) is

V (s) = (1− λ)vθ(s) + λz

Silver et al. 2016 51

Page 172: Mastering the game of Go with deep neural networks and tree search: Presentation

MCTS: Evaluation

� evaluation from the value network vθ(s)

� evaluation by the outcome z using the fast rollout policy pπ until the end of game

Using a mixing parameter λ, the final leaf evaluation V (s) is

V (s) = (1− λ)vθ(s) + λz

Silver et al. 2016 51

Page 173: Mastering the game of Go with deep neural networks and tree search: Presentation

MCTS: Evaluation

� evaluation from the value network vθ(s)

� evaluation by the outcome z using the fast rollout policy pπ until the end of game

Using a mixing parameter λ, the final leaf evaluation V (s) is

V (s) = (1− λ)vθ(s) + λz

Silver et al. 2016 51

Page 174: Mastering the game of Go with deep neural networks and tree search: Presentation

Tree Evaluation from Value Network

action values Q(s, a) for each tree-edge (s, a) from root position s (averaged over value network evaluations only)

Silver et al. 2016 52

Page 175: Mastering the game of Go with deep neural networks and tree search: Presentation

Tree Evaluation from Rollouts

action values Q(s, a), averaged over rollout evaluations only

Silver et al. 2016 53

Page 176: Mastering the game of Go with deep neural networks and tree search: Presentation

MCTS: Backup

At the end of simulation, each traversed edge is updated by accumulating:

� the action values Q

� visit counts N

Silver et al. 2016 54

Page 177: Mastering the game of Go with deep neural networks and tree search: Presentation

MCTS: Backup

At the end of simulation, each traversed edge is updated by accumulating:

� the action values Q

� visit counts N

Silver et al. 2016 54

Page 178: Mastering the game of Go with deep neural networks and tree search: Presentation

Once the search is complete, the algorithm

chooses the most visited move from the root

position.

Silver et al. 2016 54

Page 179: Mastering the game of Go with deep neural networks and tree search: Presentation

Percentage of Simulations

percentage frequency with which actions were selected from the root during simulations

Silver et al. 2016 55

Page 180: Mastering the game of Go with deep neural networks and tree search: Presentation

Principal Variation (Path with Maximum Visit Count)

The moves are presented in a numbered sequence.

� AlphaGo selected the move indicated by the red circle;

� Fan Hui responded with the move indicated by the white square;

� in his post-game commentary, he preferred the move (labelled 1) predicted by AlphaGo.

Silver et al. 2016 56

Page 181: Mastering the game of Go with deep neural networks and tree search: Presentation

Principal Variation (Path with Maximum Visit Count)

The moves are presented in a numbered sequence.

� AlphaGo selected the move indicated by the red circle;

� Fan Hui responded with the move indicated by the white square;

� in his post-game commentary, he preferred the move (labelled 1) predicted by AlphaGo.

Silver et al. 2016 56

Page 182: Mastering the game of Go with deep neural networks and tree search: Presentation

Principal Variation (Path with Maximum Visit Count)

The moves are presented in a numbered sequence.

� AlphaGo selected the move indicated by the red circle;

� Fan Hui responded with the move indicated by the white square;

� in his post-game commentary, he preferred the move (labelled 1) predicted by AlphaGo.

Silver et al. 2016 56

Page 183: Mastering the game of Go with deep neural networks and tree search: Presentation

Principal Variation (Path with Maximum Visit Count)

The moves are presented in a numbered sequence.

� AlphaGo selected the move indicated by the red circle;

� Fan Hui responded with the move indicated by the white square;

� in his post-game commentary, he preferred the move (labelled 1) predicted by AlphaGo.

Silver et al. 2016 56

Page 184: Mastering the game of Go with deep neural networks and tree search: Presentation

Scalability

� asynchronous multi-threaded search

� simulations on CPUs

� computation of neural networks on GPUs

AlphaGo:

� 40 search threads

� 40 CPUs

� 8 GPUs

Distributed version of AlphaGo (on multiple machines):

� 40 search threads

� 1202 CPUs

� 176 GPUs

Silver et al. 2016 57

Page 185: Mastering the game of Go with deep neural networks and tree search: Presentation

Scalability

� asynchronous multi-threaded search

� simulations on CPUs

� computation of neural networks on GPUs

AlphaGo:

� 40 search threads

� 40 CPUs

� 8 GPUs

Distributed version of AlphaGo (on multiple machines):

� 40 search threads

� 1202 CPUs

� 176 GPUs

Silver et al. 2016 57

Page 186: Mastering the game of Go with deep neural networks and tree search: Presentation

Scalability

� asynchronous multi-threaded search

� simulations on CPUs

� computation of neural networks on GPUs

AlphaGo:

� 40 search threads

� 40 CPUs

� 8 GPUs

Distributed version of AlphaGo (on multiple machines):

� 40 search threads

� 1202 CPUs

� 176 GPUs

Silver et al. 2016 57

Page 187: Mastering the game of Go with deep neural networks and tree search: Presentation

Scalability

� asynchronous multi-threaded search

� simulations on CPUs

� computation of neural networks on GPUs

AlphaGo:

� 40 search threads

� 40 CPUs

� 8 GPUs

Distributed version of AlphaGo (on multiple machines):

� 40 search threads

� 1202 CPUs

� 176 GPUs

Silver et al. 2016 57

Page 188: Mastering the game of Go with deep neural networks and tree search: Presentation

Scalability

� asynchronous multi-threaded search

� simulations on CPUs

� computation of neural networks on GPUs

AlphaGo:

� 40 search threads

� 40 CPUs

� 8 GPUs

Distributed version of AlphaGo (on multiple machines):

� 40 search threads

� 1202 CPUs

� 176 GPUs

Silver et al. 2016 57

Page 189: Mastering the game of Go with deep neural networks and tree search: Presentation

Scalability

� asynchronous multi-threaded search

� simulations on CPUs

� computation of neural networks on GPUs

AlphaGo:

� 40 search threads

� 40 CPUs

� 8 GPUs

Distributed version of AlphaGo (on multiple machines):

� 40 search threads

� 1202 CPUs

� 176 GPUs

Silver et al. 2016 57

Page 190: Mastering the game of Go with deep neural networks and tree search: Presentation

Scalability

� asynchronous multi-threaded search

� simulations on CPUs

� computation of neural networks on GPUs

AlphaGo:

� 40 search threads

� 40 CPUs

� 8 GPUs

Distributed version of AlphaGo (on multiple machines):

� 40 search threads

� 1202 CPUs

� 176 GPUs

Silver et al. 2016 57

Page 191: Mastering the game of Go with deep neural networks and tree search: Presentation

Scalability

� asynchronous multi-threaded search

� simulations on CPUs

� computation of neural networks on GPUs

AlphaGo:

� 40 search threads

� 40 CPUs

� 8 GPUs

Distributed version of AlphaGo (on multiple machines):

� 40 search threads

� 1202 CPUs

� 176 GPUs

Silver et al. 2016 57

Page 192: Mastering the game of Go with deep neural networks and tree search: Presentation

Scalability

� asynchronous multi-threaded search

� simulations on CPUs

� computation of neural networks on GPUs

AlphaGo:

� 40 search threads

� 40 CPUs

� 8 GPUs

Distributed version of AlphaGo (on multiple machines):

� 40 search threads

� 1202 CPUs

� 176 GPUs

Silver et al. 2016 57

Page 193: Mastering the game of Go with deep neural networks and tree search: Presentation

Scalability

� asynchronous multi-threaded search

� simulations on CPUs

� computation of neural networks on GPUs

AlphaGo:

� 40 search threads

� 40 CPUs

� 8 GPUs

Distributed version of AlphaGo (on multiple machines):

� 40 search threads

� 1202 CPUs

� 176 GPUs

Silver et al. 2016 57

Page 194: Mastering the game of Go with deep neural networks and tree search: Presentation

Scalability

� asynchronous multi-threaded search

� simulations on CPUs

� computation of neural networks on GPUs

AlphaGo:

� 40 search threads

� 40 CPUs

� 8 GPUs

Distributed version of AlphaGo (on multiple machines):

� 40 search threads

� 1202 CPUs

� 176 GPUs

Silver et al. 2016 57

Page 195: Mastering the game of Go with deep neural networks and tree search: Presentation

ELO Ratings for Various Combinations of Threads

Silver et al. 2016 58

Page 196: Mastering the game of Go with deep neural networks and tree search: Presentation

Results: the strength of AlphaGo

Page 197: Mastering the game of Go with deep neural networks and tree search: Presentation

Tournament with Other Go Programs

Silver et al. 2016 59

Page 198: Mastering the game of Go with deep neural networks and tree search: Presentation

Fan Hui

� professional 2 dan

� European Go Champion in 2013, 2014 and 2015

� European Professional Go Champion in 2016� biological neural network:

� 100 billion neurons

� 100 up to 1,000 trillion neuronal connections

https://en.wikipedia.org/wiki/Fan_Hui 60

Page 199: Mastering the game of Go with deep neural networks and tree search: Presentation

Fan Hui

� professional 2 dan

� European Go Champion in 2013, 2014 and 2015

� European Professional Go Champion in 2016� biological neural network:

� 100 billion neurons

� 100 up to 1,000 trillion neuronal connections

https://en.wikipedia.org/wiki/Fan_Hui 60

Page 200: Mastering the game of Go with deep neural networks and tree search: Presentation

Fan Hui

� professional 2 dan

� European Go Champion in 2013, 2014 and 2015

� European Professional Go Champion in 2016� biological neural network:

� 100 billion neurons

� 100 up to 1,000 trillion neuronal connections

https://en.wikipedia.org/wiki/Fan_Hui 60

Page 201: Mastering the game of Go with deep neural networks and tree search: Presentation

Fan Hui

� professional 2 dan

� European Go Champion in 2013, 2014 and 2015

� European Professional Go Champion in 2016

� biological neural network:

� 100 billion neurons

� 100 up to 1,000 trillion neuronal connections

https://en.wikipedia.org/wiki/Fan_Hui 60

Page 202: Mastering the game of Go with deep neural networks and tree search: Presentation

Fan Hui

� professional 2 dan

� European Go Champion in 2013, 2014 and 2015

� European Professional Go Champion in 2016� biological neural network:

� 100 billion neurons

� 100 up to 1,000 trillion neuronal connections

https://en.wikipedia.org/wiki/Fan_Hui 60

Page 203: Mastering the game of Go with deep neural networks and tree search: Presentation

Fan Hui

� professional 2 dan

� European Go Champion in 2013, 2014 and 2015

� European Professional Go Champion in 2016� biological neural network:

� 100 billion neurons

� 100 up to 1,000 trillion neuronal connections

https://en.wikipedia.org/wiki/Fan_Hui 60

Page 204: Mastering the game of Go with deep neural networks and tree search: Presentation

Fan Hui

� professional 2 dan

� European Go Champion in 2013, 2014 and 2015

� European Professional Go Champion in 2016� biological neural network:

� 100 billion neurons

� 100 up to 1,000 trillion neuronal connectionshttps://en.wikipedia.org/wiki/Fan_Hui 60

Page 205: Mastering the game of Go with deep neural networks and tree search: Presentation

AlphaGo versus Fan Hui

AlphaGo won 5 - 0 in a formal match on October 2015.

[AlphaGo] is very strong and stable, it seems

like a wall. ... I know AlphaGo is a computer,

but if no one told me, maybe I would think

the player was a little strange, but a very

strong player, a real person.

Fan Hui

61

Page 206: Mastering the game of Go with deep neural networks and tree search: Presentation

AlphaGo versus Fan Hui

AlphaGo won 5 - 0 in a formal match on October 2015.

[AlphaGo] is very strong and stable, it seems

like a wall. ... I know AlphaGo is a computer,

but if no one told me, maybe I would think

the player was a little strange, but a very

strong player, a real person.

Fan Hui

61

Page 207: Mastering the game of Go with deep neural networks and tree search: Presentation

AlphaGo versus Fan Hui

AlphaGo won 5 - 0 in a formal match on October 2015.

[AlphaGo] is very strong and stable, it seems

like a wall. ... I know AlphaGo is a computer,

but if no one told me, maybe I would think

the player was a little strange, but a very

strong player, a real person.

Fan Hui 61

Page 208: Mastering the game of Go with deep neural networks and tree search: Presentation

Lee Sedol “The Strong Stone”

� professional 9 dan

� the 2nd in international titles

� the 5th youngest (12 years 4 months) to become

a professional Go player in South Korean history

� Lee Sedol would win 97 out of 100 games against Fan Hui.

� biological neural network, comparable to Fan Hui’s (in number

of neurons and connections)

https://en.wikipedia.org/wiki/Lee_Sedol 62

Page 209: Mastering the game of Go with deep neural networks and tree search: Presentation

Lee Sedol “The Strong Stone”

� professional 9 dan

� the 2nd in international titles

� the 5th youngest (12 years 4 months) to become

a professional Go player in South Korean history

� Lee Sedol would win 97 out of 100 games against Fan Hui.

� biological neural network, comparable to Fan Hui’s (in number

of neurons and connections)

https://en.wikipedia.org/wiki/Lee_Sedol 62

Page 210: Mastering the game of Go with deep neural networks and tree search: Presentation

Lee Sedol “The Strong Stone”

� professional 9 dan

� the 2nd in international titles

� the 5th youngest (12 years 4 months) to become

a professional Go player in South Korean history

� Lee Sedol would win 97 out of 100 games against Fan Hui.

� biological neural network, comparable to Fan Hui’s (in number

of neurons and connections)

https://en.wikipedia.org/wiki/Lee_Sedol 62

Page 211: Mastering the game of Go with deep neural networks and tree search: Presentation

Lee Sedol “The Strong Stone”

� professional 9 dan

� the 2nd in international titles

� the 5th youngest (12 years 4 months) to become

a professional Go player in South Korean history

� Lee Sedol would win 97 out of 100 games against Fan Hui.

� biological neural network, comparable to Fan Hui’s (in number

of neurons and connections)

https://en.wikipedia.org/wiki/Lee_Sedol 62

Page 212: Mastering the game of Go with deep neural networks and tree search: Presentation

Lee Sedol “The Strong Stone”

� professional 9 dan

� the 2nd in international titles

� the 5th youngest (12 years 4 months) to become

a professional Go player in South Korean history

� Lee Sedol would win 97 out of 100 games against Fan Hui.

� biological neural network, comparable to Fan Hui’s (in number

of neurons and connections)

https://en.wikipedia.org/wiki/Lee_Sedol 62

Page 213: Mastering the game of Go with deep neural networks and tree search: Presentation

Lee Sedol “The Strong Stone”

� professional 9 dan

� the 2nd in international titles

� the 5th youngest (12 years 4 months) to become

a professional Go player in South Korean history

� Lee Sedol would win 97 out of 100 games against Fan Hui.

� biological neural network, comparable to Fan Hui’s (in number

of neurons and connections)https://en.wikipedia.org/wiki/Lee_Sedol 62

Page 214: Mastering the game of Go with deep neural networks and tree search: Presentation

I heard Google DeepMind’s AI is surprisingly

strong and getting stronger, but I am

confident that I can win, at least this time.

Lee Sedol

...even beating AlphaGo by 4-1 may allow

the Google DeepMind team to claim its de

facto victory and the defeat of him

[Lee Sedol], or even humankind.

interview in JTBC

Newsroom

62

Page 215: Mastering the game of Go with deep neural networks and tree search: Presentation

I heard Google DeepMind’s AI is surprisingly

strong and getting stronger, but I am

confident that I can win, at least this time.

Lee Sedol

...even beating AlphaGo by 4-1 may allow

the Google DeepMind team to claim its de

facto victory and the defeat of him

[Lee Sedol], or even humankind.

interview in JTBC

Newsroom

62

Page 216: Mastering the game of Go with deep neural networks and tree search: Presentation

I heard Google DeepMind’s AI is surprisingly

strong and getting stronger, but I am

confident that I can win, at least this time.

Lee Sedol

...even beating AlphaGo by 4-1 may allow

the Google DeepMind team to claim its de

facto victory and the defeat of him

[Lee Sedol], or even humankind.

interview in JTBC

Newsroom

62

Page 217: Mastering the game of Go with deep neural networks and tree search: Presentation

AlphaGo versus Lee Sedol

In March 2016 AlphaGo won 4-1 against the legendary Lee Sedol.

AlphaGo won all but the 4th game; all games were won

by resignation.

The winner of the match was slated to win $1 million.

Since AlphaGo won, Google DeepMind stated that the prize will be

donated to charities, including UNICEF, and Go organisations.

Lee received $170,000 ($150,000 for participating in all the five

games, and an additional $20,000 for each game won).

https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol 63

Page 218: Mastering the game of Go with deep neural networks and tree search: Presentation

AlphaGo versus Lee Sedol

In March 2016 AlphaGo won 4-1 against the legendary Lee Sedol.

AlphaGo won all but the 4th game; all games were won

by resignation.

The winner of the match was slated to win $1 million.

Since AlphaGo won, Google DeepMind stated that the prize will be

donated to charities, including UNICEF, and Go organisations.

Lee received $170,000 ($150,000 for participating in all the five

games, and an additional $20,000 for each game won).

https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol 63

Page 219: Mastering the game of Go with deep neural networks and tree search: Presentation

AlphaGo versus Lee Sedol

In March 2016 AlphaGo won 4-1 against the legendary Lee Sedol.

AlphaGo won all but the 4th game; all games were won

by resignation.

The winner of the match was slated to win $1 million.

Since AlphaGo won, Google DeepMind stated that the prize will be

donated to charities, including UNICEF, and Go organisations.

Lee received $170,000 ($150,000 for participating in all the five

games, and an additional $20,000 for each game won).

https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol 63

Page 220: Mastering the game of Go with deep neural networks and tree search: Presentation

AlphaGo versus Lee Sedol

In March 2016 AlphaGo won 4-1 against the legendary Lee Sedol.

AlphaGo won all but the 4th game; all games were won

by resignation.

The winner of the match was slated to win $1 million.

Since AlphaGo won, Google DeepMind stated that the prize will be

donated to charities, including UNICEF, and Go organisations.

Lee received $170,000 ($150,000 for participating in all the five

games, and an additional $20,000 for each game won).

https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol 63

Page 221: Mastering the game of Go with deep neural networks and tree search: Presentation

AlphaGo versus Lee Sedol

In March 2016 AlphaGo won 4-1 against the legendary Lee Sedol.

AlphaGo won all but the 4th game; all games were won

by resignation.

The winner of the match was slated to win $1 million.

Since AlphaGo won, Google DeepMind stated that the prize will be

donated to charities, including UNICEF, and Go organisations.

Lee received $170,000 ($150,000 for participating in all the five

games, and an additional $20,000 for each game won).

https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol 63

Page 222: Mastering the game of Go with deep neural networks and tree search: Presentation

AlphaGo versus Lee Sedol

In March 2016 AlphaGo won 4-1 against the legendary Lee Sedol.

AlphaGo won all but the 4th game; all games were won

by resignation.

The winner of the match was slated to win $1 million.

Since AlphaGo won, Google DeepMind stated that the prize will be

donated to charities, including UNICEF, and Go organisations.

Lee received $170,000 ($150,000 for participating in all the five

games, and an additional $20,000 for each game won).

https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol 63

Page 223: Mastering the game of Go with deep neural networks and tree search: Presentation

Conclusion

Page 224: Mastering the game of Go with deep neural networks and tree search: Presentation

Difficulties of Go

� challenging decision-making

� intractable search space

� complex optimal solution

It appears infeasible to directly approximate using a policy or value function!

Silver et al. 2016 64

Page 225: Mastering the game of Go with deep neural networks and tree search: Presentation

Difficulties of Go

� challenging decision-making

� intractable search space

� complex optimal solution

It appears infeasible to directly approximate using a policy or value function!

Silver et al. 2016 64

Page 226: Mastering the game of Go with deep neural networks and tree search: Presentation

Difficulties of Go

� challenging decision-making

� intractable search space

� complex optimal solution

It appears infeasible to directly approximate using a policy or value function!

Silver et al. 2016 64

Page 227: Mastering the game of Go with deep neural networks and tree search: Presentation

AlphaGo: summary

� Monte Carlo tree search

� effective move selection and position evaluation

� through deep convolutional neural networks

� trained by novel combination of supervised and reinforcement

learning

� new search algorithm combining

� neural network evaluation

� Monte Carlo rollouts

� scalable implementation

� multi-threaded simulations on CPUs

� parallel GPU computations

� distributed version over multiple machines

Silver et al. 2016 65

Page 228: Mastering the game of Go with deep neural networks and tree search: Presentation

AlphaGo: summary

� Monte Carlo tree search

� effective move selection and position evaluation

� through deep convolutional neural networks

� trained by novel combination of supervised and reinforcement

learning

� new search algorithm combining

� neural network evaluation

� Monte Carlo rollouts

� scalable implementation

� multi-threaded simulations on CPUs

� parallel GPU computations

� distributed version over multiple machines

Silver et al. 2016 65

Page 229: Mastering the game of Go with deep neural networks and tree search: Presentation

AlphaGo: summary

� Monte Carlo tree search

� effective move selection and position evaluation

� through deep convolutional neural networks

� trained by novel combination of supervised and reinforcement

learning

� new search algorithm combining

� neural network evaluation

� Monte Carlo rollouts

� scalable implementation

� multi-threaded simulations on CPUs

� parallel GPU computations

� distributed version over multiple machines

Silver et al. 2016 65

Page 230: Mastering the game of Go with deep neural networks and tree search: Presentation

AlphaGo: summary

� Monte Carlo tree search

� effective move selection and position evaluation

� through deep convolutional neural networks

� trained by novel combination of supervised and reinforcement

learning

� new search algorithm combining

� neural network evaluation

� Monte Carlo rollouts

� scalable implementation

� multi-threaded simulations on CPUs

� parallel GPU computations

� distributed version over multiple machines

Silver et al. 2016 65

Page 231: Mastering the game of Go with deep neural networks and tree search: Presentation

AlphaGo: summary

� Monte Carlo tree search

� effective move selection and position evaluation

� through deep convolutional neural networks

� trained by novel combination of supervised and reinforcement

learning

� new search algorithm combining

� neural network evaluation

� Monte Carlo rollouts

� scalable implementation

� multi-threaded simulations on CPUs

� parallel GPU computations

� distributed version over multiple machines

Silver et al. 2016 65

Page 232: Mastering the game of Go with deep neural networks and tree search: Presentation

AlphaGo: summary

� Monte Carlo tree search

� effective move selection and position evaluation

� through deep convolutional neural networks

� trained by novel combination of supervised and reinforcement

learning

� new search algorithm combining

� neural network evaluation

� Monte Carlo rollouts

� scalable implementation

� multi-threaded simulations on CPUs

� parallel GPU computations

� distributed version over multiple machines

Silver et al. 2016 65

Page 233: Mastering the game of Go with deep neural networks and tree search: Presentation

AlphaGo: summary

� Monte Carlo tree search

� effective move selection and position evaluation

� through deep convolutional neural networks

� trained by novel combination of supervised and reinforcement

learning

� new search algorithm combining

� neural network evaluation

� Monte Carlo rollouts

� scalable implementation

� multi-threaded simulations on CPUs

� parallel GPU computations

� distributed version over multiple machines

Silver et al. 2016 65

Page 234: Mastering the game of Go with deep neural networks and tree search: Presentation

AlphaGo: summary

� Monte Carlo tree search

� effective move selection and position evaluation

� through deep convolutional neural networks

� trained by novel combination of supervised and reinforcement

learning

� new search algorithm combining

� neural network evaluation

� Monte Carlo rollouts

� scalable implementation

� multi-threaded simulations on CPUs

� parallel GPU computations

� distributed version over multiple machines

Silver et al. 2016 65

Page 235: Mastering the game of Go with deep neural networks and tree search: Presentation

AlphaGo: summary

� Monte Carlo tree search

� effective move selection and position evaluation

� through deep convolutional neural networks

� trained by novel combination of supervised and reinforcement

learning

� new search algorithm combining

� neural network evaluation

� Monte Carlo rollouts

� scalable implementation

� multi-threaded simulations on CPUs

� parallel GPU computations

� distributed version over multiple machines

Silver et al. 2016 65

Page 236: Mastering the game of Go with deep neural networks and tree search: Presentation

AlphaGo: summary

� Monte Carlo tree search

� effective move selection and position evaluation

� through deep convolutional neural networks

� trained by novel combination of supervised and reinforcement

learning

� new search algorithm combining

� neural network evaluation

� Monte Carlo rollouts

� scalable implementation

� multi-threaded simulations on CPUs

� parallel GPU computations

� distributed version over multiple machines

Silver et al. 2016 65

Page 237: Mastering the game of Go with deep neural networks and tree search: Presentation

AlphaGo: summary

� Monte Carlo tree search

� effective move selection and position evaluation

� through deep convolutional neural networks

� trained by novel combination of supervised and reinforcement

learning

� new search algorithm combining

� neural network evaluation

� Monte Carlo rollouts

� scalable implementation

� multi-threaded simulations on CPUs

� parallel GPU computations

� distributed version over multiple machines

Silver et al. 2016 65

Page 238: Mastering the game of Go with deep neural networks and tree search: Presentation

Novel approach

During the match against Fan Hui, AlphaGo evaluated thousands

of times fewer positions than DeepBlue against Kasparov.

It compensated this by:

� selecting those positions more intelligently (policy network)

� evaluating them more precisely (value network)

Deep Blue relied on a handcrafted evaluation function.

AlphaGo was trained directly and automatically from gameplay.

It used general-purpose learning.

This approach is not specific to the game of Go. The algorithm

can be used for much wider class of (so far seemingly)

intractable problems in AI!

Silver et al. 2016 66

Page 239: Mastering the game of Go with deep neural networks and tree search: Presentation

Novel approach

During the match against Fan Hui, AlphaGo evaluated thousands

of times fewer positions than DeepBlue against Kasparov.

It compensated this by:

� selecting those positions more intelligently (policy network)

� evaluating them more precisely (value network)

Deep Blue relied on a handcrafted evaluation function.

AlphaGo was trained directly and automatically from gameplay.

It used general-purpose learning.

This approach is not specific to the game of Go. The algorithm

can be used for much wider class of (so far seemingly)

intractable problems in AI!

Silver et al. 2016 66

Page 240: Mastering the game of Go with deep neural networks and tree search: Presentation

Novel approach

During the match against Fan Hui, AlphaGo evaluated thousands

of times fewer positions than DeepBlue against Kasparov.

It compensated this by:

� selecting those positions more intelligently (policy network)

� evaluating them more precisely (value network)

Deep Blue relied on a handcrafted evaluation function.

AlphaGo was trained directly and automatically from gameplay.

It used general-purpose learning.

This approach is not specific to the game of Go. The algorithm

can be used for much wider class of (so far seemingly)

intractable problems in AI!

Silver et al. 2016 66

Page 241: Mastering the game of Go with deep neural networks and tree search: Presentation

Novel approach

During the match against Fan Hui, AlphaGo evaluated thousands

of times fewer positions than DeepBlue against Kasparov.

It compensated this by:

� selecting those positions more intelligently (policy network)

� evaluating them more precisely (value network)

Deep Blue relied on a handcrafted evaluation function.

AlphaGo was trained directly and automatically from gameplay.

It used general-purpose learning.

This approach is not specific to the game of Go. The algorithm

can be used for much wider class of (so far seemingly)

intractable problems in AI!

Silver et al. 2016 66

Page 242: Mastering the game of Go with deep neural networks and tree search: Presentation

Novel approach

During the match against Fan Hui, AlphaGo evaluated thousands

of times fewer positions than DeepBlue against Kasparov.

It compensated this by:

� selecting those positions more intelligently (policy network)

� evaluating them more precisely (value network)

Deep Blue relied on a handcrafted evaluation function.

AlphaGo was trained directly and automatically from gameplay.

It used general-purpose learning.

This approach is not specific to the game of Go. The algorithm

can be used for much wider class of (so far seemingly)

intractable problems in AI!

Silver et al. 2016 66

Page 243: Mastering the game of Go with deep neural networks and tree search: Presentation

Novel approach

During the match against Fan Hui, AlphaGo evaluated thousands

of times fewer positions than DeepBlue against Kasparov.

It compensated this by:

� selecting those positions more intelligently (policy network)

� evaluating them more precisely (value network)

Deep Blue relied on a handcrafted evaluation function.

AlphaGo was trained directly and automatically from gameplay.

It used general-purpose learning.

This approach is not specific to the game of Go. The algorithm

can be used for much wider class of (so far seemingly)

intractable problems in AI!

Silver et al. 2016 66

Page 244: Mastering the game of Go with deep neural networks and tree search: Presentation

Novel approach

During the match against Fan Hui, AlphaGo evaluated thousands

of times fewer positions than DeepBlue against Kasparov.

It compensated this by:

� selecting those positions more intelligently (policy network)

� evaluating them more precisely (value network)

Deep Blue relied on a handcrafted evaluation function.

AlphaGo was trained directly and automatically from gameplay.

It used general-purpose learning.

This approach is not specific to the game of Go. The algorithm

can be used for much wider class of (so far seemingly)

intractable problems in AI!

Silver et al. 2016 66

Page 245: Mastering the game of Go with deep neural networks and tree search: Presentation

Novel approach

During the match against Fan Hui, AlphaGo evaluated thousands

of times fewer positions than DeepBlue against Kasparov.

It compensated this by:

� selecting those positions more intelligently (policy network)

� evaluating them more precisely (value network)

Deep Blue relied on a handcrafted evaluation function.

AlphaGo was trained directly and automatically from gameplay.

It used general-purpose learning.

This approach is not specific to the game of Go. The algorithm

can be used for much wider class of (so far seemingly)

intractable problems in AI!

Silver et al. 2016 66

Page 246: Mastering the game of Go with deep neural networks and tree search: Presentation

Thank you!

Questions?

66

Page 247: Mastering the game of Go with deep neural networks and tree search: Presentation

Backup slides

Page 248: Mastering the game of Go with deep neural networks and tree search: Presentation

Input features for rollout and tree policy

Silver et al. 2016

Page 249: Mastering the game of Go with deep neural networks and tree search: Presentation

Results of a tournament between different Go programs

Silver et al. 2016

Page 250: Mastering the game of Go with deep neural networks and tree search: Presentation

Results of a tournament between AlphaGo and distributed Al-

phaGo, testing scalability with hardware

Silver et al. 2016

Page 251: Mastering the game of Go with deep neural networks and tree search: Presentation

AlphaGo versus Fan Hui: Game 1

Silver et al. 2016

Page 252: Mastering the game of Go with deep neural networks and tree search: Presentation

AlphaGo versus Fan Hui: Game 2

Silver et al. 2016

Page 253: Mastering the game of Go with deep neural networks and tree search: Presentation

AlphaGo versus Fan Hui: Game 3

Silver et al. 2016

Page 254: Mastering the game of Go with deep neural networks and tree search: Presentation

AlphaGo versus Fan Hui: Game 4

Silver et al. 2016

Page 255: Mastering the game of Go with deep neural networks and tree search: Presentation

AlphaGo versus Fan Hui: Game 5

Silver et al. 2016

Page 256: Mastering the game of Go with deep neural networks and tree search: Presentation

AlphaGo versus Lee Sedol: Game 1

https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol

Page 257: Mastering the game of Go with deep neural networks and tree search: Presentation

AlphaGo versus Lee Sedol: Game 2 (1/2)

https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol

Page 258: Mastering the game of Go with deep neural networks and tree search: Presentation

AlphaGo versus Lee Sedol: Game 2 (2/2)

https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol

Page 259: Mastering the game of Go with deep neural networks and tree search: Presentation

AlphaGo versus Lee Sedol: Game 3

https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol

Page 260: Mastering the game of Go with deep neural networks and tree search: Presentation

AlphaGo versus Lee Sedol: Game 4

https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol

Page 261: Mastering the game of Go with deep neural networks and tree search: Presentation

AlphaGo versus Lee Sedol: Game 5 (1/2)

https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol

Page 262: Mastering the game of Go with deep neural networks and tree search: Presentation

AlphaGo versus Lee Sedol: Game 5 (2/2)

https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol

Page 263: Mastering the game of Go with deep neural networks and tree search: Presentation

Further Reading I

AlphaGo:

� Google Research Blog

http://googleresearch.blogspot.cz/2016/01/alphago-mastering-ancient-game-of-go.html

� an article in Nature

http://www.nature.com/news/google-ai-algorithm-masters-ancient-game-of-go-1.19234

� a reddit article claiming that AlphaGo is even stronger than it appears to be:

“AlphaGo would rather win by less points, but with higher probability.”

https://www.reddit.com/r/baduk/comments/49y17z/the_true_strength_of_alphago/

Articles by Google DeepMind:

� Atari player: a DeepRL system which combines Deep Neural Networks with Reinforcement Learning (Mnih

et al. 2015)

� Neural Turing Machines (Graves, Wayne, and Danihelka 2014)

Artificial Intelligence:

� Artificial Intelligence course at MIT

http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/

6-034-artificial-intelligence-fall-2010/index.htm

� Introduction to Artificial Intelligence at Udacity

https://www.udacity.com/course/intro-to-artificial-intelligence--cs271

Page 264: Mastering the game of Go with deep neural networks and tree search: Presentation

Further Reading II

� General Game Playing course https://www.coursera.org/course/ggp

� Singularity http://waitbutwhy.com/2015/01/artificial-intelligence-revolution-1.html + Part 2

� The Singularity Is Near (Kurzweil 2005)

Combinatorial Game Theory (founded by John H. Conway to study endgames in Go):

� Combinatorial Game Theory course https://www.coursera.org/learn/combinatorial-game-theory

� On Numbers and Games (Conway 1976)

Machine Learning:

� Machine Learning course

https://youtu.be/hPKJBXkyTK://www.coursera.org/learn/machine-learning/

� Reinforcement Learning http://reinforcementlearning.ai-depot.com/

� Deep Learning (LeCun, Bengio, and Hinton 2015)

� Deep Learning course https://www.udacity.com/course/deep-learning--ud730

� Two Minute Papers https://www.youtube.com/user/keeroyz

� Applications of Deep Learning https://youtu.be/hPKJBXkyTKM

Neuroscience:

� http://www.brainfacts.org/

Page 265: Mastering the game of Go with deep neural networks and tree search: Presentation

References I

Allis, Louis Victor et al. (1994). Searching for solutions in games and artificial intelligence. Ponsen & Looijen.

Baudis, Petr and Jean-loup Gailly (2011). “Pachi: State of the art open source Go program”. In: Advances in

Computer Games. Springer, pp. 24–38.

Bowling, Michael et al. (2015). “Heads-up limit holdem poker is solved”. In: Science 347.6218, pp. 145–149. url:

http://poker.cs.ualberta.ca/15science.html.

Conway, John Horton (1976). “On Numbers and Games”. In: London Mathematical Society Monographs 6.

Corrado, Greg (2015). Computer, respond to this email. url:

http://googleresearch.blogspot.cz/2015/11/computer-respond-to-this-email.html#1 (visited on

03/31/2016).

Dieterle, Frank Jochen (2003). “Multianalyte quantifications by means of integration of artificial neural networks,

genetic algorithms and chemometrics for time-resolved analytical data”. PhD thesis. Universitat Tubingen.

Gatys, Leon A., Alexander S. Ecker, and Matthias Bethge (2015). “A Neural Algorithm of Artistic Style”. In:

CoRR abs/1508.06576. url: http://arxiv.org/abs/1508.06576.

Graves, Alex, Greg Wayne, and Ivo Danihelka (2014). “Neural turing machines”. In: arXiv preprint

arXiv:1410.5401.

Hayes, Bradley (2016). url: https://twitter.com/deepdrumpf.

Page 266: Mastering the game of Go with deep neural networks and tree search: Presentation

References II

Karpathy, Andrej (2015). The Unreasonable Effectiveness of Recurrent Neural Networks. url:

http://karpathy.github.io/2015/05/21/rnn-effectiveness/ (visited on 04/01/2016).

Kurzweil, Ray (2005). The singularity is near: When humans transcend biology. Penguin.

LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton (2015). “Deep learning”. In: Nature 521.7553, pp. 436–444.

Li, Chuan and Michael Wand (2016). “Combining Markov Random Fields and Convolutional Neural Networks for

Image Synthesis”. In: CoRR abs/1601.04589. url: http://arxiv.org/abs/1601.04589.

Mnih, Volodymyr et al. (2015). “Human-level control through deep reinforcement learning”. In: Nature 518.7540,

pp. 529–533. url:

https://storage.googleapis.com/deepmind-data/assets/papers/DeepMindNature14236Paper.pdf.

Munroe, Randall. Game AIs. url: https://xkcd.com/1002/ (visited on 04/02/2016).

Silver, David et al. (2016). “Mastering the game of Go with deep neural networks and tree search”. In: Nature

529.7587, pp. 484–489.

Sun, Felix. DeepHear - Composing and harmonizing music with neural networks. url:

http://web.mit.edu/felixsun/www/neural-music.html (visited on 04/02/2016).