Stochastic Planning in Games: an AlphaGo Case-studyTommaso Macrì | 23.03.20 | 18 AlphaGo: the 4...

Post on 25-Jun-2020

9 views 0 download

Transcript of Stochastic Planning in Games: an AlphaGo Case-studyTommaso Macrì | 23.03.20 | 18 AlphaGo: the 4...

||

Tommaso Macrì

23.03.20 1

Stochastic Planning in Games: an AlphaGo Case-study

Tommaso Macrì

|| 23.03.20Tommaso Macrì 2

AlphaGo beats 18-time world champion Lee Sedol 4 games to 1

|| 23.03.20Tommaso Macrì 3

The Game of Go

|| 23.03.20Tommaso Macrì 4

The Game of Go, RulesAim of the game is to surround more territory than the opponent

|| 23.03.20Tommaso Macrì 5

Comparison with Chess

• ~250 legal moves per position• ~150 moves per game

• ~35 legal moves per position• ~80 moves per game

|| 23.03.20Tommaso Macrì 6

Artificial Intelligence perspective

Go game has challenged artificial intelligence researchers for many decades

|| 23.03.20Tommaso Macrì 7

Mastering the Game of Go: a Major Breakthrough for AI

2016

|| 23.03.20Tommaso Macrì 8

Planning and RL

|| 23.03.20Tommaso Macrì 9

Planning vs Learning

In Learning we use Experience Generated by the Environment (not simulated)

In Planning, we use the simulated experience to update the value function and policy

|| 23.03.20Tommaso Macrì 10

Planning vs Learning: the Dyna-Q example

Direct RL

Planning

Reinforcement Learning, Sutton et al.

|| 23.03.20Tommaso Macrì 11

Planning vs Learning: pros and cons

Direct vs Undirect Learning

Reinforcement Learning, Sutton et al.

|| 23.03.20Tommaso Macrì 12

MCTS(Monte Carlo Tree Search)

|| 23.03.20Tommaso Macrì 13

Monte Carlo Tree Search

Reinforcement Learning, Sutton et al.

|| 23.03.20Tommaso Macrì 14

Monte Carlo Tree Search

https://en.wikipedia.org/wiki/Monte_Carlo_tree_search

|| 23.03.20Tommaso Macrì 15

The AlphaGo Breakthrough

|| 23.03.20Tommaso Macrì 16

AlphaGo

2016

|| 23.03.20Tommaso Macrì 17

AlphaGo

• Go is a perfect information game.• Compute optimal value function?

• Tree search = b^d• b = 250; d = 150

• Exhaustive search is infeasible

|| 23.03.20Tommaso Macrì 18

AlphaGo: the 4 Neural Networks

Forward Propagation: 2µs 3ms

Accuracy: 57%24%

Mastering the game of go with deep neural networks and tree search, David Silver et al.

|| 23.03.20Tommaso Macrì 19

AlphaGo: multiple outputs for the policy and single for the value

Mastering the game of go with deep neural networks and tree search, David Silver et al.

|| 23.03.20Tommaso Macrì 20

AlphaGo: combining Policy Network and Value Network with MCTS

Mastering the game of go with deep neural networks and tree search, David Silver et al.

|| 23.03.20Tommaso Macrì 21

AlphaGo Zero

2017

|| 23.03.20Tommaso Macrì 22

AlphaGo Zero: Only one Neural Network for Policy and Value

|| 23.03.20Tommaso Macrì 23

AlphaGo Zero: Using MCTS to select moves throughout self-play

Mastering the game of go without human knowledge, David Silver et al.

|| 23.03.20Tommaso Macrì 24

Alpha Zero2018

|| 23.03.20Tommaso Macrì 25

Alpha Zero: Learning and MCTS do not assume symmetry Alpha Go and Alpha Go Zero assumed symmetry for:• Training data augmentation• Bias removal in Monte Carlo evaluations

Alpha Zero considers also the “Drawn” outcome

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm, David Silver et al.

|| 23.03.20Tommaso Macrì 26

Alpha Zero: The Algorithm ArchitectureAlpha Go Zero, Alpha Zero: • one Neural Network (Policy + Value)• Monte Carlo Tree Search

Alpha Go Zero• Wait for an iteration to conclude to update NN• Compare the new policy to the best

Alpha Zero• Update the NN continuously• Always generate Self-Play with the latest NN

|| 23.03.20Tommaso Macrì 27

Results: AlphaGo beats the European Go Champion

Mastering the game of go with deep neural networks and tree search, David Silver et al.

|| 23.03.20Tommaso Macrì 28

AlphaGo Zero: better performances than AlphaGo

Mastering the game of go without human knowledge, David Silver et al.

|| 23.03.20Tommaso Macrì 29

AlphaGo Zero: Learns human expert moves and beyond

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm, David Silver et al.

|| 23.03.20Tommaso Macrì 30

Alpha Zero: program applied to Chess and Shogi

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm, David Silver et al.

|| 23.03.20Tommaso Macrì 31

Conclusions

https://deepmind.com/blog/article/alphazero-shedding-new-light-grand-games-chess-shogi-and-go

Stochastic Planning in Games: an AlphaGo Case-studyTommaso Macrì