Stochastic Planning in Games: an AlphaGo Case-studyTommaso Macrì | 23.03.20 | 18 AlphaGo: the 4...

32
| | Tommaso Macrì 23.03.20 1 Stochastic Planning in Games: an AlphaGo Case-study Tommaso Macrì

Transcript of Stochastic Planning in Games: an AlphaGo Case-studyTommaso Macrì | 23.03.20 | 18 AlphaGo: the 4...

Page 1: Stochastic Planning in Games: an AlphaGo Case-studyTommaso Macrì | 23.03.20 | 18 AlphaGo: the 4 Neural Networks Forward Propagation: 2µs 3ms Accuracy: 24% 57% Mastering the game

||

Tommaso Macrì

23.03.20 1

Stochastic Planning in Games: an AlphaGo Case-study

Tommaso Macrì

Page 2: Stochastic Planning in Games: an AlphaGo Case-studyTommaso Macrì | 23.03.20 | 18 AlphaGo: the 4 Neural Networks Forward Propagation: 2µs 3ms Accuracy: 24% 57% Mastering the game

|| 23.03.20Tommaso Macrì 2

AlphaGo beats 18-time world champion Lee Sedol 4 games to 1

Page 3: Stochastic Planning in Games: an AlphaGo Case-studyTommaso Macrì | 23.03.20 | 18 AlphaGo: the 4 Neural Networks Forward Propagation: 2µs 3ms Accuracy: 24% 57% Mastering the game

|| 23.03.20Tommaso Macrì 3

The Game of Go

Page 4: Stochastic Planning in Games: an AlphaGo Case-studyTommaso Macrì | 23.03.20 | 18 AlphaGo: the 4 Neural Networks Forward Propagation: 2µs 3ms Accuracy: 24% 57% Mastering the game

|| 23.03.20Tommaso Macrì 4

The Game of Go, RulesAim of the game is to surround more territory than the opponent

Page 5: Stochastic Planning in Games: an AlphaGo Case-studyTommaso Macrì | 23.03.20 | 18 AlphaGo: the 4 Neural Networks Forward Propagation: 2µs 3ms Accuracy: 24% 57% Mastering the game

|| 23.03.20Tommaso Macrì 5

Comparison with Chess

• ~250 legal moves per position• ~150 moves per game

• ~35 legal moves per position• ~80 moves per game

Page 6: Stochastic Planning in Games: an AlphaGo Case-studyTommaso Macrì | 23.03.20 | 18 AlphaGo: the 4 Neural Networks Forward Propagation: 2µs 3ms Accuracy: 24% 57% Mastering the game

|| 23.03.20Tommaso Macrì 6

Artificial Intelligence perspective

Go game has challenged artificial intelligence researchers for many decades

Page 7: Stochastic Planning in Games: an AlphaGo Case-studyTommaso Macrì | 23.03.20 | 18 AlphaGo: the 4 Neural Networks Forward Propagation: 2µs 3ms Accuracy: 24% 57% Mastering the game

|| 23.03.20Tommaso Macrì 7

Mastering the Game of Go: a Major Breakthrough for AI

2016

Page 8: Stochastic Planning in Games: an AlphaGo Case-studyTommaso Macrì | 23.03.20 | 18 AlphaGo: the 4 Neural Networks Forward Propagation: 2µs 3ms Accuracy: 24% 57% Mastering the game

|| 23.03.20Tommaso Macrì 8

Planning and RL

Page 9: Stochastic Planning in Games: an AlphaGo Case-studyTommaso Macrì | 23.03.20 | 18 AlphaGo: the 4 Neural Networks Forward Propagation: 2µs 3ms Accuracy: 24% 57% Mastering the game

|| 23.03.20Tommaso Macrì 9

Planning vs Learning

In Learning we use Experience Generated by the Environment (not simulated)

In Planning, we use the simulated experience to update the value function and policy

Page 10: Stochastic Planning in Games: an AlphaGo Case-studyTommaso Macrì | 23.03.20 | 18 AlphaGo: the 4 Neural Networks Forward Propagation: 2µs 3ms Accuracy: 24% 57% Mastering the game

|| 23.03.20Tommaso Macrì 10

Planning vs Learning: the Dyna-Q example

Direct RL

Planning

Reinforcement Learning, Sutton et al.

Page 11: Stochastic Planning in Games: an AlphaGo Case-studyTommaso Macrì | 23.03.20 | 18 AlphaGo: the 4 Neural Networks Forward Propagation: 2µs 3ms Accuracy: 24% 57% Mastering the game

|| 23.03.20Tommaso Macrì 11

Planning vs Learning: pros and cons

Direct vs Undirect Learning

Reinforcement Learning, Sutton et al.

Page 12: Stochastic Planning in Games: an AlphaGo Case-studyTommaso Macrì | 23.03.20 | 18 AlphaGo: the 4 Neural Networks Forward Propagation: 2µs 3ms Accuracy: 24% 57% Mastering the game

|| 23.03.20Tommaso Macrì 12

MCTS(Monte Carlo Tree Search)

Page 13: Stochastic Planning in Games: an AlphaGo Case-studyTommaso Macrì | 23.03.20 | 18 AlphaGo: the 4 Neural Networks Forward Propagation: 2µs 3ms Accuracy: 24% 57% Mastering the game

|| 23.03.20Tommaso Macrì 13

Monte Carlo Tree Search

Reinforcement Learning, Sutton et al.

Page 14: Stochastic Planning in Games: an AlphaGo Case-studyTommaso Macrì | 23.03.20 | 18 AlphaGo: the 4 Neural Networks Forward Propagation: 2µs 3ms Accuracy: 24% 57% Mastering the game

|| 23.03.20Tommaso Macrì 14

Monte Carlo Tree Search

https://en.wikipedia.org/wiki/Monte_Carlo_tree_search

Page 15: Stochastic Planning in Games: an AlphaGo Case-studyTommaso Macrì | 23.03.20 | 18 AlphaGo: the 4 Neural Networks Forward Propagation: 2µs 3ms Accuracy: 24% 57% Mastering the game

|| 23.03.20Tommaso Macrì 15

The AlphaGo Breakthrough

Page 16: Stochastic Planning in Games: an AlphaGo Case-studyTommaso Macrì | 23.03.20 | 18 AlphaGo: the 4 Neural Networks Forward Propagation: 2µs 3ms Accuracy: 24% 57% Mastering the game

|| 23.03.20Tommaso Macrì 16

AlphaGo

2016

Page 17: Stochastic Planning in Games: an AlphaGo Case-studyTommaso Macrì | 23.03.20 | 18 AlphaGo: the 4 Neural Networks Forward Propagation: 2µs 3ms Accuracy: 24% 57% Mastering the game

|| 23.03.20Tommaso Macrì 17

AlphaGo

• Go is a perfect information game.• Compute optimal value function?

• Tree search = b^d• b = 250; d = 150

• Exhaustive search is infeasible

Page 18: Stochastic Planning in Games: an AlphaGo Case-studyTommaso Macrì | 23.03.20 | 18 AlphaGo: the 4 Neural Networks Forward Propagation: 2µs 3ms Accuracy: 24% 57% Mastering the game

|| 23.03.20Tommaso Macrì 18

AlphaGo: the 4 Neural Networks

Forward Propagation: 2µs 3ms

Accuracy: 57%24%

Mastering the game of go with deep neural networks and tree search, David Silver et al.

Page 19: Stochastic Planning in Games: an AlphaGo Case-studyTommaso Macrì | 23.03.20 | 18 AlphaGo: the 4 Neural Networks Forward Propagation: 2µs 3ms Accuracy: 24% 57% Mastering the game

|| 23.03.20Tommaso Macrì 19

AlphaGo: multiple outputs for the policy and single for the value

Mastering the game of go with deep neural networks and tree search, David Silver et al.

Page 20: Stochastic Planning in Games: an AlphaGo Case-studyTommaso Macrì | 23.03.20 | 18 AlphaGo: the 4 Neural Networks Forward Propagation: 2µs 3ms Accuracy: 24% 57% Mastering the game

|| 23.03.20Tommaso Macrì 20

AlphaGo: combining Policy Network and Value Network with MCTS

Mastering the game of go with deep neural networks and tree search, David Silver et al.

Page 21: Stochastic Planning in Games: an AlphaGo Case-studyTommaso Macrì | 23.03.20 | 18 AlphaGo: the 4 Neural Networks Forward Propagation: 2µs 3ms Accuracy: 24% 57% Mastering the game

|| 23.03.20Tommaso Macrì 21

AlphaGo Zero

2017

Page 22: Stochastic Planning in Games: an AlphaGo Case-studyTommaso Macrì | 23.03.20 | 18 AlphaGo: the 4 Neural Networks Forward Propagation: 2µs 3ms Accuracy: 24% 57% Mastering the game

|| 23.03.20Tommaso Macrì 22

AlphaGo Zero: Only one Neural Network for Policy and Value

Page 23: Stochastic Planning in Games: an AlphaGo Case-studyTommaso Macrì | 23.03.20 | 18 AlphaGo: the 4 Neural Networks Forward Propagation: 2µs 3ms Accuracy: 24% 57% Mastering the game

|| 23.03.20Tommaso Macrì 23

AlphaGo Zero: Using MCTS to select moves throughout self-play

Mastering the game of go without human knowledge, David Silver et al.

Page 24: Stochastic Planning in Games: an AlphaGo Case-studyTommaso Macrì | 23.03.20 | 18 AlphaGo: the 4 Neural Networks Forward Propagation: 2µs 3ms Accuracy: 24% 57% Mastering the game

|| 23.03.20Tommaso Macrì 24

Alpha Zero2018

Page 25: Stochastic Planning in Games: an AlphaGo Case-studyTommaso Macrì | 23.03.20 | 18 AlphaGo: the 4 Neural Networks Forward Propagation: 2µs 3ms Accuracy: 24% 57% Mastering the game

|| 23.03.20Tommaso Macrì 25

Alpha Zero: Learning and MCTS do not assume symmetry Alpha Go and Alpha Go Zero assumed symmetry for:• Training data augmentation• Bias removal in Monte Carlo evaluations

Alpha Zero considers also the “Drawn” outcome

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm, David Silver et al.

Page 26: Stochastic Planning in Games: an AlphaGo Case-studyTommaso Macrì | 23.03.20 | 18 AlphaGo: the 4 Neural Networks Forward Propagation: 2µs 3ms Accuracy: 24% 57% Mastering the game

|| 23.03.20Tommaso Macrì 26

Alpha Zero: The Algorithm ArchitectureAlpha Go Zero, Alpha Zero: • one Neural Network (Policy + Value)• Monte Carlo Tree Search

Alpha Go Zero• Wait for an iteration to conclude to update NN• Compare the new policy to the best

Alpha Zero• Update the NN continuously• Always generate Self-Play with the latest NN

Page 27: Stochastic Planning in Games: an AlphaGo Case-studyTommaso Macrì | 23.03.20 | 18 AlphaGo: the 4 Neural Networks Forward Propagation: 2µs 3ms Accuracy: 24% 57% Mastering the game

|| 23.03.20Tommaso Macrì 27

Results: AlphaGo beats the European Go Champion

Mastering the game of go with deep neural networks and tree search, David Silver et al.

Page 28: Stochastic Planning in Games: an AlphaGo Case-studyTommaso Macrì | 23.03.20 | 18 AlphaGo: the 4 Neural Networks Forward Propagation: 2µs 3ms Accuracy: 24% 57% Mastering the game

|| 23.03.20Tommaso Macrì 28

AlphaGo Zero: better performances than AlphaGo

Mastering the game of go without human knowledge, David Silver et al.

Page 29: Stochastic Planning in Games: an AlphaGo Case-studyTommaso Macrì | 23.03.20 | 18 AlphaGo: the 4 Neural Networks Forward Propagation: 2µs 3ms Accuracy: 24% 57% Mastering the game

|| 23.03.20Tommaso Macrì 29

AlphaGo Zero: Learns human expert moves and beyond

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm, David Silver et al.

Page 30: Stochastic Planning in Games: an AlphaGo Case-studyTommaso Macrì | 23.03.20 | 18 AlphaGo: the 4 Neural Networks Forward Propagation: 2µs 3ms Accuracy: 24% 57% Mastering the game

|| 23.03.20Tommaso Macrì 30

Alpha Zero: program applied to Chess and Shogi

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm, David Silver et al.

Page 31: Stochastic Planning in Games: an AlphaGo Case-studyTommaso Macrì | 23.03.20 | 18 AlphaGo: the 4 Neural Networks Forward Propagation: 2µs 3ms Accuracy: 24% 57% Mastering the game

|| 23.03.20Tommaso Macrì 31

Conclusions

https://deepmind.com/blog/article/alphazero-shedding-new-light-grand-games-chess-shogi-and-go

Page 32: Stochastic Planning in Games: an AlphaGo Case-studyTommaso Macrì | 23.03.20 | 18 AlphaGo: the 4 Neural Networks Forward Propagation: 2µs 3ms Accuracy: 24% 57% Mastering the game

Stochastic Planning in Games: an AlphaGo Case-studyTommaso Macrì