Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from...

Article overview by Ilya Kuzovkin

David Silver et al. from Google DeepMind

Reinforcement Learning Seminar University of Tartu, 2016

Mastering the game of Go with deep neural networks and tree search

THE GAME OF GO

STONES

GROUPS

STONES

LIBERTIES

GROUPS

STONES

LIBERTIES

CAPTURE

GROUPS

STONES

LIBERTIES

CAPTURE KO

GROUPS

STONES

LIBERTIES

CAPTURE KO

GROUPS

EXAMPLES

STONES

LIBERTIES

CAPTURE KO

GROUPS

EXAMPLES

STONES

LIBERTIES

CAPTURE KO

GROUPS

EXAMPLES

STONES

LIBERTIES

CAPTURE

TWO EYES

GROUPS

EXAMPLES

STONES

LIBERTIES

CAPTURE

FINAL COUNT

GROUPS

EXAMPLES

TWO EYES

TRAINING

Supervised policy network

p�(a|s)

Reinforcement policy network

p⇢(a|s)Rollout policy

network p⇡(a|s)

Value!network v✓(s)

Tree policy network p⌧ (a|s)

TRAINING THE BUILDING BLOCKSSUPERVISED

CLASSIFICATION REINFORCEMENT SUPERVISED REGRESSION

p�(a|s)

19 x 19 x 48 input

1 convolutional layer 5x5 with k=192 filters, ReLU

11 convolutional layers 3x3 with k=192 filters, ReLU

1 convolutional layer 1x1 ReLU

Softmax

p�(a|s)

19 x 19 x 48 input

Softmax

• 29.4M positions from games between 6 to 9 dan players

p�(a|s)

19 x 19 x 48 input

Softmax • stochastic gradient ascent !• learning rate = 0.003,

halved every 80M steps !• batch size m = 16 !• 3 weeks on 50 GPUs to make

340M steps

↵1 convolutional layer 1x1 ReLU

p�(a|s)

19 x 19 x 48 input

Softmax • stochastic gradient ascent !• learning rate = 0.003,

halved every 80M steps !• batch size m = 16 !• 3 weeks on 50 GPUs to make

340M steps

↵1 convolutional layer 1x1 ReLU

• Augmented: 8 reflections/rotations • Test set (1M) accuracy: 57.0% • 3 ms to select an action

19 X 19 X 48 INPUT

Rollout policy p⇡(a|s)• Supervised — same data as • Less accurate: 24.2% (vs. 57.0%) • Faster: 2μs per action (1500 times) • Just a linear model with softmax

p�(a|s)

Rollout policy p⇡(a|s)• Supervised — same data as • Less accurate: 24.2% (vs. 57.0%) • Faster: 2μs per action (1500 times) • Just a linear model with softmax

p�(a|s)

Rollout policy p⇡(a|s) Tree policy p⌧ (a|s)• Supervised — same data as • Less accurate: 24.2% (vs. 57.0%) • Faster: 2μs per action (1500 times) • Just a linear model with softmax

p�(a|s)

Rollout policy p⇡(a|s)

• “similar to the rollout policy but with more features”

Tree policy p⌧ (a|s)• Supervised — same data as • Less accurate: 24.2% (vs. 57.0%) • Faster: 2μs per action (1500 times) • Just a linear model with softmax

p�(a|s)

p⇢(a|s)

Same architectureWeights are initialized with⇢ �

p⇢(a|s)

• Self-play: current network vs. randomized pool of previous versions

p⇢(a|s)

• Play a game until the end, get the reward zt = ±r(sT ) = ±1

p⇢(a|s)

• Play a game until the end, get the reward • Set and play the same game again, this time

updating the network parameters at each time step t

zt = ±r(sT ) = ±1zit = zt

p⇢(a|s)

updating the network parameters at each time step t • = …

‣ 0 “on the first pass through the training pipeline” ‣ “on the second pass”

v(sit)

v✓(sit)

p⇢(a|s)

• batch size n = 128 games • 10,000 batches • One day on 50 GPUs

v(sit)

v✓(sit)

p⇢(a|s)

• 80% wins against Supervised Network • 85% wins against Pachi (no search yet!) • 3 ms to select an action

• batch size n = 128 games • 10,000 batches • One day on 50 GPUs

v(sit)

v✓(sit)

19 x 19 x 49 input

Fully connected layer 256 ReLU units

Fully connected layer 1 tanh unit

19 x 19 x 49 input

• Evaluate the value of the position s under policy p: !• Double approximation

19 x 19 x 49 input

• Stochastic gradient descent to minimize MSE

19 x 19 x 49 input

• Train on 30M state-outcome pairs , each from a unique game generated by self-play:

(s, z)

19 x 19 x 49 input

• Train on 30M state-outcome pairs , each from a unique game generated by self-play: ‣ choose a random time step u ‣ sample moves t=1…u-1 from

SL policy ‣ make random move u ‣ sample t=u+1…T from RL

policy and get game outcome z

‣ add pair to the training set

(s, z)

(su, zu)

19 x 19 x 49 input

• One week on 50 GPUs to train on 50M batches of size m=32

(s, z)

(su, zu)

19 x 19 x 49 input

• Evaluate the value of the position s under policy p: !• Double approximation • MSE on the test set: 0.234 • Close to MC estimation from RL policy; 15,000 faster

• One week on 50 GPUs to train on 50M batches of size m=32

(s, z)

(su, zu)

PLAYING

APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS

APV-MCTS

Each node s has edges (s, a) for all legal actions and stores statistics:

Prior Number of evaluations

Number of rollouts

MC value estimate

Rollout value estimate

Combined mean action

ASYNCHRONOUS POLICY AND VALUE MCTS

APV-MCTS

Number of rollouts

MC value estimate

Simulation starts at the root and stops at time L, when a leaf (unexplored state) is found.

APV-MCTS

Number of rollouts

MC value estimate

Position is added to evaluation queue.sL

APV-MCTS

Number of rollouts

MC value estimate

Position is added to evaluation queue.sL

Bunch of nodes selected for evaluation…

Node s is evaluated using the value network to obtain v✓(s)

Node s is evaluated using the value network to obtain

and using rollout simulation with policy till the end of each simulated game to get the final game score.

p⇡(a|s)

v✓(s)

Node s is evaluated using the value network to obtain

and using rollout simulation with policy till the end of each simulated game to get the final game score.

p⇡(a|s)

v✓(s)

Each leaf is evaluated, we are ready to propagate updates

Statistics along the paths of each simulation are updated during the backward pass though t < L

visits counts are updated as well

Finally overall evaluation of each visited state-action edge is updated

Current tree is updated

Once an edge (s, a) is visited enough ( ) times it is included into the tree with s’

It is initialized using the tree policy to

and updated with SL policy:

p⌧ (a|s0)⌧

Tree is expanded, fully updated and ready for the next move!

It is initialized using the tree policy to

and updated with SL policy:

p⌧ (a|s0)⌧

https://www.youtube.com/watch?v=oRvlyEpOQ-8

WINNING

https://www.youtube.com/watch?v=oRvlyEpOQ-8

Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from...

Documents

Transcript of Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from...

User Manual - MUSIC Tri · 4 DeepMind 12 User Manual About the DeepMind 12 The DeepMind 12 is a True Analog 12-Voice Polyphonic Synthesizer with TC ELECTRONIC / KLARK TEKNIK FX and

AI Safety Gridworlds - arXiv · 2017-11-29 · AI Safety Gridworlds Jan Leike DeepMind Miljan Martic DeepMind Victoria Krakovna DeepMind Pedro A. Ortega DeepMind Tom Everitt DeepMind

Google DeepMind Neural Networks - viXravixra.org/pdf/1610.0142v1.pdf · Google DeepMind Neural Networks ... DeepMind is a Google-owned company that does research on artificial intelligence,

A Generalized Framework for Population Based …A Generalized Framework for Population Based Training Ang Li anglili@google.com DeepMind, Mountain View Ola Spyra aspy@google.com DeepMind,

Ilya Kuzovkin - Adaptive Interactive Learning for Brain-Computer Interfaces

cdn.chess24.com...Assessing Game Balance with AlphaZero: Exploring Alternative Rule Sets in Chess Nenad Tomašev * DeepMind Ulrich Paquet DeepMind Demis Hassabis DeepMind Vladimir

Deepmind Mastering Go

Canvas Integration Guide for Modified Mastering€¦ · Canvas Integration Guide for Modified Mastering ... Biology, Mastering Chemistry, Mastering Engineering, Mastering ... Mastering

How DeepMind Mastered The Game Of Go

强化学习前沿 - Home - LAMDAThe game of Go Deepmind AlphaGo system [Silver et al. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587): 484−489,

DeepMind - arxiv.org

DeepMind Control Suite - arxiv.org · DeepMind Control Suite YuvalTassa,YotamDoron,AlistairMuldal,TomErez, YazheLi,DiegodeLasCasas,DavidBudden,AbbasAbdolmaleki,JoshMerel, AndrewLefrancq,TimothyLillicrap

Replicating DeepMind StarCraft II Reinforcement Learning ...

Google Deepmind Mastering Go Research Paper

Replicating DeepMind StarCraft II Reinforcement Learning ......Replicating DeepMind StarCraft II Reinforcement Learning Benchmark with Actor-Critic Methods Bachelor’sthesis RomanRing

DeepMind Self-Learning Atari Agent - Liangliang "Lyon" …llcao.net/cu-deeplearning15/presentation/DeepMindNature-preso-w... · DeepMind Self-Learning Atari Agent “Human-level control

UCL x DeepMind lecture series - storage.googleapis.comstorage.googleapis.com/deepmind-media...in Deep Learning, ranging from the fundamentals of training neural networks via advanced

Neural Discrete Representation Learning - arXiv · Neural Discrete Representation Learning Aaron van den Oord DeepMind avdnoord@google.com Oriol Vinyals DeepMind vinyals@google.com

DEEPMIND 12D · DEEPMIND 12D With the creation of the DEEPMIND 12D, the ultimate true analog, 12-voice polyphonic synthesizer in a convenient desktop or rack mount format is finally

iZotope Mastering Guide - Mastering With Ozone