Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from...

66
Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering the game of Go with deep neural networks and tree search

Transcript of Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from...

Page 1: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

Article overview by Ilya Kuzovkin

David Silver et al. from Google DeepMind

Reinforcement Learning Seminar University of Tartu, 2016

Mastering the game of Go with deep neural networks and tree search

Page 2: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

THE GAME OF GO

Page 3: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

BOARD

Page 4: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

BOARD

STONES

Page 5: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

BOARD

STONES

GROUPS

Page 6: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

BOARD

STONES

LIBERTIES

GROUPS

Page 7: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

BOARD

STONES

LIBERTIES

CAPTURE

GROUPS

Page 8: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

BOARD

STONES

LIBERTIES

CAPTURE KO

GROUPS

Page 9: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

BOARD

STONES

LIBERTIES

CAPTURE KO

GROUPS

EXAMPLES

Page 10: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

BOARD

STONES

LIBERTIES

CAPTURE KO

GROUPS

EXAMPLES

Page 11: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

BOARD

STONES

LIBERTIES

CAPTURE KO

GROUPS

EXAMPLES

Page 12: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

BOARD

STONES

LIBERTIES

CAPTURE

TWO EYES

KO

GROUPS

EXAMPLES

Page 13: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

BOARD

STONES

LIBERTIES

CAPTURE

FINAL COUNT

KO

GROUPS

EXAMPLES

TWO EYES

Page 14: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

TRAINING

Page 15: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

Supervised policy network

p�(a|s)

Reinforcement policy network

p⇢(a|s)Rollout policy

network p⇡(a|s)

Value!network v✓(s)

Tree policy network p⌧ (a|s)

TRAINING THE BUILDING BLOCKSSUPERVISED

CLASSIFICATION REINFORCEMENT SUPERVISED REGRESSION

Page 16: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

Supervised policy network

p�(a|s)

Page 17: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

Supervised policy network

p�(a|s)

19 x 19 x 48 input

1 convolutional layer 5x5 with k=192 filters, ReLU

11 convolutional layers 3x3 with k=192 filters, ReLU

1 convolutional layer 1x1 ReLU

Softmax

Page 18: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

Supervised policy network

p�(a|s)

19 x 19 x 48 input

1 convolutional layer 5x5 with k=192 filters, ReLU

11 convolutional layers 3x3 with k=192 filters, ReLU

Softmax

1 convolutional layer 1x1 ReLU

• 29.4M positions from games between 6 to 9 dan players

Page 19: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

Supervised policy network

p�(a|s)

19 x 19 x 48 input

1 convolutional layer 5x5 with k=192 filters, ReLU

11 convolutional layers 3x3 with k=192 filters, ReLU

Softmax • stochastic gradient ascent !• learning rate = 0.003,

halved every 80M steps !• batch size m = 16 !• 3 weeks on 50 GPUs to make

340M steps

↵1 convolutional layer 1x1 ReLU

• 29.4M positions from games between 6 to 9 dan players

Page 20: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

Supervised policy network

p�(a|s)

19 x 19 x 48 input

1 convolutional layer 5x5 with k=192 filters, ReLU

11 convolutional layers 3x3 with k=192 filters, ReLU

Softmax • stochastic gradient ascent !• learning rate = 0.003,

halved every 80M steps !• batch size m = 16 !• 3 weeks on 50 GPUs to make

340M steps

↵1 convolutional layer 1x1 ReLU

• 29.4M positions from games between 6 to 9 dan players

• Augmented: 8 reflections/rotations • Test set (1M) accuracy: 57.0% • 3 ms to select an action

Page 21: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

19 X 19 X 48 INPUT

Page 22: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

19 X 19 X 48 INPUT

Page 23: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

19 X 19 X 48 INPUT

Page 24: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

19 X 19 X 48 INPUT

Page 25: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

Rollout policy p⇡(a|s)• Supervised — same data as • Less accurate: 24.2% (vs. 57.0%) • Faster: 2μs per action (1500 times) • Just a linear model with softmax

p�(a|s)

Page 26: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

Rollout policy p⇡(a|s)• Supervised — same data as • Less accurate: 24.2% (vs. 57.0%) • Faster: 2μs per action (1500 times) • Just a linear model with softmax

p�(a|s)

Page 27: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

Rollout policy p⇡(a|s) Tree policy p⌧ (a|s)• Supervised — same data as • Less accurate: 24.2% (vs. 57.0%) • Faster: 2μs per action (1500 times) • Just a linear model with softmax

p�(a|s)

Page 28: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

Rollout policy p⇡(a|s)

• “similar to the rollout policy but with more features”

Tree policy p⌧ (a|s)• Supervised — same data as • Less accurate: 24.2% (vs. 57.0%) • Faster: 2μs per action (1500 times) • Just a linear model with softmax

p�(a|s)

Page 29: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

Reinforcement policy network

p⇢(a|s)

Same architectureWeights are initialized with⇢ �

Page 30: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

Reinforcement policy network

p⇢(a|s)

Same architectureWeights are initialized with⇢ �

• Self-play: current network vs. randomized pool of previous versions

Page 31: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

Reinforcement policy network

p⇢(a|s)

Same architectureWeights are initialized with⇢ �

• Self-play: current network vs. randomized pool of previous versions

• Play a game until the end, get the reward zt = ±r(sT ) = ±1

Page 32: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

Reinforcement policy network

p⇢(a|s)

Same architectureWeights are initialized with⇢ �

• Self-play: current network vs. randomized pool of previous versions

• Play a game until the end, get the reward • Set and play the same game again, this time

updating the network parameters at each time step t

zt = ±r(sT ) = ±1zit = zt

Page 33: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

Reinforcement policy network

p⇢(a|s)

Same architectureWeights are initialized with⇢ �

• Self-play: current network vs. randomized pool of previous versions

• Play a game until the end, get the reward • Set and play the same game again, this time

updating the network parameters at each time step t • = …

‣ 0 “on the first pass through the training pipeline” ‣ “on the second pass”

zt = ±r(sT ) = ±1zit = zt

v(sit)

v✓(sit)

Page 34: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

Reinforcement policy network

p⇢(a|s)

Same architectureWeights are initialized with⇢ �

• Self-play: current network vs. randomized pool of previous versions

• Play a game until the end, get the reward • Set and play the same game again, this time

updating the network parameters at each time step t • = …

‣ 0 “on the first pass through the training pipeline” ‣ “on the second pass”

• batch size n = 128 games • 10,000 batches • One day on 50 GPUs

zt = ±r(sT ) = ±1zit = zt

v(sit)

v✓(sit)

Page 35: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

Reinforcement policy network

p⇢(a|s)

Same architectureWeights are initialized with⇢ �

• Self-play: current network vs. randomized pool of previous versions

• 80% wins against Supervised Network • 85% wins against Pachi (no search yet!) • 3 ms to select an action

• Play a game until the end, get the reward • Set and play the same game again, this time

updating the network parameters at each time step t • = …

‣ 0 “on the first pass through the training pipeline” ‣ “on the second pass”

• batch size n = 128 games • 10,000 batches • One day on 50 GPUs

zt = ±r(sT ) = ±1zit = zt

v(sit)

v✓(sit)

Page 36: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

Value!network v✓(s)

Page 37: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

19 x 19 x 49 input

1 convolutional layer 5x5 with k=192 filters, ReLU

11 convolutional layers 3x3 with k=192 filters, ReLU

Fully connected layer 256 ReLU units

Value!network v✓(s)

Fully connected layer 1 tanh unit

1 convolutional layer 1x1 ReLU

Page 38: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

19 x 19 x 49 input

1 convolutional layer 5x5 with k=192 filters, ReLU

11 convolutional layers 3x3 with k=192 filters, ReLU

Fully connected layer 256 ReLU units

Value!network v✓(s)

Fully connected layer 1 tanh unit

• Evaluate the value of the position s under policy p: !• Double approximation

1 convolutional layer 1x1 ReLU

Page 39: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

19 x 19 x 49 input

1 convolutional layer 5x5 with k=192 filters, ReLU

11 convolutional layers 3x3 with k=192 filters, ReLU

Fully connected layer 256 ReLU units

Value!network v✓(s)

Fully connected layer 1 tanh unit

• Evaluate the value of the position s under policy p: !• Double approximation

1 convolutional layer 1x1 ReLU

• Stochastic gradient descent to minimize MSE

Page 40: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

19 x 19 x 49 input

1 convolutional layer 5x5 with k=192 filters, ReLU

11 convolutional layers 3x3 with k=192 filters, ReLU

Fully connected layer 256 ReLU units

Value!network v✓(s)

Fully connected layer 1 tanh unit

• Evaluate the value of the position s under policy p: !• Double approximation

1 convolutional layer 1x1 ReLU

• Stochastic gradient descent to minimize MSE

• Train on 30M state-outcome pairs , each from a unique game generated by self-play:

(s, z)

Page 41: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

19 x 19 x 49 input

1 convolutional layer 5x5 with k=192 filters, ReLU

11 convolutional layers 3x3 with k=192 filters, ReLU

Fully connected layer 256 ReLU units

Value!network v✓(s)

Fully connected layer 1 tanh unit

• Evaluate the value of the position s under policy p: !• Double approximation

1 convolutional layer 1x1 ReLU

• Stochastic gradient descent to minimize MSE

• Train on 30M state-outcome pairs , each from a unique game generated by self-play: ‣ choose a random time step u ‣ sample moves t=1…u-1 from

SL policy ‣ make random move u ‣ sample t=u+1…T from RL

policy and get game outcome z

‣ add pair to the training set

(s, z)

(su, zu)

Page 42: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

19 x 19 x 49 input

1 convolutional layer 5x5 with k=192 filters, ReLU

11 convolutional layers 3x3 with k=192 filters, ReLU

Fully connected layer 256 ReLU units

Value!network v✓(s)

Fully connected layer 1 tanh unit

• Evaluate the value of the position s under policy p: !• Double approximation

1 convolutional layer 1x1 ReLU

• Stochastic gradient descent to minimize MSE

• Train on 30M state-outcome pairs , each from a unique game generated by self-play: ‣ choose a random time step u ‣ sample moves t=1…u-1 from

SL policy ‣ make random move u ‣ sample t=u+1…T from RL

policy and get game outcome z

‣ add pair to the training set

• One week on 50 GPUs to train on 50M batches of size m=32

(s, z)

(su, zu)

Page 43: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

19 x 19 x 49 input

1 convolutional layer 5x5 with k=192 filters, ReLU

11 convolutional layers 3x3 with k=192 filters, ReLU

Fully connected layer 256 ReLU units

Value!network v✓(s)

Fully connected layer 1 tanh unit

• Evaluate the value of the position s under policy p: !• Double approximation • MSE on the test set: 0.234 • Close to MC estimation from RL policy; 15,000 faster

1 convolutional layer 1x1 ReLU

• Stochastic gradient descent to minimize MSE

• Train on 30M state-outcome pairs , each from a unique game generated by self-play: ‣ choose a random time step u ‣ sample moves t=1…u-1 from

SL policy ‣ make random move u ‣ sample t=u+1…T from RL

policy and get game outcome z

‣ add pair to the training set

• One week on 50 GPUs to train on 50M batches of size m=32

(s, z)

(su, zu)

Page 44: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering
Page 45: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

PLAYING

Page 46: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS

Page 47: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

APV-MCTS

Each node s has edges (s, a) for all legal actions and stores statistics:

Prior Number of evaluations

Number of rollouts

MC value estimate

Rollout value estimate

Combined mean action

value

ASYNCHRONOUS POLICY AND VALUE MCTS

Page 48: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

APV-MCTS

Each node s has edges (s, a) for all legal actions and stores statistics:

Prior Number of evaluations

Number of rollouts

MC value estimate

Rollout value estimate

Combined mean action

value

ASYNCHRONOUS POLICY AND VALUE MCTS

Simulation starts at the root and stops at time L, when a leaf (unexplored state) is found.

Page 49: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

APV-MCTS

Each node s has edges (s, a) for all legal actions and stores statistics:

Prior Number of evaluations

Number of rollouts

MC value estimate

Rollout value estimate

Combined mean action

value

ASYNCHRONOUS POLICY AND VALUE MCTS

Simulation starts at the root and stops at time L, when a leaf (unexplored state) is found.

Position is added to evaluation queue.sL

Page 50: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

APV-MCTS

Each node s has edges (s, a) for all legal actions and stores statistics:

Prior Number of evaluations

Number of rollouts

MC value estimate

Rollout value estimate

Combined mean action

value

ASYNCHRONOUS POLICY AND VALUE MCTS

Simulation starts at the root and stops at time L, when a leaf (unexplored state) is found.

Position is added to evaluation queue.sL

Bunch of nodes selected for evaluation…

Page 51: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS

Page 52: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS

Node s is evaluated using the value network to obtain v✓(s)

Page 53: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS

Node s is evaluated using the value network to obtain

and using rollout simulation with policy till the end of each simulated game to get the final game score.

p⇡(a|s)

v✓(s)

Page 54: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS

Node s is evaluated using the value network to obtain

and using rollout simulation with policy till the end of each simulated game to get the final game score.

p⇡(a|s)

v✓(s)

Each leaf is evaluated, we are ready to propagate updates

Page 55: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS

Page 56: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS

Statistics along the paths of each simulation are updated during the backward pass though t < L

Page 57: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS

Statistics along the paths of each simulation are updated during the backward pass though t < L

visits counts are updated as well

Page 58: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS

Statistics along the paths of each simulation are updated during the backward pass though t < L

visits counts are updated as well

Finally overall evaluation of each visited state-action edge is updated

Page 59: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS

Statistics along the paths of each simulation are updated during the backward pass though t < L

visits counts are updated as well

Finally overall evaluation of each visited state-action edge is updated

Current tree is updated

Page 60: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS

Page 61: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS

Once an edge (s, a) is visited enough ( ) times it is included into the tree with s’

nthr

Page 62: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS

Once an edge (s, a) is visited enough ( ) times it is included into the tree with s’

nthr

It is initialized using the tree policy to

and updated with SL policy:

p⌧ (a|s0)⌧

Page 63: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

APV-MCTS ASYNCHRONOUS POLICY AND VALUE MCTS

Tree is expanded, fully updated and ready for the next move!

Once an edge (s, a) is visited enough ( ) times it is included into the tree with s’

nthr

It is initialized using the tree policy to

and updated with SL policy:

p⌧ (a|s0)⌧

Page 64: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering
Page 65: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

https://www.youtube.com/watch?v=oRvlyEpOQ-8

WINNING

Page 66: Mastering the game of Go - ikuz.eu · Article overview by Ilya Kuzovkin David Silver et al. from Google DeepMind Reinforcement Learning Seminar University of Tartu, 2016 Mastering

https://www.youtube.com/watch?v=oRvlyEpOQ-8