Mastering the game of Go without human knowledgeMastering the game of Go without human knowledge...

19
Mastering the game of Go without human knowledge David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel , Demis Hassabis DeepMind London Nature / Presenter: Ji Gao David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthe Mastering the game of Go without human knowledge Nature / Presenter: Ji Gao 1 / 19

Transcript of Mastering the game of Go without human knowledgeMastering the game of Go without human knowledge...

Page 1: Mastering the game of Go without human knowledgeMastering the game of Go without human knowledge David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang,

Mastering the game of Go without human knowledge

David Silver, Julian Schrittwieser, Karen Simonyan, IoannisAntonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker,

Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui,Laurent Sifre, George van den Driessche, Thore Graepel , Demis

Hassabis

DeepMind London

Nature / Presenter: Ji Gao

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel , Demis Hassabis ( DeepMind London)Mastering the game of Go without human knowledgeNature / Presenter: Ji Gao 1 / 19

Page 2: Mastering the game of Go without human knowledgeMastering the game of Go without human knowledge David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang,

Outline

1 OverviewAlphaGoRules of go

2 Method OverviewPlayTrain

3 MethodNetworkMonte Carlo Tree SearchReinforcement Learning

4 Experiment

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel , Demis Hassabis ( DeepMind London)Mastering the game of Go without human knowledgeNature / Presenter: Ji Gao 2 / 19

Page 3: Mastering the game of Go without human knowledgeMastering the game of Go without human knowledge David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang,

Alphago

AlphaGo 4:1 World Champion Lee Sedol 9-Dan

March 9 - March 15, 2016

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel , Demis Hassabis ( DeepMind London)Mastering the game of Go without human knowledgeNature / Presenter: Ji Gao 3 / 19

Page 4: Mastering the game of Go without human knowledgeMastering the game of Go without human knowledge David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang,

Alphago Master

Won 60 pros online

3-0 World Champion Ke Jie May 23, 2017 - May 27, 2017

Ke Jie: “ Alphago is like the god of go” “Future belongs to AI”

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel , Demis Hassabis ( DeepMind London)Mastering the game of Go without human knowledgeNature / Presenter: Ji Gao 4 / 19

Page 5: Mastering the game of Go without human knowledgeMastering the game of Go without human knowledge David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang,

Alphago Zero (This paper)

The second Alphago paper

Mastering the game of Go without human knowledge

100 - 0 Alphago Lee

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel , Demis Hassabis ( DeepMind London)Mastering the game of Go without human knowledgeNature / Presenter: Ji Gao 5 / 19

Page 6: Mastering the game of Go without human knowledgeMastering the game of Go without human knowledge David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang,

Go

Start with 19x19 empty board

One player take black stones and the other take white stones

Two players take turns to put stones on the board

Rules:

If one connected part is completely surrounded by the opponentsstones, remove it from the board.Ko rule: Forbids a board play to repeat a board position

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel , Demis Hassabis ( DeepMind London)Mastering the game of Go without human knowledgeNature / Presenter: Ji Gao 6 / 19

Page 7: Mastering the game of Go without human knowledgeMastering the game of Go without human knowledge David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang,

Go

End when theres no valuable moves on the board.

Count the territory of both players.

Add 7.5 points to whites points (called Komi).

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel , Demis Hassabis ( DeepMind London)Mastering the game of Go without human knowledgeNature / Presenter: Ji Gao 7 / 19

Page 8: Mastering the game of Go without human knowledgeMastering the game of Go without human knowledge David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang,

Method Overview - Play

How do AlphaGo play go?

The answer: CNN + MCTS

Deep Convolutional Neural Network: (p, v) = fθ(s). p is theprobability of selecting a move(including pass), and v is the winningprobability.

Previous version: Two networks, one for p and one for v. Thisversion: One network for both.

Monte Carlo Tree Search: Further improve the move based on theneural network.

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel , Demis Hassabis ( DeepMind London)Mastering the game of Go without human knowledgeNature / Presenter: Ji Gao 8 / 19

Page 9: Mastering the game of Go without human knowledgeMastering the game of Go without human knowledge David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang,

Method Overview - Train

Training of the network: Reinforcement Learning

At iteration i:

Generate a bunch games.At time-step t of a game, an MCTS search πt = αθi−1(st) is executedusing the parameter of fθi−1 , and a move is played by sampling.The game is played until the end time T, record a final rewardrT = {−1,+1}.Experience replay: store (st , πt , zt) in the library, while zt = ±rTSample datas from the experience library, and train the network θ.Loss:

l = (z − v)2 − πT log p + c ||θ||2

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel , Demis Hassabis ( DeepMind London)Mastering the game of Go without human knowledgeNature / Presenter: Ji Gao 9 / 19

Page 10: Mastering the game of Go without human knowledgeMastering the game of Go without human knowledge David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang,

Network structure

Input: 19x19x17. 17 binary channels instead of previous version 48channels.16 channels for the condition of board on last 16 moves (1 for havinga stone of the color, and 0 for not). 1 channel for current colorplaying.Use ResNet. Either 20 blocks and 40 blocks.

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel , Demis Hassabis ( DeepMind London)Mastering the game of Go without human knowledgeNature / Presenter: Ji Gao 10 / 19

Page 11: Mastering the game of Go without human knowledgeMastering the game of Go without human knowledge David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang,

Monte Carlo Tree Search

Each edge (s, a) in the search tree store a prior probability p(s, a), avisit count N(s, a) and action value Q(s, a).

Each step, find a leaf node by maximizing an upper confidence bound.

Expand the leaf node using only the network information.

Update N and Q for every edge on the path.

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel , Demis Hassabis ( DeepMind London)Mastering the game of Go without human knowledgeNature / Presenter: Ji Gao 11 / 19

Page 12: Mastering the game of Go without human knowledgeMastering the game of Go without human knowledge David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang,

Monte Carlo Tree Search

3+1 steps:

Select: At every node st , choose at = arg maxa(Q(st , a) + U(st , a)).

Where U(s, a) = cpuctP(s, a)

√∑b N(s,b)

1+N(s,a)

Expand and evaluate: Select a random i in [1..8], make a rotation orreflection of the board, evaluate the current board using the network.The leaf node sL is expanded, and each edge (sL, a) is initialized as[N(sL, a) = 0,Q(sL, a) = 0,W (sL, a) = 0,P(sL, a) = pa]Backup: Every edge on the path N(s, a) = N(s, a) + 1,

W (s, a) = W (s, a) + v , Q(s, a) = W (s,a)N(s,a)

Play: In training, π(a|s0) = N(s0,a)1/τ∑b N(s0,b)1/τ

. τ → 0 in real play.

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel , Demis Hassabis ( DeepMind London)Mastering the game of Go without human knowledgeNature / Presenter: Ji Gao 12 / 19

Page 13: Mastering the game of Go without human knowledgeMastering the game of Go without human knowledge David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang,

Self play

In each iteration, play 25000 games.

Each move, use 1600 MCTS simulations (around 0.4s).

For the first 30 moves, τ = 1 to encourage exploration. For thereminder of the game τ → 0.

Add a Dirichlet noise to the prior to further encourage exploration:P(s, a) = (1− ε)pa + εηa, where η ∼ Dir(0.03) and ε = 0.25.

To save calculation, define a resign rate. Automatically ensure thefalse positives are below 5% (run a subset with 10% data with noresign to decide the threshold).

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel , Demis Hassabis ( DeepMind London)Mastering the game of Go without human knowledgeNature / Presenter: Ji Gao 13 / 19

Page 14: Mastering the game of Go without human knowledgeMastering the game of Go without human knowledge David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang,

Checkpoint

On each checkpoint, evaluate the current network

Compare the current network with previous network, using thewinning rate of 400 games.

Update the network only if the winning rate is larger than 55%

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel , Demis Hassabis ( DeepMind London)Mastering the game of Go without human knowledgeNature / Presenter: Ji Gao 14 / 19

Page 15: Mastering the game of Go without human knowledgeMastering the game of Go without human knowledge David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang,

Summary

Play by RNN and MCTS

Train the RNN by MCTS based RL

Major difference to previous versions:

No human knowledge. Previous version contains a policy networklearned from human plays.Combine 2 networks into a single one.Greatly reduce the number of feature channels.

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel , Demis Hassabis ( DeepMind London)Mastering the game of Go without human knowledgeNature / Presenter: Ji Gao 15 / 19

Page 16: Mastering the game of Go without human knowledgeMastering the game of Go without human knowledge David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang,

Go learned

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel , Demis Hassabis ( DeepMind London)Mastering the game of Go without human knowledgeNature / Presenter: Ji Gao 16 / 19

Page 17: Mastering the game of Go without human knowledgeMastering the game of Go without human knowledge David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang,

Different structures

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel , Demis Hassabis ( DeepMind London)Mastering the game of Go without human knowledgeNature / Presenter: Ji Gao 17 / 19

Page 18: Mastering the game of Go without human knowledgeMastering the game of Go without human knowledge David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang,

Cost

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel , Demis Hassabis ( DeepMind London)Mastering the game of Go without human knowledgeNature / Presenter: Ji Gao 18 / 19

Page 19: Mastering the game of Go without human knowledgeMastering the game of Go without human knowledge David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang,

Supervised Training

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel , Demis Hassabis ( DeepMind London)Mastering the game of Go without human knowledgeNature / Presenter: Ji Gao 19 / 19