Mastering the game of go with deep neural networks and tree search
-
Upload
sanfengchang -
Category
Technology
-
view
855 -
download
0
Transcript of Mastering the game of go with deep neural networks and tree search
05/03/2023 1
Mastering the game of Go with deep neural networks and tree search
Speaker: San-Feng Chang
Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman,
S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis, D.
Nature, 529(7587):484–489, 2016.
05/03/2023 2
Outline
• AI in Game Playing• Previous Work of Go Research• Architecture of AlphaGo• AlphaGo’s methods• The playing strength of AlphaGo• Conclusion
05/03/2023 3
AI in Game Playing(1/3)
• Game-playing is a specific problem to measure the performance of an AI.
• One classification for outcomes of an AI test is:Optimal It is not possible to perform better
Strong super-human Performs better than all humans
Super-human Performs better than most humans
Sub-human Performs worse than most humans
05/03/2023 4
AI in Game Playing(2/3)
Game Players Branching Factor Depth Length Complexity
ChessDeep Blue vs
Kasparov (1997)
35 80 35^80 ≈ 10^123
Go AlphaGo vs Lee Sedol (2016) 250 150 250^150≈
10^360
Evolution of Gaming Tree Search:
Brute Force Minmax &Alpha-Beta MCTS AlphaGo’s
Method
05/03/2023 5
AI in Game Playing(3/3)
• Minmax & Alpha-Beta Pruning
The complexity is still too high.
https://upload.wikimedia.org/wikipedia/commons/thumb/9/91/AB_pruning.svg/1280px-AB_pruning.svg.png?1458451165542
05/03/2023 6
Previous Work of Go Research (1/4)
• Monte Carlo rollouts search to maximum depth without branching at all, by sampling long sequences of actions for both players from a policy p.
• Monte Carlo tree search (MCTS) uses Monte Carlo rollouts to estimate the value of each state in a search tree.
05/03/2023 7
Previous Work of Go Research (2/4)
• Monte Carlo Tree Search:
2/3
1/1 1/2
1/1 0/1
2/3
1/1 1/2
1/1 0/1
Selection(Randomly)
Expansion
0/0
Player 1
Player 2
Player 1
05/03/2023 8
• Monte Carlo Tree Search:
Previous Work of Go Research (3/4)
2/3
1/1 1/2
1/1 0/1
Simulation
0/0
......
3/4
1/1 2/3
2/2 0/1
Back-Propagation
1/1
Player 1
Player 2
Player 1
Player 2
05/03/2023 9
Previous Work of Go Research (4/4)
• The strongest current Go programs are based on MCTS, enhanced by policies that are trained to predict human expert moves.
• However, prior work has been limited to shallow policies or value functions based on a linear combination of input features.
05/03/2023 10
Architecture of AlphaGo
Neural Network Training Pipeline
s: board positiona: legal moves
p(a|s): probability distributionv(s): scalar value
Two Brains
Human expert dataset: KGS server ~ 160,000 games
29.4 million positions
05/03/2023 11
Convolution Neural Network(1/2)
A regular 3-layer Neural Network A convolutional neural network
Input volume of size: W1 x H1 x D1
Requires four hyperparameters: 1. Number of filters K (depth) 2. Spatial extent F (kernel size) 3. The stride S 4. The amount of zero padding P
Output volume size: W2 x H2 x D2
W2 = (W1 – F + 2P)/S + 1 H2 = (H1 – F + 2P)/S + 1 D2 = k• Parameter sharing: total weights: (F * F * D1) * K
http://cs231n.github.io/convolutional-networks/
05/03/2023 12
Convolution Neural Network(2/2)
http://cs231n.github.io/convolutional-networks/
Number of filter K: 2Spatial extent F: 3 x 3Stride S: 2Zero padding P: 1
05/03/2023 13
AlphaGo’s methods – Trained by Human Expert (1/6)
• Rollout Policy : – Using 2μs to select an action but only 24.2% accuracy
to predict expert moves correctly – Using a linear softmax of small pattern features with
weights π
p
n1
n2
n3
n1,in
n2,in
n3,in
ininin
in
nnn
n
out eeeen
,3,2,1
,1
,1
https://qph.fs.quoracdn.net/main-qimg-9e2d012ef7cb8b29d2bed14d2975c986
05/03/2023 14
AlphaGo’s methods – Trained by Human Expert (2/6)
• SL policy :– Using 3ms to select an action and 57.0% accuracy
to predict expert moves correctly – Using 13 layers convolutional neural network with
weights σ
p
......
InputSize: 19*1948 planes
First layerConv + ReLU
Kernel size: 5 x 5
2nd~12th layers Conv + ReLU
Kernel size: 3 x 3
13th layers Kernel size: 1 x 1, 1 filter, softmax
05/03/2023 15
AlphaGo’s methods – Reinforcement Learning pρ (3/6)
SL policypσ
Initialize Weightsρ = ρ- = σ
RL policypρ
pρ- pρ
Opponent pool
Play ...... End
r
rewardPolicy Gradient
Method
Add pρ to opponent pool
05/03/2023 16
AlphaGo’s methods – Value Network vθ (4/6)
• Supervised Learning: – Used to estimate the positions’ winning rate at
current state– Using 15 layers CNN
......
InputSize: 19*1948 planes+1 unit(current color)
1st~13th layers The same as
RL Policy networks
15th layers Full-connected
1 tanh unit
14th layerFully-connected256 ReLU unit
05/03/2023 17
AlphaGo’s methods – Value Network vθ (5/6)
• Randomly sample an integer U in 1 ~ 450– t = 1 ~ U-1 – Played by SL policy network pσ
– t = U – Random action– t = U+1 ~ End – Played by RL policy network pρ
• Reward • Only a single training example (sU+1, zU+1) is
added to the data set from each game.
Tt srz
05/03/2023 18
AlphaGo’s methods – Searching (6/6)
• Q: Action Value Winning scores• u(P): Upper Confidence bound Exploration vs. Exploitation • P: Prior probability using pσ (SL performed better than RL)
More
05/03/2023 19
The playing strength of AlphaGo
05/03/2023 20
Conclusion
• Reaching a milestone is the beginning of the next milestone.
• Stay hungry, stay foolish!
05/03/2023 21
References(1/2)
• Nature: – Mastering the game of Go with deep neural
networks and tree search• Mark Chang:– http://
www.slideshare.net/ckmarkohchang/alphago-in-depth
• CNN:– http://cs231n.github.io/convolutional-networks/
05/03/2023 22
References(2/2)
• 陳鍾誠– http://www.slideshare.net/ccckmit/30alphago
• Monte Carlo Tree Search– https://jeffbradberry.com/posts/2015/09/intro-t
o-monte-carlo-tree-search/
• How AlphaGo Works– http://
www.slideshare.net/ShaneSeungwhanMoon/how-alphago-works
05/03/2023 23
EndThank You
05/03/2023 24
Formula(1/2)
• Policy Network: classification
• Policy Network: reinforcement learning
• Value Network: regression
m
k
kk sap
m 1
log
itit
n
i
i
t
it
it svzsap
n
1 1
log
km
kkk svsvz
m 1
05/03/2023 25
Formula(2/2)• Searching:
asuasQa tta
t ,,maxarg
asNasPasu,1,,
n
i
iaslasN1
,,,
n
iLisViasl
asNasQ
1
,,,1,
l(s,a,i) indicates whether an edge (s,a) ith simulation
siL is the leaf node from ith simulation
LLL zsvsV 1
Back
asNbsN
asPcasu b rpuct ,1
,,,
05/03/2023 26
How AlphaGo selected its move
05/03/2023 27
The playing strength of AlphaGo(Bonus 1)
05/03/2023 28
The playing strength of AlphaGo(Bonus 2)