Accelerating Parallel Monte Carlo Tree Search using...

1
Accelerating Parallel Monte Carlo Tree Search using CUDA Kamil Rocki and Reiji Suda Department of Computer Science, Graduate School of Information Science and Technology, The University of Tokyo This work was partially supported by Core Research of Evolutional Science and Technology (CREST) project "ULP-HPC: Ultra Low-Power, High- Performance Computing via Modeling and Optimization of Next Generation HPC Technologies" of Japan Science and Technology Agency (JST) and Grant-in-Aid for Scientic Research of MEXT Japan. Monte Carlo Tree Search (MCTS) is a method for making optimal decisions in artificial intelligence (AI) problems, typically move planning in combinatorial games. It combines the generality of random simulation with the precision of tree search. It can theoretically be applied to any domain that can be described in terms of state, action pairs and simulation used to forecast outcomes such as decision support, control, delayed reward problems or complex optimization. The motivation behind this work is caused by the emerging GPU-based systems and their high computational potential combined with relatively low power usage compared to CPUs. As a problem to be solved we chose to develop an AI GPU-based agent in the game of Reversi (Othello) which provides a sufficiently complex problem for tree searching with non-uniform structure and an average branching factor of over 8. We present an efficient parallel GPU MCTS implementation based on the introduced ’block-parallelism’ scheme which combines GPU SIMD thread groups and performs independent searches without any need of intra-GPU or inter-GPU communication. The obtained results show that using my GPU MCTS implementation on the TSUBAME 2.0 system one GPU can be compared to 100-200 CPU threads depending on factors such as the search time and other MCTS parameters in terms of obtained results. We propose and analyze simultaneous CPU/GPU execution which improves the overall result. Introduction The basic MCTS algorithm is simple 1. Selection 2. Expansion 3. Simulation 4. Backpropagation standard UCB formula mean value of node (i.e. success/loss ratio) C - exploitation/exploration ratio factor, tunable MCTS - Coulom (2006) UCB - Kocsis and Szepervari (2006) Parallel MCTS Schemes - Chaslot et al. (2008) Easy Efficient Complex, not efficient Our approach - Parallel MCTS on GPU = Block parallelism (c) Weakness: CPU sequential tree management part (proportional to the number n simulations a. Leaf parallelism n trees b. Root parallelism c. Block parallelism n = blocks(trees) x threads (simulations at once) Advantage: Works well with SIMD hardware, improves the overall result on 2 levels of parallelization 3/6 1/3 3/5 2/3 1/3 0 2 parts Tree building Stored in the CPU memory Simulating 1. Temporary - not remembered 2. Done by CPU or GPU 3. The results are used to affect the tree’s expansion strategy Final result: 0 or 1 MCTS has many applications already New ones are appearing The architecture is likely to follow the trend in the future Programming GPUs may become easier, rather not harder CPU GPU TSUBAME 2.0 CPUs - Intel(R) Xeon(R) CPU X5670 @ 2.93GHz ~ 1400 Nodes of 12 cores GPU - NVIDIA Tesla C2050 - 14 (MP) x 32 (Cores/MP) = 448 (Cores) @ 1.15 GHz, ~ 1400 Nodes of 3 GPUs each (around 515GFlops max capability per GPU) • If not specified otherwise, the MCTS search time = 500 ms, and GPU block size = 128 Root parallel MCTS - many starting points Greater chance of reaching the global solution Sequential/leaf parallel MCTS Seen as an optimization problem STATE SPACE STATE SPACE Sequential MCTS Root parallel MCTS Leaf parallel MCTS Block (root-leaf ) parallel MCTS STATE SPACE STATE SPACE Starting point - with root parallelism, more chance of finding the global solution Local solution (extremum) Search scope - with leaf parallelism, the search is broader/more accurate (more samples) Problem statement Parallel tree search is one of the basic problems in computer science. It is used to solve many kinds of problems. Effective parallelization is hard, especially for more than hundreds of threads. SIMD hardware (i.e. GPU) is fast, but hard to utilize. How to utilize GPUs/CUDA? Mapping MCTS trees to blocks Multiprocessor GPU Hardware Multiprocessor Multiprocessor Multiprocessor Multiprocessor Multiprocessor Multiprocessor Multiprocessor GPU Program Block 0 Block 2 Block 4 Block 6 Block 1 Block 3 Block 5 Block 7 SIMD warp SIMD warp SIMD warp SIMD warp 32 threads ( xed, for current hardware) Thread 0 Thread 1 Thread 4 Thread 5 Thread 2 Thread 3 Thread 6 Thread 7 Thread 8 Thread 9 Thread 12 Thread 13 Thread 10 Thread 11 Thread 14 Thread 15 Thread 16 Thread 17 Thread 20 Thread 21 Thread 18 Thread 19 Thread 22 Thread 23 Number of blocks con gurable Number of threads con gurable Number of MPs xed Root parallelism Leaf parallelism Block parallelism } } Scalability - MPI Parallel Scheme Root process id = 0 n-1 processes N processes init simulate broadcast data collect data (reduce) Output data Input data simulate simulate Other machine i.e. core i7, Fedora Other machine i.e. Phenom, Ubuntu Network Send the current state of the game to all processes Think Choose the best move and send it to the opponent Receive the opponent’s move Accumulate results All simulations are independent Process number 0 controls the game Results and findings Simultaneous CPU/GPU simulating 10 6 10 7 Simulations/second 1 2 4 8 16 32 26.5 27 27.5 28 28.5 29 29.5 No of GPUs (112 block x 64 Threads) Average Point Difference No communication bottleneck Improvement gets worse 229,376 threads ~20mln sim/s 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 7168 14336 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 GPU Threads Win ratio Leaf parallelism (block size = 64) Block parallelism (block size = 32) Block parallelism (block size = 128) GPU vs 1 cpu thread 448 trees 112 trees 112 trees Average for 2000 games 10 20 30 40 50 60 0 2 4 6 8 10 12 14 Game step Points 10 20 30 40 50 60 0 10 20 30 40 Game step Depth GPU + CPU GPU 1 GPU vs 128 CPUs Average point difference (score) Average tree depth 500ms search time 10 20 30 40 50 60 0 10 20 30 40 50 Game step Avegare score e e e e 256 GPUs (3,670,016 threads) and 2048 CPU threads vs sequential MCTS Findings: Weak scaling of the algorithm - problem’s complexity affects the scalability Exploitation/exploration ratio - higher exploitation needed for more trees No communication bottleneck Much mor e efficient than the CPU version Exploration/exploitation in parallel MCTS Trees Trees 1 2 3 4 5 SUM 1 2 3 4 5 SUM High exploitation High exploration 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 7168 14336 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8 8.5 9 x 10 5 GPU Threads Simulations/second Leaf parallelism (block size = 64) Block parallelism (block size = 32) Block parallelism (block size = 128) GPU vs 1 cpu thread 448 trees 112 trees 1 CPU - around 10.000 sim/s GPU is much faster! More trees = higher score More simulations = higher score More trees = fewer simulations Block size needs to be adjusted 1 GPU ~ 64-128 CPUs (AI power) While GPU runs a kernel CPU can work too Increases the tree depth, improves the overall result GPU kernel execution time kernel execution call gpu ready event cpu control CPU can work here! processed by GPU expanded by CPU in the meantime Hybrid CPU/GPU search

Transcript of Accelerating Parallel Monte Carlo Tree Search using...

Page 1: Accelerating Parallel Monte Carlo Tree Search using CUDAon-demand.gputechconf.com/gtc/2009/posters/P0227... · Monte Carlo Tree Search (MCTS) is a method for making optimal decisions

Accelerating Parallel Monte Carlo Tree Search using CUDAKamil Rocki and Reiji Suda

Department of Computer Science, Graduate School of Information Science and Technology, The University of Tokyo

This work was partially supported by Core Research of Evolutional Scienceand Technology (CREST) project "ULP-HPC: Ultra Low-Power, High-Performance Computing via Modeling and Optimization of Next GenerationHPC Technologies" of Japan Science and Technology Agency (JST) andGrant-in-Aid for Scienti�c Research of MEXT Japan.

Monte Carlo Tree Search (MCTS) is a method for making optimal decisions in arti�cial intelligence (AI) problems, typically move planning in combinatorialgames. It combines the generality of random simulation with the precision oftree search. It can theoretically be applied to any domain that can be described in terms of state, action pairs and simulation used to forecast outcomes such as decision support, control, delayed reward problems or complex optimization.

The motivation behind this work is caused by the emerging GPU-based systems and their high computational potential combined with relatively low power usage compared to CPUs. As a problem to be solved we chose to develop an AI GPU-basedagent in the game of Reversi (Othello) which provides a su�ciently complex problemfor tree searching with non-uniform structure and an average branching factor of over 8.

We present an e�cient parallel GPU MCTS implementation based on the introduced ’block-parallelism’ scheme which combines GPU SIMD thread groups and performsindependent searches without any need of intra-GPU or inter-GPU communication. The obtained results show that using my GPU MCTS implementation on the TSUBAME 2.0system one GPU can be compared to 100-200 CPU threads depending on factors such asthe search time and other MCTS parameters in terms of obtained results. We propose andanalyze simultaneous CPU/GPU execution which improves the overall result.

Introduction

•The basic MCTS algorithm is simple

•1. Selection

•2. Expansion

•3. Simulation

•4. Backpropagation

standard UCB formula

mean value of node (i.e. success/loss ratio)

C - exploitation/exploration ratio factor, tunable

MCTS - Coulom (2006)UCB - Kocsis and Szepervari (2006)

Parallel MCTS Schemes - Chaslot et al. (2008)

Easy

Efficient

Complex, not efficientOur approach - Parallel MCTS on GPU

= Block parallelism (c)

Weakness:CPU sequential tree management part (proportional to the number

n simulations

a. Leaf parallelism

n trees

b. Root parallelism

c. Block parallelism

n = blocks(trees) x threads (simulations at once)

Advantage:Works well with SIMD hardware, improves the overall result on 2 levels of parallelization

3/6

1/33/5

2/3 1/3

0

2 partsTree building

Stored in the CPU memory

Simulating1. Temporary - not remembered2. Done by CPU or GPU3. The results are used to affect the tree’s expansion strategyFinal result:

0 or 1

•MCTS has many applications already

•New ones are appearing

•The architecture is likely to follow the trend in the future

•Programming GPUs may become easier, rather not harder

CPU

GPU

TSUBAME 2.0•CPUs - Intel(R) Xeon(R) CPU X5670

@ 2.93GHz ~ 1400 Nodes of 12 cores

•GPU - NVIDIA Tesla C2050 - 14 (MP) x 32 (Cores/MP) = 448 (Cores) @ 1.15 GHz, ~ 1400 Nodes of 3 GPUs each (around 515GFlops max capability per GPU)

•If not specified otherwise, the MCTS search time = 500 ms, and GPU block size = 128

Root parallel MCTS - many starting pointsGreater chance of reaching the global solution

Sequential/leaf parallel MCTSSeen as an optimization problem

STATE SPACE STATE SPACE

Sequential MCTS Root parallel MCTS

Leaf parallel MCTS Block (root-leaf ) parallel MCTS

STATE SPACE STATE SPACE

Starting point - with root parallelism, more chance of �nding the global solution

Local solution (extremum)

Search scope - with leaf parallelism, the search is broader/more accurate (more samples)

Problem statementParallel tree search is one of the basic problems in computer science. It is used to solve many kinds of problems. E�ective parallelization is hard, especially for more than hundreds of threads. SIMD hardware (i.e. GPU) is fast, but hard to utilize. How to utilize GPUs/CUDA?

Mapping MCTS trees to blocks

Multiprocessor Multiprocessor

Multiprocessor Multiprocessor

Multiprocessor Multiprocessor

Multiprocessor Multiprocessor

GPU Hardware

Multiprocessor

Multiprocessor Multiprocessor

Multiprocessor Multiprocessor

Multiprocessor Multiprocessor

GPU Program

Block 0 Multiprocessor

Multiprocessor Multiprocessor

Multiprocessor Multiprocessor

Multiprocessor Multiprocessor

Block 2

Block 4

Block 6

Block 1

Block 3

Block 5

Block 7

SIMD warp SIMD warp

SIMD warpSIMD warp

32 threads ( xed, for current hardware)

Thread 0 Thread 1

Thread 4 Thread 5

Thread 2 Thread 3

Thread 6 Thread 7

Thread 8 Thread 9

Thread 12 Thread 13

Thread 10 Thread 11

Thread 14 Thread 15

Thread 16 Thread 17

Thread 20 Thread 21

Thread 18 Thread 19

Thread 22 Thread 23

Number of blocks con gurable

Number of threads con gurable

Number of MPs xed

Root parallelism

Leaf parallelism

Block parallelism

}}

Scalability - MPI Parallel Scheme

Root processid = 0 n-1 processes

N processes init

simulate

broadcast data

collect data (reduce)Outputdata

Inputdata

simulate simulate

Other machinei.e. core i7, Fedora

Other machinei.e. Phenom, Ubuntu

Network

Send the current state of the game to all processes

Think

Choose the best move and send it to the opponent

Receive the opponent’s move

Accumulate results

All simulations are independent

Process number 0 controls the game

Results and �ndings

Simultaneous CPU/GPU simulating

2 3 4 5

106

107

Sim

ulat

ions

/sec

ond

1 2 4 8 16 3226.5

27

27.5

28

28.5

29

29.5

No of GPUs (112 block x 64 Threads)

Ave

rage

Poi

nt D

iffer

ence

No communication bottleneck

Improvement gets worse

229,376 threads

~20mln sim/s

1 2 4 8 16 32 64 128 256 512 1024 2048 4096 7168 143360.2

0.250.3

0.350.4

0.450.5

0.550.6

0.650.7

0.750.8

0.850.9

0.951

GPU Threads

Win

ratio

Leaf parallelism (block size = 64)

Block parallelism (block size = 32)

Block parallelism (block size = 128)

GPU vs 1 cpu thread448 trees

112 trees

112 trees

Average for 2000 games

10 20 30 40 50 600

2

4

6

8

10

12

14

Game step

Poi

nts

10 20 30 40 50 600

10

20

30

40

Game step

Dep

th

GPU + CPU

GPU

1 GPU vs 128 CPUs

Average point difference

(score)

Average tree depth

500ms search time

10 20 30 40 50 60

0

10

20

30

40

50

Game step

Aveg

are

scor

e

e

ee

e256 GPUs

(3,670,016 threads)

and 2048 CPU

threadsvs

sequential MCTS

•Findings:

•Weak scaling of the algorithm - problem’s complexity affects the scalability

•Exploitation/exploration ratio - higher exploitation needed for more trees

•No communication bottleneck

•Much more efficient than the CPU version

Exploration/exploitation in parallel MCTS

Trees

Trees

1 2 3 4 5 SUM

1 2 3 4 5 SUM

High exploitation

High exploration

1 2 4 8 16 32 64 128 256 512 1024 2048 4096 7168 143360

0.51

1.52

2.53

3.54

4.55

5.56

6.57

7.58

8.59 x 105

GPU Threads

Sim

ulat

ions

/sec

ond

Leaf parallelism (block size = 64)Block parallelism (block size = 32)Block parallelism (block size = 128)

GPU vs 1 cpu thread448 trees

112 trees

1 CPU - around 10.000 sim/sGPU is much faster!

•More trees = higher score

•More simulations = higher score

•More trees = fewer simulations

•Block size needs to be adjusted

•1 GPU ~ 64-128 CPUs (AI power)

•While GPU runs a kernel CPU can work too

• Increases the tree depth, improves the overall result

GPU kernel

execution time

kernel execution call

gpu ready event

cpu control

CPU can work here!

processed by GPU

expanded by CPU in the meantime

Hybrid CPU/GPU search