Accelerating Parallel Monte Carlo Tree Search using...

Accelerating Parallel Monte Carlo Tree Search using CUDAKamil Rocki and Reiji Suda

Department of Computer Science, Graduate School of Information Science and Technology, The University of Tokyo

This work was partially supported by Core Research of Evolutional Scienceand Technology (CREST) project "ULP-HPC: Ultra Low-Power, High-Performance Computing via Modeling and Optimization of Next GenerationHPC Technologies" of Japan Science and Technology Agency (JST) andGrant-in-Aid for Scienti�c Research of MEXT Japan.

Monte Carlo Tree Search (MCTS) is a method for making optimal decisions in arti�cial intelligence (AI) problems, typically move planning in combinatorialgames. It combines the generality of random simulation with the precision oftree search. It can theoretically be applied to any domain that can be described in terms of state, action pairs and simulation used to forecast outcomes such as decision support, control, delayed reward problems or complex optimization.

The motivation behind this work is caused by the emerging GPU-based systems and their high computational potential combined with relatively low power usage compared to CPUs. As a problem to be solved we chose to develop an AI GPU-basedagent in the game of Reversi (Othello) which provides a su�ciently complex problemfor tree searching with non-uniform structure and an average branching factor of over 8.

We present an e�cient parallel GPU MCTS implementation based on the introduced ’block-parallelism’ scheme which combines GPU SIMD thread groups and performsindependent searches without any need of intra-GPU or inter-GPU communication. The obtained results show that using my GPU MCTS implementation on the TSUBAME 2.0system one GPU can be compared to 100-200 CPU threads depending on factors such asthe search time and other MCTS parameters in terms of obtained results. We propose andanalyze simultaneous CPU/GPU execution which improves the overall result.

Introduction

•The basic MCTS algorithm is simple

•1. Selection

•2. Expansion

•3. Simulation

•4. Backpropagation

standard UCB formula

mean value of node (i.e. success/loss ratio)

C - exploitation/exploration ratio factor, tunable

MCTS - Coulom (2006)UCB - Kocsis and Szepervari (2006)

Parallel MCTS Schemes - Chaslot et al. (2008)

Easy

Efficient

Complex, not efficientOur approach - Parallel MCTS on GPU

= Block parallelism (c)

Weakness:CPU sequential tree management part (proportional to the number

n simulations

a. Leaf parallelism

n trees

b. Root parallelism

c. Block parallelism

n = blocks(trees) x threads (simulations at once)

Advantage:Works well with SIMD hardware, improves the overall result on 2 levels of parallelization

3/6

1/33/5

2/3 1/3

0

2 partsTree building

Stored in the CPU memory

Simulating1. Temporary - not remembered2. Done by CPU or GPU3. The results are used to affect the tree’s expansion strategyFinal result:

0 or 1

•MCTS has many applications already

•New ones are appearing

•The architecture is likely to follow the trend in the future

•Programming GPUs may become easier, rather not harder

CPU

GPU

TSUBAME 2.0•CPUs - Intel(R) Xeon(R) CPU X5670

@ 2.93GHz ~ 1400 Nodes of 12 cores

•GPU - NVIDIA Tesla C2050 - 14 (MP) x 32 (Cores/MP) = 448 (Cores) @ 1.15 GHz, ~ 1400 Nodes of 3 GPUs each (around 515GFlops max capability per GPU)

•If not specified otherwise, the MCTS search time = 500 ms, and GPU block size = 128

Root parallel MCTS - many starting pointsGreater chance of reaching the global solution

Sequential/leaf parallel MCTSSeen as an optimization problem

STATE SPACE STATE SPACE

Sequential MCTS Root parallel MCTS

Leaf parallel MCTS Block (root-leaf ) parallel MCTS

STATE SPACE STATE SPACE

Starting point - with root parallelism, more chance of �nding the global solution

Local solution (extremum)

Search scope - with leaf parallelism, the search is broader/more accurate (more samples)

Problem statementParallel tree search is one of the basic problems in computer science. It is used to solve many kinds of problems. E�ective parallelization is hard, especially for more than hundreds of threads. SIMD hardware (i.e. GPU) is fast, but hard to utilize. How to utilize GPUs/CUDA?

Mapping MCTS trees to blocks

Multiprocessor Multiprocessor




GPU Hardware

Multiprocessor




GPU Program

Block 0 Multiprocessor




Block 2

Block 4

Block 6

Block 1

Block 3

Block 5

Block 7

SIMD warp SIMD warp

SIMD warpSIMD warp

32 threads ( xed, for current hardware)

Thread 0 Thread 1

Thread 4 Thread 5

Thread 2 Thread 3

Thread 6 Thread 7

Thread 8 Thread 9

Thread 12 Thread 13

Thread 10 Thread 11

Thread 14 Thread 15

Thread 16 Thread 17

Thread 20 Thread 21

Thread 18 Thread 19

Thread 22 Thread 23

Number of blocks con gurable

Number of threads con gurable

Number of MPs xed

Root parallelism

Leaf parallelism

Block parallelism

}}

Scalability - MPI Parallel Scheme

Root processid = 0 n-1 processes

N processes init

simulate

broadcast data

collect data (reduce)Outputdata

Inputdata

simulate simulate

Other machinei.e. core i7, Fedora

Other machinei.e. Phenom, Ubuntu

Network

Send the current state of the game to all processes

Think

Choose the best move and send it to the opponent

Receive the opponent’s move

Accumulate results

All simulations are independent

Process number 0 controls the game

Results and �ndings

Simultaneous CPU/GPU simulating

2 3 4 5

106

107

Sim

ulat

ions

/sec

ond

1 2 4 8 16 3226.5

27

27.5

28

28.5

29

29.5

No of GPUs (112 block x 64 Threads)

Ave

rage

Poi

nt D

iffer

ence

No communication bottleneck

Improvement gets worse

229,376 threads

~20mln sim/s

1 2 4 8 16 32 64 128 256 512 1024 2048 4096 7168 143360.2

0.250.3

0.350.4

0.450.5

0.550.6

0.650.7

0.750.8

0.850.9

0.951

GPU Threads

Win

ratio

Leaf parallelism (block size = 64)

Block parallelism (block size = 32)

Block parallelism (block size = 128)

GPU vs 1 cpu thread448 trees

112 trees

112 trees

Average for 2000 games

10 20 30 40 50 600

2

4

6

8

10

12

14

Game step

Poi

nts

10 20 30 40 50 600

10

20

30

40

Game step

Dep

th

GPU + CPU

GPU

1 GPU vs 128 CPUs

Average point difference

(score)

Average tree depth

500ms search time

10 20 30 40 50 60

0

10

20

30

40

50

Game step

Aveg

are

scor

e

e

ee

e256 GPUs

(3,670,016 threads)

and 2048 CPU

threadsvs

sequential MCTS

•Findings:

•Weak scaling of the algorithm - problem’s complexity affects the scalability

•Exploitation/exploration ratio - higher exploitation needed for more trees

•No communication bottleneck

•Much more efficient than the CPU version

Exploration/exploitation in parallel MCTS

Trees

Trees

1 2 3 4 5 SUM

1 2 3 4 5 SUM

High exploitation

High exploration

1 2 4 8 16 32 64 128 256 512 1024 2048 4096 7168 143360

0.51

1.52

2.53

3.54

4.55

5.56

6.57

7.58

8.59 x 105

GPU Threads

Sim

ulat

ions

/sec

ond

Leaf parallelism (block size = 64)Block parallelism (block size = 32)Block parallelism (block size = 128)

GPU vs 1 cpu thread448 trees

112 trees

1 CPU - around 10.000 sim/sGPU is much faster!

•More trees = higher score

•More simulations = higher score

•More trees = fewer simulations

•Block size needs to be adjusted

•1 GPU ~ 64-128 CPUs (AI power)

•While GPU runs a kernel CPU can work too

• Increases the tree depth, improves the overall result

GPU kernel

execution time

kernel execution call

gpu ready event

cpu control

CPU can work here!

processed by GPU

expanded by CPU in the meantime

Hybrid CPU/GPU search

Accelerating Parallel Monte Carlo Tree Search using...

Documents

Transcript of Accelerating Parallel Monte Carlo Tree Search using...