Monte-Carlo Tree Search · 3 Introduction Monte-Carlo Tree Search Selection and expansionReferences...

22
Introduction Monte-Carlo Tree Search Selection and expansion References Monte-Carlo Tree Search An introduction er´ emie DECOCK Inria Saclay - LRI May 2012 Decock Inria Saclay - LRI Monte-Carlo Tree Search

Transcript of Monte-Carlo Tree Search · 3 Introduction Monte-Carlo Tree Search Selection and expansionReferences...

Introduction Monte-Carlo Tree Search Selection and expansion References

Monte-Carlo Tree SearchAn introduction

Jeremie DECOCK

Inria Saclay - LRI

May 2012

Decock Inria Saclay - LRI

Monte-Carlo Tree Search

2

Introduction Monte-Carlo Tree Search Selection and expansion References

Introduction

Decock Inria Saclay - LRI

Monte-Carlo Tree Search

3

Introduction Monte-Carlo Tree Search Selection and expansion References

Introduction

Monte-Carlo Tree Search (MCTS)

I MCTS is a recent algorithm for sequential decision makingI It applies to Markov Decision Processes (MDP)

I discrete-time t with finite horizon TI state st ∈ SI action at ∈ AI transition function st+1 = P(st , at)I cost function rt = RP(st)I reward R =

∑Tt=0 rt

I policy function at = πP(st)I we look for the policy π∗ that maximizes expected R

Decock Inria Saclay - LRI

Monte-Carlo Tree Search

4

Introduction Monte-Carlo Tree Search Selection and expansion References

Introduction

MCTS strength

I Mcts is a versatile algorithm (it does not require knowledgeabout the problem)

I especially, does not require any knowledge about the Bellmanvalue function

I stable on high dimensional problems

I it outperforms all other algorithms on some problems (difficultgames like Go, general game playing, . . . )

Decock Inria Saclay - LRI

Monte-Carlo Tree Search

5

Introduction Monte-Carlo Tree Search Selection and expansion References

Introduction

MCTSProblems are represented as a tree structure:

I blue circles = statesI plain edges + red squares = decisionsI dashed edges = stochastic transitions between two states

Explored decisions

Explored states

Stochastic transitions

.

.

.

Current state (root)

t 1

.

.

.

t 2

Decock Inria Saclay - LRI

Monte-Carlo Tree Search

6

Introduction Monte-Carlo Tree Search Selection and expansion References

Monte-Carlo Tree Search

Decock Inria Saclay - LRI

Monte-Carlo Tree Search

7

Introduction Monte-Carlo Tree Search Selection and expansion References

Monte-Carlo Tree Search

Main steps of MCTS

+1

+1

+1

+1

horizon

selection expansion simulation propagation

t n t n+1

Decock Inria Saclay - LRI

Monte-Carlo Tree Search

8

Introduction Monte-Carlo Tree Search Selection and expansion References

Monte-Carlo Tree Search

Main steps of MCTS

1. selection

t n t n+1

Starting from an initial state:

1. select the state we want to expand from

2. add the generated state in memory

3. evaluate the new state with a default policy until horizon is reached

4. back-propagation of some information:I n(s, a) : number of times decision a has been simulated in sI n(s) : number of time s has been visited in simulationsI Q(s, a) : mean reward of simulations where a was whosen in s

Decock Inria Saclay - LRI

Monte-Carlo Tree Search

8

Introduction Monte-Carlo Tree Search Selection and expansion References

Monte-Carlo Tree Search

Main steps of MCTS

1. selection 2. expansion

t n t n+1

Starting from an initial state:

1. select the state we want to expand from

2. add the generated state in memory

3. evaluate the new state with a default policy until horizon is reached

4. back-propagation of some information:I n(s, a) : number of times decision a has been simulated in sI n(s) : number of time s has been visited in simulationsI Q(s, a) : mean reward of simulations where a was whosen in s

Decock Inria Saclay - LRI

Monte-Carlo Tree Search

8

Introduction Monte-Carlo Tree Search Selection and expansion References

Monte-Carlo Tree Search

Main steps of MCTS

horizon

1. selection 2. expansion 3. simulation

t n t n+1

Starting from an initial state:

1. select the state we want to expand from

2. add the generated state in memory

3. evaluate the new state with a default policy until horizon is reached

4. back-propagation of some information:I n(s, a) : number of times decision a has been simulated in sI n(s) : number of time s has been visited in simulationsI Q(s, a) : mean reward of simulations where a was whosen in s

Decock Inria Saclay - LRI

Monte-Carlo Tree Search

8

Introduction Monte-Carlo Tree Search Selection and expansion References

Monte-Carlo Tree Search

Main steps of MCTS

+1

+1

+1

+1

horizon

1. selection 2. expansion 3. simulation 4. propagation

t n t n+1

Starting from an initial state:

1. select the state we want to expand from

2. add the generated state in memory

3. evaluate the new state with a default policy until horizon is reached

4. back-propagation of some information:I n(s, a) : number of times decision a has been simulated in sI n(s) : number of time s has been visited in simulationsI Q(s, a) : mean reward of simulations where a was whosen in s

Decock Inria Saclay - LRI

Monte-Carlo Tree Search

8

Introduction Monte-Carlo Tree Search Selection and expansion References

Monte-Carlo Tree Search

Main steps of MCTS

+1

+1

+1

+1

horizon

1. selection 2. expansion 3. simulation 4. propagation

t n t n+1

Starting from an initial state:

1. select the state we want to expand from

2. add the generated state in memory

3. evaluate the new state with a default policy until horizon is reached

4. back-propagation of some information:I n(s, a) : number of times decision a has been simulated in sI n(s) : number of time s has been visited in simulationsI Q(s, a) : mean reward of simulations where a was whosen in s

Decock Inria Saclay - LRI

Monte-Carlo Tree Search

9

Introduction Monte-Carlo Tree Search Selection and expansion References

Monte-Carlo Tree Search

Main steps of MCTS

+1

+1

+1

+1

horizon

1. selection 2. expansion 3. simulation 4. propagation

t n t n+1

a t n

The selected decisionatn = the most visited decision form the current state (root node)

Decock Inria Saclay - LRI

Monte-Carlo Tree Search

10

Introduction Monte-Carlo Tree Search Selection and expansion References

Selection and expansion

Decock Inria Saclay - LRI

Monte-Carlo Tree Search

11

Introduction Monte-Carlo Tree Search Selection and expansion References

Selection and expansion

Selection stepHow to select the state to expand ?

?

?

Decock Inria Saclay - LRI

Monte-Carlo Tree Search

12

Introduction Monte-Carlo Tree Search Selection and expansion References

Selection and expansion

How to select the state to expand ?

? ? ?

argmax scoreucb(s,a) a

The selection phase is driven by Upper Confidence Bound

scoreucb(s, a) = Q(s, a)︸ ︷︷ ︸1

+

√log(2 + n(s))

2 + n(s, a)︸ ︷︷ ︸2

1. mean reward of simulations including action a in state s

2. the uncertainty on this estimation of the action’s value

Decock Inria Saclay - LRI

Monte-Carlo Tree Search

12

Introduction Monte-Carlo Tree Search Selection and expansion References

Selection and expansion

How to select the state to expand ?

? ? ?

argmax scoreucb(s,a) a

The selection phase is driven by Upper Confidence Bound

scoreucb(s, a) = Q(s, a)︸ ︷︷ ︸1

+

√log(2 + n(s))

2 + n(s, a)︸ ︷︷ ︸2

The selected action:

a? = arg maxa

scoreucb(s, a)

Decock Inria Saclay - LRI

Monte-Carlo Tree Search

13

Introduction Monte-Carlo Tree Search Selection and expansion References

Selection and expansion

How to select the state to expand ?

Decock Inria Saclay - LRI

Monte-Carlo Tree Search

14

Introduction Monte-Carlo Tree Search Selection and expansion References

Selection and expansion

When should we expand?

?

PW

One standard way of tackling the exploration/exploitation dilemmais Progressive Widening.

A new parameter α ∈ [0; 1] is introduced, to choose betweenexploration (add a decision to the tree) and exploitation (go to anexisting node)

Decock Inria Saclay - LRI

Monte-Carlo Tree Search

15

Introduction Monte-Carlo Tree Search Selection and expansion References

Selection and expansion

How to select the state to expand ?

n(s)α

n(s)α

I if(|A′s| < n(s)α) then we explore a new decision

I else we simulate a known decision

With |A′s| the number of legal actions in state s

Decock Inria Saclay - LRI

Monte-Carlo Tree Search

16

Introduction Monte-Carlo Tree Search Selection and expansion References

Selection and expansion

When should we expand?

α = 0.2

t=10 t=50 t=500

α = 0.8

t=10 t=50 t=500

Decock Inria Saclay - LRI

Monte-Carlo Tree Search

17

Introduction Monte-Carlo Tree Search Selection and expansion References

References I

Decock Inria Saclay - LRI

Monte-Carlo Tree Search