Monte-Carlo Tree Search

1

Monte-Carlo Tree Search

Alan Fern

2

Introductionh Rollout does not guarantee optimality or near optimality

5 It only guarantees policy improvement (under certain conditions)

h Theoretical Question:Can we develop Monte-Carlo methods that give us near optimal policies?5 With computation that does NOT depend on number of states!5 This was an open theoretical question until late 90’s.

h Practical Question:Can we develop Monte-Carlo methods that improve smoothly and quickly with more computation time? 5 Multi-stage rollout can improve arbitrarily, but each stage adds a

big increase in time per move.

3

Look-Ahead Treesh Rollout can be viewed as performing one level of search

on top of the base policy

h In deterministic games and search problems it is common to build a look-ahead tree at a state to select best action5 Can we generalize this to general stochastic MDPs?

h Sparse Sampling is one such algorithm5 Strong theoretical guarantees of near optimality

a1 a2 ak

…

… … … … … … … …Maybe we should searchfor multiple levels.

4

Online Planning with Look-Ahead Treesh At each state we encounter in the environment we

build a look-ahead tree of depth h and use it to estimate optimal Q-values of each action5 Select action with highest Q-value estimate

h s = current stateh Repeat

5 T = BuildLookAheadTree(s) ;; sparse sampling or UCT ;; tree provides Q-value estimates for root action5 a = BestRootAction(T) ;; action with best Q-value5 Execute action a in environment5 s is the resulting state

5

Planning with Look-Ahead Trees

… …sa2 a1

Build look-ahead tree Build look-ahead tree

Real worldstate/action sequence

sa1 a2

…

a1 a2

…

s11

a1 a2

s1w…

a1 a2

…

s21

a1 a2

s2w…

R(s11,a1) R(s11,a2)R(s1w,a1)R(s1w,a2) R(s21,a1)R(s21,a2)R(s2w,a1)R(s2w,a2)

s

a1 a2

…

a1 a2

…

s11

a1 a2

s1w…

a1 a2

…

s21

a1 a2

s2w…

R(s11,a1) R(s11,a2)R(s1w,a1)R(s1w,a2) R(s21,a1)R(s21,a2)R(s2w,a1)R(s2w,a2)

………… …………

Expectimax Tree (depth 1)Alternate max &expectation

… …

s max node (max over actions)

expectation node (weighted averageover states)

a b

States -- weighted by probability of occuringafter taking a in s

h After taking each action there is a distribution over next states (nature’s moves)

h The value of an action depends on the immediate reward and weighted value of those states

s1 s2 sn-1 sn s1 s2 sn-1 sn

Expectimax Tree (depth 2)

Alternate max &expectation

… …

… … … ……......

s max node (max over actions)

expectation node (max over actions)

a b

Max Nodes: value equal to max of values of child expectation nodes

Expectation Nodes: value is weighted average of values of its children

Compute values from bottom up (leaf values = 0). Select root action with best value.

a b a b

Expectimax Tree (depth H)

… …

… … … ……......

sa b

a b a b

…...... … …

…......

…......

H

# states n

k = # actions

Size of tree = In general can grow tree to any horizon H

Size depends on size of the state-space. Bad!

Sparse Sampling Tree (depth H)

……......

sa b

a b a b

…......…......

…......

H

# samples w

Size of tree =

k = # actions

Replace expectation with average over w samples

w will typically be much smaller than n.

Sparse Sampling [Kearns et. al. 2002]

SparseSampleTree(s,h,w)

If h == 0 Then Return [0, null]

For each action a in sQ*(s,a,h) = 0For i = 1 to w

Simulate taking a in s resulting in si and reward ri

[V*(si,h),a*] = SparseSample(si,h-1,w)

Q*(s,a,h) = Q*(s,a,h) + ri + β V*(si,h)

Q*(s,a,h) = Q*(s,a,h) / w ;; estimate of Q*(s,a,h)

V*(s,h) = maxa Q*(s,a,h) ;; estimate of V*(s,h)

a* = argmaxa Q*(s,a,h)

Return [V*(s,h), a*]

The Sparse Sampling algorithm computes root value via depth first expansion

Return value estimate V*(s,h) of state s and estimated optimal action a*

SparseSample(h=2,w=2)

Alternate max &averaging

sa


Average Nodes: value is average of values of w sampled children




sa





10



sa





10


0



sa





10


05



sa




5

SparseSample(h=1,w=2) SparseSample(h=1,w=2)

4 2

b3

Select action ‘a’ since its value is larger than b’s.

Sparse Sampling (Cont’d)h For a given desired accuracy, how large

should sampling width and depth be?5 Answered: Kearns, Mansour, and Ng (1999)

h Good news: can gives values for w and H to achieve policy arbitrarily close to optimal5 Values are independent of state-space size!5 First near-optimal general MDP planning algorithm

whose runtime didn’t depend on size of state-space

h Bad news: the theoretical values are typically still intractably large---also exponential in H

5 Exponential in H is the best we can do in general5 In practice: use small H and use heuristic at leaves

Sparse Sampling (Cont’d)h In practice: use small H & evaluate leaves with heuristic

5 For example, if we have a policy, leaves could be evaluated by estimating the policy value (i.e. average reward across runs of policy)

… …

… … … ……......

a1 a2

policysims

18

Uniform vs. Adaptive Tree Search

h Sparse sampling wastes time on bad parts of tree5 Devotes equal resources to each

state encountered in the tree5 Would like to focus on most

promising parts of tree

h But how to control exploration of new parts of tree vs. exploiting promising parts?

h Need adaptive bandit algorithmthat explores more effectively

19

What now?

h Adaptive Monte-Carlo Tree Search5 UCB Bandit Algorithm5 UCT Monte-Carlo Tree Search

20

Bandits: Cumulative Regret Objective

s

a1 a2 ak

…

h Problem: find arm-pulling strategy such that the expected total reward at time n is close to the best possible (one pull per time step)5 Optimal (in expectation) is to pull optimal arm n times5 UniformBandit is poor choice --- waste time on bad arms5 Must balance exploring machines to find good payoffs

and exploiting current knowledge

21

Cumulative Regret Objective

h Protocol: At time step n the algorithm picks an arm based on what it has seen so far and receives reward ( are random variables).

h Expected Cumulative Regret (: difference between optimal expected cumulative reward and expected cumulative reward of our strategy at time n

22

UCB Algorithm for Minimizing Cumulative Regret

h Q(a) : average reward for trying action a (in our single state s) so far

h n(a) : number of pulls of arm a so farh Action choice by UCB after n pulls:

h Assumes rewards in [0,1]. We can always normalize if we know max value.

)(ln2)(maxargannaQa an

Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2), 235-256.

23

UCB: Bounded Sub-Optimality

)(ln2)(maxargannaQa an

Value Term: favors actions that looked good historically

Exploration Term:actions get an exploration bonus that grows with ln(n)

Expected number of pulls of sub-optimal arm a is bounded by:

where is the sub-optimality of arm a

na

ln82

a

Doesn’t waste much time on sub-optimal arms, unlike uniform!

24

UCB Performance Guarantee[Auer, Cesa-Bianchi, & Fischer, 2002]

Theorem: The expected cumulative regret of UCB after n arm pulls is bounded by O(log n)

h Is this good?Yes. The average per-step regret is O

Theorem: No algorithm can achieve a better expected regret (up to constant factors)

25

What now?

h Adaptive Monte-Carlo Tree Search5 UCB Bandit Algorithm5 UCT Monte-Carlo Tree Search

h UCT is an instance of Monte-Carlo Tree Search5 Applies principle of UCB5 Similar theoretical properties to sparse sampling5 Much better anytime behavior than sparse sampling

h Famous for yielding a major advance in computer Go

h A growing number of success stories

UCT Algorithm [Kocsis & Szepesvari, 2006]

h Builds a sparse look-ahead tree rooted at current state by repeated Monte-Carlo simulation of a “rollout policy”

h During construction each tree node s stores: 5 state-visitation count n(s) 5 action counts n(s,a) 5 action values Q(s,a)

h Repeat until time is up1. Execute rollout policy starting from root until horizon

(generates a state-action-reward trajectory)2. Add first node not in current tree to the tree3. Update statistics of each tree node s on trajectory

g Increment n(s) and n(s,a) for selected action ag Update Q(s,a) by total reward observed after the node

Monte-Carlo Tree Search

h For illustration purposes we will assume MDP is:5 Deterministic 5 Only non-zero rewards are at terminal/leaf nodes

h Algorithm is well-defined without these assumpitons

Example MCTS

Current World State

Rollout Policy

Terminal(reward = 1)

1

1

1 Initially tree is single leaf

Assume all non-zero reward occurs at terminal nodes.

new tree node

Iteration 1

Current World State

1

1

0

RolloutPolicy

Terminal(reward = 0)

new tree node

Iteration 2

Current World State

1

1/2

0

Iteration 3

Current World State

1

1/2

0

Rollout Policy

Iteration 3

Current World State

1

1/2

0

RolloutPolicy

0

RolloutPolicy

new tree node

Iteration 3

Current World State

1/2

1/3

0RolloutPolicy

0

Iteration 4

Current World State

1

1/2

1/3

0

0

Iteration 4

Current World State

2/3

2/3

0RolloutPolicy

0

What should the rollout policy be?

1

h Monte-Carlo Tree Search algorithms mainly differ on their choice of rollout policy

h Rollout policies have two distinct phases5 Tree policy: selects actions at nodes already in tree

(each action must be selected at least once)5 Default policy: selects actions after leaving tree

h Key Idea: the tree policy can use statistics collected from previous trajectories to intelligently expand tree in most promising direction 5 Rather than uniformly explore actions at each node

Rollout Policies

38

h Basic UCT uses random default policy5 In practice often use hand-coded or learned policy

h Tree policy is based on UCB:5 Q(s,a) : average reward received in trajectories so

far after taking action a in state s5 n(s,a) : number of times action a taken in s5 n(s) : number of times state s encountered

),()(ln),(maxarg)(asnsnKasQs aUCT

Theoretical constant that must be selected empirically in practice

UCT Algorithm [Kocsis & Szepesvari, 2006]

Current World State

1

1/2

1/3

When all node actions tried once, select action according to tree policy

0Tree

Policy

0

a1 a2),()(ln),(maxarg)(asnsnKasQs aUCT

Current World State

1

1/2

1/3


0Tree

Policy

0


Current World State

1

1/2

1/3


0Tree

Policy

0

0


Current World State

1

1/3

1/4


0Tree

Policy

0/1


0

Current World State

1

1/3

1/4


0Tree

Policy

0/1


0

1

Current World State

1

1/3

2/5


1/2Tree

Policy

0/1


0

1

45

UCT Recap

h To select an action at a state s5 Build a tree using N iterations of monte-carlo tree

searchg Default policy is uniform randomg Tree policy is based on UCB rule

5 Select action that maximizes Q(s,a)(note that this final action selection does not take the exploration term into account, just the Q-value estimate)

h The more simulations the more accurate

Some SuccessesComputer Go

Klondike Solitaire (wins 40% of games)

General Game Playing Competition

Real-Time Strategy Games

Combinatorial Optimization

Crowd Sourcing

List is growing

Usually extend UCT is some ways

Practical IssuesSelecting K

There is no fixed ruleExperiment with different values in your domainRule of thumb – try values of K that are of the same order of magnitude as the reward signal you expect

The best value of K may depend on the number of iterations N

Usually a single value of K will work well for values of N that are large enough

Practical Issues

a bs

UCT can have trouble building deep trees when actions can result in a large # of possible next states

Each time we try action ‘a’ we get a new state (new leaf) and very rarely resample an existing leaf

Practical IssuesUCT can have trouble building deep trees when actions can result in a large # of possible next states

Each time we try action ‘a’ we get a new state (new leaf) and very rarely resample an existing leaf

a bs

Practical IssuesDegenerates into a depth 1 tree search

Solution: We have a “sparseness parameter” that controls how many new children of an action will be generated

Set sparseness to something like 5 or 10 thensee if smaller values hurt performance or not

a bs

….. …..

Some ImprovementsUse domain knowledge to handcraft a more intelligent default policy than random

E.g. don’t choose obviously stupid actionsIn Go a hand-coded default policy is used

Learn a heuristic function to evaluate positions

Use the heuristic function to initialize leaf nodes(otherwise initialized to zero)

Monte-Carlo Tree Search

Documents

Transcript of Monte-Carlo Tree Search