Monte-Carlo Tree Search
description
Transcript of Monte-Carlo Tree Search
1
Monte-Carlo Tree Search
Alan Fern
2
Introductionh Rollout does not guarantee optimality or near optimality
5 It only guarantees policy improvement (under certain conditions)
h Theoretical Question:Can we develop Monte-Carlo methods that give us near optimal policies?5 With computation that does NOT depend on number of states!5 This was an open theoretical question until late 90’s.
h Practical Question:Can we develop Monte-Carlo methods that improve smoothly and quickly with more computation time? 5 Multi-stage rollout can improve arbitrarily, but each stage adds a
big increase in time per move.
3
Look-Ahead Treesh Rollout can be viewed as performing one level of search
on top of the base policy
h In deterministic games and search problems it is common to build a look-ahead tree at a state to select best action5 Can we generalize this to general stochastic MDPs?
h Sparse Sampling is one such algorithm5 Strong theoretical guarantees of near optimality
a1 a2 ak
…
… … … … … … … …Maybe we should searchfor multiple levels.
4
Online Planning with Look-Ahead Treesh At each state we encounter in the environment we
build a look-ahead tree of depth h and use it to estimate optimal Q-values of each action5 Select action with highest Q-value estimate
h s = current stateh Repeat
5 T = BuildLookAheadTree(s) ;; sparse sampling or UCT ;; tree provides Q-value estimates for root action5 a = BestRootAction(T) ;; action with best Q-value5 Execute action a in environment5 s is the resulting state
5
Planning with Look-Ahead Trees
… …sa2 a1
Build look-ahead tree Build look-ahead tree
Real worldstate/action sequence
sa1 a2
…
a1 a2
…
s11
a1 a2
s1w…
a1 a2
…
s21
a1 a2
s2w…
R(s11,a1) R(s11,a2)R(s1w,a1)R(s1w,a2) R(s21,a1)R(s21,a2)R(s2w,a1)R(s2w,a2)
s
a1 a2
…
a1 a2
…
s11
a1 a2
s1w…
a1 a2
…
s21
a1 a2
s2w…
R(s11,a1) R(s11,a2)R(s1w,a1)R(s1w,a2) R(s21,a1)R(s21,a2)R(s2w,a1)R(s2w,a2)
………… …………
Expectimax Tree (depth 1)Alternate max &expectation
… …
s max node (max over actions)
expectation node (weighted averageover states)
a b
States -- weighted by probability of occuringafter taking a in s
h After taking each action there is a distribution over next states (nature’s moves)
h The value of an action depends on the immediate reward and weighted value of those states
s1 s2 sn-1 sn s1 s2 sn-1 sn
Expectimax Tree (depth 2)
Alternate max &expectation
… …
… … … ……......
s max node (max over actions)
expectation node (max over actions)
a b
Max Nodes: value equal to max of values of child expectation nodes
Expectation Nodes: value is weighted average of values of its children
Compute values from bottom up (leaf values = 0). Select root action with best value.
a b a b
Expectimax Tree (depth H)
… …
… … … ……......
sa b
a b a b
…...... … …
…......
…......
H
# states n
k = # actions
Size of tree = In general can grow tree to any horizon H
Size depends on size of the state-space. Bad!
Sparse Sampling Tree (depth H)
……......
sa b
a b a b
…......…......
…......
H
# samples w
Size of tree =
k = # actions
Replace expectation with average over w samples
w will typically be much smaller than n.
Sparse Sampling [Kearns et. al. 2002]
SparseSampleTree(s,h,w)
If h == 0 Then Return [0, null]
For each action a in sQ*(s,a,h) = 0For i = 1 to w
Simulate taking a in s resulting in si and reward ri
[V*(si,h),a*] = SparseSample(si,h-1,w)
Q*(s,a,h) = Q*(s,a,h) + ri + β V*(si,h)
Q*(s,a,h) = Q*(s,a,h) / w ;; estimate of Q*(s,a,h)
V*(s,h) = maxa Q*(s,a,h) ;; estimate of V*(s,h)
a* = argmaxa Q*(s,a,h)
Return [V*(s,h), a*]
The Sparse Sampling algorithm computes root value via depth first expansion
Return value estimate V*(s,h) of state s and estimated optimal action a*
SparseSample(h=2,w=2)
Alternate max &averaging
sa
Max Nodes: value equal to max of values of child expectation nodes
Average Nodes: value is average of values of w sampled children
Compute values from bottom up (leaf values = 0). Select root action with best value.
SparseSample(h=2,w=2)
Alternate max &averaging
sa
Max Nodes: value equal to max of values of child expectation nodes
Average Nodes: value is average of values of w sampled children
Compute values from bottom up (leaf values = 0). Select root action with best value.
SparseSample(h=1,w=2)
10
SparseSample(h=2,w=2)
Alternate max &averaging
sa
Max Nodes: value equal to max of values of child expectation nodes
Average Nodes: value is average of values of w sampled children
Compute values from bottom up (leaf values = 0). Select root action with best value.
SparseSample(h=1,w=2)
10
SparseSample(h=1,w=2)
0
SparseSample(h=2,w=2)
Alternate max &averaging
sa
Max Nodes: value equal to max of values of child expectation nodes
Average Nodes: value is average of values of w sampled children
Compute values from bottom up (leaf values = 0). Select root action with best value.
SparseSample(h=1,w=2)
10
SparseSample(h=1,w=2)
05
SparseSample(h=2,w=2)
Alternate max &averaging
sa
Max Nodes: value equal to max of values of child expectation nodes
Average Nodes: value is average of values of w sampled children
Compute values from bottom up (leaf values = 0). Select root action with best value.
5
SparseSample(h=1,w=2) SparseSample(h=1,w=2)
4 2
b3
Select action ‘a’ since its value is larger than b’s.
Sparse Sampling (Cont’d)h For a given desired accuracy, how large
should sampling width and depth be?5 Answered: Kearns, Mansour, and Ng (1999)
h Good news: can gives values for w and H to achieve policy arbitrarily close to optimal5 Values are independent of state-space size!5 First near-optimal general MDP planning algorithm
whose runtime didn’t depend on size of state-space
h Bad news: the theoretical values are typically still intractably large---also exponential in H
5 Exponential in H is the best we can do in general5 In practice: use small H and use heuristic at leaves
Sparse Sampling (Cont’d)h In practice: use small H & evaluate leaves with heuristic
5 For example, if we have a policy, leaves could be evaluated by estimating the policy value (i.e. average reward across runs of policy)
… …
… … … ……......
a1 a2
policysims
18
Uniform vs. Adaptive Tree Search
h Sparse sampling wastes time on bad parts of tree5 Devotes equal resources to each
state encountered in the tree5 Would like to focus on most
promising parts of tree
h But how to control exploration of new parts of tree vs. exploiting promising parts?
h Need adaptive bandit algorithmthat explores more effectively
19
What now?
h Adaptive Monte-Carlo Tree Search5 UCB Bandit Algorithm5 UCT Monte-Carlo Tree Search
20
Bandits: Cumulative Regret Objective
s
a1 a2 ak
…
h Problem: find arm-pulling strategy such that the expected total reward at time n is close to the best possible (one pull per time step)5 Optimal (in expectation) is to pull optimal arm n times5 UniformBandit is poor choice --- waste time on bad arms5 Must balance exploring machines to find good payoffs
and exploiting current knowledge
21
Cumulative Regret Objective
h Protocol: At time step n the algorithm picks an arm based on what it has seen so far and receives reward ( are random variables).
h Expected Cumulative Regret (: difference between optimal expected cumulative reward and expected cumulative reward of our strategy at time n
22
UCB Algorithm for Minimizing Cumulative Regret
h Q(a) : average reward for trying action a (in our single state s) so far
h n(a) : number of pulls of arm a so farh Action choice by UCB after n pulls:
h Assumes rewards in [0,1]. We can always normalize if we know max value.
)(ln2)(maxargannaQa an
Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2), 235-256.
23
UCB: Bounded Sub-Optimality
)(ln2)(maxargannaQa an
Value Term: favors actions that looked good historically
Exploration Term:actions get an exploration bonus that grows with ln(n)
Expected number of pulls of sub-optimal arm a is bounded by:
where is the sub-optimality of arm a
na
ln82
a
Doesn’t waste much time on sub-optimal arms, unlike uniform!
24
UCB Performance Guarantee[Auer, Cesa-Bianchi, & Fischer, 2002]
Theorem: The expected cumulative regret of UCB after n arm pulls is bounded by O(log n)
h Is this good?Yes. The average per-step regret is O
Theorem: No algorithm can achieve a better expected regret (up to constant factors)
25
What now?
h Adaptive Monte-Carlo Tree Search5 UCB Bandit Algorithm5 UCT Monte-Carlo Tree Search
h UCT is an instance of Monte-Carlo Tree Search5 Applies principle of UCB5 Similar theoretical properties to sparse sampling5 Much better anytime behavior than sparse sampling
h Famous for yielding a major advance in computer Go
h A growing number of success stories
UCT Algorithm [Kocsis & Szepesvari, 2006]
h Builds a sparse look-ahead tree rooted at current state by repeated Monte-Carlo simulation of a “rollout policy”
h During construction each tree node s stores: 5 state-visitation count n(s) 5 action counts n(s,a) 5 action values Q(s,a)
h Repeat until time is up1. Execute rollout policy starting from root until horizon
(generates a state-action-reward trajectory)2. Add first node not in current tree to the tree3. Update statistics of each tree node s on trajectory
g Increment n(s) and n(s,a) for selected action ag Update Q(s,a) by total reward observed after the node
Monte-Carlo Tree Search
h For illustration purposes we will assume MDP is:5 Deterministic 5 Only non-zero rewards are at terminal/leaf nodes
h Algorithm is well-defined without these assumpitons
Example MCTS
Current World State
Rollout Policy
Terminal(reward = 1)
1
1
1 Initially tree is single leaf
Assume all non-zero reward occurs at terminal nodes.
new tree node
Iteration 1
Current World State
1
1
0
RolloutPolicy
Terminal(reward = 0)
new tree node
Iteration 2
Current World State
1
1/2
0
Iteration 3
Current World State
1
1/2
0
Rollout Policy
Iteration 3
Current World State
1
1/2
0
RolloutPolicy
0
RolloutPolicy
new tree node
Iteration 3
Current World State
1/2
1/3
0RolloutPolicy
0
Iteration 4
Current World State
1
1/2
1/3
0
0
Iteration 4
Current World State
2/3
2/3
0RolloutPolicy
0
What should the rollout policy be?
1
h Monte-Carlo Tree Search algorithms mainly differ on their choice of rollout policy
h Rollout policies have two distinct phases5 Tree policy: selects actions at nodes already in tree
(each action must be selected at least once)5 Default policy: selects actions after leaving tree
h Key Idea: the tree policy can use statistics collected from previous trajectories to intelligently expand tree in most promising direction 5 Rather than uniformly explore actions at each node
Rollout Policies
38
h Basic UCT uses random default policy5 In practice often use hand-coded or learned policy
h Tree policy is based on UCB:5 Q(s,a) : average reward received in trajectories so
far after taking action a in state s5 n(s,a) : number of times action a taken in s5 n(s) : number of times state s encountered
),()(ln),(maxarg)(asnsnKasQs aUCT
Theoretical constant that must be selected empirically in practice
UCT Algorithm [Kocsis & Szepesvari, 2006]
Current World State
1
1/2
1/3
When all node actions tried once, select action according to tree policy
0Tree
Policy
0
a1 a2),()(ln),(maxarg)(asnsnKasQs aUCT
Current World State
1
1/2
1/3
When all node actions tried once, select action according to tree policy
0Tree
Policy
0
),()(ln),(maxarg)(asnsnKasQs aUCT
Current World State
1
1/2
1/3
When all node actions tried once, select action according to tree policy
0Tree
Policy
0
0
),()(ln),(maxarg)(asnsnKasQs aUCT
Current World State
1
1/3
1/4
When all node actions tried once, select action according to tree policy
0Tree
Policy
0/1
),()(ln),(maxarg)(asnsnKasQs aUCT
0
Current World State
1
1/3
1/4
When all node actions tried once, select action according to tree policy
0Tree
Policy
0/1
),()(ln),(maxarg)(asnsnKasQs aUCT
0
1
Current World State
1
1/3
2/5
When all node actions tried once, select action according to tree policy
1/2Tree
Policy
0/1
),()(ln),(maxarg)(asnsnKasQs aUCT
0
1
45
UCT Recap
h To select an action at a state s5 Build a tree using N iterations of monte-carlo tree
searchg Default policy is uniform randomg Tree policy is based on UCB rule
5 Select action that maximizes Q(s,a)(note that this final action selection does not take the exploration term into account, just the Q-value estimate)
h The more simulations the more accurate
Some SuccessesComputer Go
Klondike Solitaire (wins 40% of games)
General Game Playing Competition
Real-Time Strategy Games
Combinatorial Optimization
Crowd Sourcing
List is growing
Usually extend UCT is some ways
Practical IssuesSelecting K
There is no fixed ruleExperiment with different values in your domainRule of thumb – try values of K that are of the same order of magnitude as the reward signal you expect
The best value of K may depend on the number of iterations N
Usually a single value of K will work well for values of N that are large enough
Practical Issues
a bs
UCT can have trouble building deep trees when actions can result in a large # of possible next states
Each time we try action ‘a’ we get a new state (new leaf) and very rarely resample an existing leaf
Practical IssuesUCT can have trouble building deep trees when actions can result in a large # of possible next states
Each time we try action ‘a’ we get a new state (new leaf) and very rarely resample an existing leaf
a bs
Practical IssuesDegenerates into a depth 1 tree search
Solution: We have a “sparseness parameter” that controls how many new children of an action will be generated
Set sparseness to something like 5 or 10 thensee if smaller values hurt performance or not
a bs
….. …..
Some ImprovementsUse domain knowledge to handcraft a more intelligent default policy than random
E.g. don’t choose obviously stupid actionsIn Go a hand-coded default policy is used
Learn a heuristic function to evaluate positions
Use the heuristic function to initialize leaf nodes(otherwise initialized to zero)