Static and Dynamic Values of Computation in MCTS · Monte Carlo tree search (MCTS) is a widely used...

10
Static and Dynamic Values of Computation in MCTS Eren Sezener DeepMind Peter Dayan Max Planck Institute for Biological Cybernetics T ¨ ubingen University of T¨ ubingen Abstract Monte-Carlo Tree Search (MCTS) is one of the most-widely used methods for planning, and has powered many recent advances in artifi- cial intelligence. In MCTS, one typically per- forms computations (i.e., simulations) to col- lect statistics about the possible future conse- quences of actions, and then chooses accord- ingly. Many popular MCTS methods such as UCT and its variants decide which computa- tions to perform by trading-off exploration and exploitation. In this work, we take a more di- rect approach, and explicitly quantify the value of a computation based on its expected impact on the quality of the action eventually chosen. Our approach goes beyond the myopic lim- itations of existing computation-value-based methods in two senses: (I) we are able to ac- count for the impact of non-immediate (ie, fu- ture) computations (II) on non-immediate ac- tions. We show that policies that greedily op- timize computation values are optimal under certain assumptions and obtain results that are competitive with the state-of-the-art. 1 INTRODUCTION Monte Carlo tree search (MCTS) is a widely used ap- proximate planning method that has been successfully applied to many challenging domains such as computer Go [4, 21]. In MCTS, one estimates values of actions by stochastically expanding a search tree—capturing poten- tial future states and actions with their respective values. Most MCTS methods rely on rules concerning how to expand the search tree, typically trading-off exploration and exploitation such as in UCT [11]. However, since no “real” reward accrues during internal search, UCT can be viewed as a heuristic [7, 8, 23]. In this paper, we propose Proceedings of the 36 th Conference on Uncertainty in Artificial Intelligence (UAI), PMLR volume 124, 2020. a more direct approach by calculating values of MCTS computations (i.e., tree expansions/simulations). In the same way that the value of an action in a Markov decision process (MDP) depends on subsequent actions, the value of a computation in MCTS should reflect sub- sequent computations. However, computing the optimal computation values—the value of a computation under an optimal computation policy—is known to be gener- ally intractable [14, 18]. Therefore, one often resorts to “myopic” approximations of computation values, such as considering the impact of only the immediate computa- tion in isolation from the subsequent computations. For instance, it has been shown that simple modifications to UCT, where myopic computation values inform the tree policy at the root node, can yield significant improve- ments [8, 23]. In this work, we propose tractable yet non-myopic meth- ods for calculating computation values, going beyond the limitations of existing methods. To this end, we intro- duce static and dynamic value functions that form lower and upper bounds for state-action values in MCTS, in- dependent of any future computations. We then utilize these functions to define static and dynamic values of computation, capturing the expected change in state val- ues resulting from a computation. We show that the ex- isting myopic computation value definitions in MCTS can be seen as specific instances of static computation values. The dynamic value function, on the other hand, is novel measure, and it enables non-myopic ways of selecting MCTS computations. We prove that policies that greedily maximize static/dynamic computation val- ues are asymptotically optimal under certain assump- tions. Furthermore, we also show that they outperform various MCTS baselines empirically.

Transcript of Static and Dynamic Values of Computation in MCTS · Monte Carlo tree search (MCTS) is a widely used...

Page 1: Static and Dynamic Values of Computation in MCTS · Monte Carlo tree search (MCTS) is a widely used ap-proximate planning method that has been successfully applied to many challenging

Static and Dynamic Values of Computation in MCTS

Eren SezenerDeepMind

Peter DayanMax Planck Institute for Biological Cybernetics Tubingen

University of Tubingen

Abstract

Monte-Carlo Tree Search (MCTS) is one of themost-widely used methods for planning, andhas powered many recent advances in artifi-cial intelligence. In MCTS, one typically per-forms computations (i.e., simulations) to col-lect statistics about the possible future conse-quences of actions, and then chooses accord-ingly. Many popular MCTS methods such asUCT and its variants decide which computa-tions to perform by trading-off exploration andexploitation. In this work, we take a more di-rect approach, and explicitly quantify the valueof a computation based on its expected impacton the quality of the action eventually chosen.Our approach goes beyond the myopic lim-itations of existing computation-value-basedmethods in two senses: (I) we are able to ac-count for the impact of non-immediate (ie, fu-ture) computations (II) on non-immediate ac-tions. We show that policies that greedily op-timize computation values are optimal undercertain assumptions and obtain results that arecompetitive with the state-of-the-art.

1 INTRODUCTION

Monte Carlo tree search (MCTS) is a widely used ap-proximate planning method that has been successfullyapplied to many challenging domains such as computerGo [4, 21]. In MCTS, one estimates values of actions bystochastically expanding a search tree—capturing poten-tial future states and actions with their respective values.Most MCTS methods rely on rules concerning how toexpand the search tree, typically trading-off explorationand exploitation such as in UCT [11]. However, since no“real” reward accrues during internal search, UCT can beviewed as a heuristic [7, 8, 23]. In this paper, we propose

Proceedings of the 36th

Conference on Uncertainty in Artificial

Intelligence (UAI), PMLR volume 124, 2020.

a more direct approach by calculating values of MCTScomputations (i.e., tree expansions/simulations).

In the same way that the value of an action in a Markovdecision process (MDP) depends on subsequent actions,the value of a computation in MCTS should reflect sub-sequent computations. However, computing the optimalcomputation values—the value of a computation underan optimal computation policy—is known to be gener-ally intractable [14, 18]. Therefore, one often resorts to“myopic” approximations of computation values, such asconsidering the impact of only the immediate computa-tion in isolation from the subsequent computations. Forinstance, it has been shown that simple modifications toUCT, where myopic computation values inform the treepolicy at the root node, can yield significant improve-ments [8, 23].

In this work, we propose tractable yet non-myopic meth-ods for calculating computation values, going beyond thelimitations of existing methods. To this end, we intro-duce static and dynamic value functions that form lowerand upper bounds for state-action values in MCTS, in-dependent of any future computations. We then utilizethese functions to define static and dynamic values ofcomputation, capturing the expected change in state val-ues resulting from a computation. We show that the ex-isting myopic computation value definitions in MCTScan be seen as specific instances of static computationvalues. The dynamic value function, on the other hand,is novel measure, and it enables non-myopic ways ofselecting MCTS computations. We prove that policiesthat greedily maximize static/dynamic computation val-ues are asymptotically optimal under certain assump-tions. Furthermore, we also show that they outperformvarious MCTS baselines empirically.

Page 2: Static and Dynamic Values of Computation in MCTS · Monte Carlo tree search (MCTS) is a widely used ap-proximate planning method that has been successfully applied to many challenging

2 BACKGROUND

In this section we cover some of relevant literature andintroduce the notation.

2.1 MONTE CARLO TREE SEARCH

MCTS algorithms function by incrementally andstochastically building a search tree (given an environ-ment model) to approximate state-action values. Thisincremental growth prioritizes the promising regions ofthe search space by directing the growth of the tree to-wards high value states. To elaborate, a tree policy isused to traverse the search tree and select a node whichis not fully expanded—meaning, it has immediate suc-cessors that aren’t included in the tree. Then, the node isexpanded once by adding one of its unexplored childrento the tree, from which a trajectory simulated for a fixednumber of steps or until a terminal state is reached. Suchtrajectories are generated using a rollout policy; which istypically fast to compute—for instance random and uni-form. The outcome of this trajectory—i.e., cumulativediscounted rewards along the trajectory—is used to up-date the value estimates of the nodes in the tree that liealong the path from the root to the expanded node.

Upper Confidence Bounds applied to trees (UCT) [11]adapts a multi-armed bandit algorithm called UCB1 [1]to MCTS. More specifically, UCT’s tree policy appliesthe UCB1 algorithm recursively down the tree startingfrom the root node. At each level, UCT selects the mostpromising action at state s via arg maxa2As Q(s, a) +

c

q2 log N(s)

N(s,a) where As is the set of available actions ats, N(s, a) is the number of times the (s, a) is visited,N(s) :=

Pa2As

N(s, a), Q(s, a) is the average rewardobtained by performing rollouts from (s, a) or one of itsdescendants, and c is a positive constant, which is typi-cally selected empirically. The second term of the UCT-rule assigns higher scores to nodes that are visited lessfrequently. As such, it can be thought of as an explo-ration bonus.

UCT is simple and has successfully been utilized formany applications. However, it has also been noted[8, 23] that UCT’s goal is different from that of approx-imate planning. UCT attempts to ensure that the agentexperiences little regret associated with the actions thatare taken during the Monte Carlo simulations that com-prise planning. However, since these simulations do notinvolve taking actions in the environment, the agent ac-tually experience no true regret at all. Thus failing toexplore actions based on this consideration could slowdown discovery of their superior or inferior quality.

2.2 METAREASONING & VALUE OFINFORMATION

Howard [9] was the first to quantify mathematically theeconomic gain from obtaining a piece of information.Russell and Wefald [18] formulated the rational metar-

easoning framework, which is concerned with how oneshould assign values to meta-level actions (i.e., compu-tations). Hay et al. [8], Tolpin and Shimony [23] appliedthe principles of this framework to MCTS by modify-ing the tree-policy of UCT at the root node such that theselected child node maximizes the value of information.They showed empirically that such a simple modificationcan yield significant improvements.

The field of Bayesian optimization has evolved in paral-lel. For instance, what are known as knowledge gradients

[19, 24] are equivalent to information/computation valueformulations for flat/stateless problems such as multi-armed bandits.

Computation values have also been used to explain hu-man and animal behavior. For example, it has been sug-gested that humans might leverage computation valuesto solve planning tasks in a resource efficient manner[13, 20], and animals might improve their policies by “re-playing” memories with large computation values [17].

2.3 NOTATION

A finite Markov decision process (MDP) is a 5-tuple(S, A, P, R, �), where S is a finite set of states A is afinite set of actions, P is the transition function such thatP

ass0 = P (st = s

0|st = s, at = a), where s, s

02 S

and a 2 A, R is the expected immediate reward functionsuch that R

ass0 = E[rt+1|st = s, at = a, st+1 = s

0],where again s, s

02 S and a 2 A, � is the discount factor

such that � 2 [0, 1).

We assume an agent interacts with the environment viaa (potentially stochastic) policy ⇡, such that ⇡(s, a) =P (at = a|st = s). These probabilities typically dependon parameters; these are omitted from the notation. Thevalue of an action a at state s is defined as the expectedcumulative discounted rewards following policy ⇡, thatis Q

⇡(s, a) = E⇡⇥P1

i=0 �irt+i

�� st = s, at = a].

The optimal action value function is defined asQ

⇤(s, a) = max⇡ Q⇡(s, a) for all state-action pairs, and

satisfies the Bellman optimality recursion:

Q⇤(s, a) =

X

s0

Pass0

hR

ass0 + �max

a0Q

⇤(s0, a

0)i

.

We use N (µ, ⌃) and N (µ,�2) to denote a multivari-

ate and univariate Normal distribution respectively withmean vector/value µ and covariance matrix ⌃ or scale �.

Page 3: Static and Dynamic Values of Computation in MCTS · Monte Carlo tree search (MCTS) is a widely used ap-proximate planning method that has been successfully applied to many challenging

3 STATE-ACTION VALUES IN MCTS

To motivate the issues underlying this paper, considerthe following example (Figure 1). Here, there are tworooms: one containing two boxes and the other con-taining five boxes. Each box contains an unknown buti.i.d. amount of money; and you are ultimately allowedto open only one box. However, you do so in stages. Firstyou must choose a room, then you can open one of theboxes and collect the money. Which room should youchoose? What if you know ahead of time that you couldpeek inside the boxes after choosing the room?

Figure 1: Illustration of the example. There are tworooms, one containing two boxes, another one contain-ing five boxes. There is an unknown amount of money ineach box.

In the first case, it doesn’t matter which room onechooses, as all the boxes are equally valuable in expecta-tion in absence of any further information. By contrast,in the second case, choosing the room with five boxesis the better option. This is because one can obtain fur-ther information by peeking inside the boxes—and moreboxes mean more money in expectation, as one has theoption to choose the best one.

Formally, let X = {xi}nxi=1 and Y = {yi}

ny

i=1 be setsof random variables denoting rewards in the boxes ofthe first and the second room respectively. Assume allthe rewards are sampled i.i.d. and nx < ny . Thenwe have maxx2X E[x] = maxy2Y E[y], which is whythe two rooms are equally valuable if one has to choosea box blindly. On the other hand, E[maxx2X x] <

E[maxy2Y y], which is analogous to the case whereboxes can be peeked in first.

If we consider MCTS with this example in mind, whenwe want to value an action at the root of the tree, backingup the estimated mean values of the actions lower in thetree may be insufficient. This is because the value of aroot action is a convex combination of the “downstream”(e.g., leaf) actions; and, as such, uncertainty in the val-ues of the leaves contributes to the expected value at theroot due to Jensen’s inequality. We formalize this notionof value as dynamic value in the following section andutilize it to define computation values later on.

3.1 STATIC AND DYNAMIC VALUES

We assume a planning setting where the environment dy-namic (i.e., P and R) is known. We could then com-pute Q

⇤ in principle; however, this is typically com-putationally intractable. Therefore, we estimate Q

⇤ byperforming computations such as random environmentsimulations (e.g., MCTS rollouts). Note that, our uncer-tainty about Q

⇤ is not epistemic—environment dynamicis known—but it is computational. In other words, if wedo not know Q

⇤, it is because we haven’t performed thenecessary computations. In this subsection, we introducestatic and dynamic value functions, which are “posterior”estimates of Q

⇤ conditioned on computations.

Let us unroll the Bellman optimality equation for n-steps1. For a given “root” state, s⇢, let �n(s⇢) be the setof leaf state-actions—that is, state-actions can be tran-sitioned to from s⇢ in exactly n-steps. Let Q

⇤0 be a ran-

dom function denoting our prior beliefs about the optimalvalue function Q

⇤ over �n(s⇢). We then use Q⇤n(s, a) to

denote the (Bayes-)optimal state-action value, which wedefine as a function of Q

⇤0:

Q⇤n(s, a) =

8><

>:

Q⇤0(s, a) if n = 0

Ps0 P

ass0 [Ra

ss0+ else�maxa02As0 Q

⇤n�1(s

0, a

0)]

,

where As0 is the set of actions available at s0.

We assume it is possible to obtain noisy evaluations ofQ

⇤ for leaf state-actions by performing computationssuch as trajectory simulations. We further assume thatthe process by which a state-action value is sampledis a given, and we are interested in determining whichstate-action to sample from. Therefore, we associateeach computation with a single state-action in �n(s⇢);but, the outcome of a computation might be informa-tive for multiple leaf values if they are dependent. Let! := (s, a) 2 �n(s⇢) be a candidate computation.We denote the unknown outcome of this computation attime t with random variable O!t (or equivalently, Osat),which we assume to be O!t = Q

⇤0(!) + ✏t where ✏t is

an unknown noise term with a known distribution andis i.i.d. sampled for each t. If we associate a candi-date computation ! with its unknown outcome O!t attime t, we refer to the resulting tuple as a closure anddenote it as ⌦t := (!, O!t). Finally, we denote a per-

formed computation at time t, by dropping the bar, as!t := (!, o!t) where o!t (or equivalently, osat) is theobserved outcome of the computation that we assume tobe o!t ⇠ O!t, and thus !t ⇠ ⌦t. In the context ofMCTS, osat will be the cumulative discounted reward

1Obtaining, what is sometimes referred to as the n-stepBellman equation.

Page 4: Static and Dynamic Values of Computation in MCTS · Monte Carlo tree search (MCTS) is a widely used ap-proximate planning method that has been successfully applied to many challenging

of a simulated trajectory from (s, a) at time t. We willobtain these trajectories using an adaptive, asymptoti-cally optimal sampler/simulator (e.g., UCT), such thatlimt!1 E[Osat] = Q

⇤(s, a). This means {Osat}t is anon-stationary stochastic process in practice; yet, we willtreat it as a stationary process, as reflected in our i.i.d. as-sumption.

Let !1:t be a sequence of t performed computations con-cerning arbitrary state-actions in �n(s⇢) and s⇢ be thecurrent state of the agent on which we can condition Q

⇤0.

Because !1:t contains the necessary statistics to computethe posterior leaf values, we will sometimes refer to itas the knowledge state. We denote the resulting poste-rior values for a (s, a) 2 �n(s⇢) as Q

⇤0(s, a)|!1:t and

the joint values of leaves as Q⇤0|!1:t = (Q⇤

0(s, a)|!1:t :(s, a) 2 �n(s⇢)).

We define the dynamic value function as the expectedvalue the agent should assign to an action at s⇢ given!1:t, assuming it could resolve all of the remaining un-certainty about posterior leaf state-action values Q

⇤0|!1:t.

Definition 1. The dynamic value function is defined as n(s, a|!1:t) := EQ⇤

0 |!1:t[⌥n(s, a|!1:t)] where,

⌥n(s, a|!1:t) :=

8><

>:

Q⇤0(s, a)|!1:t if n = 0

Ps0 P

ass0 [Ra

ss0+ else�maxa0 ⌥n�1(s0

, a0|!1:t)]

where As0 is the set of actions available at s0.

The ‘dynamic’ in the term reflects the fact that the agentmay change its mind about the best actions availableat each state within n-steps; yet, this is reflected andaccounted for in n. A useful property of n is thatis time-consistent in the sense that it does not changewith further computations in expectation. Let ⌦1:k :={(!i, O!ii)}

ki=1 be a sequence of k closures. Then the

following equality holds for any ⌦1:k:

n(s⇢, a|!1:t) = E⌦1:k [ n(s⇢, a|!1:t⌦1:k)] , (1)

due to the law of total expectation, where !1:t⌦1:k is aconcatenation. This might seem paradoxical: why per-form computations if action values do not change in ex-pectation? The reason is that we care about the maximumof dynamic values over actions at s⇢, which increasesin expectation as long as computations resolve someuncertainty. Formally, maxa2As⇢

n(s⇢, a|!1:t)

E⌦1:k [maxa2As⇢ n(s⇢, a|!1:t⌦1:k)], due to Jensen’s

inequality, just as in the example of the boxes.

Dynamic values capture one extreme: valuation of ac-tions assuming perfect information in the future. Next,we consider the other extreme, valuation under zero in-formation in the future, which is given by the static value

function.

Definition 2. We define the static value function as

�n(s, a|!1:t) :=

8><

>:

E [Q⇤0(s, a)|!1:t] if n = 0

Ps0 P

ass0 [Ra

ss0+ else�maxa0 �n�1(s0

, a0|!1:t)]

where As0 is the set of actions available at s0.

In other words, �n(s⇢, a) captures how valuable (s⇢, a)would be if the agent were to take n actions before run-ning any new computations. In Figure 2, we graphicallycontrast dynamic and static values, where the differenceis the stage at which the expectation is taken. For theformer, it is done at the level of the root actions; for thelatter, at the level of the leaves.

Going back to our example with the boxes, dynamicvalue of a room assumes that you open all the boxes af-ter entering the room, whereas the static value assumesyou do not open any boxes. What can we say about thein-between cases: action values under a finite number offuture computations? Assume we know that the agentwill perform k computations before taking an action ats⇢. The optimal allocation of these k computations toleaf nodes is known to be intractable even in a simplerbandit setting [16]. That said, for any allocation (and forany finite k), static and dynamic values will form lowerand upper bounds on expected action values neverthe-less. We formalize this for a special case below.

Proposition 1. Assume an agent at state s⇢ and knowl-

edge state !1:t decides to perform !1:k, a sequence

of k candidate computations, before taking n ac-

tions. Then the expected future value of a 2 As⇢

prior to observing any of the k-computation outcomes

is equal to E⌦1:k [�n(s⇢, a|!1:t⌦1:k)], where ⌦1:k ={(!i, O!ii)}

ki=1. Then,

n(s⇢, a|!1:t) � E⌦1:k [�n(s⇢, a|!1:t⌦1:k)] � �n(s⇢, a|!1:t),

where both bounds are tight.

The proof is provided in the Appendix.

4 VALUE OF COMPUTATION

In this section, we use the static and dynamic value func-tions to define computation values. We show that thesecomputation values have desirable properties and thatpolicies greedily-maximizing these values are optimal incertain senses. Lastly, we compare our definitions to theexisting computation value definitions.

Definition 3. We define the value of computation atstate s⇢ for a sequence of candidate computations !1:k

given a static or dynamic value function f 2 {�n, n}

Page 5: Static and Dynamic Values of Computation in MCTS · Monte Carlo tree search (MCTS) is a widely used ap-proximate planning method that has been successfully applied to many challenging

S0

S1 S2

=�

Q⇤0(s0, a0)|!1:t

<latexit sha1_base64="nsF9YYP25MVzxY5cyE9cLJ2sknk=">AAACB3icbVDJSgNBEK2JW4zbqBdBkMYgRJHQ48XgKeDFYwJmgSQOPZ1O0qRnobtHiGNuXvwVLx4U8eoveBP8GDvLQRMfFDzeq6KqnhcJrjTGX1ZqYXFpeSW9mllb39jcsrd3qiqMJWUVGopQ1j2imOABq2iuBatHkhHfE6zm9S9Hfu2WScXD4FoPItbySTfgHU6JNpJrH5RdfHOSUy4+RcTFx+geNUOfdYmbOBd66NpZnMdjoHniTEm2uHf3DQYl1/5stkMa+yzQVBClGg6OdCshUnMq2DDTjBWLCO2TLmsYGhCfqVYy/mOIjozSRp1Qmgo0Gqu/JxLiKzXwPdPpE91Ts95I/M9rxLpTaCU8iGLNAjpZ1IkF0iEahYLaXDKqxcAQQiU3tyLaI5JQbaLLmBCc2ZfnSfUs7+C8UzZpFGCCNOzDIeTAgXMowhWUoAIUHuAJXuDVerSerTfrfdKasqYzu/AH1scPsvKYxQ==</latexit><latexit sha1_base64="Vhw6spc7LyasFF2Wd4lyRLl1vQg=">AAACB3icbVDLSgMxFM3UV1tfo24EQYJFqCIl48biqujGZQv2Ae04ZNJMG5p5kGSEOhZcuPFX3Agq4tZfcCf4MaaPhbYeuHA4517uvceNOJMKoS8jNTe/sLiUzmSXV1bX1s2NzZoMY0FolYQ8FA0XS8pZQKuKKU4bkaDYdzmtu73zoV+/pkKyMLhU/YjaPu4EzGMEKy055m7FQVeHeemgI4gddABvYSv0aQc7iXWqBo6ZQwU0Apwl1oTkSts335m757OyY3622iGJfRoowrGUTQtFyk6wUIxwOsi2YkkjTHq4Q5uaBtin0k5Gfwzgvlba0AuFrkDBkfp7IsG+lH3f1Z0+Vl057Q3F/7xmrLyinbAgihUNyHiRF3OoQjgMBbaZoETxviaYCKZvhaSLBSZKR5fVIVjTL8+S2nHBQgWrotMogjHSYAfsgTywwAkogQtQBlVAwD14BC/g1Xgwnow3433cmjImM1vgD4yPH7BemkI=</latexit><latexit sha1_base64="Vhw6spc7LyasFF2Wd4lyRLl1vQg=">AAACB3icbVDLSgMxFM3UV1tfo24EQYJFqCIl48biqujGZQv2Ae04ZNJMG5p5kGSEOhZcuPFX3Agq4tZfcCf4MaaPhbYeuHA4517uvceNOJMKoS8jNTe/sLiUzmSXV1bX1s2NzZoMY0FolYQ8FA0XS8pZQKuKKU4bkaDYdzmtu73zoV+/pkKyMLhU/YjaPu4EzGMEKy055m7FQVeHeemgI4gddABvYSv0aQc7iXWqBo6ZQwU0Apwl1oTkSts335m757OyY3622iGJfRoowrGUTQtFyk6wUIxwOsi2YkkjTHq4Q5uaBtin0k5Gfwzgvlba0AuFrkDBkfp7IsG+lH3f1Z0+Vl057Q3F/7xmrLyinbAgihUNyHiRF3OoQjgMBbaZoETxviaYCKZvhaSLBSZKR5fVIVjTL8+S2nHBQgWrotMogjHSYAfsgTywwAkogQtQBlVAwD14BC/g1Xgwnow3433cmjImM1vgD4yPH7BemkI=</latexit><latexit sha1_base64="OtBIrYnehbSAcntxsGk1T3gWPDM=">AAACB3icbVDLSgMxFM3UV62vUZeCBItQRUrGjcVVwY3LFuwD2nHIpJk2mEmGJCOUsTs3/oobF4q49Rfc+Temj4W2HrhwOOde7r0nTDjTBqFvJ7e0vLK6ll8vbGxube+4u3tNLVNFaINILlU7xJpyJmjDMMNpO1EUxyGnrfDuauy37qnSTIobM0yoH+O+YBEj2FgpcA/rAbo9LekAnUEcoBP4ALsypn0cZN6lGQVuEZXRBHCReDNSBDPUAver25MkjakwhGOtOx5KjJ9hZRjhdFToppommNzhPu1YKnBMtZ9N/hjBY6v0YCSVLWHgRP09keFY62Ec2s4Ym4Ge98bif14nNVHFz5hIUkMFmS6KUg6NhONQYI8pSgwfWoKJYvZWSAZYYWJsdAUbgjf/8iJpnpc9VPbqqFitzOLIgwNwBErAAxegCq5BDTQAAY/gGbyCN+fJeXHenY9pa86ZzeyDP3A+fwBt7JcX</latexit>

Q⇤0(s0, a1)|!1:t

<latexit sha1_base64="oOwz6ioPXVSE6ZoseqP12U7dYqs=">AAACB3icbVDJSgNBEK2JW4zbqBdBkMYgRJEw48XgKeDFYwJmgWQcejo9SZOehe4eIY65efFXvHhQxKu/4E3wY+wsB018UPB4r4qqel7MmVSW9WVkFhaXlleyq7m19Y3NLXN7py6jRBBaIxGPRNPDknIW0ppiitNmLCgOPE4bXv9y5DduqZAsCq/VIKZOgLsh8xnBSkuueVB1rZuTgnStU4Rd+xjdo3YU0C52U/tCDV0zbxWtMdA8sackX967+waNimt+tjsRSQIaKsKxlC3bipWTYqEY4XSYayeSxpj0cZe2NA1xQKWTjv8YoiOtdJAfCV2hQmP190SKAykHgac7A6x6ctYbif95rUT5JSdlYZwoGpLJIj/hSEVoFArqMEGJ4gNNMBFM34pIDwtMlI4up0OwZ1+eJ/Wzom0V7apOowQTZGEfDqEANpxDGa6gAjUg8ABP8AKvxqPxbLwZ75PWjDGd2YU/MD5+ALSGmMY=</latexit><latexit sha1_base64="WIA+DpvEExF+UfS0UtPK1ZI2Cm8=">AAACB3icbVDLSgMxFM3UV1tfo24EQYJFqCIl48biqujGZQv2Ae04ZNJMG5p5kGSEOhZcuPFX3Agq4tZfcCf4MaaPhbYeuHA4517uvceNOJMKoS8jNTe/sLiUzmSXV1bX1s2NzZoMY0FolYQ8FA0XS8pZQKuKKU4bkaDYdzmtu73zoV+/pkKyMLhU/YjaPu4EzGMEKy055m7FQVeHeemgI4gd6wDewlbo0w52EutUDRwzhwpoBDhLrAnJlbZvvjN3z2dlx/xstUMS+zRQhGMpmxaKlJ1goRjhdJBtxZJGmPRwhzY1DbBPpZ2M/hjAfa20oRcKXYGCI/X3RIJ9Kfu+qzt9rLpy2huK/3nNWHlFO2FBFCsakPEiL+ZQhXAYCmwzQYnifU0wEUzfCkkXC0yUji6rQ7CmX54lteOChQpWRadRBGOkwQ7YA3lggRNQAhegDKqAgHvwCF7Aq/FgPBlvxvu4NWVMZrbAHxgfP7HymkM=</latexit><latexit sha1_base64="WIA+DpvEExF+UfS0UtPK1ZI2Cm8=">AAACB3icbVDLSgMxFM3UV1tfo24EQYJFqCIl48biqujGZQv2Ae04ZNJMG5p5kGSEOhZcuPFX3Agq4tZfcCf4MaaPhbYeuHA4517uvceNOJMKoS8jNTe/sLiUzmSXV1bX1s2NzZoMY0FolYQ8FA0XS8pZQKuKKU4bkaDYdzmtu73zoV+/pkKyMLhU/YjaPu4EzGMEKy055m7FQVeHeemgI4gd6wDewlbo0w52EutUDRwzhwpoBDhLrAnJlbZvvjN3z2dlx/xstUMS+zRQhGMpmxaKlJ1goRjhdJBtxZJGmPRwhzY1DbBPpZ2M/hjAfa20oRcKXYGCI/X3RIJ9Kfu+qzt9rLpy2huK/3nNWHlFO2FBFCsakPEiL+ZQhXAYCmwzQYnifU0wEUzfCkkXC0yUji6rQ7CmX54lteOChQpWRadRBGOkwQ7YA3lggRNQAhegDKqAgHvwCF7Aq/FgPBlvxvu4NWVMZrbAHxgfP7HymkM=</latexit><latexit sha1_base64="x+z/UeevL5Dx8WTLQ9H9vSmcgGA=">AAACB3icbVDLSgMxFM3UV62vUZeCBItQRUrGjcVVwY3LFuwD2nHIpJk2mEmGJCOUsTs3/oobF4q49Rfc+Temj4W2HrhwOOde7r0nTDjTBqFvJ7e0vLK6ll8vbGxube+4u3tNLVNFaINILlU7xJpyJmjDMMNpO1EUxyGnrfDuauy37qnSTIobM0yoH+O+YBEj2FgpcA/rAbo9LekAnUEceCfwAXZlTPs4yLxLMwrcIiqjCeAi8WakCGaoBe5XtydJGlNhCMdadzyUGD/DyjDC6ajQTTVNMLnDfdqxVOCYaj+b/DGCx1bpwUgqW8LAifp7IsOx1sM4tJ0xNgM9743F/7xOaqKKnzGRpIYKMl0UpRwaCcehwB5TlBg+tAQTxeytkAywwsTY6Ao2BG/+5UXSPC97qOzVUbFamcWRBwfgCJSABy5AFVyDGmgAAh7BM3gFb86T8+K8Ox/T1pwzm9kHf+B8/gBvgJcY</latexit>

Q⇤0(s0, a2)|!1:t

<latexit sha1_base64="wD2+PJF3kMWyK1trZEfCq8yHvLs=">AAACB3icbVDJSgNBEK1xjXEb9SII0hiEKBJmcjF4CnjxmIBZIBmHnk5P0qRnobtHiGNuXvwVLx4U8eoveBP8GDvLQRMfFDzeq6KqnhdzJpVlfRkLi0vLK6uZtez6xubWtrmzW5dRIgitkYhHoulhSTkLaU0xxWkzFhQHHqcNr3858hu3VEgWhddqEFMnwN2Q+YxgpSXXPKy61s1pXrrWGcJu8QTdo3YU0C52U/tCDV0zZxWsMdA8sackV96/+waNimt+tjsRSQIaKsKxlC3bipWTYqEY4XSYbSeSxpj0cZe2NA1xQKWTjv8YomOtdJAfCV2hQmP190SKAykHgac7A6x6ctYbif95rUT5JSdlYZwoGpLJIj/hSEVoFArqMEGJ4gNNMBFM34pIDwtMlI4uq0OwZ1+eJ/ViwbYKdlWnUYIJMnAAR5AHG86hDFdQgRoQeIAneIFX49F4Nt6M90nrgjGd2YM/MD5+ALYamMc=</latexit><latexit sha1_base64="VP0gU6H6eVaYp/yCuTBkfeswCIY=">AAACB3icbVDLSgMxFM34bOtr1I0gSLAIVaRkurG4Krpx2YJ9QDsOmTTThmYeJBmhjgUXbvwVN4KKuPUX3Al+jOljoa0HLhzOuZd773EjzqRC6MuYm19YXFpOpTMrq2vrG+bmVk2GsSC0SkIeioaLJeUsoFXFFKeNSFDsu5zW3d750K9fUyFZGFyqfkRtH3cC5jGClZYcc6/ioKujnHTQMcRO4RDewlbo0w52EutUDRwzi/JoBDhLrAnJlnZuvtN3z2dlx/xstUMS+zRQhGMpmxaKlJ1goRjhdJBpxZJGmPRwhzY1DbBPpZ2M/hjAA620oRcKXYGCI/X3RIJ9Kfu+qzt9rLpy2huK/3nNWHlFO2FBFCsakPEiL+ZQhXAYCmwzQYnifU0wEUzfCkkXC0yUji6jQ7CmX54ltULeQnmrotMogjFSYBfsgxywwAkogQtQBlVAwD14BC/g1Xgwnow3433cOmdMZrbBHxgfP7OGmkQ=</latexit><latexit sha1_base64="VP0gU6H6eVaYp/yCuTBkfeswCIY=">AAACB3icbVDLSgMxFM34bOtr1I0gSLAIVaRkurG4Krpx2YJ9QDsOmTTThmYeJBmhjgUXbvwVN4KKuPUX3Al+jOljoa0HLhzOuZd773EjzqRC6MuYm19YXFpOpTMrq2vrG+bmVk2GsSC0SkIeioaLJeUsoFXFFKeNSFDsu5zW3d750K9fUyFZGFyqfkRtH3cC5jGClZYcc6/ioKujnHTQMcRO4RDewlbo0w52EutUDRwzi/JoBDhLrAnJlnZuvtN3z2dlx/xstUMS+zRQhGMpmxaKlJ1goRjhdJBpxZJGmPRwhzY1DbBPpZ2M/hjAA620oRcKXYGCI/X3RIJ9Kfu+qzt9rLpy2huK/3nNWHlFO2FBFCsakPEiL+ZQhXAYCmwzQYnifU0wEUzfCkkXC0yUji6jQ7CmX54ltULeQnmrotMogjFSYBfsgxywwAkogQtQBlVAwD14BC/g1Xgwnow3433cOmdMZrbBHxgfP7OGmkQ=</latexit><latexit sha1_base64="J9/wdr+5cJGx70tkTcLQXsd96HQ=">AAACB3icbVDLSgMxFM3UV62vUZeCBItQRUqmG4urghuXLdgHtOOQSTNtaCYzJBmhjN258VfcuFDErb/gzr8xbWehrQcuHM65l3vv8WPOlEbo28qtrK6tb+Q3C1vbO7t79v5BS0WJJLRJIh7Jjo8V5UzQpmaa004sKQ59Ttv+6Hrqt++pVCwSt3ocUzfEA8ECRrA2kmcfNzx0d15SHrqA2KucwQfYi0I6wF7qXOmJZxdRGc0Al4mTkSLIUPfsr14/IklIhSYcK9V1UKzdFEvNCKeTQi9RNMZkhAe0a6jAIVVuOvtjAk+N0odBJE0JDWfq74kUh0qNQ990hlgP1aI3Ff/zuokOqm7KRJxoKsh8UZBwqCM4DQX2maRE87EhmEhmboVkiCUm2kRXMCE4iy8vk1al7KCy00DFWjWLIw+OwAkoAQdcghq4AXXQBAQ8gmfwCt6sJ+vFerc+5q05K5s5BH9gff4AcRSXGQ==</latexit>

E[maxa2As0Q

⇤0(s0, a)|!1:t]

<latexit sha1_base64="wgFSbcklp/UUZJnHayAl8xI4kzE=">AAACUnicdZJdSyMxFIbTqrvdrq5VL/cmWBeqSMkIq9UrF1nw0oKtwsw4nEnTNpjJDElGLHF+o7Dszf4Qb7xwTb/wg90DgYf3PTnJOUmcCa4NIX9K5YXFpQ8fK5+qn5dXvqzW1ta7Os0VZR2ailRdxqCZ4JJ1DDeCXWaKQRILdhFfn4z9ixumNE/luRllLExgIHmfUzBOimp8ywaTKr4axKH1mmQSu+QFvhPvcN8rggTMMI7tz8LHjm8jCzjgEv+IrI5IUeB2RK52Go53MWzjOxykCRtAZL0jUxThVlSrz6tj8gLT6nhu1dEszqLar6CX0jxh0lABWvseyUxoQRlOBSuqQa5ZBvQaBsx3KCFhOrSTZgr8zSk93E+VW9Lgifp6h4VE61ESu8xxX/q9Nxb/5fm56bdCy2WWGybp9KB+LrBJ8Xi+uMcVo0aMHABV3N0V0yEooMa9QvX1EP4P3b2mR5pee69+3JqNo4K+ok3UQB46QMfoFJ2hDqLoHj2gJ/S39Lv0WHa/ZJpaLs32bKA3UV5+Br6rroc=</latexit><latexit sha1_base64="wgFSbcklp/UUZJnHayAl8xI4kzE=">AAACUnicdZJdSyMxFIbTqrvdrq5VL/cmWBeqSMkIq9UrF1nw0oKtwsw4nEnTNpjJDElGLHF+o7Dszf4Qb7xwTb/wg90DgYf3PTnJOUmcCa4NIX9K5YXFpQ8fK5+qn5dXvqzW1ta7Os0VZR2ailRdxqCZ4JJ1DDeCXWaKQRILdhFfn4z9ixumNE/luRllLExgIHmfUzBOimp8ywaTKr4axKH1mmQSu+QFvhPvcN8rggTMMI7tz8LHjm8jCzjgEv+IrI5IUeB2RK52Go53MWzjOxykCRtAZL0jUxThVlSrz6tj8gLT6nhu1dEszqLar6CX0jxh0lABWvseyUxoQRlOBSuqQa5ZBvQaBsx3KCFhOrSTZgr8zSk93E+VW9Lgifp6h4VE61ESu8xxX/q9Nxb/5fm56bdCy2WWGybp9KB+LrBJ8Xi+uMcVo0aMHABV3N0V0yEooMa9QvX1EP4P3b2mR5pee69+3JqNo4K+ok3UQB46QMfoFJ2hDqLoHj2gJ/S39Lv0WHa/ZJpaLs32bKA3UV5+Br6rroc=</latexit><latexit sha1_base64="wgFSbcklp/UUZJnHayAl8xI4kzE=">AAACUnicdZJdSyMxFIbTqrvdrq5VL/cmWBeqSMkIq9UrF1nw0oKtwsw4nEnTNpjJDElGLHF+o7Dszf4Qb7xwTb/wg90DgYf3PTnJOUmcCa4NIX9K5YXFpQ8fK5+qn5dXvqzW1ta7Os0VZR2ailRdxqCZ4JJ1DDeCXWaKQRILdhFfn4z9ixumNE/luRllLExgIHmfUzBOimp8ywaTKr4axKH1mmQSu+QFvhPvcN8rggTMMI7tz8LHjm8jCzjgEv+IrI5IUeB2RK52Go53MWzjOxykCRtAZL0jUxThVlSrz6tj8gLT6nhu1dEszqLar6CX0jxh0lABWvseyUxoQRlOBSuqQa5ZBvQaBsx3KCFhOrSTZgr8zSk93E+VW9Lgifp6h4VE61ESu8xxX/q9Nxb/5fm56bdCy2WWGybp9KB+LrBJ8Xi+uMcVo0aMHABV3N0V0yEooMa9QvX1EP4P3b2mR5pee69+3JqNo4K+ok3UQB46QMfoFJ2hDqLoHj2gJ/S39Lv0WHa/ZJpaLs32bKA3UV5+Br6rroc=</latexit><latexit sha1_base64="wgFSbcklp/UUZJnHayAl8xI4kzE=">AAACUnicdZJdSyMxFIbTqrvdrq5VL/cmWBeqSMkIq9UrF1nw0oKtwsw4nEnTNpjJDElGLHF+o7Dszf4Qb7xwTb/wg90DgYf3PTnJOUmcCa4NIX9K5YXFpQ8fK5+qn5dXvqzW1ta7Os0VZR2ailRdxqCZ4JJ1DDeCXWaKQRILdhFfn4z9ixumNE/luRllLExgIHmfUzBOimp8ywaTKr4axKH1mmQSu+QFvhPvcN8rggTMMI7tz8LHjm8jCzjgEv+IrI5IUeB2RK52Go53MWzjOxykCRtAZL0jUxThVlSrz6tj8gLT6nhu1dEszqLar6CX0jxh0lABWvseyUxoQRlOBSuqQa5ZBvQaBsx3KCFhOrSTZgr8zSk93E+VW9Lgifp6h4VE61ESu8xxX/q9Nxb/5fm56bdCy2WWGybp9KB+LrBJ8Xi+uMcVo0aMHABV3N0V0yEooMa9QvX1EP4P3b2mR5pee69+3JqNo4K+ok3UQB46QMfoFJ2hDqLoHj2gJ/S39Lv0WHa/ZJpaLs32bKA3UV5+Br6rroc=</latexit>

<latexit sha1_base64="ACHDpLW4tPcTd28aSMkb5AmXStE=">AAACFnicdVDLSgMxFM3UV62vUZdugkVwoSUjPndFNy4r2AfMDCWTZtrQTGZIMkIZ+hVu/BU3LhRxK+78G9NppVX0QOBwzrk3lxMknCmN0KdVmJtfWFwqLpdWVtfWN+zNrYaKU0loncQ8lq0AK8qZoHXNNKetRFIcBZw2g/7VyG/eUalYLG71IKF+hLuChYxgbaS2fZh5+RJXdgM/cyooxwGakhPkXJw6Qy9RbNi2y98RiKZkHIHfVhlMUGvbH14nJmlEhSYcK+U6KNF+hqVmhNNhyUsVTTDp4y51DRU4osrP8ouGcM8oHRjG0jyhYa7OTmQ4UmoQBSYZYd1Tv72R+Jfnpjo89zMmklRTQcYfhSmHOoajjmCHSUo0HxiCiWTmVkh6WGKiTZOl2RL+J42jioMqzs1xuXo5qaMIdsAu2AcOOANVcA1qoA4IuAeP4Bm8WA/Wk/VqvY2jBWsysw1+wHr/AvQPm3M=</latexit><latexit sha1_base64="ACHDpLW4tPcTd28aSMkb5AmXStE=">AAACFnicdVDLSgMxFM3UV62vUZdugkVwoSUjPndFNy4r2AfMDCWTZtrQTGZIMkIZ+hVu/BU3LhRxK+78G9NppVX0QOBwzrk3lxMknCmN0KdVmJtfWFwqLpdWVtfWN+zNrYaKU0loncQ8lq0AK8qZoHXNNKetRFIcBZw2g/7VyG/eUalYLG71IKF+hLuChYxgbaS2fZh5+RJXdgM/cyooxwGakhPkXJw6Qy9RbNi2y98RiKZkHIHfVhlMUGvbH14nJmlEhSYcK+U6KNF+hqVmhNNhyUsVTTDp4y51DRU4osrP8ouGcM8oHRjG0jyhYa7OTmQ4UmoQBSYZYd1Tv72R+Jfnpjo89zMmklRTQcYfhSmHOoajjmCHSUo0HxiCiWTmVkh6WGKiTZOl2RL+J42jioMqzs1xuXo5qaMIdsAu2AcOOANVcA1qoA4IuAeP4Bm8WA/Wk/VqvY2jBWsysw1+wHr/AvQPm3M=</latexit><latexit sha1_base64="ACHDpLW4tPcTd28aSMkb5AmXStE=">AAACFnicdVDLSgMxFM3UV62vUZdugkVwoSUjPndFNy4r2AfMDCWTZtrQTGZIMkIZ+hVu/BU3LhRxK+78G9NppVX0QOBwzrk3lxMknCmN0KdVmJtfWFwqLpdWVtfWN+zNrYaKU0loncQ8lq0AK8qZoHXNNKetRFIcBZw2g/7VyG/eUalYLG71IKF+hLuChYxgbaS2fZh5+RJXdgM/cyooxwGakhPkXJw6Qy9RbNi2y98RiKZkHIHfVhlMUGvbH14nJmlEhSYcK+U6KNF+hqVmhNNhyUsVTTDp4y51DRU4osrP8ouGcM8oHRjG0jyhYa7OTmQ4UmoQBSYZYd1Tv72R+Jfnpjo89zMmklRTQcYfhSmHOoajjmCHSUo0HxiCiWTmVkh6WGKiTZOl2RL+J42jioMqzs1xuXo5qaMIdsAu2AcOOANVcA1qoA4IuAeP4Bm8WA/Wk/VqvY2jBWsysw1+wHr/AvQPm3M=</latexit><latexit sha1_base64="ACHDpLW4tPcTd28aSMkb5AmXStE=">AAACFnicdVDLSgMxFM3UV62vUZdugkVwoSUjPndFNy4r2AfMDCWTZtrQTGZIMkIZ+hVu/BU3LhRxK+78G9NppVX0QOBwzrk3lxMknCmN0KdVmJtfWFwqLpdWVtfWN+zNrYaKU0loncQ8lq0AK8qZoHXNNKetRFIcBZw2g/7VyG/eUalYLG71IKF+hLuChYxgbaS2fZh5+RJXdgM/cyooxwGakhPkXJw6Qy9RbNi2y98RiKZkHIHfVhlMUGvbH14nJmlEhSYcK+U6KNF+hqVmhNNhyUsVTTDp4y51DRU4osrP8ouGcM8oHRjG0jyhYa7OTmQ4UmoQBSYZYd1Tv72R+Jfnpjo89zMmklRTQcYfhSmHOoajjmCHSUo0HxiCiWTmVkh6WGKiTZOl2RL+J42jioMqzs1xuXo5qaMIdsAu2AcOOANVcA1qoA4IuAeP4Bm8WA/Wk/VqvY2jBWsysw1+wHr/AvQPm3M=</latexit>

� �

a0 a1a2

a0 a1 a2

S0

S1 S2

=�

EQ⇤0(s0, a0)|!1:t

<latexit sha1_base64="VT2v5Jd4+X1j2yhlXdjt3XCj9Gg=">AAACEnicbVDLSgMxFL1TX7W+qm4EN8EitCIl48biqiCCyxbsA9o6ZNJMG5p5kGSEOvYb3Pgrblwo4taVO8GPMX0stHogcDjnXm7OcSPBlcb400otLC4tr6RXM2vrG5tb2e2dugpjSVmNhiKUTZcoJnjAapprwZqRZMR3BWu4g/Ox37hhUvEwuNLDiHV80gu4xynRRnKyhbZPdN91k4sRqjr4+iivHHyMiIML6A61Q5/1iJPYZ3rkZHO4iCdAf4k9I7ny3u0XGFSc7Ee7G9LYZ4GmgijVsnGkOwmRmlPBRpl2rFhE6ID0WMvQgPhMdZJJpBE6NEoXeaE0L9Boov7cSIiv1NB3zeQ4gJr3xuJ/XivWXqmT8CCKNQvo9JAXC6RDNO4HdblkVIuhIYRKbv6KaJ9IQrVpMWNKsOcj/yX1k6KNi3bVtFGCKdKwDweQBxtOoQyXUIEaULiHR3iGF+vBerJerbfpaMqa7ezCL1jv3ysvnVo=</latexit><latexit sha1_base64="gKLQn9ECmX/7F5JF9DbHzlpa+xs=">AAACEnicbVDLSgMxFM3UV1tfVTeCm2ARWpGScWNxVRTBZQv2Ae04ZNK0hmYeJBmhjgP+gRu/RHDjQhG3rtwJfoyZtgttPRA4nHMvN+c4AWdSIfRlpObmFxaX0pns8srq2npuY7Mh/VAQWic+90XLwZJy5tG6YorTViAodh1Om87gNPGb11RI5nsXahhQy8V9j/UYwUpLdq7YcbG6cpzoLIY1G13uF6SNDiC2URHewo7v0j62I/NYxXYuj0poBDhLzAnJV7ZvvjN3jydVO/fZ6fokdKmnCMdStk0UKCvCQjHCaZzthJIGmAxwn7Y19bBLpRWNIsVwTytd2POFfp6CI/X3RoRdKYeuoyeTAHLaS8T/vHaoemUrYl4QKuqR8aFeyKHyYdIP7DJBieJDTTARTP8VkissMFG6xawuwZyOPEsahyUTlcyabqMMxkiDHbALCsAER6ACzkEV1AEB9+AJvIBX48F4Nt6M9/FoypjsbIE/MD5+ACibntc=</latexit><latexit sha1_base64="gKLQn9ECmX/7F5JF9DbHzlpa+xs=">AAACEnicbVDLSgMxFM3UV1tfVTeCm2ARWpGScWNxVRTBZQv2Ae04ZNK0hmYeJBmhjgP+gRu/RHDjQhG3rtwJfoyZtgttPRA4nHMvN+c4AWdSIfRlpObmFxaX0pns8srq2npuY7Mh/VAQWic+90XLwZJy5tG6YorTViAodh1Om87gNPGb11RI5nsXahhQy8V9j/UYwUpLdq7YcbG6cpzoLIY1G13uF6SNDiC2URHewo7v0j62I/NYxXYuj0poBDhLzAnJV7ZvvjN3jydVO/fZ6fokdKmnCMdStk0UKCvCQjHCaZzthJIGmAxwn7Y19bBLpRWNIsVwTytd2POFfp6CI/X3RoRdKYeuoyeTAHLaS8T/vHaoemUrYl4QKuqR8aFeyKHyYdIP7DJBieJDTTARTP8VkissMFG6xawuwZyOPEsahyUTlcyabqMMxkiDHbALCsAER6ACzkEV1AEB9+AJvIBX48F4Nt6M9/FoypjsbIE/MD5+ACibntc=</latexit><latexit sha1_base64="TAL2G1OfLz1/IPpOgY/wKV20mj0=">AAACEnicbVDLSgMxFM3UV62vUZdugkVoRUrGjcVVQQSXLdgHdMYhk2ba0MyDJCOUsd/gxl9x40IRt67c+Tdm2llo64HA4Zx7uTnHizmTCqFvo7Cyura+UdwsbW3v7O6Z+wcdGSWC0DaJeCR6HpaUs5C2FVOc9mJBceBx2vXGV5nfvadCsii8VZOYOgEehsxnBCstuWbVDrAaeV56PYUtF92dVqSLziB2URU+QDsK6BC7qXWppq5ZRjU0A1wmVk7KIEfTNb/sQUSSgIaKcCxl30KxclIsFCOcTkt2ImmMyRgPaV/TEAdUOuks0hSeaGUA/UjoFyo4U39vpDiQchJ4ejILIBe9TPzP6yfKrzspC+NE0ZDMD/kJhyqCWT9wwAQlik80wUQw/VdIRlhgonSLJV2CtRh5mXTOaxaqWS1UbtTzOorgCByDCrDABWiAG9AEbUDAI3gGr+DNeDJejHfjYz5aMPKdQ/AHxucP5hqbrA==</latexit>

EQ⇤0(s0, a1)|!1:t

<latexit sha1_base64="4UNh/3xy+F+0h9iospVdYKzTRe0=">AAACEXicbVDJSgNBEK2JW4zbqBfBS2MQokiY8WLwFBDBYwJmgWQcejqdpEnPQnePEMf5BS/+ihcPinj15k3wY+wsB018UPB4r4qqel7EmVSW9WVkFhaXlleyq7m19Y3NLXN7py7DWBBaIyEPRdPDknIW0JpiitNmJCj2PU4b3uBi5DduqZAsDK7VMKKOj3sB6zKClZZcs9D2sep7XnKZVl3r5rggXesEYdc+QveoHfq0h93EPlepa+atojUGmif2lOTLe3ffoFFxzc92JySxTwNFOJayZVuRchIsFCOcprl2LGmEyQD3aEvTAPtUOsn4oxQdaqWDuqHQFSg0Vn9PJNiXcuh7unN0v5z1RuJ/XitW3ZKTsCCKFQ3IZFE35kiFaBQP6jBBieJDTTARTN+KSB8LTJQOMadDsGdfnif106JtFe2qTqMEE2RhHw6gADacQRmuoAI1IPAAT/ACr8aj8Wy8Ge+T1owxndmFPzA+fgDPHJ0x</latexit><latexit sha1_base64="cfUUKHNq+vrsMhe/sWEqBjL4X+o=">AAACEXicbVDLSsNAFJ3UV1tfUTeCm8EiVJGSuLG4KorgsgX7gCaGyXTaDp08mJkINQb8Ajf+ibhxoYhbd+4EP8ZJ24W2HrhwOOde7r3HDRkV0jC+tMzc/MLiUjaXX15ZXVvXNzYbIog4JnUcsIC3XCQIoz6pSyoZaYWcIM9lpOkOzlK/eU24oIF/KYchsT3U82mXYiSV5OhFy0Oy77rxeVJzjKuDonCMQ4gccx/eQivwSA85sXkiE0cvGCVjBDhLzAkpVLZvvnN3j6dVR/+0OgGOPOJLzJAQbdMIpR0jLilmJMlbkSAhwgPUI21FfeQRYcejjxK4p5QO7AZclS/hSP09ESNPiKHnqs70fjHtpeJ/XjuS3bIdUz+MJPHxeFE3YlAGMI0HdignWLKhIghzqm6FuI84wlKFmFchmNMvz5LGUck0SmZNpVEGY2TBDtgFRWCCY1ABF6AK6gCDe/AEXsCr9qA9a2/a+7g1o01mtsAfaB8/zIierg==</latexit><latexit sha1_base64="cfUUKHNq+vrsMhe/sWEqBjL4X+o=">AAACEXicbVDLSsNAFJ3UV1tfUTeCm8EiVJGSuLG4KorgsgX7gCaGyXTaDp08mJkINQb8Ajf+ibhxoYhbd+4EP8ZJ24W2HrhwOOde7r3HDRkV0jC+tMzc/MLiUjaXX15ZXVvXNzYbIog4JnUcsIC3XCQIoz6pSyoZaYWcIM9lpOkOzlK/eU24oIF/KYchsT3U82mXYiSV5OhFy0Oy77rxeVJzjKuDonCMQ4gccx/eQivwSA85sXkiE0cvGCVjBDhLzAkpVLZvvnN3j6dVR/+0OgGOPOJLzJAQbdMIpR0jLilmJMlbkSAhwgPUI21FfeQRYcejjxK4p5QO7AZclS/hSP09ESNPiKHnqs70fjHtpeJ/XjuS3bIdUz+MJPHxeFE3YlAGMI0HdignWLKhIghzqm6FuI84wlKFmFchmNMvz5LGUck0SmZNpVEGY2TBDtgFRWCCY1ABF6AK6gCDe/AEXsCr9qA9a2/a+7g1o01mtsAfaB8/zIierg==</latexit><latexit sha1_base64="ZWYi7muRVSm0n3uNIIEf5v4H6Mc=">AAACEXicbVDLSsNAFJ34rPUVdelmsAhVpEzcWFwVRHDZgn1AE8NkOmmHTh7MTIQS8wtu/BU3LhRx686df+OkzUJbD1w4nHMv997jxZxJhdC3sbS8srq2Xtoob25t7+yae/sdGSWC0DaJeCR6HpaUs5C2FVOc9mJBceBx2vXGV7nfvadCsii8VZOYOgEehsxnBCstuWbVDrAaeV56nbVcdHdalS46g9i1TuADtKOADrGbWpcqc80KqqEp4CKxClIBBZqu+WUPIpIENFSEYyn7FoqVk2KhGOE0K9uJpDEmYzykfU1DHFDppNOPMnislQH0I6ErVHCq/p5IcSDlJPB0Z36/nPdy8T+vnyi/7qQsjBNFQzJb5Cccqgjm8cABE5QoPtEEE8H0rZCMsMBE6RDLOgRr/uVF0jmvWahmtVClUS/iKIFDcASqwAIXoAFuQBO0AQGP4Bm8gjfjyXgx3o2PWeuSUcwcgD8wPn8Aihabgw==</latexit>

EQ⇤0(s0, a2)|!1:t

<latexit sha1_base64="R77fxVIzlPj4vzl+PKgGeXdsmYs=">AAACEnicbVC7SgNBFL0bXzG+ojaCzWAQEpGwm8ZgFRDBMgHzgOy6zE4myZDZBzOzQlzzDTb+io2FIrZWdoIf4+RRaOKBgcM593LnHC/iTCrT/DJSS8srq2vp9czG5tb2TnZ3ryHDWBBaJyEPRcvDknIW0LpiitNWJCj2PU6b3uBi7DdvqZAsDK7VMKKOj3sB6zKClZbcbMH2sep7XnI5QjXXvDnJS9c8RdgtFdA9skOf9rCbWOdq5GZzZtGcAC0Sa0ZylYO7b9CoutlPuxOS2KeBIhxL2bbMSDkJFooRTkcZO5Y0wmSAe7StaYB9Kp1kEmmEjrXSQd1Q6BcoNFF/byTYl3Loe3pyHEDOe2PxP68dq27ZSVgQxYoGZHqoG3OkQjTuB3WYoETxoSaYCKb/ikgfC0yUbjGjS7DmIy+SRqlomUWrptsowxRpOIQjyIMFZ1CBK6hCHQg8wBO8wKvxaDwbb8b7dDRlzHb24Q+Mjx8uV51c</latexit><latexit sha1_base64="z2RE14Ka5QJn7G8lQ1jcrPPdAv0=">AAACEnicbVDLSgMxFM3UV1tfVTeCm2ARWpEy043FVVEEly3YB7R1yKSZNjSTGZKMUMcB/8CNXyK4caGIW1fuBD/G9LHQ1gOBwzn3cnOOEzAqlWl+GYmFxaXllWQqvbq2vrGZ2dquSz8UmNSwz3zRdJAkjHJSU1Qx0gwEQZ7DSMMZnI38xjURkvr8Ug0D0vFQj1OXYqS0ZGfybQ+pvuNE5zGs2ubVYU7a5hFEdjEPb2Hb90gP2ZF1omI7kzUL5hhwnlhTki3v3nyn7h5PK3bms931cegRrjBDUrYsM1CdCAlFMSNxuh1KEiA8QD3S0pQjj8hONI4UwwOtdKHrC/24gmP190aEPCmHnqMnRwHkrDcS//NaoXJLnYjyIFSE48khN2RQ+XDUD+xSQbBiQ00QFlT/FeI+Eggr3WJal2DNRp4n9WLBMgtWVbdRAhMkwR7YBzlggWNQBhegAmoAg3vwBF7Aq/FgPBtvxvtkNGFMd3bAHxgfPyvDntk=</latexit><latexit sha1_base64="z2RE14Ka5QJn7G8lQ1jcrPPdAv0=">AAACEnicbVDLSgMxFM3UV1tfVTeCm2ARWpEy043FVVEEly3YB7R1yKSZNjSTGZKMUMcB/8CNXyK4caGIW1fuBD/G9LHQ1gOBwzn3cnOOEzAqlWl+GYmFxaXllWQqvbq2vrGZ2dquSz8UmNSwz3zRdJAkjHJSU1Qx0gwEQZ7DSMMZnI38xjURkvr8Ug0D0vFQj1OXYqS0ZGfybQ+pvuNE5zGs2ubVYU7a5hFEdjEPb2Hb90gP2ZF1omI7kzUL5hhwnlhTki3v3nyn7h5PK3bms931cegRrjBDUrYsM1CdCAlFMSNxuh1KEiA8QD3S0pQjj8hONI4UwwOtdKHrC/24gmP190aEPCmHnqMnRwHkrDcS//NaoXJLnYjyIFSE48khN2RQ+XDUD+xSQbBiQ00QFlT/FeI+Eggr3WJal2DNRp4n9WLBMgtWVbdRAhMkwR7YBzlggWNQBhegAmoAg3vwBF7Aq/FgPBtvxvtkNGFMd3bAHxgfPyvDntk=</latexit><latexit sha1_base64="LWFT0mg/32nB7rq4EwFZbcs7HF8=">AAACEnicbVDLSgMxFM34rPU16tJNsAitSMl0Y3FVEMFlC/YBnXHIpGkbmnmQZIQyzje48VfcuFDErSt3/o2ZdhbaeiBwOOdebs7xIs6kQujbWFldW9/YLGwVt3d29/bNg8OODGNBaJuEPBQ9D0vKWUDbiilOe5Gg2Pc47XqTq8zv3lMhWRjcqmlEHR+PAjZkBCstuWbF9rEae15yncKWi+7OytJF5xC7tQp8gHbo0xF2E+tSpa5ZQlU0A1wmVk5KIEfTNb/sQUhinwaKcCxl30KRchIsFCOcpkU7ljTCZIJHtK9pgH0qnWQWKYWnWhnAYSj0CxScqb83EuxLOfU9PZkFkIteJv7n9WM1rDsJC6JY0YDMDw1jDlUIs37ggAlKFJ9qgolg+q+QjLHAROkWi7oEazHyMunUqhaqWi1UatTzOgrgGJyAMrDABWiAG9AEbUDAI3gGr+DNeDJejHfjYz66YuQ7R+APjM8f6UKbrg==</latexit>

maxa2As0EQ

⇤0(s0, a)|!1:t

<latexit sha1_base64="QwK2vAoxhQiAvUwoEQTPGCYKtvY=">AAACUHicbVHLahsxFL3jtE3qvpx02Y2oU0iLMVKgeXSVUgpZJhAnAc90uCPLExGNZpA0IUaZT+wmu35HN120NLJjQ5r0gNDh3JfOVVYpaR2lP6LW0qPHT5ZXnrafPX/x8lVnde3YlrXhYsBLVZrTDK1QUouBk06J08oILDIlTrLzL9P4yYUwVpb6yE0qkRSYazmWHF2Q0k6+7uNZl6HJs8TT/kfKdrdYj/bpDD02J01c4GXqkcRSk8+ptyltGhJEd5Zl/mtDDlP67cNGkHsE35MrEpeFyDH17JNrmvW00120JA/JYkgX5jhIO9fxqOR1IbTjCq0dMlq5xKNxkivRtOPaigr5OeZiGKjGQtjEz6w05F1QRmRcmnC0IzP1boXHwtpJkYXMqQF7PzYV/xcb1m68k3ipq9oJzW8HjWtFXEmm2yUjaQR3ahIIciPDWwk/Q4PchT9ohyWw+5YfkuPNPqN9drjZ3duZr2MF3sBb2AAG27AH+3AAA+DwHX7Cb/gTXUe/or+t6DZ1ccNr+Aet9g0Zfa6j</latexit><latexit sha1_base64="QwK2vAoxhQiAvUwoEQTPGCYKtvY=">AAACUHicbVHLahsxFL3jtE3qvpx02Y2oU0iLMVKgeXSVUgpZJhAnAc90uCPLExGNZpA0IUaZT+wmu35HN120NLJjQ5r0gNDh3JfOVVYpaR2lP6LW0qPHT5ZXnrafPX/x8lVnde3YlrXhYsBLVZrTDK1QUouBk06J08oILDIlTrLzL9P4yYUwVpb6yE0qkRSYazmWHF2Q0k6+7uNZl6HJs8TT/kfKdrdYj/bpDD02J01c4GXqkcRSk8+ptyltGhJEd5Zl/mtDDlP67cNGkHsE35MrEpeFyDH17JNrmvW00120JA/JYkgX5jhIO9fxqOR1IbTjCq0dMlq5xKNxkivRtOPaigr5OeZiGKjGQtjEz6w05F1QRmRcmnC0IzP1boXHwtpJkYXMqQF7PzYV/xcb1m68k3ipq9oJzW8HjWtFXEmm2yUjaQR3ahIIciPDWwk/Q4PchT9ohyWw+5YfkuPNPqN9drjZ3duZr2MF3sBb2AAG27AH+3AAA+DwHX7Cb/gTXUe/or+t6DZ1ccNr+Aet9g0Zfa6j</latexit><latexit sha1_base64="QwK2vAoxhQiAvUwoEQTPGCYKtvY=">AAACUHicbVHLahsxFL3jtE3qvpx02Y2oU0iLMVKgeXSVUgpZJhAnAc90uCPLExGNZpA0IUaZT+wmu35HN120NLJjQ5r0gNDh3JfOVVYpaR2lP6LW0qPHT5ZXnrafPX/x8lVnde3YlrXhYsBLVZrTDK1QUouBk06J08oILDIlTrLzL9P4yYUwVpb6yE0qkRSYazmWHF2Q0k6+7uNZl6HJs8TT/kfKdrdYj/bpDD02J01c4GXqkcRSk8+ptyltGhJEd5Zl/mtDDlP67cNGkHsE35MrEpeFyDH17JNrmvW00120JA/JYkgX5jhIO9fxqOR1IbTjCq0dMlq5xKNxkivRtOPaigr5OeZiGKjGQtjEz6w05F1QRmRcmnC0IzP1boXHwtpJkYXMqQF7PzYV/xcb1m68k3ipq9oJzW8HjWtFXEmm2yUjaQR3ahIIciPDWwk/Q4PchT9ohyWw+5YfkuPNPqN9drjZ3duZr2MF3sBb2AAG27AH+3AAA+DwHX7Cb/gTXUe/or+t6DZ1ccNr+Aet9g0Zfa6j</latexit><latexit sha1_base64="QwK2vAoxhQiAvUwoEQTPGCYKtvY=">AAACUHicbVHLahsxFL3jtE3qvpx02Y2oU0iLMVKgeXSVUgpZJhAnAc90uCPLExGNZpA0IUaZT+wmu35HN120NLJjQ5r0gNDh3JfOVVYpaR2lP6LW0qPHT5ZXnrafPX/x8lVnde3YlrXhYsBLVZrTDK1QUouBk06J08oILDIlTrLzL9P4yYUwVpb6yE0qkRSYazmWHF2Q0k6+7uNZl6HJs8TT/kfKdrdYj/bpDD02J01c4GXqkcRSk8+ptyltGhJEd5Zl/mtDDlP67cNGkHsE35MrEpeFyDH17JNrmvW00120JA/JYkgX5jhIO9fxqOR1IbTjCq0dMlq5xKNxkivRtOPaigr5OeZiGKjGQtjEz6w05F1QRmRcmnC0IzP1boXHwtpJkYXMqQF7PzYV/xcb1m68k3ipq9oJzW8HjWtFXEmm2yUjaQR3ahIIciPDWwk/Q4PchT9ohyWw+5YfkuPNPqN9drjZ3duZr2MF3sBb2AAG27AH+3AAA+DwHX7Cb/gTXUe/or+t6DZ1ccNr+Aet9g0Zfa6j</latexit>

�<latexit sha1_base64="ThvF3folpRHPvl/TZ0GSGyccdWM=">AAACFnicbVBLSwMxGMz6rPW16tFLsAge6pIVn7eiF48V7AN2l5JNs21o9kGSFcrSX+HFv+LFgyJexZv/xnS7FbUOBIaZ+ZIv4yecSYXQpzE3v7C4tFxaKa+urW9smlvbTRmngtAGiXks2j6WlLOINhRTnLYTQXHoc9ryB1djv3VHhWRxdKuGCfVC3ItYwAhWWuqYh5mbX+KInu9lyDpB9sWpXUUWylG1CzJykz4bdczK1IHTLJxGvkkFFKh3zA+3G5M0pJEiHEvp2ChRXoaFYoTTUdlNJU0wGeAedTSNcEill+UbjeC+VrowiIU+kYK5+nMiw6GUw9DXyRCrvvzrjcX/PCdVwbmXsShJFY3I5KEg5VDFcNwR7DJBieJDTTARTO8KSR8LTJRusqxLmPnyLGkeWTay7JvjSu2yqKMEdsEeOAA2OAM1cA3qoAEIuAeP4Bm8GA/Gk/FqvE2ic0YxswN+wXj/AuV7m2g=</latexit><latexit sha1_base64="ThvF3folpRHPvl/TZ0GSGyccdWM=">AAACFnicbVBLSwMxGMz6rPW16tFLsAge6pIVn7eiF48V7AN2l5JNs21o9kGSFcrSX+HFv+LFgyJexZv/xnS7FbUOBIaZ+ZIv4yecSYXQpzE3v7C4tFxaKa+urW9smlvbTRmngtAGiXks2j6WlLOINhRTnLYTQXHoc9ryB1djv3VHhWRxdKuGCfVC3ItYwAhWWuqYh5mbX+KInu9lyDpB9sWpXUUWylG1CzJykz4bdczK1IHTLJxGvkkFFKh3zA+3G5M0pJEiHEvp2ChRXoaFYoTTUdlNJU0wGeAedTSNcEill+UbjeC+VrowiIU+kYK5+nMiw6GUw9DXyRCrvvzrjcX/PCdVwbmXsShJFY3I5KEg5VDFcNwR7DJBieJDTTARTO8KSR8LTJRusqxLmPnyLGkeWTay7JvjSu2yqKMEdsEeOAA2OAM1cA3qoAEIuAeP4Bm8GA/Gk/FqvE2ic0YxswN+wXj/AuV7m2g=</latexit><latexit sha1_base64="ThvF3folpRHPvl/TZ0GSGyccdWM=">AAACFnicbVBLSwMxGMz6rPW16tFLsAge6pIVn7eiF48V7AN2l5JNs21o9kGSFcrSX+HFv+LFgyJexZv/xnS7FbUOBIaZ+ZIv4yecSYXQpzE3v7C4tFxaKa+urW9smlvbTRmngtAGiXks2j6WlLOINhRTnLYTQXHoc9ryB1djv3VHhWRxdKuGCfVC3ItYwAhWWuqYh5mbX+KInu9lyDpB9sWpXUUWylG1CzJykz4bdczK1IHTLJxGvkkFFKh3zA+3G5M0pJEiHEvp2ChRXoaFYoTTUdlNJU0wGeAedTSNcEill+UbjeC+VrowiIU+kYK5+nMiw6GUw9DXyRCrvvzrjcX/PCdVwbmXsShJFY3I5KEg5VDFcNwR7DJBieJDTTARTO8KSR8LTJRusqxLmPnyLGkeWTay7JvjSu2yqKMEdsEeOAA2OAM1cA3qoAEIuAeP4Bm8GA/Gk/FqvE2ic0YxswN+wXj/AuV7m2g=</latexit><latexit sha1_base64="ThvF3folpRHPvl/TZ0GSGyccdWM=">AAACFnicbVBLSwMxGMz6rPW16tFLsAge6pIVn7eiF48V7AN2l5JNs21o9kGSFcrSX+HFv+LFgyJexZv/xnS7FbUOBIaZ+ZIv4yecSYXQpzE3v7C4tFxaKa+urW9smlvbTRmngtAGiXks2j6WlLOINhRTnLYTQXHoc9ryB1djv3VHhWRxdKuGCfVC3ItYwAhWWuqYh5mbX+KInu9lyDpB9sWpXUUWylG1CzJykz4bdczK1IHTLJxGvkkFFKh3zA+3G5M0pJEiHEvp2ChRXoaFYoTTUdlNJU0wGeAedTSNcEill+UbjeC+VrowiIU+kYK5+nMiw6GUw9DXyRCrvvzrjcX/PCdVwbmXsShJFY3I5KEg5VDFcNwR7DJBieJDTTARTO8KSR8LTJRusqxLmPnyLGkeWTay7JvjSu2yqKMEdsEeOAA2OAM1cA3qoAEIuAeP4Bm8GA/Gk/FqvE2ic0YxswN+wXj/AuV7m2g=</latexit>

� �

a0 a1a2

a0 a1 a2

A B

Figure 2: Graphical illustration dynamic ( n) and static (�n) value functions for n = 2. We ignore immediate rewardsand the discounting for simplicity. In Panel A, dynamic values (given by 2) are obtained by calculating the expected

maximum of all state-action values (given by Q⇤0) lying 2-steps away. Whereas, the static values (given by �2) are

obtained by calculating the maximum of expectations of state-action values, as shown in Panel B.

and a knowledge state !1:t as

VOCf (s⇢,!1:k|!1:t) = E⌦1:k

max

a2As⇢

f(s⇢, a|!1:t⌦1:k)

� maxa2As⇢

f(s⇢, a|!1:t) ,

where !1:k specifies the state-actions in ⌦1:k, that is,⌦1:k = {(!i, O!ii)}

ki=1 where !i is the ith element of

!1:k.

We refer to computation-policies that choose computa-tions based on greedy maximization one by one (i.e.,k = 1) of VOC as VOC(�n)-greedy and VOC( n)-greedy depending on which value function is utilized.We assume these policies stop if and only if 8! :VOCf (s⇢,!|!1:t) = 0. Our greedy policies considerand select computations one-by-one. Alternatively, onecan perform a forward search over future computationsequences, similar to the search over future actions se-quences in Guez et al. [6]. However, this adds anothermeta-level to our metareasoning problem; thus, furtherincreasing the computational burden.

We analyze these greedy policies in terms of theBayesian simple regret, which we define as the differ-ence between two values. The first is the maximumvalue the agent could expect to reap assuming it canperform infinitely many computations, thus resolvingall the uncertainty, before committing to an immedi-ate action. Given our formulation, this is identical toEQ⇤

0 |!1:t

hmaxa2As⇢

⌥n(s⇢, a|!1:t)i

and thus is inde-pendent of the agent’s action policy and the (future) com-putation policy for a given knowledge state !1:t. Further-more, it remains constant in expectation as the knowl-

edge state expands. The second term in the regret isthe maximum static/dynamic action value assuming theagent cannot expand its knowledge state before taking anaction.

Definition 4. Given a knowledge state !1:t, we defineBayesian simple regret at state s⇢ as

Rf (s⇢,!1:t) = Ehmax

a⌥n(s⇢, a|!1:t)

i�max

af(s⇢, a|!1:t),

where f 2 {�n, n}.

Based on this definition, we have the following.

Proposition 2. VOC(�n)-greedy and VOC( n)-greedy

choose the computation that maximize expected decrease

in R�n(s⇢,!1:t) and R n(s⇢,!1:t) respectively.

The proof is provided in the appendix.

We refer to policies that choose the regret-minimizingcomputation as being one-step optimal. Note that thisresult is different than what is typically referred to as my-opic optimality. Myopia refers to considering the impactof a single computation only, whereas VOC( )-greedypolicy accounts for the impact of possible future compu-tations that succeed the immediate action.

Proposition 3. Given an infinite computation budget,

VOC(�n)-greedy and VOC( n)-greedy policies will

find the optimal action at the root state.

Proof sketch. Both policies will perform all computa-tions infinitely many times as shown for the flat setting inRyzhov et al. [19]. Thus, dynamic and static values at theleaves (ie, for �n(s⇢)) will converge to the true optimalvalues (given by Q

⇤), so will the downstream values.

Page 6: Static and Dynamic Values of Computation in MCTS · Monte Carlo tree search (MCTS) is a widely used ap-proximate planning method that has been successfully applied to many challenging

We refer to such policies as being asymptotically opti-

mal.

4.1 ALTERNATIVE VOC DEFINITIONS

A common [8, 10, 18, 23] formulation for the value ofcomputation is

VOC0f (s⇢,!1:k|!1:t) =E⌦1:k

max

a2As⇢

f(s⇢, a|!1:t⌦1:k)

� f(s⇢,↵|!1:t⌦1:k)

�, (2)

where ↵ := arg maxa f(s⇢, a|!1:t) and f is a valuefunction as before.

The difference between this and Definition 3 is that thesecond term in VOC0 conditions f also on ⌦1:k. Thismight seems intuitively correct. VOC0 is positive if andonly if the policy at s⇢ changes with some probability,that is, P (arg maxa2As⇢

f(s⇢, a|!1:t⌦1:k) 6= ↵) > 0.However, this approach can be too myopic as it oftentakes multiple computations for the policy to change [8].Note that, this is particularly troublesome for static val-ues (f = �n), which commonly arise in methods such asUCT that estimate mean returns of rollouts.Proposition 4. VOC0(�n)-greedy is neither one-step

optimal nor asymptotically optimal.

By contrast, dynamic value functions escape this prob-lem.Proposition 5. For any !1:k and !1:t we have

VOC0 n(s⇢,!1:k|!1:t) = VOC n(s⇢,!1:k|!1:t).

Both propositions are proved in the Appendix.

5 VALUE OF COMPUTATION IN MCTS

We now introduce a MCTS method based on VOC-greedy policies we introduced. For this, as donein other information/computation-value-based MCTSmethods [8, 23], we utilize UCT as a “base” policy—meaning we call UCT as a subroutine to draw samplesfrom leaf nodes. Because UCT is adaptive, these sampleswill be drawn from a non-stationary stochastic process inpractice; yet, we will treat them as being i.i.d. .

We introduce the model informally, and provide the ex-act formulas and a pseudocode in the Appendix. We as-sume no discounting, i.e., � = 1, and zero immediate re-wards within n steps of the root node for simplicity here,though as we show in the Appendix, the results triviallygeneralize.

We assume Q⇤0 ⇠ N (µ0, ⌃0) where µ0 is a prior mean

vector and ⌃0 is a prior covariance matrix. We as-

sume both these quantities are known—but it is possi-ble to also assume a Wishart prior over ⌃0 or to employoptimization methods from the Gaussian process litera-ture (e.g., maximizing the likelihood function via gra-dient descent). We assume computations return evalu-ations of Q

⇤0 with added Normal noise with known pa-

rameters. Then, the posterior value function Q⇤0|!1:t can

be computed in O(t) for an isotropic prior covariance,in O(tm2) using recursive update rules for multivariateNormal priors , where m = |Q

⇤0| is the number of leaf

nodes, or in O(t3) using Gaussian process priors. Weomit the exact form of the posterior distribution here asit is a standard result.

For computing the VOC(�n)-greedy policy we needto evaluate how the expected values at the leaveschange with a candidate computation ! = (s, a), i.e.,E⌦[EQ⇤

0 |!1:t⌦[Q⇤0|!1:t⌦]] where ⌦ = (!, O!t+1). Note

that, O!t+1 conditioned on !1:t, gives the posteriorpredictive distribution for rollout returns from (s, a),and is normally distributed. Thus, EQ⇤

0 |!1:t⌦[Q⇤0|!1:t⌦]

is a multivariate random Normal variable of di-mension m. The maximum of this variable, i.e.max EQ⇤

0 |!1:t⌦[Q⇤0|!1:t⌦], is a piecewise linear function

in O!t+1 and thus its expectation can be computed ex-actly in O(m2 log m) as shown in Frazier et al. [5]. If anisotropic prior covariance is assumed, the computationssimplify greatly as VOC�n reduces to the expectation ofa truncated univariate normal distribution, can be com-puted in O(1) given the posterior distributions and theleaf node with the highest expected value. If the tran-sitions are stochastic, then the same method can be uti-lized whether the covariance is isotropic or anisotropic,with an extra averaging step over transition probabilitiesat each node, increasing the computational costs.

Computing the VOC( n)-policy is much harder onthe other hand, even for a deterministic state-transitionfunction, because we need to calculate the expectedmaximum of possibly correlated random variables,E[max Q

⇤0|!1:t]. One could resort to Monte Carlo sam-

pling. Alternatively, assuming an isotropic prior over leafvalues, we can obtain the following by adapting a boundon expected maximum of random variables [12, 15]:

E[max Q⇤0|!1:t] �s⇢t

:= c +X

(s0,a0)2�n(s⇢)

(�s0a0t)

2Fs0a0t(c)

+ (µs0a0t � c)[1 � Fs0a0t(c)]

where µs0a0t and �s0a0t are posterior mean and vari-ances, that is Q

⇤0(s

0, a

0)|!1:t ⇠ N (µs0a0t, (�s0a0t)2) ,Fs0a0t is the CDF of Q

⇤0(s

0, a

0)|!1:t, and c is a real num-ber. The tightest bound is realized for a c that satisfies

Page 7: Static and Dynamic Values of Computation in MCTS · Monte Carlo tree search (MCTS) is a widely used ap-proximate planning method that has been successfully applied to many challenging

P(s0,a0)2�n(s⇢)[1 � Fs0a0t(c)] = 1, which can be found

by root-finding methods.

The critical question is then how �s⇢t changes with anadditional sample from (s0

, a0). For this, we use the lo-

cal sensitivity, @�s⇢t/@ns0a0t as a proxy, where ns0a0t isthe number of samples drawn from (s0

, a0) until time t.

We give the closed form equation for this partial deriva-tive along with some of its additional properties in theAppendix. Then we can approximately compute theVOC( n)-greedy policy by choosing the computationthat maximizes the magnitude of @�s⇢t/@ns0a0t. Thisapproach only works if state-transitions are determinis-tic as it enables us to collapse the root action values intoa single max of leaf values. If the state transitions arestochastic, this is no longer possible as averaging overstate transitions probabilities is required. Alternatively,one can sample deterministic transition functions and av-erage @�s⇢t/@ns0a0t over the samples as an approxima-tion.

Our VOC-greedy MCTS methods address importantlimitations of VOC0-based methods [8, 23]. VOC-greedy does not suffer from the early stopping problemthat afflicts VOC0-based. It is also less myopic in thesense that it can incorporate the impact of computationsthat may be performed in the future if dynamic valuefunctions are utilized. Lastly, our proposal extends VOCcalculations to non-root actions, as determined by n.

6 EXPERIMENTS

We compare the VOC-greedy policies against UCT [11],VOI-based [8], Bayes UCT [22], and Thompson sam-pling for MCTS (DNG-MCTS) [2] in two different envi-ronments, bandit-trees and peg solitaire, where the envi-ronment dynamics are provided to each method.

Bayes UCT computes approximate posterior action val-ues and uses a rule similar to UCT to select childnodes. DNG-MCTS also estimates the posterior actionvalues but instead utilizes Thompson sampling recur-sively down the tree. We use the same conjugate Nor-mal prior structure for the Bayesian algorithms: VOC-greedy, Bayes UCT, and DNG-MCTS2. The prior andthe noise parameters are tuned for each method via gridsearch using the same number of evaluations, as well asthe exploration parameters of UCT and VOI-based.

VOI-based, VOC-greedy, and Bayes UCT are hybridmethods, using one set of rules for the top of the searchtree and UCT for the rest. We refer to this top part of

2In the original paper [2], the authors use Dirichlet-Normal-Gamma priors, but we resort to Normal priors to preserve con-sistency among all the Bayesian policies.

the tree as the partial search tree (PST). By construction,VOI-based utilizes a PST of height 1. We implement thelatter two methods using PSTs of height 4 in bandit-treesand of 2 in peg solitaire. These heights are determinedbased on the branching factors of the environments andthe total computation budgets, such that each leaf nodeis sampled a few (5-8) times on average. For the exper-iments we explain next, we tune the hyperparameters ofall the policies using grid search.

6.1 BANDIT-TREES

The first environment in which we evaluate the MCTSpolicies is an MDP composed of a complete binary treeof height d, similar to the setting presented in Tolpinand Shimony [23] but with a deeper tree structure andstochastic transitions. The leaves of the tree are noisy“bandit arms” with unknown distributions. Agents per-form “computations” to draw samples from the arms,which is analogous to performing rollouts for evaluatingleaf values in MCTS. At each state, the agents select anaction from A = {LEFT,RIGHT} (denoting the desiredsubtree of height d � 1) and transition there with prob-ability .75 and to the other subtree with probability .25.In Figure 3, we illustrate a bandit tree of height 3.

Figure 3: A bandit tree of height 3. Circles denote states,and squares denote bandit arms.

At each time step t, agents sample one of the arms, andupdate their value estimates at the root state s⇢. We mea-sure the simple objective3 regret at state s⇢ at t, whichwe define as maxa2A Q

⇤(s⇢, a) � Q⇤(s⇢,⇡t(s⇢)), for a

deterministic policy ⇡t : s⇢ ! A which depends on theknowledge state acquired by performing t many compu-tations.

We sample the rewards of the bandit arms from a multi-variate Normal distribution, where the covariance is ob-tained either from a radial basis function or from a whitenoise kernel. The noise of arms/computations follow ani.i.d. Normal distribution. We provide the exact environ-ment parameters in the Appendix.

3We call it objective regret because it is based on the groundtruth (Q⇤) as opposed to Bayesian regret, which is based on theestimates of the ground-truth.

Page 8: Static and Dynamic Values of Computation in MCTS · Monte Carlo tree search (MCTS) is a widely used ap-proximate planning method that has been successfully applied to many challenging

Figure 4.a shows the results in the case with correlatedbandit arms. These correlations are exploited in ourimplementation of VOC(�n)-greedy (via an anisotropicNormal prior over the leaf values of the PST). Notethat, we aren’t able to incorporate this extra assump-tion in other Bayesian methods. Bayes UCT utilizesa specific approximation for propagating values up thetree. Thompson sampling would require a prior overall state-actions in the environment which is complicateddue to the parent-child dependency among the nodes aswell as computationally prohibitive. Because comput-ing the VOC( n)-greedy policy is very expensive if statetransitions are stochastic, we only implement VOC(�n)-greedy for this environment, but implement both for thenext environment.

We see that VOC(�n)-greedy outperforms all othermethods. Note that this is a low-sample density set-ting: there are 27 = 128 bandit arms and each arm getssampled on average once as the maximum budget (seex-axis) is 128 as well. This is why many of the poli-cies do not seem to have converged to the optimal solu-tion. The outstanding performance of VOC(�n)-greedyis partially due to its ability of exploiting correlations. Inorder to control for this, Figure 4b shows the results ina case in which the bandit rewards are actually uncor-related (i.e., sampled from an isotropic Normal distribu-tion). As we can see VOC(�n)-greedy and Bayes UCTperforms equally well, and better than the other policies.This implies that the good performance of VOC(�n)-greedy does not depend wholly on its ability to exploitthe correlational structure.

6.2 PEG SOLITAIRE

Peg solitaire—also known as Solitaire, Solo, or SoloNoble—is a single-player board game, where the objec-tive for our purposes is to remove as many pegs as pos-sible from the board by making valid moves. We use a4 ⇥ 4 board, with 9 pegs randomly placed.

In the implementation of VOC-greedy policies, we as-sume an anisotropic prior over the leaf nodes. As shownin Figure 5, VOC(�n)-greedy has the best performancefor small budget ranges, which is in line with our intu-ition as �n is a more accurate valuation of action valuesfor small computation budgets. For large budgets, wesee that VOC( n)-greedy performs as well as Thomp-son sampling, and better than the rest.

7 DISCUSSION

This paper offers principled ways of assigning values toactions and computations in MCTS. We address impor-tant limitations of existing methods by extending com-

0 20 40 60 80 100 120

Budget

0.000

0.001

0.002

0.003

0.004

0.005

0.006

0.007

Mean

regre

t

UCT

VOI

Thompson

BayesUCT

VOC(�n)

(a) Bandit-trees with correlated bandit arms.

0 100 200 300 400 500

Budget

0.00050

0.00075

0.00100

0.00125

0.00150

0.00175

0.00200

Mean

regre

t

UCT

VOI

Thompson

BayesUCT

VOC(�n)

(b) Bandit-trees with uncorrelated bandit arms.

Figure 4: Mean regret as a function of the computationbudget for bandit-trees with correlated (panel a) uncor-

related (panel b) expected bandit rewards, averaged over10k and 5k random bandit reward seeds respectively.

putation values to non-immediate actions while account-ing for the impact of non-immediate future computa-tions. We show that MCTS methods that greedily max-imize computation values have desirable properties andare more sample-efficient in practice than many popularexisting methods. The major drawback of our proposalis that computing VOC-greedy policies might be expen-sive, and may only worth doing so if rollouts (i.e., en-vironment simulations) are computationally expensive.That said, we believe efficient derivates of our methodsare possible, for instance by using graph neural networksto directly learn static/dynamic action or computationvalues.

Practical applications aside, the study of computationvalues might provide tools for a better understanding ofMCTS policies, for instance, by providing notions of re-gret and optimality for computations, similar to what al-ready exists for actions (i.e., Q

⇤).

Page 9: Static and Dynamic Values of Computation in MCTS · Monte Carlo tree search (MCTS) is a widely used ap-proximate planning method that has been successfully applied to many challenging

50 100 150 200 250 300

Budget

2.6

2.8

3.0

3.2

3.4

3.6

3.8Pegs

UCT

VOI

BayesUCT 1

BayesUCT 2

Thompson

VOC(�n)

VOC( n)

Figure 5: The average number of pegs remaining onthe board as a function of the computation budget, aver-aged over 200 random seeds. The bars denote the meansquared errors.

Acknowledgements. We are grateful to Marcus Hut-ter, whose feedback significantly improved our presen-tation of this work. We also thank the anonymous re-viewers for their valuable feedback and suggestions. ESconducted this work during a research visit to the GatsbyUnit at University College London and during his studiesat the Bernstein Center for Computational NeuroscienceBerlin. PD was funded by the Gatsby Charitable Foun-dation, the Max Planck Society, and the Alexander vonHumboldt Foundation.

References[1] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer.

Finite-time analysis of the multiarmed bandit prob-lem. Mach. Learn., 47(2-3):235–256, 2002.

[2] Aijun Bai, Feng Wu, and Xiaoping Chen. Bayesianmixture modelling and inference based thompsonsampling in monte-carlo tree search. In Pro-

ceedings of the Advances in Neural Information

Processing Systems (NIPS-13), pages 1646–1654,Lake Tahoe, United States, 2013.

[3] David Barber. Bayesian Reasoning and Machine

Learning. Cambridge University Press, New York,NY, USA, 2012.

[4] Remi Coulom. Efficient selectivity and backup op-erators in monte-carlo tree search. In Proceedings

of the 5th International Conference on Computers

and Games, CG’06, pages 72–83, Berlin, Heidel-berg, 2007. Springer-Verlag.

[5] Peter Frazier, Warren Powell, and Savas Dayanik.The knowledge-gradient policy for correlated nor-

mal beliefs. INFORMS Journal on Computing, 21(4):599–613, 2009.

[6] Arthur Guez, David Silver, and Peter Dayan. Scal-able and efficient bayes-adaptive reinforcementlearning based on monte-carlo tree search. J. Ar-

tif. Int. Res., 48(1):841–883, October 2013.[7] Nicholas Hay and Stuart J. Russell. Metareason-

ing for monte carlo tree search. Technical ReportUCB/EECS-2011-119, EECS Department, Univer-sity of California, Berkeley, Nov 2011.

[8] Nicholas Hay, Stuart Russell, David Tolpin, andSolomon Eyal Shimony. Selecting computations:Theory and applications. In Proceedings of the

Twenty-Eighth Conference on Uncertainty in Arti-

ficial Intelligence, UAI’12, pages 346–355, Arling-ton, Virginia, United States, 2012. AUAI Press.

[9] Ronald A. Howard. Information value theory. IEEE

Transactions on Systems Science and Cybernetics,SSC-2:22–26, 1966.

[10] Mehdi Keramati, Amir Dezfouli, and Payam Piray.Speed/accuracy trade-off between the habitual andthe goal-directed processes. PLOS Computational

Biology, 7(5):1–21, 05 2011.[11] Levente Kocsis and Csaba Szepesvari. Bandit

based monte-carlo planning. In Proceedings of

the 17th European Conference on Machine Learn-

ing, ECML’06, pages 282–293, Berlin, Heidelberg,2006. Springer-Verlag.

[12] T. L. Lai and Herbert Robbins. Maximally depen-dent random variables. Proceedings of the National

Academy of Sciences of the United States of Amer-

ica, 73(2):286–288, 1976.[13] Falk Lieder, Dillon Plunkett, Jessica B. Hamrick,

Stuart J. Russell, Nicholas J. Hay, and Thomas L.Griffiths. Algorithm selection by rational metar-easoning as a model of human strategy selection.In Proceedings of the 27th International Confer-

ence on Neural Information Processing Systems -

Volume 2, NIPS’14, pages 2870–2878, Cambridge,MA, USA, 2014. MIT Press.

[14] Christopher H. Lin, Andrey Kolobov, Ece Kamar,and Eric Horvitz. Metareasoning for planning un-der uncertainty. In Proceedings of the 24th Inter-

national Conference on Artificial Intelligence, IJ-CAI’15, pages 1601–1609. AAAI Press, 2015.

[15] Andrew M. Ross. Computing bounds on the ex-pected maximum of correlated normal variables.Methodology and Computing in Applied Probabil-

ity, 12:111–138, 03 2010.[16] Omid Madani, Daniel J. Lizotte, and Russell

Greiner. The budgeted multi-armed bandit prob-

Page 10: Static and Dynamic Values of Computation in MCTS · Monte Carlo tree search (MCTS) is a widely used ap-proximate planning method that has been successfully applied to many challenging

lem. In John Shawe-Taylor and Yoram Singer, edi-tors, Learning Theory, pages 643–645, Berlin, Hei-delberg, 2004. Springer Berlin Heidelberg.

[17] Marcelo G. Mattar and Nathaniel D. Daw. Prior-itized memory access explains planning and hip-pocampal replay. Nature Neuroscience, 21(11):1609–1617, 2018.

[18] S. Russell and E. Wefald. Do the right thing. Stud-

ies in limited rationality. MIT Press, 1991.

[19] Ilya O. Ryzhov, Warren B. Powell, and Peter I. Fra-zier. The knowledge gradient algorithm for a gen-eral class of online learning problems. Operations

Research, 60(1):180–195, 2012.

[20] Can Eren Sezener, Amir Dezfouli, and MehdiKeramati. Optimizing the depth and the direc-tion of prospective planning using information val-ues. PLOS Computational Biology, 15(3):1–21, 032019.

[21] David Silver, Aja Huang, Chris J. Maddison,Arthur Guez, Laurent Sifre, George van den Driess-che, Julian Schrittwieser, Ioannis Antonoglou,Veda Panneershelvam, Marc Lanctot, SanderDieleman, Dominik Grewe, John Nham, NalKalchbrenner, Ilya Sutskever, Timothy Lillicrap,Madeleine Leach, Koray Kavukcuoglu, ThoreGraepel, and Demis Hassabis. Mastering the gameof Go with deep neural networks and tree search.Nature, 529(7587):484–489, January 2016.

[22] Gerald Tesauro, V T Rajan, and Richard Segal.Bayesian inference in monte-carlo tree search. InProceedings of the Twenty-Sixth Conference on Un-

certainty in Artificial Intelligence, UAI’10, pages580–588, Arlington, Virginia, United States, 2010.AUAI Press.

[23] David Tolpin and Solomon Eyal Shimony. MCTSbased on simple regret. In Proceedings of the

Twenty-Sixth AAAI Conference on Artificial In-

telligence, July 22-26, 2012, Toronto, Ontario,

Canada., 2012.

[24] Jian Wu, Matthias Poloczek, Andrew G Wilson,and Peter Frazier. Bayesian optimization withgradients. In I. Guyon, U. V. Luxburg, S. Ben-gio, H. Wallach, R. Fergus, S. Vishwanathan, andR. Garnett, editors, Advances in Neural Informa-

tion Processing Systems 30, pages 5267–5278. Cur-ran Associates, Inc., 2017.