A VALUE OF COMPUTATION IN MCTS N · A VALUE OF COMPUTATION IN MCTS...

3
A VALUE OF COMPUTATION IN MCTS If the state transition function is deterministic, then static and dynamic value computations simplify greatly: n (s, a|! 1:t )= E max (s 0 ,a 0 )2Γn(s) Z (s 0 ,a 0 |! 1:t ) (3) φ n (s, a|! 1:t )= max (s 0 ,a 0 )2Γn(s) E [Z (s 0 ,a 0 |! 1:t )] , (4) where Z (s 0 ,a 0 |! 1:t ) := r 1 + γ r 2 + γ 2 r 3 + ··· + γ n (Q 0 (s 0 ,a 0 )|! 1:t ), is the posterior leaf values scaled by γ n and shifted by the discounted immediate rewards (r i ) along the path from s to s 0 . In this ‘flat’ case, VOC(φ n )-greedy policy is equivalent to a knowledge gradient policy, details of which can be found in Frazier et al. [5], Ryzhov et al. [19] for either isotropic or anisotropic Normal Q 0 . On the other hand, VOC( n )-greedy policy has not been studied to the best of our knowledge. Computing the expected maximum of random variables is generally hard, which is required for n . Below, we offer a novel approximation to remedy this problem. A.1 COMPUTING VOC( n ) We utilize a bound [12] that enables us to get a handle on n . This asserts, n (s, a|! 1:t ) c + X (s 0 ,a 0 )2Γn(s) Z 1 c [1 - F s 0 a 0 t (x)] dx for any c 2 R, where F s 0 a 0 t is the CDF of Z (s 0 ,a 0 |! 1:t ). This bound does not assume independence and holds for any correlation structure by assuming the worst case. Furthermore, the inequality is true for all c. However, the tightest bound is obtained by differentiating the RHS with respect to c, and setting its derivative to zero, which in turn yields P (s 0 ,a 0 )2Γn(s) [1 - F s 0 a 0 t (c)] = 1 Thus, the optimizing c can be obtained via line search meth- ods. If Z (·, ·|! 1:t ) is distributed according to a multivariate (isotropic or anisotropic) Normal distribution, then we can eliminate the integral [15]: n (s, a|! 1:t ) λ sat := c+ X (s 0 ,a 0 )2Γn(s) (σ s 0 a 0 t ) 2 F s 0 a 0 t (c) +(μ s 0 a 0 t - c)[1 - F s 0 a 0 t (c)] where μ s 0 a 0 t and σ 2 s 0 a 0 t are posterior mean and variances, that is Z (s 0 ,a 0 |! 1:t ) N (μ s 0 a 0 t , (σ s 0 a 0 t ) 2 ). If we further assume an isotropic Normal prior with mean μ s 0 a 0 0 and scale σ s 0 a 0 0 , and observation noise i N (0, σ 2 ) i.i.d. for i =1, 2,...,t, then we get the poste- rior mean and scale as μ s 0 a 0 t = n s 0 a 0 t ˆ o s 0 a 0 t /σ 2 + μ s 0 a 0 0 /σ 2 s 0 a 0 t n s 0 a 0 t /σ 2 +1/σ 2 s 0 a 0 t , σ s 0 a 0 t = n s 0 a 0 t /σ 2 s 0 a 0 t +1/σ 2 , where ˆ o s 0 a 0 t is the mean trajectory rewards obtained from (s 0 ,a 0 ) and n s 0 a 0 t is the number of times a sample is drawn from (s 0 ,a 0 ). Then keeping c fixed, we can es- timate the “sensitivity” of λ sat with respect to an addi- tional sample from (s 0 ,a 0 ) with sat @ n s 0 a 0 t = sat s 0 a 0 t dσ s 0 a 0 t dn s 0 a 0 t + sat @ μ s 0 a 0 t s 0 a 0 t dn s 0 a 0 t . where sat s 0 a 0 t = 1 p 2exp - (μ s 0 a 0 t - c) 2 2(σ s 0 a 0 t ) 2 ! dσ s 0 a 0 t dn s 0 a 0 t = - σσ 3 sa0 2(n s 0 a 0 t σ 2 sa0 + σ 2 ) 3 2 sat @ μ s 0 a 0 t = 1 2 1 + erf p 2(μ s 0 a 0 t - c) 2σ s 0 a 0 t !! s 0 a 0 t dn s 0 a 0 t = σ 2 σ 2 sa0 (-μ sa0 o s 0 a 0 t ) (n s 0 a 0 t ) 2 σ 4 sa0 +2n s 0 a 0 t σ 2 σ 2 sa0 + σ 4 . We can then compute and utilize sat /@ n s 0 a 0 t as a proxy for the expected change in λ sat . Because λ sat is an upper bound, we find that this scheme works the best when the priors are optimistic, that is μ sa0 is large. In fact, as long as the prior mean is larger than the em- pirical mean, μ sa0 > ˆ o s 0 a 0 t , we have sat /@ n s 0 a 0 t < 0. Then we can safely choose the best leaf to sam- ple from via arg min (s 0 ,a 0 )2Γn(s) [max a2As λ sat ]. We use this scheme when implementing VOC( n )-greedy in peg solitaire and confirmed that the results are nearly indistinguishable from calculating VOC( n )-greedy by drawing Monte Carlo samples in terms of the resulting regret curves. B VOC-GREEDY ALGORITHMS We provide the pseudocode for VOC-greedy MCTS pol- icy in Algorithm 1. Throughout our analysis of this pol- icy, we assume an infinite computation budget B. B.1 TIME COMPLEXITIES Computational complexity of VOC-greedy methods de- pend on a variety of factors, including the prior distri- bution of the leaf values, stochasticity, use of static vs

Transcript of A VALUE OF COMPUTATION IN MCTS N · A VALUE OF COMPUTATION IN MCTS...

Page 1: A VALUE OF COMPUTATION IN MCTS N · A VALUE OF COMPUTATION IN MCTS Ifthestatetransitionfunctionisdeterministic, thenstatic and dynamic value computations simplify greatly: n(s,a|!

A VALUE OF COMPUTATION INMCTS

If the state transition function is deterministic, then staticand dynamic value computations simplify greatly:

n(s, a|!1:t) = E

max(s0,a0)2�n(s)

Z(s0, a

0|!1:t)

�(3)

�n(s, a|!1:t) = max(s0,a0)2�n(s)

E [Z(s0, a

0|!1:t)] , (4)

where Z(s0, a

0|!1:t) := r1 + �r2 + �

2r3 + · · · +

�n(Q⇤

0(s0, a

0)|!1:t), is the posterior leaf values scaled by�

n and shifted by the discounted immediate rewards (ri)along the path from s to s

0.

In this ‘flat’ case, VOC(�n)-greedy policy is equivalentto a knowledge gradient policy, details of which can befound in Frazier et al. [5], Ryzhov et al. [19] for eitherisotropic or anisotropic Normal Q

⇤0. On the other hand,

VOC( n)-greedy policy has not been studied to the bestof our knowledge. Computing the expected maximum ofrandom variables is generally hard, which is required for n. Below, we offer a novel approximation to remedythis problem.

A.1 COMPUTING VOC( n)

We utilize a bound [12] that enables us to get a handle on n. This asserts,

n(s, a|!1:t) c +X

(s0,a0)2�n(s)

Z 1

c[1� Fs0a0t(x)] dx

for any c 2 R, where Fs0a0t is the CDF of Z(s0, a

0|!1:t).

This bound does not assume independence and holds forany correlation structure by assuming the worst case.Furthermore, the inequality is true for all c. However,the tightest bound is obtained by differentiating the RHSwith respect to c, and setting its derivative to zero, whichin turn yields

P(s0,a0)2�n(s) [1� Fs0a0t(c)] = 1 Thus,

the optimizing c can be obtained via line search meth-ods.

If Z(·, ·|!1:t) is distributed according to a multivariate(isotropic or anisotropic) Normal distribution, then wecan eliminate the integral [15]:

n(s, a|!1:t) �sat := c+X

(s0,a0)2�n(s⇢)

(�s0a0t)

2Fs0a0t(c)

+ (µs0a0t � c)[1� Fs0a0t(c)]

where µs0a0t and �2s0a0t are posterior mean and variances,

that is Z⇤(s0

, a0|!1:t) ⇠ N (µs0a0t, (�s0a0t)2).

If we further assume an isotropic Normal prior withmean µs0a00 and scale �s0a00, and observation noise ✏i ⇠N (0,�

2) i.i.d. for i = 1, 2, . . . , t, then we get the poste-rior mean and scale as

µs0a0t =ns0a0tos0a0t/�

2 + µs0a00/�2s0a0t

ns0a0t/�2 + 1/�2

s0a0t

,

�s0a0t = ns0a0t/�2s0a0t + 1/�

2,

where os0a0t is the mean trajectory rewards obtained from(s0

, a0) and ns0a0t is the number of times a sample is

drawn from (s0, a

0). Then keeping c fixed, we can es-timate the “sensitivity” of �sat with respect to an addi-tional sample from (s0

, a0) with

@�sat

@ns0a0t=

@�sat

@�s0a0t

d�s0a0t

dns0a0t+

@�sat

@µs0a0t

dµs0a0t

dns0a0t.

where

@�sat

@�s0a0t=

1p

2⇡exp

� (µs0a0t � c)2

2 (�s0a0t)2

!

d�s0a0t

dns0a0t= �

��3sa0

2 (ns0a0t�2sa0 + �2)

32

@�sat

@µs0a0t=

1

2

1 + erf

p2 (µs0a0t � c)

2�s0a0t

!!

dµs0a0t

dns0a0t=

�2�

2sa0 (�µsa0 + os0a0t)

(ns0a0t)2�4sa0 + 2ns0a0t�

2�2sa0 + �4

.

We can then compute and utilize @�sat/@ns0a0t as aproxy for the expected change in �sat. Because �sat

is an upper bound, we find that this scheme works thebest when the priors are optimistic, that is µsa0 is large.In fact, as long as the prior mean is larger than the em-pirical mean, µsa0 > os0a0t, we have @�sat/@ns0a0t <

0. Then we can safely choose the best leaf to sam-ple from via arg min(s0,a0)2�n(s) [maxa2As �sat]. Weuse this scheme when implementing VOC( n)-greedyin peg solitaire and confirmed that the results are nearlyindistinguishable from calculating VOC( n)-greedy bydrawing Monte Carlo samples in terms of the resultingregret curves.

B VOC-GREEDY ALGORITHMS

We provide the pseudocode for VOC-greedy MCTS pol-icy in Algorithm 1. Throughout our analysis of this pol-icy, we assume an infinite computation budget B.

B.1 TIME COMPLEXITIES

Computational complexity of VOC-greedy methods de-pend on a variety of factors, including the prior distri-bution of the leaf values, stochasticity, use of static vs

Page 2: A VALUE OF COMPUTATION IN MCTS N · A VALUE OF COMPUTATION IN MCTS Ifthestatetransitionfunctionisdeterministic, thenstatic and dynamic value computations simplify greatly: n(s,a|!

Algorithm 1: VOC(�n/ n)-greedy for MCTSInput: Current state s⇢

Input: Maximum computation budget B

Output: Selected action to perform1 Create a partial search graph/tree by expanding state s

for n steps ;2 Initialize the leaf set �n(s⇢) ;3 Initialize a partial function U , that maps states to UCT

trees ;4 t 0 ;5 !1:t ✏ ; /* empty sequence */

6 repeat7 (s⇤

, a⇤) = arg max! VOC�n/ n

(s⇢,!|!1:t) ;8 s

†⇠ P

a⇤

s⇤· ;9 if U(s†) is not defined then

10 Initialize a UCT-tree rooted at s† ;

11 Define U(s†), which maps to the UCT-tree fromthe previous step ;

12 end13 Obtain sample os⇤a⇤t by expanding U(s†) and

perform a roll-out ;14 !t+1 (s⇤

, a⇤, os⇤a⇤t) ;

15 !1:t+1 !1:t!t+1 ;16 t t + 1 ;17 until max! VOC�n/ n

(s⇢,!|!1:t) < 0 or t � B;18 return arg maxa2As⇢

�n/ n(s⇢, a|!1:t) ;

dynamic values. Here, we discuss the time complexityof computing the VOC(�)-greedy policy with a conju-gate Normal prior (with known variance) in MDPs withdeterministic transitions.

The posterior values can be updated incrementally inconstant time if the prior is isotropic Normal and inO(m2) if it is anisotropic, where m is the number of leafnodes [3]. Given the posterior distributions, the value ofa computation can be computed in O(m) in the isotropiccase and in O(m2 log m) in the anisotropic case. We re-fer the reader to [5] for further details, as the analysisdone for bandits with correlated Normal arms do applydirectly.

C PROOFS

C.1 PROOF OF PROPOSITION 1

Let us consider the “base case” of n = 1 and define ahigher-order function g, capturing the 1-step Bellman op-timality equation for a state-action (s, a):

g(h) :=X

s0

Pass0

hR

ass0 + �max

a0h(s0

, a0)i

.

Then �1(s, a|!1:t) = g(E[Q⇤0|!1:t]) and 1(s, a|!1:t) =

E[g(Q⇤0|!1:t)]. Because g is a convex function, we have

1(s, a|!1:t) � �1(s, a|!1:t) by Jensen’s inequality. Weuse this to prove the upper bound in Proposition 1:

1(s, a|!1:t) = E⌦1:k [ 1(s, a|!1:t⌦1:k)]

� E⌦1:k [�1(s, a|!1:t⌦1:k)] ,

where the first inequality is due to Equation 1. For thelower bound, we have

E⌦1:k [�1(s, a|!1:t⌦1:k)] = E⌦1:k [g(E[Q⇤0|!1:t⌦1:k])]

� g(E⌦1:k [E[Q⇤0|!1:t⌦1:k)]]

= g(E[Q⇤0|!1:t])

= �1(s, a|!1:t) .

These inequalities also hold for n > 1 for the same rea-sons. We omit the proof.

C.2 PROOF OF PROPOSITION 2

Let !⇤ denote the optimal candidate computation (oflength 1), which minimizes Bayesian simple regret in ex-pectation in one-step. That is,

!⇤ := arg min

!E⌦[Rf (s⇢,!1:t⌦)]

where ⌦ := (!, O!t+1) is the closure corresponding to!. Then, we subtracting Rf (s⇢,!1:t), we get

!⇤ = arg min

![E⌦[Rf (s⇢,!1:t⌦)]�Rf (s⇢,!1:t)] .

The first terms of the regrets cancel outas EQ⇤

0 |!1:t

hmaxa2As⇢

⌥n(s⇢, a|!1:t)i

=

E⌦EQ⇤0 |!1:t⌦

hmaxa2As⇢

⌥n(s⇢, a|!1:t⌦)i. Thus,

we end up with,

!⇤ = arg min

!

�E⌦

hmax

af(s⇢, a|!1:t⌦)

i+max

af(s⇢, a|!1:t)

�,

or equivalently !⇤ = arg max! VOCf (s⇢, ⌦|!1:t).

C.3 PROOF OF PROPOSITION 4

Consider the following 2-step (i.e., n = 2) search treeshown in Figure 6. The deterministic transitions areshown with the arrows, each corresponding to an actionin A = {L, R}. The leaves are denoted with filled circleswhose posterior values are given by Q

⇤0(·, ·)|!1:t. The

root state is shown as s⇢ with its immediate successorsas s0 and s1. Assume � = 1 and all shown actions yield0 immediate rewards. Finally, assume the posterior dis-tribution of the leaf values are as in Figure 6, and theyare pairwise independent.

Page 3: A VALUE OF COMPUTATION IN MCTS N · A VALUE OF COMPUTATION IN MCTS Ifthestatetransitionfunctionisdeterministic, thenstatic and dynamic value computations simplify greatly: n(s,a|!

Q⇤0(s0,L)|!1:t

⇠ N (0, 1)<latexit sha1_base64="CGjpnY1NaY8FCV70XIeEvVUxAzM=">AAACK3icbVDLSsNAFJ34tr6qLt0MFqWKlEQExZXoxoWIglWhqWEyva2DM0mYuRFLzP+48Vdc6MIHbv0PJ7ULXwcGDufcw9x7wkQKg6776gwMDg2PjI6NlyYmp6ZnyrNzpyZONYc6j2Wsz0NmQIoI6ihQwnmigalQwll4tVf4Z9egjYijE+wm0FSsE4m24AytFJR3l48D92K1agJ3jfoIN2h4dpCv0Fvqxwo6LMi8bcyp75eWfSMU9RXDS85kdphXbcRbKQXliltze6B/idcnFdLHUVB+9FsxTxVEyCUzpuG5CTYzplFwCXnJTw0kjF+xDjQsjZgC08x6t+Z0ySot2o61fRHSnvo9kTFlTFeFdrLY1Pz2CvE/r5Fie6uZiShJESL+9VE7lRRjWhRHW0IDR9m1hHEt7K6UXzLNONp6ixK83yf/JafrNc+teccblZ3dfh1jZIEskirxyCbZIfvkiNQJJ3fkgTyTF+feeXLenPev0QGnn5knP+B8fALC9aRP</latexit><latexit sha1_base64="CGjpnY1NaY8FCV70XIeEvVUxAzM=">AAACK3icbVDLSsNAFJ34tr6qLt0MFqWKlEQExZXoxoWIglWhqWEyva2DM0mYuRFLzP+48Vdc6MIHbv0PJ7ULXwcGDufcw9x7wkQKg6776gwMDg2PjI6NlyYmp6ZnyrNzpyZONYc6j2Wsz0NmQIoI6ihQwnmigalQwll4tVf4Z9egjYijE+wm0FSsE4m24AytFJR3l48D92K1agJ3jfoIN2h4dpCv0Fvqxwo6LMi8bcyp75eWfSMU9RXDS85kdphXbcRbKQXliltze6B/idcnFdLHUVB+9FsxTxVEyCUzpuG5CTYzplFwCXnJTw0kjF+xDjQsjZgC08x6t+Z0ySot2o61fRHSnvo9kTFlTFeFdrLY1Pz2CvE/r5Fie6uZiShJESL+9VE7lRRjWhRHW0IDR9m1hHEt7K6UXzLNONp6ixK83yf/JafrNc+teccblZ3dfh1jZIEskirxyCbZIfvkiNQJJ3fkgTyTF+feeXLenPev0QGnn5knP+B8fALC9aRP</latexit><latexit sha1_base64="CGjpnY1NaY8FCV70XIeEvVUxAzM=">AAACK3icbVDLSsNAFJ34tr6qLt0MFqWKlEQExZXoxoWIglWhqWEyva2DM0mYuRFLzP+48Vdc6MIHbv0PJ7ULXwcGDufcw9x7wkQKg6776gwMDg2PjI6NlyYmp6ZnyrNzpyZONYc6j2Wsz0NmQIoI6ihQwnmigalQwll4tVf4Z9egjYijE+wm0FSsE4m24AytFJR3l48D92K1agJ3jfoIN2h4dpCv0Fvqxwo6LMi8bcyp75eWfSMU9RXDS85kdphXbcRbKQXliltze6B/idcnFdLHUVB+9FsxTxVEyCUzpuG5CTYzplFwCXnJTw0kjF+xDjQsjZgC08x6t+Z0ySot2o61fRHSnvo9kTFlTFeFdrLY1Pz2CvE/r5Fie6uZiShJESL+9VE7lRRjWhRHW0IDR9m1hHEt7K6UXzLNONp6ixK83yf/JafrNc+teccblZ3dfh1jZIEskirxyCbZIfvkiNQJJ3fkgTyTF+feeXLenPev0QGnn5knP+B8fALC9aRP</latexit><latexit sha1_base64="CGjpnY1NaY8FCV70XIeEvVUxAzM=">AAACK3icbVDLSsNAFJ34tr6qLt0MFqWKlEQExZXoxoWIglWhqWEyva2DM0mYuRFLzP+48Vdc6MIHbv0PJ7ULXwcGDufcw9x7wkQKg6776gwMDg2PjI6NlyYmp6ZnyrNzpyZONYc6j2Wsz0NmQIoI6ihQwnmigalQwll4tVf4Z9egjYijE+wm0FSsE4m24AytFJR3l48D92K1agJ3jfoIN2h4dpCv0Fvqxwo6LMi8bcyp75eWfSMU9RXDS85kdphXbcRbKQXliltze6B/idcnFdLHUVB+9FsxTxVEyCUzpuG5CTYzplFwCXnJTw0kjF+xDjQsjZgC08x6t+Z0ySot2o61fRHSnvo9kTFlTFeFdrLY1Pz2CvE/r5Fie6uZiShJESL+9VE7lRRjWhRHW0IDR9m1hHEt7K6UXzLNONp6ixK83yf/JafrNc+teccblZ3dfh1jZIEskirxyCbZIfvkiNQJJ3fkgTyTF+feeXLenPev0QGnn5knP+B8fALC9aRP</latexit>

Q⇤0(s0,R)|!1:t

⇠ N (0, 1)<latexit sha1_base64="+vzKZnawt71XjvQhYsi7jOggKYY=">AAACK3icbVDLSsNAFJ34tr6qLt0MFqWKlEQExZXoxpWoWBWaGibT2zo4k4SZG7HE/I8bf8WFLnzg1v9wUrvwdWDgcM49zL0nTKQw6LqvzsDg0PDI6Nh4aWJyanqmPDt3auJUc6jzWMb6PGQGpIigjgIlnCcamAolnIVXe4V/dg3aiDg6wW4CTcU6kWgLztBKQXl3+ShwL1arJnDXqI9wg4Znx/kKvaV+rKDDgszbxpz6fmnZN0JRXzG85ExmB3nVRryVUlCuuDW3B/qXeH1SIX0cBuVHvxXzVEGEXDJjGp6bYDNjGgWXkJf81EDC+BXrQMPSiCkwzax3a06XrNKi7VjbFyHtqd8TGVPGdFVoJ4tNzW+vEP/zGim2t5qZiJIUIeJfH7VTSTGmRXG0JTRwlF1LGNfC7kr5JdOMo623KMH7ffJfcrpe89yad7RR2dnt1zFGFsgiqRKPbJIdsk8OSZ1wckceyDN5ce6dJ+fNef8aHXD6mXnyA87HJ80bpFU=</latexit><latexit sha1_base64="+vzKZnawt71XjvQhYsi7jOggKYY=">AAACK3icbVDLSsNAFJ34tr6qLt0MFqWKlEQExZXoxpWoWBWaGibT2zo4k4SZG7HE/I8bf8WFLnzg1v9wUrvwdWDgcM49zL0nTKQw6LqvzsDg0PDI6Nh4aWJyanqmPDt3auJUc6jzWMb6PGQGpIigjgIlnCcamAolnIVXe4V/dg3aiDg6wW4CTcU6kWgLztBKQXl3+ShwL1arJnDXqI9wg4Znx/kKvaV+rKDDgszbxpz6fmnZN0JRXzG85ExmB3nVRryVUlCuuDW3B/qXeH1SIX0cBuVHvxXzVEGEXDJjGp6bYDNjGgWXkJf81EDC+BXrQMPSiCkwzax3a06XrNKi7VjbFyHtqd8TGVPGdFVoJ4tNzW+vEP/zGim2t5qZiJIUIeJfH7VTSTGmRXG0JTRwlF1LGNfC7kr5JdOMo623KMH7ffJfcrpe89yad7RR2dnt1zFGFsgiqRKPbJIdsk8OSZ1wckceyDN5ce6dJ+fNef8aHXD6mXnyA87HJ80bpFU=</latexit><latexit sha1_base64="+vzKZnawt71XjvQhYsi7jOggKYY=">AAACK3icbVDLSsNAFJ34tr6qLt0MFqWKlEQExZXoxpWoWBWaGibT2zo4k4SZG7HE/I8bf8WFLnzg1v9wUrvwdWDgcM49zL0nTKQw6LqvzsDg0PDI6Nh4aWJyanqmPDt3auJUc6jzWMb6PGQGpIigjgIlnCcamAolnIVXe4V/dg3aiDg6wW4CTcU6kWgLztBKQXl3+ShwL1arJnDXqI9wg4Znx/kKvaV+rKDDgszbxpz6fmnZN0JRXzG85ExmB3nVRryVUlCuuDW3B/qXeH1SIX0cBuVHvxXzVEGEXDJjGp6bYDNjGgWXkJf81EDC+BXrQMPSiCkwzax3a06XrNKi7VjbFyHtqd8TGVPGdFVoJ4tNzW+vEP/zGim2t5qZiJIUIeJfH7VTSTGmRXG0JTRwlF1LGNfC7kr5JdOMo623KMH7ffJfcrpe89yad7RR2dnt1zFGFsgiqRKPbJIdsk8OSZ1wckceyDN5ce6dJ+fNef8aHXD6mXnyA87HJ80bpFU=</latexit><latexit sha1_base64="+vzKZnawt71XjvQhYsi7jOggKYY=">AAACK3icbVDLSsNAFJ34tr6qLt0MFqWKlEQExZXoxpWoWBWaGibT2zo4k4SZG7HE/I8bf8WFLnzg1v9wUrvwdWDgcM49zL0nTKQw6LqvzsDg0PDI6Nh4aWJyanqmPDt3auJUc6jzWMb6PGQGpIigjgIlnCcamAolnIVXe4V/dg3aiDg6wW4CTcU6kWgLztBKQXl3+ShwL1arJnDXqI9wg4Znx/kKvaV+rKDDgszbxpz6fmnZN0JRXzG85ExmB3nVRryVUlCuuDW3B/qXeH1SIX0cBuVHvxXzVEGEXDJjGp6bYDNjGgWXkJf81EDC+BXrQMPSiCkwzax3a06XrNKi7VjbFyHtqd8TGVPGdFVoJ4tNzW+vEP/zGim2t5qZiJIUIeJfH7VTSTGmRXG0JTRwlF1LGNfC7kr5JdOMo623KMH7ffJfcrpe89yad7RR2dnt1zFGFsgiqRKPbJIdsk8OSZ1wckceyDN5ce6dJ+fNef8aHXD6mXnyA87HJ80bpFU=</latexit>

Q⇤0(s1,R)|!1:t

= �1<latexit sha1_base64="WZPouHdZosT4U+Ln+QHVr5CdZtA=">AAACGnicbVDLSgNBEJyN7/ha9ehlMBhUNOyIoAiC6MWjiolCNi6zk04yZPbBTK8Y1nyHF3/FiwdFvIkX/8ZNzEGjBQ1FVTfdXX6spEHH+bRyI6Nj4xOTU/npmdm5eXthsWKiRAsoi0hF+srnBpQMoYwSFVzFGnjgK7j028c9//IGtJFReIGdGGoBb4ayIQXHTPJsVjzznOuNNeOxTeoi3KIR6Xl3nd5RNwqgyb2U7WOXum6+SA/oFst7dsEpOX3Qv4QNSIEMcOrZ7249EkkAIQrFjakyJ8ZayjVKoaCbdxMDMRdt3oRqRkMegKml/de6dDVT6rQR6axCpH3150TKA2M6gZ91BhxbZtjrif951QQbe7VUhnGCEIrvRY1EUYxoLydalxoEqk5GuNAyu5WKFtdcYJZmLwQ2/PJfUtkuMafEznYKh0eDOCbJMlkha4SRXXJITsgpKRNB7skjeSYv1oP1ZL1ab9+tOWsws0R+wfr4AlzlnUg=</latexit><latexit sha1_base64="WZPouHdZosT4U+Ln+QHVr5CdZtA=">AAACGnicbVDLSgNBEJyN7/ha9ehlMBhUNOyIoAiC6MWjiolCNi6zk04yZPbBTK8Y1nyHF3/FiwdFvIkX/8ZNzEGjBQ1FVTfdXX6spEHH+bRyI6Nj4xOTU/npmdm5eXthsWKiRAsoi0hF+srnBpQMoYwSFVzFGnjgK7j028c9//IGtJFReIGdGGoBb4ayIQXHTPJsVjzznOuNNeOxTeoi3KIR6Xl3nd5RNwqgyb2U7WOXum6+SA/oFst7dsEpOX3Qv4QNSIEMcOrZ7249EkkAIQrFjakyJ8ZayjVKoaCbdxMDMRdt3oRqRkMegKml/de6dDVT6rQR6axCpH3150TKA2M6gZ91BhxbZtjrif951QQbe7VUhnGCEIrvRY1EUYxoLydalxoEqk5GuNAyu5WKFtdcYJZmLwQ2/PJfUtkuMafEznYKh0eDOCbJMlkha4SRXXJITsgpKRNB7skjeSYv1oP1ZL1ab9+tOWsws0R+wfr4AlzlnUg=</latexit><latexit sha1_base64="WZPouHdZosT4U+Ln+QHVr5CdZtA=">AAACGnicbVDLSgNBEJyN7/ha9ehlMBhUNOyIoAiC6MWjiolCNi6zk04yZPbBTK8Y1nyHF3/FiwdFvIkX/8ZNzEGjBQ1FVTfdXX6spEHH+bRyI6Nj4xOTU/npmdm5eXthsWKiRAsoi0hF+srnBpQMoYwSFVzFGnjgK7j028c9//IGtJFReIGdGGoBb4ayIQXHTPJsVjzznOuNNeOxTeoi3KIR6Xl3nd5RNwqgyb2U7WOXum6+SA/oFst7dsEpOX3Qv4QNSIEMcOrZ7249EkkAIQrFjakyJ8ZayjVKoaCbdxMDMRdt3oRqRkMegKml/de6dDVT6rQR6axCpH3150TKA2M6gZ91BhxbZtjrif951QQbe7VUhnGCEIrvRY1EUYxoLydalxoEqk5GuNAyu5WKFtdcYJZmLwQ2/PJfUtkuMafEznYKh0eDOCbJMlkha4SRXXJITsgpKRNB7skjeSYv1oP1ZL1ab9+tOWsws0R+wfr4AlzlnUg=</latexit><latexit sha1_base64="WZPouHdZosT4U+Ln+QHVr5CdZtA=">AAACGnicbVDLSgNBEJyN7/ha9ehlMBhUNOyIoAiC6MWjiolCNi6zk04yZPbBTK8Y1nyHF3/FiwdFvIkX/8ZNzEGjBQ1FVTfdXX6spEHH+bRyI6Nj4xOTU/npmdm5eXthsWKiRAsoi0hF+srnBpQMoYwSFVzFGnjgK7j028c9//IGtJFReIGdGGoBb4ayIQXHTPJsVjzznOuNNeOxTeoi3KIR6Xl3nd5RNwqgyb2U7WOXum6+SA/oFst7dsEpOX3Qv4QNSIEMcOrZ7249EkkAIQrFjakyJ8ZayjVKoaCbdxMDMRdt3oRqRkMegKml/de6dDVT6rQR6axCpH3150TKA2M6gZ91BhxbZtjrif951QQbe7VUhnGCEIrvRY1EUYxoLydalxoEqk5GuNAyu5WKFtdcYJZmLwQ2/PJfUtkuMafEznYKh0eDOCbJMlkha4SRXXJITsgpKRNB7skjeSYv1oP1ZL1ab9+tOWsws0R+wfr4AlzlnUg=</latexit>

L<latexit sha1_base64="ribuu/nI/i/ChBi+DvAqzOfqJG0=">AAAB8HicbVA9SwNBEN2LXzF+RS1tFoNgFS4iqF3QxsIigmeCyRH2NnPJkr29Y3dODEf+hY2Fiq0/x85/4ya5QhMfDDzem2FmXpBIYdB1v53C0vLK6lpxvbSxubW9U97duzdxqjl4PJaxbgXMgBQKPBQooZVoYFEgoRkMryZ+8xG0EbG6w1ECfsT6SoSCM7TSQwfhCQ3PbsbdcsWtulPQRVLLSYXkaHTLX51ezNMIFHLJjGnX3AT9jGkUXMK41EkNJIwPWR/alioWgfGz6cVjemSVHg1jbUshnaq/JzIWGTOKAtsZMRyYeW8i/ue1UwzP/UyoJEVQfLYoTCXFmE7epz2hgaMcWcK4FvZWygdMM442pJINoTb/8iLxTqoXVff2tFK/zNMokgNySI5JjZyROrkmDeIRThR5Jq/kzTHOi/PufMxaC04+s0/+wPn8AWOskOw=</latexit><latexit sha1_base64="ribuu/nI/i/ChBi+DvAqzOfqJG0=">AAAB8HicbVA9SwNBEN2LXzF+RS1tFoNgFS4iqF3QxsIigmeCyRH2NnPJkr29Y3dODEf+hY2Fiq0/x85/4ya5QhMfDDzem2FmXpBIYdB1v53C0vLK6lpxvbSxubW9U97duzdxqjl4PJaxbgXMgBQKPBQooZVoYFEgoRkMryZ+8xG0EbG6w1ECfsT6SoSCM7TSQwfhCQ3PbsbdcsWtulPQRVLLSYXkaHTLX51ezNMIFHLJjGnX3AT9jGkUXMK41EkNJIwPWR/alioWgfGz6cVjemSVHg1jbUshnaq/JzIWGTOKAtsZMRyYeW8i/ue1UwzP/UyoJEVQfLYoTCXFmE7epz2hgaMcWcK4FvZWygdMM442pJINoTb/8iLxTqoXVff2tFK/zNMokgNySI5JjZyROrkmDeIRThR5Jq/kzTHOi/PufMxaC04+s0/+wPn8AWOskOw=</latexit><latexit sha1_base64="ribuu/nI/i/ChBi+DvAqzOfqJG0=">AAAB8HicbVA9SwNBEN2LXzF+RS1tFoNgFS4iqF3QxsIigmeCyRH2NnPJkr29Y3dODEf+hY2Fiq0/x85/4ya5QhMfDDzem2FmXpBIYdB1v53C0vLK6lpxvbSxubW9U97duzdxqjl4PJaxbgXMgBQKPBQooZVoYFEgoRkMryZ+8xG0EbG6w1ECfsT6SoSCM7TSQwfhCQ3PbsbdcsWtulPQRVLLSYXkaHTLX51ezNMIFHLJjGnX3AT9jGkUXMK41EkNJIwPWR/alioWgfGz6cVjemSVHg1jbUshnaq/JzIWGTOKAtsZMRyYeW8i/ue1UwzP/UyoJEVQfLYoTCXFmE7epz2hgaMcWcK4FvZWygdMM442pJINoTb/8iLxTqoXVff2tFK/zNMokgNySI5JjZyROrkmDeIRThR5Jq/kzTHOi/PufMxaC04+s0/+wPn8AWOskOw=</latexit>

L<latexit sha1_base64="ribuu/nI/i/ChBi+DvAqzOfqJG0=">AAAB8HicbVA9SwNBEN2LXzF+RS1tFoNgFS4iqF3QxsIigmeCyRH2NnPJkr29Y3dODEf+hY2Fiq0/x85/4ya5QhMfDDzem2FmXpBIYdB1v53C0vLK6lpxvbSxubW9U97duzdxqjl4PJaxbgXMgBQKPBQooZVoYFEgoRkMryZ+8xG0EbG6w1ECfsT6SoSCM7TSQwfhCQ3PbsbdcsWtulPQRVLLSYXkaHTLX51ezNMIFHLJjGnX3AT9jGkUXMK41EkNJIwPWR/alioWgfGz6cVjemSVHg1jbUshnaq/JzIWGTOKAtsZMRyYeW8i/ue1UwzP/UyoJEVQfLYoTCXFmE7epz2hgaMcWcK4FvZWygdMM442pJINoTb/8iLxTqoXVff2tFK/zNMokgNySI5JjZyROrkmDeIRThR5Jq/kzTHOi/PufMxaC04+s0/+wPn8AWOskOw=</latexit><latexit sha1_base64="ribuu/nI/i/ChBi+DvAqzOfqJG0=">AAAB8HicbVA9SwNBEN2LXzF+RS1tFoNgFS4iqF3QxsIigmeCyRH2NnPJkr29Y3dODEf+hY2Fiq0/x85/4ya5QhMfDDzem2FmXpBIYdB1v53C0vLK6lpxvbSxubW9U97duzdxqjl4PJaxbgXMgBQKPBQooZVoYFEgoRkMryZ+8xG0EbG6w1ECfsT6SoSCM7TSQwfhCQ3PbsbdcsWtulPQRVLLSYXkaHTLX51ezNMIFHLJjGnX3AT9jGkUXMK41EkNJIwPWR/alioWgfGz6cVjemSVHg1jbUshnaq/JzIWGTOKAtsZMRyYeW8i/ue1UwzP/UyoJEVQfLYoTCXFmE7epz2hgaMcWcK4FvZWygdMM442pJINoTb/8iLxTqoXVff2tFK/zNMokgNySI5JjZyROrkmDeIRThR5Jq/kzTHOi/PufMxaC04+s0/+wPn8AWOskOw=</latexit><latexit sha1_base64="ribuu/nI/i/ChBi+DvAqzOfqJG0=">AAAB8HicbVA9SwNBEN2LXzF+RS1tFoNgFS4iqF3QxsIigmeCyRH2NnPJkr29Y3dODEf+hY2Fiq0/x85/4ya5QhMfDDzem2FmXpBIYdB1v53C0vLK6lpxvbSxubW9U97duzdxqjl4PJaxbgXMgBQKPBQooZVoYFEgoRkMryZ+8xG0EbG6w1ECfsT6SoSCM7TSQwfhCQ3PbsbdcsWtulPQRVLLSYXkaHTLX51ezNMIFHLJjGnX3AT9jGkUXMK41EkNJIwPWR/alioWgfGz6cVjemSVHg1jbUshnaq/JzIWGTOKAtsZMRyYeW8i/ue1UwzP/UyoJEVQfLYoTCXFmE7epz2hgaMcWcK4FvZWygdMM442pJINoTb/8iLxTqoXVff2tFK/zNMokgNySI5JjZyROrkmDeIRThR5Jq/kzTHOi/PufMxaC04+s0/+wPn8AWOskOw=</latexit>

R<latexit sha1_base64="vvh2qGXmTPY7lmfizwVAQHkgIzk=">AAAB8HicbVBNS8NAEN3Ur1q/qh69LBbBU0lFUG9FLx6rGFtsQ9lsJ+3SzSbsTsQS+i+8eFDx6s/x5r9x2+agrQ8GHu/NMDMvSKQw6LrfTmFpeWV1rbhe2tjc2t4p7+7dmzjVHDwey1i3AmZACgUeCpTQSjSwKJDQDIZXE7/5CNqIWN3hKAE/Yn0lQsEZWumhg/CEhme342654lbdKegiqeWkQnI0uuWvTi/maQQKuWTGtGtugn7GNAouYVzqpAYSxoesD21LFYvA+Nn04jE9skqPhrG2pZBO1d8TGYuMGUWB7YwYDsy8NxH/89ophud+JlSSIig+WxSmkmJMJ+/TntDAUY4sYVwLeyvlA6YZRxtSyYZQm395kXgn1Yuqe3NaqV/maRTJATkkx6RGzkidXJMG8QgnijyTV/LmGOfFeXc+Zq0FJ5/ZJ3/gfP4AbMSQ8g==</latexit><latexit sha1_base64="vvh2qGXmTPY7lmfizwVAQHkgIzk=">AAAB8HicbVBNS8NAEN3Ur1q/qh69LBbBU0lFUG9FLx6rGFtsQ9lsJ+3SzSbsTsQS+i+8eFDx6s/x5r9x2+agrQ8GHu/NMDMvSKQw6LrfTmFpeWV1rbhe2tjc2t4p7+7dmzjVHDwey1i3AmZACgUeCpTQSjSwKJDQDIZXE7/5CNqIWN3hKAE/Yn0lQsEZWumhg/CEhme342654lbdKegiqeWkQnI0uuWvTi/maQQKuWTGtGtugn7GNAouYVzqpAYSxoesD21LFYvA+Nn04jE9skqPhrG2pZBO1d8TGYuMGUWB7YwYDsy8NxH/89ophud+JlSSIig+WxSmkmJMJ+/TntDAUY4sYVwLeyvlA6YZRxtSyYZQm395kXgn1Yuqe3NaqV/maRTJATkkx6RGzkidXJMG8QgnijyTV/LmGOfFeXc+Zq0FJ5/ZJ3/gfP4AbMSQ8g==</latexit><latexit sha1_base64="vvh2qGXmTPY7lmfizwVAQHkgIzk=">AAAB8HicbVBNS8NAEN3Ur1q/qh69LBbBU0lFUG9FLx6rGFtsQ9lsJ+3SzSbsTsQS+i+8eFDx6s/x5r9x2+agrQ8GHu/NMDMvSKQw6LrfTmFpeWV1rbhe2tjc2t4p7+7dmzjVHDwey1i3AmZACgUeCpTQSjSwKJDQDIZXE7/5CNqIWN3hKAE/Yn0lQsEZWumhg/CEhme342654lbdKegiqeWkQnI0uuWvTi/maQQKuWTGtGtugn7GNAouYVzqpAYSxoesD21LFYvA+Nn04jE9skqPhrG2pZBO1d8TGYuMGUWB7YwYDsy8NxH/89ophud+JlSSIig+WxSmkmJMJ+/TntDAUY4sYVwLeyvlA6YZRxtSyYZQm395kXgn1Yuqe3NaqV/maRTJATkkx6RGzkidXJMG8QgnijyTV/LmGOfFeXc+Zq0FJ5/ZJ3/gfP4AbMSQ8g==</latexit>

R<latexit sha1_base64="vvh2qGXmTPY7lmfizwVAQHkgIzk=">AAAB8HicbVBNS8NAEN3Ur1q/qh69LBbBU0lFUG9FLx6rGFtsQ9lsJ+3SzSbsTsQS+i+8eFDx6s/x5r9x2+agrQ8GHu/NMDMvSKQw6LrfTmFpeWV1rbhe2tjc2t4p7+7dmzjVHDwey1i3AmZACgUeCpTQSjSwKJDQDIZXE7/5CNqIWN3hKAE/Yn0lQsEZWumhg/CEhme342654lbdKegiqeWkQnI0uuWvTi/maQQKuWTGtGtugn7GNAouYVzqpAYSxoesD21LFYvA+Nn04jE9skqPhrG2pZBO1d8TGYuMGUWB7YwYDsy8NxH/89ophud+JlSSIig+WxSmkmJMJ+/TntDAUY4sYVwLeyvlA6YZRxtSyYZQm395kXgn1Yuqe3NaqV/maRTJATkkx6RGzkidXJMG8QgnijyTV/LmGOfFeXc+Zq0FJ5/ZJ3/gfP4AbMSQ8g==</latexit><latexit sha1_base64="vvh2qGXmTPY7lmfizwVAQHkgIzk=">AAAB8HicbVBNS8NAEN3Ur1q/qh69LBbBU0lFUG9FLx6rGFtsQ9lsJ+3SzSbsTsQS+i+8eFDx6s/x5r9x2+agrQ8GHu/NMDMvSKQw6LrfTmFpeWV1rbhe2tjc2t4p7+7dmzjVHDwey1i3AmZACgUeCpTQSjSwKJDQDIZXE7/5CNqIWN3hKAE/Yn0lQsEZWumhg/CEhme342654lbdKegiqeWkQnI0uuWvTi/maQQKuWTGtGtugn7GNAouYVzqpAYSxoesD21LFYvA+Nn04jE9skqPhrG2pZBO1d8TGYuMGUWB7YwYDsy8NxH/89ophud+JlSSIig+WxSmkmJMJ+/TntDAUY4sYVwLeyvlA6YZRxtSyYZQm395kXgn1Yuqe3NaqV/maRTJATkkx6RGzkidXJMG8QgnijyTV/LmGOfFeXc+Zq0FJ5/ZJ3/gfP4AbMSQ8g==</latexit><latexit sha1_base64="vvh2qGXmTPY7lmfizwVAQHkgIzk=">AAAB8HicbVBNS8NAEN3Ur1q/qh69LBbBU0lFUG9FLx6rGFtsQ9lsJ+3SzSbsTsQS+i+8eFDx6s/x5r9x2+agrQ8GHu/NMDMvSKQw6LrfTmFpeWV1rbhe2tjc2t4p7+7dmzjVHDwey1i3AmZACgUeCpTQSjSwKJDQDIZXE7/5CNqIWN3hKAE/Yn0lQsEZWumhg/CEhme342654lbdKegiqeWkQnI0uuWvTi/maQQKuWTGtGtugn7GNAouYVzqpAYSxoesD21LFYvA+Nn04jE9skqPhrG2pZBO1d8TGYuMGUWB7YwYDsy8NxH/89ophud+JlSSIig+WxSmkmJMJ+/TntDAUY4sYVwLeyvlA6YZRxtSyYZQm395kXgn1Yuqe3NaqV/maRTJATkkx6RGzkidXJMG8QgnijyTV/LmGOfFeXc+Zq0FJ5/ZJ3/gfP4AbMSQ8g==</latexit>

R<latexit sha1_base64="vvh2qGXmTPY7lmfizwVAQHkgIzk=">AAAB8HicbVBNS8NAEN3Ur1q/qh69LBbBU0lFUG9FLx6rGFtsQ9lsJ+3SzSbsTsQS+i+8eFDx6s/x5r9x2+agrQ8GHu/NMDMvSKQw6LrfTmFpeWV1rbhe2tjc2t4p7+7dmzjVHDwey1i3AmZACgUeCpTQSjSwKJDQDIZXE7/5CNqIWN3hKAE/Yn0lQsEZWumhg/CEhme342654lbdKegiqeWkQnI0uuWvTi/maQQKuWTGtGtugn7GNAouYVzqpAYSxoesD21LFYvA+Nn04jE9skqPhrG2pZBO1d8TGYuMGUWB7YwYDsy8NxH/89ophud+JlSSIig+WxSmkmJMJ+/TntDAUY4sYVwLeyvlA6YZRxtSyYZQm395kXgn1Yuqe3NaqV/maRTJATkkx6RGzkidXJMG8QgnijyTV/LmGOfFeXc+Zq0FJ5/ZJ3/gfP4AbMSQ8g==</latexit><latexit sha1_base64="vvh2qGXmTPY7lmfizwVAQHkgIzk=">AAAB8HicbVBNS8NAEN3Ur1q/qh69LBbBU0lFUG9FLx6rGFtsQ9lsJ+3SzSbsTsQS+i+8eFDx6s/x5r9x2+agrQ8GHu/NMDMvSKQw6LrfTmFpeWV1rbhe2tjc2t4p7+7dmzjVHDwey1i3AmZACgUeCpTQSjSwKJDQDIZXE7/5CNqIWN3hKAE/Yn0lQsEZWumhg/CEhme342654lbdKegiqeWkQnI0uuWvTi/maQQKuWTGtGtugn7GNAouYVzqpAYSxoesD21LFYvA+Nn04jE9skqPhrG2pZBO1d8TGYuMGUWB7YwYDsy8NxH/89ophud+JlSSIig+WxSmkmJMJ+/TntDAUY4sYVwLeyvlA6YZRxtSyYZQm395kXgn1Yuqe3NaqV/maRTJATkkx6RGzkidXJMG8QgnijyTV/LmGOfFeXc+Zq0FJ5/ZJ3/gfP4AbMSQ8g==</latexit><latexit sha1_base64="vvh2qGXmTPY7lmfizwVAQHkgIzk=">AAAB8HicbVBNS8NAEN3Ur1q/qh69LBbBU0lFUG9FLx6rGFtsQ9lsJ+3SzSbsTsQS+i+8eFDx6s/x5r9x2+agrQ8GHu/NMDMvSKQw6LrfTmFpeWV1rbhe2tjc2t4p7+7dmzjVHDwey1i3AmZACgUeCpTQSjSwKJDQDIZXE7/5CNqIWN3hKAE/Yn0lQsEZWumhg/CEhme342654lbdKegiqeWkQnI0uuWvTi/maQQKuWTGtGtugn7GNAouYVzqpAYSxoesD21LFYvA+Nn04jE9skqPhrG2pZBO1d8TGYuMGUWB7YwYDsy8NxH/89ophud+JlSSIig+WxSmkmJMJ+/TntDAUY4sYVwLeyvlA6YZRxtSyYZQm395kXgn1Yuqe3NaqV/maRTJATkkx6RGzkidXJMG8QgnijyTV/LmGOfFeXc+Zq0FJ5/ZJ3/gfP4AbMSQ8g==</latexit>

sρs0 s1

Figure 6: A search graph, where VOC0(�n)-greedystops early.

In this case, we can see that no single sample from theleafs can result in a policy change at s⇢ since we wouldneed to sample both of the leaves of the left subtree atleast once for the policy at the root to change from L to R.Therefore, VOC0

�2 is zero for all possible computationshere, and thus stops early, not achieving neither one-stepnor asymptotic optimality. In contrast, VOC�2 is greaterthan zero for computations concerning the left subtree.

C.4 PROOF OF PROPOSITION 5

We need to show the equality of the second term in Equa-tion 2 to the second term of VOC as we defined in Defi-nition 3. First observe that E⌦1:k [ n(s⇢,↵|!1:t⌦1:k)] = n(s⇢,↵|!1:t). Then, we can take the ↵ out as n(s⇢,↵|!1:t) = maxa2As⇢

n(s⇢, a|!1:t), which isidentical to the second term of our VOC definition inEquation 2.

D BANDIT TREE DETAILS

We utilizes trees of depth 7, where the agent transitionsto the desired sub-tree with probability .75. In the cor-related bandit arms case, the expected rewards of thearms are sampled from N (1/2, ⌃) i.i.d. at each trial,where ⌃ is the covariance matrix given by an RBF ker-nel with scale parameter of 1 and the observation noiseis sampled from N (0, 0.1) i.i.d. at each time step. Inthe uncorrelated case, the expected rewards are sampledfrom U(0.45, 0.55) and the observation noise is fromN (0, 0.01). The former setting is designed to be nois-ier to compensate for the extra information provided bythe correlations.