Contextual MDPs for PAC Reinforceme t Learning With Rich Observations

8/17/2019 Contextual MDPs for PAC Reinforceme t Learning With Rich Observations

1/30

Contextual-MDPs for PAC-Reinforcement Learning with

Rich Observations

Akshay Krishnamurthy ∗1, Alekh Agarwal †1, and John Langford ‡1

1Microsoft Research, New York, NY 10011

March 2, 2016

Abstract

We propose and study a new tractable model for reinforcement learning with high-dimensional obser-

vation called Contextual-MDPs, generalizing contextual bandits to a sequential decision making setting.

These models require an agent to take actions based on high-dimensional observations (features) with the

goal of achieving long-term performance competitive with a large set of policies. Since the size of the

observation space is a primary obstacle to sample-efficient learning, Contextual-MDPs are assumed to be

summarizable by a small number of hidden states. In this setting, we design a new reinforcement learning

algorithm that engages in global exploration while using a function class to approximate future perfor-

mance. We also establish a sample complexity guarantee for this algorithm, proving that it learns near

optimal behavior after a number of episodes that is polynomial in all relevant parameters, logarithmic in

the number of policies, and independent of the size of the observation space. This represents an exponen-

tial improvement on the sample complexity of all existing alternative approaches and provides theoretical

justification for reinforcement learning with function approximation.

1 Introduction

The Atari Reinforcement Learning research program [20] has highlighted a critical deficiency of reinforce-

ment learning algorithms: they cannot effectively solve problems that require systematic exploration. How

can we construct Reinforcement Learning (RL) algorithms which effectively plan and plan to explore?

In RL theory, this is an effectively solved problem for Markov Decision Processes (MDPs) [11, 4, 24].

Why do these results not apply?

An easy response is, “because the hard games are not MDPs.” This may be true for some of the hard

games, but it is misleading—the algorithms used do not even engage in minimal planning and global explo-

ration1 as is required to solve MDPs efficiently. MDP-optimized global exploration has also been avoided

because of a polynomial dependence on the number of unique observations which is intractably large with

observations from a visual sensor.

∗

[email protected]†[email protected]‡ [email protected] use “global exploration” to distinguish the structural exploration strategies required to solve an MDP efficiently from exponen-

tially less efficient alternative such as -greedy.

1

a r X i v : 1 6 0 2 . 0

2 7 2 2 v 2 [ c s . L G ] 1

M a r 2 0 1 6


2/30

In contrast, supervised and contextual bandit learning algorithms have no dependence on the number of

observations and at most a logarithmic dependence on the size of the underlying policy set. Approaches

to RL with a weak dependence on these quantities exist [13], but suffer from an exponential dependence

on the time horizon—with K actions and a horizon of H , they require K H samples. Examples show this

dependence is necessary, although such examples require a large number of states. Can we find an RLalgorithm with no dependence on the number of unique observations and a polynomial dependence on the

number of actions K , the number of necessary states M , the horizon H , and the policy complexity log(|Π|)?To begin answering this question we consider a simplified setting by assuming:

1. episodic reinforcement learning.

2. the policy space can represent the exact-best solution.

3. state transition dynamics are deterministic.

These simplifications make the problem significantly more tractable without trivializing the core goal of

designing a Poly(K,M,H, log(|Π|))) algorithm. To this end, our contributions are:1. A new class of models (Contextual-MDPs) for the design and analysis of reinforcement learning

algorithms. Contextual-MDPs generalize both contextual bandits and MDPs, but, unlike Partially

Observable MDPs (POMDPs), the optimal policy in a Contextual-MDP depends only on the mostrecent observation rather than the entire trajectory.

2. A new reinforcement learning algorithm and a guarantee that it PAC-learns Contextual-MDPs (with

the above assumptions) using O(MK 2H 3 log(|Π|)) samples. This is done by combining ideas fromcontextual bandits with a novel state equality test and a on-demand exploration technique, yielding the

first Poly(K,M,H, log(|Π|)) reinforcement learning algorithm with no dependence on the numberof unique observations. Like initial contextual bandit approaches, the algorithm is computationally

inefficient since it requires enumeration of the policy class, an aspect we hope to address in future

work.

Our algorithm uses a function class to approximate future rewards, and thus lends theoretical backing for

reinforcement learning with function approximation, which is the empirical state-of-the-art.

2 The Model

In this section, we introduce the model we study throughout the paper, which we call episodic Contextual-

MDPs. We first setup basic notation. Let H ∈ N be a time horizon, X denote a high-dimensional observationspace, A a finite set of actions, and let S denote a finite set of latent states. Let K = |A|. We partition S intoH disjoint groups S 1, . . . , S H , each of size at most M . For a set P , ∆(P ) denotes the set of distributionsover P .

2.1 Basic Definitions

An episodic Contextual-MDP is defined by the tuple (ΓH , Γ, D) where H ∈ N is the episode length, ΓH ∈∆(S H ) denotes a starting state distribution, Γ : (S × A) → ∆(S ) denotes the transition dynamics, andD : S → ∆(X × [0, 1]K ) associates a distribution over (observation, reward) pairs with each state. We useDs ∈ ∆(X × [0, 1]K ) to denote the (observation, reward) distribution associated with state s and also themarginal distribution over observations (usage will be clear from context). We use Ds|x to denote conditional

2


3/30

s s

x, r x, r

a(x)r(a(x))

Γ(s, a(x))

. . . . . .

Figure 1: Snippet of a trajectory induced by an optimal agent in a Contextual-MDP. Black text denotes

unobserved quantities, blue denotes observed quantities, and red denotes quantities chosen by the optimal

agent. The optimal action a is a function solely of the current observation x, in contrast with more generalPOMDPs.

distribution of the reward given the observation x in state s. The marginal and conditional distributions arereferred to as Ds(x) and Ds|x(r).

We assume that the process is layered (also known as loop-free or acyclic in the literature) so that for

a state sh ∈ S h

and for any action a ∈ A

, Γ(sh

, a) ∈

∆(S h−1

). Since the state space is partitioned intodisjoint sets, each state is available only at a single time point, and the environment transitions from the

state space S H down to S 1 via a sequence of actions. Layered structure allows us to avoid indexing policiesand Q-functions with time, which enables more concise notation but is mathematically equivalent to analternative reformulation without layered structure.

An episode proceeds as follows. The environment chooses sH ∼ ΓH , (xH , rH ) ∼ DsH , and xH isrevealed to the learner, who chooses an action aH . The learner observes rH (aH ) and the environmenttransitions to state sH −1 ∼ Γ(sH , aH ), draws (xH −1, rH −1) ∼ DsH−1 and reveals xH −1 to the learner.The learner chooses an action aH −1 and the process continues for a total of H rounds of interaction, atwhich point the episode ends.

Over the course of an episode, the reward obtained by the learner isH

h=1 rh(ah), and the goal is tomaximize the expected cumulative reward,

R = E[H h=1

rh(ah)], (1)

where the expectation accounts for all randomness in the model and the learner. We assume that almost

surelyH

h=1 rh(ah) ∈ [0, 1] for any action sequence.The record of interaction observed by the learner is (xH , aH , rH (aH ), . . . , x1, a1, r1(a1)). The full

record of interaction for a single episode is the tuple (sH , xH , rH , aH , . . . s1, x1, r1, a1) where sH ∼ ΓH ,sh ∼ Γ(sh+1, ah+1), (xh, rh) ∼ Dsh and all actions ah are chosen by the learner. Notice that all stateinformation and rewards for alternative actions are unobserved by the learning algorithm. Figure 1 illustrates

the observed and unobserved quantities over one round of interaction.

A policy π : X → A is a strategy for navigating the search space by taking actions π(x) given observa-tion x. A policy generates a sequence of interactions as (xH , π(xH ), rH (π(xH )), . . . , x1, π(x1), r1(π(x1)))with expected reward defined recursively through

V (π) = Es∼ΓH [V (s, π)] and,

V (s, π) = E(x,r)∼Ds

r(π(x)) + Es∼Γ(s,π(x))V (s, π)

.

3


4/30

As the base case, we assume that for states s ∈ S 1, all actions transition deterministically to a terminal states0 with V (s0, π) = 0 for all π.

The optimal expected reward achievable can be similarly computed recursively as

V = Es∼ΓH [V

(s)] and, (2)

V (s) = Ex∼Ds maxaEr∼Ds|x

r(a) + Es∼Γ(s,a)V

(s)

.

For each (s, x) pair such that Ds(x) > 0 we can also define a Q function as

Qs(x, a) = Er∼Ds|x

r(a) + Es∼Γ(s,a)V (s)

. (3)

This function captures the optimal choice of action given this (state, observation) pair and therefore encodes

optimal behavior in the model. With no further assumptions, the above model is a layered episodic Partially

Observable Markov Decision Process (POMDP).

2.2 The Contextual-MDP Model

The Contextual-MDP is as described above, but with an important restriction on the structure of the Q

function defined in Equation (3).

Definition 1 (Contextual-MDP). Let (S , A, X , ΓH , Γ, D) be a layered episodic POMDP. Let Q be cor-respondingly defined as in Equation 3 and a(s, x) = argmaxa∈A Q

s(x, a). The POMDP is called a

Contextual-MDP if for any two states s, s such that Ds(x), Ds(x) > 0 we have a(s, x) = a(s, x).

Restated, a Contextual-MDP requires the optimal action for maximizing long-term reward to be depen-

dent solely on the observation x irrespective of the state. This is depicted in Figure 1, where the optimalaction a depends only on the current observation. In the following section, we describe how this conditionrelates to other reinforcement learning models in the literature. However, we first describe some examples

where the condition holds.

Example 1 ( Disjoint contexts). The simplest example for a Contextual-MDP is one where each state s canbe identified with a subset X s so that Ds(x) > 0 only for x ∈ X s and where X s ∩ X s = ∅ when s = s.In this case, a realized context x uniquely identifies the underlying state s so that the function Q

s(x, a)need not explicitly depend on the state s. On the other hand, this underlying mapping from s to X s isunknown to the learning agent so the problem cannot be easily reduced to a classical tabular MDP with a

small number of states. Our algorithm will not try to explicitly learn this mapping as the sample complexity

could be prohibitive, but resolve it implicitly using the learner’s policy class. The setting naturally extends

to scenarios where any tuple of τ successive contexts have non-overlapping support for some τ ≥ 1. In thiscase, we can define a new state space that consists of all concatenations of τ states in the underlying statespace and new observation distributions that concatenate corresponding observations.

Example 2 (Path-augmented contexts). If the transition function Γ is deterministic, one can associate eachstate s with a sequence of actions (a path) that ends at state s and augment the context with some featurizationof this sequence, including a featurization of the states visited along this path. As paths uniquely identify

states in this case, this is an instance of the disjoint context scenario in Example 1.

More generally, Contextual-MDPs provides a convenient framework to reason about reinforcementlearning with function approximation as we will see. This is highly desirable as such approaches are the

empirical state-of-the-art, but the limited supporting theory provides little advice on systematic global ex-

ploration.

4


5/30

2.3 Connections to Other Models

Our model is closely related to several well-studied models in the literature, namely:

Contextual Bandits: If H = 1, then Contextual-MDPs reduce to stochastic contextual bandits [14, 6], awell-studied simplification of the general reinforcement learning problem. In contextual bandits, the learning

algorithm repeatedly takes an action on the basis of a context (or observation), and accumulates reward for

the chosen action. The main difference is that the choice of action does not influence the future observations;

in the stochastic contextual bandits problem all (observation, reward) pairs are drawn independently and

identically from some distribution. Thus Contextual-MDPs force the learning algorithm to use long-term

decision making, which is not required for contextual bandits.

Markov Decision Processes: If X = S and the distribution over observations for each state s is con-centrated on s, then Contextual-MDPs reduce to Markov Decision Processes (MDPs). MDPs with smallstate spaces can be efficiently solved by tabular approaches that maintain and update statistics about each

state [11, 4, 24]. The main difference in our setting is that the observation space X is extremely large orinfinite and the underlying state is unobserved, so tabular approaches are not viable. Thus, Contextual-

MDPs force the learning algorithm to generalize across observations, which is not required in MDPs.

Sample complexity bounds for reinforcement learning with large state or observation spaces do exist, but

the results require unrealistic assumptions and/or the bounds are rather weak. One example is the metric-

E 3 algorithm of Kakade et al. [10] (see also [8]) that has sample complexity independent of the numberof unique states, but assumes the ability to cover the state space in a metric a priori known to the learner.

Moreover, the sample complexity scales linearly in the cover size as opposed to a more typical logarithmic

dependence as in supervised learning. The sparse sampling planner of Kearns et al. [13] also implies a

sample complexity bound that is independent of the observation space size for episodic MDPs, but grows as

O(K H ). More recently, Abbasi-Yadkori and Neu [1] propose a model for MDPs with side-information, butthis model requires mapping side-information to a small-state MDP, rather than mapping an observation to

an action as in Contextual-MDPs.

Policy gradient methods that apply (stochastic) optimization methods to find a parameterized policy

with high value can also be applied to large-state MDPs and to Contextual-MDPs. However, these methods

use local search techniques and consequently do not achieve global optimality [25, 9] in theory as well as

empirically, unlike our algorithm which is guaranteed to find the globally optimal policy.

POMDPs: By definition a Contextual-MDP is a Partially Observable Markov Decision Process(POMDP) where the optimal action at any state depends only on the current observation. Thus in Contextual-

MDPs, the learning algorithm does not have to reason over belief states as is required in POMDPs.

Borrowing terminology, a Contextual-MDP is precisely a POMDP where a reactive policy, which uses

only the current observation, is optimal. While there are POMDP methods for learning reactive policies,

or more generally policies with bounded memory [19], they are based on policy gradient techniques, which

suffer both theoretical and empirical drawbacks as we mentioned.

There are some sample complexity guarantees for learning in arbitrarily complex POMDPs, but the

bounds we are aware of are quite weak as they scale linearly with |Π| [12, 18].Predictive State Representations (PSRs): PSRs [17] encode states as a a collection of tests, a test being

a sequence of (a, x) pairs observed in the history. Representationally, PSRs are even more powerful thanPOMDPs [23] which make them also more general than Contextual-MDPs. However, we are not aware of

finite sample bounds for learning PSRs.

5


6/30

2.4 Connections to Other Techniques

State Abstraction: Our work is closely related to the literature on state abstraction (See [16] for a survey),

which primarily focuses on understanding what optimality properties are preserved in an MDP after the

state space is compressed. However, Contextual-MDPs do not necessarily admit non-trivial state abstraction

functions that are easy to discover (i.e. that do not amount to learning the optimal behavior) as the optimal

behavior can depend on the observation in an arbitrary manner. Moreover, while there are finite sample

results for learning state abstractions, they all make strong assumptions that limit the scope of application.

A recent example is the work of Jiang et al. [ 7] which finds a good abstraction from a set of successively

finer ones, but cannot search over the exponentially many abstractions functions.

Function Approximation: Our solution uses function approximation to address the generalization prob-

lem implicit in Contextual-MDPs. Function approximation is the empirical state-of-the-art in reinforcement

learning [20], but theoretical analysis has been quite limited. Several authors have studied linear function

approximation (See [26, 21]) but none of these results give finite sample bounds, as they do not address

the exploration question. Baird [3] analyzes more general function approximation for predicting the value

function in a Markov Chain, but does not show convergence when the agent is also selecting actions. More

closely to our work, Li and Littman [15] do give finite sample bounds for RL with function approximation,

but they assume access to a particular “Knows-what-it-knows” oracle, which cannot exist even for simple

problems. We are not aware of finite sample results for approximating Q with a function class, which isprecisely what we do here.

3 Our Approach

In this paper, we consider the task of probably approximately correct (PAC) learning Contextual-MDPs.

Given a policy class Π, we say that an algorithm PAC learns a Contextual-MDP if for any , δ ∈ (0, 1), thealgorithm outputs a policy π̂ with V (π̂) ≥ maxπ∈Π V (π) − with probability at least 1 − δ . The samplecomplexity of the algorithm is the number of episodes of the Contextual-MDP that the algorithm executes

before returning an -suboptimal policy. Formally, the sample complexity is a function n : (0, 1)2 → N suchthat for any , δ ∈ (0, 1), the algorithm returns an -suboptimal policy with probability at least 1 − δ usingonly n(, δ ) episodes.

3.1 Additional Assumptions for the Result

Our algorithm operates on Contextual-MDPs with two additional assumptions. The first assumption posits

the ability to approximate the Q function (3) well and seems essential for a function approximation basedapproach.

Assumption 1 ( Realizability). We identify our set of policies Π with a set of regression functions F ⊂(X ×A) → [0, 1]. Specifically, we set Π = {πf : f ∈ F} where πf = argmaxa f (x, a). We assume that F is available to the learner and make a realizability assumption, meaning that there exists a function f ∈ F ,such that for every x ∈ X and a ∈ A, f (x, a) = Qs(x, a), for any state s such that Ds(x) > 0. We use N to denote |F| = |Π|.

Note that the above assumption tacitly forces the function Qs(x, a) to be consistent across all states swith Ds(x) > 0. This is stronger than only assuming the consistency of the argmax of Q

as in Definition 1,

but Q may still be a complex function of the observation x and action a.

6


7/30

Algorithm 1 cMDPLearn (F , , δ )F ← DFS-Learn(∅, F , ,δ/2).Let V̂ = V̂ f (∅) for any f ∈ F .f

←Explore-on-Demand(

F , V̂ , ,δ/2).

Return πf .

The regressor class induces a family of value functions defined for each f ∈ F , s ∈ S , and for anypolicy π : X → A,

V f (s, π) = Ex∼Dsf (x, π(x)).

Working through definitions, it is easy to see that,

V (s) = V f

(s, πf ),

for all s so that V = Es∼ΓH [V f (s, πf (s))]. Recalling the earlier definition, PAC-learning in the realizable

setting requires finding a policy π̂ with V (π̂)≥

V

−.

Assumption 2 ( Deterministic Transitions). We further assume that the transition model is deterministic.

This means that the starting distribution ΓH is a point-mass on some state sH and the transition dynamicsmap state action pairs deterministically to future states, i.e. Γ : (S × A) → S , preserving the layeredstructure.

Even with deterministic transitions, PAC-learning Contextual-MDPs requires systematic exploration that

is unaddressed in previous work.

3.2 Algorithm

We seek an algorithm that can PAC-learn realizable deterministic-transition Contextual-MDPs with

Poly(M,K,H,, log(N ), log(1/δ )) sample complexity, and we refer to such a sample complexity bound

as polynomial in all relevant parameters. Notably, the algorithm should have no dependence on |X |, whichmay be infinite. We develop such an algorithm in this section, and we prove the sample complexity boundin Section 4. Our focus is on statistical efficiency, so we ignore computational considerations here.

Before turning to the algorithm, it is worth clarifying some additional notation. Since we are focused on

the deterministic transition setting, it is natural to think about the Contextual-MDP as an exponentially large

search tree with fan-out K and depth H . Each node in the search tree is labeled with a state s ∈ S , and eachedge is labeled with an action a ∈ A, both of which are consistent with the transition model. A path p ∈ Acorresponds to a sequence of actions from the root of the search tree, and we also use p to denote the statereached after executing the corresponding sequence of actions from the root. We often call such a path a

roll-in, in line with existing terminology. For a roll-in p, we use p ◦ a to denote a path formed by executingall actions in p and then executing action a. Let ∅ denote the empty path, which corresponds to the root of the search tree.

Pseudocode for our algorithm is displayed in Algorithm 1 with subroutines displayed as Algo-

rithms 2, 3, 4, and 5. The algorithm should be invoked as cMDPLearn(F , , δ ) where F is the givenclass of regression functions, is the target accuracy and δ is the target failure probability. The two maincomponents of the algorithm are the DFS-Learn and Explore-on-Demand routines. DFS-Learn

ensures proper invocation of the training step, TD-Elim, by verifying a number of preconditions, while

7


8/30

Algorithm 2 DFS-Learn ( p, F , , δ )Set φ =

320H 2√ K

and test = 20(H − | p| − 5/4)√

Kφ.

for a ∈ A doif Not Consensus( p

◦a,

F , test, φ,

δ/2MKH ) then

F ← DFS-Learn( p ◦ a, F , , δ ). # Recurseend if

end for

F̂ ← TD-Elim p, F , φ, δ/2MH

. # Learn in state p.

Return F̂ .

Algorithm 3 Consensus( p, F , test, φ , δ )Set ntest = 2 log(2N/δ )/φ

2.

Collect ntest observations xi ∼ D p.Compute Monte-Carlo estimates for each value function,

V̂ f

( p, πf ) =

1

ntest

ntesti=1

f (xi, πf (xi)) ∀f ∈ F

if |V̂ f ( p, πf ) − V̂ g( p, πg)| ≤ test for all f, g ∈ F thenreturn true

end if

Return false.

Explore-on-Demand finds regions of the search tree for which training must be performed. To convey

the intuition of the algorithm, it is best to start from the subroutines.

The Elimination Component: At a high level, the algorithm aims to maintain only the regressors that

approximate the Q function well, and it makes progress by discarding regressors that have a poor fit to the

Q

function. At path p, we train by retaining only the regressors that have low excess risk on a carefullyconstructed regression problem (See TD-Elim, displayed in Algorithm 4). The regression problem in

TD-Elim is motivated by Assumption 1 and the definition of Q in Eq. (3), which imply that for any states generating observation x,

f (x, a) = Er∼Ds|xr(a) + V (Γ(s, a), πf )

=Er∼Ds|xr(a) + Ex∼DΓ(s,a)f (x, πf (x)). (4)

Thus f is consistent between its estimate at the current state s and the future state s = Γ(s, a).The regression problem we create is essentially a finite sample version of this identity. However some

care must be taken as the target for each regression function f , V f (s, πf ), is the value of the future aspredicted by f . This target differs for each function but can be estimated from samples. To ensure correctbehavior of the regression problem, we must obtain high-quality estimates of these future value predictions.

Nevertheless if constructed carefully, these regression problems ensure that the algorithm retains only goodregressors, which induce good policies.

TD-Elim is inspired by the RegressorElimination algorithm of Agarwal et al. [2] for contextual bandit

learning in the realizable setting. Apart from the differences between the regression problem, motivated

8


9/30

Algorithm 4 TD-Elim( p, F , φ , δ )Require estimates V̂ f ( p ◦ a, πf ), ∀f ∈ F , a ∈ A.Set ntrain = 24 log(2N/δ )/φ

2

Collect ntrain observations (xi, ai, ri) where xi

∼D p, ai is chosen uniformly at random, and ri = ri(ai).

Update F to f ∈ F : R̃(f ) ≤ min

f ∈F R̃(f ) + 2φ2 +

22 log(2N/δ )

ntrain

,

with R̃(f ) = 1

ntrain

ntraini=1

(f (xi, ai) − ri − V̂ f ( p ◦ ai, πf ))2

Return F .

Algorithm 5 Explore-on-Demand (F , V̂ , , δ )Set demand = /2, ndemand,1 =

32 log(6MH/δ)2 and ndemand,2 =

8 log(3MH/δ) .

while true do

Fix a regressor f ∈ F .Collect ndemand,1 trajectories according to πf and estimate V̂ (∅, πf ) via Monte-Carlo estimate.

If |V̂ (∅, πf ) − V̂ | ≤ demand, return πf .Otherwise update F by calling DFS-Learn ( p, F , ,δ/(3MHndemand,2)) on each of the H −1 prefixes

p of each of the first ndemand,2 paths collected for the Monte-Carlo estimate.end while

by the discussion above, the other main difference between the algorithms is the choice of action-selection

distribution. RegressorElimination must carefully choose actions to balance exploration and exploitation

which leads to an optimal regret bound. In contrast, we are pursuing a PAC-guarantee here, for which it

suffices to focus exclusively on exploration.The Consensus Test: The other component of the DFS-Learn routine, which is crucial for obtaining

polynomial sample complexity, is a global exploration technique (See Consensus in Algorithm 3). This is

based on testing for consensus among the surviving regression functions, which can be done by estimating

the value predictions for all surviving regressors. Specifically, if the consensus test returns true, then all

the surviving regressors agree on the value of the current state. As shown below, this condition is sufficient

for successfully running TD-Elim at the parent state. Furthermore, if we have already trained on the

(observation, reward) distribution induced by the path p, then this test returns true with high probability,thus implicitly performing a state equality test at a level needed by the class F . The first property ensuresthat we invoke the training mechanism properly, while the second property implies that we avoid exploring

the entire search tree.

This test is performed at each path visited by the algorithm before making the recursive call on that path,

and if the surviving functions are in agreement, the algorithm does not visit the descendant paths. Thus, the

algorithm does not traverse a large fraction the entire search tree provided that the consensus test succeeds

often enough. The number of times the test does not report true at each level is upper bounded by the

number of states M leading to a polynomial sample complexity bound.

9


10/30

On-Demand Exploration: Apart from the first call to DFS-Learn, which is simply used to estimate

the optimal value V , the bulk of the computations occur inside the loop of Explore-on-Demand. Thisalgorithm is another exploration technique that only invokes the learning mechanism on regions of the search

space that are visited by the surviving policies. The specification is quite straightforward: it iteratively selects

a surviving policy πf , estimates its value V (∅

, πf ) at the root, and if the policy has highly sub-optimal value,it invokes DFS-Learn on many of the paths visited by πf before repeating. If the policy has near-optimalvalue, it simply returns the policy.

This subroutine is motivated by the following high-level argument, which we formalize in our analysis.

As we will show, running the elimination step at some path p ensures that all surviving regressors takegood actions at p, in the sense that taking one action according to any surviving policy and then behavingoptimally thereafter achieves the near-optimal reward for path p. Unfortunately this does not ensure that allsurviving policies achieve near-optimal reward, because they may take highly sub-optimal actions after the

first one. On the other hand if a surviving policy πf visits only states for which TD-Elim has been invoked,then it must have near-optimal reward.

The contrapositive of this statement is that if a surviving policy πf has highly sub-optimal reward, thenit must visit some state that TD-Elim has not been invoked on with substantial probability. By calling

DFS-Learn on the paths visited by the policy, we ensure that we call TD-Elim on this “unlearned” state.

Since there are only M H distinct states in the search tree, and each non-terminal iteration ensures trainingon an unlearned state, this algorithm is guaranteed to terminate and output a near-optimal policy.

4 Theoretical Analysis

In this section we prove a PAC-learning guarantee for cMDPLearn on Contextual-MDPs.

Theorem 1 (PAC bound). For any (, δ ) ∈ (0, 1) and any Contextual-MDP (Definition 1) with deterministictransitions for which Q ∈ F , with probability at least 1−δ , the policy π returned by cMDPLearn (F , , δ )is at most -suboptimal. Moreover, cMDPLearn (F , , δ ) requires at most,

ÕMH 6K 2

3 log(N/δ ) log(1/δ )

episodes.

This result uses the Õ notation to suppress logarithmic dependence in all parameters except for N andδ . The precise dependence on all parameters can be recovered by examination of our proof and is shortenedhere simply for clarity.

This theorem states that cMDPLearn learns a policy that is at most -suboptimal for a Contextual-MDP using a number of episodes that is polynomial in all relevant parameters. Looking more closely into

the result, our overall sample complexity is shown to scale with ndemand,2(ntrain + Kntest), ignoring somefactors of M and H and logarithmic dependencies. Since ntrain and ntest are set to be of the same order, thisreveals that a factor of K more samples are accumulated for the Consensus routine. Consequently, thesample complexity can be improved by a factor of K whenever we do not need to test for state equality, or if collecting several exploration observations x for a path p without observing reward signal is cheap. Moreover

since ntrain and ntest scale with 1/2 while ndemand,2 scales with log(1/δ )/, this gives the 1/3 dependencein Theorem 1. This setting of ndemand,2 as Õ(log(1/δ )/) is required to ensure that we encounter at least one“unlearned” state on which TD-Elim needs to be called. Consequently, the sample complexity can also be

10


11/30

improved by a factor of log(1/δ )/, if identifying an unlearned state can be done easily, for example in atabular MDP.

Since Contextual-MDPs generalize both contextual bandits and MDPs, it is worth comparing the results.

1. In contextual bandits, we have M = H = 1 so that the sample complexity of cMDPLearn is

Õ(K 23 log(N/δ ) log(1/δ )), in contrast with the optimal Õ(K 2 log(N/δ )) sample complexity for con-textual bandit learning. As discussed above, the main gap is due to the K ntest factor, which goesaway in contextual bandits since there is only one state, and the additional log(1/δ )/ factor inExplore-on-Demand, which need not be invoked at all in the contextual bandit case. Thus with

minor modification, cMDPLearn matches the optimal sample complexity for contextual bandits.

2. Assumptions vary with the paper, but broadly prior results establish the sample complexity for learning

layered episodic MDPs with deterministic transitions is Õ(MK poly(H )2 log(1/δ )) [5, 22]. Again thediscrepancy is the additional factors of K and log(1/δ )/ present in Theorem 1, both of which canbe avoided given that the states are known in an MDP. In this setting, cMDPLearn can easily be

modified to have Õ(MKH 52 log(N/δ )) sample complexity for layered episodic MDPs.

4.1 Preliminaries for the Proof The proof of the theorem hinges on analysis of the the subroutines. We turn first to the TD-Elim routine,

for which we show the following guarantee.

Theorem 2 (Guarantee for TD-Elim). Consider running TD-Elim at path p with regressors F , parame-ters φ, δ and with ntrain = 24 log(2N/δ )/φ

2. Suppose that the following are true:

1. Estimation Precondition: We have access to estimates V̂ f ( p ◦ a, πf ) for all f ∈ F , a ∈ A suchthat,|V̂ f ( p ◦ a, πf ) − V f ( p ◦ a, πf )| ≤ φ.

2. Bias Precondition: For all f, g ∈ F and for all a ∈ A , |V f ( p ◦ a, πf ) − V g( p ◦ a, πg)| ≤ τ 1.Then the following hold simultaneously with probability at least 1 − δ :

1. f is retained by the algorithm.2. Bias Bound :

|V f ( p, πf ) − V g( p, πg)| ≤ 8φ√

K + 2φ + τ 1 (5)

3. Instantaneous Risk Bound:

V ( p) − V f ( p, πf ) ≤ 4φ√

2K + 2φ + 2τ 1 (6)

4. Estimation Bound : Regardless of whether the preconditions hold, we have estimates V̂ f ( p, πf ) with,

|V̂ f ( p, πf ) − V f ( p, πf )| ≤ φ√ 12

(7)

The last three bounds hold for all surviving f, g ∈ F .

11


12/30


13/30

For the bias precondition, the proof proceeds by induction on the number of actions to-go h. The induc-tive claim is that for all paths p with h actions to-go that the subroutine accesses (i.e., calls TD-Elim orConsensus returns true) and all surviving f, g ∈ F , we have,

|V f ( p, π

f )−

V g( p, πg

)| ≤

20h√

Kφ.

This claim is verified by applying Theorem 2 on paths for which the algorithm calls TD-Elim with the

choice τ 1 = 20(h − 1)√

Kφ (due to the inductive hypothesis), and by applying Theorem 3 on the otheraccessed paths with test as prescribed in DFS-Learn. Thus both preconditions for Theorem 2 are satisfiedon all calls to the subroutine.

It remains to bound the number of calls. The main insight here is that if TD-Elim has been called on

a state s, the Consensus returns true on any path p that leads to s. This follows by the bias bound inTheorem 2 and the setting of test. Since there are only M states per level, this means that number of calls toTD-Elim is bounded by M H and the number of calls to Consensus is at most M KH . This suggests asetting for the failure probability parameter in the calls to the subroutines, so that a union bound reveals that

the total failure probability is at most δ .Lastly, each call to TD-Elim requires K calls to Consensus, so if T calls to TD-Elim are performed

the total sample complexity is,

T (ntrain + Kntest) = O

T H 4K 2

2 log(NMKH/δ )

.

Finally we turn to the Explore-on-Demand routine.

Theorem 5 (Guarantee for Explore-on-Demand). Consider running Explore-on-Demand with re-

gressors F , estimate V̂ and parameters , δ and assume that |V̂ − V | ≤ /8. Then with probability at least 1 − δ , Explore-on-Demand terminates after at most,

ÕMH 6K 2


trajectories and it returns a policy πf with V − V (∅, πf ) ≤ .We provide a sketch here. See Appendix E for details.

Proof Sketch. First, by standard concentration-of-measure arguments, it is apparent that if the algorithm

selects a policy with near-optimal performance, then it terminates, and conversely, if it terminates then it

produces a policy with near-optimal performance. Thus the main challenge is in bounding the number of

iterations of the loop until the algorithm terminates.

Our proof first shows that if a policy πf has poor performance, then it must visit states that have notbeen trained on with substantial probability. Specifically, let L denote the set of states for which we havecalled TD-Elim and let L̄ be its complement, that is L̄ = S /L. Then we can bound the sub-optimality of asurviving policy πf by,

V − V (∅, πf ) ≤ O(H 2√ Kφ) + P[πf visits L̄]This bound is obtained by another inductive argument that uses the instantaneous risk bound in Theorem 2

on states in L. Since the instantaneous risk bound is itself obtained inductively as in the proof of Theorem 4

13


14/30

(i.e. τ 1 = 20(h − 1)√

Kφ for states at level h), the recurrence here grows as h2. Our setting of φ ensuresthat if a surviving policy visits only states in L, then its suboptimality is at most /8.

Thus if a surviving policy is very suboptimal, it must visit L̄ with substantial probability. Applying aChernoff Bound, we see that at least one of the ndemand,2 trajectories that we train on must visit L̄. This

ensures that at every iteration of the loop the set L grows by at least one state, and since there are at mostMH states in the Contextual-MDP, the number of iteration is bounded by M H .To bound the sample complexity, we perform at most MH iterations of the loop and each iteration makes

Hndemand,2 calls to DFS-Learn. Each call to DFS-Learn makes at least one call to TD-Elim but thetotal number of additional calls can be at most M H , since by the argument in the proof of Theorem 4, oncea state has be trained on, Consensus always returns true. Thus the sample complexity is,

(MH × Hndemand,2 + MH ) × (ntrain + Kn test)

= Õ

MH 6K 2


.

This gives the sample complexity bound in Theorem 5

Proof of Theorem 1: The proof of the main theorem follows from straightforward application of Theo-rems 4 and 5. First, since we run DFS-Learn at the root, ∅, the bias and estimation bounds in Theorem 2

apply at ∅, so we guarantee accurate estimation of the value V (See Corollary 1 in Appendix A). This isrequired by the Explore-on-Demand routine, but at this point, we can simply apply Theorem 5, which

is guaranteed to find a -suboptimal policy and also terminate in MH iterations. Combining these tworesults, appropriately allocating the failure probability δ , and accumulating the sample complexity boundsestablishes Theorem 1.

5 Discussion

This paper introduces a new model, Contextual-MDPs, in which it is possible to design and analyze prin-

cipled reinforcement learning algorithms that engage in global exploration. As a first step, we develop

cMDPLearn and show that it learns near-optimal behavior in Contextual-MDPs with polynomial sample

complexity. To our knowledge, this is the first polynomial sample complexity bound for reinforcement

learning with general function approximation.

However, there are many avenues for future work:

1. cMDPLearn has two main undesirable properties. Firstly, it requires a deterministic transition model

which is unrealistic in some practical settings. Secondly, the algorithm involves enumerating the class

of regression functions, so while its sample complexity is logarithmic in the function class size, its

running time is linear, which is typically intractably slow. Resolving both of these deficiencies may

lead to a new practical reinforcement learning algorithm.

2. Our algorithm also crucially relies on the realizability assumption, which on one hand is implicitly

assumed by state-of-the-art reinforcement learning algorithms, but is known to be unnecessary in

the contextual bandit setting. Is it possible to design completely agnostic algorithms for learning in

Contextual-MDPs?

We look forward to pursuing these directions.

14


15/30

Acknowledgements

We thank Akshay Balsubramani and Hal Daumé III for formative discussions, and we thank Tzu-Kuo Huang

for a careful reading of an early draft of this paper.

References

[1] Yasin Abbasi-Yadkori and Gergely Neu. Online learning in mdps with side information.

arXiv:1406.6812, 2014.

[2] Alekh Agarwal, Miroslav Dudı́k, Satyen Kale, John Langford, and Robert E Schapire. Contextual

bandit learning with predictable rewards. In International Conference on Artificial Intelligence and

Statistics (AISTATS), 2012.

[3] Leemon Baird. Residual algorithms: Reinforcement learning with function approximation. In Inter-

national Conference on Machine Learning (ICML), 1995.

[4] Ronen I Brafman and Moshe Tennenholtz. R-max-a general polynomial time algorithm for near-

optimal reinforcement learning. Journal of Machine Learning Research, 2003.

[5] Christoph Dann and Emma Brunskill. Sample complexity of episodic fixed-horizon reinforcement

learning. In Advances in Neural Information Processing Systems (NIPS), 2015.

[6] Miroslav Dudik, Daniel Hsu, Satyen Kale, Nikos Karampatziakis, John Langford, Lev Reyzin, and

Tong Zhang. Efficient optimal learning for contextual bandits. In Uncertainty in Artificial Intelligence

(UAI), 2011.

[7] Nan Jiang, Alex Kulesza, and Satinder Singh. Abstraction selection in model-based reinforcement

learning. In International Conference on Machine Learning (ICML), 2015.

[8] Nicholas K Jong and Peter Stone. Model-based exploration in continuous state spaces. In Abstraction,

Reformulation, and Approximation, 2007.

[9] Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In

International Conference on Machine Learning (ICML), 2002.

[10] Sham Kakade, Michael Kearns, and John Langford. Exploration in metric state spaces. In International

Conference on Machine Learning (ICML), 2003.

[11] Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time. Machine

Learning, 2002.

[12] Michael J Kearns, Yishay Mansour, and Andrew Y Ng. Approximate planning in large pomdps via

reusable trajectories. In Advances in Neural Information Processing Systems (NIPS), 1999.

[13] Michael J. Kearns, Yishay Mansour, and Andrew Y. Ng. A sparse sampling algorithm for near-optimal

planning in large markov decision processes. Machine Learning, 2002.

[14] John Langford and Tong Zhang. The epoch-greedy algorithm for multi-armed bandits with side infor-

mation. In Advances in Neural Information Processing Systems (NIPS), 2008.

15


16/30

[15] Lihong Li and Michael L Littman. Reducing reinforcement learning to kwik online regression. Annals

of Mathematics and Artificial Intelligence, 2010.

[16] Lihong Li, Thomas J Walsh, and Michael L Littman. Towards a unified theory of state abstraction for

mdps. In International Symposium on Artificial Intelligence and Mathematics (ISAIM), 2006.

[17] Michael L Littman, Richard S Sutton, and Satinder P Singh. Predictive representations of state. In

Advances in Neural Information Processing Systems (NIPS), 2001.

[18] Yishay Mansour. Reinforcement learning and mistake bounded algorithms. In Conference on Compu-

tational Learning Theory (COLT), 1999.

[19] Nicolas Meuleau, Leonid Peshkin, Kee-Eung Kim, and Leslie Pack Kaelbling. Learning finite-state

controllers for partially observable environments. In Uncertainty in Artificial Intelligence (UAI), 1999.

[20] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare,

Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control

through deep reinforcement learning. Nature, 2015.

[21] Theodore J Perkins and Doina Precup. A convergent form of approximate policy iteration. In Advancesin Neural Information Processing Systems (NIPS), 2002.

[22] Spyros Reveliotis and Theologos Bountourelis. Efficient pac learning for episodic tasks with acyclic

state spaces. Discrete Event Dynamic Systems, 2007.

[23] Satinder Singh, Michael R James, and Matthew R Rudary. Predictive state representations: A new

theory for modeling dynamical systems. In Uncertainty in Artificial Intelligence (UAI). AUAI Press,

2004.

[24] Alexander L Strehl, Lihong Li, Eric Wiewiora, John Langford, and Michael L Littman. Pac model-free

reinforcement learning. In International Conference on Machine Learning (ICML), 2006.

[25] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods

for reinforcement learning with function approximation. In Advances in Neural Information ProcessingSystems (NIPS), 1999.

[26] John N Tsitsiklis and Benjamin Van Roy. An analysis of temporal-difference learning with function

approximation. IEEE Transactions on Automatic Control, 1997.

16


17/30

A An Additional Corollary

A simple consequence of Theorem 4 is that we can estimate V accurately once we have called DFS-Learnon ∅.

Corollary 1 (Estimating V ). Consider running DFS-Learn at ∅ with regressors F , and parameters , δ .Then with probability at least 1 − δ , the estimate V̂ satisfies,

|V̂ − V | ≤ /8. Moreover the algorithm uses at most,

O

MH 5K 2

2 log

NMHK

δ

trajectories.

Proof. Since we ran DFS-Learn at ∅, we may apply Theorem 4. By specification of the algorithm, we

certainly ran TD-Elim at∅, so we apply the conclusions in Theorem 2. In particular, we know that f

∈ F and that for any surviving f ∈ F ,|V̂ f ( p, πf ) − V | = |V̂ f ( p, πf ) − V f ( p, πf ) + V f ( p, πf ) − V f

( p, πf )|≤ φ√

12+ 8φ

√ K + 2φ + 20(H − 1)

√ Kφ ≤ /8.

The last bound follows from the setting of φ. Since our estimate V̂ is V̂ f ( p, πf ) for some surviving f , weguarantee estimation error at most /8.

As for the sample complexity, Theorem 4 shows that the total number of executions of TD-Elim can be

at most M H , which is our setting of T .

B Proof of Theorem 2

The proof of Theorem 2 is quite technical, and we compartmentalize into several components. We begin with

several technical lemmas. Throughout we will use the preconditions of the theorem, which we reproduce

here.

Condition 1. For all f ∈ F and a ∈ A, we have estimates V̂ f ( p ◦ a, πf ) such that,

|V̂ f ( p ◦ a, πf ) − V f ( p ◦ a, πf )| ≤ φ.Condition 2. For all f, g ∈ F and a ∈ A we have,

|V f ( p ◦ a, πf ) − V g( p ◦ a, πg)| ≤ τ 1.We will make frequent use of the parameters φ and τ 1 which are specified by these two conditions, and

explicit in the theorem statement.Recall the notation,

V f ( p, πg) = Ex∼Dpf (x, πg(x))

17


18/30

which will be used heavily throughout the proof.

As notational convenience, we will suppress dependence on the distribution D p, since we are consideringone invocation of TD-Elim and we always roll into path p. This means that all (observation, reward) tupleswill be drawn from D p. Secondly it will be convenient to introduce the shorthand V

f ( p) = V f ( p, πf ) and

similarly for the estimates. Finally, we will further shorten the value functions for paths p ◦ a by defining,V f a = Ex∼Dp◦af (x, πf (x)) = V

f ( p ◦ a, πf ).

We will also use V̂ f a to denote the estimated versions which we have access to according to Condition 1.Lastly, our proof makes extensive use of the following random variable, which is defined for a particular

regressor f ∈ F Y (f ) (f (x, a) − r(a) − V̂ f ( p ◦ a))2 − (f (x, a) − r(a) − V̂ f ( p ◦ a))2.

Here (x, r) ∼ D p and a ∈ A is drawn uniformly at random as prescribed by Algorithm 4. We use Y (f ) todenote the random variable associated with regressor f , but sometimes drop the dependence on f when it isclear from context.

To proceed, we first compute the expectation and variance of this random variable.

Lemma 1 (Properties of TD Squared Loss). Assume Condition 1 holds. Then for any f ∈ F , the randomvariable Y satisfies,

Ex,a,r[Y ] = Ex,a

(f (x, a) − V̂ f ( p ◦ a) − f (x, a) + V f ( p ◦ a))2

− Ex,a

(V̂ f

( p ◦ a) − V f ( p ◦ a))2

Varx,a,r

[Y ] ≤ 32Ex,a[Y ] + 64φ2

Proof. For further shorthand, denote f = f (x, a), f = f (x, a) and recall the definition of V f a and V̂ f a .

Ex,a,rY

= Ex,a,r

(f − V̂ f a − r(a))2 − (f − V̂ f

a − r(a))2

= Ex,a,r (f − V̂ f a )

2

−2r(a)(f

− V̂ f a

−f + V̂ f

a )

−(f

− V̂ f

a )2

Now recall that E[r(a)|x, a] = f ∗(x, a) − V f a by the definition of f ∗, which allows us to further obtainEx,a,rY

= Ex,a

(f − V̂ f a )2 − 2(f − V f

a )(f − V̂ f a ) + 2(f − V̂ f

a + V̂ f

a − V f

a )(f − V̂ f a ) − (f − V̂ f

a )2

= Ex,a

(f − V̂ f a )2 − 2(f − V f

a )(f − V̂ f a ) + (f − V̂ f

a )2 + 2(V̂ f

a − V f

a )(f − V̂ f a )

= Ex,a

(f − V̂ f a )2 − 2(f − V f

a )(f − V̂ f a ) + (f − V f

a + V f

a − V̂ f

a )2 + 2(V̂ f

a − V f

a )(f − V̂ f a )

= Ex,a

(f − V̂ f a − f + V f

a )2 + 2(V f

a − V̂ f

a )(f − V f a ) + (V f

a − V̂ f

a )2 + 2(V̂ f

a − V f

a )(f − V̂ f a )

= Ex,a (f −

V̂ f a

−f + V f

a )2

−(V f

a

− V̂ f

a )2

For the second claim, notice that we can write,

Y = (f − V̂ f a − f + V̂ f

a )(f − V̂ f a + f − V̂ f

a − 2r(a)),

18


19/30

so that,

Y 2 ≤ 16(f − V̂ f a − f + V̂ f

a )2.

This holds because all quantities in the second term are bounded in [0, 1]. Therefore,

Var(Y ) ≤ E[Y 2]≤ 16Ex,a

(f (x, a) − V̂ f a − f (x, a) + V̂ f

a )2

= 16Ex,a

(f (x, a) − V̂ f a − f (x, a) + V f

a + V̂ f

a − V f

a )2

≤ 32Ex,a

(f (x, a) − V̂ f a − f (x, a) + V f

a )2

+ 32φ2

≤ 32Ex,aY + 64φ2

The first inequality is straightforward, while the second inequality is from the argument above. The third

inequality uses the fact that (a + b)2 ≤ 2a2 + 2b2 and the fact that for each a, the estimate V̂ f a has absoluteerror at most φ (By Condition 1). The last inequality adds and subtracts the term involving (V f

a

− V̂ f

a )2

to obtain Ex,aY .

The next step is to relate the empirical squared loss to the population squared loss, which is done by

application of Bernstein’s inequality.

Lemma 2 (Squared Loss Deviation Bounds). Assume Condition 1 holds. With probability at least 1 − δ/2 ,where δ is a parameter of the algorithm, f survives the filtering step of Algorithm 4 and moreover, anysurviving f satisfies,

EY (f ) ≤ 10φ2 + 120 log(2N/δ )ntrain

.

Proof. We will apply Bernstein’s inequality on the centered random variable,

ntraini=1

Y i(f ) − EY i(f ),

and then take a union bound over all f ∈ F . Here the expectation is over the n train samples (xi, ai, ri)where (xi, r) ∼ D p, ai is chosen uniformly at random, and ri = r(ai). Notice that since actions are chosenuniformly at random, all terms in the sum are identically distributed, so that EY i(f ) = EY (f ).

To that end, fix one f ∈ F and notice that |Y −EY | ≤ 8 almost surely, as each quantity in the definitionof Y is bounded in [0, 1], so each of the four terms can be at most 4, but two are non-positive and two arenon-negative in Y − EY . We will use Lemma 1 to control the variance. Bernstein’s inequality implies that,with probability at least 1 − δ ,

ntrain

i=1

EY i − Y i ≤ 2i

Var(Y i)log(1/δ ) + 16 log(1/δ )

3

≤

64i

(E(Y i) + 2φ2) log(1/δ ) + 16 log(1/δ )

3

19


20/30

The first inequality here is Bernstein’s inequality while the second is based on the variance bound in

Lemma 1.

Now letting X =

i(E(Y i) + 2φ2), Z =

i Y i and C =

log(1/δ ), the inequality above is

equivalent to,

X 2 − 2ntrainφ2 − Z ≤ 8XC + 163

C 2

⇒ X 2 − 8XC + 16C 2 − Z ≤ 2ntrainφ2 + 22C 2⇒ (X − 4C 2) − Z ≤ 2ntrainφ2 + 22C 2⇒ −Z ≤ 2ntrainφ2 + 22C 2

Using the definition of −Z , this last inequality implies that,ntraini=1

(f (xi, ai) − ri(ai) − V̂ f

( p ◦ ai))2 ≤ntraini=1

(f (xi, ai) − ri(ai) − V̂ f ( p ◦ ai))2 + 2ntrainφ2 + 22 log(1/δ )

Via a union bound over all f

∈ F , rebinding δ

←δ/(2N ), and dividing through by ntrain, we have,

R̃(f ) ≤ minf ∈F

R̃(f ) + 2φ2 + 22 log(2N/δ )

ntrain

Since this is precisely the threshold used in filtering regressors, we ensure that f survives.Now for any other surviving regressor f , we are ensured that Z is upper bounded. Specifically we have,

(X − 4C )2 ≤ Z + 2ntrainφ2 + 22C 2 ≤ 4ntrainφ2 + 44C 2

⇒ X 2 ≤ (

4ntrainφ2 + 44C 2 + 4C )2

≤ 8ntrainφ2 + 120C 2

This proves the claim since X 2 = ntrainEY (f )+2ntrainφ2 (Recall that the Y is are identically distributed).

This deviation bound allows us to establish the three claims in Theorem 2. We start with the estimationerror claim, which is straightforward.

Lemma 3 (Estimation Error). Let δ ∈ (0, 1). Then with probablity at least 1 − δ , for all f ∈ F that areretained by the Algorithm 4, we have estimates V̂ f ( p, πf ) with,

|V̂ f ( p, πf ) − V f ( p, πf )| ≤

2log(N/δ )

ntrain.

Proof. The proof is a consequence of Hoeffding’s inequality and a union bound. Clearly the Monte Carlo

estimate,

V̂ f ( p, πf ) = 1

ntrain

ntrain

i=1

f (xi, πf (xi)),

is unbiased for V f ( p, πf ) and the centered quantity is bounded in [−1, 1]. Thus Hoeffding’s inequality givesprecisely the bound in the lemma.

20


21/30

Next we turn to the claim regarding bias.

Lemma 4 (Bias Accumulation). Assume Conditions 1 and 2 hold. In the same 1 − δ/2 event in Lemma 2, for any pair f, g ∈ F retained by Algorithm 4, we have,

V f ( p, πf ) − V g( p, πg) ≤ 2√

K

11φ2 +

120 log(2N/δ )

ntrain+ 2φ + τ 1

Proof. We start by expanding definitions,

V f ( p, πf ) − V g( p, πg) = Ex∼Dp [f (x, πf (x)) − g(x, πg(x))]

Now, since g prefers πg(x) to πf (x), it must be the case that g(x, πg(x)) ≥ g(x, πf (x)), so that,

V f ( p, πf ) − V g( p, πg) ≤Ex∼Dpf (x, πf (x)) − g(x, πf (x))= Ex∼Dp [f (x, πf (x)) − V̂ f ( p ◦ πf (x), πf ) − f (x, πf (x)) + V̂ f

( p ◦ πf (x), πf )]

−Ex

∼Dp [g(x, πf (x))

− V̂ g( p

◦πf (x), πg)

−f (x, πf (x)) + V̂

f ( p

◦πf (x), πf )]

+ Ex∼Dp [V̂ f ( p ◦ πf (x), πf ) − V̂ g( p ◦ πf (x), πg)]

This last equality is just based on adding and subtracting several terms. The first two terms look similar, and

we will relate them to the squared loss. For the first, by Lemma 1, we have that for each x ∈ X ,

Er,a|x[Y (f )] + Ea|x[(V̂ f ( p ◦ a, πf ) − V f

( p ◦ a, πf ))2]= Ea|x

(f (x, a) − V̂ f ( p ◦ a, πf ) − f (x, a) + V f

( p ◦ a, πf ))2

≥ 1K

(f (x, πf (x)) − V̂ f ( p ◦ πf (x), πf ) − f (x, πf (x)) + V f

( p ◦ πf (x), πf ))2

The equality is Lemma 1 while the inequality follows from the fact that each action, in particular πf (x),is played with probability 1/K and the quantity inside the expectation is non-negative. Now by Jensen’s

inequality the first term can be upper bounded as,

Ex∼Dp [f (x, πf (x)) − V̂ f ( p ◦ πf (x), πf ) − f (x, πf (x)) + V̂ f

( p ◦ πf (x), πf )]≤ Ex∼Dp [(f (x, πf (x)) − V̂ f ( p ◦ πf (x), πf ) − f (x, πf (x)) + V̂ f ( p ◦ πf (x), πf ))2]

=

K Ex∼Dp

1

K (f (x, πf (x)) − V̂ f ( p ◦ πf (x), πf ) − f (x, πf (x)) + V̂ f ( p ◦ πf (x), πf ))2

≤

K Ex,a,r[Y (f )] + Ex,a[(V̂ f

( p ◦ a, πf ) − V f ( p ◦ a, πf ))2]

≤√

K EY (f ) + φ2

≤ √ K 11φ2 + 120log(N/δ )ntrain ,

21


22/30

where the last step follows from Lemma 2. This bounds the first term in the expansion of V f ( p, πf ) −V g( p, πg). Now for the term involving g, we can apply essentially the same argument,

− Ex∼Dp [g(x, πf (x)) − V̂ g( p ◦ πf (x), πg) − f (x, πf (x)) + V̂ f

( p ◦ πf (x), πf )]≤ Ex∼Dp [(g(x, πf (x)) − V̂ g( p ◦ πf (x), πg) − f (x, πf (x)) + V̂ f ( p ◦ πf (x), πf ))2]≤

√ K

11φ2 +

120 log(N/δ )

ntrain

Summarizing, the current bound we have is,

V f ( p, πf ) − V g( p, πg) ≤ 2√

K

11φ2 +

120log(N/δ )

ntrain+ Ex∼Dp [V̂

f ( p ◦ πf (x), πf ) − V̂ g( p ◦ πf (x), πg)](8)

The last term is easily bounded by the preconditions in the statement of Theorem 2. For each a, we have,

V̂ f ( p ◦ a, πf ) − V̂ g( p ◦ a, πg)≤ |V̂ f ( p ◦ a, πf ) − V f ( p ◦ a, πf )| + |V f ( p ◦ a, πf ) − V g( p ◦ a, πg)| + |V g( p ◦ a, πg) − V̂ g( p ◦ a, πg)|≤ 2φ + τ 1,

from Conditions 1 and 2. Consequently

Ex∼Dp [V̂ f ( p ◦ πf (x), πf ) − V̂ g( p ◦ πf (x), πg)]

=a∈A

Ex

1[πf (x) = a](V̂

f ( p ◦ a, πf ) − V̂ g( p ◦ a, πg))

≤ 2φ + τ 1

This proves the claim.Lastly, we must show how the squared loss relates to the risk, which helps establish the last claim of the

Theorem. The proof is similar to that of the bias bound but has subtle differences that require reproducing

the argument.

Lemma 5 (Instantaneous Risk Bound). Assume Conditions 1 and 2 hold. In the same 1 − δ/2 event in Lemma 2, for any regressor f ∈ F retained by Algorithm 4 , we have,

V f

( p, πf ) − V f

( p, πf ) ≤√

2K

11φ2 +

120log(N/δ )

ntrain+ 2(φ + τ 1).

Proof.

V f

( p, πf ) − V f

( p, πf ) = Ex[f (x, πf (x)) − f (x, πf (x))]≤ Ex[f (x, πf (x)) − f (x, πf (x)) + f (x, πf (x)) − f (x, πf (x))]

22


23/30

This follows since f prefers its own action to that of f , so that f (x, πf (x)) ≥ f (x, πf (x)). For anyobservation x ∈ X and action a ∈ A, define,

∆x,a = (f (x, a) − V̂ f ( p ◦ a, πf ) − f (x, a) + V f

( p ◦ a, πf ))

so that we can write,V f

( p, πf ) − V f ( p, πf )≤ Ex[∆x,πf (x) − ∆x,πf (x) + (V̂ f ( p ◦ πf (x)) − V f

( p ◦ πf (x)) − V̂ f ( p ◦ πf (x)) + V f

( p ◦ πf (x)))]The term involving both ∆s can be bounded as in the proof of Lemma 4. For any x ∈ X

Er,a|xY (f ) + Ea|x[(V̂ f

( p ◦ a) − V f ( p ◦ a))2]= Ea|x

(f (x, a) − V̂ f ( p ◦ a) − f (x, a) + V f ( p ◦ a))2

≥∆2x,πf (x) + ∆

2x,πf (x)

K ≥ (∆x,πf (x) − ∆x,πf (x))

2

2K

Thus,

Ex[∆x,πf (x) − ∆x,πf (x)] ≤

2K E (∆x,πf (x) − ∆x,πf (x))2

2K

≤√

2K EY (f ) + φ2 ≤

√ 2K

11φ2 +

120log(N/δ )

ntrain

We are left to bound the residual term,

(V̂ f ( p ◦ πf (x)) − V f

( p ◦ πf (x)) − V̂ f ( p ◦ πf (x)) + V f

( p ◦ πf (x)))≤V f ( p ◦ πf (x)) − V f ( p ◦ πf (x)) − V f ( p ◦ πf (x)) + V f ( p ◦ πf (x))+ 2φ

≤ 2(φ + τ 1)

Notice that Lemma 5 above controls the quantity V f

( p, πf ) − V f

( p, πf ) which is the differencein values of the optimal behavior from p and the policy that first acts according to πf and then behavesoptimally thereafter. This is not the same as acting according to πf for all subsequent actions. We willcontrol this cumulative risk V ( p) − V ( p, πf ) in the second phase of the algorithm.

Proof of Theorem 2: Equipped with the above Lemmas, we can proceed to prove the theorem. By

assumption of the theorem, Conditions 1 and 2 hold, so all lemmas are applicable. Apply Lemma 3 with

failure probability δ/2, where δ is the parameter in the algorithm, and apply Lemma 2, which also fails withprobability at most δ/2. A union bound over these two events implies that the failure probability of thealgorithm is at most δ .

Outside of this failure event, all three of Lemmas 3, 4, and 5 hold. If we set ntrain = 24 log(2N/δ )/φ2

then these four bounds give,

|V̂ f ( p, πf )

−V f ( p, πf )

| ≤ φ

√ 12|V f ( p, πf ) − V g( p, πg)| ≤ 8φ

√ K + 2φ + τ 1

V f

( p, πf ) − V f ( p, πf ) ≤ 4φ√

2K + 2φ + 2τ 1.

23


24/30

These bounds hold for all f, g ∈ F that are retained by the algorithm. Of course by Lemma 2, we are alsoensured that f is retained by the algorithm.

We are easily able to prove the pairwise disagreement bound,

|V f ( p, πf )

−V f ( p, πg)

|= Ex∼Dpf (x, πf (x)) − f (x, πg(x))

The quantity inside the absolute value is positive since f prefers πf to πg pointwise at each x.

|V f ( p, πf ) − V f ( p, πg)| = V f ( p, πf ) − V f ( p, πg)≤ V f ( p, πf ) − V g( p, πf ) + V g( p, πg) − V f ( p, πg)≤ 2

8φ

√ K + 2φ + τ 1

= 16φ

√ K + 4φ + 2τ 1.

The first inequality follows since g prefers πg to πf , while the second one is based on the inequality (8) inthe proof of Lemma 4, which actually bounds V f ( p, πf ) − V g( p, πf ) for any pair f, g ∈ F .

This proves the five claims in the theorem.

C Proof of Theorem 3

This result is a straightforward application of Hoeffding’s inequality. We collect ntest observations xi ∼ D pby rolling into p and use the Monte Carlo estimates,

V̂ f ( p, πf ) = 1

ntest

ntesti=1

f (xi, πf (xi))

By Hoeffding’s inequality, via a union bound over all f ∈ F , we have that with probability at least 1 − δ ,V̂ f ( p, πf ) − V f ( p, πf )

≤

2 log(2N/δ )

ntest

Setting ntest = 2log(2N/δ )/φ2, gives that our empirical estimates are at most φ away from the populationversions.

Now for the first claim, if the population versions are already within τ 2 of each other, then the empiricalversions are at most 2φ + τ 2 apart by the triangle inequality,

|V̂ f ( p, πf ) − V̂ g( p, πg)| ≤ |V̂ f ( p, πf ) − V f ( p, πf )| + |V f ( p, πf ) − V g( p, πg)| + |V g( p, πg) − V̂ g( p, πg)|≤ 2φ + τ 2.

This applies for any pair f, g ∈ F whose population value predictions are within τ 2 of each other. Since weset test ≥ 2φ + τ 2 in Theorem 3, this implies that the procedure returns true.

For the second claim, if the procedure returns true, then all empirical value predictions are at most testapart, so the population versions are at most 2φ + test apart, again by the triangle inequality. Specifically,for any pair f, g ∈ F we have,

|V f ( p, πf ) − V g( p, πg)| ≤ |V f ( p, πf ) − V̂ f ( p, πf )| + |V̂ f ( p, πf ) − V̂ g( p, πg)| + |V̂ g( p, πg) − V g( p, πg)|≤ 2φ + test.

Both arguments apply for all pairs f, g ∈ F , which proves the claim.

24


25/30

D Proof of Theorem 4

Assume that all calls to TD-Elim and Consensus operate successfully, i.e., we can apply Theorems 2

and 3 on any path p for which the appropriate subroutine has been invoked. We will bound the number of

calls and hence the total failure probability.Recall that is the error parameter passed to DFS-Learn and that we set φ = 320H 2

√ K

.

We first argue that in all calls to TD-Elim, the estimation precondition is satisfied. To see this, notice

that by design, the algorithm only calls TD-Elim at path p after the recursive step, which means that foreach a, we either ran TD-Elim on p ◦ a or Consensus returned true on p ◦ a. Since both Theorems 2and 3 guarantee estimation error of order φ, the estimation precondition for path p holds. This argumentapplies to all paths p for which we call TD-Elim, so that the estimation precondition is always satisfied.

We next analyze the bias term, for which proceed by induction. To state the inductive claim, we define

the notion of an accessed path. We say that a path p is accessed if either (a) we called TD-Elim on path por (b) we called Consensus on p and it returned true.

Inductive Claim: For all accessed paths p with h actions remaining and any pair f, g ∈ F of survivingregressors,

|V f

( p, πf ) − V g

( p, πg)| ≤ 20h√ KφBase Case: The claim clearly holds at time point 0, since all regressors estimate future reward as zero.Inductive Step: Assume that the inductive claim holds for all accessed paths with h − 1 actions re-

maining. Consider any accessed path p with h actions remaining. Since we access the path p, either wecall TD-Elim or Consensus returns true. If we call TD-Elim, then we access the paths p ◦ a for alla ∈ A. Therefore by the inductive hypothesis, we have already filtered the regressor class so that for alla ∈ A, f , g ∈ F , we have,

|V f ( p ◦ a, πf ) − V g( p ◦ a, πf )| ≤ 20(h − 1)√

Kφ.

We will therefore instantiate τ 1 = 20(h −1)√

Kφ in the bias precondition of Theorem 2. We also know thatthe estimation precondition is satisfied with parameter φ. Therefore, the bias bound of Theorem 2 shows

that, for all f, g ∈ F retained by the algorithm,|V f ( p, πf ) − V g( p, πg)| ≤ 8φ

√ K + 2φ + τ 1

≤ 10φ√

K + 20(h − 1)φ√

K ≤ 20(h − 12

)φ√

K (9)

Thus the inductive step holds in this case.

The other case we must consider is if Consensus returnstrue. Notice that for a path p with h actionsto go, we call Consensus with parameter test = 20(h − 1/4)

√ Kφ. We actually invoke the routine on

path p when we are currently processing a path p with h + 1 actions to go (i.e., p = p ◦ a for some a ∈ A),so we set test in terms of H − | p| − 5/4 = H − | p ◦ a| − 1/4 = h − 1/4. Then, by Theorem 3, we havethe bias bound,

|V f ( p, πf )

−V g( p, πf )

| ≤2φ + 20(h

−1/4)

√ Kφ

≤ 20h√ Kφ

Thus we have established the inductive claim.

25


26/30

Verifying preconditions for Theorem 2: To apply the conclusions of Theorem 2 at some state s, wemust verify that the preconditions hold, with the appropriate parameter settings, before we executed the al-

gorithm. We saw above that the estimation precondition always holds with parameter φ, assuming successfulexecution of all subroutines. The inductive argument also shows that the bias precondition also holds with

τ 1 = 20(h − 1)√ Kφ for a state s ∈ S h that we called TD-Elim on. Thus, both preconditions are satisfiedat each execution of TD-Elim, so the conclusions of Theorem 2 apply at any state s for which we haveexecuted the subroutine. Note that the precondition parameters that we use here, specifically τ 1, depend onthe level h.

Sample Complexity: We now bound the number of calls to each subroutine, which reveals how to

allocate the failure probability and gives the sample complexity bound. Again assume that all calls succeed.

First notice that if we call Consensus on some state s ∈ S h for which we have already calledTD-Elim, then Consensus returns true (assuming all calls to subroutines succeed). This follows be-

cause TD-Elim guarantees that the population predicted values are at most 20(h−1/2)√ Kφ apart (Eq. 9),which becomes the choice of τ 2 in application of Theorem 3. This is legal since,

2φ + 20(h − 1/2)√

Kφ ≤ 20(h − 1/4)√

Kφ = test,

so that the precondition for Theorem 3 holds. Therefore, at any level h, we can call TD-Elim at most one

time per state s ∈ S h. In total, this yields M H calls to TD-Elim.Next, since we only make recursive calls when we execute TD-Elim, we expand at most M paths per

level. This means that we call Consensus on at most M K paths per level, since the fan-out of the tree isK . Thus, the number of calls to Consensus is at most M KH .

By our setting δ in the subroutine calls (i.e. δ/(2MKH ) in calls to Consensus and δ/(2MH ) in callsto TD-Elim), and by Theorems 2 and 3, the total failure probability is therefore at most δ .

Each execution of TD-Elim requires ntrain trajectories while executions of Consensus require ntesttrajectories. Since before each execution of TD-Elim we always perform K executions of Consensus, if we perform T executions of TD-Elim, the total sample complexity is bounded by,

T (ntrain + Kn test) ≤ (3 × 106) T H 4K

2 log(4NMH/δ ) + (3 × 105) T H

4K 2

2 log(4NMKH/δ )

= OT H 4K 2

2 logNMHK

δ

,

Recall that the total number of executions of TD-Elim can be no more than M H , by the argument above.

E Analysis for Explore-on-Demand

The first part of the algorithm essentially computes the value V at the root of the search tree, but does notensures good performance of retained policies. To do the latter, and to establish a PAC-guarantee, we run

the Explore-on-Demand procedure.

Throughout the proof, we assume that |V̂ −V | ≤ /8. We will ensure that the first half of the algorithmexecution guarantees this. Let E denote the event that all Monte-Carlo estimates V̂ (∅, πf ) are accurate andall calls to DFS-Learn succeed (so that we may apply Theorem 4). By accurate, we mean,

|V̂ (∅, πf ) − V (∅, πf )| ≤ /8.Formally, E is the intersection over all executions of DFS-Learn of the event that the conclusions of Theo-rem 4 apply for this execution and the intersection over all iterations of the loop in Explore-on-Demand

26


27/30

of the event that the Monte Carlo estimate V̂ (∅, πf ) is within /8 of V (∅, πf ). We will bound this failureprobability, i.e. P[Ē ], toward the end of the proof.Lemma 6 (Risk bound upon termination). If E holds, then when Explore-on-Demand terminates, it outputs a policy πf with V

−V (πf )

≤.

Proof. The proof is straightforward,

V − V (πf ) ≤ |V − V̂ | + |V̂ − V̂ (πf )| + |V̂ (πf ) − V (πf )|≤ /8 + /2 + /8 = 3/4 ≤

The first bound follows by assumption on V̂ while the second comes from the definition of demand and thethird holds under event E .Lemma 7 (Termination Guarantee). If E holds, then when Explore-on-Demand selects a policy that isat most /4-suboptimal, it terminates.

Proof. We must show that the test succeeds, for which we will apply the triangle inequality,

|V̂

− V̂ (πf )| ≤ |V̂

− V

| + |V

− V (πf )| + |V (πf ) − V̂ (πf )|≤ /8 + /4 + /8 ≤ /2 = demand,

And therefore the test is guaranteed to succeed. Again the last bound here holds under event E .At a current time in the execution of the algorithm, let L denote the set of learned states. Learned states

are ones for which we have called successfully TD-Elim, so that we may apply Theorem 2. Since we only

ever call TD-Elim through DFS-Learn, the fact that these calls to TD-Elim succeeded is implied by the

event E . A slightly tighter definition of L, which is sufficient for our purposes is

L(F ) =h

s ∈ S h : max

f ∈F V (s) − V f (s, πf ) ≤ 4φ

√ 2K + 2φ + 40(h − 1)

√ Kφ

.

The only property we will use from Theorem 2 is the instantaneous risk bound, which is what this alternative

definition of L provides.For a policy πf , let q

πf [s → L̄] denote the probability that when behaving according to πf starting fromstate s, we visit an unlearned state. We now show that q πf [∅→ L̄] is related to the risk of the policy πf .Lemma 8 (Policy Risk). Define L to be the set of states that have had TD-Elim called on them and defineq πf [s → L̄] accordingly. Assume that E holds and let f be a surviving regressor, so that πf is a surviving policy. Then,

V − V (∅, πf ) ≤ q πf [∅→ L̄] + 40√

KφH 2.

Proof. Recall that under event E , we can apply the conclusions of Theorem 2 with φ = 320H 2

√ K

and

τ 1 = 20(h − 1)√

Kφ for any h and state s ∈ S h for which we have called TD-Elim. Our proof proceedsby creating a recurrence relation through application of Theorem 2 and then solving the relation. Specifically,

we want to prove the following inductive claim.Inductive Claim: For a state s ∈ L with h actions to go,

V (s) − V (s, πf ) ≤ 40φ√

Kh2 + q πf [s → L̄]

27


28/30

Base Case: With zero actions to go, all policies achieve zero reward and no policies visit L̄ from this point,so the inductive claim trivially holds.

Inductive Step: For the inductive hypothesis, consider some state s at level h, for which TD-Elim hassuccessfully been called. By Theorem 4, we know that,

V (s) − V f (s, πf ) ≤ 4φ√ 2K + 2φ + 2τ 1,

with τ 1 = 20(h − 1)φ√

K . This bound is clearly at most 40hφ√

K . Now,

V (s) − V (s, πf ) = V (s) − V f (s, πf ) + V f (s, πf ) − V (s, πf )≤ 40hφ

√ K + E(x,r)∼Dsr(πf (x)) + V

(s ◦ πf (x)) − r(πf (x)) − V (s ◦ πf (x), πf ).Let us focus on just the second term, which is equal to,

Ex∼Ds [(V (s ◦ πf (x)) − V (s ◦ πf (x), πf )) (1[Γ(s, πf (x)) ∈ L] + 1[Γ(s, πf (x)) /∈ L])]

≤s∈L

Px∼Ds [Γ(s, πf (x)) = s] (V (s) − V (s, πf )) + Px∼Ds [Γ(s, πf (x)) /∈ L]

Since all of the recursive terms above correspond only to states s ∈ L, we may apply the inductive hypoth-esis, to obtain the bound

40hφ√

K +s∈L

Px∈Ds [Γ(s, πf (x)) = s]

40(h − 1)2φ√

K + q πf [s → L̄]

+ Px∼Ds [Γ(s, πf (x)) /∈ L]

≤ 40hφ√

K + 40(h − 1)2φ√

K + q πf [s → L̄]≤ 40φ

√ Kh2 + q πf [s → L̄]

Thus we have proved the inductive claim. Applying at the root of the tree gives the result.

Recall that we set φ = 320H 2

√ K

in DFS-Learn. This ensures that 40H 2φ√

K ≤ /8, which meansthat if q πf [∅→ L̄] = 0, then we ensure V − V (∅, πf ) ≤ /8.

Lemma 9 (Each non-terminal iteration makes progress). Assume that E holds. If πf is selected but fails thetest, then with probability at least 1 − exp(−ndemand,2/8) , at least one of the ndemand,2 trajectories collected visits a state s /∈ L.Proof. First, if πf fails the test, we know that,

demand < |V̂ (∅, πf ) − V̂ | ≤ /4 + |V (∅, πf ) − V |which implies that,

/4 < V − V (∅, πf )On the other hand Lemma 8, shows that,

V

−V (∅, πf )

≤q πf [∅

→ L̄] + 40H 2

√ Kφ

Using our setting of φ, and combining the two bounds gives,

/4 < q πf [∅→ L̄] + /8 ⇒ q πf [∅→ L̄] > /8

28


29/30

Thus, the probability that all ndemand,2 trajectories miss L̄ is,

P[all trajectories miss L̄] = (1 − q πf [∅→ L̄])ndemand,2≤ (1 − /8)ndemand,2 ≤ exp(−ndemand,2/8).

Thus we must hit L̄ with substantial probability.

E.1 Proof of Theorem 5

Again for now assume that E holds. First of all, by Lemma 6, we argued that if Explore-on-Demandterminates, then it outputs a policy that satisfies the PAC-guarantee. Moreover, by Lemma 7, we also argued

that if Explore-on-Demand selects a policy that is at most /4 suboptimal, then it terminates. Thus thegoal of the proof is to show that it quickly finds a policy that is at most /4 suboptimal.

Every execution of the loop in Explore-on-Demand either passes the test or fails the test at level

demand. If the test succeeds, then Lemma 6 certifies that we have found an -suboptimal policy, thus estab-lishing the PAC-guarantee. If the test fails, then Lemma 9 guarantees that we call DFS-Learn on a state

that was not previously trained on. Thus at each non-terminal iteration of the loop, we call DFS-Learn

and hence TD-Elim on at least one state s /∈ L, so that the set of learned states grows by at least one. ByLemma 8 and our setting of φ, if we have called TD-Elim on all states at all levels, then we guarantee thatall surviving policies have risk at most /8. Thus the number of iterations of the loop is bounded by at mostMH since that is the number of unique states in the Contextual-MDP.

Bounding P[Ē ]: Since we have bounded the total number of iterations, we are now in a position toassign failure probabilities and bound the event E . Actually we must consider not only the event E but alsothe event that all iterations where the test fails visit some state s /∈ L. Call this new event E which is theintersection of E with the event that all unsuccessful iterations visit L̄.

We have δ probability to allocate, and we perform at most MH iterations. Thus in each iteration we mayallocate δ/(MH ) probability. There are three types of events required: (1) the initial monte carlo estimatesV̂ (∅, πf ) must be close to V (∅, πf ), (2) the failure probability in Lemma 9 must be small, and (3) allHndemand,2 calls to DFS-Learn at this iteration must succeed. Naively, we allocate 1/3 of the availablefailure probability to each.

For the initial Monte-Carlo estimate, by Hoeffding’s inequality, we know that,

|V̂ (∅, πf ) − V (∅, πf )| ≤

log(6MH/δ )

2ndemand,1.

We want this bound to be at most /8 which requires:

ndemand,1

Contextual MDPs for PAC Reinforceme t Learning With Rich Observations

Documents

Transcript of Contextual MDPs for PAC Reinforceme t Learning With Rich Observations