Contextual MDPs for PAC Reinforceme t Learning With Rich Observations

download Contextual MDPs for PAC Reinforceme t Learning With Rich Observations

of 30

Transcript of Contextual MDPs for PAC Reinforceme t Learning With Rich Observations

  • 8/17/2019 Contextual MDPs for PAC Reinforceme t Learning With Rich Observations

    1/30

    Contextual-MDPs for PAC-Reinforcement Learning with

    Rich Observations

    Akshay Krishnamurthy  ∗1, Alekh Agarwal  †1, and John Langford  ‡1

    1Microsoft Research, New York, NY 10011

    March 2, 2016

    Abstract

    We propose and study a new tractable model for reinforcement learning with high-dimensional obser-

    vation called Contextual-MDPs, generalizing contextual bandits to a sequential decision making setting.

    These models require an agent to take actions based on high-dimensional observations (features) with the

    goal of achieving long-term performance competitive with a large set of policies. Since the size of the

    observation space is a primary obstacle to sample-efficient learning, Contextual-MDPs are assumed to be

    summarizable by a small number of hidden states. In this setting, we design a new reinforcement learning

    algorithm that engages in global exploration while using a function class to approximate future perfor-

    mance. We also establish a sample complexity guarantee for this algorithm, proving that it learns near

    optimal behavior after a number of episodes that is polynomial in all relevant parameters, logarithmic in

    the number of policies, and independent of the size of the observation space. This represents an exponen-

    tial improvement on the sample complexity of all existing alternative approaches and provides theoretical

     justification for reinforcement learning with function approximation.

    1 Introduction

    The Atari Reinforcement Learning research program [20] has highlighted a critical deficiency of reinforce-

    ment learning algorithms: they cannot effectively solve problems that require systematic exploration. How

    can we construct Reinforcement Learning (RL) algorithms which effectively plan and plan to explore?

    In RL theory, this is an effectively solved problem for Markov Decision Processes (MDPs)  [11, 4,  24].

    Why do these results not apply?

    An easy response is, “because the hard games are not MDPs.” This may be true for some of the hard

    games, but it is misleading—the algorithms used do not even engage in minimal planning and global explo-

    ration1 as is required to solve MDPs efficiently. MDP-optimized global exploration has also been avoided

    because of a polynomial dependence on the number of unique observations which is intractably large with

    observations from a visual sensor.

    [email protected][email protected]‡ [email protected] use “global exploration” to distinguish the structural exploration strategies required to solve an MDP efficiently from exponen-

    tially less efficient alternative such as  -greedy.

    1

     a r X i v : 1 6 0 2 . 0

     2 7 2 2 v 2 [ c s . L G ] 1

     M a r 2 0 1 6

  • 8/17/2019 Contextual MDPs for PAC Reinforceme t Learning With Rich Observations

    2/30

    In contrast, supervised and contextual bandit learning algorithms have  no  dependence on the number of 

    observations and at most a logarithmic dependence on the size of the underlying policy set. Approaches

    to RL with a weak dependence on these quantities exist  [13], but suffer from an exponential dependence

    on the time horizon—with K  actions and a horizon of  H , they require K H  samples. Examples show this

    dependence is necessary, although such examples require a large number of states. Can we find an RLalgorithm with no dependence on the number of unique observations and a polynomial dependence on the

    number of actions K , the number of necessary states M , the horizon H , and the policy complexity log(|Π|)?To begin answering this question we consider a simplified setting by assuming:

    1. episodic reinforcement learning.

    2. the policy space can represent the exact-best solution.

    3. state transition dynamics are deterministic.

    These simplifications make the problem significantly more tractable without trivializing the core goal of 

    designing a Poly(K,M,H, log(|Π|))) algorithm. To this end, our contributions are:1. A new class of models (Contextual-MDPs) for the design and analysis of reinforcement learning

    algorithms. Contextual-MDPs generalize both contextual bandits and MDPs, but, unlike Partially

    Observable MDPs (POMDPs), the optimal policy in a Contextual-MDP depends only on the mostrecent observation rather than the entire trajectory.

    2. A new reinforcement learning algorithm and a guarantee that it PAC-learns Contextual-MDPs (with

    the above assumptions) using O(MK 2H 3 log(|Π|)) samples. This is done by combining ideas fromcontextual bandits with a novel state equality test and a on-demand exploration technique, yielding the

    first Poly(K,M,H, log(|Π|))   reinforcement learning algorithm with no dependence on the numberof unique observations. Like initial contextual bandit approaches, the algorithm is computationally

    inefficient since it requires enumeration of the policy class, an aspect we hope to address in future

    work.

    Our algorithm uses a function class to approximate future rewards, and thus lends theoretical backing for

    reinforcement learning with function approximation, which is the empirical state-of-the-art.

    2 The Model

    In this section, we introduce the model we study throughout the paper, which we call episodic Contextual-

    MDPs. We first setup basic notation. Let H  ∈ N be a time horizon, X  denote a high-dimensional observationspace, A a finite set of actions, and let S  denote a finite set of latent states. Let K  = |A|. We partition S   intoH  disjoint groups S 1, . . . , S H , each of size at most M . For a set P , ∆(P )  denotes the set of distributionsover P .

    2.1 Basic Definitions

    An episodic Contextual-MDP is defined by the tuple  (ΓH , Γ, D) where H  ∈  N is the episode length, ΓH  ∈∆(S H )  denotes a starting state distribution,  Γ : (S × A) →   ∆(S )   denotes the transition dynamics, andD : S → ∆(X × [0, 1]K ) associates a distribution over (observation, reward) pairs with each state. We useDs ∈  ∆(X × [0, 1]K )  to denote the (observation, reward) distribution associated with state  s  and also themarginal distribution over observations (usage will be clear from context). We use Ds|x  to denote conditional

    2

  • 8/17/2019 Contextual MDPs for PAC Reinforceme t Learning With Rich Observations

    3/30

    s   s

    x, r   x, r

    a(x)r(a(x))

    Γ(s, a(x))

    . . . . . .

    Figure 1: Snippet of a trajectory induced by an   optimal agent  in a Contextual-MDP. Black text denotes

    unobserved quantities, blue denotes observed quantities, and red denotes quantities chosen by the optimal

    agent. The optimal action a is a function solely of the current observation x, in contrast with more generalPOMDPs.

    distribution of the reward given the observation  x  in state s. The marginal and conditional distributions arereferred to as Ds(x) and Ds|x(r).

    We assume that the process is   layered   (also known as loop-free or acyclic in the literature) so that for

    a state sh ∈ S h

     and for any action  a ∈ A

    ,  Γ(sh

    , a) ∈

     ∆(S h−1

    ). Since the state space is partitioned intodisjoint sets, each state is available only at a single time point, and the environment transitions from the

    state space S H  down to S 1  via a sequence of actions. Layered structure allows us to avoid indexing policiesand  Q-functions with time, which enables more concise notation but is mathematically equivalent to analternative reformulation without layered structure.

    An episode proceeds as follows. The environment chooses sH  ∼   ΓH ,  (xH , rH ) ∼   DsH , and  xH   isrevealed to the learner, who chooses an action  aH . The learner observes  rH (aH )   and the environmenttransitions to state sH −1 ∼  Γ(sH , aH ), draws (xH −1, rH −1) ∼  DsH−1  and reveals xH −1   to the learner.The learner chooses an action  aH −1  and the process continues for a total of  H  rounds of interaction, atwhich point the episode ends.

    Over the course of an episode, the reward obtained by the learner isH 

    h=1 rh(ah), and the goal is tomaximize the expected cumulative reward,

    R =  E[H h=1

    rh(ah)],   (1)

    where the expectation accounts for all randomness in the model and the learner. We assume that almost

    surelyH 

    h=1 rh(ah) ∈ [0, 1] for any action sequence.The record of interaction observed by the learner is  (xH , aH , rH (aH ), . . . , x1, a1, r1(a1)). The full

    record of interaction for a single episode is the tuple  (sH , xH , rH , aH , . . . s1, x1, r1, a1) where sH  ∼  ΓH ,sh ∼   Γ(sh+1, ah+1),  (xh, rh) ∼   Dsh  and all actions  ah  are chosen by the learner. Notice that all stateinformation and rewards for alternative actions are unobserved by the learning algorithm. Figure 1 illustrates

    the observed and unobserved quantities over one round of interaction.

    A policy π  : X → A is a strategy for navigating the search space by taking actions π(x) given observa-tion x. A policy generates a sequence of interactions as (xH , π(xH ), rH (π(xH )), . . . , x1, π(x1), r1(π(x1)))with expected reward defined recursively through

    V (π) =  Es∼ΓH [V (s, π)]   and,

    V (s, π) =  E(x,r)∼Ds

    r(π(x)) + Es∼Γ(s,π(x))V (s, π)

    .

    3

  • 8/17/2019 Contextual MDPs for PAC Reinforceme t Learning With Rich Observations

    4/30

    As the base case, we assume that for states  s ∈ S 1, all actions transition deterministically to a terminal states0  with V (s0, π) = 0 for all π.

    The optimal expected reward achievable can be similarly computed recursively as

    V  =  Es∼ΓH [V 

    (s)]   and,   (2)

    V (s) =  Ex∼Ds maxaEr∼Ds|x

    r(a) + Es∼Γ(s,a)V 

    (s)

    .

    For each (s, x) pair such that Ds(x) >  0  we can also define a  Q function as

    Qs(x, a) =  Er∼Ds|x

    r(a) + Es∼Γ(s,a)V (s)

    .   (3)

    This function captures the optimal choice of action given this (state, observation) pair and therefore encodes

    optimal behavior in the model. With no further assumptions, the above model is a  layered episodic Partially

    Observable Markov Decision Process (POMDP).

    2.2 The Contextual-MDP Model

    The Contextual-MDP is as described above, but with an important restriction on the structure of the  Q

    function defined in Equation (3).

    Definition 1  (Contextual-MDP).   Let (S , A, X , ΓH , Γ, D)   be a layered episodic POMDP. Let  Q be cor-respondingly defined as in Equation   3   and  a(s, x) = argmaxa∈A Q

    s(x, a). The POMDP is called a

    Contextual-MDP if for any two states s, s  such that Ds(x), Ds(x) >  0  we have a(s, x) =  a(s, x).

    Restated, a Contextual-MDP requires the optimal action for maximizing long-term reward to be depen-

    dent solely on the observation  x  irrespective of the state. This is depicted in Figure 1, where the optimalaction a depends only on the current observation. In the following section, we describe how this conditionrelates to other reinforcement learning models in the literature. However, we first describe some examples

    where the condition holds.

    Example 1 ( Disjoint contexts).  The simplest example for a Contextual-MDP is one where each state  s  canbe identified with a subset X s  so that Ds(x)  >  0 only for x ∈ X s  and where X s ∩ X s   = ∅  when s =  s.In this case, a realized context  x   uniquely identifies the underlying state  s  so that the function  Q

    s(x, a)need not explicitly depend on the state  s. On the other hand, this underlying mapping from s   to X s   isunknown to the learning agent so the problem cannot be easily reduced to a classical tabular MDP with a

    small number of states. Our algorithm will not try to explicitly learn this mapping as the sample complexity

    could be prohibitive, but resolve it implicitly using the learner’s policy class. The setting naturally extends

    to scenarios where any tuple of  τ  successive contexts have non-overlapping support for some  τ  ≥ 1. In thiscase, we can define a new state space that consists of all concatenations of  τ  states in the underlying statespace and new observation distributions that concatenate corresponding observations.

    Example 2 (Path-augmented contexts).  If the transition function Γ  is deterministic, one can associate eachstate s with a sequence of actions (a path) that ends at state s and augment the context with some featurizationof this sequence, including a featurization of the states visited along this path. As paths uniquely identify

    states in this case, this is an instance of the disjoint context scenario in Example 1.

    More generally, Contextual-MDPs provides a convenient framework to reason about reinforcementlearning with function approximation as we will see. This is highly desirable as such approaches are the

    empirical state-of-the-art, but the limited supporting theory provides little advice on systematic global ex-

    ploration.

    4

  • 8/17/2019 Contextual MDPs for PAC Reinforceme t Learning With Rich Observations

    5/30

    2.3 Connections to Other Models

    Our model is closely related to several well-studied models in the literature, namely:

    Contextual Bandits:  If H  = 1, then Contextual-MDPs reduce to stochastic contextual bandits [14, 6], awell-studied simplification of the general reinforcement learning problem. In contextual bandits, the learning

    algorithm repeatedly takes an action on the basis of a context (or observation), and accumulates reward for

    the chosen action. The main difference is that the choice of action does not  influence the future observations;

    in the stochastic contextual bandits problem all (observation, reward) pairs are drawn independently and

    identically from some distribution. Thus Contextual-MDPs force the learning algorithm to use  long-term

    decision making, which is not required for contextual bandits.

    Markov Decision Processes:   If  X   = S  and the distribution over observations for each state  s   is con-centrated on s, then Contextual-MDPs reduce to Markov Decision Processes (MDPs). MDPs with smallstate spaces can be efficiently solved by tabular approaches that maintain and update statistics about each

    state [11, 4, 24]. The main difference in our setting is that the observation space X   is extremely large orinfinite and the underlying state is unobserved, so tabular approaches are not viable. Thus, Contextual-

    MDPs force the learning algorithm to  generalize across observations, which is not required in MDPs.

    Sample complexity bounds for reinforcement learning with large state or observation spaces do exist, but

    the results require unrealistic assumptions and/or the bounds are rather weak. One example is the metric-

    E 3 algorithm of Kakade et al. [10] (see also [8]) that has sample complexity independent of the numberof unique states, but assumes the ability to cover the state space in a metric a priori known to the learner.

    Moreover, the sample complexity scales linearly in the cover size as opposed to a more typical logarithmic

    dependence as in supervised learning. The sparse sampling planner of Kearns et al. [13]  also implies a

    sample complexity bound that is independent of the observation space size for episodic MDPs, but grows as

    O(K H ). More recently, Abbasi-Yadkori and Neu [1] propose a model for MDPs with side-information, butthis model requires mapping side-information to a small-state MDP, rather than mapping an observation to

    an action as in Contextual-MDPs.

    Policy gradient methods that apply (stochastic) optimization methods to find a parameterized policy

    with high value can also be applied to large-state MDPs and to Contextual-MDPs. However, these methods

    use local search techniques and consequently do not achieve global optimality [25, 9] in theory as well as

    empirically, unlike our algorithm which is guaranteed to find the globally optimal policy.

    POMDPs:   By definition a Contextual-MDP is a Partially Observable Markov Decision Process(POMDP) where the optimal action at any state depends only on the current observation. Thus in Contextual-

    MDPs, the learning algorithm does not have to reason over belief states as is required in POMDPs.

    Borrowing terminology, a Contextual-MDP is precisely a POMDP where a  reactive policy, which uses

    only the current observation, is optimal. While there are POMDP methods for learning reactive policies,

    or more generally policies with bounded memory [19], they are based on policy gradient techniques, which

    suffer both theoretical and empirical drawbacks as we mentioned.

    There are some sample complexity guarantees for learning in arbitrarily complex POMDPs, but the

    bounds we are aware of are quite weak as they scale linearly with |Π| [12, 18].Predictive State Representations (PSRs): PSRs [17] encode states as a a collection of  tests, a test being

    a sequence of  (a, x)  pairs observed in the history. Representationally, PSRs are even more powerful thanPOMDPs [23] which make them also more general than Contextual-MDPs. However, we are not aware of 

    finite sample bounds for learning PSRs.

    5

  • 8/17/2019 Contextual MDPs for PAC Reinforceme t Learning With Rich Observations

    6/30

    2.4 Connections to Other Techniques

    State Abstraction:  Our work is closely related to the literature on state abstraction (See [16] for a survey),

    which primarily focuses on understanding what optimality properties are preserved in an MDP after the

    state space is compressed. However, Contextual-MDPs do not necessarily admit non-trivial state abstraction

    functions that are easy to discover (i.e. that do not amount to learning the optimal behavior) as the optimal

    behavior can depend on the observation in an arbitrary manner. Moreover, while there are finite sample

    results for learning state abstractions, they all make strong assumptions that limit the scope of application.

    A recent example is the work of Jiang et al. [ 7] which finds a good abstraction from a set of successively

    finer ones, but cannot search over the exponentially many abstractions functions.

    Function Approximation: Our solution uses function approximation to address the generalization prob-

    lem implicit in Contextual-MDPs. Function approximation is the empirical state-of-the-art in reinforcement

    learning [20], but theoretical analysis has been quite limited. Several authors have studied linear function

    approximation (See [26,  21]) but none of these results give finite sample bounds, as they do not address

    the exploration question. Baird [3] analyzes more general function approximation for predicting the value

    function in a Markov Chain, but does not show convergence when the agent is also selecting actions. More

    closely to our work, Li and Littman [15] do give finite sample bounds for RL with function approximation,

    but they assume access to a particular “Knows-what-it-knows” oracle, which cannot exist even for simple

    problems. We are not aware of finite sample results for approximating Q with a function class, which isprecisely what we do here.

    3 Our Approach

    In this paper, we consider the task of probably approximately correct (PAC) learning Contextual-MDPs.

    Given a policy class Π, we say that an algorithm PAC learns a Contextual-MDP if for any , δ  ∈ (0, 1), thealgorithm outputs a policy  π̂  with V (π̂) ≥  maxπ∈Π V (π) −  with probability at least  1 − δ . The  samplecomplexity of the algorithm is the number of episodes of the Contextual-MDP that the algorithm executes

    before returning an -suboptimal policy. Formally, the sample complexity is a function n : (0, 1)2 → N suchthat for any , δ  ∈  (0, 1), the algorithm returns an -suboptimal policy with probability at least  1 − δ  usingonly n(, δ ) episodes.

    3.1 Additional Assumptions for the Result

    Our algorithm operates on Contextual-MDPs with two additional assumptions. The first assumption posits

    the ability to approximate the Q function (3) well and seems essential for a function approximation basedapproach.

    Assumption 1  ( Realizability).  We identify our set of policies  Π  with a set of regression functions F ⊂(X ×A) → [0, 1]. Specifically, we set Π = {πf   : f  ∈ F} where πf  = argmaxa f (x, a). We assume that F is available to the learner and make a  realizability assumption, meaning that there exists a function f  ∈ F ,such that for every x ∈ X   and a ∈ A, f (x, a) = Qs(x, a), for any state s  such that Ds(x) >  0. We use N to denote |F| = |Π|.

    Note that the above assumption tacitly forces the function  Qs(x, a)   to be consistent across all states  swith Ds(x) >  0. This is stronger than only assuming the consistency of the argmax of  Q

    as in Definition 1,

    but Q may still be a complex function of the observation  x and action a.

    6

  • 8/17/2019 Contextual MDPs for PAC Reinforceme t Learning With Rich Observations

    7/30

    Algorithm 1 cMDPLearn (F , , δ  )F ← DFS-Learn(∅, F , ,δ/2).Let  V̂  =  V̂ f (∅) for any f  ∈ F .f 

     ←Explore-on-Demand(

    F ,  V̂ , ,δ/2).

    Return πf .

    The regressor class induces a family of value functions defined for each  f  ∈ F ,  s ∈ S , and for anypolicy π  : X → A,

    V f (s, π) =  Ex∼Dsf (x, π(x)).

    Working through definitions, it is easy to see that,

    V (s) =  V f 

    (s, πf ),

    for all s so that V  =  Es∼ΓH [V f (s, πf (s))]. Recalling the earlier definition, PAC-learning in the realizable

    setting requires finding a policy  π̂ with V (π̂)≥

    −.

    Assumption 2   ( Deterministic Transitions).   We further assume that the transition model is deterministic.

    This means that the starting distribution ΓH   is a point-mass on some state  sH  and the transition dynamicsmap state action pairs deterministically to future states, i.e.   Γ : (S × A) → S , preserving the layeredstructure.

    Even with deterministic transitions, PAC-learning Contextual-MDPs requires systematic exploration that

    is unaddressed in previous work.

    3.2 Algorithm

    We seek an algorithm that can PAC-learn realizable deterministic-transition Contextual-MDPs with

    Poly(M,K,H,, log(N ), log(1/δ ))   sample complexity, and we refer to such a sample complexity bound

    as polynomial in all relevant parameters. Notably, the algorithm should have no dependence on |X |, whichmay be infinite. We develop such an algorithm in this section, and we prove the sample complexity boundin Section 4.  Our focus is on statistical efficiency, so we ignore computational considerations here.

    Before turning to the algorithm, it is worth clarifying some additional notation. Since we are focused on

    the deterministic transition setting, it is natural to think about the Contextual-MDP as an exponentially large

    search tree with fan-out K  and depth H . Each node in the search tree is labeled with a state s ∈ S , and eachedge is labeled with an action a ∈ A, both of which are consistent with the transition model. A path p ∈ Acorresponds to a sequence of actions from the root of the search tree, and we also use p  to denote the statereached after executing the corresponding sequence of actions from the root. We often call such a path a

    roll-in, in line with existing terminology. For a roll-in p, we use p ◦ a to denote a path formed by executingall actions in p  and then executing action a. Let ∅ denote the empty path, which corresponds to the root of the search tree.

    Pseudocode for our algorithm is displayed in Algorithm   1   with subroutines displayed as Algo-

    rithms   2, 3,   4,   and  5.   The algorithm should be invoked as   cMDPLearn(F , , δ  )  where F   is the givenclass of regression functions,     is the target accuracy and  δ   is the target failure probability. The two maincomponents of the algorithm are the  DFS-Learn  and  Explore-on-Demand  routines.   DFS-Learn

    ensures proper invocation of the training step,  TD-Elim, by verifying a number of preconditions, while

    7

  • 8/17/2019 Contextual MDPs for PAC Reinforceme t Learning With Rich Observations

    8/30

    Algorithm 2 DFS-Learn ( p, F , , δ  )Set φ =  

    320H 2√ K 

      and test  = 20(H  − | p| − 5/4)√ 

    Kφ.

    for a ∈ A doif  Not  Consensus( p

    ◦a,

    F , test, φ,

      δ/2MKH ) then

    F ← DFS-Learn( p ◦ a, F , , δ  ). # Recurseend if 

    end for

    F̂ ← TD-Elim p, F , φ,   δ/2MH 

    . # Learn in state p.

    Return  F̂ .

    Algorithm 3 Consensus( p, F , test, φ , δ  )Set ntest  = 2 log(2N/δ )/φ

    2.

    Collect ntest   observations xi ∼ D p.Compute Monte-Carlo estimates for each value function,

    V̂ f 

    ( p, πf ) =

      1

    ntest

    ntesti=1

    f (xi, πf (xi))   ∀f  ∈ F 

    if |V̂ f ( p, πf ) −  V̂ g( p, πg)| ≤ test for all f, g ∈ F  thenreturn  true

    end if 

    Return false.

    Explore-on-Demand finds regions of the search tree for which training must be performed. To convey

    the intuition of the algorithm, it is best to start from the subroutines.

    The Elimination Component:  At a high level, the algorithm aims to maintain only the regressors that

    approximate the Q function well, and it makes progress by discarding regressors that have a poor fit to the

    Q

    function. At path p, we train by retaining only the regressors that have low excess risk on a carefullyconstructed regression problem (See  TD-Elim, displayed in Algorithm   4). The regression problem in

    TD-Elim is motivated by Assumption 1 and the definition of  Q in Eq. (3), which imply that for any states generating observation x,

    f (x, a) =  Er∼Ds|xr(a) + V (Γ(s, a), πf )

    =Er∼Ds|xr(a) + Ex∼DΓ(s,a)f (x, πf (x)).   (4)

    Thus f  is consistent between its estimate at the current state  s  and the future state s  = Γ(s, a).The regression problem we create is essentially a finite sample version of this identity. However some

    care must be taken as the target for each regression function  f ,  V f (s, πf ), is the value of the future aspredicted by f . This target differs for each function but can be estimated from samples. To ensure correctbehavior of the regression problem, we must obtain high-quality estimates of these future value predictions.

    Nevertheless if constructed carefully, these regression problems ensure that the algorithm retains only goodregressors, which induce good policies.

    TD-Elim is inspired by the RegressorElimination algorithm of Agarwal et al. [2] for contextual bandit

    learning in the realizable setting. Apart from the differences between the regression problem, motivated

    8

  • 8/17/2019 Contextual MDPs for PAC Reinforceme t Learning With Rich Observations

    9/30

    Algorithm 4 TD-Elim( p, F , φ , δ  )Require estimates  V̂ f ( p ◦ a, πf ), ∀f  ∈ F , a ∈ A.Set ntrain  = 24 log(2N/δ )/φ

    2

    Collect ntrain   observations (xi, ai, ri) where xi

     ∼D p, ai  is chosen uniformly at random, and  ri  =  ri(ai).

    Update F   to f  ∈ F   :  R̃(f ) ≤  min

    f ∈F R̃(f ) + 2φ2 +

     22 log(2N/δ )

    ntrain

    ,

    with  R̃(f ) =  1

    ntrain

    ntraini=1

    (f (xi, ai) − ri −  V̂ f ( p ◦ ai, πf ))2

    Return F .

    Algorithm 5 Explore-on-Demand (F , V̂ , , δ  )Set demand =  /2, ndemand,1 =

      32 log(6MH/δ)2   and ndemand,2 =

      8 log(3MH/δ)   .

    while true do

    Fix a regressor f  ∈ F .Collect ndemand,1 trajectories according to πf  and estimate  V̂ (∅, πf ) via Monte-Carlo estimate.

    If |V̂ (∅, πf ) −  V̂ | ≤ demand, return πf .Otherwise update F  by calling DFS-Learn ( p, F , ,δ/(3MHndemand,2)) on each of the H −1 prefixes

     p of each of the first  ndemand,2 paths collected for the Monte-Carlo estimate.end while

    by the discussion above, the other main difference between the algorithms is the choice of action-selection

    distribution. RegressorElimination must carefully choose actions to balance exploration and exploitation

    which leads to an optimal regret bound. In contrast, we are pursuing a PAC-guarantee here, for which it

    suffices to focus exclusively on exploration.The Consensus Test:  The other component of the  DFS-Learn routine, which is crucial for obtaining

    polynomial sample complexity, is a global exploration technique (See Consensus in Algorithm 3). This is

    based on testing for consensus among the surviving regression functions, which can be done by estimating

    the value predictions for all surviving regressors. Specifically, if the consensus test returns  true, then all

    the surviving regressors agree on the value of the current state. As shown below, this condition is sufficient

    for successfully running   TD-Elim   at the parent state. Furthermore, if we have already trained on the

    (observation, reward) distribution induced by the path  p, then this test returns  true with high probability,thus implicitly performing a state equality test at a level needed by the class F . The first property ensuresthat we invoke the training mechanism properly, while the second property implies that we avoid exploring

    the entire search tree.

    This test is performed at each path visited by the algorithm  before making the recursive call on that path,

    and if the surviving functions are in agreement, the algorithm does not visit the descendant paths. Thus, the

    algorithm does not traverse a large fraction the entire search tree provided that the consensus test succeeds

    often enough. The number of times the test does not report  true  at each level is upper bounded by the

    number of states M  leading to a polynomial sample complexity bound.

    9

  • 8/17/2019 Contextual MDPs for PAC Reinforceme t Learning With Rich Observations

    10/30

    On-Demand Exploration:  Apart from the first call to  DFS-Learn, which is simply used to estimate

    the optimal value V , the bulk of the computations occur inside the loop of  Explore-on-Demand. Thisalgorithm is another exploration technique that only invokes the learning mechanism on regions of the search

    space that are visited by the surviving policies. The specification is quite straightforward: it iteratively selects

    a surviving policy πf , estimates its value V (∅

    , πf ) at the root, and if the policy has highly sub-optimal value,it invokes DFS-Learn on many of the paths visited by πf  before repeating. If the policy has near-optimalvalue, it simply returns the policy.

    This subroutine is motivated by the following high-level argument, which we formalize in our analysis.

    As we will show, running the elimination step at some path  p   ensures that all surviving regressors takegood actions at p, in the sense that taking one action according to any surviving policy and then behavingoptimally thereafter achieves the near-optimal reward for path  p. Unfortunately this does not ensure that allsurviving policies achieve near-optimal reward, because they may take highly sub-optimal actions after the

    first one. On the other hand if a surviving policy πf  visits only states for which TD-Elim has been invoked,then it must have near-optimal reward.

    The contrapositive of this statement is that if a surviving policy  πf  has highly sub-optimal reward, thenit must visit some state that  TD-Elim  has not been invoked on with substantial probability. By calling

    DFS-Learn on the paths visited by the policy, we ensure that we call  TD-Elim on this “unlearned” state.

    Since there are only M H  distinct states in the search tree, and each non-terminal iteration ensures trainingon an unlearned state, this algorithm is guaranteed to terminate and output a near-optimal policy.

    4 Theoretical Analysis

    In this section we prove a PAC-learning guarantee for  cMDPLearn on Contextual-MDPs.

    Theorem 1 (PAC bound).   For any (, δ ) ∈ (0, 1) and any Contextual-MDP (Definition 1) with deterministictransitions for which Q ∈ F  , with probability at least  1−δ  , the policy π returned by cMDPLearn (F , , δ  )is at most  -suboptimal. Moreover, cMDPLearn (F , , δ  ) requires at most,

    ÕMH 6K 2

    3  log(N/δ ) log(1/δ )

    episodes.

    This result uses the  Õ  notation to suppress logarithmic dependence in all parameters except for N   andδ . The precise dependence on all parameters can be recovered by examination of our proof and is shortenedhere simply for clarity.

    This theorem states that  cMDPLearn   learns a policy that is at most  -suboptimal for a Contextual-MDP using a number of episodes that is polynomial in all relevant parameters. Looking more closely into

    the result, our overall sample complexity is shown to scale with  ndemand,2(ntrain  +  Kntest), ignoring somefactors of  M  and H  and logarithmic dependencies. Since ntrain  and  ntest  are set to be of the same order, thisreveals that a factor of  K  more samples are accumulated for the  Consensus  routine. Consequently, thesample complexity can be improved by a factor of K  whenever we do not need to test for state equality, or if collecting several exploration observations x for a path p without observing reward signal is cheap. Moreover

    since ntrain  and ntest  scale with 1/2 while ndemand,2  scales with log(1/δ )/, this gives the 1/3 dependencein Theorem 1.  This setting of  ndemand,2 as  Õ(log(1/δ )/) is required to ensure that we encounter at least one“unlearned” state on which  TD-Elim needs to be called. Consequently, the sample complexity can also be

    10

  • 8/17/2019 Contextual MDPs for PAC Reinforceme t Learning With Rich Observations

    11/30

    improved by a factor of  log(1/δ )/, if identifying an unlearned state can be done easily, for example in atabular MDP.

    Since Contextual-MDPs generalize both contextual bandits and MDPs, it is worth comparing the results.

    1. In contextual bandits, we have  M   =   H   = 1   so that the sample complexity of   cMDPLearn   is

    Õ(K 23   log(N/δ ) log(1/δ )), in contrast with the optimal  Õ(K 2  log(N/δ )) sample complexity for con-textual bandit learning. As discussed above, the main gap is due to the K ntest   factor, which goesaway in contextual bandits since there is only one state, and the additional   log(1/δ )/   factor inExplore-on-Demand, which need not be invoked at all in the contextual bandit case. Thus with

    minor modification, cMDPLearn matches the optimal sample complexity for contextual bandits.

    2. Assumptions vary with the paper, but broadly prior results establish the sample complexity for learning

    layered episodic MDPs with deterministic transitions is  Õ(MK poly(H )2   log(1/δ ))  [5,  22]. Again thediscrepancy is the additional factors of  K   and log(1/δ )/  present in Theorem 1, both of which canbe avoided given that the states are known in an MDP. In this setting,   cMDPLearn  can easily be

    modified to have  Õ(MKH 52   log(N/δ )) sample complexity for layered episodic MDPs.

    4.1 Preliminaries for the Proof The proof of the theorem hinges on analysis of the the subroutines. We turn first to the  TD-Elim routine,

    for which we show the following guarantee.

    Theorem 2 (Guarantee for TD-Elim).   Consider running TD-Elim at path p  with regressors F  , parame-ters φ, δ  and with ntrain = 24 log(2N/δ )/φ

    2. Suppose that the following are true:

    1.   Estimation Precondition:  We have access to estimates  V̂ f ( p ◦ a, πf )   for all f  ∈ F , a ∈ A   suchthat,|V̂ f ( p ◦ a, πf ) − V f ( p ◦ a, πf )| ≤ φ.

    2.   Bias Precondition:  For all f, g ∈ F  and for all a ∈ A , |V f ( p ◦ a, πf ) − V g( p ◦ a, πg)| ≤ τ 1.Then the following hold simultaneously with probability at least  1 − δ :

    1.   f  is retained by the algorithm.2.   Bias Bound :

    |V f ( p, πf ) − V g( p, πg)| ≤ 8φ√ 

    K  + 2φ + τ 1   (5)

    3.   Instantaneous Risk Bound:

    V ( p) − V f ( p, πf ) ≤ 4φ√ 

    2K  + 2φ + 2τ 1   (6)

    4.   Estimation Bound : Regardless of whether the preconditions hold, we have estimates V̂ f ( p, πf ) with,

    |V̂ f ( p, πf ) − V f ( p, πf )| ≤   φ√ 12

    (7)

    The last three bounds hold for all surviving f, g ∈ F .

    11

  • 8/17/2019 Contextual MDPs for PAC Reinforceme t Learning With Rich Observations

    12/30

  • 8/17/2019 Contextual MDPs for PAC Reinforceme t Learning With Rich Observations

    13/30

    For the bias precondition, the proof proceeds by induction on the number of actions to-go h. The induc-tive claim is that for all paths  p  with h  actions to-go that the subroutine accesses (i.e., calls  TD-Elim  orConsensus returns true) and all surviving f, g ∈ F , we have,

    |V f ( p, π

    f )−

    V g( p, πg

    )| ≤

    20h√ 

    Kφ.

    This claim is verified by applying Theorem 2 on paths for which the algorithm calls  TD-Elim  with the

    choice τ 1   = 20(h − 1)√ 

    Kφ   (due to the inductive hypothesis), and by applying Theorem  3 on the otheraccessed paths with test  as prescribed in DFS-Learn. Thus both preconditions for Theorem 2 are satisfiedon all calls to the subroutine.

    It remains to bound the number of calls. The main insight here is that if  TD-Elim has been called on

    a state s, the  Consensus  returns  true  on any path p   that leads to  s. This follows by the bias bound inTheorem 2 and the setting of  test. Since there are only M  states per level, this means that number of calls toTD-Elim is bounded by M H  and the number of calls to  Consensus is at most  M KH . This suggests asetting for the failure probability parameter in the calls to the subroutines, so that a union bound reveals that

    the total failure probability is at most  δ .Lastly, each call to TD-Elim requires K  calls to Consensus, so if T  calls to TD-Elim are performed

    the total sample complexity is,

    T (ntrain + Kntest) = O

    T H 4K 2

    2  log(NMKH/δ )

    .

    Finally we turn to the  Explore-on-Demand routine.

    Theorem 5 (Guarantee for Explore-on-Demand).  Consider running Explore-on-Demand with re-

    gressors F  , estimate  V̂  and parameters , δ  and assume that  |V̂  − V | ≤  /8. Then with probability at least  1 − δ  , Explore-on-Demand  terminates after at most,

    ÕMH 6K 2

    3  log(N/δ ) log(1/δ )

    trajectories and it returns a policy πf   with V  − V (∅, πf ) ≤ .We provide a sketch here. See Appendix E for details.

    Proof Sketch.  First, by standard concentration-of-measure arguments, it is apparent that if the algorithm

    selects a policy with near-optimal performance, then it terminates, and conversely, if it terminates then it

    produces a policy with near-optimal performance. Thus the main challenge is in bounding the number of 

    iterations of the loop until the algorithm terminates.

    Our proof first shows that if a policy  πf  has poor performance, then it must visit states that have notbeen trained on with substantial probability. Specifically, let  L  denote the set of states for which we havecalled TD-Elim and let  L̄ be its complement, that is  L̄ = S /L. Then we can bound the sub-optimality of asurviving policy πf   by,

    V  − V (∅, πf ) ≤ O(H 2√ Kφ) + P[πf  visits  L̄]This bound is obtained by another inductive argument that uses the instantaneous risk bound in Theorem  2

    on states in L. Since the instantaneous risk bound is itself obtained inductively as in the proof of Theorem  4

    13

  • 8/17/2019 Contextual MDPs for PAC Reinforceme t Learning With Rich Observations

    14/30

    (i.e.   τ 1  = 20(h − 1)√ 

    Kφ  for states at level  h), the recurrence here grows as  h2. Our setting of  φ  ensuresthat if a surviving policy visits only states in  L, then its suboptimality is at most /8.

    Thus if a surviving policy is very suboptimal, it must visit  L̄   with substantial probability. Applying aChernoff Bound, we see that at least one of the  ndemand,2  trajectories that we train on must visit  L̄. This

    ensures that at every iteration of the loop the set  L  grows by at least one state, and since there are at mostMH  states in the Contextual-MDP, the number of iteration is bounded by M H .To bound the sample complexity, we perform at most MH  iterations of the loop and each iteration makes

    Hndemand,2  calls to  DFS-Learn. Each call to  DFS-Learn  makes at least one call to  TD-Elim  but thetotal number of additional calls can be at most M H , since by the argument in the proof of Theorem 4, oncea state has be trained on,  Consensus always returns true. Thus the sample complexity is,

    (MH  × Hndemand,2 + MH ) × (ntrain + Kn test)

    =  Õ

    MH 6K 2

    3  log(N/δ ) log(1/δ )

    .

    This gives the sample complexity bound in Theorem 5

    Proof of Theorem 1:  The proof of the main theorem follows from straightforward application of Theo-rems 4 and 5.  First, since we run  DFS-Learn at the root, ∅, the bias and estimation bounds in Theorem 2

    apply at ∅, so we guarantee accurate estimation of the value V  (See Corollary 1  in Appendix A). This isrequired by the  Explore-on-Demand routine, but at this point, we can simply apply Theorem 5,  which

    is guaranteed to find a  -suboptimal policy and also terminate in  MH   iterations. Combining these tworesults, appropriately allocating the failure probability  δ , and accumulating the sample complexity boundsestablishes Theorem 1.

    5 Discussion

    This paper introduces a new model, Contextual-MDPs, in which it is possible to design and analyze prin-

    cipled reinforcement learning algorithms that engage in global exploration. As a first step, we develop

    cMDPLearn  and show that it learns near-optimal behavior in Contextual-MDPs with polynomial sample

    complexity. To our knowledge, this is the first polynomial sample complexity bound for reinforcement

    learning with general function approximation.

    However, there are many avenues for future work:

    1.   cMDPLearn has two main undesirable properties. Firstly, it requires a deterministic transition model

    which is unrealistic in some practical settings. Secondly, the algorithm involves enumerating the class

    of regression functions, so while its sample complexity is logarithmic in the function class size, its

    running time is linear, which is typically intractably slow. Resolving both of these deficiencies may

    lead to a new practical reinforcement learning algorithm.

    2. Our algorithm also crucially relies on the realizability assumption, which on one hand is implicitly

    assumed by state-of-the-art reinforcement learning algorithms, but is known to be unnecessary in

    the contextual bandit setting. Is it possible to design completely agnostic algorithms for learning in

    Contextual-MDPs?

    We look forward to pursuing these directions.

    14

  • 8/17/2019 Contextual MDPs for PAC Reinforceme t Learning With Rich Observations

    15/30

    Acknowledgements

    We thank Akshay Balsubramani and Hal Daumé III for formative discussions, and we thank Tzu-Kuo Huang

    for a careful reading of an early draft of this paper.

    References

    [1] Yasin Abbasi-Yadkori and Gergely Neu. Online learning in mdps with side information.

    arXiv:1406.6812, 2014.

    [2] Alekh Agarwal, Miroslav Dudı́k, Satyen Kale, John Langford, and Robert E Schapire. Contextual

    bandit learning with predictable rewards. In   International Conference on Artificial Intelligence and 

    Statistics (AISTATS), 2012.

    [3] Leemon Baird. Residual algorithms: Reinforcement learning with function approximation. In  Inter-

    national Conference on Machine Learning (ICML), 1995.

    [4] Ronen I Brafman and Moshe Tennenholtz. R-max-a general polynomial time algorithm for near-

    optimal reinforcement learning.  Journal of Machine Learning Research, 2003.

    [5] Christoph Dann and Emma Brunskill. Sample complexity of episodic fixed-horizon reinforcement

    learning. In Advances in Neural Information Processing Systems (NIPS), 2015.

    [6] Miroslav Dudik, Daniel Hsu, Satyen Kale, Nikos Karampatziakis, John Langford, Lev Reyzin, and

    Tong Zhang. Efficient optimal learning for contextual bandits. In Uncertainty in Artificial Intelligence

    (UAI), 2011.

    [7] Nan Jiang, Alex Kulesza, and Satinder Singh. Abstraction selection in model-based reinforcement

    learning. In International Conference on Machine Learning (ICML), 2015.

    [8] Nicholas K Jong and Peter Stone. Model-based exploration in continuous state spaces. In  Abstraction,

     Reformulation, and Approximation, 2007.

    [9] Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In

     International Conference on Machine Learning (ICML), 2002.

    [10] Sham Kakade, Michael Kearns, and John Langford. Exploration in metric state spaces. In International

    Conference on Machine Learning (ICML), 2003.

    [11] Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time. Machine

     Learning, 2002.

    [12] Michael J Kearns, Yishay Mansour, and Andrew Y Ng. Approximate planning in large pomdps via

    reusable trajectories. In  Advances in Neural Information Processing Systems (NIPS), 1999.

    [13] Michael J. Kearns, Yishay Mansour, and Andrew Y. Ng. A sparse sampling algorithm for near-optimal

    planning in large markov decision processes.  Machine Learning, 2002.

    [14] John Langford and Tong Zhang. The epoch-greedy algorithm for multi-armed bandits with side infor-

    mation. In Advances in Neural Information Processing Systems (NIPS), 2008.

    15

  • 8/17/2019 Contextual MDPs for PAC Reinforceme t Learning With Rich Observations

    16/30

    [15] Lihong Li and Michael L Littman. Reducing reinforcement learning to kwik online regression. Annals

    of Mathematics and Artificial Intelligence, 2010.

    [16] Lihong Li, Thomas J Walsh, and Michael L Littman. Towards a unified theory of state abstraction for

    mdps. In International Symposium on Artificial Intelligence and Mathematics (ISAIM), 2006.

    [17] Michael L Littman, Richard S Sutton, and Satinder P Singh. Predictive representations of state. In

     Advances in Neural Information Processing Systems (NIPS), 2001.

    [18] Yishay Mansour. Reinforcement learning and mistake bounded algorithms. In Conference on Compu-

    tational Learning Theory (COLT), 1999.

    [19] Nicolas Meuleau, Leonid Peshkin, Kee-Eung Kim, and Leslie Pack Kaelbling. Learning finite-state

    controllers for partially observable environments. In Uncertainty in Artificial Intelligence (UAI), 1999.

    [20] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare,

    Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control

    through deep reinforcement learning.   Nature, 2015.

    [21] Theodore J Perkins and Doina Precup. A convergent form of approximate policy iteration. In Advancesin Neural Information Processing Systems (NIPS), 2002.

    [22] Spyros Reveliotis and Theologos Bountourelis. Efficient pac learning for episodic tasks with acyclic

    state spaces.  Discrete Event Dynamic Systems, 2007.

    [23] Satinder Singh, Michael R James, and Matthew R Rudary. Predictive state representations: A new

    theory for modeling dynamical systems. In   Uncertainty in Artificial Intelligence (UAI). AUAI Press,

    2004.

    [24] Alexander L Strehl, Lihong Li, Eric Wiewiora, John Langford, and Michael L Littman. Pac model-free

    reinforcement learning. In International Conference on Machine Learning (ICML), 2006.

    [25] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods

    for reinforcement learning with function approximation. In Advances in Neural Information ProcessingSystems (NIPS), 1999.

    [26] John N Tsitsiklis and Benjamin Van Roy. An analysis of temporal-difference learning with function

    approximation. IEEE Transactions on Automatic Control, 1997.

    16

  • 8/17/2019 Contextual MDPs for PAC Reinforceme t Learning With Rich Observations

    17/30

    A An Additional Corollary

    A simple consequence of Theorem 4 is that we can estimate V  accurately once we have called DFS-Learnon ∅.

    Corollary 1 (Estimating V ).  Consider running DFS-Learn at ∅ with regressors F  , and parameters , δ .Then with probability at least  1 − δ  , the estimate  V̂  satisfies,

    |V̂  − V | ≤ /8. Moreover the algorithm uses at most,

    O

    MH 5K 2

    2  log

    NMHK 

    δ 

    trajectories.

    Proof.  Since we ran  DFS-Learn  at ∅, we may apply Theorem 4. By specification of the algorithm, we

    certainly ran TD-Elim at∅, so we apply the conclusions in Theorem 2. In particular, we know that f 

    ∈ F and that for any surviving f  ∈ F ,|V̂ f ( p, πf ) − V | = |V̂ f ( p, πf ) − V f ( p, πf ) + V f ( p, πf ) − V f 

    ( p, πf )|≤   φ√ 

    12+ 8φ

    √ K  + 2φ + 20(H  − 1)

    √ Kφ ≤ /8.

    The last bound follows from the setting of  φ. Since our estimate  V̂  is  V̂ f ( p, πf ) for some surviving f , weguarantee estimation error at most /8.

    As for the sample complexity, Theorem 4 shows that the total number of executions of TD-Elim can be

    at most M H , which is our setting of  T .

    B Proof of Theorem 2

    The proof of Theorem 2 is quite technical, and we compartmentalize into several components. We begin with

    several technical lemmas. Throughout we will use the preconditions of the theorem, which we reproduce

    here.

    Condition 1.  For all f  ∈ F  and a ∈ A, we have estimates  V̂ f ( p ◦ a, πf ) such that,

    |V̂ f ( p ◦ a, πf ) − V f ( p ◦ a, πf )| ≤ φ.Condition 2.  For all f, g ∈ F  and a ∈ A we have,

    |V f ( p ◦ a, πf ) − V g( p ◦ a, πg)| ≤ τ 1.We will make frequent use of the parameters  φ  and  τ 1  which are specified by these two conditions, and

    explicit in the theorem statement.Recall the notation,

    V f ( p, πg) =  Ex∼Dpf (x, πg(x))

    17

  • 8/17/2019 Contextual MDPs for PAC Reinforceme t Learning With Rich Observations

    18/30

    which will be used heavily throughout the proof.

    As notational convenience, we will suppress dependence on the distribution D p, since we are consideringone invocation of TD-Elim and we always roll into path p. This means that all (observation, reward) tupleswill be drawn from D p. Secondly it will be convenient to introduce the shorthand  V 

    f ( p) = V f ( p, πf ) and

    similarly for the estimates. Finally, we will further shorten the value functions for paths  p ◦ a by defining,V f a   =  Ex∼Dp◦af (x, πf (x)) =  V 

    f ( p ◦ a, πf ).

    We will also use  V̂ f a   to denote the estimated versions which we have access to according to Condition 1.Lastly, our proof makes extensive use of the following random variable, which is defined for a particular

    regressor f  ∈ F Y (f ) (f (x, a) − r(a) −  V̂ f ( p ◦ a))2 − (f (x, a) − r(a) −  V̂ f ( p ◦ a))2.

    Here (x, r) ∼ D p  and  a ∈ A is drawn uniformly at random as prescribed by Algorithm 4.  We use Y (f ) todenote the random variable associated with regressor f , but sometimes drop the dependence on  f  when it isclear from context.

    To proceed, we first compute the expectation and variance of this random variable.

    Lemma 1 (Properties of TD Squared Loss).   Assume Condition 1  holds. Then for any f  ∈ F  , the randomvariable Y   satisfies,

    Ex,a,r[Y ] =  Ex,a

    (f (x, a) −  V̂ f ( p ◦ a) − f (x, a) + V f ( p ◦ a))2

    − Ex,a

    (V̂ f 

    ( p ◦ a) − V f ( p ◦ a))2

    Varx,a,r

    [Y ] ≤ 32Ex,a[Y ] + 64φ2

    Proof.  For further shorthand, denote f  = f (x, a), f  = f (x, a) and recall the definition of  V f a   and  V̂ f a  .

    Ex,a,rY 

    =  Ex,a,r

    (f  − V̂ f a − r(a))2 − (f  −  V̂ f 

    a   − r(a))2

    =  Ex,a,r (f  − V̂ f a )

    2

    −2r(a)(f 

     − V̂ f a

     −f  + V̂ f 

    a   )

    −(f 

    − V̂ f 

    a   )2

    Now recall that E[r(a)|x, a] = f ∗(x, a) − V f a   by the definition of  f ∗, which allows us to further obtainEx,a,rY 

    =  Ex,a

    (f  −  V̂ f a )2 − 2(f  − V f 

    a   )(f  −  V̂ f a ) + 2(f  −  V̂ f 

    a   + V̂ f 

    a   − V f 

    a   )(f  −  V̂ f a   ) − (f  −  V̂ f 

    a   )2

    =  Ex,a

    (f  −  V̂ f a )2 − 2(f  − V f 

    a   )(f  −  V̂ f a ) + (f  −  V̂ f 

    a   )2 + 2(V̂ f 

    a   − V f 

    a   )(f  −  V̂ f a   )

    =  Ex,a

    (f  −  V̂ f a )2 − 2(f  − V f 

    a   )(f  −  V̂ f a ) + (f  − V f 

    a   + V f 

    a   −  V̂ f 

    a   )2 + 2(V̂ f 

    a   − V f 

    a   )(f  −  V̂ f a   )

    =  Ex,a

    (f  −  V̂ f a − f  + V f 

    a   )2 + 2(V f 

    a   −  V̂ f 

    a   )(f  − V f a   ) + (V f 

    a   −  V̂ f 

    a   )2 + 2(V̂ f 

    a   − V f 

    a   )(f  −  V̂ f a   )

    =  Ex,a (f  −

     V̂ f a

     −f  + V f 

    a   )2

    −(V f 

    a

      − V̂ f 

    a   )2

    For the second claim, notice that we can write,

    Y   = (f  − V̂ f a − f  + V̂ f 

    a   )(f  −  V̂ f a   + f  −  V̂ f 

    a   − 2r(a)),

    18

  • 8/17/2019 Contextual MDPs for PAC Reinforceme t Learning With Rich Observations

    19/30

    so that,

    Y 2 ≤ 16(f  −  V̂ f a − f  + V̂ f 

    a   )2.

    This holds because all quantities in the second term are bounded in  [0, 1]. Therefore,

    Var(Y ) ≤ E[Y 2]≤ 16Ex,a

    (f (x, a) −  V̂ f a − f (x, a) +  V̂ f 

    a   )2

    = 16Ex,a

    (f (x, a) −  V̂ f a − f (x, a) + V f 

    a   + V̂ f 

    a   − V f 

    a   )2

    ≤ 32Ex,a

    (f (x, a) −  V̂ f a − f (x, a) + V f 

    a   )2

    + 32φ2

    ≤ 32Ex,aY   + 64φ2

    The first inequality is straightforward, while the second inequality is from the argument above. The third

    inequality uses the fact that (a + b)2 ≤ 2a2 + 2b2 and the fact that for each a, the estimate  V̂ f a   has absoluteerror at most φ  (By Condition 1). The last inequality adds and subtracts the term involving (V f 

    a

      − V̂ f 

    a   )2

    to obtain Ex,aY .

    The next step is to relate the empirical squared loss to the population squared loss, which is done by

    application of Bernstein’s inequality.

    Lemma 2 (Squared Loss Deviation Bounds).   Assume Condition 1 holds. With probability at least  1 − δ/2 ,where  δ  is a parameter of the algorithm, f  survives the filtering step of Algorithm 4 and moreover, anysurviving f  satisfies,

    EY (f ) ≤ 10φ2 + 120 log(2N/δ )ntrain

    .

    Proof.  We will apply Bernstein’s inequality on the centered random variable,

    ntraini=1

    Y i(f ) − EY i(f ),

    and then take a union bound over all  f  ∈ F . Here the expectation is over the n train  samples  (xi, ai, ri)where (xi, r) ∼ D p, ai  is chosen uniformly at random, and ri  =  r(ai). Notice that since actions are chosenuniformly at random, all terms in the sum are identically distributed, so that EY i(f ) =  EY (f ).

    To that end, fix one f  ∈ F  and notice that |Y  −EY | ≤ 8 almost surely, as each quantity in the definitionof  Y   is bounded in  [0, 1], so each of the four terms can be at most  4, but two are non-positive and two arenon-negative in Y  − EY . We will use Lemma 1 to control the variance. Bernstein’s inequality implies that,with probability at least 1 − δ ,

    ntrain

    i=1

    EY i − Y i ≤ 2i

    Var(Y i)log(1/δ ) + 16 log(1/δ )

    3

    ≤ 

    64i

    (E(Y i) + 2φ2) log(1/δ ) + 16 log(1/δ )

    3

    19

  • 8/17/2019 Contextual MDPs for PAC Reinforceme t Learning With Rich Observations

    20/30

    The first inequality here is Bernstein’s inequality while the second is based on the variance bound in

    Lemma 1.

    Now letting  X   = 

    i(E(Y i) + 2φ2),   Z   =

     i Y i   and  C   =

     log(1/δ ), the inequality above is

    equivalent to,

    X 2 − 2ntrainφ2 − Z  ≤ 8XC  +  163

     C 2

    ⇒ X 2 − 8XC  + 16C 2 − Z  ≤ 2ntrainφ2 + 22C 2⇒ (X  − 4C 2) − Z  ≤ 2ntrainφ2 + 22C 2⇒ −Z  ≤ 2ntrainφ2 + 22C 2

    Using the definition of −Z , this last inequality implies that,ntraini=1

    (f (xi, ai) − ri(ai) −  V̂ f 

    ( p ◦ ai))2 ≤ntraini=1

    (f (xi, ai) − ri(ai) −  V̂ f ( p ◦ ai))2 + 2ntrainφ2 + 22 log(1/δ )

    Via a union bound over all  f 

     ∈ F , rebinding δ 

     ←δ/(2N ), and dividing through by ntrain, we have,

    R̃(f ) ≤ minf ∈F 

    R̃(f ) + 2φ2 + 22 log(2N/δ )

    ntrain

    Since this is precisely the threshold used in filtering regressors, we ensure that  f  survives.Now for any other surviving regressor  f , we are ensured that Z  is upper bounded. Specifically we have,

    (X  − 4C )2 ≤ Z  + 2ntrainφ2 + 22C 2 ≤ 4ntrainφ2 + 44C 2

    ⇒ X 2 ≤ ( 

    4ntrainφ2 + 44C 2 + 4C )2

    ≤ 8ntrainφ2 + 120C 2

    This proves the claim since X 2 = ntrainEY (f )+2ntrainφ2 (Recall that the Y is are identically distributed).

    This deviation bound allows us to establish the three claims in Theorem  2. We start with the estimationerror claim, which is straightforward.

    Lemma 3 (Estimation Error).   Let  δ  ∈  (0, 1). Then with probablity at least  1 − δ  , for all f  ∈ F   that areretained by the Algorithm 4, we have estimates  V̂ f ( p, πf ) with,

    |V̂ f ( p, πf ) − V f ( p, πf )| ≤ 

    2log(N/δ )

    ntrain.

    Proof.  The proof is a consequence of Hoeffding’s inequality and a union bound. Clearly the Monte Carlo

    estimate,

    V̂ f ( p, πf ) =  1

    ntrain

    ntrain

    i=1

    f (xi, πf (xi)),

    is unbiased for V f ( p, πf ) and the centered quantity is bounded in [−1, 1]. Thus Hoeffding’s inequality givesprecisely the bound in the lemma.

    20

  • 8/17/2019 Contextual MDPs for PAC Reinforceme t Learning With Rich Observations

    21/30

    Next we turn to the claim regarding bias.

    Lemma 4 (Bias Accumulation).   Assume Conditions 1  and  2  hold. In the same 1 − δ/2 event in Lemma 2, for any pair  f, g ∈ F  retained by Algorithm 4, we have,

    V f ( p, πf ) − V g( p, πg) ≤ 2√ 

     11φ2 +

     120 log(2N/δ )

    ntrain+ 2φ + τ 1

    Proof.  We start by expanding definitions,

    V f ( p, πf ) − V g( p, πg) =  Ex∼Dp [f (x, πf (x)) − g(x, πg(x))]

    Now, since g  prefers πg(x) to πf (x), it must be the case that g(x, πg(x)) ≥ g(x, πf (x)), so that,

    V f ( p, πf ) − V g( p, πg) ≤Ex∼Dpf (x, πf (x)) − g(x, πf (x))=   Ex∼Dp [f (x, πf (x)) −  V̂ f ( p ◦ πf (x), πf ) − f (x, πf (x)) + V̂ f 

    ( p ◦ πf (x), πf )]

    −Ex

    ∼Dp [g(x, πf (x))

    − V̂ g( p

    ◦πf (x), πg)

    −f (x, πf (x)) + V̂ 

    f ( p

    ◦πf (x), πf )]

    + Ex∼Dp [V̂ f ( p ◦ πf (x), πf ) −  V̂ g( p ◦ πf (x), πg)]

    This last equality is just based on adding and subtracting several terms. The first two terms look similar, and

    we will relate them to the squared loss. For the first, by Lemma  1, we have that for each x ∈ X ,

    Er,a|x[Y (f )] + Ea|x[(V̂ f ( p ◦ a, πf ) − V f 

    ( p ◦ a, πf ))2]=  Ea|x

    (f (x, a) −  V̂ f ( p ◦ a, πf ) − f (x, a) + V f 

    ( p ◦ a, πf ))2

    ≥   1K 

    (f (x, πf (x)) −  V̂ f ( p ◦ πf (x), πf ) − f (x, πf (x)) + V f 

    ( p ◦ πf (x), πf ))2

    The equality is Lemma 1  while the inequality follows from the fact that each action, in particular  πf (x),is played with probability 1/K   and the quantity inside the expectation is non-negative. Now by Jensen’s

    inequality the first term can be upper bounded as,

    Ex∼Dp [f (x, πf (x)) −  V̂ f ( p ◦ πf (x), πf ) − f (x, πf (x)) + V̂ f 

    ( p ◦ πf (x), πf )]≤ Ex∼Dp [(f (x, πf (x)) −  V̂ f ( p ◦ πf (x), πf ) − f (x, πf (x)) + V̂ f ( p ◦ πf (x), πf ))2]

    =

     K Ex∼Dp

     1

    K (f (x, πf (x)) −  V̂ f ( p ◦ πf (x), πf ) − f (x, πf (x)) + V̂ f ( p ◦ πf (x), πf ))2

    ≤ 

    K Ex,a,r[Y (f )] + Ex,a[(V̂ f 

    ( p ◦ a, πf ) − V f ( p ◦ a, πf ))2]

    ≤√ 

    K  EY (f ) + φ2

    ≤ √ K  11φ2 + 120log(N/δ )ntrain ,

    21

  • 8/17/2019 Contextual MDPs for PAC Reinforceme t Learning With Rich Observations

    22/30

    where the last step follows from Lemma 2.  This bounds the first term in the expansion of  V f ( p, πf ) −V g( p, πg). Now for the term involving g, we can apply essentially the same argument,

    − Ex∼Dp [g(x, πf (x)) −  V̂ g( p ◦ πf (x), πg) − f (x, πf (x)) + V̂ f 

    ( p ◦ πf (x), πf )]≤ Ex∼Dp [(g(x, πf (x)) −  V̂ g( p ◦ πf (x), πg) − f (x, πf (x)) + V̂ f ( p ◦ πf (x), πf ))2]≤

    √ K 

     11φ2 +

     120 log(N/δ )

    ntrain

    Summarizing, the current bound we have is,

    V f ( p, πf ) − V g( p, πg) ≤ 2√ 

     11φ2 +

     120log(N/δ )

    ntrain+ Ex∼Dp [V̂ 

    f ( p ◦ πf (x), πf ) −  V̂ g( p ◦ πf (x), πg)](8)

    The last term is easily bounded by the preconditions in the statement of Theorem 2.  For each a, we have,

    V̂ f ( p ◦ a, πf ) −  V̂ g( p ◦ a, πg)≤ |V̂ f ( p ◦ a, πf ) − V f ( p ◦ a, πf )| + |V f ( p ◦ a, πf ) − V g( p ◦ a, πg)| + |V g( p ◦ a, πg) −  V̂ g( p ◦ a, πg)|≤ 2φ + τ 1,

    from Conditions 1 and 2.  Consequently

    Ex∼Dp [V̂ f ( p ◦ πf (x), πf ) −  V̂ g( p ◦ πf (x), πg)]

    =a∈A

    Ex

    1[πf (x) =  a](V̂ 

    f ( p ◦ a, πf ) −  V̂ g( p ◦ a, πg))

    ≤ 2φ + τ 1

    This proves the claim.Lastly, we must show how the squared loss relates to the risk, which helps establish the last claim of the

    Theorem. The proof is similar to that of the bias bound but has subtle differences that require reproducing

    the argument.

    Lemma 5   (Instantaneous Risk Bound).   Assume Conditions 1 and   2   hold. In the same 1 − δ/2  event in Lemma 2, for any regressor  f  ∈ F  retained by Algorithm 4 , we have,

    V f 

    ( p, πf ) − V f 

    ( p, πf ) ≤√ 

    2K 

     11φ2 +

     120log(N/δ )

    ntrain+ 2(φ + τ 1).

    Proof.

    V f 

    ( p, πf ) − V f 

    ( p, πf ) =  Ex[f (x, πf (x)) − f (x, πf (x))]≤ Ex[f (x, πf (x)) − f (x, πf (x)) + f (x, πf (x)) − f (x, πf (x))]

    22

  • 8/17/2019 Contextual MDPs for PAC Reinforceme t Learning With Rich Observations

    23/30

    This follows since  f  prefers its own action to that of  f , so that  f (x, πf (x)) ≥   f (x, πf (x)). For anyobservation  x ∈ X  and action a ∈ A, define,

    ∆x,a = (f (x, a) −  V̂ f ( p ◦ a, πf ) − f (x, a) + V f 

    ( p ◦ a, πf ))

    so that we can write,V f 

    ( p, πf ) − V f ( p, πf )≤ Ex[∆x,πf (x) − ∆x,πf (x) + (V̂ f ( p ◦ πf (x)) − V f 

    ( p ◦ πf (x)) −  V̂ f ( p ◦ πf (x)) + V f 

    ( p ◦ πf (x)))]The term involving both ∆s can be bounded as in the proof of Lemma 4. For any x ∈ X 

    Er,a|xY (f ) + Ea|x[(V̂ f 

    ( p ◦ a) − V f ( p ◦ a))2]=  Ea|x

    (f (x, a) −  V̂ f ( p ◦ a) − f (x, a) + V f ( p ◦ a))2

    ≥∆2x,πf (x) + ∆

    2x,πf (x)

    K   ≥ (∆x,πf (x) − ∆x,πf (x))

    2

    2K 

    Thus,

    Ex[∆x,πf (x) − ∆x,πf (x)] ≤ 

    2K E (∆x,πf (x) − ∆x,πf (x))2

    2K 

    ≤√ 

    2K  EY (f ) + φ2 ≤

    √ 2K 

     11φ2 +

     120log(N/δ )

    ntrain

    We are left to bound the residual term,

    (V̂ f ( p ◦ πf (x)) − V f 

    ( p ◦ πf (x)) −  V̂ f ( p ◦ πf (x)) + V f 

    ( p ◦ πf (x)))≤V f ( p ◦ πf (x)) − V f ( p ◦ πf (x)) − V f ( p ◦ πf (x)) + V f ( p ◦ πf (x))+ 2φ

    ≤ 2(φ + τ 1)

    Notice that Lemma 5   above controls the quantity  V f 

    ( p, πf ) − V f 

    ( p, πf )   which is the differencein values of the optimal behavior from  p  and the policy that first acts according to  πf   and then behavesoptimally thereafter. This is   not  the same as acting according to  πf   for all subsequent actions. We willcontrol this cumulative risk  V ( p) − V ( p, πf ) in the second phase of the algorithm.

    Proof of Theorem 2:  Equipped with the above Lemmas, we can proceed to prove the theorem. By

    assumption of the theorem, Conditions 1 and  2 hold, so all lemmas are applicable. Apply Lemma 3 with

    failure probability δ/2, where δ  is the parameter in the algorithm, and apply Lemma 2, which also fails withprobability at most  δ/2. A union bound over these two events implies that the failure probability of thealgorithm is at most δ .

    Outside of this failure event, all three of Lemmas 3, 4,  and  5  hold. If we set  ntrain  = 24 log(2N/δ )/φ2

    then these four bounds give,

    |V̂ f ( p, πf )

    −V f ( p, πf )

    | ≤  φ

    √ 12|V f ( p, πf ) − V g( p, πg)| ≤ 8φ

    √ K  + 2φ + τ 1

    V f 

    ( p, πf ) − V f ( p, πf ) ≤ 4φ√ 

    2K  + 2φ + 2τ 1.

    23

  • 8/17/2019 Contextual MDPs for PAC Reinforceme t Learning With Rich Observations

    24/30

    These bounds hold for all  f, g ∈ F   that are retained by the algorithm. Of course by Lemma 2,  we are alsoensured that f  is retained by the algorithm.

    We are easily able to prove the pairwise disagreement bound,

    |V f ( p, πf )

    −V f ( p, πg)

    |= Ex∼Dpf (x, πf (x)) − f (x, πg(x))

    The quantity inside the absolute value is positive since f  prefers πf   to πg  pointwise at each x.

    |V f ( p, πf ) − V f ( p, πg)| = V f ( p, πf ) − V f ( p, πg)≤ V f ( p, πf ) − V g( p, πf ) + V g( p, πg) − V f ( p, πg)≤ 2

    √ K  + 2φ + τ 1

    = 16φ

    √ K  + 4φ + 2τ 1.

    The first inequality follows since g  prefers πg   to πf , while the second one is based on the inequality (8) inthe proof of Lemma 4, which actually bounds V f ( p, πf ) − V g( p, πf ) for any pair f, g ∈ F .

    This proves the five claims in the theorem.

    C Proof of Theorem 3

    This result is a straightforward application of Hoeffding’s inequality. We collect  ntest   observations xi ∼ D pby rolling into p and use the Monte Carlo estimates,

    V̂ f ( p, πf ) =  1

    ntest

    ntesti=1

    f (xi, πf (xi))

    By Hoeffding’s inequality, via a union bound over all  f  ∈ F , we have that with probability at least 1 − δ ,V̂ f ( p, πf ) − V f ( p, πf )

    ≤ 

    2 log(2N/δ )

    ntest

    Setting ntest  = 2log(2N/δ )/φ2, gives that our empirical estimates are at most  φ  away from the populationversions.

    Now for the first claim, if the population versions are already within  τ 2  of each other, then the empiricalversions are at most 2φ + τ 2 apart by the triangle inequality,

    |V̂ f ( p, πf ) −  V̂ g( p, πg)| ≤ |V̂ f ( p, πf ) − V f ( p, πf )| + |V f ( p, πf ) − V g( p, πg)| + |V g( p, πg) −  V̂ g( p, πg)|≤ 2φ + τ 2.

    This applies for any pair f, g ∈ F  whose population value predictions are within  τ 2  of each other. Since weset test ≥ 2φ + τ 2  in Theorem 3, this implies that the procedure returns  true.

    For the second claim, if the procedure returns true, then all empirical value predictions are at most  testapart, so the population versions are at most  2φ + test  apart, again by the triangle inequality. Specifically,for any pair f, g ∈ F  we have,

    |V f ( p, πf ) − V g( p, πg)| ≤ |V f ( p, πf ) −  V̂ f ( p, πf )| + |V̂ f ( p, πf ) −  V̂ g( p, πg)| + |V̂ g( p, πg) − V g( p, πg)|≤ 2φ + test.

    Both arguments apply for all pairs  f, g ∈ F , which proves the claim.

    24

  • 8/17/2019 Contextual MDPs for PAC Reinforceme t Learning With Rich Observations

    25/30

    D Proof of Theorem 4

    Assume that all calls to  TD-Elim  and  Consensus  operate successfully, i.e., we can apply Theorems  2

    and 3  on any path p  for which the appropriate subroutine has been invoked. We will bound the number of 

    calls and hence the total failure probability.Recall that   is the error parameter passed to  DFS-Learn and that we set  φ =   320H 2

    √ K 

    .

    We first argue that in all calls to  TD-Elim, the estimation precondition is satisfied. To see this, notice

    that by design, the algorithm only calls  TD-Elim  at path p  after the recursive step, which means that foreach a, we either ran  TD-Elim on  p ◦ a or  Consensus returned  true on  p ◦ a. Since both Theorems 2and 3 guarantee estimation error of order  φ, the estimation precondition for path  p  holds. This argumentapplies to all paths p for which we call  TD-Elim, so that the estimation precondition is always satisfied.

    We next analyze the bias term, for which proceed by induction. To state the inductive claim, we define

    the notion of an  accessed path. We say that a path p  is  accessed  if either (a) we called  TD-Elim on path por (b) we called  Consensus on p and it returned  true.

    Inductive Claim:  For all accessed paths  p with h  actions remaining and any pair f, g ∈ F  of survivingregressors,

    |V f 

    ( p, πf ) − V g

    ( p, πg)| ≤ 20h√ KφBase Case:  The claim clearly holds at time point  0, since all regressors estimate future reward as zero.Inductive Step:  Assume that the inductive claim holds for all accessed paths with  h − 1  actions re-

    maining. Consider any accessed path  p  with  h  actions remaining. Since we access the path p, either wecall TD-Elim or  Consensus returns  true. If we call  TD-Elim, then we access the paths  p ◦ a for alla ∈ A. Therefore by the inductive hypothesis, we have already filtered the regressor class so that for alla ∈ A, f , g ∈ F , we have,

    |V f ( p ◦ a, πf ) − V g( p ◦ a, πf )| ≤ 20(h − 1)√ 

    Kφ.

    We will therefore instantiate τ 1  = 20(h −1)√ 

    Kφ in the bias precondition of Theorem 2.  We also know thatthe estimation precondition is satisfied with parameter  φ. Therefore, the bias bound of Theorem 2 shows

    that, for all f, g ∈ F  retained by the algorithm,|V f ( p, πf ) − V g( p, πg)| ≤ 8φ

    √ K  + 2φ + τ 1

    ≤ 10φ√ 

    K  + 20(h − 1)φ√ 

    K  ≤ 20(h −  12

    )φ√ 

    K    (9)

    Thus the inductive step holds in this case.

    The other case we must consider is if Consensus returnstrue. Notice that for a path p with h actionsto go, we call  Consensus with parameter test   = 20(h − 1/4)

    √ Kφ. We actually invoke the routine on

    path p when we are currently processing a path  p  with h + 1 actions to go (i.e.,  p =  p ◦ a for some a ∈ A),so we set test  in terms of  H  − | p| − 5/4 = H  − | p ◦ a| − 1/4 = h − 1/4. Then, by Theorem 3,  we havethe bias bound,

    |V f ( p, πf )

    −V g( p, πf )

    | ≤2φ + 20(h

    −1/4)

    √ Kφ

    ≤ 20h√ Kφ

    Thus we have established the inductive claim.

    25

  • 8/17/2019 Contextual MDPs for PAC Reinforceme t Learning With Rich Observations

    26/30

    Verifying preconditions for Theorem 2:  To apply the conclusions of Theorem  2  at some state  s, wemust verify that the preconditions hold, with the appropriate parameter settings, before we executed the al-

    gorithm. We saw above that the estimation precondition always holds with parameter φ, assuming successfulexecution of all subroutines. The inductive argument also shows that the bias precondition also holds with

    τ 1  = 20(h − 1)√ Kφ  for a state s ∈ S h  that we called  TD-Elim on. Thus, both preconditions are satisfiedat each execution of  TD-Elim, so the conclusions of Theorem  2 apply at any state  s  for which we haveexecuted the subroutine. Note that the precondition parameters that we use here, specifically  τ 1, depend onthe level h.

    Sample Complexity:  We now bound the number of calls to each subroutine, which reveals how to

    allocate the failure probability and gives the sample complexity bound. Again assume that all calls succeed.

    First notice that if we call   Consensus   on some state   s  ∈ S h   for which we have already calledTD-Elim, then  Consensus  returns  true   (assuming all calls to subroutines succeed). This follows be-

    cause TD-Elim guarantees that the population predicted values are at most  20(h−1/2)√ Kφ apart (Eq. 9),which becomes the choice of  τ 2 in application of Theorem 3.  This is legal since,

    2φ + 20(h − 1/2)√ 

    Kφ ≤ 20(h − 1/4)√ 

    Kφ  =  test,

    so that the precondition for Theorem 3 holds. Therefore, at any level  h, we can call  TD-Elim at most one

    time per state s ∈ S h. In total, this yields M H  calls to TD-Elim.Next, since we only make recursive calls when we execute  TD-Elim, we expand at most  M  paths per

    level. This means that we call  Consensus on at most M K  paths per level, since the fan-out of the tree isK . Thus, the number of calls to  Consensus is at most M KH .

    By our setting δ  in the subroutine calls (i.e.  δ/(2MKH ) in calls to Consensus and δ/(2MH ) in callsto TD-Elim), and by Theorems 2 and 3, the total failure probability is therefore at most  δ .

    Each execution of  TD-Elim  requires ntrain   trajectories while executions of  Consensus  require ntesttrajectories. Since before each execution of TD-Elim we always perform K  executions of Consensus, if we perform T  executions of  TD-Elim, the total sample complexity is bounded by,

    T (ntrain + Kn test) ≤ (3 × 106) T H 4K 

    2  log(4NMH/δ ) + (3 × 105) T H 

    4K 2

    2  log(4NMKH/δ )

    = OT H 4K 2

    2   logNMHK 

    δ 

    ,

    Recall that the total number of executions of  TD-Elim can be no more than  M H , by the argument above.

    E Analysis for Explore-on-Demand 

    The first part of the algorithm essentially computes the value V  at the root of the search tree, but does notensures good performance of retained policies. To do the latter, and to establish a PAC-guarantee, we run

    the Explore-on-Demand procedure.

    Throughout the proof, we assume that |V̂ −V | ≤ /8. We will ensure that the first half of the algorithmexecution guarantees this. Let E  denote the event that all Monte-Carlo estimates  V̂ (∅, πf ) are accurate andall calls to DFS-Learn succeed (so that we may apply Theorem 4). By accurate, we mean,

    |V̂ (∅, πf ) − V (∅, πf )| ≤ /8.Formally, E  is the intersection over all executions of DFS-Learn of the event that the conclusions of Theo-rem 4 apply for this execution and the intersection over all iterations of the loop in Explore-on-Demand

    26

  • 8/17/2019 Contextual MDPs for PAC Reinforceme t Learning With Rich Observations

    27/30

    of the event that the Monte Carlo estimate  V̂ (∅, πf ) is within /8 of  V (∅, πf ). We will bound this failureprobability, i.e.  P[Ē ], toward the end of the proof.Lemma 6   (Risk bound upon termination).   If  E  holds, then when  Explore-on-Demand  terminates, it outputs a policy πf   with V 

    −V (πf )

    ≤.

    Proof.  The proof is straightforward,

    V  − V (πf ) ≤ |V  −  V̂ | + |V̂  −  V̂ (πf )| + |V̂ (πf ) − V (πf )|≤ /8 + /2 + /8 = 3/4 ≤

    The first bound follows by assumption on  V̂  while the second comes from the definition of  demand and thethird holds under event E .Lemma 7  (Termination Guarantee).   If E  holds, then when Explore-on-Demand  selects a policy that isat most  /4-suboptimal, it terminates.

    Proof.  We must show that the test succeeds, for which we will apply the triangle inequality,

    |V̂ 

    −  V̂ (πf )| ≤ |V̂ 

    − V 

    | + |V 

    − V (πf )| + |V (πf ) −  V̂ (πf )|≤ /8 + /4 + /8 ≤ /2 =  demand,

    And therefore the test is guaranteed to succeed. Again the last bound here holds under event E .At a current time in the execution of the algorithm, let  L  denote the set of learned states. Learned states

    are ones for which we have called successfully  TD-Elim, so that we may apply Theorem 2.  Since we only

    ever call TD-Elim through DFS-Learn, the fact that these calls to TD-Elim succeeded is implied by the

    event E . A slightly tighter definition of  L, which is sufficient for our purposes is

    L(F ) =h

    s ∈ S h : max

    f ∈F V (s) − V f (s, πf ) ≤ 4φ

    √ 2K  + 2φ + 40(h − 1)

    √ Kφ

    .

    The only property we will use from Theorem 2 is the instantaneous risk bound, which is what this alternative

    definition of  L provides.For a policy πf , let q 

    πf [s →  L̄] denote the probability that when behaving according to  πf  starting fromstate s, we visit an unlearned state. We now show that  q πf [∅→  L̄] is related to the risk of the policy  πf .Lemma 8 (Policy Risk).   Define L to be the set of states that have had  TD-Elim called on them and defineq πf [s →  L̄]  accordingly. Assume that E  holds and let  f  be a surviving regressor, so that  πf  is a surviving policy. Then,

    V  − V (∅, πf ) ≤ q πf [∅→  L̄] + 40√ 

    KφH 2.

    Proof.   Recall that under event E , we can apply the conclusions of Theorem 2  with  φ   =   320H 2

    √ K 

      and

    τ 1  = 20(h − 1)√ 

    Kφ  for any h  and state s ∈ S h  for which we have called  TD-Elim. Our proof proceedsby creating a recurrence relation through application of Theorem 2 and then solving the relation. Specifically,

    we want to prove the following inductive claim.Inductive Claim:  For a state s ∈ L with h actions to go,

    V (s) − V (s, πf ) ≤ 40φ√ 

    Kh2 + q πf [s →  L̄]

    27

  • 8/17/2019 Contextual MDPs for PAC Reinforceme t Learning With Rich Observations

    28/30

    Base Case:  With zero actions to go, all policies achieve zero reward and no policies visit  L̄ from this point,so the inductive claim trivially holds.

    Inductive Step:   For the inductive hypothesis, consider some state  s  at level  h, for which  TD-Elim  hassuccessfully been called. By Theorem 4, we know that,

    V (s) − V f (s, πf ) ≤ 4φ√ 2K  + 2φ + 2τ 1,

    with τ 1  = 20(h − 1)φ√ 

    K . This bound is clearly at most  40hφ√ 

    K . Now,

    V (s) − V (s, πf ) =  V (s) − V f (s, πf ) + V f (s, πf ) − V (s, πf )≤ 40hφ

    √ K  + E(x,r)∼Dsr(πf (x)) + V 

    (s ◦ πf (x)) − r(πf (x)) − V (s ◦ πf (x), πf ).Let us focus on just the second term, which is equal to,

    Ex∼Ds [(V (s ◦ πf (x)) − V (s ◦ πf (x), πf )) (1[Γ(s, πf (x)) ∈ L] + 1[Γ(s, πf (x))  /∈ L])]

    ≤s∈L

    Px∼Ds [Γ(s, πf (x)) =  s] (V (s) − V (s, πf )) + Px∼Ds [Γ(s, πf (x))  /∈ L]

    Since all of the recursive terms above correspond only to states  s ∈ L, we may apply the inductive hypoth-esis, to obtain the bound

    40hφ√ 

    K  +s∈L

    Px∈Ds [Γ(s, πf (x)) =  s]

    40(h − 1)2φ√ 

    K  + q πf [s →  L̄]

    + Px∼Ds [Γ(s, πf (x))  /∈ L]

    ≤ 40hφ√ 

    K  + 40(h − 1)2φ√ 

    K  + q πf [s →  L̄]≤ 40φ

    √ Kh2 + q πf [s →  L̄]

    Thus we have proved the inductive claim. Applying at the root of the tree gives the result.

    Recall that we set  φ  =   320H 2

    √ K 

      in  DFS-Learn. This ensures that 40H 2φ√ 

    K  ≤  /8, which meansthat if  q πf [∅→  L̄] = 0, then we ensure V  − V (∅, πf ) ≤ /8.

    Lemma 9 (Each non-terminal iteration makes progress).   Assume that E  holds. If  πf  is selected but fails thetest, then with probability at least  1 − exp(−ndemand,2/8) , at least one of the ndemand,2 trajectories collected visits a state s /∈ L.Proof.  First, if  πf  fails the test, we know that,

    demand < |V̂ (∅, πf ) −  V̂ | ≤ /4 + |V (∅, πf ) − V |which implies that,

    /4 < V  − V (∅, πf )On the other hand Lemma 8, shows that,

    −V (∅, πf )

    ≤q πf [∅

    → L̄] + 40H 2

    √ Kφ

    Using our setting of  φ, and combining the two bounds gives,

    /4 < q πf [∅→  L̄] + /8 ⇒ q πf [∅→  L̄] > /8

    28

  • 8/17/2019 Contextual MDPs for PAC Reinforceme t Learning With Rich Observations

    29/30

    Thus, the probability that all ndemand,2 trajectories miss  L̄ is,

    P[all trajectories miss  L̄] = (1 − q πf [∅→  L̄])ndemand,2≤ (1 − /8)ndemand,2 ≤ exp(−ndemand,2/8).

    Thus we must hit  L̄ with substantial probability.

    E.1 Proof of Theorem 5

    Again for now assume that E  holds. First of all, by Lemma 6,  we argued that if  Explore-on-Demandterminates, then it outputs a policy that satisfies the PAC-guarantee. Moreover, by Lemma 7, we also argued

    that if  Explore-on-Demand selects a policy that is at most  /4 suboptimal, then it terminates. Thus thegoal of the proof is to show that it quickly finds a policy that is at most  /4 suboptimal.

    Every execution of the loop in  Explore-on-Demand  either passes the test or fails the test at level

    demand. If the test succeeds, then Lemma  6  certifies that we have found an  -suboptimal policy, thus estab-lishing the PAC-guarantee. If the test fails, then Lemma 9  guarantees that we call  DFS-Learn on a state

    that was not previously trained on. Thus at each non-terminal iteration of the loop, we call  DFS-Learn

    and hence  TD-Elim on at least one state s /∈ L, so that the set of learned states grows by at least one. ByLemma 8 and our setting of  φ, if we have called  TD-Elim on all states at all levels, then we guarantee thatall surviving policies have risk at most /8. Thus the number of iterations of the loop is bounded by at mostMH  since that is the number of unique states in the Contextual-MDP.

    Bounding   P[Ē ]:  Since we have bounded the total number of iterations, we are now in a position toassign failure probabilities and bound the event E . Actually we must consider not only the event E  but alsothe event that all iterations where the test fails visit some state  s /∈  L. Call this new event E   which is theintersection of E  with the event that all unsuccessful iterations visit  L̄.

    We have δ  probability to allocate, and we perform at most MH  iterations. Thus in each iteration we mayallocate δ/(MH ) probability. There are three types of events required: (1) the initial monte carlo estimatesV̂ (∅, πf )  must be close to  V (∅, πf ), (2) the failure probability in Lemma  9   must be small, and (3) allHndemand,2  calls to  DFS-Learn   at this iteration must succeed. Naively, we allocate  1/3  of the availablefailure probability to each.

    For the initial Monte-Carlo estimate, by Hoeffding’s inequality, we know that,

    |V̂ (∅, πf ) − V (∅, πf )| ≤ 

    log(6MH/δ )

    2ndemand,1.

    We want this bound to be at most /8 which requires:

    ndemand,1