Minimax Weight and Q-Function Learning for Off-Policy Evaluation · 2020-02-12 · Minimax Weight...

28
Minimax Weight and Q-Function Learning for Off-Policy Evaluation Masatoshi Uehara 1 Jiawei Huang 2 Nan Jiang 2 Abstract We provide theoretical investigations into off- policy evaluation in reinforcement learning using function approximators for (marginalized) impor- tance weights and value functions. Our contribu- tions include: (1) A new estimator, MWL, that directly estimates importance ratios over the state-action distribu- tions, removing the reliance on knowledge of the behavior policy as in prior work (Liu et al., 2018). (2) Another new estimator, MQL, obtained by swapping the roles of importance weights and value-functions in MWL. MQL has an intuitive interpretation of minimizing average Bellman er- rors and can be combined with MWL in a doubly robust manner. (3) Several additional results that offer further insights, including the sample complexities of MWL and MQL, their asymptotic optimality in the tabular setting, how the learned importance weights depend the choice of the discriminator class, and how our methods provide a unified view of some old and new algorithms in RL. 1. Introduction In reinforcement learning (RL), off-policy evaluation (OPE) refers to the problem of estimating the performance of a new policy using historical data collected from a different policy, which is of crucial importance to the real-world applications of RL. The problem is genuinely hard as any unbiased estimator has to suffer a variance exponential in horizon in the worst case (Li et al., 2015; Jiang and Li, 2016), known as the curse of horizon. Recently, a new family of estimators based on marginalized importance sampling (MIS) receive significant attention from the community (Liu et al., 2018; Xie et al., 2019), as they overcome the curse of horizon with relatively mild representation assumptions. The basic idea is to learn the 1 Harvard University 2 University of Illinois at Urbana- Champaign. Correspondence to: Masatoshi Uehara < uehara- [email protected] >. Preliminary work. marginalized importance weight that converts the state distri- bution in the data to that induced by the target policy, which sometimes has much smaller variance than the importance weight on action sequences used by standard sequential IS. Among these works, Liu et al. (2018) learn the importance weights by solving a minimax optimization problem defined with the help of a discriminator value-function class. In this work, we investigate more deeply the space of algo- rithms that utilize a value-function class and an importance weight class for OPE. Our main contributions are: (Section 4) A new estimator, MWL, that directly esti- mates importance ratios over the state-action distributions, removing the reliance on knowledge of the behavior pol- icy as in prior work (Liu et al., 2018). (Section 5) By swapping the roles of importance weights and Q-functions in MWL, we obtain a new estimator that learns a Q-function using importance weights as discriminators. The procedure and the guarantees of MQL exhibit an interesting symmetry w.r.t. MWL. We also combine MWL and MQL in a doubly robust manner and provide their sample complexity guarantees (Section 6). (Section 7) We examine the statistical efficiency of MWL and MQL, and show that by modeling state-action func- tions, MWL and MQL are able to achieve the semipara- metric lower bound of OPE in the tabular setting while their state-function variants fail to do so. Our work provides a unified view of many old and new algorithms in RL. For example, when both importance weights and value functions are modeled using the same linear class, we recover LSTDQ (Lagoudakis and Parr, 2004) and off-policy LSTD (Bertsekas and Yu, 2009; Dann et al., 2014) as special cases of MWL/MQL and their state-function variants. This gives LSTD algorithms a novel interpretation that is very different from the stan- dard TD intuition. As another example, (tabular) model- based OPE and step-wise importance sampling—two al- gorithms that are so different that we seldom connect them to each other—are both special cases of MWL. arXiv:1910.12809v3 [cs.LG] 11 Feb 2020

Transcript of Minimax Weight and Q-Function Learning for Off-Policy Evaluation · 2020-02-12 · Minimax Weight...

Page 1: Minimax Weight and Q-Function Learning for Off-Policy Evaluation · 2020-02-12 · Minimax Weight and Q-Function Learning for Off-Policy Evaluation ˇ bknown?Target object Func. approx.

Minimax Weight and Q-Function Learning for Off-Policy Evaluation

Masatoshi Uehara 1 Jiawei Huang 2 Nan Jiang 2

AbstractWe provide theoretical investigations into off-policy evaluation in reinforcement learning usingfunction approximators for (marginalized) impor-tance weights and value functions. Our contribu-tions include:(1) A new estimator, MWL, that directly estimatesimportance ratios over the state-action distribu-tions, removing the reliance on knowledge of thebehavior policy as in prior work (Liu et al., 2018).(2) Another new estimator, MQL, obtained byswapping the roles of importance weights andvalue-functions in MWL. MQL has an intuitiveinterpretation of minimizing average Bellman er-rors and can be combined with MWL in a doublyrobust manner.(3) Several additional results that offer furtherinsights, including the sample complexities ofMWL and MQL, their asymptotic optimality inthe tabular setting, how the learned importanceweights depend the choice of the discriminatorclass, and how our methods provide a unified viewof some old and new algorithms in RL.

1. IntroductionIn reinforcement learning (RL), off-policy evaluation (OPE)refers to the problem of estimating the performance of anew policy using historical data collected from a differentpolicy, which is of crucial importance to the real-worldapplications of RL. The problem is genuinely hard as anyunbiased estimator has to suffer a variance exponential inhorizon in the worst case (Li et al., 2015; Jiang and Li,2016), known as the curse of horizon.

Recently, a new family of estimators based on marginalizedimportance sampling (MIS) receive significant attentionfrom the community (Liu et al., 2018; Xie et al., 2019),as they overcome the curse of horizon with relatively mildrepresentation assumptions. The basic idea is to learn the

1Harvard University 2University of Illinois at Urbana-Champaign. Correspondence to: Masatoshi Uehara < [email protected] >.

Preliminary work.

marginalized importance weight that converts the state distri-bution in the data to that induced by the target policy, whichsometimes has much smaller variance than the importanceweight on action sequences used by standard sequential IS.Among these works, Liu et al. (2018) learn the importanceweights by solving a minimax optimization problem definedwith the help of a discriminator value-function class.

In this work, we investigate more deeply the space of algo-rithms that utilize a value-function class and an importanceweight class for OPE. Our main contributions are:

• (Section 4) A new estimator, MWL, that directly esti-mates importance ratios over the state-action distributions,removing the reliance on knowledge of the behavior pol-icy as in prior work (Liu et al., 2018).

• (Section 5) By swapping the roles of importance weightsand Q-functions in MWL, we obtain a new estimatorthat learns a Q-function using importance weights asdiscriminators. The procedure and the guarantees of MQLexhibit an interesting symmetry w.r.t. MWL. We alsocombine MWL and MQL in a doubly robust manner andprovide their sample complexity guarantees (Section 6).

• (Section 7) We examine the statistical efficiency of MWLand MQL, and show that by modeling state-action func-tions, MWL and MQL are able to achieve the semipara-metric lower bound of OPE in the tabular setting whiletheir state-function variants fail to do so.

• Our work provides a unified view of many old and newalgorithms in RL. For example, when both importanceweights and value functions are modeled using the samelinear class, we recover LSTDQ (Lagoudakis and Parr,2004) and off-policy LSTD (Bertsekas and Yu, 2009;Dann et al., 2014) as special cases of MWL/MQL andtheir state-function variants. This gives LSTD algorithmsa novel interpretation that is very different from the stan-dard TD intuition. As another example, (tabular) model-based OPE and step-wise importance sampling—two al-gorithms that are so different that we seldom connectthem to each other—are both special cases of MWL.

arX

iv:1

910.

1280

9v3

[cs

.LG

] 1

1 Fe

b 20

20

Page 2: Minimax Weight and Q-Function Learning for Off-Policy Evaluation · 2020-02-12 · Minimax Weight and Q-Function Learning for Off-Policy Evaluation ˇ bknown?Target object Func. approx.

Minimax Weight and Q-Function Learning for Off-Policy Evaluation

2. PreliminariesAn infinite-horizon discounted MDP is often specified bya tuple (S,A, P,R, γ) where S is the state space, A is theaction space, P : S ×A → ∆(S) is the transition function,R : S × A → ∆([0, Rmax]) is the reward function, andγ ∈ [0, 1) is the discount factor. We also use X := S × Ato denote the space of state-action pairs. Given an MDP,a (stochastic) policy π : S → ∆(A) and a starting statedistribution d0 ∈ ∆(S) together determine a distributionover trajectories of the form s0, a0, r0, s1, a1, r1, . . ., wheres0 ∼ d0, at ∼ π(st), rt ∼ R(st, at), and st+1 ∼ P (st, at)for t ≥ 0. The ultimate measure of the a policy’s perfor-mance is the (normalized) expected discounted return:

Rπ := (1− γ)Ed0,π [∑∞t=0 γ

trt] , (1)

where the expectation is taken over the randomness of thetrajectory determined by the initial distribution and the pol-icy on the subscript, and (1− γ) is the normalization factor.

A concept central to this paper is the notion of (normalized)discounted occupancy:

dπ,γ := (1− γ)∑∞t=0 γ

tdπ,t,

where dπ,t ∈ ∆(X ) is the distribution of (st, at) underpolicy π. (The dependence on d0 is made implicit.) Wewill sometimes also write s ∼ dπ,γ for sampling from itsmarginal distribution over states. An important property ofdiscounted occupancy, which we will make heavy use of, is

Rπ = E(s,a)∼dπ,γ , r∼R(s,a)[r]. (2)

It will be useful to define the policy-specific Q-function:

Qπ(s, a) := E[∑∞t=0 γ

trt|s0 = s, a0 = a; at ∼ π(st) ∀t > 0].

The corresponding state-value function is V π(s) :=Qπ(s, π), where for any function f , f(s, π) is the short-hand for Ea∼π(s)[f(s, a)].

Off-Policy Evaluation (OPE) We are concerned withestimating the expected discounted return of an evaluationpolicy πe under a given initial distribution d0, using datacollected from a different behavior policy πb. For our meth-ods, we will consider the following data generation proto-col, where we have a dataset consisting of n i.i.d. tuples(s, a, r, s′) generated according to the distribution:

s ∼ dπb , a ∼ πb(s), r ∼ R(s, a), s′ ∼ P (s, a).

Here dπb is some exploratory state distribution that wellcovers the state space, and the technical assumptions re-quired on this distribution will be discussed in later sections.With a slight abuse of notation we will also refer to the jointdistribution over (s, a, r, s′) or its marginal on (s, a) as dπb ,

e.g., whenever we write (s, a, r, s′) ∼ dπb or (s, a) ∼ dπb ,the variables are always distributed according to the abovegenerative process. We will use E[·] to denote the exactexpectation, and use En[·] as its empirical approximationusing the n data points.

On the i.i.d. assumption Although we assume i.i.d. datafor concreteness and the ease of exposition, the actual re-quirement on the data is much milder: our method works aslong as the empirical expectation (over n data points) con-centrates around the exact expectation w.r.t. (s, a, r, s′) ∼dπb . This holds, for example, when the Markov chain in-duced by πb is ergodic, and our data is a single long tra-jectory generated by πb without resetting. As long as theinduced chain mixes nicely, it is well known that the empiri-cal expectation over the single trajectory will concentrate,and in this case dπb(s) corresponds to the stationary distri-bution of the Markov chain.1

3. Overview of OPE MethodsDirect Methods A straightforward approach to OPE isto estimate an MDP model from data, and then computethe quantity of interest from the estimated model. An al-ternative but closely related approach is to fit Qπe directlyfrom data using standard approximate dynamic program-ming (ADP) techniques, e.g., the policy evaluation analog ofFitted Q-Iteration (Ernst et al., 2005; Le et al., 2019). Whilethese methods overcome the curse of dimensionality andare agnostic to the knowledge of πb, they often require verystrong representation assumptions to succeed: for example,in the case of fitting a Q-value function from data, not onlyone needs to assume realizability, that the Q-function class(approximately) captures Qπe , but the class also needs to beclosed under Bellman update Bπe (Antos et al., 2008), oth-erwise ADP can diverge in discounted problems (Tsitsiklisand Van Roy, 1997) or suffer exponential sample complexityin finite-horizon problems (Dann et al., 2018, Theorem 45);we refer the readers to Chen and Jiang (2019) for further dis-cussions on this condition. When the function approximatorfails to satisfy these strong assumptions, the estimator canpotentially incur a high bias.

Importance Sampling (IS) IS forms an unbiased esti-mate of the expected return by collecting full-trajectorybehavioral data and reweighting each trajectory accordingto its likelihood under πe over πb (Precup et al., 2000). Sucha ratio can be computed as the cumulative product of theimportance weight over action (πe(a|s)πb(a|s) ) for each time step,which is the cause of high variance in IS: even if πe andπb only has constant divergence per step, the divergence

1We consider precisely this setting in Appendix C.1 to solidifythe claim that we do not really need i.i.d.ness.

Page 3: Minimax Weight and Q-Function Learning for Off-Policy Evaluation · 2020-02-12 · Minimax Weight and Q-Function Learning for Off-Policy Evaluation ˇ bknown?Target object Func. approx.

Minimax Weight and Q-Function Learning for Off-Policy Evaluation

πb known? Target object Func. approx.Tabular

optimalityMSWL (Liu et al., 2018) Yes wSπe/πb (Eq.(3)) wSπe/πb ∈ W

S , V πe ∈ FS (*) NoMWL (Sec 4) No wπe/πb (Eq.(4)) wπe/πb ∈ W , Qπe ∈ conv(F) YesMQL (Sec 5) No Qπe Qπe ∈ Q, wπe/πb ∈ conv(G) Yes

Fitted-Q No Qπe Q closed under Bπe Yes

Table 1. Summary of some of the OPE Methods. For methods that require knowledge of πb, the policy can be estimated from data to forma “plug-in” estimator (e.g., Hanna et al., 2019). In the function approximation column, we use blue color to mark the conditions for thediscriminator classes for minimax-style methods. For Liu et al. (2018), we useWS and FS for the function classes to emphasize thattheir functions are over the state space (ours are over the state-action space). Although they assumed V πe ∈ FS (*), this assumption canalso be relaxed to V πe ∈ conv(FS) as in our analyses. Also note that the assumption for the main function classes (WS ,W , andQ) canbe relaxed as discussed in Examples 1 and 3, and we put realizability conditions here only for simplicity.

will be amplified over the horizon, causing the cumulativeimportance weight to have exponential variance, thus the“curse of horizon”. Although techniques that combine ISand direct methods can partially reduce the variance, theexponential variance of IS simply cannot be improved whenthe MDP has significant stochasticity (Jiang and Li, 2016).

Marginalized Importance Sampling (MIS) MIS im-proves over IS by observing that, if πb and πe in-duces marginal distributions over states that have substan-tial overlap—which is often the case in many practicalscenarios—then reweighting the reward r in each data point(s, a, r, s′) with the following ratio

wSπe/πb(s) ·πe(a|s)πb(a|s)

, where wSπe/πb(s) :=dπe,γ(s)

dπb(s)(3)

can potentially have much lower variance than reweight-ing the entire trajectory (Liu et al., 2018). The differencebetween IS and MIS is essentially performing importancesampling using Eq.(1) vs. Eq.(2). However, the weightwSπe/πb is not directly available and has to be estimatedfrom data. Liu et al. (2018) proposes an estimation pro-cedure that requires two function approximators, one formodeling the weighting function wSπe/πb(s), and the otherfor modeling V πb which is used as a discriminator classfor distribution learning. Compared to the direct methods,MIS only requires standard realizability conditions for thetwo function classes, though it also needs the knowledge ofπb. A related method for finite horizon problems has beendeveloped by Xie et al. (2019).

4. Minimax Weight Learning (MWL)In this section we propose a simple extension to Liu et al.(2018) that is agnostic to the knowledge of πb. The estima-tor in the prior work uses a discriminator class that containsV πe to learn the marginalized importance weight on statedistributions (see Eq.(3)). We show that as long as the dis-criminator class is slightly more powerful—in particular, itis a Q-function class that realizes Qπe—then we are able to

learn the importance weight over state-action pairs directly:

wπe/πb(s, a) :=dπe,γ(s, a)

dπb(s, a). (4)

We can use it to directly re-weight the rewards with-out having to know πb, as Rπe = Rw[wπe/πb ] :=Eπb [wπe/πb(s, a) · r]. It will be also useful to defineRw,n[w] := En[w(s, a) · r] as the empirical approximationof Rw[·] based on n data points.

Before giving the estimator and its theoretical properties,we start with two assumptions that we will use throughoutthe paper, most notably that the state-action distribution indata well covers the discounted occupancy induced by πb.

Assumption 1. Assume X = S × A is a compact space.Let ν be its Lebesgue measure. 2

Assumption 2. There exists Cw < +∞ such thatwπe/πb(s, a) ≤ Cw ∀(s, a) ∈ X .

In the rest of this section, we derive the new estimator andprovide its theoretical guarantee. Our derivation (Eqs.(5)–(7)) provides the high-level intuitions for the method whileonly invoking basic and familiar concepts in MDPs (essen-tially, just Bellman equations). The estimator of Liu et al.(2018) can be also derived in a similar manner.

Derivation Recall that it suffices to learn w : X → Rsuch that Rw[w] = Rπe . This is equivalent to

Eπb [w(s, a) · r] = (1− γ)Es∼d0 [Qπe(s, πe)]. (5)

By Bellman equation, we have E[r|s, a] = E[Qπe(s, a)−γQπe(s′, πe)|s, a]. We use the RHS to replace r in Eq.(5),

Eπb [w(s, a) · (Qπe(s, a)− γQπe(s′, πe))]= (1− γ)Es∼d0 [Qπe(s, πe)]. (6)

To recap, it suffices to find any w that satisfies the aboveequation. Since we do not knowQπe , we will use a function

2When ν is the counting measure for finite X , all the resultshold with minor modifications.

Page 4: Minimax Weight and Q-Function Learning for Off-Policy Evaluation · 2020-02-12 · Minimax Weight and Q-Function Learning for Off-Policy Evaluation ˇ bknown?Target object Func. approx.

Minimax Weight and Q-Function Learning for Off-Policy Evaluation

class F that (hopefully) captures Qπe , and find w that mini-mizes (the absolute value of) the following objective func-tion that measures the violation of Eq.(6) over all f ∈ F :

Lw(w, f) := E(s,a,r,s′)∼dπb [{γw(s, a) · f(s′, πe) (7)

−w(s, a)f(s, a)}] + (1− γ)Es∼d0 [f(s, πe)]. (8)

The loss is always zero when w = wπe/πb , so adding func-tions to F does not hurt its validity. Such a solution is alsounique if we require Lw(w, f) = 0 for a rich set of func-tions and dπb is supported on the entire X , formalized asthe following lemma:3

Lemma 1. Lw(wπe/πb , f) = 0 ∀f ∈ L2(X , ν) := {f :∫f(s, a)2dν <∞}. Moreover, under additional technical

assumptions,4 wπe/πb is the only function that satisfies this.

This motivates the following estimator, which uses twofunction classes: a classW : X → R to model the wπe/πbfunction, and another class F : X → R to serve as thediscriminators:

w(s, a) = arg minw∈W

maxf∈F

Lw(w, f)2. (9)

Note that this is the ideal estimator that assumes exact expec-tations (or equivalently, infinite amount of data). In reality,we will only have access to a finite sample, and the real esti-mator replaces Lw(w, f) with its sample-based estimation,defined as

Lw,n(w, f) := En[{γw(s, a)f(s′, πe)− w(s, a)f(s, a)}]+ (1− γ)Ed0 [f(s, πe)]. (10)

So the sample-based estimator is wn(s, a) :=arg minw∈W maxf∈F Lw,n(w, f)2. We call this esti-mation procedure MWL (minimax weight learning), andprovide its main theorem below.

Theorem 2. For any given w : X → R, define Rw[w] =Edπb [w(s, a)·r]. IfQπe ∈ conv(F), where conv(·) denotesthe convex hull of a function class,

|Rπe −Rw[w]| ≤ maxf∈F|Lw(w, f)|,

|Rπe −Rw[w]| ≤ minw∈W

maxf∈F|Lw(w, f)|.

A few comments are in order:

1. To guarantee that the estimation is accurate, all we needisQπe ∈ conv(F), and minw maxf |Lw(w, f)| is small.

3All proofs of this paper can be found in the appendices.4As we will see, the identifiability of wπe/πb is not crucial to

the OPE goal, so we defer the technical assumptions to a formalversion of the lemma in Appendix A, where we also show that thesame statement holds when F is an ISPD kernel; see Theorem 15.

While the latter can be guaranteed by realizability ofW ,i.e., wπe/πb ∈ W , we show in an example below thatrealizability is sufficient but not always necessary: inthe extreme case where F only contains Qπe , even aconstant w function can satisfy maxf |Lw(w, f)| = 0and hence provide accurate OPE estimation.

Example 1 (Realizability ofW can be relaxed). WhenF = {Qπe}, as long as w0 ∈ W where w0 is a constantfunction that always evaluates to Rπe/Rπb , we haveRw[w] = Rπe . See Appendix A.1 for a detailed proof.

2. For the condition that Qπe ∈ conv(F), we can furtherrelax the convex hull to the linear span, though we willneed to pay the `1 norm of the combination coefficients(which is 1 for convex combinations) in the later sam-ple complexity analysis and we do not consider this re-laxation for simplicity. Also note that relaxing F toconv(F) is not useful whenF itself is linear, but can pro-vide improper learning benefits when F is a non-linearclass. We will see an intuitive example in an analoguessituation for the MQL estimator in the next section.

3. Although Eq.(9) uses Lw(w, f)2 in the objective func-tion, the square is mostly for optimization convenienceand is not vital in determining the statistical properties ofthe estimator. In later sample complexity analysis, it willbe much more convenient to work with the equivalentobjective function that uses |Lw(w, f)| instead.

4. When the behavior policy πb is known, we can incorpo-rate this knowledge by settingW = {s 7→ w(s)πe(a|s)πb(a|s) :

w ∈ WS}, whereWS is some function class over thestate space. The resulting estimator is still different from(Liu et al., 2018) since our discriminator class is stillover the state-action space.

4.1. Case Studies

The estimator in Eq.(10) requires solving a minimax opti-mization problem, which can be computationally challeng-ing. Following Liu et al. (2018) we show that the innermaximization has a closed form solution when we choose Fto correspond to a reproducing kernel Hilbert space (RKHS)HK be a RKHS associated with kernel K(·, ·). We includean informal statement below and defer the detailed expres-sion to Appendix A.2 due to space limit.Lemma 3 (Informal). When F = {f ∈ (X → R) :〈f, f〉HK ≤ 1}, the term maxf∈F Lw(w, f)2 has a closedform expression.

As a further special case when both W and F are linearclasses under the same state-action features φ : X → Rd.The resulting algorithm has a close connection to LSTDQ(Lagoudakis and Parr, 2004), which we will discuss moreexplicitly later in Section 5, Example 7.

Page 5: Minimax Weight and Q-Function Learning for Off-Policy Evaluation · 2020-02-12 · Minimax Weight and Q-Function Learning for Off-Policy Evaluation ˇ bknown?Target object Func. approx.

Minimax Weight and Q-Function Learning for Off-Policy Evaluation

Example 2. Let w(s, a;α) = φ(s, a)>α where φ(s, a) ∈Rd is some basis function and α is the parameters. If we usethe same linear function space as F , i.e., F = {(s, a) 7→φ(s, a)>β : ‖β‖2 ≤ 1}, then the estimation of α given byMWL is (assuming the matrix being inverted is full-rank)

α = En[−γφ(s′, πe)φ(s, a)> + φ(s, a)φ(s, a)>]−1

(1− γ)Es∼d0 [φ(s, πe)]. (11)

The sample-based estimator for the OPE problem is there-fore Rw,n[wn] = En[rφ(s, a)>]α; see Appendix A.3 for afull derivation.

Just as our method corresponds to LSTDQ in the linearsetting, it is worth pointing out that the method of Liu et al.(2018)—which we will call MSWL (minimax state weightlearning) for distinction and easy reference—correspondsto off-policy LSTD (Bertsekas and Yu, 2009; Dann et al.,2014); see Appendix A.5 for details.

4.2. Connections to related work

Nachum et al. (2019a) has recently proposed a version ofMIS with a similar goal of being agnostic to the knowl-edge of πb. In fact, their estimator and ours have an in-teresting connection, as our Lemma 1 can be obtainedby taking the functional derivative of their loss function;we refer interested readers to Appendix A.4 for details.That said, there are also important differences betweenour methods. First, our loss function can be reduced tosingle-stage optimization when using an RKHS discrim-inator, just as in Liu et al. (2018). In comparison, theestimator of Nachum et al. (2019a) cannot avoid two-stage optimization. Second, they do not directly esti-mate wπe/πb(s, a), and instead estimate ν∗(s, a) such thatν∗(s, a)−γEs′∼P (s,a) a′∼πe(s′)[ν

∗(s′, a′)] = wπe/πb(s, a),which is more indirect.

In the special case of γ = 0, i.e., when the problem is acontextual bandit, our method essentially becomes kernelmean matching when using an RKHS discriminator (Grettonet al., 2012), so MWL can be viewed as a natural extensionof kernel mean matching in MDPs.

5. Minimax Q-Function Learning (MQL)In Section 4, we show how to use value-function class asdiscriminators to learn the importance weight function. Inthis section, by swapping the roles of w and f , we derive anew estimator that learns Qπe from data using importanceweights as discriminators. The resulting objective functionhas an intuitive interpretation of average Bellman errors,which has many nice properties and interesting connectionsto prior works in other areas of RL.

Setup We assume that we have a class of state-action im-portance weighting functions G ⊂ (X → R) and a class ofstate-action value functions Q ⊂ (X → R). To avoid con-fusion we do not reuse the symbolsW and F in Section 4,but when we apply both estimators on the same dataset (andpossibly combine them via doubly robust), it can be rea-sonable to choose Q = F and G =W . For now it will beinstructive to assume that Qπe is captured by Q (we willrelax this assumption later), and the goal is to find q ∈ Qsuch that

Rq[q] := (1− γ)Es∼d0 [q(s, πe)] (12)

(i.e., the estimation of Rπe as if q were Qπe ) is an accurateestimate of Rπe .5

Loss Function The loss function of MQL is

Lq(q, g) = Edπb [g(s, a)(r + γq(s′, πe)− q(s, a))].

As we alluded to earlier, if g is the importance weight thatconverts the data distribution (over (s, a)) dπb to some otherdistribution µ, then the loss becomes Eµ[r + γq(s′, πe)−q(s, a)], which is essentially the average Bellman error de-fined by Jiang et al. (2017). An important property of thisquantity is that, if µ = dπe,γ , then by (a variant of) Lemma1 of Jiang et al. (2017), we immediately have

Rπe−Rq[q] = Edπe,γ [r+γq(s′, πe)−q(s, a)](= Lq(q, wπe/πb)).

Similar to the situation of MWL, we can use a rich functionclass G to model wπe/πb , and find q that minimizes the RHSof the above equation for all g ∈ G, which gives rise to thefollowing estimator:

q = arg minq∈Q

maxg∈G

Lq(q, g)2.

We call this method MQL (minimax Q-function learning).Similar to Section 4, we use qn to denote the estimator basedon a finite sample of size n (which replaces Lq(q, g) withits empirical approximation Lq,n(q, g)), and develop theformal results that parallel those in Section 4 for MWL. Allproofs and additional results can be found in Appendix B.

Lemma 4. Lq(Qπe , g) = 0 for ∀g ∈ L2(X , ν). Moreover,if we further assume that dπb(s, a) > 0 ∀(s, a), then Qπe isthe only function that satisfies such a property.

Similar to the case of MWL, we show that under certain rep-resentation conditions, the estimator will provide accurateestimation to Rπe .

5Note that Rq[·] only requires knowledge of d0 and can becomputed directly. This is different from the situation in MWL,whereRw[·] still requires knowledge of dπb even if the importanceweights are known, and the actual estimator needs to use theempirical approximation Rw,n[·].

Page 6: Minimax Weight and Q-Function Learning for Off-Policy Evaluation · 2020-02-12 · Minimax Weight and Q-Function Learning for Off-Policy Evaluation ˇ bknown?Target object Func. approx.

Minimax Weight and Q-Function Learning for Off-Policy Evaluation

Theorem 5. The following holds if wπe/πb ∈ conv(G):

|Rπe −Rq[q]| ≤ maxg∈G|Lq(q, g)|,

|Rπe −Rq[q]| ≤ minq∈Q

maxg∈G|Lq(q, g)|.

5.1. Case Studies

We proceed to give several special cases of this estima-tor corresponding to different choices of G to illustrateits properties. In the first example, we show the analogyof Example 1 for MWL, which demonstrates that requir-ing minq maxg Lq(q, g) = 0 is weaker than realizabilityQπe ∈ Q:Example 3 (Realizability of Q can be relaxed). WhenG = {wπe/πb}, as long as q0 ∈ Q, where q0 is a con-stant function that always evaluates to Rπe/(1 − γ), wehave Rq[q] = Rπe . See Appendix B.1 for details.

Next, we show a simple and intuitive example wherewπe/πb /∈ G but wπe/πb ∈ conv(G), i.e., there are caseswhere relaxing G to its convex hull yields stronger repre-sentation power and the corresponding theoretical resultsprovide a better description of the algorithm’s behavior.Example 4. Suppose X is finite. Let Q be the tabularfunction class, and G is the set of state-action indicatorfunctions.6 Then wπe/πb /∈ G but wπe/πb ∈ conv(G), andRq[q] = 0. Furthermore, the sample-based estimator qncoincides with the model-based solution, as Lq,n(q, g) =0 for each g is essentially the Bellman equation on thecorresponding state-action pair in the estimated MDP model.(In fact, the solution remains the same if we replace G withthe tabular function class.)

In the next example, we choose G to be a rich L2-class withbounded norm, and recover the usual (squared) Bellmanerror as a special case. A similar example has been given byFeng et al. (2019).Example 5 (L2-class). When G = {g : Edπb [g2] ≤ 1},

maxg∈G

Lq(q, g)2 = Edπb [((Bπeq)(s, a)− q(s, a))2],

where Bπ is the Bellman update operator (Bπq)(s, a) :=Er∼R(s,a),s′∼P (s,a)[r + γq(s′, π)].

Note that the standard Bellman error cannot be directly esti-mated from data when the state space is large, even if the Qclass is realizable (Szepesvari and Munos, 2005; Sutton andBarto, 2018; Chen and Jiang, 2019). From our perspective,this difficulty can be explained by the fact that squared Bell-man error corresponds to an overly rich discriminator classthat demands an unaffordable sample complexity.

6Strictly speaking we need to multiply these indicator functionsbyCw to guaranteewπe/πb ∈ conv(G); see the comment on linearspan after Theorem 2.

The next example is RKHS class which yields a closed-formsolution to the inner maximization as usual.Example 6 (RKHS class). When G ={g(s, a); 〈g, g〉HK ≤ 1}, we have the following:Lemma 6. Let G = {g(s, a); 〈g, g〉HK ≤1}. Then, maxg∈G Lq(q, g)2 =Edπb [∆q(q; s, a, r, s′)∆q(q; s, a, r, s′)K((s, a), (s, a))],where ∆q(q; s, a, r, s′) = g(s, a)(r+γq(s′, πe)− q(s, a)).

Finally, the linear case.Example 7. Let q(s, a;α) = φ(s, a)>α where φ(s, a) ∈Rd is some basis function and α is the parameters. If weuse the same linear function space as G, i.e., G = {(s, a) 7→φ(s, a)>β : ‖β‖2 ≤ 1}, then MQL yields

α = En[−γφ(s, a)φ(s′, πe)> + φ(s, a)φ(s, a)>]−1En[rφ(s, a)].

The resulting q(s, a; α) as an estimation of Qπe is preciselyLSTDQ (Lagoudakis and Parr, 2004). In addition, the finalOPE estimator Rq[qn] = (1 − γ)Es∼d0 [q(s, πe; α)] is thesame as Rw,n[wn] whenW and F are the same linear class(Example 2).

5.2. Connection to Kernel Loss (Feng et al., 2019)

Feng et al. (2019) has recently proposed a method for value-based RL. By some transformations, we may rewrite theirloss over state-value function v : S → R as

maxg∈GS

(Eπe [{r + γv(s′)− v(s)}g(s)])2, (13)

where GS is an RKHS over the state space. While theirmethod is very similar to MQL when written as the aboveexpression, they focus on learning a state-value functionand need to be on-policy for policy evaluation. In contrast,our goal is OPE (i.e., estimating the expected return insteadof the value function), and we learn a Q-function as an inter-mediate object and hence are able to learn from off-policydata. More importantly, the importance weight interpreta-tion of g has eluded their paper and they interpret this losspurely from a kernel perspective. In contrast, by leveragingthe importance weight interpretation, we are able to estab-lish approximation error bounds based on representationassumptions that are fully expressed in quantities directlydefined in the MDP. We also note that their loss for pol-icy optimization can be similarly interpreted as minimizingaverage Bellman errors under a set of distributions.

Furthermore, it is easy to extend their estimator to the OPEtask using knowledge of πb, which we call MVL; see Ap-pendix B.2 for details. Again, just as we discussed in Ap-pendix A.5 on MSWL, when we use linear classes for bothvalue functions and importance weights, these two estima-tors become two variants of off-policy LSTD (Dann et al.,2014; Bertsekas and Yu, 2009) and coincide with MSWLand its variant.

Page 7: Minimax Weight and Q-Function Learning for Off-Policy Evaluation · 2020-02-12 · Minimax Weight and Q-Function Learning for Off-Policy Evaluation ˇ bknown?Target object Func. approx.

Minimax Weight and Q-Function Learning for Off-Policy Evaluation

6. Doubly Robust Extension and SampleComplexity of MWL & MQL

In the previous sections we have seen two different ways ofusing a value-function class and an importance-weight classfor OPE. Which one should we choose?

In this section we show that there is no need to make achoice. In fact, we can combine the two estimates naturallythrough the doubly robust trick (Kallus and Uehara, 2019b)(see also (Tang et al., 2020)), whose population version is:

R[w, q] = (1− γ)Ed0 [q(s, πe)]

+ Edπb [w(s, a){r + γq(s′, πe)− q(s, a)}]. (14)

As before, we write Rn[w, q] as the empirical analogue ofR[w, q]. While w and q are supposed to be the MWL andMQL estimators in practice, in this section we will some-times treat w and q as arbitrary functions from theW andQ classes to keep our results general. By combining thetwo estimators, we obtain the usual doubly robust prop-erty, that when either w = wπe/πb or q = Qπe , we haveR[w, q] = Rπe , that is, as long as either one of the modelsworks well, the final estimator behaves well.7

Besides being useful as an estimator, Eq.(14) also provides aunified framework to analyze the previous estimators, whichare all its special cases: Note that R[w,0] = Rw[w] andR[0, q] = Rq[q], where 0 means a constant function thatalways evaluates to 0. Below we first prove a set of resultsthat unify and generalize the results in Sections 4 and 5,and then state the sample complexity guarantees for theproposed estimators.

Lemma 7. R[w, q] − Rπe = Edπb [{w(s, a) −wπe/πb(s, a)}{γV πe(s′)− γv(s′) + q(s, a)−Qπe(s, a)}].

Theorem 8. Fixing any q′ ∈ Q, if [Qπe − q′] ∈ conv(F),

|R[w, q′]−Rπe | ≤ maxf∈F|Lw(w, f)|,

|R[w, q′]−Rπe | ≤ minw∈W

maxf∈F|Lw(w, f)|.

Similarly, fixing any w′ ∈ W , if [wπe/πb − w′] ∈ conv(G),

|R[w′, q]−Rπe | ≤ maxg∈G|Lq(q, g)|,

|R[w′, q]−Rπe | ≤ minq∈Q

maxg∈G|Lq(q, g)|.

Remark 1. When q′ = 0, the first statement is reduced toTheorem 2. When w′ = 0, the second statement is reducedto Theorem 5.

Theorem 9 (Double robust inequality for discriminators

7See Kallus and Uehara (2019b, Theorem 11,12) for formalstatements.

(i.i.d case)). Recall that

wn = arg minw∈W

maxf∈F

Lw,n(w, f)2,

qn = arg minq∈Q

maxg∈G

Lq,n(q, g)2,

where Lw,n and Lq,n are the empirical losses based on aset of n i.i.d samples. We have the following two statements.

(1) Assume [Qπe − q′] ∈ conv(F) for some q′, and ∀f ∈F , ‖f‖∞ < Cf . Then, with probability at least 1− δ,

|R[wn, q′]−Rπe | � min

w∈Wmaxf∈F|Lw(w, f)|

+Rn(F ,W) + CfCw

√log(1/δ)

n

where Rn(W,F) is the Rademacher complexity 8 ofthe function class {(s, a, s′) 7→ w(s, a)(γf(s′, πe) −f(s, a)) : w ∈ W, f ∈ F}.

(2) Assume [wπe/πb − w′] ∈ conv(G) for some w′, and∀g ∈ G, ‖g‖∞ < Cg . Then, with probability at least 1− δ,

|R[w′, qn]−Rπe | � minq∈Q

maxg∈G|Lq(q, g)|

+ Rn(Q,G) + CgRmax

(1− γ)

√log(1/δ)

n,

where Rn(Q,G) is the Rademacher complexity of thefunction class {(s, a, r, s′) 7→ g(s, a){r + γq(s′, πe) −q(s, a)} : q ∈ Q, g ∈ G}.

Here A � B means inequality without an (absolute) con-stant. Note that we can immediately extract the samplecomplexity guarantees for the MWL and the MQL estima-tors as the corollaries of this general guarantee by lettingq′ = 0 and w′ = 0.9 In Appendix C.1 we also extend theanalysis to the non-i.i.d. case and show that similar resultscan be established for β-mixing data.

7. Statistical Efficiency in the Tabular SettingAs we have discussed earlier, both MWL and MQL areequivalent to LSTDQ when we use the same linear class forall function approximators. Here we show that in the tabularsetting, which is a special case of the linear setting, MWLand MQL can achieve the semiparametric lower bound ofOPE (Kallus and Uehara, 2019a), because they coincidewith the model-based solution. This is a desired property

8See Bartlett and Mendelson (2003) for the definition.9Strictly speaking, when q′ = 0, R[wn, q′] = Rw[wn] is

very close to but slightly different from the sample-based MWLestimator Rw,n[wn], but their difference can be bounded by auniform deviation bound over the W class in a straightforwardmanner. The MQL analysis does not have this issue as Rq[·] doesnot require empirical approximation.

Page 8: Minimax Weight and Q-Function Learning for Off-Policy Evaluation · 2020-02-12 · Minimax Weight and Q-Function Learning for Off-Policy Evaluation ˇ bknown?Target object Func. approx.

Minimax Weight and Q-Function Learning for Off-Policy Evaluation

that many OPE estimators fail to obtain, including MSWLand MVL.

Theorem 10. Assume the whole data set {(s, a, r, s′)}is geometrically ergodic 10. Then, in the tabular setting,√n(Rw,n[wn]−Rπe) and

√n(Rq[qn]−Rπe) weakly con-

verge to the normal distribution with mean 0 and variance

Edπb [w2πe/πb

(s, a)(r + γV πe(s′)−Qπe(s, a))2].

This variance matches the semiparametric lower bound forOPE given by Kallus and Uehara (2019a, Theorem 5).

The details of this theorem and further discussions can befound in Appendix D, where we also show that MSWLand MVL have an asymptotic variance greater than thislower bound. To back up this theoretical finding, we alsoconduct experiments in the Taxi environment (Dietterich,2000) following Liu et al. (2018, Section 5), and showthat MWL performs significantly better than MSWL in thetabular setting; see Appendix D.3 for details.

8. ExperimentsWe empirically demonstrate the effectiveness of our meth-ods and compare them to baseline algorithms in CartPolewith function approximation. We compare MWL & MQLto MSWL (Liu et al., 2018, with estimated behavior policy)and DualDICE (Nachum et al., 2019a). We use neural net-works with 2 hidden layers as function approximators for themain function classes for all methods, and use an RBF ker-nel for the discriminator classes (except for DualDICE); dueto space limit we defer the detailed settings to Appendix E.Figure 1 shows the log MSE of relative errors of differentmethods, where MQL appears to the best among all meth-ods. Despite that these methods require different functionapproximation capabilities and it is difficult to comparethem apple-to-apple, the results still show that MWL/MQLcan achieve similar performance to related algorithms andsometimes outperform them significantly.

9. DiscussionsWe conclude the paper with further discussions.

On the dependence of w on F In Example 1, we haveshown that with some special choice of the discriminatorclass, MWL can pick up very simple weighting functions—such as constant functions—that are very different fromthe “true” wπe/πb and nevertheless produce accurate OPEestimates with very low variances. Therefore, the functionw that satisfies Lw(w, f) = 0 ∀f ∈ F may not be unique,and the set of feasible w highly depends on the choice ofF . This leads to several further questions, such as how to

10Regarding the definition, refer to (Meyn and Tweedie, 2009)

Figure 1. Accuracy of OPE methods as a function of sample size.Error bars show 95% confidence intervals.

choose F to allow for these simple solutions, and how regu-larization can yield simple functions for better bias-variancetrade-off. 11 In case it is not clear that the set of feasiblew generally depends on F , we provide two additional ex-amples which are also of independent interest themselves.The first example shows that standard step-wise IS can beviewed as a special case of MWL, when we choose a veryrich discriminator class of history-dependent functions.

Example 8 (Step-wise IS as a Special Case of MWL). Ev-ery episodic MDP can be viewed as an equivalent MDPwhose state is the history of the original MDP. The marginaldensity ratio in this history-based MDP is essentially thecumulative product of importance weight used in step-wiseIS, and from Lemma 11 we know that such a function is theunique minimizer of MWL’s population loss if F is chosento be a sufficiently rich class of functions over histories. SeeAppendix F for more details on this example.

Example 9 (Bisimulation). Let φ be a bisimulation stateabstraction (see Li et al. (2006) for definition). If πe and πbonly depend on s through φ(s), and F only contains func-tions that are piece-wise constant under φ, then w(s, a) =dπe,γ(φ(s),a)dπb (φ(s),a)

also satisfies Lw(w, f) = 0, ∀f ∈ F .

Duality between MWL and MQL From Sections 4 and5, one can observe an obvious symmetry between MWLand MQL from the estimation procedures to the guarantees,which reminds us a lot about the duality between valuefunctions and distributions in linear programming for MDPs.Formalizing this intuition is an interesting direction.12

11Such a trade-off is relatively well understood in contextualbandits (Kallus, 2020; Hirshberg and Wager, 2017), though exten-sion to sequential decision-making is not obvious.

12See the parallel work by Nachum et al. (2019) for some in-triguing discussions on this matter.

Page 9: Minimax Weight and Q-Function Learning for Off-Policy Evaluation · 2020-02-12 · Minimax Weight and Q-Function Learning for Off-Policy Evaluation ˇ bknown?Target object Func. approx.

Minimax Weight and Q-Function Learning for Off-Policy Evaluation

ReferencesAntos, A., C. Szepesvari, and R. Munos (2008). Learning

near-optimal policies with bellman-residual minimizationbased fitted policy iteration and a single sample path.Machine Learning 71, 89–129.

Bartlett, P. L. and S. Mendelson (2003). Rademacher andGaussian complexities: Risk bounds and structural results.The Journal of Machine Learning Research 3, 463–482.

Bertsekas, D. P. (2012). Dynamic programming and optimalcontrol (4th ed. ed.). Athena Scientific optimization andcomputation series. Belmont, Mass: Athena Scientific.

Bertsekas, D. P. and H. Yu (2009). Projected equationmethods for approximate solution of large linear systems.Journal of Computational and Applied Mathematics 227,27–50.

Bickel, P. J., C. A. J. Klaassen, Y. Ritov, and J. A. Wellner(1998). Efficient and Adaptive Estimation for Semipara-metric Models. Springer.

Brockman, G., V. Cheung, L. Pettersson, J. Schneider,J. Schulman, J. Tang, and W. Zaremba (2016). OpenAIgym. arXiv preprint arXiv:1606.01540.

Chen, J. and N. Jiang (2019). Information-theoretic consid-erations in batch reinforcement learning. In Proceedingsof the 36th International Conference on Machine Learn-ing, Volume 97, pp. 1042–1051.

Dann, C., N. Jiang, A. Krishnamurthy, A. Agarwal, J. Lang-ford, and R. E. Schapire (2018). On oracle-efficient pacrl with rich observations. In Advances in Neural Informa-tion Processing Systems 31, pp. 1422–1432.

Dann, C., G. Neumann, and J. Peters (2014). Policy evalua-tion with temporal differences: A survey and comparison.The Journal of Machine Learning Research 15, 809–883.

Dietterich, T. G. (2000). Hierarchical reinforcement learn-ing with the maxq value function decomposition. Journalof Artificial Intelligence Research 13, 227–303.

Ernst, D., P. Geurts, and L. Wehenkel (2005). Tree-basedbatch mode reinforcement learning. Journal of MachineLearning Research 6, 503–556.

Feng, Y., L. Li, and Q. Liu (2019). A kernel loss for solvingthe bellman equation. In Advances in Neural InformationProcessing Systems 32, pp. 15430–15441.

Gretton, A., K. M. Borgwardt, M. J. Rasch, B. Scholkopf,and A. Smola (2012). A kernel two-sample test. Journalof Machine Learning Research 13, 723–773.

Hahn, J. (1998). On the role of the propensity score inefficient semiparametric estimation of average treatmenteffects. Econometrica 66, 315–331.

Hanna, J., S. Niekum, and P. Stone (2019). Importance sam-pling policy evaluation with an estimated behavior policy.In Proceedings of the 36th International Conference onMachine Learning, pp. 2605–2613.

Henmi, M. and S. Eguchi (2004). A paradox concerningnuisance parameters and projected estimating functions.Biometrika 91, 929–941.

Hirano, K., G. Imbens, and G. Ridder (2003). Efficient esti-mation of average treatment effects using the estimatedpropensity score. Econometrica 71, 1161–1189.

Hirshberg, D. A. and S. Wager (2017). Augmented minimaxlinear estimation. arXiv preprint arXiv:1712.00038.

Jiang, N. (2019). On value functions and the agent-environment boundary. arXiv preprint arXiv:1905.13341.

Jiang, N., A. Krishnamurthy, A. Agarwal, J. Langford, andR. E. Schapire (2017). Contextual Decision Processeswith low Bellman rank are PAC-learnable. In Proceed-ings of the 34th International Conference on MachineLearning, Volume 70, pp. 1704–1713.

Jiang, N. and L. Li (2016). Doubly Robust Off-policy ValueEvaluation for Reinforcement Learning. In Proceedingsof The 33rd International Conference on Machine Learn-ing, Volume 48, pp. 652–661.

Kallus, N. (2020). Generalized optimal matching methodsfor causal inference. JMLR (To appear).

Kallus, N. and M. Uehara (2019a). Double reinforcementlearning for efficient off-policy evaluation in markov de-cision processes. arXiv preprint arXiv:1908.08526.

Kallus, N. and M. Uehara (2019b). Efficiently break-ing the curse of horizon: Double reinforcement learn-ing in infinite-horizon processes. arXiv preprintarXiv:1909.05850.

Lagoudakis, M. and R. Parr (2004). Least-squares policyiteration. Journal of Machine Learning Research 4, 1107–1149.

Le, H., C. Voloshin, and Y. Yue (2019). Batch policy learn-ing under constraints. In International Conference onMachine Learning, pp. 3703–3712.

Li, L., R. Munos, and C. Szepesvari (2015). Toward mini-max off-policy value estimation. In Proceedings of the18th International Conference on Artificial Intelligenceand Statistics, 608–616.

Page 10: Minimax Weight and Q-Function Learning for Off-Policy Evaluation · 2020-02-12 · Minimax Weight and Q-Function Learning for Off-Policy Evaluation ˇ bknown?Target object Func. approx.

Minimax Weight and Q-Function Learning for Off-Policy Evaluation

Li, L., T. J. Walsh, and M. L. Littman (2006). Towardsa unified theory of state abstraction for MDPs. In Pro-ceedings of the 9th International Symposium on ArtificialIntelligence and Mathematics, pp. 531–539.

Liu, Q., L. Li, Z. Tang, and D. Zhou (2018). Breaking thecurse of horizon: Infinite-horizon off-policy estimation.In Advances in Neural Information Processing Systems,pp. 5361–5371.

Meyn, S. and R. L. Tweedie (2009). Markov Chains andStochastic Stability (2nd ed. ed.). New York: CambridgeUniversity Press.

Mohri, M. and A. Rostamizadeh (2009). Rademacher com-plexity bounds for non-i.i.d. processes. In Advances inNeural Information Processing Systems 21, pp. 1097–1104.

Mohri, M., A. Rostamizadeh, and A. Talwalkar (2012).Foundations of machine learning. MIT press.

Nachum, O., Y. Chow, B. Dai, and L. Li (2019a). Dualdice:Behavior-agnostic estimation of discounted stationary dis-tribution corrections. In Advances in Neural InformationProcessing Systems 32.

Nachum, O., Y. Chow, B. Dai, and L. Li (2019b). Dualdice:Behavior-agnostic estimation of discounted stationary dis-tribution corrections. In Advances in Neural InformationProcessing Systems 32, pp. 2315–2325.

Nachum, O., B. Dai, I. Kostrikov, Y. Chow, L. Li, andD. Schuurmans (2019). Algaedice: Policy gradient fromarbitrary experience. arXiv preprint arXiv:1912.02074.

Newey, W. K. and D. L. Mcfadden (1994). Large sampleestimation and hypothesis testing. Handbook of Econo-metrics IV, 2113–2245.

Precup, D., R. S. Sutton, and S. P. Singh (2000). EligibilityTraces for Off-Policy Policy Evaluation. In Proceedingsof the 17th International Conference on Machine Learn-ing, pp. 759–766.

Sriperumbudur, B. K., A. Gretton, K. Fukumizu,B. Scholkopf, and G. R. Lanckriet (2010). Hilbert spaceembeddings and metrics on probability measures. Journalof Machine Learning Research 11, 1517–1561.

Sutton, R. S. and A. G. Barto (2018). Reinforcement learn-ing: An introduction. MIT press.

Szepesvari, C. and R. Munos (2005). Finite time bounds forsampling based fitted value iteration. In Proceedings ofthe 22nd international conference on machine learning,pp. 880–887.

Tang, Z., Y. Feng, L. Li, D. Zhou, and Q. Liu (2020). Har-nessing infinite-horizon off-policy evaluation: Doublerobustness via duality. ICLR 2020(To appear).

Tsitsiklis, J. N. and B. Van Roy (1997). An analysis oftemporal-difference learning with function approxima-tion. IEEE TRANSACTIONS ON AUTOMATIC CON-TROL 42.

van der Vaart, A. W. (1998). Asymptotic statistics. Cam-bridge, UK: Cambridge University Press.

Xie, T., Y. Ma, and Y.-X. Wang (2019). Towards opti-mal off-policy evaluation for reinforcement learning withmarginalized importance sampling. In Advances in Neu-ral Information Processing Systems 32, pp. 9665–9675.

Page 11: Minimax Weight and Q-Function Learning for Off-Policy Evaluation · 2020-02-12 · Minimax Weight and Q-Function Learning for Off-Policy Evaluation ˇ bknown?Target object Func. approx.

Minimax Weight and Q-Function Learning for Off-Policy Evaluation

Table 2. Table of Notationsπe, πb Evaluation policy, Behavior policy{(si, ai, ri, s′i)}ni=1 Finite sample of data(S,A, P,R, γ, d0),X MDP, X = S ×Adπb Data distribution over (s, a, r, s′) or its marginalsdπe,γ Normalized discounted occupancy induced by πe(s, a) ∼ d0 × πe s ∼ d0, a ∼ πe(s)βπe/πb(a, s) Importance weight on action: πe(a|s)/πb(a|s)wπe/πb(s, a) dπe,γ(s, a)/dπb(s, a)Cw Bound on ‖wπe/πb‖∞V πe Value functionQπe Q-value functionRπe Expected discounted return of πeRw[·] OPE estimator using (·) as the weighting function (population version)Rw,n[·] OPE estimator using (·) as the weighting function (sample-based version)Rq[·] OPE estimator using (·) as the approximate Q-functionEn Empirical approximationW , F Function classes for MWLQ, G Function classes for MQL〈·, ·〉HK Inner product of RKHS with a kernel Kconv(·) convex hullν Uniform measure over the compact space XL2(X , ν) L2-space on X with respect to measure νRn(·) Rademacher complexity� Inequality without constant

A. Proofs and Additional Results of Section 4 (MWL)We first give the formal version of Lemma 1, which is Lemmas 11 and 12 below, and then provide the proof.

Lemma 11. For any function g(s, a), define the map; g → δ(g, s′, a′);

δ(g, s′, a′) = γ

∫P (s′|s, a)πe(a

′|s′)g(s, a)dν(s, a)− g(s′, a′) + (1− γ)d0(s′)πe(a′|s′).

Then, δ(dπe,γ , s′, a′) = 0∀(s′, a′).

Proof of Lemma 11. We have

dπe,γ(s′, a′) = (1− γ)

∞∑t=0

γtdπe,t(s′, a′)

= (1− γ)

{d0(s′)πe(a

′|s′) +

∞∑t=1

γtdπe,t(s′, a′)

}

= (1− γ)

{d0(s′)πe(a

′|s′) +

∞∑t=0

γt+1dπe,t+1(s′, a′)

}

= (1− γ)

{d0(s′)πe(a

′|s′) + γ

∞∑t=0

∫P (s′|s, a)πe(a

′|s′)γtdπe,t(s, a)dν(s, a)

}

= (1− γ)d0(s′)πe(a′|s′) + γ

∫P (s′|s, a)πe(a

′|s′)dπe,γ(s, a)dν(s, a).

This concludes δ(dπe,γ , s′, a′) = 0∀(s′, a′).

Page 12: Minimax Weight and Q-Function Learning for Off-Policy Evaluation · 2020-02-12 · Minimax Weight and Q-Function Learning for Off-Policy Evaluation ˇ bknown?Target object Func. approx.

Minimax Weight and Q-Function Learning for Off-Policy Evaluation

Lemma 12. Lw(wπe/πb , f) = 0 ∀f ∈ L2(X , ν) := {f :∫f(s, a)2dν < ∞}. Moreover, if we further assume that (a)

dπb(s, a) > 0 ∀(s, a), (b) g(s, a) = dπe,γ(s, a) if and only if δ(g, s′, a′) = 0∀(s′, a′), then wπe/πb is the only function thatsatisfies such a property.

Proof of Lemma 12. Here, we denote βπe/πb(s, a) = πe(a|s)/πb(a|s). Then, we have

Lw(w, f)

= Edπb [γw(s, a)f(s′, πe)− w(s, a)f(s, a)] + (1− γ)Ed0×πe [f(s, a)]

= Edπb [γw(s, a)βπe/πb(a′, s′)f(s′, a′)]− Edπb [w(s, a)f(s, a)] + (1− γ)Ed0×πe [f(s, a)]

= γ

∫f(s′, a′)P (s′|s, a)πe(a

′|s′)dπb(s, a)w(s, a)dν(s, a, s′, a′)+

−∫w(s′, a′)f(s′, a′)dπb(s

′, a′)dν(s′, a′) +

∫(1− γ)d0(s′)πe(a

′|s′)f(s′, a′)dν(a′, s′)

=

∫δ(g, s′, a′)f(s′, a′)dν(s′, a′),

where g(s, a) = dπb(s, a)w(s, a). Note that Edπb [·] means the expectation with respect to dπb(s, a)P (s′|s, a)πb(a′|s′).

First statement

We prove that Lw(wπe/πb , f) = 0∀f ∈ L2(X , ν). This follows because

Lw(wπe/πb , f) =

∫δ(dπe,γ , s

′, a′)f(s′, a′)dν(s′, a′) = 0.

Here, we have used Lemma 11: δ(dπe,γ , s′, a′) = 0∀(s′, a′).

Second statement

We prove the uniqueness part. Assume Lw(w, f) = 0 ∀f ∈ L2(X , ν) holds. Noting the

Lw(w, f) = 〈δ(g, s′, a′), f(s′, a′)〉,

where the inner product is for Hilbert space L2(X , ν), the Riesz representative of the functional f → Lw(w, f) is δ(g, s′, a′).From the Riesz representation theorem and the assumption Lw(w, f) = 0∀f ∈ L2(X , ν), the Riesz representative isuniquely determined as 0, that is, δ(g, s′, a′) = 0.

From the assumption (b), this implies g = dπe,γ . From the assumption (a) and the definition of g, this implies w = wπe/πb .This concludes the proof.

Proof of Theorem 2. We prove two lemmas; then prove the statement in the theorem.

Lemma 13.

Lw(w, f) = Edπb [{wπe/πb(s, a)− w(s, a)}∏

f(s, a)],

where∏f(s, a) = f(s, a)− γEs′∼P (s,a),a′∼πe(s′)[f(s′, a′)].

Proof of Lemma 13.

Lw(w, f) = Lw(w, f)− Lw(wπe/πb , f)

= Edπb [{γ{w(s, a)− wπe/πb(s, a)}βπe/πb(a′, s′)f(a′, s′)]

− Edπb [{w(s, a)− wπe/πb(s, a)}f(s, a)]

= Edπb [{wπe/πb(s, a)− w(s, a)}∏

f(s, a)].

Page 13: Minimax Weight and Q-Function Learning for Off-Policy Evaluation · 2020-02-12 · Minimax Weight and Q-Function Learning for Off-Policy Evaluation ˇ bknown?Target object Func. approx.

Minimax Weight and Q-Function Learning for Off-Policy Evaluation

Lemma 14. Define

fg(s, a) = Eπe

[ ∞∑t=0

γtg(st, at)|s0 = s, a0 = a

].

Here, the expectation is taken with respect to the density P (s1|s0, a0)πe(a1|s1)P (s2|s1, a1) · · · . Then, f = fg is a solutionto g =

∏f .

Proof of Lemma 14.∏fg(s, a)

= fg(s, a)− γEs′∼P (s,a),a′∼πe(s′)[fg(s′, a′)]

= Eπe

[ ∞∑t=0

γtg(st, at)|s0 = s, a0 = a

]− Eπe

[ ∞∑t=0

γt+1g(st+1, at+1)|s0 = s, a0 = a

]= Eπe [g(s0, a0)|s0 = s, a0 = a] = g(s, a).

We go back to the main proof. Here, we have Lw(w,Qπe) = Rπe −Rw[w] since

Lw(w,Qπe) = Edπb [{wπe/πb(s, a)− w(s, a)}∏

Qπe(s, a)]

= Edπb [{wπe/πb(s, a)− w(s, a)}E[r|s, a]]

= Edπb [{wπe/πb(s, a)− w(s, a)}r] = Rπe −Rw[w].

In the first line, we have used Lemma 13. From the first line to the second line, we have used Lemma 14.

Therefore, if Qπe ∈ conv(F), for any w,

|Rπe −Rw[w]| = |Lw(w,Qπe)| ≤ maxf∈conv(F)

|Lw(w, f)| = maxf∈F|Lw(w, f)|.

Here, we have used a fact that maxf∈conv(F) |Lw(w, f)| = maxf∈F |Lw(w, f)|. This is proved as follows.First, maxf∈conv(F) |Lw(w, f)| is equal to |Lw(w, f)| where f =

∑λifi and

∑λi = 1. Since |Lw(w, f)| ≤∑

λi|Lw(w, fi)| ≤ maxf∈F |Lw(w, f)|, we have

maxf∈conv(F)

|Lw(w, f)| ≤ maxf∈F|Lw(w, f)|.

The reverse direction is obvious.

Finally, from the definition of w,

|Rπe −Rw[w]| ≤ minw∈W

maxf∈F|Lw(w, f)|.

Proof of Lemma 17. We have

Lw(w, f)2

={

Edπb [γw(s, a)Ea′∼πe(s′)[f(s′, a′)]− w(s, a)f(s, a)] + (1− γ)Ed0×πe [f(s, a)]}2

(15)

= {Edπb [γw(s, a)Ea′∼πe(s′)[〈f,K((s′, a′), ·)〉HK ]− w(s, a)〈f,K((s, a), ·)〉HK ] (16)

+ (1− γ)Ed0×πe [〈f,K((s, a), ·)〉HK ]}2

= 〈f, f∗〉2HK , (17)

where

f∗(·) = Edπb [γw(s, a)Ea′∼πe(s′)[K((s′, a′), ·)]− w(s, a)K((s, a), ·)] + (1− γ)Ed0×πe [K((s, a), ·)].

Page 14: Minimax Weight and Q-Function Learning for Off-Policy Evaluation · 2020-02-12 · Minimax Weight and Q-Function Learning for Off-Policy Evaluation ˇ bknown?Target object Func. approx.

Minimax Weight and Q-Function Learning for Off-Policy Evaluation

Here, from (15) to (16), we have used a reproducing property of RKHS; f(s, a) = 〈f(·),K((s, a), ·)〉. From (16) to (17),we have used a linear property of the inner product.

Therefore,

maxf∈F

Lw(w, f)2 = maxf∈F〈f, f∗〉2HH = 〈f∗, f∗〉2HH .

from Cauchy-Schwarz inequality. This is equal to

maxf∈F

Lw(w, f)2 = Edπb [γ2w(s, a)w(s, a)Ea′∼πe(s′),a′∼πe(s′)[K((s′, a′), (s′, a′))]]+

+ Edπb [w(s, a)w(s, a)K((s, a), (s, a))]

+ (1− γ)2Ed0×πe [K((s, a), (s, a))]

− 2Edπb [γw(s, a)w(s, a)Ea′∼πe(s′)[K((s′, a′), (s, a))]]

+ 2γ(1− γ)E(s,a)∼dπb ,(s,a)∼d0×πe [γw(s, a)Ea′∼πe(s′)[K((s′, a′), (s, a))]]

− 2(1− γ)E(s,a)∼dπb ,(s,a)∼d0×πe [w(s, a)w(s, a)K((s, a), (s, a))],

where the first expectation is taken with respect to the density dπb(s, a, s′)dπb(s, a, s

′).

For example, the term (1− γ)2Ed0×πe [K((s, a), (s, a))] is derived by

〈(1− γ)Ed0×πe [K((s, a), ·)], (1− γ)Ed0×πe [K((s, a), ·)]〉HK

= (1− γ)2⟨∫

K((s, a), ·)d0(s)π(a|s)ν(a, s),

∫K((s, a), ·)d0(s)π(a|s)ν(a, s)

⟩HK

= (1− γ)2∫〈K((s, a), ·),K((s, a), ·)〉HKd0(s)π(a|s)d0(s)π(a|s)dν(a, s, a′, s′)

= (1− γ)2∫K((s, a), (s, a))d0(s)π(a|s)d0(s)π(a|s)dν(a, s, a′, s′).

Other term are derived in a similar manner. Here, we have used a kernel property

〈K((s, a), ·),K((s, a), ·)〉HK = K((s, a), (s, a)).

Next we show the result mentioned in the main text, that the minimizer of maxf∈F Lw(w, f) is unique when F correspondsto an ISPD kernel.

Theorem 15. AssumeW is realizable, i.e., wπe/πb ∈ W and conditions (a), (b) in Lemma 12. Then for F = L2(X , ν),w(s, a) = wπe/πb(s, a) is the unique minimizer of maxf∈F Lw(w, f). The same result holds when F is a RKHS associatedwith a integrally strictly positive definite (ISPD) kernel K(·, ·) (Sriperumbudur et al., 2010).

Proof of Theorem 15. The first statement is obvious from Lemma 12 and the proof is omitted, and here we prove the secondstatement on the ISPD kernel case. If we can prove Lemma 12 when replacing L2(X , ν) with RKHS associated with anISPD kernel K(·, ·), the statement is concluded. More specifically, what we have to prove is

Lemma 16. Assume conditions in Theorem 15. Then, Lw(w; f) = 0, ∀f ∈ F holds if and only if w(s, a) = wπe/πb(s, a).

This is proved by Mercer’s theorem (Mohri et al., 2012). From Mercer’s theorem, there exist an orthnormal basis (φj)∞j=1 of

L2(X , ν) such that RKHS is represented as

F =

f =

∞∑j=1

bjφj ; (bj)∞j=1 ∈ l2(N) with

∞∑j=1

β2j

µj<∞

,

where each µj is a positive value since kernel is ISPD. Suppose there exists w(s, a) 6= wπe/πb(s, a) in w(s, a) ∈ Wsatisfying Lw(w, f) = 0, ∀f ∈ F . Then, by taking bj = 1 (j = j′), bj = 1 (j 6= j′), for any j′ ∈ N, we haveLw(w, φj′) = 0. This implies Lw(w, f) = 0, ∀f ∈ L2(X , ν) = 0. This contradicts the original Lemma 12. Then, Lemma16 is concluded.

Page 15: Minimax Weight and Q-Function Learning for Off-Policy Evaluation · 2020-02-12 · Minimax Weight and Q-Function Learning for Off-Policy Evaluation ˇ bknown?Target object Func. approx.

Minimax Weight and Q-Function Learning for Off-Policy Evaluation

A.1. Proof for Example 1

Suppose w0(s, a) = C ∀(s, a). Then, for f = Qπe we have

Lw(w0, f) = CEdπb [{γV πe(s′)−Qπe(s, a)}] + (1− γ)Ed0×πe [Qπe(s, a)]

= CRπb −Rπe .

HenceC = Rπe/Rπb satisfies thatLw(w0, f) = 0, ∀f ∈ F . From Theorem 2, |Rπe−Rw[w]| ≤ minw maxf |Lw(w, f)| ≤maxf |Lw(w0, f)| = 0.

A.2. The RKHS Result

Lemma 17 (Formal version of Lemma 3). Let 〈·, ·〉HK be the inner product ofHK , which satisfies the reproducing propertyf(x) = 〈f,K(·, x)〉HK . When F = {f ∈ (X → R) : 〈f, f〉HK ≤ 1}, the term maxf∈F Lw(w, f)2 has a closed formexpression:

maxf∈F

Lw(w, f)2

= Edπb [γ2w(s, a)w(s, a)Ea′∼πe(s′),a′∼πe(s′)[K((s′, a′), (s′, a′))]]+

+ Edπb [w(s, a)w(s, a)K((s, a), (s, a))] + (1− γ)2Ed0×πe [K((s, a), (s, a))]+

− 2Edπb [γw(s, a)w(s, a)Ea′∼πe(s′)[K((s′, a′), (s, a))]]

+ 2γ(1− γ)E(s,a)∼dπb ,(s,a)∼d0×πe [γw(s, a)Ea′∼πe(s′)[K((s′, a′), (s, a))]]

− 2(1− γ)E(s,a)∼dπb ,(s,a)∼d0×πe [w(s, a)w(s, a)K((s, a), (s, a))].

In the above expression, (s, a) ∼ d0 × πe means s ∼ d0, a ∼ πe(s). All a′ and a′ terms are marginalized out in the innerexpectations, and when they appear together they are always independent. Similarly, in the first 3 lines when (s, a, s′) and(s, a, s′) appear together in the outer expectation, they are i.i.d. following the distribution specified in the subscript.

A.3. Details of Example 2

Here we provide the detailed derivation of Eq.(11), that is, the closed-form solution of MWL when bothW and F are set tothe same linear class. We assume that the matrix being inverted in Eq.(11) is non-singular.

We show that Eq.(11) is a solution to the objective function of MWL by showing that Lw,n(w(·; α), f) = 0 for any f inthe linear class. This suffices because for any w, the loss Lw,n(w(·; α), f)2 is non-negative and at least 0, so any w thatachieves 0 for all f ∈ F is an argmin of the loss.

Consider w ∈ W whose parameter is α and f ∈ F whose parameter is β. Then

Lw,n(w, f) = En[γα>φ(s, a)φ(s′, πe)>β − α>φ(s, a)φ(s, a)>β] + (1− γ)Ed0 [φ(s, πe)

>β]

=(α>En[γφ(s, a)φ(s′, πe)

> − φ(s, a)φ(s, a)>] + (1− γ)Ed0×πe [φ(s, a)>])β.

Since Lw,n(w, f) is linear in β, to achieve maxf Lw,n(w, f)2 = 0 it suffices to satisfy

α>En[γφ(s, a)φ(s′, πe)> − φ(s, a)φ(s, a)>] + (1− γ)Ed0 [φ(s, πe)

>] = 0. (18)

Note that this is just a set of linear equations where the number of unknowns is the same as the number of equations, and theα in Eq.(11) is precisely the solution to Eq.(18) when the matrix multiplied by α> is non-singular.

A.4. Connection to Dual DICE (Nachum et al., 2019a)

Nachum et al. (2019a) proposes an extension of Liu et al. (2018) without the knowledge of the behavior policy, which sharesthe same goal with our MWL in Section 4. In fact, there is an interesting connection between our work and theirs, as our keylemma (12) can be obtained if we take the functional gradient of their loss function. (Ideal) DualDICE with the chi-squareddivergence f(x) = 0.5x2 is described as follows;

Page 16: Minimax Weight and Q-Function Learning for Off-Policy Evaluation · 2020-02-12 · Minimax Weight and Q-Function Learning for Off-Policy Evaluation ˇ bknown?Target object Func. approx.

Minimax Weight and Q-Function Learning for Off-Policy Evaluation

• Estimate ν(s, a);

minν

0.5Edπb [{ν(s, a)− (Bν)(s, a)}2]− (1− γ)Ed0×πe [ν(s, a)], (19)

where (Bν)(s, a) = γEs′∼P (s,a),a′∼πe(s′)[ν(s′, a′)].

• Estimate the ratio as ν(s, a)− (Bν)(s, a).

Because this objective function includes an integral in (Bν)(s, a), the Monte-Carlo approximation is required. However,even if we take an Monte-Carlo sample for the approximation, it is biased. Therefore, they further modify this objectivefunction into a more complex minimax form. See (11) in (Nachum et al., 2019a).

Here, we take a functional derivative of (19)(Gateaux derivative) with respect to ν. The functional derivative at ν(s, a) is

f(s, a)→ Edπb [{ν(s, a)− (Bν)(s, a)}{f(s, a)− (Bf)(s, a)}]− (1− γ)Ed0×πe [f(s, a)].

The first order condition exactly corresponds to our Lemma 12:

−Lw(w, f) = Edπb [w(s, a){f(s, a)− γEs′∼P (s,a),a′∼πe(s′)[f(s′, a′)]}] + (1− γ)Ed0×πe [f(s, a)] = 0

⇐⇒ Edπb [w(s, a){f(s, a)− γEa′∼πe(s′)[f(s′, a′)]}] + (1− γ)Ed0×πe [f(s, a)] = 0.

where w(s, a) = ν(s, a)− (Bν)(s, a). Our proposed method with RKHS enables us to directly estimate w(s, a) in one step,and in contrast their approach requires two additional steps: estimating (Bν)(s, a) in the loss function, estimating ν(s, a) byminimizing the loss function, and taking the difference ν(s, a)− (Bν)(s, a).

A.5. Connection between MSWL (Liu et al., 2018) and Off-policy LSTD

Just as our methods become LSTDQ using linear function classes (Examples 2 and 7), here we show that the method of Liuet al. (2018) (which we call MSWL for easy reference) corresponds to off-policy LSTD. Under our notations, their method is

arg minw∈WS

maxf∈FS

{Edπb

[(γw(s)f(s′)

πe(a|s)πb(a|s)

− w(s)f(s)

)]+ (1− γ)Ed0 [f(s)]

}2

. (20)

We call this method MSWL (minimax state weight learning) for easy reference. A slightly different but closely relatedestimator is

arg minw∈WS

maxf∈FS

{Edπb

[πe(a|s)πb(a|s)

(γw(s)f(s′)− w(s)f(s))

]+ (1− γ)Ed0 [f(s)]

}2

. (21)

Although the two objectives are equal in expectation, under empirical approximations the two estimators are different. Infact, Eq.(21) corresponds to the most common form of off-policy LSTD (Bertsekas and Yu, 2009) when bothWS andFS are linear (similar to Example 2). In the same linear setting, Eq.(20) corresponds to another type of off-policy LSTDdiscussed by Dann et al. (2014). In the tabular setting, we show that cannot achieve the semiparametric lower bound inAppendix D.

B. Proofs and Additional Results of Section 5 (MQL)Proof for Lemma 4. First statement

We prove Lq(Qπe , g) = 0 ∀g ∈ L2(X , ν). The function Qπe satisfies the following Bellman equation;

Er∼R(s,a), s′∼P (s,a)[r + γQπe(s′, πe)−Qπe(s, a)] = 0. (22)

Then, ∀g ∈ L2(X , ν),

0 =

∫ {Er∼R(s,a), s′∼P (s,a)[r + γQπe(s′, πe)−Qπe(s, a)]

}g(s, a)dπb(s, a)dν(s, a)

= Lq(Qπe , g).

Page 17: Minimax Weight and Q-Function Learning for Off-Policy Evaluation · 2020-02-12 · Minimax Weight and Q-Function Learning for Off-Policy Evaluation ˇ bknown?Target object Func. approx.

Minimax Weight and Q-Function Learning for Off-Policy Evaluation

Second statement

We prove the uniqueness part. Recall that Qπe is uniquely characterized as (Bertsekas, 2012): ∀(s, a),

Er∼R(s,a), s′∼P (s,a)[r + γq(s′, πe)− q(s, a)] = 0. (23)

Assume

E(s,a,r,s′)∼dπb [{r + γq(s′, πe)− q(s, a)}g(s, a)] = 0,∀g(s, a) ∈ L2(X , ν) (24)

Note that the left hand side term is seen as

Lq(q, g) = 〈{Er∼R(s,a)[r] + Es′∼P (s,a)[γq(s′, πe)]− q(s, a)}dπ(s, b), g(s, a)〉

where the inner product 〈·, ·〉 is for Hilbert space L2(X , ν) and the representator of the functional g → Lq(q, g) is{E[r|s, a] + γq(s′, πe)− q(s, a)}dπb(s, a). From Riesz representation theorem and the assumption (24), the representatorof the linear bounded functional g → Lq(q, g) is uniquely determined as 0. Since we also assume dπb(s, a) > 0 ∀(s, a), wehave ∀(s, a),

Er∼R(s,a), s′∼P (s,a)[r + γq(s′, πe)− q(s, a)] = 0. (25)

From (23), such q is Qπe .

Proof of Theorem 5. We prove the first statement. For fixed any q, we have

|Rπe −Rq[q]| = |(1− γ)Ed0×πe [q(s, a)]−Rπe |= |Edπb [−γwπe/πb(s, a)q(s′, πe) + wπe/πb(s, a)q(s, a)]− Edπb [wπe/πb(s, a)r]|= |Edπb [wπe/πb(s, a){r + γq(s′, πe)− q(s, a)}]|≤ maxg∈conv(G)

|Lq(q, g)| = maxg∈G|Lq(q, g)|.

From the first line to the second line, we use Lemma 12 choosing f(s, a) as q(s, a). From second line to the third line, weuse wπe/πb ∈ conv(G).

Then, the second statement follows immediately based on the definition of q.

Proof of Lemma 6. We have

maxg∈G

Lq(q, g)2 = maxg∈G

Edπb [g(s, a)(r + γq(s′, πe)− q(s, a))]2

= maxg∈G

Edπb [〈g,K((s, a), ·)〉HK (r + γq(s′, πe)− q(s, a))]2

= maxg∈G〈g,Edπb [K((s, a), ·)(r + γq(s′, πe)− q(s, a))]〉2HK

= maxg∈G〈g, g∗〉2HK = 〈g∗, g∗〉HK

whereg∗(·) = Edπb [K((s, a), ·)(r + γq(s′, πe)− q(s, a))].

From the first line to the second line, we use a reproducing property of RKHS; g(s, a) = 〈g(·),K((s, a), ·〉HK . From thesecond line to the third line, we use a linear property of the inner product. From third line to the fourth line, we use aCauchy–schwarz inequality since G = {g; 〈g, g〉HK ≤ 1}.

Then, the last expression 〈g∗, g∗〉HK is equal to

〈Edπb [K((s, a), ·)(r + γq(s′, πe)− q(s, a))],Edπb [K((s, a), ·)(r + γq(s′, πe)− q(s, a))]〉HK= 〈Edπb [K((s, a), ·)(Er∼R(s,a)[r] + Es′∼P (s,a)[γq(s

′, πe)]− q(s, a))],

Edπb [K((s, a), ·)(Er∼R(s,a)[r] + Es′∼P (s,a)[γv(s′)]− q(s, a))]〉= Edπb [∆q(q; s, a, r, s′)∆q(q; s, a, r, s′)K((s, a), (s, a))]

Page 18: Minimax Weight and Q-Function Learning for Off-Policy Evaluation · 2020-02-12 · Minimax Weight and Q-Function Learning for Off-Policy Evaluation ˇ bknown?Target object Func. approx.

Minimax Weight and Q-Function Learning for Off-Policy Evaluation

where ∆q(q; s, a, r, s′) = r + γq(s′, πe) − q(s, a) and the expectation is taken with respect to the densitydπb(s, a, r, s

′)dπb(s, a, r, s′). Here, we have used a kernel property

〈K((s, a), ·),K((s, a), ·)〉HK = K((s, a), (s, a)).

Theorem 18. Assume Qπe is included in Q and dπb(s, a) > 0∀(s, a). Then, if G is L2(X , ν), q = Qπe . Also if G is aRKHS associated with an ISPD kernel, q = Qπe

Proof of Theorem 18. The first statement is obvious from Lemma 4. The second statement is proved similarly as Theorem15.

B.1. Proof for Example 3

Suppose q0(s, a) = C. Then, for g = wπe/πb , we have

Lq(q, g) = Edπb [wπe/πb(s, a)(r + γC − C)]

= Edπb [wπe/πb(s, a)r]− (1− γ)C = Rπe − (1− γ)C.

Therefore C = Rπe/(1− γ) satisfies Lq(q, g) = 0 ∀g ∈ G. From Theorem 5 we further have Rq[q] = Rπe .

B.2. Minimax Value Learning (MVL)

We can extend the loss function in Eq.(13) to the off-policy setting as follows: for state-value function v : S → R, definethe following two losses:

maxg∈GS

Eπb

[πe(a|s)πb(a|s)

{r + γv(s′)− v(s)}g(s)

]2, (26)

maxg∈GS

Eπb

[(πe(a|s)πb(a|s)

{r + γv(s′)} − v(s)

)g(s)

]2. (27)

Again, similar to the situation of MSWL as we discussed in Appendix A.5, these two losses are equal in expectation butexhibit different finite sample behaviors. When we use linear classes for both value functions and importance weights, thesetwo estimators become two variants of off-policy LSTD (Dann et al., 2014; Bertsekas and Yu, 2009) and coincide withMSWL and its variant.

C. Proofs and Additional Results of Section 6 (DR and Sample Complexity)Proof of Lemma 7. We have

R[w, q]−Rπe=R[w, q]−R[wπe/πb(s, a), Qπe(s, a)] (28)=Edπb [{w(s, a)− wπe/πb(s, a)}{r −Qπe(s, a) + γV πe(s′)}]+ (29)

Edπb [wπe/πb(s, a){Qπe(s, a)− q(s, a) + γq(s′, πe)− γV πe(s′)}] + (1− γ)Ed0 [q(s, πe)− V πe(s)]+Edπb [{w(s, a)− wπe/πb(s, a)}{Qπe(s, a)− q(s, a) + γq(s′, πe)− γV πe(s′)}]+

=Edπb [{w(s, a)− wπe/πb(s, a)}{Qπe(s, a)− q(s, a) + γq(s′, πe)− γV πe(s′)}]. (30)

From (28) to (29), this is just by algebra following the definition of R[·, ·]. From (29) to (30), we use the following lemma.

Lemma 19.

0 = Edπb [wπe/πb(s, a){Qπe(s, a)− q(s, a) + γq(s′, πe)− γV πe(s′)}] + (1− γ)Ed0 [q(s, πe)− V πe(s)],0 = Edπb [{w(s, a)− wπe/πb(s, a)}{r −Qπe(s, a) + γV πe(s′)}].

Proof. The first equation comes form Lemma 12 with f(s, a) = q(s, a) − Qπ(s, a). The second equation comes fromLemma 4 with g(s, a) = w(s, a)− wπe/πb(s, a).

Page 19: Minimax Weight and Q-Function Learning for Off-Policy Evaluation · 2020-02-12 · Minimax Weight and Q-Function Learning for Off-Policy Evaluation ˇ bknown?Target object Func. approx.

Minimax Weight and Q-Function Learning for Off-Policy Evaluation

This concludes the proof of Lemma 7.

Proof of Theorem 8. We begin with the second statement, which is easier to prove from Lemma 7:

R[w′, q]−Rπe = Edπb [{w′(s, a)− wπe/πb(s, a)}{−γV πe(s′)− γq(s′, πe) + γq(s′, πe) +Qπe(s, a)}]= Lq(q, w′ − wπe/πb)− Lq(Qπe , w′ − wπe/πb)= −Lq(q, wπe/πb − w

′)− 0. (Lemma 4)

Thus, if (wπe/πb − w′) ∈ conv(G)

|R[w′, q]−Rπe | ≤ maxg∈conv(G)

|Lq(q, g)| = maxg∈G|Lq(q, g)|.

Next, we prove the first statement. From Lemma 7,

R[w, q′]−Rπe = Edπb [{w(s, a)− wπe/πb(s, a)}{−γV πe(s′)− γq′(s′, πe) + γq(s′, πe) +Qπe(s, a)}]= Lw(w, q′ −Qπe)− Lw(wπe/πb , q

′ −Qπe)= −Lw(w,Qπe − q′)− 0. (Lemma 12)

Then, if (Qπe − q′) ∈ conv(F),

|R[w, q′]−Rπe | ≤ maxf∈conv(F)

|Lw(w, f)| = maxf∈F|Lw(w, f)|.

Finally, from the definition of w and q, we also have

|R[w, q′]−Rπe | ≤ minw∈W

maxf∈F|Lw(w, f)|, |R[w′, q]−Rπe | ≤ min

q∈Qmaxg∈G|Lq(q, g)|.

Proof of Theorem 9. We prove the first statement. The second statement is proved in the same way. We have

|R[wn, q′]−Rπe |

≤ maxf∈F|Lw(wn, f)|

= maxf∈F|Lw(wn, f)| −max

f∈F|Lw,n(wn, f)|+ max

f∈F|Lw,n(wn, f)| −max

f∈F|Lw(w, f)|+ max

f∈F|Lw(w, f)|

≤ maxf∈F|Lw(wn, f)| −max

f∈F|Lw,n(wn, f) + max

f∈F|Lw,n(w, f)−max

f∈F|Lw(w, f)|+ max

f∈F|Lw(w, f)|

≤ 2 maxf∈F,w∈W

||Lw,n(w, f)| − |Lw(w, f)||+ minw∈W

maxf∈F|Lw(w, f)|. (31)

The remaining task is to bound term maxf∈F,w∈W ||Lw,n(w, f)| − |Lw(w, f)||. This is bounded as follows;

maxf∈F,w∈W

||Lw,n(w, f)| − |Lw(w, f)|| � R′n(F ,W) + CfCw√

log(1/δ)/n. (32)

where R′n(F ,W) is the Rademacher complexity of the function class

{(s, a, s′) 7→ |w(s, a)(γf(s′, πe)− f(s, a))| : w ∈ W, f ∈ F}.

Here, we just used an uniform law of large number based on the Rademacher complexity noting| (w(s, a)(γf(s′, πe)− f(s, a))) | is uniformly bounded by CfCw up to some constant (Bartlett and Mendelson, 2003,Theorem 8). From the contraction property of the Rademacher complexity (Bartlett and Mendelson, 2003, Theorem 12),

R′n(F ,W) ≤ 2Rn(F ,W)

where Rn(F ,W) is the Rademacher complexity of the function class

{(s, a, s′) 7→ w(s, a)(γf(s′, πe)− f(s, a)) : w ∈ W, f ∈ F}. (33)

Finally, Combining (31), (32) and (33), the proof is concluded.

Page 20: Minimax Weight and Q-Function Learning for Off-Policy Evaluation · 2020-02-12 · Minimax Weight and Q-Function Learning for Off-Policy Evaluation ˇ bknown?Target object Func. approx.

Minimax Weight and Q-Function Learning for Off-Policy Evaluation

C.1. Relaxing the i.i.d. data assumption

Although the sample complexity results in Section 6 are established under i.i.d. data, we show that under standard assumptionswe can also handle dependent data and obtain almost the same results. For simplicity, we only include the result for wn.

In particular, we consider the setting mentioned in Section 2, that our data is a single long trajectory generated by policy πb:

s0, a0, r0, s1, a1, r1, . . . , sT−1, aT−1, rT−1, sT .

We assume that the Markov chain induced by πb is ergodic, and s0 is sampled from its stationary distribution so that thechain is stationary. In this case, dπb corresponds to such a stationary distribution, which is also the marginal distribution ofany st. We convert this trajectory into a set of transition tuples {(si, ai, ri, s′i)}

n−1i=0 with n = T and s′i = si+1, and then

apply our estimator on this data. Under the standard β-mixing condition13 (see e.g., Antos et al., 2008), we can prove asimilar sample complexity result:

Corollary 20. Assume {si, ai, ri, s′i}ni=1 follows a stationary β–mixing distribution with β–mixing coefficient β(k) fork = 0, 1, · · · . For any a1, a2 > 0 with 2a1a2 = n and δ > 4(a1 − 1)β(a2), with probability at least 1− δ, we have (allother assumptions are the same as in Theorem 9(1))

|R[wn, q]−Rπe | � minw∈W

maxf∈F|Lw(w, f)|+ Ra1(F ,W) + CfCw

√log(1/δ′)

a1

where Ra1(F ,W) is the empirical Rademacher complexity of the function class {(s, a, s′) 7→ {w(s, a)(γf(s′, πe) −f(s, a)) : w ∈ W, f ∈ F} based on a selected subsample of size a1 from the original data (see Mohri and Rostamizadeh(2009, Section 3.1) for details), and δ′ = δ − 4(a1 − 1)β(a2).

Proof Sketch of Corollary 20. We can prove in the same way as for Theorem 9. The only difference is we use Theorem 2(Mohri and Rostamizadeh, 2009) to bound the term supf∈F,w∈W ||Lw,n(w, f)| − |Lw(w, f)||.

D. Statistical Efficiency in the Tabular SettingD.1. Statistical efficiency of MWL and MQL

As we already have seen in Example 7, whenW,F ,Q,G are the same linear class, MWL, MQL, and LSTDQ give the sameOPE estimator. These methods are also equivalent in the tabular setting—as tabular is a special case of linear representation(with indicator features)—which also coincides with the model-based (or certainty-equivalent) solution. Below we provethat this tabular estimator can achieve the semiparametric lower bound for infinite horizon OPE (Kallus and Uehara, 2019a).14 Though there are many estimators for OPE, many of the existing OPE methods do not satisfy this property.

Here, we have the following theorem; the proof is deferred to Appendix D.4.

Theorem 21 (Restatement of Theorem 10). Assume the whole data set {(s, a, r, s′)} is geometrically Ergodic 15. Then, inthe tabular setting,

√n(Rw,n[wn]−Rπe) and

√n(Rq[qn]−Rπe) weakly converge to the normal distribution with mean 0

and varianceEdπb [w2

πe/πb(s, a)(r + γV πe(s′)−Qπe(s, a))2].

This variance matches the semiparametric lower bound for OPE given by Kallus and Uehara (2019a, Theorem 5).

Two remarks are in order:

1. Theorem 21 could be also extended to the continuous sample space case in a nonparametric manner, i.e., replacingφ(s, a) with some basis functions for L2–space and assuming that its dimension grows with some rate related to n andthe data–generating process has some smoothness condition (Newey and Mcfadden, 1994). The proof is not obviousand we leave it to future work.

13Refer to (Meyn and Tweedie, 2009) regarding the definition.14Semiparametric lower bound is the non–parametric extension of Cramer–Rao lower bound (Bickel et al., 1998). It is the lower bound

of asymptotic MSE among regular estimators (van der Vaart, 1998).15Regarding the definition, refer to (Meyn and Tweedie, 2009)

Page 21: Minimax Weight and Q-Function Learning for Off-Policy Evaluation · 2020-02-12 · Minimax Weight and Q-Function Learning for Off-Policy Evaluation ˇ bknown?Target object Func. approx.

Minimax Weight and Q-Function Learning for Off-Policy Evaluation

2. In the contextual bandit setting, it is widely known that the importance sampling estimator with plug-in weight from theempirical distribution and the model-based approach can achieve the semiparametric lower bound (Hahn, 1998; Hiranoet al., 2003). Our findings are consistent with this fact and is novel in the MDP setting to the best of our knowledge.

Table 3. Summary of the connections between several OPE methods and LSTD, and their optimality in the tabular setting.MWL MQL MSWL MVL

Definition Sec 4 Sec 5 Appendix A.5 Appendix B.2Linear case LSTDQ Off-policy LSTDOptimalityin tabular Yes No

D.2. Statistical inefficiency of MSWL and MVL for OPE

Here, we compare the statistical efficiency of MWL, MQL with MSWL, MVL in the tabular setting. First, we show thatMSWL, MVL positing the linear class is the same as the off-policy LSTD (Bertsekas and Yu, 2009; Dann et al., 2014).Then, we calculate the asymptotic MSE of these estimators in the tabular case and show that this is larger than the ones ofMWL and MQL.

Equivalence of MSWL, MVL with linear models and off-policy LSTD By slightly modifying Liu et al. (2018, Theorem4), MSWL is introduced based on the following relation;

Edπb

[πe(a|s)πb(a|s)

(γdπe,γ(s)

dπb(s)f(s′)− dπe,γ(s)

dπb(s)f(s)

)]+ (1− γ)Ed0 [f(s)] = 0∀f ∈ L2(S, ν).

Then, the estimator for dπe (s)dπb (s)is given as

minwS

maxf∈FS

{Edπb

[πe(a|s)πb(a|s)

(γw(s)f(s′)− w(s)f(s))

]+ (1− γ)Ed0 [f(s)]

}2

. (34)

As in Example 2, in the linear model case, let z(s) = φ(s)>α where φ(s) ∈ Rd is some basis function and α is theparameters. Then, the resulting estimator for dπe,γ(s)/dπb(s) is

α = (1− γ)En

[πe(a|s)πb(a|s)

{−γφ(s′) + φ(s)}φ>(s)

]−1Ed0 [φ(s)]

Then, the final estimator for Rπe is

(Dv1)>D−1v2 Dv3 ,

where

Dv1 = (1− γ)Ed0 [φ(s)],

Dv2 = En

[πe(a|s)πb(a|s)

φ(s){−γφ>(s′) + φ>(s)

}],

Dv3 = En

[rπe(a|s)πb(a|s)

φ(s)

].

In MVL, the estimator for V πe(s) is constructed based on the relation;

Edπb

[πe(a|s)πb(a|s)

{r + γV πe(s′)− V πe(s)} g(s)

]= 0∀g ∈ L2(S, ν).

Page 22: Minimax Weight and Q-Function Learning for Off-Policy Evaluation · 2020-02-12 · Minimax Weight and Q-Function Learning for Off-Policy Evaluation ˇ bknown?Target object Func. approx.

Minimax Weight and Q-Function Learning for Off-Policy Evaluation

Then, the estimator for V πe(s) is given by

minv∈V

maxg∈GS

Edπb

[πe(a|s)πb(a|s)

{r + γv(s′)− v(s)}g(s)

]2.

As in Example 7, in the linear model case, let v(s) = φ(s)>β where φ(s) ∈ Rd is some basis function and β is theparameters. Then, the resulting estimator for V πe(s) is

β = En

[πe(a|s)πb(a|s)

φ(s){−γφ>(s′) + φ>(s)

}]−1En

[rπe(a|s)πb(a|s)

φ(s)

].

Then, the final estimator for Rπe is still (Dv1)>D−1v2 Dv3 . This is exactly the same as the estimator obtained by off-policyLSTD (Bertsekas and Yu, 2009).

Another formulation of MSWL and MVL According to Liu et al. (2018, Theorem 4), we have

Edπb

[(γdπe,γ(s)

dπb(s)

πe(a|s)πb(a|s)

− dπe,γ(s′)

dπb(s′)

)f(s′)

]+ (1− γ)Ed0 [f(s)] = 0∀f ∈ L2(S, ν).

They construct an estimator for dπe,γ(s)dπb (s)as;

minw∈WS

maxf∈FS

{Edπb

[(γw(s)f(s′)

πe(a|s)πb(a|s)

− w(s)f(s)

)]+ (1− γ)Ed0 [f(s)]

}2

.

Note that compared with the previous case (34), the position of the importance weight πe/πb is different. In the same way,MVL is constructed base on the relation;

Edπb

[{πe(a|s)πb(a|s)

(r + γV πe(s′))− V πe(s)}g(s)

]= 0∀g ∈ L2(S, ν).

The estimator for V πe(s) is given by

minv∈V

maxg∈GS

Eπb

[(πe(a|s)πb(a|s)

{r + γv(s′)} − v(s)

)g(s)

]2.

When positing linear models, in both cases, the final estimator for Rπe is

D>v1{Dv4}−1Dv3 ,

where

Dv4 = En

[φ(s)

{−γ πe(a|s)

πb(a|s)φ>(s′) + φ>(s)

}].

This is exactly the same as the another type of off-policy LSTD (Dann et al., 2014).

Statistical inefficiency of MSWL, MVL and off-policy LSTD Next, we calculate the asymptotic variance ofDv1D

−1v2 Dv3 and Dv1D

−1v4 Dv3 in the tabular setting. It is shown that these methods cannot achieve the semiparametric

lower bound (Kallus and Uehara, 2019a). These results show that these methods are statistically inefficient. Note that thisimplication is also brought to the general continuous sample space case since the the asymptotic MSE is generally the sameeven in the continuous sample space case with some smoothness conditions.

Theorem 22. Assume the whole data is geometrically Ergodic. In the tabular setting,√n(Dv1D

−1v2 Dv3 − Rπe) weakly

converges to the normal distribution with mean 0 and variance;

Edπb

[{dπe,γ(s)

dπb(s)

}2

vardπb

[πe(a|s)πb(a|s)

{r + V πe(s′)− V πe(s)}|s]].

This is larger than the semiparametric lower bound.

Page 23: Minimax Weight and Q-Function Learning for Off-Policy Evaluation · 2020-02-12 · Minimax Weight and Q-Function Learning for Off-Policy Evaluation ˇ bknown?Target object Func. approx.

Minimax Weight and Q-Function Learning for Off-Policy Evaluation

Theorem 23. Assume the whole data is geometrically Ergodic. In the tabular setting,√n(Dv1D

−1v4 Dv3 − Rπe) weakly

converges to the normal distribution with mean 0 and variance;

Edπb

[{dπe,γ(s)

dπb(s)

}2

vardπb

[πe(a|s)πb(a|s)

{r + V πe(s′)}|s]].

This is larger than the semiparametric lower bound.

(a) α = 0.2 (b) α = 0.4

Figure 2. MSE as a function of sample size. α controls the difference between πb and πe; see Appendix D.3 for details.

D.3. Experiments

We show some empirical results that back up the theoretical discussions in this section. We conduct experiments in the Taxienvironment (Dietterich, 2000), which has 20000 states and 6 actions; see Liu et al. (2018, Section 5) for more details. Wecompare three methods, all using the tabular representation: MSWL with exact πe, MSWL with estimated πe (“plug-in”),and MWL (same as MQL). As we have mentioned earlier, this comparison is essentially among off-policy LSTD, plug-inoff-policy LSTD, and LSTDQ.

We choose the target policy πe to be the one obtained after running Q-learning for 1000 iterations, and choose anotherpolicy π+ after 150 iterations. The behavior policy is πb = απe + (1 − α)π+. We report the results for α ∈ {0.2, 0.4}.The discount factor is γ = 0.98.

We use a single trajectory and vary the truncation size T as [5, 10, 20, 40]× 104. For each case, by making 200 replications,we report the Monte Carlo MSE of each estimator with their 95% interval in Figure 2.

It is observed that MWL is significantly better than MSWL and MWL is slightly better than plug-in MSWL. This is becauseMWL is statisitcally efficient and MSWL is statistically inefficient as we have shown earlier in this section. The reasonwhy plug-in MSWL is superior to the original MSWL is that the plug-in based on MLE with a well specified model can beviewed as a form of control variates (Henmi and Eguchi, 2004; Hanna et al., 2019). Whether the plug-in MSWL can achievethe semiparametric lower bound remains as future work.

D.4. Proofs of Theorems 21, 22, and 23

Proof of Theorem 21. Recall that the estimator is written as D>q1D−1q2 Dq3, where

Dq1 = (1− γ)Ed0×πe [φ(s, a)]

Dq2 = En[−γφ(s, a)φ>(s′, πe) + φ(s, a)φ(s, a)>]

Dq3 = En[rφ(s, a)].

Page 24: Minimax Weight and Q-Function Learning for Off-Policy Evaluation · 2020-02-12 · Minimax Weight and Q-Function Learning for Off-Policy Evaluation ˇ bknown?Target object Func. approx.

Minimax Weight and Q-Function Learning for Off-Policy Evaluation

Recall that D−1q2 Dq3 = β is seen as Z–estimator with a parametric model q(s, a;β) = β>φ(s, a). More specifically, theestimator β is given as a solution to

En[{r + γq(s′, πe;β)− q(s, a;β)}φ(s, a)] = 0.

Following the standard theory of Z–estimator (van der Vaart, 1998), the asymptotic MSE of β is calculated as a sandwichestimator;

√n(β − β0)

d→ N (0, D−11 D2D−11

>)

where β>0 φ(s, a) = Qπe(s, a) and

D1 = Edπb [φ(s, a){−γφ(s′, πe) + φ(s, a)}>]|β0,

D2 = vardπb [{r + γq(s′, πe;β)− q(s, a;β)}φ(s, a)]|β0(35)

= Edπb [vardπb [r + γV πe(s′)−Qπe(s, a)|s, a]φ(s, a)φ>(s, a)] (36)

+ vardπb [Edπb [r + γV πe(s′)−Qπe(s, a)|s, a]φ(s, a)φ>(s, a)]

= Edπb [vardπb [r + γV πe(s′)−Qπe(s, a)|s, a]φ(s, a)φ>(s, a)]. (37)

Here, we use a variance decomposition to simplify D2 from (35) to (36). We use a relation Edπb [r + γV πe(s′) −Qπe(s, a)|s, a] = 0 from (36) to (37). Then, by delta method,

√n(D>q1D

−1q2 Dq3 −Rπe)

d→ N (0, D>q1D−11 D2D

−11

>Dq1).

From now on, we simplify the expression D>q1D−11 D2D

−11

>Dq1. First, we observe

[Dq1]|S|i1+i2 = (1− γ)d0(Si1)πe(Ai2 |Si1),

where [Dq1]|S|i1+i2 is a element corresponding (Si1 , Ai2) of Dq1. In addition,

D−11 = Edπb [φ(s, a){−γφ(s′, πe) + φ(s, a)}>]−1

= Edπb [φ(s, a)φ>(s, a)(−γPπe + I)>]−1

= {(−γPπe + I)−1}>Edπb [φ(s, a)φ>(s, a)]−1.

where Pπe is a transition matrix between (s, a) and (s′, a′), and I is an identity matrix.

Therefore, by defining g(s, a) = vardπb [r + γV πe(s′) − Qπe(s, a)|s, a] and (I − γPπe)−1Dq1 = D3 the asymptoticvariance is

D>3 Edπb [φ(s, a)φ>(s, a)]−1Edπb [g(s, a)φ(s, a)φ(s, a)>]Edπb [φ(s, a)φ>(s, a)]−1D3

=∑

s∈S,a∈A{dπb(s, a)}−1g(s, a){D>3 Is,a}2

where Is,a is a |S||A|–dimensional vector, which the element corresponding (s, a) is 1 and other elements are 0. NotingD>3 Is,a = dπe,γ(s, a), the asymptotic variance is∑

s∈S,a∈A{dπb(s, a)}−1g(s, a)d2πe,γ(s, a)

= Edπb [w2πe/πb

(s, a)vardπb [r + γV πe(s′)−Qπe(s, a)|s, a]]

= Edπb [w2πe/πb

(s, a)(r + γV πe(s′)−Qπe(s, a))2].

This concludes the proof.

Page 25: Minimax Weight and Q-Function Learning for Off-Policy Evaluation · 2020-02-12 · Minimax Weight and Q-Function Learning for Off-Policy Evaluation ˇ bknown?Target object Func. approx.

Minimax Weight and Q-Function Learning for Off-Policy Evaluation

Proof of Theorem 22. Recall that D−1v2 Dv3 = β is seen as Z–estimator with a parametric model v(s;β) = β>φ(s). Morespecifically, the estimator β is given as a solution to

En

[πe(a|s)πb(a|s)

{r + v(s′;β)− v(s;β)}φ(s)

]= 0.

Following the standard theory of Z–estimator (van der Vaart, 1998), the asymptotic variance of β is calculated as a sandwichestimator;

√n(β − β0)

d→ N (0, D−11 D2(D−11 )>),

where β>0 φ(s) = V πe(s) and

D1 = Edπb [φ(s){−γφ(s′) + φ(s)}>]

D2 = vardπb

[πe(a|s)πb(a|s)

{r + v(s′;β)− v(s;β)}φ(s)

]|β0

= Edπb

[vardπb

[πe(a|s)πb(a|s)

{r + γV πe(s′;β)− V πe(s)}|s]φ(s)φ>(s)

].

Then, by delta method,√n(D>v1D

−1v2 Dv3 −Rπe)

d→ N (0, D>v1D−11 D2(D−11 )>Dv1).

From now on, we simplify the expression D>v1S−11 S2(S−11 )>Dv1. First, we observe

[Dv1]i = (1− γ)d0(Si),

where [Dv1]i is i–th element. In addition,

D−11 = Edπb [φ(s){−γφ(s′) + φ(s)}>]−1

= Edπb [φ(s)φ>(s){−γPπe + I}>]−1

= ({−γPπe + I}>)−1Edπb [φ(s)φ>(s)]−1,

where Pπe is a transition matrix from the current state to the next state.

Therefore, by defining g(s) = vardπb

[πe(a|s)πb(a|s){r + V πe(s′)− V πe(s)}|s

]and {−γPπe + I}−1Dv1 = D3, the asymptotic

variance is

D>3 Edπb (s)[φ(s)φ>(s)]−1Edπb (s)[g(s)φ(s)φ>(s)]Edπb (s)[φ(s)φ>(s)]−1D3

=∑s∈S

d−1πb(s)g(s){D>3 Is}2,

where Is is |S|–dimensional vector, which the element corresponding s is 1 and other elements are 0. Noting D>3 Is =dπe,γ(s), the asymptotic variance is∑

s∈Sd−1πb(s)g(s)d2πe(s) = Edπb

[{dπe,γ(s)

dπb(s)

}2

vardπb

[πe(a|s)πb(a|s)

{r + γV πe(s′)− V πe(s)}|s]].

Finally, we show this is larger than the semiparametric lower bound. This is seen as

Edπb

[{dπe,γ(s)

dπb(s)

}2

vardπb

[πe(a|s)πb(a|s)

{r + γV πe(s′)− V πe(s)}|s]]

≥ Edπb

[{dπe,γ(s)

dπb(s)

}2

Edπb

[vardπb

[πe(a|s)πb(a|s)

{r + γV πe(s′;β)− V πe(s)}|s, a]]]

= Edπb

[w2πe/πb

(s, a)var[r + γV πe(s′)−Qπe(s, a)|s, a]].

Here, from the first line to the second line, we use a general inequality var[x] = var[E[x|y]]+E[var[x|y]] ≥ E[var[x|y]].

Page 26: Minimax Weight and Q-Function Learning for Off-Policy Evaluation · 2020-02-12 · Minimax Weight and Q-Function Learning for Off-Policy Evaluation ˇ bknown?Target object Func. approx.

Minimax Weight and Q-Function Learning for Off-Policy Evaluation

Proof of Theorem 23. By refining g(s) = var[πe(a|s)πb(a|s){r + V πe(s)}|s

]in the proof of Theorem 22, the asymptotic variance

is ∑s∈S

d−1πb(s)g(s)d2πe,γ(s) = Edπb

[{dπe,γ(s)

dπb(s)

}2

vardπb

[πe(a|s)πb(a|s)

{r + γV πe(s′)}|s]].

Then, we show this is larger than the semiparamatric lower bound. This is seen as

∑s∈S

d−1πb(s)g(s)d2πe,γ(s) = Edπb

[{dπe,γ(s)

dπb(s)

}2

vardπb

[πe(a|s)πb(a|s)

{r + γV πe(s′)}|s]]

≥ Edπb

[{dπe,γ(s)

dπb(s)

}2

Eπb

[vardπb

[πe(a|s)πb(a|s)

{r + γV πe(s′)}|s, a]]]

= Edπb

[w2πe/πb

(s, a)var[r + γV πe(s′)−Qπe(s, a)|s, a]].

E. Experiments in the Function Approximation SettingIn this section, we empirically evaluate our new algorithms MWL and MQL, and make comparison with MSWL (Liu et al.,2018) and DualDICE (Nachum et al., 2019b) in the function approximation setting.

E.1. Setup

We consider infinite-horizon discounted setting with γ = 0.999, and test all the algorithms on CartPole, a control task withcontinuous state space and discrete action space. Based on the implementation of OpenAI Gym (Brockman et al., 2016),we define a new state-action-dependent reward function and add small Gaussian noise with zero mean on the transitiondynamics.

To obtain the behavior and the target policies, we first use the open source code16 of DQN to get a near-optimal Q function,and then apply softmax on the Q value divided by an adjustable temperature τ :

π(a|s) ∝ exp(Q(s, a)

τ) (38)

We choose τ = 1.0 as the behavior policy and τ = 0.25, 0.5, 1.5, 2.0 as the target policies. The training datasets aregenerated by collecting trajectories of the behavior policy with fixed horizon length 1000. If the agent visits the terminalstates within 1000 steps, the trajectory will be completed by repeating the last state and continuing to sample actions.

E.2. Algorithms Implementation

We use the loss functions and the OPE estimators derived in this paper and previous literature (Liu et al., 2018; Nachumet al., 2019b), except for MWL, in which we find another equivalent loss function can work better in practice:

L = Edπ0 [γ(w(s, a)f(s′, π)− w(s′, a′)f(s′, a′)

)]− (1− γ)Ed0×π0

[w(s, a)f(s, a)] + (1− γ)Ed0×π[f(s, a)] (39)

In all algorithms, we keep the structure of the neural networks the same, which have two hidden layers with 32 units ineach and ReLU as activation function. Besides, the observation is normalized to zero mean and unit variance, and the batchsize is fixed to 500. In MSWL, the normalized states are the only input, while in the others, the states and the actions areconcatenated and fed together into neural networks.

As for DualDICE, we conduct evaluation based on the open source implementation17. The learning rate of ν−networkand ζ−network are changed to be ην = 0.0005 and ηζ = 0.005, respectively, after a grid search in ην × ηζ ∈{0.0001, 0.0005, 0.001, 0.0015, 0.002} × {0.001, 0.005, 0.01, 0.015, 0.02}.

16https://github.com/openai/baselines17https://github.com/google-research/google-research/tree/master/dual dice

Page 27: Minimax Weight and Q-Function Learning for Off-Policy Evaluation · 2020-02-12 · Minimax Weight and Q-Function Learning for Off-Policy Evaluation ˇ bknown?Target object Func. approx.

Minimax Weight and Q-Function Learning for Off-Policy Evaluation

In MSWL, we use estimated policy distribution instead of the true value to compute the policy ratio. To do this, we train a64x64 MLP with cross-entropy loss until convergence to approximate the distribution of the behavior policy. The learningrate is set to be 0.0005.

Besides, we implement MQL, MWL and MSWL with F corresponding to a RKHS associated with kernel K(·, ·). The newloss function of MWL (39) can be written as:

L =γ2Es,a,s′,s,a,s′∼dπ0 [w(s, a)w(s, a)Ea′,a′∼π[K((s′, a′), (s′, a′))]]

+ γ2Es′,a′,s′,a′∼dπ0 [w(s′, a′)w(s′, a′)K((s′, a′), (s′, a′))]

+ (1− γ)2Es,a,s,a∼d0×π0[w(s, a)w(s, a)K((s, a), (s, a))]

+ (1− γ)2Es,a,s,a∼d0×π[K((s, a), (s, a))]

− 2γ2Es,a,s′,s′,a′∼dπ0 [w(s, a)w(s′, a′)Ea′∼π[K((s′, a′), (s′, a′))]]

− 2γ(1− γ)Es,a,s′∼dπ0 ,s,a∼d0×π0 [w(s, a)w(s, a)Ea′∼π[K((s′, a′), (s, a))]]

+ 2γ(1− γ)Es,a,s′∼dπ0 ,s,a∼d0×π[w(s, a)Ea′∼π[K((s′, a′), (s, a))]]

+ 2γ(1− γ)Es′,a′∼dπ0 ,s,a∼d0×π0[w(s′, a′)w(s, a)K((s′, a′), (s, a))]

− 2γ(1− γ)Es′,a′∼dπ0 ,s,a∼d0×π[w(s′, a′)K((s′, a′), (s, a))]

− 2(1− γ)2Es,a∼d0×π0,s,a∼d0×π[w(s, a)K((s, a), (s, a))]. (40)

We choose the RBF kernel, defined as

K(xi,xj) = exp(−‖xi − xj‖222σ2

) (41)

where xi,xj corresponds to state vectors in MSWL, and corresponds to the vectors concatenated by state and action inMWL and MQL. Denote h as the median of the pairwise distance between xi, we set σ equal to h, h3 and h

15 in MSWL,MWL and MQL, respectively. The learning rates are fixed to 0.005 in these three methods.

In MSWL and MWL, to ensure that the predicted density ratio is non-negative, we apply log(1 + exp(·)) as the activationfunction in the last layer of the neural networks. Moreover, we normalize the ratio to have unit mean value in each batch,which works better.

E.3. Results

We generate N datasets with different random seeds. For each dataset, we run all these four algorithms from the beginninguntil convergence, and consider it as one trial. Every 100 training iterations, the estimation of value function is logged, andthe average over the last five logged estimations will be recorded as the result in this trial. For each algorithm, we report thenormalized MSE, defined by

1

N

N∑i=1

(R(i)πe −Rπe)2

(Rπb −Rπe)2(42)

where R(i)πe is the estimated return in the i-th trial; Rπe and Rπb are the true expected returns of the target and the behavior

policies, respectively, estimated by 500 on-policy Monte-Carlo trajectories (truncated at H = 10000 steps to make sure γH

is sufficiently small). This normalization is very informative, as a naıve baseline that treats Rπe ≈ Rπb will get 0.0 (aftertaking logarithm), so any method that beats this simple baseline should get a negative score. We conduct N = 25 trials andplot the results in Figure 3.

E.4. Error bars

We denote xi =(R(i)πe−Rπe )

2

(Rπb−Rπe )2. As we can see, the result we plot (Eq.(42)) is just the average of N i.i.d. random variables

{xi}ni=1. We plot twice the standard error of the estimation—which corresponds to 95% confidence intervals—underlogarithmic transformation. That is, the upper bound of the error bar is

log(1

N

N∑i=1

xi +2σ√N

)

Page 28: Minimax Weight and Q-Function Learning for Off-Policy Evaluation · 2020-02-12 · Minimax Weight and Q-Function Learning for Off-Policy Evaluation ˇ bknown?Target object Func. approx.

Minimax Weight and Q-Function Learning for Off-Policy Evaluation

Figure 3. A Comparison of four algorithms: MQL (red circle), MWL (blue square), DualDICE (green triangle) and MSWL (yellowdiamond). Left: We fix the number of trajectories to 200 and change the target policies. Right: We keep the target policy as τ = 1.5, andvary the number of samples.

and the lower bound of the error bar is

log(1

N

N∑i=1

xi −2σ√N

)

where σ is the sample standard deviation of xi.

F. Step-wise IS as a Special Case of MWLWe show that step-wise IS (Precup et al., 2000) in discounted episodic problems can be viewed as a special case of MWL,and sketch the proof as follows. In addition to the setup in Section 2, we also assume that the MDP always goes to theabsorbing state in H steps from any starting state drawn from d0. The data are trajectories generated by πb. We firstconvert the MDP into an equivalent history-based MDP, i.e., a new MDP where the state is the history of the original MDP(absorbing states are still treated specially). We use ht to denote a history of length t, i.e., ht = (s0, a0, r0, s1, . . . , st).Since the history-based MDP still fits our framework, we can apply MWL as-is to the history-based MDP. In this case, eachdata trajectory will be converted into H tuples in the form of (ht, at, rt, ht+1).

We choose the following F class for MWL, which is the space of all functions over histories (of various lengths up to H).Assuming all histories have non-zero density under πe (this assumption can be removed), from Lemma 1 we know that theonly w that satisfies ∀f ∈ F , Lw(w, f) = 0 is

dπe,γ(ht, at)

dπb(ht, at)=

(1− γ)γt

1/H

t∏t′=0

πe(at′ |st′)πb(at′ |st′)

. 18

Note that Rw[w] with such an w is precisely the step-wise IS estimator in discounted episodic problems. Furthermore, thetrue marginalized importance weight in the original MDP dπe,γ(s,a)

dπb (s,a)is not feasible under this “overly rich” history-dependent

discriminator class (see also related discussions in Jiang (2019)).

18Here the term (1−γ)γt1/H

appears because a state at time step t is discounted in the evaluation objective but its empirical frequency inthe data is not. Other than that, the proof of this equation is precisely how one derives sequential IS, i.e., density ratio between histories isequal to the cumulative product of importance weights on actions.