Finale Doshi-Velez - Interpretable Off-Policy Evaluation in … · 2020. 7. 27. · tion process...

17
Interpretable Off-Policy Evaluation in Reinforcement Learning by Highlighting Influential Transitions Omer Gottesman 1 Joseph Futoma 1 Yao Liu 2 Sonali Parbhoo 1 Leo Anthony-Celi 34 Emma Brunskill 2 Finale Doshi-Velez 1 Abstract Off-policy evaluation in reinforcement learning offers the chance of using observational data to im- prove future outcomes in domains such as health- care and education, but safe deployment in high stakes settings requires ways of assessing its va- lidity. Traditional measures such as confidence intervals may be insufficient due to noise, limited data and confounding. In this paper we develop a method that could serve as a hybrid human-AI system, to enable human experts to analyze the validity of policy evaluation estimates. This is accomplished by highlighting observations in the data whose removal will have a large effect on the OPE estimate, and formulating a set of rules for choosing which ones to present to domain experts for validation. We develop methods to compute exactly the influence functions for fitted Q-evaluation with two different function classes: kernel-based and linear least squares. Experi- ments on medical simulations and real-world in- tensive care unit data demonstrate that our method can be used to identify limitations in the evalua- tion process and make evaluation more robust. 1. Introduction Within reinforcement learning (RL), off-policy evaluation (OPE) is the task of estimating the value of a given eval- uation policy, using data collected by interaction with the environment under a different behavior policy (Sutton & Barto, 2018; Precup, 2000). OPE is particularly valuable when interaction and experimentation with the environment is expensive, risky, or unethical—for example, in healthcare or with self-driving cars. However, despite recent interest and progress, state-of-the-art OPE methods still often fail to differentiate between obviously good and obviously bad 1 Harvard University 2 Stanford University 3 Massachusetts Insti- tute of Technology 4 Beth Israel Deaconess Medical Center. Corre- spondence to: Omer Gottesman <[email protected]>. policies, e.g. in healthcare (Gottesman et al., 2018). Most of the OPE literature focuses on sub-problems such as improving asymptotic sample efficiency or bounding the error on OPE estimators for the value of a policy. How- ever, while these bounds are theoretically sound, they are often too conservative to be useful in practice (though see e.g. Thomas et al. (2019) for an exception). This is not surprising, as there is a theoretical limit to the statistical information contained in a given dataset, no matter which estimation technique is used. Furthermore, many of the com- mon assumptions underlying these theoretical guarantees are usually not met in practice: observational healthcare data, for example, often contains many unobserved con- founders (Gottesman et al., 2019a). Given the limitations of OPE, we argue that in high stakes scenarios domain experts should be integrated into the eval- uation process in order to provide useful actionable results. For example, senior clinicians may be able to provide in- sights that reduce our uncertainty of our value estimates. In this light, the explicit integration of expert knowledge into the OPE pipeline is a natural way for researchers to receive feedback and continually update their policies un- til one can make a responsible decision about whether to pursue gathering prospective data. The question is then what information can humans provide that might help assess and potentially improve our confi- dence in an OPE estimate? In this work, we consider how human input could improve our confidence in the recently proposed OPE estimator, fitted Q-evaluation (FQE) (Le et al., 2019). We develop an efficient approach to identify the most influential transitions in the batch of observational data, that is, transitions whose removal would have large effects on the OPE estimate. By presenting these influen- tial transitions to a domain expert and verifying that they are indeed representative of the data, we can increase our confidence that our estimated evaluation policy value is not dependent on outliers, confounded observations, or mea- surement errors. The main contributions of this work are: Conceptual: We develop a framework for using influ- ence functions to interpret OPE, and discuss the types arXiv:2002.03478v2 [cs.LG] 14 Feb 2020

Transcript of Finale Doshi-Velez - Interpretable Off-Policy Evaluation in … · 2020. 7. 27. · tion process...

Page 1: Finale Doshi-Velez - Interpretable Off-Policy Evaluation in … · 2020. 7. 27. · tion process and make evaluation more robust. 1. Introduction Within reinforcement learning (RL),

Interpretable Off-Policy Evaluation in Reinforcement Learningby Highlighting Influential Transitions

Omer Gottesman 1 Joseph Futoma 1 Yao Liu 2 Sonali Parbhoo 1 Leo Anthony-Celi 3 4 Emma Brunskill 2

Finale Doshi-Velez 1

AbstractOff-policy evaluation in reinforcement learningoffers the chance of using observational data to im-prove future outcomes in domains such as health-care and education, but safe deployment in highstakes settings requires ways of assessing its va-lidity. Traditional measures such as confidenceintervals may be insufficient due to noise, limiteddata and confounding. In this paper we developa method that could serve as a hybrid human-AIsystem, to enable human experts to analyze thevalidity of policy evaluation estimates. This isaccomplished by highlighting observations in thedata whose removal will have a large effect onthe OPE estimate, and formulating a set of rulesfor choosing which ones to present to domainexperts for validation. We develop methods tocompute exactly the influence functions for fittedQ-evaluation with two different function classes:kernel-based and linear least squares. Experi-ments on medical simulations and real-world in-tensive care unit data demonstrate that our methodcan be used to identify limitations in the evalua-tion process and make evaluation more robust.

1. IntroductionWithin reinforcement learning (RL), off-policy evaluation(OPE) is the task of estimating the value of a given eval-uation policy, using data collected by interaction with theenvironment under a different behavior policy (Sutton &Barto, 2018; Precup, 2000). OPE is particularly valuablewhen interaction and experimentation with the environmentis expensive, risky, or unethical—for example, in healthcareor with self-driving cars. However, despite recent interestand progress, state-of-the-art OPE methods still often failto differentiate between obviously good and obviously bad

1Harvard University 2Stanford University 3Massachusetts Insti-tute of Technology 4Beth Israel Deaconess Medical Center. Corre-spondence to: Omer Gottesman <[email protected]>.

policies, e.g. in healthcare (Gottesman et al., 2018).

Most of the OPE literature focuses on sub-problems suchas improving asymptotic sample efficiency or bounding theerror on OPE estimators for the value of a policy. How-ever, while these bounds are theoretically sound, they areoften too conservative to be useful in practice (though seee.g. Thomas et al. (2019) for an exception). This is notsurprising, as there is a theoretical limit to the statisticalinformation contained in a given dataset, no matter whichestimation technique is used. Furthermore, many of the com-mon assumptions underlying these theoretical guaranteesare usually not met in practice: observational healthcaredata, for example, often contains many unobserved con-founders (Gottesman et al., 2019a).

Given the limitations of OPE, we argue that in high stakesscenarios domain experts should be integrated into the eval-uation process in order to provide useful actionable results.For example, senior clinicians may be able to provide in-sights that reduce our uncertainty of our value estimates.In this light, the explicit integration of expert knowledgeinto the OPE pipeline is a natural way for researchers toreceive feedback and continually update their policies un-til one can make a responsible decision about whether topursue gathering prospective data.

The question is then what information can humans providethat might help assess and potentially improve our confi-dence in an OPE estimate? In this work, we consider howhuman input could improve our confidence in the recentlyproposed OPE estimator, fitted Q-evaluation (FQE) (Leet al., 2019). We develop an efficient approach to identifythe most influential transitions in the batch of observationaldata, that is, transitions whose removal would have largeeffects on the OPE estimate. By presenting these influen-tial transitions to a domain expert and verifying that theyare indeed representative of the data, we can increase ourconfidence that our estimated evaluation policy value is notdependent on outliers, confounded observations, or mea-surement errors. The main contributions of this work are:

• Conceptual: We develop a framework for using influ-ence functions to interpret OPE, and discuss the types

arX

iv:2

002.

0347

8v2

[cs

.LG

] 1

4 Fe

b 20

20

Page 2: Finale Doshi-Velez - Interpretable Off-Policy Evaluation in … · 2020. 7. 27. · tion process and make evaluation more robust. 1. Introduction Within reinforcement learning (RL),

Interpretable Off-Policy Evaluation by Highlighting Influential Transitions

of questions which can be shared with domain expertsto use their expertise in debugging OPE.

• Technical: We develop computationally efficient algo-rithms to compute the exact influence functions for twobroad function classes in FQE: kernel-based functionsand linear functions.

• Empirical: We demonstrate the potential benefits of in-fluence analysis for interpreting OPE on a cancer simu-lator, and present results of analysis together with prac-ticing clinicians of OPE for management of acute hy-potension from a real intensive care unit (ICU) dataset.

2. Related workThe OPE problem in RL has been studied extensively.Works fall into three main categories: importance sampling(IS) (e.g. Precup (2000); Jiang & Li (2015)), model-based(often referred to as the direct method) (e.g. Hanna et al.(2017); Gottesman et al. (2019b)), and value-based (e.g. Leet al. (2019)). Some of these works provide bounds on theestimation errors (e.g. Thomas et al. (2015); Dann et al.(2018)). We emphasize, however, that for most real-worldapplications these bounds are either too conservative to beuseful or rely on assumptions which are usually violated.

While there has been considerable recent progress in in-terpretable machine learning and machine learning withhumans in the loop (e.g. Tamuz et al. (2011); Lage et al.(2018)), to our knowledge, there has been little work thatconsiders human interaction in the context of OPE. Oberst& Sontag (2019) proposed framing the OPE problem asa structural causal model, which enabled them to identifytrajectories where the predicted counterfactual trajectoriesunder an evaluation policy differs substantially from theobserved data collected under the behavior policy. However,that work does not give guidance on what part of the tra-jectory might require closer scrutiny, nor can it use humaninput for additional refinement.

Finally, the notion of influence that we use throughout thiswork has a long history in statistics as a technique for evalu-ating the robustness of estimators (Cook & Weisberg, 1980).Recently, an approximate version of influence for complexblack-box models was presented in Koh & Liang (2017),and they demonstrated how influence functions can makemachine learning methods more interpretable. In the contextof optimal control and RL, influence functions were firstintroduced by Munos & Moore (2002) to aid in online opti-mization of policies. However, their definition of influenceas a change in the value function caused by perturbations ofthe reward at a specific state is quite different from ours.

3. Background3.1. Notation

A Markov Decision Process (MDP) is a tuple〈X ,A, PT , PR, P0, γ〉, where X , A and γ are thestate space, action space, and the discount factor, respec-tively. The next state transition and reward distributions aregiven by PT (·|x, a) and PR(·|x, a) respectively, and P0(x)is the initial state distribution. The state and action spacescould be either discrete or continuous, and the transition andreward functions may be either stochastic or deterministic.

A dataset is composed of a set of N observed transitionsD = {(x(n), a(n), r(n), x′(n))}Nn=1, and we use τ (n) to de-note a single transition. The subset D0 ⊆ D denotes initialtransitions from which P0 can be estimated. Note that al-though we treat all data points as observed transitions, inmost practical applications data is collected in the form oftrajectories rather than individual transitions.

A policy is a function π : (X ,A) → [0, 1] that givesthe probability of taking each action at a given state(∑a∈A π(a|x) = 1). The value of a policy is the expected

return collected by following the policy, vπ := E[gT |at ∼π], where actions are chosen according to π, expectationsare taken with respect to the MDP, and gT :=

∑Tt=0 γ

trtdenotes the total return, or sum of discounted rewards. Thestate-action value function qπ(x, a) is the expected returnfor taking action a at state x, and afterwards following π inselecting future actions. The goal of off-policy evaluation isto estimate the value of an evaluation policy, πe, using datacollected using a different behavior policy πb. In this work,we are only interested in estimating vπe and qπe , and willtherefore drop the superscript for brevity. We will also limitourselves to deterministic evaluation policies.

For the purpose of kernel-based value function approxima-tion, we define a distance metric, d((x(i), a(i)), (x(j), a(j)))over X ×A. In this work, for discrete action spaces, we willassume d((x(i), a(i)), (x(j), a(j))) = ∞ when a(i) 6= a(j),but this is not required for any of the derivations.

3.2. Fitted Q-Evaluation

Fitted Q-Evaluation (Le et al., 2019) can be thought of asdynamic programming on an observational dataset to com-pute the value of a given evaluation policy. It is similarto the more well-known fitted Q-iteration method (FQI)(Ernst et al., 2005), except it is performed offline on ob-servational data, and the target is used for evaluation of agiven policy rather than for optimization. FQE performs asequence of supervised learning steps where the inputs arestate-action pairs, and the targets at each iteration are givenby yi(x, a) = r + γqi−1(x′, πe(x

′)), where qi−1(x, a) isthe estimator (from a function class F) that best estimates

Page 3: Finale Doshi-Velez - Interpretable Off-Policy Evaluation in … · 2020. 7. 27. · tion process and make evaluation more robust. 1. Introduction Within reinforcement learning (RL),

Interpretable Off-Policy Evaluation by Highlighting Influential Transitions

yi−1(x, a). For more information, see Le et al. (2019).

4. OPE diagnostics using influence functions4.1. Definition of the influence

We aim to make OPE interpretable and easy to debug byidentifying transitions in the data which are highly influ-ential on the estimated policy value. We define the totalinfluence of transition τ (j) as the change in the value esti-mate if τ (j) was removed:

Ij ≡ v−j − v, (1)

where v−j is the value estimate using the same dataset afterremoval of τ (j). In general, for any function of the dataf(D) we will use f(D−j) ≡ f−j to denote the value of fcomputed for the dataset after removal of τ (j).

Another quantity of interest is the change in the estimatedvalue of q(x(i), a(i)) as a result of removing τ (j), which wecall the individual influence:

Ii,j ≡ q−j(x(i), a(i))− q(x(i), a(i)). (2)

The total influence of τ (j) can be computed by averagingits individual influences over the set D∗0 of all initial state-action transitions in which a = πe(x):

Ij =1

|D∗0 |∑i∈D∗0

Ii,j . (3)

As we are interested in the robustness of our evaluation, wecan normalize the absolute value of the influence of τ (j) bythe estimated value of the policy to provide a more intuitivenotion of overall importance:

Ij ≡|Ij ||v|

. (4)

4.2. Diagnosing OPE estimation

With the above definitions of influence functions, we nowformulate and discuss guidelines for diagnosing the OPEprocess for potential problems.

No influential transitions: OPE appears reliable. As afirst diagnostic, we check that none of the transitions influ-ence the OPE estimate by more than a specified influencethreshold IC , i.e. for all j we have Ij ≤ IC . In such a casewe would output that, to the extent that low influences sug-gests the OPE is stable, the evaluation appears reliable. Thatsaid, we emphasize that our proposed method for evaluatingOPE methods is not exhaustive, and there could be manyother ways in which OPE could fail.

Influential transitions: a human can help. When thereare several influential transitions in the data (defined as tran-sitions whose influence is larger than IC), we present themto domain experts to determine whether they are represen-tative, that is, taking action a in state x is likely to resultin transition to x′. If the domain experts can validate allinfluential transitions, we can still have some confidencein the validity of the OPE. If any influential transitions areflagged as unrepresentative or artefacts, we have severaloptions: (1) Declare the OPE as unreliable; (2) Remove thesuspect influential transitions from the data and recomputethe OPE; (3) Caveat the OPE results as valid only for asubset of initial states that do not rely on that problematictransition.

In situations where there is a large number of influentialtransitions, manual review by experts may be infeasible. Assuch, it is necessary to present as few transitions as possiblewhile still presenting enough to ensure that any potentialartefacts in the data and/or the OPE process are accountedfor. In practice, we find it is common to observe a sequenceof influential transitions where removing any single transi-tion has the same effect as removing the entire sequence.An example of this is shown schematically in Figure 1. Anentire sequence marked in blue and red leads to a regionof high reward, and so all transitions in that sequence willhave high influence. The whole influential sequence appearsvery different from the rest of the data, and a domain expertmight flag it as an outlier to be removed. However, we canpresent the expert with only the red transition and capturethe influence of the blue transitions as well, reducing thenumber of suspect examples to be manually reviewed.

Influential transitions: policy is unevaluatable. Whenan influential transition, τ (j), has no nearest neighbors to(x′(j), πe(x

′(j))), we can determine that the evaluation pol-icy cannot be evaluated, even without review by a domainexpert. This claim is a result of the fact that such a situationrepresents reliance of the OPE on transitions for which thereis no overlap between the actions observed in the data andthe evaluation policy. However, while the evaluation policyis not evaluatable, the influential “dead-end” transitions maystill inform experts of what data is required for evaluationto be feasible.

It should be noted that the applicability of the diagnosticsmethods discussed above may change depending on whetherthe FQE function class is parametric or nonparametric. Allfunction classes lend themselves to highlighting of highlyinfluential transitions. However, the notion of stringing to-gether sequences of neighbors, or looking for red flags inthe form of influential transitions with no neighbors to their(x′, πe(x

′)) state action pairs only makes sense for nonpara-metric models. In the case of parametric models, the notionof neighbors is less important as the influence of removing

Page 4: Finale Doshi-Velez - Interpretable Off-Policy Evaluation in … · 2020. 7. 27. · tion process and make evaluation more robust. 1. Introduction Within reinforcement learning (RL),

Interpretable Off-Policy Evaluation by Highlighting Influential Transitions

High reward

Figure 1. Schematic of an influential sequence. All transitionsin the sequence leading to a high reward have high influence,but flagging just the red transition for inspection will capture theinfluence of the blue ones as well.

a transition manifests as a change to the learned parame-ters which affects the value estimates for the entire domainsimultaneously. In contrast, for nonparametric methods,removing a transition locally changes the value of neigh-boring transitions and propagates through the entire domainthrough the sequential nature of the environment. While wederive efficient ways to compute the influence for both para-metric and nonparametric function classes, in the empiricalsection of this paper we present results for nonparametrickernel-based estimators to demonstrate all diagnostics.

5. Efficient computation of influence functionsA key technical challenge in performing the proposed influ-ence analysis in OPE is computing the influences efficiently.The brute-force approach of removing a transition and re-computing the OPE estimate is clearly infeasible for allbut tiny problems, as it requires refitting N models. Thecomputation of influences in RL is also significantly morechallenging than in static one-step prediction tasks, as achange in the value of one state has a ripple effect on allother states that are possible to reach from it. We describecomputationally efficient methods to compute the influencefunctions in two classes of FQE: kernel-based, and linearleast squares. Unlike previous works (e.g. (Koh & Liang,2017)) that approximate the influence function for a broadclass of black-box functions, we provide closed-form, ana-lytic solutions for the exact influence function in two widelyused white-box function classes.

5.1. Kernel-Based FQE

In kernel based FQE, the function class we choose for es-timating the value function of πe at a point in state-actionspace is based on similar observations within that space. Forsimplicity, in the main body of this work we estimate thevalue function as an average of all its neighbors within aball of radius R, i.e.

q(x, a) =1

N(x,a)

∑i

q(x(i), a(i)) (5)

where the summation is performed over all (x(i), a(i)) suchthat d((x(i), a(i)), (x, a)) < R and N(x,a) is the numberof such points. Extension to general kernel functions isstraightforward. We introduce a matrix formulation forperforming FQE which allows for efficient computation ofthe influence functions.

Matrix formulation of nearest-neighbors based FQE.We define ∆ij as the event that the starting state-actionof τ (j) is a neighbor of the starting state-action of τ (i),i.e. d((x(i), a(i)), (x(j), a(j))) < R. Similarly, we define∆i′j as the event that the starting state-action of τ (j) is aneighbor of the next-state and corresponding πe action ofτ (i), i.e. d((x′(i), π(x′(i))), (x(j), a(j))) < R. We also de-fine the counts for numbers of neighbors of trajectories asNi =

∑Nj=1 I(∆ij) and Ni′ =

∑Nj=1 I(∆i′j), where I(e)

is the indicator function.

To perform nearest-neighbors FQE using matrix multiplica-tions, we first construct two nearest-neighbors matrices: onefor the neighbors of all state-action pairs, and one for theneighbors of all state-action pairs with pairs of next-statesand subsequent actions under πe. Formally:

Mij =I(∆ij)

Ni; M′

ij =I(∆i′j)

Ni′. (6)

The N × N matrices M and M′ can be easily computedfrom the data, and are used to compute the value functionfor all state-action pairs using the following proposition, theproof of which is given in Appendix A.1.

Proposition 1. For all transitions in the dataset, the valuesfor corresponding state-action pairs are given by

q′t =

(t∑

t′=1

γt′−1M′t′

)r ≡ Φ′tr (7)

qt = M

(t∑

t′=1

(γM′)t′−1

)r ≡ Φtr. (8)

where q′t,i and qt,i are the estimated policy values at(x′(i), πe(x

′(i))) and (x(i), a(i)), respectively, for τ (i).

In future derivations, we will drop the time dependence of Φand q on t. This is justified when there are well defined endsof trajectories with no nearest neighbors (or equivalently,trajectories end in an absorbing state), and the number ofiterations in the FQE is larger than the longest trajectory.

Influence function computation. Removal of a transi-tion τ (j) from the dataset can affect qi in two ways. First,qi is a mean over all of its neighbors, indexed by k, ofr(k) +γq′k. Thus if (x(j), a(j)) is one of theM−1

ij neighborsof (x(i), a(j)), removing it from the dataset will change the

value of qi byqi−(r(j)+γq′j)

M−1ij −1

. The special case of M−1ij = 1

Page 5: Finale Doshi-Velez - Interpretable Off-Policy Evaluation in … · 2020. 7. 27. · tion process and make evaluation more robust. 1. Introduction Within reinforcement learning (RL),

Interpretable Off-Policy Evaluation by Highlighting Influential Transitions

does not pose a problem in the denominator, as giventhat i 6= j and every transition is a neighbor of itself, if(x(j), a(j)) is a neighbor of (x(i), a(i)), then M−1

ij ≥ 2.

The second way in which removing τ (j) influences qiis through its effect on intermediary transitions. Re-moval of τ (j) changes the estimated value of q′k, ofall (x′(k), πe(x

′(k))) that (x(j), a(j)) is a neighbor of byq′k−(r(j)+γq′j)

M ′−1kj −1

. Multiplying this difference by γ yields the

difference in qk due to removal of τ (j). A change in thevalue of qk is identical in its effect on the value estimationto changing r(k), a change which is mediated to qi throughΦik. In the special case that (x(j), a(j)) is the only neighborof (x′(k), πe(x

′(k))), the value estimate q′k changes from qjto zero.

Combining the two ways in which removal of τ (j) changesthe estimated value qi yields the individual influence:

Ii,j = I(∆ij)qi −

(r(j) + γq′j

)M−1ij − 1

+∑k:∆k′j

I(k)(i,j), (9)

where we define

I(k)i,j =

γΦikq′k−(r(j)+γq′j)

M ′−1kj −1

M ′−1kj > 1

γΦikqj M ′−1kj = 1.

(10)

Computational complexity. The matrix formulation ofkernel based FQE allows us to compute an individual in-fluence in constant time, making influence analysis of theentire dataset possible in O(N |D∗0 |) time. Furthermore, thesparsity of M and M′ allows the FQE itself to be done inO(N2T ). See Appendix A.2 for a full discussion.

5.2. Linear Least Squares FQE

In linear least squares FQE, the policy value function q(x, a)is approximated by a linear function q(x, a) = ψψψ(x, a)>wwhere ψψψ(x, a) is a D-dimensional feature vector for a state-action pair. Let Ψ ∈ RN×D be the sample matrix ofψψψ(x, a).Define vector ψψψπ(x) = γψψψ(x, πe(x)) and let Ψp ∈ RN×Dbe the sample matrix of ψψψπ(x′). The least-squares solutionof w is (Ψ>Ψ−γΨ>Ψp)

−1Ψ>r (See Appendix 2 for fullderivation).

Let w−j be the solution of linear least squares FQE afterremoving τ (j), and Ψ−j , r−j , and Ψp,−j be the correspond-ing matrices and vectors without the τ (j). Then, w−j =(Ψ>−jΨ−j − γΨ>−jΨp,−j)

−1Ψ>−jr−j . The key challengeof computing the influence function is computing w−j inan efficient manner that avoids recomputing a costly matrixinverse for each j. Let C−j = (Ψ>−jΨ−j − γΨ>−jΨp,−j)

and C = (Ψ>Ψ−γΨ>Ψp). We compute w−j as follows:

Bj ← C−1 +C−1ψψψjψψψ

>j C−1

1−ψψψ>j C−1ψψψj(11)

(C−j)−1 ← Bj −

γBjψψψjψψψ>π,jBj

1 + γψψψ>p,jBjψψψj(12)

w−j ← (C−j)−1(Ψ>r− r(j)ψψψj

)(13)

The proof of correctness is in Proposition 3 in Appendix B.The individual influence function is then simply:

Ii,j = ψψψ(s(i), a(i))>(w−j −w). (14)

Computational complexity. The bottleneck of comput-ing w−j is the matrix multiplication of D × D matriceswhich takes at mostO(D3). All the other matrix multiplica-tions involving size N , e.g. Ψ>r, do not depend on j andcould be cached from the original OPE. Thus, the overallcomplexity for computing Ii,j for all i and j is O(ND3).Assuming N > D, the complexity of the original OPEalgorithm is O(ND2), where the bottleneck is computingΨ>Ψ.

6. Illustration of influence functions in asequential setting

We now demonstrate and give intuition for how the influ-ence behaves in an RL setting. For the demonstrations andexperiments presented throughout the rest of the paper weuse the kernel-based FQE method.

Several factors determine the influence of a transition. Fortransitions to be influential they must have actions whichare possible under the evaluation policy and form links insequences which result in returns different than the expectedvalue. Furthermore, transitions will be more influential theless neighbors they have.

To demonstrate this intuition we present in Figure 2 tra-jectories from a 2D continuous navigational domain. Theagent starts at the origin and takes noisy steps of length 1 at45◦ to the axes. The reward for a given transition is a func-tion of the state and has the shape of a Gaussian centeredalong the approximate path of the agent, represented as thebackground heat map in Figure 2 (top), where observed tran-sitions are drawn as black line segments. Because distancesfor the FQE are computed in the state-action space, in thisexample all actions in the data are the same to allow fordistances to be visualized in 2D.

To illustrate how influence is larger for transitions with fewneighbors, we removed most of the transitions in two re-gions (denoted II and III), and compared the distribution ofinfluences in these regions with influences in a data dense

Page 6: Finale Doshi-Velez - Interpretable Off-Policy Evaluation in … · 2020. 7. 27. · tion process and make evaluation more robust. 1. Introduction Within reinforcement learning (RL),

Interpretable Off-Policy Evaluation by Highlighting Influential Transitions

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 170123456789

1011121314151617

I

II

III

I II III0.0

0.1

0.2

0.3

0.4

0.5

I j

Figure 2. Conceptual demonstration on a 2D domain. For tran-sitions in the data to have high influence, they must agree with theevaluation policy and lead to rewarding regions in the state-actionspace. Additionally, the influence of transitions decreases with thenumber of their close neighbors.

region (denoted I). Figure 2 (bottom) shows the distribu-tion over 200 experiments (in each experiment, new data isgenerated) of the influences of transitions in the differentregions. The influence is much higher for transitions insparse regions with few neighbors, as can be seen by com-paring the distributions in regions I and II. This is a desiredproperty, as in analysis of the OPE process, we’d like to beable to present domain experts with transitions that havefew neighbors where the sampling variance of a particulartransition could have large effect on evaluation.

In region III, despite the fact that the observations examinedalso have very few neighbors, their influence is extremelylow, as they don’t lead to any regions where rewards aregained by the agent.

7. Experiments7.1. Medical simulators

To demonstrate the different ways in which influence anal-ysis can allow domain experts to either increase our con-

0 5 10 15 20 25 30Time

0

1

2

3

4

5

6

C

0 5 10 15 20 25 30Time

0

20

40

60

80

100

120

Qp

(a) No influential transitions

0 5 10 15 20 25 30Time

0

1

2

3

4

5

6

C

0 5 10 15 20 25 30Time

0

20

40

60

80

100

120

Qp

(b) Dead end sequence

0 5 10 15 20 25 30Time

0

1

2

3

4

5

6

C

0 5 10 15 20 25 30Time

0

20

40

60

80

100

120

Qp

(c) Highlighted reliable transitions

0 5 10 15 20 25 30Time

0

1

2

3

4

5

6

C

0 5 10 15 20 25 30Time

0

20

40

60

80

100

120

Qp

(d) Highlighted problematic transitions

Figure 3. Influence analysis for simulated cancer data. Analy-sis of synthetic cancer simulations demonstrates how influenceanalysis can differentiate between different diagnostics of the OPEprocess.

fidence in the validity of OPE or identify instances wherethey are invalid, we first present results on a simulator ofcancer dynamics. The 4 dimensional states of the simulatorapproximate the dynamics of tumor growth, with actionsconsisting administration of chemotherapy at each timesteprepresenting one month. For more details see Ribba et al.(2012).

In Figure 3 we present four cases in which we attempt toevaluate the policy of treating a patient for 15 months andthen discontinuing chemotherapy until the end of treatmentat 30 months. Each subplot in Figure 3 shows two of thefour state variables as a function of time, under differentconditions which might make evaluation more difficult, suchas difference in behavior policy or stochasticity in the envi-ronment. The heavy black line represents the expectation ofeach state dimension at each time-step under the evaluationpolicy, while the grey lines represent observed transitionsunder the behavior policy which is ε-greedy with respect tothe evaluation policy. In all figures, we highlight in red allinfluential transitions our method would have highlighted

Page 7: Finale Doshi-Velez - Interpretable Off-Policy Evaluation in … · 2020. 7. 27. · tion process and make evaluation more robust. 1. Introduction Within reinforcement learning (RL),

Interpretable Off-Policy Evaluation by Highlighting Influential Transitions

for review by domain experts (Ic = 0.05).

Case 1: OPE seems reliable. Figure 3(a) represents a typ-ical example where the OPE can easily be trusted. Despitethe large difference between the evaluation and behaviorpolicy (ε = 0.3), enough trajectories have been observed inthe data to allow for proper evaluation, and no transition isflagged as being too influential. The value estimation errorin this example is less than 1% and our method correctlylabels this dataset as reliable.

Case 2: Unevaluatable. Figure 3(b) is similar in experi-mental conditions to (a) (ε = 0.3 and deterministic transi-tions), but with less collected data, so that the observationsneeded to properly estimate the dynamics are not in the data.This can be seen by the lack of overlap between the observedtransitions and the expected trajectory, and results in a 38%value estimation error. In real life we will not know whatthe expected trajectory under the evaluation policy lookslike, and therefore will not be able to make the comparisonand detect the lack of overlap between transitions underthe evaluation and behavior policies. However, our methodhighlights a very influential sequence which terminates at adead-end, and thus will correctly flag this dataset as not suf-ficient for evaluation. Our method in this case is confidentenough to dismiss the results of evaluation without need fordomain experts, but can still inform experts on what type ofdata is lacking in order for evaluation to be feasible.

Case 3: Humans might help. In Figures 3(c-d), ε = 0.3,but the dynamics have different levels of stochasticity. Theless stochastic dynamics in 3(c) allow for relatively accu-rate evaluation (8% error) but our method identifies severalinfluential transitions which must be presented to a domainexpert. These transitions lie on the expected trajectory, andthus a clinician would verify that they represent a typical re-sponse of a patient to treatment. This is an example in whichour method would allow a domain expert to verify the va-lidity of the evaluation by examining the flagged influentialtransitions.

Conversely, in 3(d) some extreme outliers lead to a largeestimation error (23% error). The influential transitionsidentified by our method are exactly those which start closeto the expected trajectory but deviate significantly from theexpected dynamics. A domain expert presented with thethese transitions would easily be able to note that the OPEheavily relies on atypical patients and rightly dismiss thevalidity of evaluation.

To summarize this section, we demonstrated that analysisof influences can both validate or invalidate the evaluationwithout need for domain experts, and in intermediate casespresent domain experts with the correct queries required togain confidence in the evaluation results or dismiss them.

7.2. Analysis of real ICU data - MIMIC III

To show how influence analysis can help debug OPE for achallenging healthcare task, we consider the managementof acutely hypotensive patients in the ICU. Hypotension isassociated with high morbidity and mortality (Jones et al.,2006), but management of these patients is not standard-ized as ICU patients are heterogeneous. Within critical care,there is scant high-quality evidence from randomized con-trolled trials to inform treatment guidelines (de Grooth et al.,2018; Girbes & de Grooth, 2019), which provides an oppor-tunity for RL to help learn better treatment strategies. Incollaboration with an intensivist, we use influence analysisto identify potential artefacts when performing OPE on aclinical dataset of acutely hypotensive patients.

Data and evaluation policy. Our data source is a subsetof the publicly available MIMIC-III dataset (Johnson et al.,2016). See Appendix C for full details of the data prepro-cessing. Our final dataset consists of 346 patient trajectories(6777 transitions) for learning a policy and another 346 tra-jectories (6863 transitions) for evaluation of the policy viaOPE and influence analysis.

Our state space consists of 29 relevant clinical variables,summarizing current physiological condition and past ac-tions. The two main treatments for hypotension are admin-istration of an intravenous (IV) fluid bolus or initiation ofvasopressors. We bin doses of each treatment into 4 cat-egories for ”none”, ”low”, ”medium” and ”high”, so thatthe full action space consists of 16 discrete actions. Eachreward is a function of the next blood pressure (MAP) andtakes values in [−1, 0]. As an evaluation policy, we use themost common action of a state’s 50 nearest neighbors. Thisis setup is equivalent to constructing a decision assistancetool for clinicians by recommending the common practiceaction for patients, and using OPE combined with influenceanalysis to estimate the efficacy of such a tool. See Ap-pendix C for more details on how we setup the RL problemformulation, and for the kernel function used to computenearest-neighbors.

Presenting queries to a practicing intensivist. Runninginfluence analysis flags 6 influential transitions that havehigh influence, which we define as a change of 5% or moreon the final value estimate. We show 2 of these transitionsin Figure 4 and the rest in Appendix D. While this analysishighlights individual transitions, our results figures displayadditional context before and after the suspect transition tohelp the clinician understand what might be going on.

In Figure 4, each column shows a transition flagged by influ-ence analysis. The top two rows show actions taken (actualtreatments in the top row and binned actions in the secondrow). The remaining three rows show the most important

Page 8: Finale Doshi-Velez - Interpretable Off-Policy Evaluation in … · 2020. 7. 27. · tion process and make evaluation more robust. 1. Introduction Within reinforcement learning (RL),

Interpretable Off-Policy Evaluation by Highlighting Influential Transitions

0

100

200Fl

uid

Bolu

sSi

ze (m

l)

Observed Treatment Amounts

0.00

0.25

0.50

0.75

1.00

None

Low

Medium

High

Flui

dAc

tion

Binned Actions

None

Low

Medium

HighFluidVasopressor

4050607080

Mea

n Ar

teria

lPr

essu

re(m

mHg

)

0

50

100

150

Urin

eOu

tput

(ml)

0 1 5 8 12 15 19 22 25 28 31 33 35 37 39

Action Times (Hours After ICU Admission)

369

1215

Glas

gow

Com

aSc

ore

Influence of Flagged Transition: 28%

0

500

1000Observed Treatment Amounts

0.00

0.25

0.50

0.75

1.00

Pres

sor

Rate

(mcg

/kg/

hr)

None

Low

Medium

HighBinned Actions

None

Low

Medium

High

Pres

sor

Actio

nFluidVasopressor

4050607080

0

50

100

150

0 2 4 6 8 10 12 14 16 20 22 23 25 26 28 29 31

Action Times (Hours After ICU Admission)

369

1215

Influence of Flagged Transition: 12%

Figure 4. Influence analysis on our real-world dataset discovered six transitions in the evaluation dataset that were especially influential onour OPE. We display two of them in this figure, see Appendix D for the remaining four.

state variables that inform the clinicians’ decisions: bloodpressure (MAP), urine output, and level of consciousness(GCS). For these three variables, the abnormal range isshaded in red, where the blood pressure shading is darkerhighlighting its direct relationship with the reward. Verticalgrey lines represent timesteps, and the highlighted influen-tial transition is shaded in grey.

Outcome: Identifying and removing an influential,buggy measurement. The two transitions in Figure 4highlight potential problems in the dataset that have a largeeffect on our final OPE estimate. In the first transition (left),a large drop in blood pressure is observed at the startingtime of this transition, potentially leaving the patient in adangerous hypotensive state. Suprisingly, the patient re-ceived no treatment, and this unusual transition has a 29%influence on the OPE estimate. Given additional contextjust before and after the transition, though, it was clear tothe clinician that this was due to a single low measurementin a sequence that was previously stable. Coupled with astable GCS (patient was conscious and alert) and a normalurine output, the intensivist easily determined the single lowMAP was likely either a measurement error or a clinicallyinsignificant transient episode of hypotension. After correct-ing the outlier MAP measurement to its most recent normalvalue (80mmHg) and then rerunning FQE and the influence

analysis, the transition no longer has high influence and wasnot flagged.

Outcome: Identifying and correcting a temporal mis-alignment. The second highlighted transition (right) fea-tures a sudden drop in GCS and worsening MAP values,indicating a sudden deterioration of the patient’s state, buttreatment is not administered until the next timestep. Theintensivist attributed this finding to a time stamp recordingerror. Again, influence analysis identified an inconsistencyin the original data which had undue impact on evaluation.After correcting the inconsistency by shifting the two fluidtreatments back by one timestep each, we found that thetransition no longer had high influence and was not flagged.

8. DiscussionA key aim of this paper is to formulate a framework for usingdomain expertise to help in evaluating the trustworthiness ofOPE methods for noisy and confounded observational data.The motivation for this research direction is the intersec-tion of two realities: for messy real-world applications, thedata itself might never be enough; and domain experts willalways need to be involved in the integration of decisionsupport tools in the wild, so we should incorporate their ex-pertise into the evaluation process. We showcased influence

Page 9: Finale Doshi-Velez - Interpretable Off-Policy Evaluation in … · 2020. 7. 27. · tion process and make evaluation more robust. 1. Introduction Within reinforcement learning (RL),

Interpretable Off-Policy Evaluation by Highlighting Influential Transitions

analysis as one way of performing this task for value-basedOPE, but emphasize that such measures can and should beincorporated into other OPE methods as well. For example,importance sampling weights offer a straightforward way ofhighlighting important entire trajectories for IS based tech-niques, and the dynamics learned by models in model-basedOPE can be tested for their agreement with expert intuition.

We stress that research to integrate human input into OPEmethods to increase their reliability complements, and doesnot replace, the approaches for estimating error bounds anduncertainties over the errors of OPE estimates. The factthat traditional theoretical error bounds rely so heavily onassumptions which are generally impossible to verify fromthe data alone highlights the need for other techniques forgauging to what extent these assumptions hold.

ReferencesAgniel, D., Kohane, I. S., and Weber, G. M. Biases in

electronic health record data due to processes within thehealthcare system: retrospective observational study. Bmj,361:k1479, 2018.

Asfar, P., Meziani, F., Hamel, J.-F., Grelon, F., Megarbane,B., Anguel, N., Mira, J.-P., Dequin, P.-F., Gergaud, S.,Weiss, N., et al. High versus low blood-pressure target inpatients with septic shock. N Engl J Med, 370:1583–1593,2014.

Cook, R. D. and Weisberg, S. Characterizations of an em-pirical influence function for detecting influential casesin regression. Technometrics, 22(4):495–508, 1980.

Dann, C., Li, L., Wei, W., and Brunskill, E. Policy cer-tificates: Towards accountable reinforcement learning.arXiv preprint arXiv:1811.03056, 2018.

de Grooth, H.-J., Postema, J., Loer, S. A., Parienti, J.-J.,Oudemans-van Straaten, H. M., and Girbes, A. R. Unex-plained mortality differences between septic shock trials:a systematic analysis of population characteristics andcontrol-group mortality rates. Intensive care medicine,44(3):311–322, 2018.

Ernst, D., Geurts, P., and Wehenkel, L. Tree-based batchmode reinforcement learning. Journal of Machine Learn-ing Research, 6(Apr):503–556, 2005.

Girbes, A. R. J. and de Grooth, H.-J. Time to stop ran-domized and large pragmatic trials for intensive caremedicine syndromes: the case of sepsis and acute res-piratory distress syndrome. Journal of Thoracic Disease,12(S1), 2019. ISSN 2077-6624. URL http://jtd.amegroups.com/article/view/33636.

Gottesman, O., Johansson, F., Meier, J., Dent, J., Lee,D., Srinivasan, S., Zhang, L., Ding, Y., Wihl, D.,

Peng, X., et al. Evaluating reinforcement learning al-gorithms in observational health settings. arXiv preprintarXiv:1805.12298, 2018.

Gottesman, O., Johansson, F., Komorowski, M., Faisal, A.,Sontag, D., Doshi-Velez, F., and Celi, L. A. Guidelinesfor reinforcement learning in healthcare. Nat Med, 25(1):16–18, 2019a.

Gottesman, O., Liu, Y., Sussex, S., Brunskill, E., and Doshi-Velez, F. Combining parametric and nonparametric mod-els for off-policy evaluation. In International Conferenceon Machine Learning, pp. 2366–2375, 2019b.

Hanna, J. P., Stone, P., and Niekum, S. Bootstrapping withmodels: Confidence intervals for off-policy evaluation. InThirty-First AAAI Conference on Artificial Intelligence,2017.

Jiang, N. and Li, L. Doubly robust off-policy valueevaluation for reinforcement learning. arXiv preprintarXiv:1511.03722, 2015.

Johnson, A. E., Pollard, T. J., Shen, L., Li-wei, H. L., Feng,M., Ghassemi, M., Moody, B., Szolovits, P., Celi, L. A.,and Mark, R. G. Mimic-iii, a freely accessible criticalcare database. Scientific data, 3:160035, 2016.

Jones, A. E., Yiannibas, V., Johnson, C., and Kline, J. A.Emergency department hypotension predicts sudden un-expected in-hospital mortality: a prospective cohort study.Chest, 130(4):941–946, 2006.

Koh, P. W. and Liang, P. Understanding black-box predic-tions via influence functions. In Proceedings of the 34thInternational Conference on Machine Learning-Volume70, pp. 1885–1894. JMLR. org, 2017.

Komorowski, M., Celi, L. A., Badawi, O., Gordon, A. C.,and Faisal, A. A. The artificial intelligence clinicianlearns optimal treatment strategies for sepsis in intensivecare. Nature medicine, 24(11):1716–1720, 2018.

Lage, I., Ross, A., Gershman, S. J., Kim, B., and Doshi-Velez, F. Human-in-the-loop interpretability prior. InAdvances in Neural Information Processing Systems, pp.10159–10168, 2018.

Le, H. M., Voloshin, C., and Yue, Y. Batch policy learn-ing under constraints. arXiv preprint arXiv:1903.08738,2019.

Munos, R. and Moore, A. Variable resolution discretizationin optimal control. Machine learning, 49(2-3):291–323,2002.

Oberst, M. and Sontag, D. Counterfactual off-policy eval-uation with gumbel-max structural causal models. In

Page 10: Finale Doshi-Velez - Interpretable Off-Policy Evaluation in … · 2020. 7. 27. · tion process and make evaluation more robust. 1. Introduction Within reinforcement learning (RL),

Interpretable Off-Policy Evaluation by Highlighting Influential Transitions

International Conference on Machine Learning, pp. 4881–4890, 2019.

Precup, D. Eligibility traces for off-policy policy evalua-tion. Computer Science Department Faculty PublicationSeries, pp. 80, 2000.

Ribba, B., Kaloshi, G., Peyre, M., Ricard, D., Calvez, V.,Tod, M., Cajavec-Bernard, B., Idbaih, A., Psimaras, D.,Dainese, L., et al. A tumor growth inhibition model forlow-grade glioma treated with chemotherapy or radio-therapy. Clinical Cancer Research, 18(18):5071–5080,2012.

Sutton, R. S. and Barto, A. G. Reinforcement learning: Anintroduction. MIT press, 2018.

Tamuz, O., Liu, C., Belongie, S., Shamir, O., and Kalai, A. T.Adaptively learning the crowd kernel. arXiv preprintarXiv:1105.1033, 2011.

Thomas, P. S., Theocharous, G., and Ghavamzadeh, M.High-confidence off-policy evaluation. In Twenty-NinthAAAI Conference on Artificial Intelligence, 2015.

Thomas, P. S., da Silva, B. C., Barto, A. G., Giguere, S.,Brun, Y., and Brunskill, E. Preventing undesirable behav-ior of intelligent machines. Science, 366(6468):999–1004,2019.

Page 11: Finale Doshi-Velez - Interpretable Off-Policy Evaluation in … · 2020. 7. 27. · tion process and make evaluation more robust. 1. Introduction Within reinforcement learning (RL),

Interpretable Off-Policy Evaluation by Highlighting Influential Transitions

A. Derivations for Kernel-Based FQEA.1. Proof of Proposition 1

Proposition 1. For all transitions in the dataset, the valuesfor corresponding state-action pairs are given by

q′t =

(t∑

t′=1

γt′−1M′t′

)r ≡ Φ′tr (15)

qt = M

(t∑

t′=1

(γM′)t′−1

)r ≡ Φtr. (16)

where q′t,i and qt,i are the estimated policy values at(x′(i), πe(x

′(i))) and (x(i), a(i)), respectively, for the ob-served transition i

Proof. We first prove 15 by induction. We start by not-ing that for a given observed transition, i, averaging overall observations j such that ∆ij holds can be writtenas 1

Ni

∑j:∆ij

=∑j

I(∆ij)Ni

=∑jMij . Similarly, av-

eraging over all j such that ∆i′j holds can be writtenas∑jM

′ij . Therefore, if u(x, a) is some function over

the sttate-action space and u is a vector containing thequantity ui = u(x(i), a(i)) for every (x(i), a(i)), then thenearest-neighbors estimation of u(x′(i), π(x′(i))) is givenby [M′u]i.

Given the formulation above, for t = 1, q′t,i estimates thereward at (x′(i), π(x′(i))), and can be written as:

q′1 = M′r. (17)

For t > 1, assume q′t−1 =(∑t−1

t′=1 γt′−1M′t′

)r. Then

q′t = M′ (r + γq′t−1

)(18)

= M′

[r + γ

(t−1∑t′=1

γt′−1M′t′

)r

]

=

(M′ +

t−1∑t′=1

γt′M′t′+1

)r

=

(t−1∑t′=0

γt′M′t′+1

)r

=

(t∑

t′=1

γt′−1M′t′

)r ≡ Φ′tr,

completing the proof of 15. To estimate qt, we write qt,i =1Ni

∑j:∆ij

(r(j) + γq′t−1,j

)or in matrix notation.

qt = M(r + γq′t−1

)(19)

= M

(I + γ

(t−1∑t′=1

γt′−1M′t′

))r

= M

(t∑

t′=1

(γM′)t′−1

)r ≡ Φtr.

A.2. Computational complexity.

Computation of a single influence value, Ii,j requires sum-mation over all transitions k that satisfy ∆k′j . Denote thenumber of such neighbors by N∗j′

1. We expect N∗j′ to besmall and not scale with the size of the dataset, and also Ii,jis inversely proportional to N∗j′ . Thus, if we only computethe influence of transitions such that N∗j′ <

vmax

vIc≡ N∗j′,c,

where vmax is the maximum possible value, we are guaran-teed not to miss any transitions with influence larger thanour threshold Ic. Since N∗j′,c does not scale with the size ofthe data, computation of a single individual influence caneffectively be done in constant time. Performing influenceanalysis on a full dataset requires computing the influencesof all transitions on all initial transitions, and therefore takesO(N |D∗0 |) time.

In our matrix formulation, the FQE evaluation itself is bot-tlenecked by computing the matrix Φ, which includes com-putation of powers of M ′. Because M ′ is a sparse matrix(each row i only has Ni nonzero elements), the matrix mul-tiplication itself can be done in O(N2) rather than O(N3)time, and the entire evaluation is done inO(N2T ) time. Im-portantly, the influence analysis analyzing all transitions haslower complexity than the OPE, and should not significantlyincrease the computational cost of the evaluation pipeline.

B. Derivations Linear Least-Squares FQEProposition 2. The the linear least square solution of fittedQ evaluation is (Ψ>Ψ− γΨ>Ψp)

−1Ψ>r

Proof. The least-square solution of parameter vector w canbe found by minimizing the following square error of theBellman equation for all (x, a) in the dataset:

(q(x, a)− r(x, a)− γq(x′, πe(x′)))2 (20)

Plugging in q(x, a) = ψψψ(x, a)>w, the square error is(ψψψ(x, a)>w − r(x, a)− γψψψπ(x′)>w

)2(21)

1Note that N∗j′ , which counts all k that satisfy ∆k′j , is subtly

different from the quantity Nj′ introduced in section 5.1, whichcounts all k that satisfy ∆j′k.

Page 12: Finale Doshi-Velez - Interpretable Off-Policy Evaluation in … · 2020. 7. 27. · tion process and make evaluation more robust. 1. Introduction Within reinforcement learning (RL),

Interpretable Off-Policy Evaluation by Highlighting Influential Transitions

By definition of Ψ and Ψp, the mean square error over theN samples is:

‖Ψw − r− γΨpw‖22 (22)

The least square solution is:

w = (Ψ>Ψ)−1Ψ>(r + γΨpw) (23)

(Ψ>Ψ)w = Ψ>(r + γΨpw) (24)

(Ψ>Ψ− γΨ>Ψp)w = Ψ>r (25)

w = (Ψ>Ψ− γΨ>Ψp)−1Ψ>r (26)

Proposition 3. Let C−j = (Ψ>−jΨ−j−γΨ>−jΨp,−j) andC = (Ψ>Ψ− γΨ>Ψp).

w−j = (C−j)−1(Ψ>r− r(j)ψψψj

)(C−j)

−1= Bj −

γBjψψψjψψψ>π,jBj

1 + γψψψ>p,jBjψψψj

where

Bj = C−1 +C−1ψψψjψψψ

>j C−1

1−ψψψ>j C−1ψψψj

Proof. By the list squares solution of FQE, w−jequals (Ψ>−jΨ−j − γΨ>−jΨp,−j)

−1Ψ>−jr−i. SinceΨ>−jr−j = Ψ>r − r(i)ψψψj , we have that w−j =

(C−j)−1 (

Ψ>r− r(j)ψψψj). Then

C−j = C−ψψψjψψψ>j + γψψψjψψψ>π,j , (27)

because

Ψ>−jΨ−j = Ψ>Ψ−ψψψjψψψ>j (28)

Ψ>−jΨp,−j = Ψ>Ψp −ψψψjψψψ>π,j (29)

This indicate C−j equals C plus two rank-1 matrices. For-tunately, we can store C−1 when we compute w and q. Thefollowing result named Sherman–Morrison formula allowus to compute C−1

−j from C−1 in an efficient way. For anyinvertible matrix A ∈ Rd×d and vector u, v ∈ Rd:

(A + uv>)−1 = A−1 − A−1uv>A−1

1 + v>A−1u(30)

Then if we define Bj ≡(C−ψψψjψψψ>j

)−1, we have that

Bj = C−1 +C−1ψψψjψψψ

>j C−1

1−ψψψ>j C−1ψψψj(31)

(C−j)−1

= Bj −γBjψψψjψψψ

>π,jBj

1 + γψψψ>p,jBjψψψj(32)

C. Preprocessing and experimental details forMIMIC-III acute hypotension dataset

In this section, we describe the preprocessing we performedon the raw MIMIC-III database to convert it into a datasetamenable to modeling with RL. This preprocessing pro-cedure was done in close consultation with the intensivistcollaborator on our team.

C.1. Cohort Selection

We use MIMIC-III v1.4 (Johnson et al., 2016), which con-tains information from about 60,000 intensive care unit(ICU) admissions to Beth Israel Deaconess Medical Cen-ter. We filter the initial database on the following features:admissions where data was collected using the Metavisionclinical information system; admissions to a medical ICU(MICU); adults (age ≥ 18 years); initial ICU stays for hos-pital admissions with multiple ICU stays; ICU stays with atotal length of stay of at least 24 hours; and ICU stays wherethere are 7 or more mean arterial pressure (MAP) valuesof 65mmHg or less, indicating probable acute hypotension.For long ICU stays, we limit to only using informationcaptured during the inital 48 hours after admission, as ourintensivist advised that care for hypotension during laterperiods of an ICU stay often look very different. After thisfiltering, we have a final cohort consisting of 1733 distinctICU admissions. For computational convenience, we fur-ther down-sample this cohort, and use 20% (346) ICU staysto use to learn a policy, and another 20% (346) ICU staysto evaluate the policy via FQE and our proposed influenceanalysis.

C.2. Clinical Variables Considered

Given our final cohort of patients admitted to the ICU, wenext discuss the different clinical variables that we extractthat are relevant to our task of acute hypotension manage-ment.

The two first-line treatments are intravenous (IV) fluid bolustherapy, and vasopressor therapy. We construct fluid bolusvariables in the following way:

1. We filter all fluid administration events to only in-clude NaCl 0.9%, lactated ringers, or blood transfu-sions (packed red blood cells, fresh frozen plasma, orplatelets).

2. Since a fluid bolus should be a nontrivial amount offluid administered over a brief period of time, we fur-ther filter to only fluid administrations with a volumeof at least 250mL and over a period of 60 minutes orshorter.

Each fluid bolus has an associated volume, and a starting

Page 13: Finale Doshi-Velez - Interpretable Off-Policy Evaluation in … · 2020. 7. 27. · tion process and make evaluation more robust. 1. Introduction Within reinforcement learning (RL),

Interpretable Off-Policy Evaluation by Highlighting Influential Transitions

time (since a bolus is given quickly / near-instantaneously,we ignore the end-time of the administration). To constructvasopressors, we first normalize vasopressor infusion ratesacross different drug types as follows, using the same nor-malization as in Komorowski et al. (2018):

1. Norepinephrine: this is our “base” drug, as it’s themost commonly administered. We will normalize allother drugs in terms of this drug. Units for vasopressorrates are in mcg per kg body weight per minute for alldrugs except vasopressin.

2. Vasopressin: the original units are in units/min. Wefirst clip any values above 0.2 units/min, and then mul-tiply the final rates by 5.

3. Phenylephrine: we multiply the original rate by 0.45.

4. Dopamine: we multiply the original rate by 0.01.

5. Epinephrine: this drug is on the same scale as nore-pinephrine and is not rescaled.

As vasopressors are given as a continuous infusion, theyconsist of both a treatment start time and stop time, aswell as potentially many times in the middle where therates are changed. More than a single vasopressor may beadministered at once, as well.

We also use 11 other clinical variables as part of the statespace in our application: serum creatinine, FiO2, lactate,urine output, ALT, AST, diastolic/systolic blood pressure,mean arterial pressure (MAP; the main blood pressure vari-able of interest), PO2, and the Glasgow Coma Score (GCS).

C.3. Selecting Action Times

Given a final cohort, clinical variables, and treatment vari-ables, we still must determine how to discretize time andchoose at which specific time points actions should be cho-sen. To arrive at a final set of “action” times for a specificICU stay, we use the following heuristic-based algorithm:

1. We start by including all times a treatment is started,stopped, or modified.

2. Next, we remove consecutive treatment times if thereare no MAP measurements between treatments. We dothis because without at least one MAP measurementin between treatments, we would not be able to assesswhat effect the treatment had on blood pressure. Thisleaves us with a set of time points when treatmentswere started or modified.

3. At many time points, the clinician consciously choosesnot to take an action. Unfortunately, this informa-tion is not generally recorded (although, on occasion,

may exist in clinical notes). As a proxy, we consecu-tively add to our existing set of “action times” any timepoint at which an abnormally low MAP is observed(< 60mmHg) and there are no other “action times”within a 1 hour window either before or after. Thiscaptures the relatively fine-granularity with which aphysician may choose not to treat despite some degreeof hypotension.

4. Last, we add additional time points to fill in any largegaps where no “action times” exist. We do this byadding time points between existing “action times” un-til there are no longer any gaps greater than 4 hoursbetween actions. This makes some clinical sense, as pa-tients in the ICU are being monitored relatively closely,but if they are more stable, their treatment decisionswill be made on a coarser time scale.

Now that we have a set of action times for each trajectory,we can count up the total number of transitions in our train-ing and evaluation datasets (both of which consist of 346trajectories). The training trajectories contain a total of 6777transitions, while there are 6863 total transitions in the eval-uation data. Trajectories vary in length from a minimumof 7 transitions to a maximum of 49, with 16, 18, and 23transitions comprising the 25%, 50%, and 75% quantiles,respectively.

C.4. Action Space Construction

Given treatment timings, doses, and manually identified“action times” at which we want to assess what type ofclinical decision was made, we can now construct our actionspace. We choose to operate in a discrete action space,which means we need to decide how to bin each of thecontinuous-valued treatment amounts.

Binning of IV fluids is more natural and easier, as fluidboluses are generally given in discrete amounts. Themost common bolus sizes are 500mL and 1000mL, sowe bin fluid bolus volumes into the following 4 bins,which correspond to “none”/“low”/“medium”/“high” (inmL): {0, [250, 500), [500, 1000), [1000,∞]}, although inpractice very few boluses of more than 2L are ever given.Given this binning scheme, we can simply add up the totalamount of fluids administered during any adjacent actiontimes to determine which discrete fluid amount we shouldcode the action as.

Binning of vasopressors is slightly more complex. Thesedrugs are dosed at a specific rate, and there may be manyrate changes made between action times, or sometimesthere are several vasopressors being given at once. Wechose to first add up the cumulative amount of (normal-ized) vasopressor drug administered between action times,and then normalize this amount by the size of the time

Page 14: Finale Doshi-Velez - Interpretable Off-Policy Evaluation in … · 2020. 7. 27. · tion process and make evaluation more robust. 1. Introduction Within reinforcement learning (RL),

Interpretable Off-Policy Evaluation by Highlighting Influential Transitions

window between action times to account for the irregu-lar spacing. Finally, we also bin vasopressors into 4 dis-crete bins corresponding to “none”/“low”/“medium”/“high”amounts: {0, (0, 8.1), [8.1, 21.58), [21.58,∞]}. The rele-vant units here are total mcg of drug given each hour, perkg body weight. Since the distribution of values for vaso-pressors is not as naturally discrete, we chose our bin sizesusing the 33.3% and 66.7% quantiles of dose amounts.

In the end, we have an action space with 16 possible dis-crete actions, considering all combinations of each of the 4vasopressor amounts and fluid bolus amounts.

C.5. State Construction

Given a patient cohort, decision/action times, and discreteactions, we are now ready to construct a state space. Forsimplicity in this initial work, we first start with the 11 clin-ical time series variables previously listed. If a variable isnever measured, we use the population median as a place-holder. If a variable has been measured before, we use themost recent measurement. The sole exception to this is the3 blood pressure variables. For the blood pressures, weinstead use the minimum (or worst) value observed sincethe last action.

We add to these a number of indicator variables that denotewhether a particular variable was recently measured or not.Due to the strongly missing-not-at-random nature of clinicaltime series, there is often considerable signal in knowingthat certain types of measurements were recently taken,irrespective of the measurement values (Agniel et al., 2018).We choose to construct indicator variables denoting whetheror not a urine output was taken since the last action time,and whether a GCS was recorded since the last action. Wealso include state features denoting whether the followinglabs/vitals were ever ordered: creatinine, FiO2, lactate, ALT,AST, PO2. We do not include these indicators for all 11clinical variables, as most of the vitals are recorded at leastonce an hour, and sometimes even more frequently. In total,8 indicators comprise part of our state space.

Last, we include 10 additional variables that summarizepast treatments administered, if any. We first include 6indicator variables (3 for each treatment type) denotingwhich dose of fluid and vasopressor, if any, was chosen atthe last action time. Last, for each treatment type we includetwo final features that summarize past actual amounts oftreatments administered (the total amount of this treatmentadministered up until the current time, and the total amountof this treatment administered within the last 8 actions.

In total, our final state space has 29 dimensions. In futurework we plan to explore richer state representations.

C.6. Reward Function Construction

In this preliminary work, we use a simple reward that is apiecewise linear function of the MAP in the next state. Inparticular, the reward takes on a value of −1 at 40mmHg,the lowest attainable MAP in the data. It increases lin-early to -0.15 at 55mmHg, linearly from there to -0.05 at60mmHg, and achieves a maximum value of 0 at 65mmHg,a commonly used target for blood pressure in the ICU (Asfaret al., 2014). However, if a patient has a urine output of30mL/hour or higher, then any MAP values of 55mmHg orhigher are reset to 0. This attempts to mimic the fact that aclinician will not be too concerned if a patient is slightly hy-potensive but otherwise stable, since a modest urine outputindicates that the modest hypotension is not a real problem.

C.7. Choice of Kernel Function

In order to use kernel-based FQE, we need to define a kernelthat defines similarity between states. In consultation withour intensivist collaborator, we chose a simple weightedEuclidean distance, where each state variable receives adifferent weight based on its estimated importance to theclinical problem. We show all weights in Table 1.

Since technically we need a kernel over both all possiblestates and actions for FQE and influence analysis, we aug-ment our kernel with extremely large weights so that effec-tively the kernel only compares pairs (s, a) and (s′, a′) fora = a′. Other choices should be made for continuous actionspaces.

C.8. Hyperparameters

We use the training set of 6777 trajectories to learn a policyto then evaluate using FQE and influence analysis. In par-ticular, we learn a deterministic policy by taking the mostcommon action within the 50 nearest neighbors of a givenstate, with respect to the kernel in Table 1. We use a dis-count of γ = 1 so that all time steps are treated equally, anduse a neighborhood radius of 7 for finding nearest neighborsin FQE. Lastly, for the influence analysis, we use a thresh-old of 0.05, or 5%, so that transitions which will affect theFQE value estimate by more than 5% are flagged for expertreview.

D. Additional Results from MIMIC-III acutehypotension dataset

In the main body of the paper, we showed two qualitativeresults figures showing 2 of the 6 highly influential transi-tions flagged by influence analysis. In this section, we showthe remaining 4 influential transitions.

Page 15: Finale Doshi-Velez - Interpretable Off-Policy Evaluation in … · 2020. 7. 27. · tion process and make evaluation more robust. 1. Introduction Within reinforcement learning (RL),

Interpretable Off-Policy Evaluation by Highlighting Influential Transitions

Table 1. Weights for each state variable in our kernel function.State Variable Kernel WeightCreatinine 3FiO2 15Lactate 10Urine Output 15Urine Outputsince last action? 5

ALT 5AST 5Diastolic BP 5MAP 15PO2 3Systolic BP 5GCS 15GCS sincelast action? 5

Creatinine evertaken? 3

FiO2 evertaken? 15

Lactate evertaken? 10

ALT ever taken? 5AST ever taken? 5PO2 ever taken? 3Low vasopressordone last time? 15

Medium vasopressordone last time? 15

High vasopressordone last time? 15

Low fluid donelast time? 15

Medium fluid donelast time? 15

High fluid donelast time? 15

Total vasopressorsso far 15

Total fluids so far 15Total vasopressorslast 8 actions 15

Total fluidslast 8 actions 15

0

500

1000

Flui

d Bo

lus

Size

(ml)

Observed Treatment Amounts

0.00

0.25

0.50

0.75

1.00

Pres

sor

Rate

(mcg

/kg/

hr)

None

Low

Medium

High

Flui

dAc

tion

Binned Actions

None

Low

Medium

High

Pres

sor

Actio

nFluidVasopressor

4050607080

Mea

n Ar

teria

lPr

essu

re(m

mHg

)

0

50

100

150

Urin

eOu

tput

(ml)

0 2 5 7 10 12 15 18 20 24 26 30 32 34 36 37 38

Action Times (Hours After ICU Admission)

369

1215

Glas

gow

Com

aSc

ore

Influence of Flagged Transition: 6%

Figure 5. An additional example identified by our influence analy-sis as having an especially high effect on the OPE value estimate.

Page 16: Finale Doshi-Velez - Interpretable Off-Policy Evaluation in … · 2020. 7. 27. · tion process and make evaluation more robust. 1. Introduction Within reinforcement learning (RL),

Interpretable Off-Policy Evaluation by Highlighting Influential Transitions

0

500

1000

Flui

d Bo

lus

Size

(ml)

Observed Treatment Amounts

0.000

0.025

0.050

0.075

Pres

sor

Rate

(mcg

/kg/

hr)

None

Low

Medium

High

Flui

dAc

tion

Binned Actions

None

Low

Medium

High

Pres

sor

Actio

nFluidVasopressor

4050607080

Mea

n Ar

teria

lPr

essu

re(m

mHg

)

0

50

100

150

Urin

eOu

tput

(ml)

0 3 5 8 10 12 13 15 16

Action Times (Hours After ICU Admission)

369

1215

Glas

gow

Com

aSc

ore

Influence of Flagged Transition: 5%

Figure 6. An additional example identified by our influence analy-sis as having an especially high effect on the OPE value estimate.Note that this transition is from the same trajectory as the influen-tial transition highlighted in Figure 7

0

500

1000

Flui

d Bo

lus

Size

(ml)

Observed Treatment Amounts

0.000

0.025

0.050

0.075

Pres

sor

Rate

(mcg

/kg/

hr)

None

Low

Medium

High

Flui

dAc

tion

Binned Actions

None

Low

Medium

High

Pres

sor

Actio

nFluidVasopressor

4050607080

Mea

n Ar

teria

lPr

essu

re(m

mHg

)

0

50

100

150

Urin

eOu

tput

(ml)

0 3 5 8 10 12 13 15 17 18

Action Times (Hours After ICU Admission)

369

1215

Glas

gow

Com

aSc

ore

Influence of Flagged Transition: 6%

Figure 7. An additional example identified by our influence analy-sis as having an especially high effect on the OPE value estimate.Note that this transition is from the same trajectory as the influen-tial transition highlighted in Figure 6

Page 17: Finale Doshi-Velez - Interpretable Off-Policy Evaluation in … · 2020. 7. 27. · tion process and make evaluation more robust. 1. Introduction Within reinforcement learning (RL),

Interpretable Off-Policy Evaluation by Highlighting Influential Transitions

0

500

1000

Flui

d Bo

lus

Size

(ml)

Observed Treatment Amounts

0.00

0.25

0.50

0.75

1.00

Pres

sor

Rate

(mcg

/kg/

hr)

None

Low

Medium

High

Flui

dAc

tion

Binned Actions

None

Low

Medium

High

Pres

sor

Actio

nFluidVasopressor

4050607080

Mea

n Ar

teria

lPr

essu

re(m

mHg

)

0

50

100

150

Urin

eOu

tput

(ml)

0 3 6 8 10 13 15 16

Action Times (Hours After ICU Admission)

369

1215

Glas

gow

Com

aSc

ore

Influence of Flagged Transition: 19%

Figure 8. An additional example identified by our influence analy-sis as having an especially high effect on the OPE value estimate.