Methods for Reinforcement Learning in Clinical Decision Support · 2020. 7. 13. · danski, Gary...
Transcript of Methods for Reinforcement Learning in Clinical Decision Support · 2020. 7. 13. · danski, Gary...
Methods for Reinforcement Learning in
Clinical Decision Support
Niranjani Prasad
A Dissertation
Presented to the Faculty
of Princeton University
in Candidacy for the Degree
of Doctor of Philosophy
Recommended for Acceptance
by the Department of
Computer Science
Adviser: Professor Barbara E. Engelhardt
September 2020
c⃝ Copyright by Niranjani Prasad, 2020.
All rights reserved.
Abstract
The administration of routine interventions, from breathing support to pain manage-
ment, constitutes a major part of inpatient care. Thoughtful treatment is crucial to
improving patient outcomes and minimizing costs, but these interventions are often
poorly understood, and clinical opinion on best protocols can vary significantly.
Through a series of case studies of key critical care interventions, this thesis devel-
ops a framework for clinician-in-loop decision support. The first of these explores the
weaning of patients from mechanical ventilation: admissions are modelled as Markov
decision processes (MDPs), and model-free batch reinforcement learning algorithms
are employed to learn personalized regimes of sedation and ventilator support, that
show promise in improving outcomes when assessed against current clinical practice.
The second part of this thesis is directed towards effective reward design when for-
mulating clinical decisions as a reinforcement learning task. In tackling the problem
of redundant testing in critical care, methods for Pareto-optimal reinforcement learn-
ing are integrated with known procedural constraints in order to consolidate multiple,
often conflicting, clinical goals and produce a flexible optimized ordering policy.
The challenges here are probed further to examine how decisions by care providers,
as observed in available data, can be used to restrict the possible convex combinations
of objectives in the reward function, to those that yield policies reflecting what we
implicitly know from the data about reasonable behaviour for a task, and that allow
for high-confidence off-policy evaluation. The proposed approach to reward design is
demonstrated through synthetic domains as well as in planning in critical care.
The final case study considers the task of electrolyte repletion, describing how
this task can be optimized using the MDP framework and analysing current clinical
behaviour through the lens of reinforcement learning, before going on to outline the
steps necessary in enabling the adoption of these tools in current healthcare systems.
iii
Acknowledgements
This thesis owes its existence to a number of incredible people. First and foremost,
thank you to my advisor Barbara Engelhardt, for her mentorship and her commitment
to tackling meaningful problems in machine learning for healthcare. She is a source
of inspiration to me as both a scientist and as a pillar of support within the research
community. I also want to thank Finale Doshi-Velez for her guidance and optimism
in encouraging me to pursue my ideas, at crucial junctures of this dissertation.
I thank Ryan Adams, Sebastian Seung and Mengdi Wang for agreeing to serve
on my thesis committee, as well as Warren Powell for his early feedback. I am also
thankful to Kai Li for forging our collaborations with the Hospital of the University
of Pennsylvania. I feel incredibly fortunate to have worked alongside Corey Chivers,
Michael Draugelis and the rest of the data science team at Penn Medicine; this re-
search would not have been possible without their consistent backing. I am also
grateful for my discussions with physicians at Penn, in particular Krzysztof Lau-
danski, Gary Weissman, Heather Giannini, and Daniel Herman. Their tirelessness
and confidence in the potential of machine learning to improve patient care has been
heartening, and their insights have moulded my own perspectives. I also owe a great
deal to my labmate and co-author, Li-Fang Cheng. Working with her in our efforts
towards optimal laboratory testing in acute patient care was a wonderful experience;
her steadfast and systematic approach helped me grow as a researcher.
Thank you to my officemates over the years, and the whole BEEhive, for making
the lab a welcoming and engaging place to discuss anything from statistics to politics.
At the same time, I am lucky to have amazing housemates, Sumegha Garg—a con-
stant source of support and humour, and my co-conspirator in so much of our life in
Princeton—and Sravya Jangareddy, who have made dissertation writing in quaran-
tine rather more fun. Thanks also to the rest of the Musketeers for all the potlucks,
hikes and game nights, keeping me from any real danger of working too hard.
iv
Thank you to my fiance, Cormac O’Neill, who I have found at my side in every
adventure the past five years have brought my way. I am so grateful for his immea-
surable patience and positivity, not to mention his role as my personal guide to the
mysterious world of medical parlance. I cannot imagine this experience without him.
Above all, thank you to my family—my sister Nivedita, my parents Prasad and
Vasumathi, and my paternal and maternal grandparents—for all their love and sup-
port. Their winding paths have carried me to this point, and their total conviction
in my capabilities has been a constant source of strength to me. It was through that
faith that I began this PhD, and it is with them that I complete it.
v
To Amma, Appa, Nivi, and Cormac.
vi
Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
1 Introduction 1
1.1 Learning from Electronic Health Records . . . . . . . . . . . . . . . . 2
1.2 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Reinforcement Learning: Preliminaries 7
2.1 Markov Decision Processes . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.1 Solving an MDP . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Off-Policy Policy Evaluation . . . . . . . . . . . . . . . . . . . . . . . 16
3 An RL Framework for Weaning from Mechanical Ventilation 19
3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.1 MIMIC III Dataset . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.2 Resampling using Multi-Output Gaussian Processes . . . . . . 26
3.2.3 MDP Formulation . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.4 Learning the Optimal Policy . . . . . . . . . . . . . . . . . . . 34
vii
3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4 Optimizing Laboratory Tests with Multi-objective RL 45
4.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.1.1 MIMIC Cohort Selection and Preprocessing . . . . . . . . . . 48
4.1.2 Designing a Multi-Objective MDP . . . . . . . . . . . . . . . . 50
4.1.3 Solving for Deterministic Optimal Policy . . . . . . . . . . . . 53
4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2.1 Off-Policy Evaluation . . . . . . . . . . . . . . . . . . . . . . . 57
4.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5 Constrained Reward Design for Batch RL 62
5.1 Preliminaries and Notation . . . . . . . . . . . . . . . . . . . . . . . . 64
5.2 Admissible Reward Sets . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.2.1 Consistent Reward Polytope . . . . . . . . . . . . . . . . . . . 66
5.2.2 Evaluable Reward Polytope . . . . . . . . . . . . . . . . . . . 70
5.2.3 Querying Admissible Reward Polytope . . . . . . . . . . . . . 73
5.2.4 Finding the Nearest Admissible Reward . . . . . . . . . . . . 74
5.3 Experiment Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.3.1 Benchmark Tasks . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.3.2 Mechanical Ventilation in ICU . . . . . . . . . . . . . . . . . . 77
5.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.4.1 Benchmark Control Tasks . . . . . . . . . . . . . . . . . . . . 79
5.4.2 Mechanical Ventilation in ICU . . . . . . . . . . . . . . . . . . 82
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6 Guiding Electrolyte Repletion in Critical Care using RL 86
6.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
viii
6.1.1 UPHS Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.1.2 Formulating the MDP . . . . . . . . . . . . . . . . . . . . . . 89
6.1.3 Fitted Q-Iteration with Gradient-boosted Trees . . . . . . . . 95
6.1.4 Reward Inference using IRL with Batch Data . . . . . . . . . 96
6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.2.1 Understanding Behaviour in UPHS . . . . . . . . . . . . . . . 98
6.2.2 Analysing Policies from FQI . . . . . . . . . . . . . . . . . . . 100
6.2.3 Off-policy Policy Evaluation . . . . . . . . . . . . . . . . . . . 103
6.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7 Conclusion 106
7.1 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . 107
7.1.1 Translation to Clinical Practice . . . . . . . . . . . . . . . . . 109
Bibliography 112
ix
List of Tables
3.1 Extubation guidelines at Hospital of University of Pennsylvania . . . 26
4.1 Summary statistics of key labs and vitals within selected cohort . . . 49
5.1 MDP state features for ventilation management in the ICU . . . . . . 77
5.2 Top admitted weights for benchmark control tasks . . . . . . . . . . . 79
5.3 Effective sample size for ventilation policies with FPL . . . . . . . . . 84
6.1 Patient state features for electrolyte repletion policies . . . . . . . . . 91
6.2 Discretized dosage levels for K, Mg and P. . . . . . . . . . . . . . . . 92
6.3 Reward weights inferred from IRL vs. chosen for FQI policies . . . . 100
x
List of Figures
2.1 Agent-Environment interaction in an MDP . . . . . . . . . . . . . . . 9
2.2 Deep Q-Network architecture . . . . . . . . . . . . . . . . . . . . . . 15
3.1 Example ICU admission with invasive mechanical ventilation . . . . . 25
3.2 Multi-output Gaussian process imputation of key vitals . . . . . . . . 31
3.3 Shaping penalties for abnormal or fluctuating vitals . . . . . . . . . . 35
3.4 Convergence of Q(s, a) using Q-learning . . . . . . . . . . . . . . . . 39
3.5 Convergence of Q(s, a) using Fitted Q-iteration . . . . . . . . . . . . 40
3.6 Gini feature importances for optimal policies . . . . . . . . . . . . . . 41
3.7 Evaluating mean reward and reintubations of learnt policies . . . . . 42
4.1 Measurement frequency of key labs and vitals . . . . . . . . . . . . . 50
4.2 Feature importance scores for optimal lab ordering policies . . . . . . 56
4.3 Recommendated orders for lactate in example ICU admission . . . . . 58
4.4 Evaluating component-wise PDWIS estimates across ordering policies 59
4.5 Comparing information gain from lab tests (Clinician vs MO-FQI) . . 60
4.6 Evaluating lab time to treatment onset (Clinician vs MO-FQI) . . . . 61
5.1 Admissibility polytopes with varying thresholds for 2D map . . . . . 69
5.2 Polytope size and distribution of admitted weights in benchmark tasks 80
5.3 Admissible weights for ventilation weaning task . . . . . . . . . . . . 83
6.1 Distribution of electrolyte levels at repletion events for K, Mg and P. 89
xi
6.2 Example hospital admission with potassium supplementation . . . . . 90
6.3 Penalizing abnormal electrolyte levels in reward function . . . . . . . 94
6.4 UPHS vs FQI-recommended potassium repletion for sample admission 100
6.5 Distribution of recommended actions for K, Mg and P . . . . . . . . 101
6.6 Shapley values of top features for PO and IV potassium repletion . . 102
6.7 Evaluation of policies for K, Mg and P using Fitted-Q Evaluation . . 104
xii
Chapter 1
Introduction
“Medicine is a science of uncertainty and an art of probability.”
——————————————————————————
- Sir William Osler (1849-1919)
Clinical decision-making is the process of collecting and contextualizing evidence,
within an evolving landscape of medical knowledge, with the intent of advancing
patient health. In current practice, this requires care providers to sift through large
volumes of fragmented, multi-modal data, evaluate these in the face of conflicting
pressures—to minimize patient risk, manage uncertainty, and rein in costs—so as to
formulate an understanding of a patient’s underlying state and decide what additional
information is necessary in order to make diagnostic and therapeutic decisions.
These pressures are heightened when examining clinical decision-making for crit-
ically ill patients, that is, those in resource and data-intensive settings such as the
ICU (intensive care unit), often requiring rapid judgements with high stakes. Timely
and proportionate interventions are crucial to ensuring the best possible patient out-
come in these cases. However, there is a severe lack of conclusive evidence on best
practices for many routine interventions, particularly when serving heterogeneous pa-
tient populations [79]: multiple systematic reviews of randomized control trials of
1
common ICU interventions have found that less than one in seven of these were of
measurable value to patient outcome [93, 106]. Coupled with human biases arising
from, for example, skewed (or simply lack of) clinical experience, fatigue, or legal and
procedural burdens, this often necessitates an over-reliance by clinicians on intuition
or heuristics. While such heuristic decision-making may seem most practical, it can
result in compounding errors with increasingly complex cases. It is estimated that
more than 250,000 deaths per year in the US can be attributed to medical error [77];
within the ICU, observational studies suggest that around 45% of adverse events are
preventable, with the majority of serious medical errors occurring in the ordering
or execution of treatments [126]. This motivates the adoption of a more quantita-
tive, data-centric approach to patient care, that systemically evaluates the space of
possible interventions in order to determine an optimal course of action.
In this thesis, we develop a framework for clinician-in-loop decision support tools
that makes use of large-scale data from electronic health records to aid the manage-
ment of routine interventions in critically ill patients. We do so by considering these
sequential decision-making problems through the lens of reinforcement learning (RL).
We go on to demonstrate how our methods can be applied to and evaluated in critical
care settings; we do so by learning policies for the management of an array of routine
interventions, from the control of mechanical ventilation to the ordering of routine
laboratory tests or administration of electrolyte repletion therapy.
1.1 Learning from Electronic Health Records
Electronic health records (EHRs) are digitized collections of patient data, comprising
demographic information and personal statistics, medical histories, as well as data
from individual hospital visits such as vitals, laboratory tests, administered drugs and
procedures, radiology images, nursing notes and billing information. In the space of a
2
decade, the rate of adoption of EHRs in the US increased 10-fold, from just 9% in 2008
to 94% in 2017 [40, 96], driven by the need to facilitate clinical practice, streamline
workflows and slow the inflation of healthcare costs. This sudden availability of rich
healthcare data at scale has resulted in the proliferation of efforts to leverage state-
of-the-art machine learning methods towards the analysis of this data.
The majority of these efforts have been directed towards predictive modelling, for
example, forecasting the likelihood of patient mortality, length of ICU stay, onset of
sepsis, and numerous other adverse clinical events [27, 39, 45]. A recurring challenge
in the analysis of these forecasting methods is in justifying whether they are action-
able, that is, whether they can provide clinical insights to inform early interventions
and prevent patient deterioration or improve outcome. This question of actionability
is addressed more directly by approaches that instead aim to directly optimize for in-
terventions, which are then presented to clinicians. There have been numerous efforts
to learn personalized treatment recommendations, from direct action-learning using
simple predictive models or rule-based approaches, to those drawing on literature in
control theory, contextual bandits or reinforcement learning [36, 82, 84, 124].
Learning treatment policies from observational data in the form of EHRs poses a
number of challenges in practice. Firstly, much of the data present in health records
is collected to facilitate the billing of hospital admissions, rather than with the view
of being pedagogical for sequential decision making tasks. For instance, nursing notes
and ICD-9 codes indicating diagnosed comorbidities and administered procedures are
typically entered post hoc, so cannot be relied upon to be available at the time of
decision-making. The timestamped data that is available is often sporadically sam-
pled and error-ridden; time series recordings can have widely differing measurement
frequencies for different physiological parameters, and are rarely missing at random.
Furthermore, “normal” or reference ranges of variables hold little meaning when iden-
tifying outliers in the data, as abnormal values are likely to be omnipresent, and are
3
crucial to identifying patient deterioration. Great care must therefore be taken when
processing the data prior to learning treatment policies.
Next, the space of possible predictors available in practice for reasoning about
diagnosis or intervention is large. Certain factors may be easily observable by clin-
icians, but inadequately captured or difficult to infer from chart data in the EHR.
These can include for instance informal assessment of a patient’s pallor, muscle weak-
ness, or breathing difficulty. Where standardized severity tests do exist, such as for
the monitoring of cognition or pain levels, these are often time-consuming to ad-
minister and subject to bias. Clinicians also have complex and seemingly arbitrary
choices in intervention, and it can be difficult to discern treatment options that are
systematic and evidence-based from those driven instead by available resources, local
policies or physician preferences. For example, intravenous drug delivery or fluid re-
pletion, is often restricted to fixed preparations (combinations) of drugs rather than
tailored to individual patient needs, increasing risk of overdose or overcorrection of
other physiological parameters in the process of treating a certain target condition.
Lastly, the inference of counterfactual outcomes given the interventions observed
in patient EHRs is an intrinsic challenge to learning and evaluating policies from
observational data. Counterfactual treatment effect estimation has been explored in
depth in literature on causal inference. This problem is amplified when planning
interventions over multiple time steps: the set of all possible treatment sequences
grows exponentially with the number of decision points in the patient admission.
Therefore, even with the availability of a large database of patient histories, the
effective sample size for evaluating specific treatment policy—that is, the number of
histories in the dataset for which decisions by the clinician match this policy—rapidly
shrinks [33], demanding caution when assessing its potential value.
4
1.2 Thesis Contributions
The key contributions of this thesis are two-fold: firstly, the work here is fundamen-
tally interdisciplinary, bridging the gap between the decision-making process in the
hospital setting, and planning as a machine learning problem. Within the paradigm
of model-free reinforcement learning, I develop definitions of state, action and reward
for actors in a critical care environment that are underpinned by clinical reasoning. In
doing so, I draw on canonical models in time series representation, such as Gaussian
processes, and in prediction, from tree-based ensemble methods to feed-forward neu-
ral networks. I apply these methods to multiple distinct facets of inpatient care; our
work on the management of invasive mechanical ventilation [98], for example, helped
lay a foundation for learning and evaluating treatment policies in this paradigm.
The second area of work this thesis endeavours to push forward is methodology
for reward design in practical applications. While the reward function is considered
to be the most robust definition of a reinforcement learning task, the use of sparse,
overarching objectives can be challenging—and sometimes misleading—to learn from.
On the other hand, in a reward function that incorporates several sub-objectives that
present more immediate, relevant feedback, it is often unclear how these multiple
objectives should be weighted. To this end, I introduce various approaches to drawing
from domain experts when deciding this trade-off: both explicitly, by developing a
framework for multi-objective reinforcement learning that can be applied to extended
horizon tasks, and guides clinicians in ultimately prioritizing objectives to obtain
a deterministic policy [13] and implicitly, by examining available historical data to
understand current clinical priorities, and using this where appropriate as an anchor
in treatment policy optimization [99].
5
1.2.1 Outline
The remainder of this thesis is organized as follows: in Chapter 2, I introduce the
fundamentals of the reinforcement learning framework for sequential decision-making.
Chapter 3 describes my efforts in the development of clinician-in-loop decision
support for weaning from mechanical ventilation, outlining the formulation of this task
in the reinforcement learning frame, and the use of off-policy reinforcement learning
algorithms to learn an optimal sequence of actions, in terms of sedation, intubation
or extubation, from sub-optimal behaviour in historical intensive care data.
Chapter 4 turns to the problem of effective reward design given multiple ob-
jectives when applying reinforcement learning to clinical decision-making tasks. It
combines work in Pareto optimality in reinforcement learning with known clinical
and procedural imperatives to present a flexible recommendation system for optimiz-
ing the ordering of laboratory tests for patients in critical care.
In Chapter 5, I then consider how available clinical data can be used to inform
and restrict the possible convex combinations of these multiple objectives in the re-
ward function to those that yield a scalar reward which reflects what we implicitly
know from the data about reasonable behaviour for a task, and allows for robust
off-policy evaluation, and apply this to reward design on synthetic domains as well
as in the critical care context.
Finally, Chapter 6 explores the problem of electrolyte repletion in critically ill
patients, and adapts the framework introduced to previous chapters to demonstrate
how, given data from a particular healthcare system, we can understand current
behavioural patterns in repletion therapy through the lens of reinforcement learning,
and model this task to learn optimal repletion policies in the same way.
In my conclusion, I summarize findings from these works, and consider the steps
necessary for successful adoption of data-driven decision support in clinical practice.
6
Chapter 2
Reinforcement Learning:
Preliminaries
The reinforcement learning paradigm is characterized by an agent continuously inter-
acting with and learning from a stochastic environment to achieve a certain goal. This
mirrors one of the fundamental ways in which we as humans learn: not with a formal
teacher, but by observing cause and effect through direct sensorimotor connections
to our surroundings [116]. In supervised machine learning—which dominates much
of machine learning in practice—one is given data in the form of featurized inputs
along with the true labels to be predicted. Unsupervised learning provides no labelled
data at all, and instead aims to learn the underlying structure of observations. Rein-
forcement learning is distinct from either of these modalities in that we receive only
feedback in the form of a reward signal, to enforce certain actions over others (rather
than labels of the “correct” or best possible actions). Additionally, for most tasks
with some degree of complexity, this feedback is delayed: the effects of a given action
may not present themselves until several time steps into the future, or may emerge
gradually over an extended period of time.
7
Reinforcement learning is also often distinguished by the exploration-exploitation
dilemma it poses. When repeatedly interacting online with the environment, an
agent can at each time step choose either the best action given existing information
(exploit), or choose to gather more information (explore) that may enable better de-
cisions in the future. While this has been a subject of intense study in reinforcement
learning literature, when running RL offline—that is, with previously collected ob-
servational data, as is the case for many real-world tasks—this trade-off is to a large
extent predetermined by the behaviour of the domain actors from whom the data was
collected. Despite this, the ability of the reinforcement learning framework to take a
holistic, goal-directed approach to learning, and to inherently capture uncertainty in
observations and outcomes, makes it an attractive approach for planning in practice.
2.1 Markov Decision Processes
The simplest and most common model underlying reinforcement learning methods is
the Markov decision process (MDP). Consider the setting where an agent (the learner)
interacts with the environment over a series of discrete time steps, t = 0, 1, 2, 3.... At
each time t, the agent observes the environment in some state st, takes action at
accordingly, and in turn receives some feedback rt+1 together with the next state st+1
(Figure 2.1 [116]). An MDP is then defined by the tuple M = {S,A, P0, P, R, γ},
where S is a finite state space such that the environment is in some state st ∈ S
at each time step t, and A is the space of all possible actions that can be taken by
the agent, at ∈ A. P0 defines the probability distribution of initial states, s0 ∼ P0,
while the transition function P (st+1|st, at) defines the probability of the next state
given the current state-action pair. This essentially summarizes the dynamics of the
system, and is unknown for most real-world tasks. The reward function R is the
immediate feedback received at each state transition, and is typically described as a
8
Figure 2.1: Agent-Environment interaction in an MDP
function of the current state, action and observed next state: rt+1 = R(st, at, st+1).
Finally, the discount factor γ determines the relative importance of immediate and
future rewards for the task in question.
The Markov assumption here posits that given the full history of state transitions,
h = s0, a0, r1, s1, a1, r2, ..., st, information on future states and rewards—and hence all
information relevant in planning for the MDP—is encapsulated by the current state.
This assumption of perfect information can often be unrealistic in practice. One pop-
ular generalization of the MDP framework that looks to relax this assumption is the
partially observable MDP, or POMDP. In a POMDP, observations are treated as noisy
measurements of the true underlying state of the environment, and used to model the
probability distribution over the state space given an observation; the inferred belief
state is then used to learn optimal policies for this environment. However, this infer-
ence problem is often challenging and computationally infeasible for large problems.
Instead, careful design of state representation in an MDP, in a way that incorporates
relevant high-level information from transition histories in order to bridge the gap to
complete observability, is often more effective in practice.
It is worth noting that the full reinforcement learning problem as modelled by an
MDP can be thought of as an extension to contextual multi-armed bandits in online
9
learning [62]: whereas the observation at each time step is independent and identically
distributed in the bandit setting, the observed state at each step in an MDP depends
on the previous state-action pair, as dictated by the transition function P .
2.1.1 Solving an MDP
The goal of the agent in reinforcement learning is to learn a policy function π : S → A
mapping from the state space to the action space of the MDP, such that this policy
maximizes expected return, that is, the expected cumulative sum of rewards received
by the agent over time. Denoting this optimal policy π∗,
π∗ = argmaxπ∈Π
Es0∼P0
[limT→∞
T−1∑t=0
γtrt+1|π
](2.1)
for an infinite horizon MDP, where 0 ≤ γ ≤ 1. Setting discount factor γ = 0 models
a myopic agent, looking only to maximize immediate rewards; as γ approaches 1,
future rewards increasingly contribute to the expected return.
The use of discounted sum of rewards as the objective when solving an MDP
for optimal policies is both mathematically convenient—ensuring finite returns with
γ < 1—and a reasonable model for most tasks in practice, which immediate feedback
tends to be most reflective of the action taken at the current state, and there is
increasing uncertainty about distant rewards. It can also be seen as a softening of
finite horizon or episodic MDPs with fixed horizons T , as the contribution of rewards
far into the future to the objective function is negligible.
The Bellman Optimality Equation
Given an infinite horizon MDP with finite state and action spaces, bounded reward
R and discount γ < 1, the value V π(s) of state s is defined simply as the expected
10
return when starting from s and following of policy π from that point onwards:
V π(s) = E
[limT→∞
T−1∑t=0
γtrt+1|π, s0 = s
](2.2)
A fundamental property of value functions in reinforcement learning is that they can
be written recursively, such that given current state s, action a and next state s′,
the value of the current state can be written as the sum of the immediate reward
r = R(s, a, s′) and the discounted expected value of the next state:
V π(s) = E[r + γV π(s′)] (2.3)
This is the Bellman recursive equation for value function V π(s). An optimal policy
π∗ is therefore one for which V ∗(s) = maxπ Vπ(s) ∀s ∈ S. Substituting the recursive
definition above gives us the Bellman optimality equation:
V ∗(s) = maxπ
E[r + γV π(s′)]
≤ maxa∈A
∑s′∈S
P (s, a, s′)[R(s, a, s′) + γV π(s′)]
= maxa∈A
∑s′∈S
P (s, a, s′)[R(s, a, s′) + γV ∗(s′)] (2.4)
We can interpret this as stating that the value of a given state under an optimal
policy is necessarily the expected discounted return when taking the best possible
action from that state [116]. It has been shown that V ∗(s) is in fact a unique solution
of Equation 2.4, [102]; it follows that the deterministic policy
π∗(s) = argmaxa∈A
∑s′∈S
P (s, a, s′)[R(s, a, s′) + γV ∗(s′)] (2.5)
is optimal. Here, a deterministic policy is one which maps from any given state
to a single action; a randomized or stochastic policy on the other hand maps from
11
states to a probability distribution over the action space. These can be useful in
adversarial settings or when tackling the exploration-exploitation trade-off in online
reinforcement learning, but are of less interest in the case of human-in-loop decision
support, where we wish to recommend a single, optimal action to the user.
Value Function Approximation in Off-policy RL
Optimal policies also share the same action-value function: Q∗(s, a) = maxπ Qπ(s, a)
where Qπ(s, a) is the expected return when taking action a at state s, and following
policy π thereafter, such that V ∗(s) = maxa(Q∗(s, a)). The Q-function in effect
caches the result of one-step lookahead searches for the value of each action at any
given state, simplifying the process of choosing optimal actions. The corresponding
Bellman optimality equation for Q∗(s, a) is given by:
Q∗(s) =∑s′∈S
P (s, a, s′)[R(s, a, s′) + maxa∈A
γV ∗(s′)] (2.6)
This forms the basis of one of the most popular classes of reinforcement learning algo-
rithms, namely value-based methods such as Q-learning and its variants. Q-learning
[125] is a reinforcement learning algorithm that uses one-step temporal differences to
successively bootstrap on current estimates for the value of each state-action pair.
Starting from some initial state and an arbitrary approximation Q(s, a), we perform
an update using the observed immediate reward at each state transition according to
the following update rule, based on the Bellman equation:
Q(s, a)← α(r + γmaxa′∈A
Q(s′, a′)) + (1− α)Q(s, a) (2.7)
Our new estimate of Q is a convex sum of the previous estimate, and the expected
return given the reward received at the current transition. The learning rate α de-
termines the relative weights of the new and old estimates in this update. We repeat
12
this over a fixed number of iterations, or until the LHS and the RHS of the above
equation are approximately equal. It has been shown that this procedure provides
guaranteed convergence to the true value of Q in the tabular setting—that is, with
discrete state and action spaces—given that all state-action pairs in this space are
repeatedly sampled and updated.
Now that we have our estimate of Q, the optimal policy π∗ is simply the action
maximizing Q at each state:
π∗(s) = argmaxa∈A
Q(s, a) ∀s ∈ S (2.8)
The Q-learning algorithm is both model-free, requiring no prior knowledge of the
transition or reward dynamics of the system, and off-policy : it learns an optimal
policy from experience collected whilst following a different behaviour policy.
In order to extend from the tabular case to large or continuous state spaces,
we must combine Q-learning with some form of parametric function approximation,
such as linear models or neural networks. These take as input a vector representation
of state and action, and learn the mapping to the corresponding action-value. In
practice, updating the function approximator with each new state transition can cause
significant instability in learning the Q-function: sampled transitions are typically
sparse in comparison with the full state space, and an update based on a single
observation in a certain part of the space can disproportionately affect our estimate
of Q in a very different region, and in turn lead to extremely slow convergence.
Fitted Q-iteration (FQI) is a batch-mode reinforcement learning algorithm that ad-
dresses this instability by treating Q-function estimation in an infinite-horizon MDP
as a sequence of supervised learning problems, where each iteration extends the op-
timization horizon by one time step. Given a dataset of transition tuples of the form
D = {⟨sn, an, rn, s′n⟩}n=1:|D| and initializing Q0(s, a) = 0 ∀s, a, the training set for the
13
kth iteration of FQI is given by:
⟨sn, an⟩︸ ︷︷ ︸input
,
target︷ ︸︸ ︷rn + γmax
a′∈AQk−1(s
′n, a
′)
(2.9)
We can see that at the first iteration, solving this regression problem yields an ap-
proximation for the immediate reward given a state-action pair, that is, we solve for
the 1-step optimization problem. It follows that running this over k iterations gives
us the expected return over a k-step optimization horizon; the number of iterations
required for convergence in an infinite horizon MDP is effectively determined by the
discount factor γ. The algorithm uses all available experience at each iteration to
learn the action-value function, and in turn the optimal policy. This efficient use of
information makes it popular in settings with limited data, or where additional ex-
perience is expensive to collect, as is the case in healthcare. It can also be applied in
conjunction with any function approximator, from tree-based methods or kernel func-
tions [21] to neural networks [105], and provides convergence guarantees for several
common regressors.
Many of the recent successes of reinforcement learning have been achieved through
the adaptation of existing action-value approximation methods in the form of deep
Q-networks (DQNs) [81]. Rather than updating estimates following each observed
transition (as in Q-learning) or training function approximators from scratch at each
iteration using the entire set of collected experience, DQNs take a mini-batch ap-
proach, with several key deviations from prior methods in order to stabilize and
speed up training. The first is the use of experience replay: the agent maintains a
data buffer D of randomized, decorrelated recent experience and draws a mini-batch
of tuples ⟨s, a, r, s′⟩ ∼ U(D) uniformly from this buffer. The Q-learning update is
then applied to this mini-batch by running gradient descent with the following loss
14
function:
Li(θi) = Ee∼U(D)
[(r + γmax
a′Q(s′, a′; θti)−Q(s, a; θi))
2]
(2.10)
The above definition highlights another key variation: DQN maintains two sep-
arate networks, a target network with parameters θt, and the actual Q-network
parametrized by θ. These parameters are only copied over to the target network
periodically, in order to reduce temporal correlations between the Q-value used in
action evaluation and in the target.
Figure 2.2: Basic Deep Q-Network architecture
Finally, while traditional Q-function approximators take as input the state and
action, and output Q-value; the DQN takes just the state as input, and outputs a
vector of Q-values for each action (Figure 2.2)—necessitating, in the case of continuous
state spaces, access to the data-generating process in order to simulate all possible
state-action pairs. This speeds up training by allowing us to estimate Q for all actions
with a single forward pass through the network.
Much of the performance gains afforded by DQNs comes from the convolutional
neural network architecture used to learn state representations in settings with un-
structured, high-dimensional observations such as raw image inputs in Atari [81],
and are likely to have limited impact when handling unstructured EHR data. This,
15
combined with issues of sample efficiency and dependence on the availability of a
simulator, makes DQNs less suited to reinforcement learning in clinical settings.
2.2 Off-Policy Policy Evaluation
A fundamental challenge of reinforcement learning using batch data, in settings where
it is infeasible to either build a functional simulator of system dynamics or to collect
additional experience, is in evaluating the efficacy of a proposed policy. This can be
viewed as a problem of counterfactual interference: given observed outcomes following
a certain behaviour policy, we wish to estimate what would have happened if instead
we follow a different policy.
Observational data in practice is rarely generated with pedagogical intent, and
the distribution of states and actions represented in these datasets can be starkly
different from the policies we want to evaluate. The majority of approaches to off-
policy evaluation (OPE) in these settings are founded on either importance sampling
or the training of approximate models, or a combination of the two. Importance
sampling based approaches draw from methods in classical statistics for handling
mismatch between target and sampling distributions: given a dataset of trajectories
D = {h(i)}i=1:N sampled from some behaviour policy πb(a|s), and a policy πe(a|s)
that we wish to evaluate, importance sampling re-weights each trajectory h(i) =
{s0, a0, r1, s1, ...}(i) according to its relative likelihood under the new policy. We define
importance weights ρt,
ρT =T−1∏t=0
πe(aht |sht )
πb(aht |sht )(2.11)
16
as the probability ratio of T steps of trajectory h under policy πe versus πb [100]. It
follows that value of the new policy πe can be estimated by:
VIS(πe) =1
N
N∑i=1
ρ(i)T−1
T−1∑t=0
γtrt+1 (2.12)
This yields a consistent, unbiased estimate of the value of a given policy, but can be
incredibly high variance in practice, as a result of the product term in ρT . This is
amplified in tasks with extended horizons. Two common extensions that attempt to
mitigate this explosion of variance are the per-decision importance sampling (PDIS)
and the per-decision weighted (PDWIS) estimators [100], defined as follows:
VPDIS(πe) =1
N
N∑i=1
T∑t=0
γtρ(i)t rt+1 (2.13)
VPDWIS(πe) =1
N
N∑i=1
T∑t=0
γt ρ(i)t∑N
i=1 ρ(i)t
rt+1 (2.14)
The intuition behind the per-decision estimator is to weight each reward along a
trajectory according to the likelihood of the trajectory only up to that time step,
rather than the relative probability of the complete trajectory. However, the variance
of the PDIS estimator from importance weights ρ can still often be unacceptably
high. To address this, the weighted variant normalizes ρ, dividing by the sum of
all importance weights during each trajectory. While this introduces bias in our
estimated policy value, it still yields a consistent and lower variance estimator, in
comparison with alternative approaches.
The second class of approaches to off-policy evaluation rely on directly learning
regressors for the expected return, by first fitting a model M for the MDP using
available transition data, and then taking the estimated parameters P and R for the
transition and reward function respectively, substituting these into the Bellman equa-
tion (Equations 2.3-2.5) in order to estimate the value V πe of the policy in question.
17
However, it is challenging to train models that can generalize well in most real-world
problems, composed of large or continuous state spaces and many combinations of
state-action pairs that are never observed in the data. Function approximation in
these settings can introduce significant bias in the estimated parameters of the MDP,
limiting the credibility of the policy value estimates returned.
Doubly robust estimators for off-policy evaluation in sequential decision-making
problems [48] look to leverage both the low bias of importance sampling and the low
variance of model-based approaches to achieve the best possible estimates for the
value of a given policy:
V 0DR = 0; V T−t+1
DR (πe) = VAM(st) + ρT (rt + γV T−tDR − QAM(st, at)) (2.15)
where VAM and QAM are the state and action value estimates respectively according
the approximate model of the MDP, and ρT is the importance weight given available
trajectories (Equation 2.11). The quality of the doubly robust estimator VDR is then
dependent of the robustness of the best of the IS or AM estimates.
In recent years, several extensions to both importance sampling and model-based
methods have been introduced for off-policy evaluation in reinforcement learning.
These include importance sampling applied to state visitation distributions rather
than state transition sequences to tackle exploding variance in tasks with extended
horizons [70], efforts to draw from treatment effect estimation in causal reasoning to
estimate individual policy values [71], and variations of model-based or supervised
learning approaches [34, 47]. In particular, fitted Q-evaluation (FQE) [64] for batch
reinforcement learning, which adapts the iterative supervised learning approach of
FQI to the evaluation of learnt policies, has been shown to be data-efficient and
outperform prior approaches in high-dimensional reinforcement learning settings.
18
Chapter 3
An RL Framework for Weaning
from Mechanical Ventilation
Mechanical ventilation is one of the most widely used interventions in admissions to
the intensive care unit (ICU): around 40% of patients in the ICU are supported on
invasive mechanical ventilation at any given hour, accounting for 12% of total hospital
costs in the United States [3, 130]. These are typically patients with acute respiratory
failure or compromised lung function caused by some underlying condition such as
pneumonia, sepsis or heart disease, or cases in which breathing support is necessitated
by neurological disorders, impaired consciousness or weakness following major surgery.
As advances in healthcare enable more patients to survive critical illness or surgery,
the need for mechanical ventilation during recovery has risen.
Closely coupled with ventilation in the care of these patients is sedation and
analgesia, which are crucial to maintaining physiological stability and controlling
pain levels of patients while intubated. The underlying condition of the patient, as
well as factors such as obesity or genetic variations, can have a significant effect on the
pharmacology of drugs, and cause high inter-patient variability in response to a given
sedative [97], lending motivation to a personalized approach to sedation strategies.
19
Weaning refers to the process of liberating patients from mechanical ventilation.
The primary diagnostic tests for determining whether a patient is ready to be extu-
bated involve screening for resolution of the underlying disease, monitoring haemo-
dynamic stability, assessment of current ventilator settings and level of conscious-
ness, and finally a series of spontaneous breathing trials (SBTs) ascertaining that
the patient is able to cope with reduced support. Prolonged ventilation—and in
turn over-sedation—is associated with post-extubation delirium, drug dependence,
ventilator-induced pneumonia and higher patient mortality rates [44], in addition to
inflating costs and straining hospital resources. Physicians are often conservative
in recognizing patient suitability for extubation, however, as failed breathing trials
or premature extubations that necessitate reintubation within the space of 48 to 72
hours can cause severe patient discomfort, and result in even longer ICU stays [59].
Efficient weaning of sedation levels and ventilation is therefore a priority both for
improving patient outcomes and reducing costs, but a lack of comprehensive evidence
and the variability in outcomes between individuals and across subpopulations means
there is little agreement in clinical literature on the best weaning protocol [18, 32].
In this work, we aim to develop a decision support tool that leverages available
information in the data-rich ICU setting to alert clinicians when a patient is ready
for initiation of weaning, and recommend a personalized treatment protocol. We ex-
plore the use of off-policy reinforcement learning algorithms, namely fitted Q-iteration
(FQI) with different regressors, to determine the optimal treatment at each patient
state from sub-optimal historical patient treatment profiles. The setting fits natu-
rally into the framework of reinforcement learning as it is fundamentally a sequential
decision making problem rather than purely a prediction task: we wish to choose the
best possible action at each time—in terms of sedation drug and dosage, ventilator
settings, initiation of a spontaneous breathing trial, or extubation—while capturing
20
the stochasticity of the underlying process, the delayed effects of actions and the
uncertainty in state transitions and outcomes.
The problem poses a number of key challenges: firstly, there are a multitude of
factors that can potentially influence patient readiness for extubation, including some
not directly observed in ICU chart data, such as a patient’s inability to protect their
airway due to muscle weakness. The data that is recorded can itself be sparse, noisy
and irregularly sampled. In addition, there is potentially an extremely large space of
possible combinations of sedatives (in terms of drug, dosage and delivery method) and
ventilator settings, such as oxygen concentration, tidal volume and system pressure,
that can be manipulated during weaning. We are also posed with the problem of
interval censoring, as in other intervention data: given past treatment and vitals
trajectories, observing a successful extubation at time t provides us only with an
upper bound on the true time to extubation readiness, te ≤ t; on the other hand, if a
breathing trial was unsuccessful, there is uncertainty how premature the intervention
was. This presents difficulties both during learning and when evaluating policies.
The rest of this chapter is organized as follows: Section 3.1 explores prior ef-
forts towards the use of reinforcement learning in clinical settings. In Section 3.2
we describe the data and methods applied here, and Section 4 presents the results
achieved. Finally, conclusions and possible directions for further work are discussed
in Section 3.3.
Prior Publication: Niranjani Prasad, Li-Fang Cheng, Corey Chivers, Michael
Draugelis, and Barbara E. Engelhardt. A reinforcement learning approach to weaning
of mechanical ventilation in intensive care units. Proceedings of 33rd Conference on
Uncertainty in Artificial Intelligence, (UAI) 2017 [98].
21
3.1 Related Work
The widespread adoption of electronic health records paved the way for a data-driven
approach to healthcare, and recent years have seen a number of efforts towards per-
sonalized, dynamic treatment regimes. Reinforcement learning in particular has been
explored across various settings, particularly in the management of chronic illness.
These range from determining the sequence of drugs to be administered in HIV ther-
apy or cancer treatment, to minimizing risk of anaemia in haemodialysis patients and
insulin regulation in diabetics.
These efforts are typically based on estimating the value, in terms of clinical out-
comes, of different treatment decisions given the state of the patient. For example,
Ernst et al. [22] apply fitted Q-iteration with a tree-based ensemble method to learn
the optimal HIV treatment in the form of structured treatment interruption strate-
gies, in which patients are cycled on and off drug therapy over several months. The
observed reward here is defined in terms of the equilibrium point between healthy and
unhealthy blood cells in the patient as well as the time spent on drug therapy, such
that the RL agent learns a policy that minimizes viral load (the fraction of unhealthy
cells) as well as drug-induced side effects.
Zhao et al. [133] use Q-learning to learn optimal individualized treatment regimens
for nonsmall cell lung cancer. The objective is to choose the optimal first and second
lines of therapy and optimal initiation time for the second line treatment such that the
overall survival time is maximized. The Q-function with time-indexed parameters is
approximated using a modification of support vector regression (SVR) that explicitly
handles right-censored data. In this setting, right-censoring arises in measuring the
time of death from start of therapy: given that a patient is still alive at the time of
the last follow-up, we merely have a lower bound on the exact survival time.
Escandell-Montero et al. [23] compare the performance of both Q-learning and
fitted Q-iteration with current clinical protocol for informing the administration of
22
erythropoeisis-stimulating agents (ESAs) for treating anaemia. The drug administra-
tion strategy is modeled as an MDP, with the state space expressed by current and
change in haemoglobin levels, the most recent ESA dosages, and the patient subpop-
ulation group. The action space here comprises a set of four discretized ESA dosages,
and the reward function is designed to maintain haemoglobin levels within a healthy
range, while avoiding abrupt changes.
On the problem of administering anaesthesia in the acute care setting, Moore
et al. [82] apply Q-learning with eligibility traces to the administration of intravenous
propofol, modeling patient dynamics according to an established pharmacokinetic
model, with the aim of maintaining some level of sedation or consciousness. Padman-
abhan et al. [94] also use Q-learning, for the regulation of both sedation level and
arterial pressure (as an indicator of physiological stability) using propofol infusion
rate. All of the aforementioned work rely on model-based approaches to reinforce-
ment learning, and develop treatment policies on simulated patient data. More re-
cently however, Nemati et al. [86] consider the problem of heparin dosing to maintain
blood coagulation levels within some well-defined therapeutic range, modeling the
task as a partially observable MDP, using a dynamic Bayesian network trained on
real ICU data, and learning a dosing policy with neural fitted Q-iteration (NFQ).
There exists some literature on machine learning methods for the problem of
ventilator weaning: Mueller et al. [83] and Kuo et al. [60] look at prediction of weaning
outcomes using supervised learning methods, and suggest that classifiers based on
neural networks, logistic regression, or naive Bayes, trained on patient ventilator
and blood gas data, show promise in predicting successful extubation. Gao et al.
[28] develop association rule networks for naive Bayes classifiers, in order to analyze
the discriminative power of different feature categories toward each decision outcome
class, to help inform clinical decision making.
23
The approach described in this chapter is novel in its use of reinforcement learn-
ing methods to directly provide actionable recommendations for the management of
ventilation weaning, the incorporation of a larger number of possible predictors of
wean readiness in the patient state representation compared with previous work—
which limit features for classification to a few key vitals—and the design of a reward
function informed by current clinical protocols.
3.2 Methods
3.2.1 MIMIC III Dataset
We use the Multi-parameter Intelligent Monitoring in Intensive Care (MIMIC III)
database [49], a freely available source of de-identified critical care data for 53,423
adult admissions and 7,870 neonates. The data includes patient demographics, record-
ings from bedside monitoring of vital signs, administration of fluids and medications,
results of laboratory tests, observations and notes charted by care providers, as well
as information on diagnoses, procedures and prescriptions for billing.
We extract from this database a set of 8,860 admissions from 8,182 unique adult
patients undergoing invasive ventilation. In order to train and test our weaning pol-
icy, we further filter this dataset to include only those admissions in which the patient
was kept under ventilator support for more than 24 hours. This allows us to exclude
the majority of episodes of routine ventilation following surgery, which are at minimal
risk of adverse extubation outcomes. We also filter out admissions in which the pa-
tient in not successfully discharged from the hospital by the end of the admission, as
in cases where the patient expires in the ICU, this is largely due to factors beyond the
scope of ventilator weaning, and again, a more informed weaning policy is unlikely to
have a significant influence on outcomes. Failure in our problem setting is instead de-
fined as prolonged ventilation, administration of unsuccessful spontaneous breathing
24
SBT StartedSBT Successfully CompletedMechanical Ventilation
Propofol IV DripFentanyl IV Drip
Propofol BolusFentanyl BolusHydromorphone Bolus
2
1
0Richmond-RAS Scale
0
2
4 PEEP set
95
100
O2 saturation pulseoxymetry
50
75
100Inspired O2 Fraction
60
80
100 Heart Rate
0
20
40
Respiratory Rate
08-17 20 08-18 08 08-18 20 08-19 08 08-19 20 08-20 08 08-20 20 08-21 087.44
7.46
7.48 PH (Arterial)
Figure 3.1: Example ICU admission comprising mechanical ventilation andaccompanying sedation, with time-stamped measurements of key vitals.
25
Table 3.1: Core extubation guidelines at Hospital of University of Pennsylvania
Physiological Stability Oxygenation Criteria
Respiratory Rate ≤ 30 PEEP (cm H2O) ≤ 8
Heart Rate ≤ 130 SpO2 (%) ≥ 88
Arterial pH ≥ 7.3 Inspired O2 (%) ≤ 50
trials, or reintubation within the same admission—all of which are associated with
adverse outcomes for the patient. A typical patient admission episode is illustrated in
Figure 3.1: we can see ventilation times, a number of administered sedatives, both as
continuous IV drips and discrete boli, as well as nurse-verified recordings of patient
physiological parameters, measured at a widely varying sampling intervals.
Preliminary guidelines for the weaning protocol, in terms of the desired ranges
of major physiological parameters (heart rate, respiratory rate and arterial pH) as
well as approximate constraints at time of extubation on the inspired O2 fraction
(FiO2), oxygenation pulse oxymetry (SpO2) and the setting of positive end-expiratory
pressure (PEEP), were obtained by referencing criteria in current practice at the
Hospital of University of Pennsylvania, and are summarized in Table 3.1. These
ranges are used in designing the feedback received by our reinforcement learning
agent, to facilitate the learning of an optimal weaning policy.
3.2.2 Resampling using Multi-Output Gaussian Processes
Measurements of vitals and lab results in the ICU data can be irregular, sparse and
error-ridden. Non-invasive measurements such as heart rate or respiratory rate are
taken several times an hour, while tests for arterial pH or oxygen pressure, which
involve more distress to the patient, may only be carried out every few hours as
needed. This wide discrepancy in measurement frequency is typically handled by
resampling with means in hourly intervals, and using sample-and-hold interpolation
26
where data is missing. However, patient state—and therefore the need to update
management of sedation or ventilation—can change within the space of an hour, and
naive methods for interpolation are unlikely to provide the necessary accuracy at
higher temporal resolutions. We therefore explore methods that can enable further
fine-tuning of policy estimation. One of the most commonly used techniques to resolve
missing data and irregular sampling is Gaussian processes (GPs, [20, 30, 115]), a
function-based method well-suited to medical time series data. GPs can be thought
of as distributions over arbitrary functions; a collection of random variables is said
to form a Gaussian process if for any finite subset of these random variables there
is a joint Gaussian distribution. In the case of time-series modeling, given a dataset
with inputs denoted by a set of T time steps t = [ t1 ... tT ]T and corresponding
observations of some vital sign v = [ v1 ... vT ]T , we can model
v = f(t) + ε, (3.1)
where ε vector represents i.i.d Gaussian noise, and f(t) are the latent noise-free values
we would like to estimate. Equivalently, this can thought of as placing a GP prior on
the latent function f(t):
f(t) ∼ GP(m(t), κ(t, t′)), (3.2)
where m(t) is the mean function and κ(t, t′) is the covariance function or kernel :
m(t) = E[f(t)] (3.3)
κ(t, t′) = E[f(t−m(t)), f(t′ −m(t′))] (3.4)
Together, the mean and kernel functions fully describe the Gaussian process. Prop-
erties such as smoothness and periodicity of f(t) are dependent on the kernel used.
27
Prior approaches to modeling physiological time series typically rely on univariate
Gaussian processes, treating each signal as independent. However, this assumption
may result in considerable loss of information—there are known correlations between
several common vitals [87]—and limit the accuracy of imputation for more sparsely
sampled vitals. In this work, we instead learn a multi-output GP (MOGP) to ac-
count for temporal correlations between physiological parameters during interpola-
tion; MOGPs have shown improvements over the univariate case in medical time
series for both imputation and forecasting [20, 30]. We adapt the framework in [12]
to impute the physiological signals jointly by exploring covariance structures between
them, excluding the sparse prior settings: for the ith patient is our out dataset, the
time series of the dth covariate (that is, vital sign or laboratory test) is denoted by a
vector of time points ti,d and corresponding values vi,d. The time series data for this
patient over all D covariates can then be written as:
ti = [ tTi,1 tTi,2 ... tTi,D ]T (3.5)
vi = [vTi,1 vT
i,2 ... vTi,D ]T (3.6)
where ti,vi ∈ RTi×1, and Ti =∑
d Ti,d. Denoting Fi as a multi-output time series
function for patient i, we now have:
vi = Fi(ti) + εi (3.7)
where Fi(ti) is drawn from a patient specific Gaussian process, GP i such that
Fi(ti) ∼ GP i(µi(t), κi(t, t′)) (3.8)
Here, we set µ(t) = 0 without loss of generality ([104]) so the Gaussian process is
completely defined by secondary statistics alone. In designing a kernel function κ(t, t′)
28
that captures covariance structure in clinical time series, we adapt the linear model
of coregionalization (LMC) framework, originally applied to prediction over vector-
valued data in geostatistics [51]. In the linear model of coregionalization, outputs
are modeled as a weighted combination of independent random functions, which we
refer to as basis kernels. We denote this set of Q basis kernels used to model our D
covariates as κq(t, t′)Qq=1, such that the full joint kernel for a given patient i can be
writted as a structured linear mixture of these Q kernels:
κi(ti, t′i) =
Q∑q=1
bq,(1,1)κq(ti,1, t
′i,1), · · · bq,(1,D)κq(ti,1, t
′i,D)
.... . .
...
bq,(D,1)κq(ti,D, t′i,1) · · · bq,(D,D)κq(ti,D, t
′i,D)
∈ RTi×Ti (3.9)
where weights bq,(d,d′) scale the covariance (as described by κq) between covariates d
and d′. These weights can be rewritten as a set of matrices {Bq}Qq=1 where each Bq
is a symmetric positive definite matrix defined by:
Bq =
bq,(1,1) · · · bq,(1,D)
.... . .
...
bq,(D,1) · · · bq,(D,D)
∈ RD×D. (3.10)
In cases where we have the same input time vector for each of our covariates, the LMC
in Equation 3.9 can be further simplified using the Kronecker product (⊗), such that:
κi(ti, t′i) =
Q∑q=1
Bq ⊗ κq(ti,∗, t′i,∗) (3.11)
where ti,∗ represents the time vector of each covariate. Note that this is not true for
the irregular sampled vitals and lab tests in the clinical time series modeled here. In
practice, we compute each sub-block κq(ti,d, t′i,d′) given any pair of input time ti,d and
t′i,d′ from two signals, indexed by d and d′.
29
For our setting, we parametrize the basis kernel as a spectral mixture kernel [128]:
κq(t, t′) = exp (−2π2τ 2vq) cos (2πτµq) (3.12)
where τ = |t−t′|, allowing us to model smooth transitions in time or circadian rhythm
of these vital signs and lab results. The use of this model for GP regression requires
that our covariance matrix κ(t, t′) is positive definite for all t, t′; as each basis kernel
is positive definite, we simply need to ensure that every Bq is also positive definite.
We do so by parametrizing:
Bq = AqATq +
λq,1 0 · · · 0
0 λq,2 · · · 0
......
. . ....
0 0 · · · λq,D
= AqA
Tq + diag(λq) (3.13)
where Aq ∈ RD×Rq and λq ∈ RD×1; Rq is therefore the rank of Bq when λq = 0.
In this work, we set the number of basis kernels Q = 2 and Rq = 5 ∀q, to jointly
model 12 selected physiological signals (D = 12). In choosing these signals, we exclude
vitals that take discrete values, such as ventilator mode or the RASS sedation scale.
For each patient, one structured GP kernel is estimated using the implementation
in [12]. We then impute the time series with the estimated posterior mean given
all the observations across all chosen physiological signals within that patient. For
those vitals that are not imputed this way, we simply resample with means and apply
sample-and-hold interpolation. After preprocessing, we obtain complete data for each
patient, at a temporal resolution of 10 minutes, from admission time to discharge time.
Imputation in the training set uses all known measurements, while for the test set we
use only those measurements before current time step; our forecast values converge
30
Figure 3.2: Sample trajectories of 8 vitals in an ICU admission, with GaussianProcess imputation. A total of 12 vital signs are jointly modeled by the MOGP.
31
towards the population mean with increasing time since the last known measurement.
An example of imputed vital signs for a single patient is shown in Figure 3.2.
3.2.3 MDP Formulation
A Markov Decision Process or MDP is defined by the following key components:
(i) A finite state space S such that at each time t, the environment (here, the
patient as observed through the EHR) is in state st ∈ S,
(ii) An action space A: at each time t, the reinforcement learning agent chooses
some action at ∈ A, which influences the next state, st+1,
(iii) A transition function P(st+1|st, at), which defines the dynamics of the system
and typically unknown, and
(iv) A reward rt+1 = R(st, at, st+1) observed at each time step, which defines the
immediate feedback received following a state transition.
The goal of the reinforcement learning agent is to learn a policy, or mapping
π(s)→ a from states to actions, that maximizes the value V π(s) defined as expected
accumulated reward, over horizon length T with discount factor γ:
V π(st) = limT→∞
Eπ
[T−1∑t
γtR(st, at, st+1)
](3.14)
where γ determines the relative weight of immediate and long-term rewards.
Patient response to sedation and readiness for extubation can depend on a num-
ber of different factors, from demographic characteristics, pre-existing conditions and
comorbidities to specific time-varying vitals measurements, and there is considerable
variability in clinical opinion on the extent of influence of different factors. Here, in
defining each patient state within an MDP, we look to incorporate as many reliable
32
and frequently monitored features as possible, and allow the algorithm to determine
the relevant features. The state at each time t is a 32-dimensional feature vector
that includes fixed demographic information (patient age, weight, gender, admit type,
ethnicity) as well any relevant physiological measurements, ventilator settings, level
of consciousness (given by the Richmond Agitation Sedation Scale, or RASS), cur-
rent dosages of different sedatives or analgesic agents, time into ventilation, and the
number of intubations so far in the admission. For simplicity, categorical variables
admit type and ethnicity are binarized according to emergency/non-emergency and
white/non-white admissions respectively.
In designing the action space, we develop an approximate mapping of a set of six
commonly used sedatives into a single dosage scale, and choose to discretize this scale
to four different levels of sedation. The action at ∈ A at each time step is chosen from
a finite two-dimensional set of eight actions, where at[0] ∈ {0, 1} indicates having the
patient off or on the ventilator respectively, and at[1] ∈ {0, 1, 2, 3} corresponds to the
level of sedation to be administered over the next 10-minute interval:
A =
00
,
01
,
02
,
03
,
10
,
11
,
12
,
13
(3.15)
Finally, we associate a reward rt+1 with each state transition—defined by the
tuple ⟨st, at, st+1⟩—to encompass (i) effective cost of time spent under invasive ven-
tilation rintubt+1 , (ii) feedback from failed SBTs or need for reintubation rextubt+1 , and (iii)
penalties for physiological stability, i.e. when vitals are highly fluctuating or outside
reference ranges, rvitalst+1 . The feedback at each timestep is defined by a combination of
sigmoid, piecewise-linear and threshold functions that reward closely regulated vitals
and successful extubation while penalizing adverse events:
rt+1 = rintubt+1 + rextubt+1 + rvitalst+1 (3.16)
33
where each component in the above summation is defined as follows:
rintubt+1 = 1[at[0]=1]
[C11[at−1[0]=1] − C21[at−1[0]=0]
](3.17)
rextubt+1 = 1[at[0]=0]
[C31[at−1[0]=1] + C41[at−1[0]=0] − C5
∑vext
1[vextt >vextmax || vextt <vextmin]
](3.18)
rvitalst+1 =∑v
[C6
1 + e−(vt−vmin)− C6
1 + e−(vt−vmax)+
C6
2− C7max
(0,|vt+1 − vt|
vt− 1
5
)](3.19)
where positive constants C1 to C7 determine the relative importance of these reward
signals. The system therefore accumulates negative rewards C1 at intubation, and
C2 for each additional time step spent on the ventilator. A large positive reward
C3 is observed at the time of extubation. along with additional positive feedback
C4 while remaining off the ventilator. Vitals vextt comprise the subset of parameters
directly associated with readiness for extubation (FiO2, SpO2 and PEEP set) with
weaning criteria defined by the ranges [vextmin, vextmax]. A fixed penalty C5 is applied for
each criterion not met when off invasive support.
Finally, values vt are the measurements of those vitals v (included in the state
representation st) believed to be indicative of physiological stability at time t, with
desired ranges [vmin, vmax]. The penalty for exceeding these ranges at each time step
is given by a truncated sigmoid function, illustrated in Figure 3.3a. The system also
receives negative rewards when consecutive measurements see a change greater than
20% (positive or negative) in value, as shown in Figure 3.3b. These two sources of
feedback are scaled by constants C6 and C7 respectively.
3.2.4 Learning the Optimal Policy
The majority of reinforcement learning algorithms are based on estimation of the
Q-function, that is, the expected value of state-action pairs Qπ(s, a) : S × A → R,
34
(a) Exceeding threshold values (b) High fluctuation in values
Figure 3.3: Shape of reward function penalising instability in vitals, rvitalst (vt)
to determine the optimal policy π. Of these, the most widely used is Q-learning, an
off-policy reinforcement learning algorithm in which we start with some initial state
and arbitrary approximation of the Q-function, and update this estimate using the
reward from the next transition using the Bellman recursion for Q-values:
Q(st, at) = Q(st, at) + α(rt+1 + γmaxa∈A
Q(st+1, a)− Q(st, at)) (3.20)
where the learning rate α determines the weight given to each new transition seen,
and γ is the discount factor.
Fitted Q-iteration (FQI), on the other hand, is a form of off-policy batch-mode
reinforcement learning that uses a set of one-step transition tuples:
F = {(⟨snt , ant , snt+1⟩, rnt+1), n = 1, ..., |F|} (3.21)
to learn a sequence of function approximators Q1, Q2...QK of the value of state-action
pairs, by iteratively solving supervised learning problems. Both FQI and Q-learning
belong to the class of model-free reinforcement learning methods, which assumes no
35
knowledge of the dynamics of the system. In the case of FQI, there are also no
assumptions made on the ordering of tuples; these could correspond to a sequence
of transitions from a single admission, or randomly ordered transitions from multiple
histories. FQI is therefore more data-efficient, with the full set of samples used by the
algorithm at every iteration, and typically converges much faster than Q-learning.
The training set at the kth supervised learning problem is given by T S =
{(⟨snt , ant ⟩ , Qk(snt , a
nt )), n = 1, ..., |F|}. As before, the Q-function is updated at each
iteration according to the Bellman equation:
Qk(st, at)← rt+1 + γmaxa∈A
Qk−1(st+1, a) (3.22)
where Q1(st, at) = rt+1. The optimal policy after K iterations is then given by:
π∗(s) = argmaxa∈A
QK(s, a) (3.23)
A variant of this procedure is outlined in Algorithm 1 for Fitted Q-iteration with
sampling, where a batch of transitions are sampled from the full dataset (uniformly,
or by prioritizing certain experience) without replacement at each iteration. This
allows us to speed up training of the Q-function given very large datasets, assigning
greater weight more informative transitions as necessary.
FQI guarantees convergence for many commonly used regressors, including kernel-
based methods [92] and decision trees. In particular, fitted-Q with extremely random-
ized trees or extra-trees (FQIT) [21, 29], a tree-based ensemble method that extends
on random forests by introducing randomness in the thresholds chosen at each split,
has been applied in the past to learning large or continuous Q-functions in clinical
settings [22, 23]. Neural Fitted-Q (NFQ) [105] on the other hand, looks to lever-
age the representational power of neural networks as regressors to fitted Q-iteration.
Nemati et al. [86] use NFQ to learn optimal heparin dosages, mapping the patient
36
Algorithm 1 Fitted Q-iteration with samplingInput:
One-step transitions F = {⟨snt , ant , snt+1⟩, rnt+1}n=1:|F|;
Regression parameters θ;
Action space A; subset size N
Initialize Q0(st, at) = 0 ∀st ∈ F , at ∈ Afor iteration k = 1→ K do
subsetN ∼ FS ← []
for i ∈ subsetN doQk(si, ai)← ri+1 + γmax
a′∈A(predict(⟨si+1, a
′⟩, θ))S ← append(S, ⟨(si, ai), Q(si, ai)⟩)
end
θ ← regress(S)
end
Result: θ
π ← classify(⟨snt , ant ⟩)
hidden state to expected return. Neural networks hold an advantage over tree-based
methods in iterative settings in that it is possible to simply update weights in the
network at each iteration, rather than rebuilding entirely.
3.3 Results
After extracting relevant ventilation episodes from ICU admissions in the MIMIC III
database as described in Section 3.2.1, and splitting these into training and test data,
we obtain a total of 1,800 distinct admissions in our training set and 664 admissions
in our test set. We interpolate a set of 12 key time-varying vitals measurements using
Gaussian processes, sampling at 10-minute intervals; missing values in the remaining
components of the state space are imputed using sample-and-hold interpolation. This
yields of the order of 1.5 million one-step transition tuples of the form ⟨st, at, st+1, rt+1⟩
in the training set and 0.5 million in the test set respectively, where each state in the
37
tuple is a 32-dimensional continuous representation of patient physiology, each action
is two-dimensional and can take one of eight discrete values, and the scalar rewards
indicate the “goodness” of each transition with respect to patient outcome. In our
policy optimization, we use discount factor γ = 0.9, such that rewards observed
24 hours in the future then have approximately one tenth the weight of immediate
rewards, when determining the optimal action at a given state.
As an initial baseline, we look to apply Q-learning on the training data to learn
the mapping of continuous states to Q-values, with function approximation using a
simple three-layer feedforward neural network. The network is trained using Adam,
an efficient stochastic gradient-based optimizer [54], and l2 regularization of weights.
Each patient admission k is treated as a distinct episode, with of the order of thou-
sands of state transitions in each, and the network weights are incrementally updated
following each transition. The change between successive episodes in the predicted Q-
values for all state-action pairs in the training set is plotted in Figure 3.4—it is unclear
whether the algorithm succeeds in converging within the 1,800 training episodes.
We then explore the use of Fitted Q-iteration instead to learn our Q-function,
first running with an Extra-Trees regressor. In our implementation, each iteration of
FQI is performed on a sampled subset of 10% of all transitions in the training set,
as described in Algorithm 2, such that on average, each sample is seen in a tenth of
all iterations. Though sampling increases the total number of iterations required for
convergence, it yields significant speed-ups in building trees at each iteration, and
hence in total training time. The ensemble regressor learns 50 trees, with regulariza-
tion in the form of a minimum leaf node size of 20 samples. We present here results
with FQI performed for a fixed number of 100 iterations, though it is possible to use
a convergence criterion of the form ∆(Qk, Qk−1) ≤ ε for early stopping, in order to
speed up training further.
38
Figure 3.4: Convergence of Q(s, a) using Q-learning
The same methods are then used to run FQI with neural networks (NFQ) in place
of tree-based regression: we train a feedforward network with architecture and tech-
niques identical to those applied in function approximation with Q-learning. Conver-
gence of the estimated Q-function for both regressors, measured by the mean change
in the estimate Q for transitions in the training set, is plotted in Figure 3.5; we can
see that the algorithm takes roughly 60 iterations to converge in both cases. How-
ever, NFQ yields approximately a four-fold gain in runtime speed, as expected, since
with neural networks we can incrementally update weights rather than retraining the
network with a cold start at each iteration.
The estimated Q-functions from FQI with Extra-Trees (FQIT) and from NFQ are
then used to evaluate the optimal action, i.e. that which maximizes the value of the
state-action pair, for each state in the training set. We can then train policy functions
π(s) mapping a given patient state to the corresponding optimal action a ∈ A. To
39
Figure 3.5: Convergence of Q(s, a) using Fitted Q-iteration
allow for clinical interpretation of the final policy, we choose to train an Extra-Trees
classifier comprising an ensemble of 100 trees to represent the policy function.
Figure 3.6 gives the relative weight assigned to the top 24 features in the state
space for the policy trees learnt, when training on optimal actions from both FQIT
and NFQ. Feature importances are obtained using the Gini or mean decrease in
impurity importance score. The five vitals ranking highest in importance across the
two policies are arterial O2 pressure, arterial pH, FiO2, O2 flow and PEEP set. These
are as expected—arterial pH, FiO2 and PEEP all feature in our preliminary HUP
guidelines for extubation criteria, and there is considerable literature suggesting blood
gases are an important indicator of readiness for weaning [42]. On the other hand,
oxygen saturation pulse oxymetry (SpO2) which is also included in HUP’s current
extubation criteria, is fairly low in ranking. This may be because these measurements
are highly correlated with other factors in the state space, for example arterial O2
40
Figure 3.6: Gini feature importances for optimal policies following FQIT or NFQ.Oxygenation criteria used in typical weaning guidelines tend to be highly weighted.
pressure [17], that account for its influence on weaning more directly. The limited
importance assigned to heart rate and respiratory rate are also likely to be explained
by this dependence between vitals. In terms of demographics, weight and age play a
significant role in the weaning policy learnt: weight is likely to influence our sedation
policy specifically, as dosages are typically adjusted for patient weight, while age can
be strongly correlated with a patient’s speed of recovery, and hence the time needed
on ventilator support.
In order to evaluate the performance of the policies learnt, we compare the algo-
rithm’s recommendations against the true policy implemented by the hospital. Con-
sidering ventilation and sedation separately, the policies learnt with FQIT and NFQ
achieve similar accuracies in recommending ventilation (both matching the true policy
in approximately 85% of transitions), while FQIT far outperforms NFQ in the case
of sedation policy (achieving 58% accuracy compared with just 28%, barely outper-
formed random dosage level), perhaps due to overfitting of the neural network on this
41
(a) Ventilation Policy: Reintubations (b) Ventilation Policy: Accumulated Reward
(c) Sedation Policy: Reintubations (d) Sedation Policy: Accumulated Reward
Figure 3.7: Evaluating policy in terms of reward and number of reintubations suggestsadmissions where actions match our policy more closely are generally associated withbetter patient outcomes, both in terms of number of reintubations and accumulatedreward, which reflects in part the regulation of vitals.
dataset—it is likely that more data is necessary to develop a meaningful sedation pol-
icy with NFQ. We therefore concentrate further analysis of policy recommendations
to those produced by FQIT.
Given the long horizons of MDPs in this task, and the size of the action space, tra-
ditional off-policy evaluation estimators such as importance sampling yield incredibly
high variance estimates of performance. Instead, we consider applying a variant of
simpler rejection-sampling approaches, detailed here. We divide the 664 test admis-
sions into six groups according to the fraction of FQI policy actions that differ from
the hospital’s policy: ∆0 comprises admissions in which the true and recommended
policies agree perfectly, while those in ∆5 show the greatest deviation. Figure 3.7a and
3.7b plot the distribution of the number of reintubations and the mean accumulated
reward over patient admissions respectively, for all patients in each set; we can see
42
that those admissions in set ∆0 undergo no reintubation, and in general the average
number of reintubations increases with deviation from the FQIT policy, with upto
seven distinct intubations observed in admissions in ∆5. This effect is emphasised
by the trend in mean rewards across the six admission groups, which serve primarily
as an indicator of the regulation of vitals within desired ranges and whether certain
criteria were met at extubation: we can see that mean reward over a set is highest
(and the range lowest) for admissions in which the policies match perfectly, and de-
creases with increasing divergence of the two policies. A less distinct but very much
comparable pattern is seen when grouping admissions instead by similarity of the
sedation policy to the true dosage levels administered by the hospital; Figure 3.7c
and 3.7d illustrate the trends in the number of reintubations and in mean rewards
respectively.
3.4 Conclusion
In this chapter, we propose a data-driven approach to the optimization of weaning
from mechanical ventilation of patients in the ICU. We model patient admissions
as Markov decision processes, developing novel representations of the problem state,
action space and reward function in this framework. Reinforcement learning with
fitted Q-iteration using different regressors is then used to learn a simple ventilator
weaning policy from examples in historical ICU data. We demonstrate that the
algorithm is capable of extracting meaningful indicators for patient readiness and
shows promise in recommending extubation time and sedation levels, on average
outperforming clinical practice in terms of regulation of vitals and reintubations.
There are a number of challenges that must be overcome before these methods can
be meaningfully implemented in a clinical setting, however: firstly, in order to generate
robust treatment recommendations, it is important to ensure policy invariance to
43
reward shaping: the current methods display considerable sensitivity to the relative
weighting of various components of the feedback received after each transition. A
more principled approach to the design of the reward function, can help tackle this
sensitivity. In addition, addressing the question of censoring in sub-optimal historical
data and explicitly correcting for the bias that arises from the timing of interventions
is crucial to fair evaluation of learnt policies, particularly where they deviate from the
actions taken by the clinician. Finally, effective communication of the best action,
expected reward, and the associated uncertainty, calls for a probabilistic approach to
estimation of the Q-function, which can perhaps be addressed by pairing regressors
such as Gaussian processes with Fitted Q-iteration.
Possible avenues for future work also include increasing the sophistication of the
state space, for example by handling long term effects more explicitly using second-
order statistics of vitals, applying techniques in inverse reinforcement learning to
feature engineering [67], or modeling patient admissions as a partially observable
MDP, in which raw observations of the patient physiology are drawn from some true
underlying state. Extending the action space to include continuous dosages of specific
drug types and explicit settings such as the inspired oxygen fraction or the value of
PEEP set can also facilitate directly executable policy recommendations, and enable
better informed decisions in critical care.
44
Chapter 4
Optimizing Laboratory Tests with
Multi-objective RL
Precise, targeted patient monitoring is central to improving treatment in an ICU, al-
lowing clinicians to detect changes in patient state and to intervene promptly and only
when necessary. While basic physiological parameters that can be monitored bedside
(e.g., heart rate) are recorded continually, those that require invasive or expensive lab-
oratory tests (e.g., white blood cell counts) are more intermittently sampled. These
lab tests are estimated to influence up to 70% percent of diagnoses or treatment deci-
sions, and are often cited as the motivation for more costly downstream care [7, 134].
Recent medical reviews raise several concerns about the over-ordering of lab tests
in the ICU [74]. Redundant testing can occur when labs are ordered by multiple
clinicians treating the same patient or when recurring orders are placed without re-
assessment of clinical necessity. Many of these orders occur at time intervals that are
unlikely to include a clinically relevant change or when large panel testing is repeated
to detect a change in a small subset of analyses [58]. This leads to inflation in costs
of care and in the likelihood of false positives in diagnostics, and also causes un-
necessary discomfort to the patient. Moreover, excessive phlebotomies (blood tests)
45
can contribute to risk of hospital-acquired anaemia; around 95% of patients in the
ICU have below normal haemoglobin levels by day 3 of admission and are in need of
blood transfusions. It has been shown that phlebotomy accounts for almost half the
variation in the amount of blood transfused [46].
With the disproportionate rise in lab costs relative to medical activity in recent
years, there is a pressing need for a sustainable approach to test ordering. A variety of
approaches have been considered to this end, including restrictions on the minimum
time interval between tests or the total number of tests ordered per week. More
data-driven approaches include an information theoretic framework to analyze the
amount of novel information in each ICU lab test by computing conditional entropy
and quantifying the decrease in novel information of a test over the first three days
of an admission [65]. In a similar vein, a binary classifier was trained using fuzzy
modeling to determine whether or not a given lab test contributes to information
gain in the clinical management of patients with gastrointestinal bleeding [15]. An
“informative” lab test is one in which there is significant change in the value of the
tested parameter, or where values were beyond certain clinically defined thresholds;
the results suggest a 50% reduction in lab tests compared with observed behaviour.
More recent work looked at predicting the results of ferratin testing for iron deficiency
from information in other labs performed concurrently [76]. The predictability of the
measurement is inversely proportional to the novel information in the test. These
past approaches underscore the high levels of redundancy that arise from current
practice. However, there are many key clinical factors that have not been previously
accounted for, such as the low-cost predictive information available from vital signs,
causal connection of clinical interventions with test results, and the relative costs or
feasibility constraints associated with ordering various tests.
In this chapter, we introduce a reinforcement learning (RL) based method to tackle
the problem of developing a policy to perform actionable lab testing in ICU patients.
46
Our approach is two-fold: first, we build an interpretable model to forecast future
patient states based on past observations, including uncertainty quantification. We
adapt multi-output Gaussian processes (MOGPs; [12, 30]) to learn the patient state
transition dynamics from a patient cohort including sparse and irregularly sampled
medical time series data, and to predict future states of a given patient trajectory.
Second, we model patient trajectories as a Markov decision process (MDP). In doing
so, we draw from the framework introduced in Chapter 3 to efficiently wean patients
from mechanical ventilation [98], as well as other work on recommendation of treat-
ment strategies for critical care patients in a variety of different settings [88, 103].
We design the state and reward functions of the MDP to incorporate relevant clinical
information, such as the expected information gain, subsequent administered inter-
ventions, and the costs of actions (namely, requesting and performing a lab test).
A major challenge is designing a reward function that can trade off multiple, often
opposing, objectives. There has been initial work on extending the MDP framework
to composite reward functions [85]. Specifically, fitted Q-iteration (FQI) has been
used to learn policies for multi-objective MDPs with vector-valued rewards, for the
sequence of interventions in two-stage clinical antipsychotic trials [72]. A variation
of Pareto domination was then used to generate a partial ordering of policies and
extract all policies that are optimal for some scalarization function, leaving the choice
of parameters of the scalarization function to decision makers.
Here, we look to translate these principles to the problem of lab test ordering.
Specifically, we focus on blood tests relevant in the diagnosis of sepsis or acute renal
failure, two conditions with high prevalance in the ICU, as well as high associated
mortality risk in the ICU. These tests are white blood cell count (WBC), blood lac-
tate level, serum creatinine, and blood urea nitrogen (BUN); abnormalities in the
first two markers are commonly used in diagnosis of severe sepsis, while the latter
are associated with compromised kidney function. We present our methods within a
47
flexible framework that can in principle be adapted to a patient cohort with different
diagnoses or treatment objectives, influenced by a distinct set of lab results. Our
proposed framework integrates prior work on off-policy RL and Pareto learning with
practical clinical constraints to yield policies that are close to intuition demonstrated
in historical data. Again, we demonstrate our approach using a publicly available
database of ICU admissions, evaluating the estimated policy against the policy fol-
lowed by clinicians using both importance sampling based estimators for off-policy
policy evaluation and by comparing against multiple clinically inspired objectives,
including onset of clinical treatment that was motivated by the lab results.
Prior publication: Li-Fang Cheng*, Niranjani Prasad*, and Barbara E. Engel-
hardt. An Optimal Policy for Patient Laboratory Tests in Intensive Care Units.
Proceedings of Pacific Symposium on Biocomputing (PSB) 2019 [13].
*Much of the work detailed in this chapter was developed jointly with Li-Fang Cheng.
I sincerely thank her for her contribution.
4.1 Methods
4.1.1 MIMIC Cohort Selection and Preprocessing
We extract our cohort of interest from the MIMIC III database [49], which includes de-
identified critical care data from over 58,000 hospital admissions. From this database,
we first select adult patients with at least one recorded measure for each of 20 vital
signs and lab tests commonly ordered and reviewed by clinicians (for instance, results
reported in a complete blood count or basic metabolic panel). We further filter
patients by their length-of-stay, keeping only those in the ICU for between one and
twenty days, to obtain a final set of 6,060 patients. Table 4.1 summarizes key statistics
for patient physiological parameters in this filtered cohort.
48
Table 4.1: Total number of nurse-verified recordings, measurement mean and standarddeviation (SD) for covariates in selected cohort.
Covariate Count Mean SD
Respiratory Rate (RR) 1,046,364 20.1 5.7
Heart Rate (HR) 964,804 87.5 18.2
Mean Blood Pressure (Mean BP) 969,062 77.9 15.3
Temperature, ◦F 209,499 98.5 1.4
Creatinine 67,565 1.5 1.2
Blood Urea Nitrogen (BUN) 66,746 31.0 21.1
White Blood Cell Count (WBC) 59,777 11.6 6.2
Lactate 39,667 2.4 1.8
Included in the 20 physiological traits we filter for are eight that are particularly
predictive of the onset of severe sepsis, septic shock, or acute kidney failure. These
traits are included in the SIRS (System Inflammatory Response Syndrome) and SOFA
(Sequential Organ Failure Assessment) criteria [78]. The average daily measurements
or lab test orders across the chosen cohort for these eight traits is highly variable
(Figure 4.1). Of these eight traits, the first three are vitals measured using bedside
monitoring systems for which approximately hourly measurements are recorded; the
latter four are labs requiring phlebotomy and are typically measured just 2–3 times
each day. We find the frequency of orders also varies across different labs, possibly
due in part to differences in cost; for example, WBC (which is relatively inexpensive
to test) is on average sampled slightly more often than lactate.
In order to apply our proposed RL algorithm to this sparse, irregularly sampled
dataset, we adapt the multi-output Gaussian process (MOGP) framework [12] to
obtain hourly predictions of patient state with uncertainty quantified, on 17 of the 20
clinical traits. For three of the vitals, namely the components of the Glasgow Coma
Scale, we impute with the last recorded measurement.
49
Figure 4.1: Mean recorded measurements per day, of eight key vitals and lab tests.
4.1.2 Designing a Multi-Objective MDP
Each patient admission is modelled as a Markov decision process defined by: (i) state
space S, where st ∈ S is patient physiological state at time t; (ii) action space A from
which the clinician’s action at is chosen; (iii) unknown transition function P(s, a)
that determines the patient dynamics; and (iv) reward function rt+1 = r(st, at) which
determines observed clinical feedback for this action. The objective of the RL agent
is to learn an optimal policy π∗ : S → A that maximizes the expected discounted
(with some factor γ) accumulated reward over the course of an admission:
π∗ = argmaxπ
E
[∞∑t=0
γtrt|π
]
We start by describing the state space of our MDP for ordering lab tests. We first re-
sample the raw time series using a multi-objective Gaussian process with a sampling
period of one hour. The patient state at time t is defined by:
st =
[mSOFA
t , mvitalst , mlabs
t , ylabst , ∆labs
t
]⊤(4.1)
Here, mt denotes the predictive means and standard deviations respectively of each
of the vitals and lab tests. For the predictive SOFA score mSOFAt , we compute the
50
value using its clinical definition, from the predictive means on five traits—mean BP,
bilirubin, platelet, creatinine, FiO2—along with GCS and related medication history
(e.g., dopamine). Vitals include any time-varying physiological traits that we consider
when determining whether to order a lab test. Here, we look at four key physiological
traits—heart rate, respiratory rate, temperature, and mean blood pressure—and four
lab tests—creatinine, BUN, WBC, and lactate. The values yt are the last known
measurements of each of the four labs, and ∆t denotes the elapsed time since each
was last ordered. This formulation results in a 21-dimensional state space. Depending
on the labs that we wish to learn recommendations for testing, the action space A
is a set of binary vectors whose 0/1 elements indicate whether or not to place an
order for a specific lab. These actions can be written as at ∈ A = {1, 0}L, where
L is the number of labs. In our experiments, we learn policies for each of the four
labs independently, such that L = 1, but this framework could be easily extended to
jointly learning recommendations for multiple labs.
In order for our RL agent to learn a meaningful policy, we need to design a reward
function that provides positive feedback for the ordering of tests where necessary,
while penalizing the over- or under-ordering of any given lab test. In particular, the
agent should be encouraged to order labs when the physiological state of the patient is
abnormal with high probability, based on estimates from the MOGP, or when a lab is
predicted to be informative (in that the forecasted value is significantly different from
the last known measurement) due to a sudden change in disease state. In addition,
the agent should incur some penalty whenever a lab test is taken, decaying with
elapsed time since the last measurement, to reflect the effective cost (both economic
and in terms of discomfort to the patient) of the test. We formulate these ideas into
a vector-valued reward function rt ∈ Rd of the state and action at time t, as follows:
rt =
[rt
SOFA , rttreat , rt
info , −rtcost]⊤
(4.2)
51
Patient state: The first element, rSOFA, uses the recently introduced SOFA score
for sepsis [112] which assesses severity of organ dysfunction in a potentially septic
patient. Our use of SOFA is motivated by the fact that, in practice, sepsis is more
often recognized from the associated organ failure than from direct detection of the
infection itself [122]. The raw SOFA score ranges from 0 to 24, with a maximum of four
points assigned each to symptom of failure in the respiratory system, nervous system,
liver, kidneys, and blood coagulation. A change in SOFA score ≥ 2 is considered a
critical index for sepsis [112]. This rule of thumb is used to define the first reward
term:
rtSOFA = 1at =0 · 1f(·)≥2 , where f(·) = mSOFA
t −mSOFAt−1 . (4.3)
The raw score mSOFAt at each t is evaluated using current patient labs and vitals [122].
Treatment onset: The second term is an indicator variable for rewards capturing
whether or not there is some treatment or intervention initiated at the next time step:
rttreat = 1at =0 ·
∑i∈M
1st+1(treatment i was given), (4.4)
where M denotes the set of disease-specific interventions of interest. Again, the
reward term is positive if a lab is ordered; this is based on the rationale that, if a
lab test is ordered and immediately followed by an intervention, the test is likely
to have provided actionable information. Possible interventions include antibiotics,
vasopressors, dialysis or ventilation.
Lab redundancy: The term rtinfo denotes the feedback from taking one or more lab
tests with novel information. We quantify this by using the mean absolute difference
between the last observed value and predictive mean from the MOGP as a proxy for
52
the information available:
rtinfo =
L∑ℓ=1
max (0, g(·)− cℓ) · 1at[ℓ]=1 , where g(·) =
∣∣∣∣∣m(ℓ)t − y
(ℓ)t
σ(ℓ)t
∣∣∣∣∣ , (4.5)
where σℓt is the normalization coefficient for lab ℓ, and the parameter cℓ determines
the minimum prediction error necessary to trigger a reward; in our experiments, this
is set to the median prediction error for labs ordered in the training data. The larger
the deviation from current forecasts, the higher the potential information gain, and
in turn the reward if the lab is taken.
Lab cost: The last term in the reward function, rtcost adds a penalty whenever any
test is ordered to reflect the effective “cost” of taking the lab at time t.
rtcost =
L∑ℓ=1
exp
(−∆
(ℓ)t
Γℓ
)· 1at[ℓ]=1, (4.6)
where Γℓ is a decay factor that controls the how fast the cost decays with the time
∆t elapsed since the last measurement. In our experiments, we set Γℓ = 6 ∀ℓ ∈ L.
4.1.3 Solving for Deterministic Optimal Policy
Once we extract sequences of states, actions, and rewards from the ICU data, we can
generate a dataset of one-step transition tuples of the form D = {⟨snt , ant , snt+1⟩, rnt },
n = 1...|D|. These tuples can then be used to learn an estimate of the Q-function,
Q : S ×A → Rd —where d = 4 is the dimensionality of the reward function—to map
a given state-action pair to a vector of expected cumulative rewards. Each element
in the Q-vector represents the estimated value of that state-action pair according to
a different objective. We learn this Q-function using a variant of Fitted Q-iteration
(FQI) with extremely randomized trees [21, 98]. FQI is a batch off-policy reinforce-
ment learning algorithm that is well-suited to clinical applications where we have
53
limited data and challenging state dynamics. The algorithm adapted here to handle
vector-valued rewards is based on Pareto-optimal Fitted-Q [72].
In order to scale from the two-stage decision problem originally tackled to the much
longer admission sequences here (≥ 24 time steps), we define a stricter pruning of
actions: at each iteration we eliminate any dominated actions for a given state—those
actions that are outperformed by alternatives for all elements of the Q-function—and
retain only the set Π(s) = {a : ∄a′ (∀ d, Qd(s, a) < Qd(s, a′))} for each s. Actions are
further filtered for consistency : we might consider feature consistency to be defined
as rewards being linear in each feature space [72]. Here, we relax this idea to filter out
only those actions from policies that cannot be expressed by our non-linear tree-based
classifier. The function will still yield a non-deterministic policy (NDP) as, in most
cases, there will not be a strictly optimal action that achieves the highest Qd for all d.
We suggest one possible approach for reducing the NDP to give a single best action
for any given state based on practical considerations in the next section.
4.2 Results
Following the extraction of our 6,060 admissions and resampling in hourly intervals
using the forecasting MOGP, we partitioned the cohort into training and test sets of
3,636 and 2,424 admissions respectively. This gave approximately 500,000 one-step
transition tuples of the form ⟨st, at, st+1, rt⟩ in the training set, and over 350,000 in
the test set. We then ran batched FQI with these samples for 200 iterations with
discount factor γ = 0.9. Each iteration took 100,000 transitions, sampled from the
training set, with probability inversely proportional to the frequency of the action
in the tuple. The vector-valued outputs of estimated Q-function were then used to
obtain a non-deterministic policy for each lab considered (Section 4.1.3). We chose
54
Algorithm 2 Multi-Objective Fitted Q-iteration with strict pruningInput:
One-step transitions F = {⟨snt , ant , snt+1⟩, rnt+1}n=1:|F|;
Regression parameters θ; action space A; subset size N
Initialize Q(0)(st, at) = 0 ∈ Rd ∀st ∈ F , at ∈ Afor iteration k = 1→ K do
Sample subsetN ∼ F ; initialize S ← []
for i ∈ subsetN do
Generate set Π(si) using Q(k−1)
Initialize classification parameters ϕ
ϕ← classify(si, ai)
for πi ∈ Π : doa′ ← πi(si+1) ∩ predict(si+1, ϕ)
Q(k)(si, ai)← ri+1 + γQ(k−1)(si+1, a′)
end
S ← append(S, ⟨(si, ai), Q(k)(si, ai)⟩)end
θ ← regress(S)
end
Result: θ
to collapse this set to a practical deterministic policy as follows:
Π(s) =
1, Qd(s, a = 0) < Qd(s, a = 1) + εd, ∀ d
0, otherwise.
(4.7)
In particular, a lab should be taken (Π(s) = 1) only if the action is optimal, or
estimated to outperform the alternative for all objectives in the Q-function. This
strong condition for ordering a lab is motivated by the fact that one of our primary
objectives here is to minimize unnecessary ordering; the variable εd allows us to
relax this for certain objectives if desired. For example, if cost is a softer constraint,
setting εcost > 0 is an intuitive way to specify this preference in the policy. In our
55
Figure 4.2: Gini feature importances scores over 21-dimensional state space for eachof our four optimal ordering policies.
experiments, we tuned εcost such that the total number of recommended orders of
each lab approximates the number of actual orders in the training set.
With a deterministic set of optimal actions, we could train our final policy func-
tion π : S → A; again, we used extremely randomized trees. The estimated Gini
feature importances of the policies learnt show that in the case of lactate the most
important features are the mean and measured lactate, the time since last lactate
measurement (∆) and the SOFA score (Figure 4.2). These relative importance scores
are expected: a change in SOFA score may indicate the onset of sepsis, and in turn
warrant a lactate test to confirm a source of infection, fitting typical clinical proto-
col. For the other three policies (WBC, creatinine, BUN) again the time since last
measurement of the respective lab tends be prominent in the policy, along with the ∆
terms for the other two labs. This suggests an overlap in information in these tests:
For example, abnormally high white blood cell count is a key criteria for sepsis; severe
sepsis often cascades into renal failure, which is typically diagnosed by elevated BUN
and creatinine levels [16].
56
Once we have trained our policy functions, an additional component is added
to our final recommendations: we introduce a budget that suggests taking a lab at
the end of every 24 hour period for which our policy recommends no orders. This
allows us to handle regions of very sparse recommendations by the policy function,
and reflects clinical protocols that require minimum daily monitoring of key labs. In
the policy for lactate orders in a typical patient admission, looking at the timing of
the actual clinician orders, recommendations from our policy, and suggested orders
from the budget framework, the actions are concentrated where lactate values are
increasingly abnormal, or at sharp rises in SOFA score (Figure 4.3).
4.2.1 Off-Policy Evaluation
We evaluated the quality of our final policy recommendations in a number of ways.
First, we implemented the per-step or per decision weighted importance sampling
(PDWIS) estimator to calculate the value of the policy πe to be evaluated:
VPDWIS(πe) =n∑
i=1
T−1∑t=0
γtWIS
[ρ(i)t∑n
i=1 ρ(i)t
]r(i)t , where ρt =
t−1∏j=0
πe(sj|aj)πb(sj|aj)
,
given data collected from behaviour policy πb [100]. The behaviour policy was found
by training a regressor on real state-action pairs observed in the dataset. The discount
factor was set to γWIS = 1.0, so all time steps contribute equally to the value of a
trajectory.
We then compared estimates for our policy (MO-FQI) against the behaviour policy
and a set of randomized policies as baselines. These randomized policies were designed
to generate random decisions to order a lab, with probabilities p = {0.01, pemp, 0.5},
where pemp is the empirical probability of an order in the behaviour policy. For each p,
we evaluated ten randomly generated policies and averaged performance over these.
We observed that MO-FQI outperforms the behaviour policy across all reward com-
57
Treatment: mechanical ventilation
Figure 4.3: Demonstration of recommended lactate ordering policy for example ad-mission; shaded green region denotes normal lactate range (0.5–2 mmol/L).
ponents, for all four labs (Figure 4.4). Our policy also consistently approximately
matches or outperforms the other policies in terms of cost—note that for absolute
cost, the best policy corresponds to that with the lowest estimated value—even with
the inclusion of the slack variable εcost and the budget framework. Across the re-
maining objectives, MO-FQI outperforms the random policy in at least two of three
components for all but lactate. This may be due in part to the relatively sparse orders
for lactate resulting in higher variance value estimates.
In addition to evaluating using the per-step WIS estimator, we looked for more
intuitive measures of how the final policy influences clinical practice. We computed
three metrics here: (i) estimated reduction in total number of orders, (ii) mean in-
formation gain of orders taken, and (iii) time intervals between labs and subsequent
treatment onsets.
In evaluating the total number of recommended orders, we first filter a sequence of
recommended orders to the just the first (onset) of recommendations if there are no
clinician orders between them. We argue that this is a fair comparison as subsequent
recommendations are made without counterfactual state estimation, i.e., without as-
suming that the first recommendation was followed the clinician. Empirically, we find
that the total number of recommendations is considerably reduced. For instance, in
the case of recommending WBC orders, our final policy reports 12,358 orders in the
58
Figure 4.4: Evaluating Vd(πe) for each reward component d, across policies for fourlabs. The (⋆) indicates the best performing policy for each reward component. Errorbars for randomized policies show standard deviations across 10 trials.
test set, achieving a reduction of 44% from the number of true orders (22,172). In
the case of lactate, for which clinicians’ orders are the least frequent (14,558), we still
achieved a reduction of 27%.
We also compared the approximate information gain of the actions taken by the
estimated policy, in comparison with the policy used in the collected data. To do this,
we defined the information gain at a given time by looking at the difference between
the approximated true value of the target lab, which we impute using the MOGP
model given all the observed values, and the forecasted value, computed using only
the values observed before the current time. The distribution of aggregate information
gain for orders recommended by our policy and actual clinician’s orders in the test set
shows consistently higher expected mean information gain following ordering policies
learnt from MO-FQI, across all four labs (Figure 4.5).
59
0 5 10 15 20 25Information Gain
Clinician
MO-FQI
WBC
0 1 2 3 4 5 6Information Gain
Clinician
MO-FQI
Creatinine
0 10 20 30 40 50 60 70 80Information Gain
Clinician
MO-FQI
BUN
0 1 2 3 4 5 6 7Information Gain
Clinician
MO-FQI
Lactate
Figure 4.5: Evaluating Information Gain of clinician actions against MO-FQI acrossall labs: the mean information in labs ordered by clinicians is consistently outper-formed by MO-FQI: 0.69 vs 1.53 for WBC; 0.09 vs 0.18 for creatinine; 1.63 vs 3.39for BUN; 0.19 vs 0.38 for lactate.
Lastly, we considered the time to onset of critical interventions, which we define
to include initiation of vasopressors, antibiotics, mechanical ventilation or dialysis.
We first obtained a sequence of treatment onset times for each test patient; for each
of these time points, we traced back to the earliest observed or recommended order
taking place within the past 48 hours, and computed the time between these: ∆t =
ttreatment − torder . The distribution of time-to-treatment for labs taken by the clinician
in the true trajectory against that for recommendations from our policy, for all four
labs, shows that the recommended orders tend to happen earlier than the actual time
of an order by the clinician—on average over an hour in advance for lactate, and more
that four hours in advance for WBC, creatinine, and BUN (Figure 4.6).
4.3 Conclusion
In this work, we propose a reinforcement learning framework for decision support in
the ICU that learns a compositional optimal treatment policy for the ordering of lab
tests from sub-optimal histories. We do this by designing a multi-objective reward
function that reflects clinical considerations when ordering labs, and adapting meth-
60
0 10 20 30 40 50Time to Treatment (hours)
Clinician
MO-FQI
WBC
0 10 20 30 40 50Time to Treatment (hours)
Clinician
MO-FQI
Creatinine
0 10 20 30 40 50Time to Treatment (hours)
Clinician
MO-FQI
BUN
0 10 20 30 40 50Time to Treatment (hours)
Clinician
MO-FQI
Lactate
Figure 4.6: Evaluating time to treatment onset for lab orders by the clinician againstMO-FQI, across all labs. The mean time intervals are as follows: 9.1 vs 13.2 forWBC; 7.9 vs 12.5 for creatinine; 8.0 vs 12.5 for BUN; 14.4 vs 15.9 for lactate.
ods for multi-objective batch RL to learning extended sequences of Pareto-optimal
actions. Our final policies are evaluated using importance-sampling based estimators
for off-policy evaluation, metrics for improvements in cost, and reducing redundancy
of orders. Our results suggest that there is considerable room for improvement on
current ordering practices, and the framework introduced here can help recommend
best practices and be used to evaluate deviations from these across care providers,
driving us towards more efficient health care. Furthermore, the low risk of these types
of interventions in patient health care reduces the barrier of testing and deploying
clinician-in-the-loop machine learning-assisted patient care in ICU settings.
61
Chapter 5
Constrained Reward Design for
Batch RL
One fundamental challenge of reinforcement learning (RL) in practice is specifying
the agent’s reward. Reward functions implicitly define policy, and misspecified re-
wards can introduce severe, unexpected effects, from reward gaming to irreversible
changes in parts of the environment we do not want to influence [66]. However, it
can be difficult for domain experts to distil multiple (and often implicit) requisites for
desired behaviour into a single scalar feedback signal. This is exemplified by efforts
towards the application of reinforcement learning to decision-making in healthcare; in
RL, an agent aims to choose the best action within a stochastic process given inherent
time delay in feedback from a decision, making it an attractive framework for learning
clinical treatment policies [131]. However, this feedback can be received over various
time scales and represent clinical implications—such as treatment efficacy, side ef-
fects or patient discomfort—with widely different, and uncertain, priorities. Existing
approaches to representing this scalar feedback in healthcare tasks range from taking
reward to be a sparse, high-level signal such as mortality [57] or rewards based on a
62
single physiological variable or severity score of interest [88, 111] to relatively ad hoc
weighting of clinically derived objectives, as in Chapter 3.
Much work in reward design [113, 114] or inference using inverse reinforcement
learning [1, 9, 37] focuses on online, interactive settings in which the agent has access
either to human feedback [14, 73] or to a simulator with which to evaluate policies and
compare against human performance. Here, we focus on reward design for batch RL:
we assume access only to a set of past trajectories collected from sub-optimal experts,
with which to train our policies. This is common in many real-world scenarios where
the risks of deploying an agent are high but logging current practice is relatively easy,
as in healthcare, as well as education or finance [6, 19].
Batch RL is distinguished by two key preconditions when performing reward de-
sign. First, as we assume that data are expensive to acquire, we must ensure that
policies found using the reward function can be evaluated given existing data. Re-
gardless of the true objectives of the designer, there exist fundamental limitations on
reward functions that can be optimized and that also provide guarantees on perfor-
mance. There have been a number of methods presented in the literature for safe,
high-confidence policy improvement from batch data given some reward function,
treating behaviour seen in the data as a baseline [31, 63, 107, 118]. In this work,
we turn this question around to ask: What is the class of reward functions for which
high-confidence policy improvement is possible?
Second, we typically assume that batch data are not random but produced by
domain experts pursuing biased but reasonable policies. Thus if an expert-specified
reward function results in behaviour that diverges substantially from past trajecto-
ries, we must ask whether that divergence was intentional or, as is more likely, simply
because the designer omitted an important constraint, causing the agent to learn
unintentional behaviour. This assumption can be formalized by treating the batch
data as ε-optimal with respect to the true reward function, and searching for rewards
63
that are consistent with this assumption [43]. Here, we extend these ideas to incor-
porate the uncertainty present when evaluating a policy in the batch setting, where
trajectories from the estimated policy cannot be collected.
We note that these two constraints are not equivalent; the extent of overlap in
reward functions satisfying these criteria depends, for example, on the homogeneity
of behaviour in the batch data: if consistency is measured with respect to average
behaviour in the data, and agents deviate substantially from this average—as may be
across clinical care providers—then the space of policies that can be evaluated given
the batch data may be larger than the space consistent with the average expert.
In this chapter, we combine these two conditions to construct tests for admissible
functions in reward design using available data. This yields a novel approach to
the challenge of high-confidence policy evaluation given high-variance importance
sampling-based value estimates over extended decision horizons—typical of batch RL
problems—and encourages safe, incremental policy improvement. We illustrate our
approach on several benchmark control tasks with continuous state spaces, and in
reward design for the task of weaning a patient from a mechanical ventilator.
Prior Publication: Niranjani Prasad, Barbara E. Engelhardt, and Finale Doshi-
Velez. Defining admissible rewards for high-confidence policy evaluation in batch re-
inforcement learning. Proceedings of the ACM Conference on Health, Inference, and
Learning (CHIL) 2020 [98].
5.1 Preliminaries and Notation
A Markov decision process (MDP) is a tuple of the form M = {S,A, P0, P, R, γ},
where S is the set of all possible states, and A are the available actions. P0(s) is the
distribution over the initial state s ∈ S; P (s′|s, a) gives the probability of transition
to s′ given current state s and action a ∈ A. The function R(s, a, s′) defines the
64
reward for performing action a in state s, and observing new state s′. Lastly, the
discount factor γ ≤ 1 determines the relative importance of immediate and longer-
term rewards received by the reinforcement learning agent.
Our objective is to learn a policy function π∗ : S → A mapping states to
actions that maximize the expected cumulative discounted reward—that is, π∗ =
argmaxπ Es∼P0 [Vπ(s)|M ]—where the value function V π(s) is defined as:
V π = EP0,P,π
[∞∑t=0
γtR(st, at, st+1)
]. (5.1)
In batch RL, we have a collection of trajectories of the form h = {s0, a0, r0, . . . , sT , aT , rT}.
We do not have access to the transition function P or the initial state distribution
P0. Without loss of generality, we express the reward as a linear combination of some
arbitrary function of the observed state transition: R = wTϕ(s, a, s′), where ϕ ∈ Rk
is a vector function of state-action features relevant to learning an optimal policy,
and ||w||1 = 1, to induce invariance to scaling factors in reward specification [9]. The
value V π of a policy π with reward weight w can then be written as:
V π = EP0,P,π
[∞∑t=0
γtwTϕ(·)
]= wTµπ, where
µπ = EP0,P,π
[∞∑t=0
γtϕ(·)
]. (5.2)
where the vector µπ denotes the feature expectations [1] of policy π, that is, the total
expected discounted time an agent spends in each feature state. Thus, µπ provides a
representation of the state dynamics of a policy that is entirely decoupled from the
reward function of the MDP.
To quantify confidence in the estimated value V π of policy π, we adapt the em-
pirical Bernstein concentration inequality [80] to get a probabilistic lower bound Vlb
on the estimated value [119]: consider a set of trajectories hn, n ∈ 1...N and let Vn
65
be the value estimate for trajectory n. Then, with probability at least 1− δ:
Vlb =1
N
N∑n=1
Vn −1
N
√√√√ ln(2δ)
N − 1
N∑n,n′=1
(Vn − Vn′)2 −7b ln(2
δ)
3(N − 1), (5.3)
where b is the maximum achievable value of V (π).
5.2 Admissible Reward Sets
We now turn to our task of identifying admissible reward sets – that is, defining
the space of reward functions that yield policies that are consistent in performance
with available observational data, as well as possible to evaluate off-policy for high-
confidence performance lower bounds. In Sections 5.2.1 and 5.2.2, we define two
sets of weights PC and PE to be the consistent and evaluable sets, respectively, show
that they are closed and convex, and define their intersection PC ∩ PE as the set
of admissible reward weights. In Sections 5.2.3 and 5.2.4, we describe how to test
whether a given reward lies in the intersection of these polytopes, and, if not, how
to find the closest points within this space of admissible reward functions given some
initial reward proposed by the designer of the RL agent.
5.2.1 Consistent Reward Polytope
Given near-optimal expert demonstrations, the polytope of consistent rewards [43]
may be defined as the set of all weight vectors w defining reward function R = wTϕ(s),
that are consistent with the agent’s existing knowledge. In the setting of learning
from demonstrations, this knowledge is the assumption that demonstrations achieve
ε-optimal performance with respect to the “true” reward. We denote the behaviour
policy of experts as πb with policy feature expectations µb , where V (πb) = wTµb. The
consistent weight vectors for this expert demonstration setting are then all w such
66
that wTµ ≤ wTµb + ε, µ ∈ PF , where PF is the space of all possible policy feature
representations. It has been shown that this set is convex, given access to an exact
MDP solver [43].
Translating this to the batch reinforcement learning setting, with a fixed set of
sub-optimal trajectories, requires adaptations to both the constraints and their com-
putation. First, we choose to constrain the relative rather than absolute difference
in performance of the observed trajectories and that of the learnt optimal policy, in
order to better handle high variance in the magnitudes of estimated values. Second,
we make our constraint symmetric such that the value of the learnt policy can deviate
equally above or below the value of the observed behaviour. This reflects the use of
this constraint as a way to place metaphorical guardrails on the deviation of the be-
haviour of the learnt policy from the policy in the batch trajectories—rather than to
impose optimality assumptions that only bound performance from above. That is, we
want a reward that results in performance similar to the observed batch trajectories,
where performance some factor ∆c greater than or less than this established baseline
should be equally admissible. Our new polytope PC for the space of weights satisfying
this is then:
PC =
{w :
1
∆c
≤ wTµb
wTµ≤ ∆c
}, (5.4)
where µ are the feature expectations of the optimal policy when solving an MDP
with reward weights w, and value estimates are constrained to be positive, wTµ >
0 ∀µ ∈ PF . The parameter ∆c ≥ 1 that determines the threshold on the consistency
polytope is tuned according to our confidence in the batch data; trajectories from
severely biased experts may warrant larger ∆c.
The batch setting also requires changes to the computation of these constraints,
as we do not have access to a simulator to calculate exact feature expectations µ;
67
we must instead estimate them from available data. We do so by adapting off-
policy evaluation methods to estimate the representation of a policy in feature space.
Specifically, we use per-decision importance sampling (PDIS [100]) to get a consistent,
unbiased estimator of µ:
µ =1
N
N∑n=1
T∑t=0
γtρ(n)t ϕ
(s(n)t
)(5.5)
where importance weights ρ(n)t =
∏ti=0(π(a
ni |sni )/πb(a
ni |sni )). Together with the feature
expectations of the observed experts (obtained by simple averaging across trajecto-
ries), we can evaluate the constraint in Equation 5.4.
Proposition 1. The set of weights PC defines a closed convex set, given access to
exact MDP solver.
Proof. The redefined constraints in Equation 5.4 can be rewritten as: wT (µ−∆cµb) ≤
0; wT (µb − ∆cµ) ≤ 0, where µ = maxµ′∈PFwTµ′ is the feature expectations of the
optimal policy obtained from the exact MDP solver. As these constraints are still
linear in w—that is, of the form wTA ≤ b—the convexity argument in [43] holds.
In Section 5.2.3, we discuss how this assumption of convexity changes given the
presence of approximation error in the MDP solver and in estimated feature expec-
tations.
Illustration. We first construct a simple, synthetic task to visualize a polytope of
consistent rewards. Consider an agent in a two-dimensional continuous environment,
with state defined by position st = [xt, yt] for bounded xt and yt. At each time
t, available actions are steps in one of four directions, with random step size δt ∼
N (0.4, 0.1). The reward is rt = [0.5, 0.5]T st: the agent’s goal is to reach the top-right
corner of the 2D map. We use fitted-Q iteration with tree-based approximation [21]
to learn a deterministic policy πb that optimizes the reward, then we sample 1000
68
trajectories from a biased policy (move left with probability ϵ, and πb otherwise) to
obtain batch data.
We then train policies πw optimizing for reward functions rt = wTϕ(s) on a set of
candidate weights w ∈ R2 on the unit ℓ1-norm ball. For each policy, a PDIS estimate
of the feature expectations µw is obtained using the collected batch data. The con-
sistency constraint (Equation 5.4) is then evaluated for each candidate weight vector,
with different thresholds ∆c (Figure 5.1). Prior to evaluating constraints, we ensure
our estimates wTµ for discounted cumulative reward are positive, by augmenting w
and ϕ(s) with a constant positive bias term: w′ = [w, 1], ϕ′(s) = [ϕ(s), B] where
B = 14.0 for this task. For large ∆c (∆c ≥ 17), the set of consistent w includes
approximately half of all test weights: given these thresholds, all w for which at least
one dimension of the state vector was assigned a significant positive weight (greater
than 0.5) in the reward function were determined to yield policies sufficiently close
to the batch data, while vectors with large negative weights on either coordinate are
rejected. When ∆c is reduced to 3.0, only the reward originally optimized for the
batch data, (w = [0.5, 0.5]) is admitted by PC.
(a) ∆c = 17.0, ∆e = 0.7 (b) ∆c = 10.0, ∆e = 0.4 (c) ∆c = 3.0, ∆e = 0.1
Figure 5.1: Consistency and evaluability polytopes with different thresholds ∆c > 1.0and ∆e < 1.0 respectively, given true reward rt = [0.5, 0.5]T [xt, yt]. Increasing ∆corresponds to relaxing constraints and expanding the satisfying set of weights w.
69
5.2.2 Evaluable Reward Polytope
Our second set of constraints on reward design stem from the need to be able to
confidently evaluate a policy in settings when further data collection is expensive
or infeasible. We interpret this as a condition on confidence in the estimated policy
performance: given an estimate for the expected value E[V π] = wT µ of a policy π and
corresponding probabilistic lower bound V πlb , we constrain the ratio of these values
to lie within some threshold ∆e ≥ 0. A reward function with weights w lies within
the polytope of evaluable rewards if V πlb ≥ (1 −∆e)w
T µ, where µ ∈ PF is our PDIS
estimate of feature expectations. To formulate this as a linear constraint in the space
of reward weights w, the value lower bound V πlb must be rewritten in terms of w.
This is done by constructing a combination of upper and lower confidence bounds on
the policy feature expectations, denoted µlb. Starting from the empirical Bernstein
concentration inequality (Equation 5.3):
Vlb =1
N
N∑n=1
Vn−1
N
√√√√√ ln(2δ)
N−1︸ ︷︷ ︸c1
N∑n,n′=1
(Vi − Vj)2 −
7b ln(2δ)
3(N − 1)︸ ︷︷ ︸c2
=1
N
N∑n=1
µ(n)w−sgn(w)· 1N
√√√√c1
N∑n,n′=1
(µ(n)w−µ(n′)w
)2−c2 (5.6)
=1
Nw·
N∑n=1
µ(n) − sgn(w)· 1Nw
√√√√c1 ·N∑
n,n′=1
(µ(n) − µ(n′))2 − c2
= wT µlb − c2 (5.7)
where the kth element of µlb—that is, the value of the kth feature that yields the
lower bound in the value of the policy—is dependent on the sign of the corresponding
70
element of the weights, w[k]:
µlb[k]=
1
N
N∑n=1
µ(n)−
√√√√c1
N∑n,n′=1
(µ(n) − µ(n′)
)2 k
w[k]≥0
1
N
N∑n=1
µ(n)+
√√√√c1
N∑n,n′=1
(µ(n) − µ(n′)
)2 k
w[k]<0
(5.8)
The definition in Equation 5.8 allows us to incorporate uncertainty in µ when evalu-
ating our confidence in a given policy: a lower bound for our value estimate requires
the lower bound of µ if the weight is positive, and the upper bound if the weight is
negative. Thus, the evaluable reward polytope can be written as:
PE ={w : wTµlb ≥ (1−∆e)w
Tµ}
(5.9)
where µ = maxµ′∈PFwTµ′ is the expectation of state features for the optimal policy
obtained from solving the MDP with reward weights w, and µlb is the corresponding
lower bound. The constant c2 in the performance lower bound (Equation 5.7) is
absorbed by threshold parameter ∆e on the tightness of the lower bound.
Proposition 2. The set of weights PE defines a closed convex set, given access to an
exact MDP solver.
Proof. The set PE contains all weights w that satisfy constraints linear in w:
wT ((1−∆e)µ− µlb) ≤ 0. As in the case of PC, it follows from [43] that the set
described by these constraints is convex.
Illustration. In order to visualize an example polytope for evaluable rewards
(Equation 5.9), we return to the two-dimensional map described in Section 5.2.1. As
before, we begin with a batch of trajectories collected by a biased ϵ-greedy expert
policy trained on the true reward. We use these trajectories to obtain PDIS estimates
71
Algorithm 3 Separation oracle SOadm for admissible wInput:
Proposed weights w ∈ Rk; Behaviour policy µb; Threshold parameters ∆c,∆e
1. Solve MDP with weights w for optimal policy features µ = argmaxµ′ wTµ′
2. Evaluate lower bound µlb for estimated policy features
if wT (µ−∆cµb) > 0 thenw /∈ PC ⇒ Reject w
Output: Halfspace {wT (µ−∆cµb) ≤ 0}else if wT (µb −∆cµ) > 0 then
w /∈ PC ⇒ Reject w
Output: Halfspace {wT (µb −∆cµ) ≤ 0}else if wT ((1−∆e)µ− µlb) > 0 then
w /∈ PE ⇒ Reject w
Output: Halfspace {wT ((1−∆e)µ− µlb) ≤ 0}else
w ∈ PC ∩ PE = Padm ⇒ Accept w
end
µ for policies trained with a range of reward weights w on the ℓ1-norm ball. We
then evaluate µlb, and in turn the hyperplanes defining the intersecting half-spaces
of the evaluable reward polytope, for each w. Plotting the set of evaluable reward
vectors for different thresholds ∆e, we see substantial overlap with the consistent
reward polytope in this environment, though neither polytope is a subset of the other
(Figure 5.1b). We also find that in this setting, the value of the evaluability constraint
is asymmetric about the true reward—more so than the consistency metric—such
that policies trained on penalizing xt (w[0] < 0), hence favouring movement left, can
be evaluated to obtain a tighter lower bound than weights that learn policies with
movement down, which is rarely seen in the biased demonstration data (Figure 5.1b).
Finally, tightening the threshold further to ∆e = 0.1 (Figure 5.1c) the set of accepted
weights is again just the true reward, as for the consistency polytope.
72
5.2.3 Querying Admissible Reward Polytope
Given our criteria for consistency and evaluability of reward functions, we need a
way to access the sets satisfying these constraints. These sets cannot be explicitly
described as there are infinite policies with corresponding representations µ, and so
infinite possible constraints; instead, we construct a separation oracle to access points
in this set in polynomial time (Algorithm 3). A separation oracle tests whether a given
point w′ lies in polytope of interest P , and if not, outputs a separating hyperplane
defining some half-space wTA ≤ b, such that P lies inside this half-space and w′ lies
outside of it. The separation oracle for the polytope of admissible rewards evaluates
both consistency and evaluability to determine whether w′ lies in the intersection of
the two polytopes, which we define as our admissible polytope Padm. If a constraint
is not met, it outputs a new hyperplane accordingly.
It should be noted that the RL problems of interest to us are typically large
MDPs with continuous state spaces, as in the clinical setting of managing mechanical
ventilation in the ICU, and moreover, because we are optimizing policies given only
batch data, we know we can only expect to find approximately optimal policies. The
use of PDIS estimates µ of the true feature expectations in the batch setting introduces
an additional source of approximation error. It has been shown that Algorithm 3 with
an approximate MDP solver produces a weird separation oracle [43], one that does
not necessarily define a convex set. However, it does still accept all points in the
queried polytope, and can thus still be used to test whether a proposed weight vector
w lies within this set.
Returning to our 2D map (Figure 5.1), the admissible reward polytope Padm is
the set of weights accepted by both the consistent and evaluable polytopes. The
choice of thresholds ∆c and ∆e respectively is important in obtaining a meaningfully
restricted, non-empty set to choose rewards from. These thresholds will depend on
the extent of exploratory or sub-optimal behaviour in the batch data, and the level
73
of uncertainty acceptable when deploying a new policy. We find that in this toy
2D map setting, there is considerable overlap between the two polytopes defining
the admissible set, though this is not always the case; from our earlier intuition,
as the behaviour policy from which trajectories were generated is the same for all
trajectories, there is limited “exploration”, or deviation from average behaviour across
trajectories, and the therefore the evaluability constraints admit reward weights that
largely overlap with those consistent with average behaviour.
5.2.4 Finding the Nearest Admissible Reward
With a separation oracle SOadm for querying whether a given w lies in the admissible
reward polytope, we optimize linear functions over this set using, e.g., the ellipsoid
method for exact solutions or—as considered here—the iterative follow-perturbed-
leader (FPL) algorithm for computationally efficient approximate solutions [52]. To
achieve our goal of aiding reward specification for a designer with existing but im-
perfectly known goals, we pose our optimization problem as follows (Algorithm 4):
given initial reward weights w0 proposed by the agent designer, we first test whether
w0, with some small perturbation, lies in the admissible polytope Padm, which we
define by training a policy π0 approximately optimizing this reward. If it does not
lie in Padm, we return new weights w ∈ Padm that minimize distance ∥w − winit∥2
from the proposed weights. This solution is then perturbed and tested in turn. We
note that constraints posed based on the behaviour µb observed in the available batch
trajectories are encapsulated by this minimization over weights in set Padm, that is,
solving a constrained linear optimization defined by the linear constraints on w from
Equations 5.4 and 5.9. The constraints at each iteration do not fully specify Padm,
but instead give us a half-space to optimize over, at each step.
The constrained linear program solved at each iteration scales in constant time
with the dimensionality of w; although we only present results with functions ϕ(s)
74
Algorithm 4 Follow-perturbed-leader for admissible w.
Input: Initial weights w0 ∈ Rk, # iterations T , perturbation δ = 1k√T
t = 0
while t ≤ T do
1. Let rt =∑t−1
i=1(wi + pt) · ϕ(·), where pt ∼ U [0, 1δ ]k
2. Solve for πt = argmaxπ Vπ|rt
3. Let µt = µ(πt) + qt, where qt ∼ U [0, 1δ ]k
4. Evaluate constraints defining Padm
5. Solve for wt := argminw∈Padm∥w − winit∥2
6. t := t+ 1
end
Output: πfinal =1T
T∑i=1
πt; w = 1T
T∑i=1
wt
of dimensionality at most 3, for the sake of visualization, the iterative algorithm
presented can be scaled to higher dimensional ϕ(s), as the complexity of the linear
program solved at each iteration is dependent only on the number of constraints.
Our final reward weights and a randomized policy are the average across the approx-
imate solutions in each iteration. This policy optimizes a reward that is the closest
admissible reward to the original goals of the designer of the RL agent.
5.3 Experiment Design
5.3.1 Benchmark Tasks
We illustrate our approach to determining admissible reward functions on three bench-
mark domains with well-defined objectives: classical control tasks Mountain Car and
Acrobot, and a simulation-based treatment task for HIV patients. The control tasks,
implemented using OpenAI Gym [8], both have a continuous state space and discrete
action space, and the objective is to reach a terminal goal state. To explore how
the constrained polytopes inform reward design for these tasks, an expert behaviour
75
policy is first trained with data collected from an exploratory policy receiving a re-
ward of −1 at each time step, and 0 once the goal state is reached. A batch of 1000
trajectories is collected by following this expert policy with Boltzmann exploration,
mimicking a sub-optimal expert. Given these trajectories, our task is to choose a re-
ward function that allows us to efficiently learn an optimal policy that is i) consistent
with the expert behaviour in the trajectories, and ii) evaluable with acceptably tight
lower bounds on performance. We limit the reward function rt = wTϕ(st) in each task
to a weighted sum of three features, ϕ(s) ∈ R3, chosen to include sufficient informa-
tion to learn a meaningful policy while allowing for visualization. For Mountain Car,
we use quantile-transformed position, velocity, and an indicator ±1 of whether the
goal state has been reached. For Acrobot, ϕ(s) comprises the quantile-transformed
cosine of the angle of the first link, angular velocity of the link, and an indicator
±1 of whether the goal link height is satisfied. We sweep over weight vectors on the
3D ℓ1-norm ball, training policies with the corresponding rewards, and filtering for
admissible w.
The characterization of a good policy is more complex in our third benchmark task,
namely treatment recommendation for HIV patients, modeled by a linear dynamical
system [22]. Again, we have a continuous state space and four discrete actions to
choose from: no treatment, one of two possible drugs, or both in conjunction. The true
reward in this domain is given by: R = −0.1V +103E−2 ·104(0.7d1)2−2 ·103(0.3d2)2,
where V is the viral count, E is the count of white blood cells (WBC) targeting the
virus, and d1 and d2 are indicators for drugs 1 and 2 respectively. We can rewrite
this function as r = wTϕ(s), where ϕ(s) = [V, c0E, c1d1 + c2d2] ∈ R3, with constants
c0, c1 and c2 set such that weights w = [−0.1, 0.5, 0.4] reproduce the original function.
Again, the low dimensionality of ϕ(s) is simply for the sake of interpretability. An
expert policy is trained using this true reward, and a set of sub-optimal trajectories
76
Table 5.1: MDP state features taken as input for learning an optimal policy formanagement of mechanical ventilation in ICU.
State Features
Demographics Age, Gender, Ethnicity, Admission Weight, First ICU Unit
Vent Settings Ventilator mode, Inspired O2 fraction (FiO2), O2 Flow
Positive End-Expiratory Pressure (PEEP) set
Measured Vitals Heart Rate, Respiratory Rate, Arterial pH,
O2 saturation pulseoxymetry (SpO2), Richmond-RAS Scale,
Non Invasive Blood Pressure (systolic, diastolic, mean),
Mean Airway Pressure, Tidal Volume, Peak Insp. Pressure,
Plateau Pressure, Arterial CO2 Pressure, Arterial O2 pressure
Input Sedation Propofol, Fentanyl, Midazolam, Dexmedetomidine,
Morphine Sulfate, Hydromorphone, Lorazepam
Other Consecutive duration into ventilation (D),
Number of reintubations (N)
are collected by following this policy with Boltzmann exploration. Policies are then
trained over weights w, ||w||1 = 1 to determine the set of admissible rewards.
5.3.2 Mechanical Ventilation in ICU
We use our methods to aid reward design for the task of managing invasive me-
chanical ventilation in critically ill patients, as described in Chapter 3. Mechanical
ventilation refers to the use of external breathing support to replace spontaneous
breathing in patients with compromised lung function. It is one of the most common,
as well as most costly, interventions in the ICU [108]. Timely weaning, or removal of
breathing support, is crucial to minimizing risks of ventilator-associated infection or
over-sedation, while avoiding failed breathing tests or reintubation due to premature
weaning. Expert opinion varies on how best to trade off these risks, and clinicians
77
tend to err towards conservative estimates of patient wean readiness, resulting in
extended ICU stays and inflated costs.
We look to design a reward function for a weaning policy that penalizes prolonged
ventilation, while weighing the relative risks of premature weaning such that the opti-
mal policy does not recommend strategies starkly different from clinician behaviour,
and the policies can be evaluated for acceptably robust bounds on performance using
existing trajectories. We train and test our policies on data filtered from the MIMIC
III data [49] with 6,883 ICU admissions from successfully discharged patients follow-
ing mechanical ventilation, preprocessed and resampled in hourly intervals. The
MDP for this task is adapted from that introduced in Section 3.2.3: the patient state
st at time t is a 32-dimensional vector that includes demographic data, ventilator
settings, and relevant vitals (Table 5.1. We learn a policy with binary action space
at ∈ [0, 1], for keeping the patient off or on the ventilator, respectively. The reward
function rt = wTϕ(st, at) with ϕ(s, a) ∈ R3 includes (i) a penalty for more than 48
hours on the ventilator, (ii) a penalty for reintubation due to unsuccessful weaning,
and (iii) a penalty on physiological instability when the patient is off the ventilator
based on abnormal vitals:
ϕ =
−min(0, tanh 0.1(Dt − 48)) · 1[at = 1]
−1[∃t′ > t such that Nt′ > Nt) · 1[at = 0]
− 1|V |∑V
v (v < vmin||v < vmax) · 1[at = 0]
(5.10)
where Dt is duration into ventilation at time t in an admission, Nt is the number of
reintubations, v ∈ V are physiological parameters each with normal range [vmin, vmax],
and V = {Ventilator settings, Measured vitals}. The three terms in ϕ(·) represent
penalties on duration of ventilation, reintubation, and abnormal vitals, respectively.
Our goal is to learn the relative weights of these feedback signals to produce a consis-
78
Table 5.2: Analysing top three admitted weights w for each of the three benchmarkcontrol enviroments. Admissibility polytope thresholds are set by choosing a small∆c and required corresponding threshold ∆e for an admissible set of size |Padm| = 3.
Task Top 3 Admissible weights ∆c (PC) ∆e (PE)
Mountain Car[0.0, 0.2, 0.8]T , [0.2, 0.2, 0.6]T ,
[0.4,−0.4, 0.2]T1.10 0.27
Acrobot[−0.2, 0.0, 0.8]T , [−0.8,−0.2, 0.0]T ,
[−0.2,−0.2, 0.6]T1.10 0.29
HIV Simulator[0.0, 0.4, 0.6]T , [−0.6, 0.2, 0.2]T ,
[−0.2, 0.4,−0.4]T1.20 0.28
tent, evaluable reward function and learn a policy optimizing this reward. As before,
we train our optimal policies using Fitted Q-iteration (FQI) with function approxi-
mation using extremely randomized trees [21]. We partition our dataset into 3,000
training episodes and 3,883 test episodes, and run FQI over 100 iterations on the
training set, with discount factor γ = 0.9. We then use the learnt Q-function to train
our binary treatment policy.
5.4 Results and Discussion
5.4.1 Benchmark Control Tasks
Admissible w are clustered near true rewards.
We analyze reward functions from the sweep over weight vectors on the ℓ1-norm
unit ball for each benchmark task (Section 5.3.1) by first visualizing how the space
of weights accepted by the consistency and evaluability polytopes—and therefore
the space Padm at the intersection of these polytopes—changes with the values of
thresholds ∆c and ∆e. Alongside this, we plot the set of admitted weights produced by
79
Figure 5.2: Admissible polytope size for varying thresholds on consistency (∆c) andevaluability (∆e), and distribution of admitted weights for fixed ∆c,∆e, in: (a) Moun-tain Car (b) Acrobot (c) HIV Simulator. Note that admitted rewards for each tasktypically correspond to positive weights on the goal state.
80
arbitrarily chosen thresholds (Figure 5.2). In all three tasks, we find that the admitted
weights form distinct clusters; these are typically at positive weights on goal states in
the classic control tasks, and at positive weights on WBC count for the HIV simulator,
in keeping with the rewards optimized by the batch data in each case. We could
therefore use this naive sweep over weights to choose a vector within the admitted
cluster that is closest to our initial proposed function, or to our overall objective. For
instance, if in the HIV task we want a policy that prioritizes minimization of side
effects from administered drugs, we can choose specifically from admissible rewards
with negative weight on the treatment term.
Analysis of admissible w can lend insight into reward shaping for faster
policy learning.
We may wish to shortlist candidate weights by setting more stringent thresholds for
admissibility. We mimic this design process as follows: prioritizing evaluability in
each of our benchmark environments, we choose the smallest possible ∆e and large
∆c for an admissible set of exactly three weights (Table 5.2). This reflects a typical
batch setting, in which we want high-confidence performance guarantees; we also want
to allow our policy to deviate when necessary from the available sub-optimal expert
trajectories. For Mountain Car, our results show that two of the three vectors assign
large positive weights to reaching the goal state; all assign zero or positive weight to
the position of the car. The third, w = [0.4,−0.4, 0.2] is dominated by a significant
positive weight on position and a significant negative weight on velocity; this may
be interpreted as a kind of reward shaping: the agent is encouraged to first move in
reverse to achieve a negative velocity, as is necessary to reach the goal state in the
under-powered mountain car problem. The top three w for Acrobot also place either
positive weights on the goal state, or negative weights on the position of the first link.
81
Again, the latter reward definition likely plays a shaping role in policy optimization
by rewarding link displacement.
FPL can be used to correct biased reward specification in the direction of
true reward.
We use the HIV treatment task to explore how iterative solutions for admissible
reward (Algorithm 4) can improve a partial or flawed reward specified by a designer.
For instance, a simple first attempt by the designer at a reward function may place
equal weights on each component of ϕ(s), with the polarity of weights—whether
each component should elicit positive feedback or incur a penalty—decided by the
designer’s domain knowledge; here, the designer may suggest w0 = 13[−1, 1,−1]T .
We run Algorithm 4 for twenty iterations with this initial vector and thresholds
∆c = 2.0,∆e = 0.8 and average over the weights from each iteration. This yields
weights w = [−0.11, 0.57,−0.32]T , redistributed to be closer to the reward function
being optimized in the batch data. This pattern is observed with more extreme initial
rewards functions too; if e.g., the reward proposed depends solely on WBC count,
w0 = [0, 1, 0], then we obtain weights w = [−0.14, 0.83,−0.04] after twenty iterations
of this algorithm such that appropriate penalties are introduced on viral load and
administered drugs.
5.4.2 Mechanical Ventilation in ICU
Admissible w may highlight bias in expert behaviour.
We apply our methods to choose a reward function for a ventilator weaning pol-
icy in the ICU, given that we have access only to historical ICU trajectories with
which to train and validate our policies. When visualizing the admissible set, with
∆c = 1.8,∆e = 0.4, we find substantial intersection in the consistent and evaluable
polytopes (Figure 5.3). Admitted weights are clustered at large negative weights on
82
Figure 5.3: Mechanical Ventilation in the ICU: Admitted reward weights for fixedpolytope thresholds ∆c,∆e.
the duration penalty term favouring policies that are conservative in weaning patients
(that is, those that keep patients longer on the ventilator), which is the direction of
bias we expect in the past clinical behaviour. We can tether a naive reward that
instead penalizes duration on the ventilator, w = [1, 0, 0] to the space of rewards that
are consistent with this conservative behaviour as follows: using FPL to search for a
reward within the admissible set given this initial vector yields w = [0.72, 0.14, 0.14],
introducing non-zero penalties on reintubation and physiological instability when off
ventilation. This allows us to learn behaviour that is averse to premature extubation
(consistent with historical clinical behaviour) without simply rewarding long dura-
tions on the ventilator.
FPL improves effective sample size for learnt policies.
To verify whether weights from the admissible polytope enable higher confidence
policy evaluation, we explore a simple proxy for variance of an importance sampling-
83
Table 5.3: Mechanical ventilation in the ICU: Influence of FPL algorithm on Kisheffective sample size of learnt policies.
Initial w Neff Final w Neff
[1., 0., 0.] 8 [0.72, 0.14, 0.14] 14
[0., 1., 0.] 304 [-0.07, 0.77, 0.16] 352
[0., 0., 1.] 32 [0.15, -0.21, 0.66] 3713[1., 1., 1.] 16 [0.24, 0.51, 0.25] 33
based estimate of performance: the effective sample size Neff = (∑N
n ρn)2/∑N
n ρ2n of
the batch data [55], where ρn is importance weight of trajectory n for a given policy.
In order to evaluate the Kish effective sample size Neff for a given policy, we subsample
admissions in our test data to obtain trajectories of approximately 20 timesteps in
length, and calculate importance weights ρn for the policy considered using these
subsampled trajectories. Testing a number of naive initializations of w, we find that
effective sample size is consistently higher for weights following FPL (Table 5.3). This
indicates that the final weights induce an optimal policy that is better represented in
the batch data than the policy from the original weights.
5.5 Conclusion
In this work, we present a method for reward design in reinforcement learning using
batch data collected from sub-optimal experts. We do this by constraining rewards
to those yielding policies within some distance of the policies of domain experts;
the policies inferred from the admissible rewards also provide reasonable bounds on
performance. Our experiments show how rewards can be chosen in practice from the
space of functions satisfying these constraints, and illustrate this on the problem of
weaning clinical patients from mechanical ventilation.
84
Effective reward design for RL in safety-critical settings is necessarily an iterative
process of deployment and evaluation, to push the space of observed behaviour incre-
mentally towards policies consistent and evaluable with respect to our ideal reward.
There are a number of ways in which the methods here could be extended however,
to better use the information available in existing data on what constitutes a safe
policy, and in turn what reward function can ensure this. For instance, different care
providers in clinical settings likely follow policies with different levels of precision,
or perhaps even optimize for different reward functions; modeling this heterogeneity
in behaviour and weighting experts appropriately can enable learnt behaviour closer
to the best, rather than the average, expert. In addition, going beyond the use of
summary statistics provided by policy feature expectations to explore more complex
representations of behaviour that are still decoupled from rewards, and in turn better
metrics for similarity in behaviour, can aid in more meaningful choices in reward.
85
Chapter 6
Guiding Electrolyte Repletion in
Critical Care using RL
The replacements of electrolyte levels is an ubiquitous part of healthcare delivery
in hospitalized and critically ill patients. Electrolytes are charged minerals found in
the blood, such as potassium, sodium, magnesium, calcium or phosphate, that are
essential in supporting the normal function of cells and tissues. They play a key role
in electrical conduction in the heart, muscle and nervous system, and in intracellular
signalling; it follows that electrolyte insufficiency is associated with higher morbidity
and mortality rates in critical care.
Disturbances in electrolyte levels can arise from a range in underlying causes,
including reduced kidney or liver function, endocrine disorders, or concurrently ad-
ministered drugs such as diuretics. Although the standardized institutional protocols
are typically in place to guide electrolyte replacement, adherence to published guide-
lines is poor, and the repletion process is instead largely driven by individual care
providers. There is evidence that experiential bias from this provider-directed ap-
proach is prone to significant errors, both in terms of more missed episodes of low
electrolyte levels [41] and—increasingly—high rates of superfluous replacements, con-
86
tributing to unnecessary expenditure by way of prescription of medications, ordering
of laboratory tests, as well as clinician and nursing time spent [50, 123].
There have been several studies in recent literature that highlight the prevalence
of ineffectual electrolyte repletion therapy. Considering the regulation of potassium
in particular, as many as 20% of hospitalized patients experience episodes of hy-
pokalaemia, where blood serum levels of potassium are below the reference normal
range. The majority of patients receiving (predominantly non-potassium sparing) di-
uretics go to become hypokalaemic [121]. However, only in 4-5% of patients has this
been found to be clinically significant [2]. In investigating rule-of-thumb potassium
repletions, Hammond et al. [38] found that just over a third of repletions achieved
potassium levels within reference range. Lancaster et al. [61] demonstrate that potas-
sium supplementation is not effective as a preventative measure against atrial fibril-
lation, while magnesium supplementation can in fact increase risk.
In this chapter, we aim to develop a clinician-in-loop decision support tool for elec-
trolyte repletion, focusing on the management of potassium, magnesium and phos-
phate levels in hospitalized patients. While there have been few efforts to take a
personalized, data-driven approach to electrolyte repletion, machine learning meth-
ods have been applied to the closely related problem of fluid resuscitation, in order
to manage hypotension in critically ill patients. For example, Celi et al. [10] con-
sider a Bayesian network to predict need for fluid replacement based on historical
data, while Komorowski et al. [57] describe a reinforcement learning approach to the
administration of fluids and vasopressors in patients with sepsis, using Q-learning
with discretized state and action spaces to learn a policy minimizing the risk of
patient mortality. Here, we translate the reinforcement learning framework intro-
duced in Chapter 3—using batch reinforcement learning methods with continuous
patient state representations—to learning policies for targeted electrolyte repletion.
We seek to understand the clinical priorities that shape current provider behaviour
87
through methods based on inverse reinforcement learning, and adapt these priorities
to learn policies that minimize the costs associated with repletion while maintaining
electrolyte levels within their reference ranges.
6.1 Methods
6.1.1 UPHS Dataset
The data used in this work is drawn from a set of over 450,000 acute care admissions
between 2010 and 2015, across three centres within the University of Pennsylvania
health system (UPHS). For each admission, we have de-identified electronic health
records comprising demographics, details of the hospital and unit the patients are ad-
mitted to, identification numbers of their care providers, nurse-verified physiological
parameters, administered medications and procedures, and patient outcomes. From
this rich dataset, we select for all adult patients for which we have high-level infor-
mation on the admission (including age, gender and admission weight), a minimum
of one lab test result for each of potassium, magnesium and phosphate levels (often
available jointly as part of a basic electrolyte panel) as well as recorded measurements
for other key vitals and lab tests, including all commonly tested electrolytes.
This yields a cohort of 13,164 unique patient visits,of which 7,870 are administered
potassium at least once, 8,342 are administered magnesium, and 1,768 are adminis-
tered phosphates. Figure 6.1 plots the distribution of measured serum electrolyte
levels both prior to and post repletion events, along with the target (normal) range
in each case. We can see that while the majority of phosphate repletions occur when
measurements fall below the reference range, this is not true for potassium or magne-
sium; in the case of potassium, 4% of all repletions are ordered while the last known
measurement is above the target range, which appears to lend support to claims in
the literature regarding unnecessary potassium supplementation.
88
Figure 6.1: Distribution of electrolyte levels at repletion events for K, Mg and P.
Each patient hospital visit is our chosen cohort is resampled into 6-hour intervals.
This relatively large window is chosen as it reflects the minimum frequency with which
lab tests for electrolyte levels are generally ordered, and in turn the duration between
reassessment of the need for electrolyte supplements; in standard practice, electrolyte
repletion is typically reviewed three times a day. In sampling patient vitals and lab
tests, outliers (recorded measurements that are not clinical viable) are filtered out,
and the mean of remaining measures is taken as representative of the value at each 6
hour interval. Missing values are imputed using simple feed-forward imputation, up
to a maximum of 48 hours since the last known measurement, and otherwise imputed
with the value of the population mean.
6.1.2 Formulating the MDP
As in previous chapters, we frame the clinical decision-making problem here as a
Markov decision process (MDP), M = {S,A, P, P0, R, γ} parametrized by some fi-
89
Figure 6.2: Example hospital admission with potassium supplementation
nite state space S, st ∈ S, finite set of actions A, at ∈ A, an unknown transition
function P (st+1|st, at), distribution P0 over initial states s0 ∈ S, a reward function
R(st, at, st+1), and a scalar discount factor γ defining the relative importance of im-
mediate and long-term rewards. Our objective is learn an optimal policy function
π∗(s) : S → A that maximizes the discounted cumulative reward:
π∗(st) = argmaxat∈A
EP,P0
[∑t
γtR(st, at, st+1)
](6.1)
State representation The relative risk posed by electrolyte levels outside the ref-
erence range, and the initiation of strict regulation of potassium, magnesium and
phosphate levels, is influenced by a number of different factors. These include de-
mographic characteristics, patient physiological stability, and interaction with con-
currently administered drugs. For example, Figure 6.2 illustrates repletion events for
a single hospital visit, along with available measurements of serum potassium level,
and administration events of non-potassium sparing diuretics. We can see that oral
potassium repletion is routinely ordered as a prophylaxis—that is, as a preventive
measure against hypokalaemia—in conjunction with diuretics, even when potassium
levels are within the target range.
In defining our state space, we therefore include static features defining patient
admissions such as age, weight, gender and whether the admission is to the ICU or
to a regular inpatient ward on the hospital floor (as an proxy for patient severity
of illness). We also incorporate imputed measurements at each 6-hour interval for
90
Table 6.1: Selected 52 features for patient state representation in electrolyte repletion.
Features
Static Age, Gender, Weight, Floor/ICU
Vitals Heart rate, Respiratory rate, Temperature,
O2 saturation pulseoxymetry (SpO2), Urine output
Non-invasive blood pressure (systolic, diastolic)
Labs K, Mg, P, Na, Ca (Ionized), Chloride, Anion gap, Creatinine,
Hemoglobin, Glucose, BUN, WBC, CPK, LDH, ALT, AST, PTH
Drugs K-IV, K-PO, Mg-IV, Mg-PO, P-IV, P-PO, Ca-IV, Ca-PO,
Loop diuretics, Thiazides, Acetazolamide, Spironolactone,
Fluids, Vasopressors, Beta Blockers, Ca Blockers,
Dextrose, Insulin, Kayazelate, TPN, PN, PO Nutrition
Procedures Packed cell transfusion, Dialysis
seven common vitals (including urine output) and eleven labs. In addition, seven
rarer labs are represented with an indicator of whether a measurement in available
from the past 24 hours, as the ordering of these lab tests by clinicians can in itself be
informative. In terms of medication, the patient state includes the administered dose
of both intravenous (IV) and oral (PO) potassium, magnesium or phosphate. Several
additional key classes of drugs are represented as indicator variables, taking a value of
1 if the drug is administered over the 6-hour interval, and 0 if not. Fluids, diuretics,
parenteral nutrition, etc. fall within this category. Finally, we include indicators of
whether patient is administered packed cell transfusions (as these can increase risk of
hyperkalaemia) or dialysis, which aims to correct electrolyte imbalances resulting from
kidney failure. This yields a 52-dimensional state space (Table 6.1) encompassing all
available information relevant in learning an optimal repletion policy.
91
Table 6.2: Discretized dosage levels for K, Mg and P.
Oral (PO) Intravenous (IV)
PO1 PO2 PO3 IV1 IV2 IV3 IV4 IV5 IV6
K 0 20 40 6020mEq
2h
40mEq
4h
60mEq
6h
20mEq
1h
40mEq
2h
60mEq
3h
Mg 0 400 800 12000.5mEq
1h
1mEq
1h
1mEq
2h
1mEq
3h
P 0 250 500 7501mEq
1h
2mEq
3h
2mEq
6h
Action representation For each of the three electrolytes considered here, supple-
ments are administered either via fast-acting intravenous drugs (at various rates and
infusion times) or with tablets at different doses. In designing our action space, we
discretize the dosage rates such that the set of actions we choose from are in line with
most common practice in the UPHS data. In the case of potassium, this yields the
following options: no repletion, PO repletion at one of three discretized levels: 0-20,
20-40 or 40-60 mEq, IV repletion at one of six possible rates: 0-10 mEq/hr infused
over 1, 2 or 3 hours; 10-20 mEq/hr over 2, 4 or 6 hours, or some combination of both
intravenous and oral supplements.
In order to effectively learn treatment recommendations over this space of actions,
we choose to learn three independent policies for each electrolyte, each with a distinct
action space. Specifically, we first learn optimal policy recommendations for the route
of electrolyte administration, πroute : S → Aroute where
Aroute =
00
,
01
,
10
,
11
(6.2)
92
such that aroutet [0] = 1 indicates an IV repletion event at time t, and aroutet [1] = 1
indicates administration of PO repletion. We then learn separate policies for PO and
IV dosage that map patient state to action spaces APO and AIV respectively, where
the actions in each set are represented by one-hot encodings of each dosage level in
that category. The size of the action space in each case is therefore given by the total
number of possible dosage rates, plus an additional action representing no repletion.
For potassium, this yields action spaces of size |AIV | = 7 and |APO| = 4 respectively.
The set of action spaces for magnesium and phosphate repletion are defined in the
same way; the complete list of discretized dosage levels is summarized in Table 6.2.
This subdivision of action spaces mimics the likely decision-making process in clinical
practice: the provider must first decide whether to administer a supplement, and if
so, by what route, before choosing the most appropriate dosage of this supplement.
Reward function The objective of our electrolyte repletion policy is to optimize
patient clinical outcome while minimizing unnecessary repletion events. To this end,
we look to incorporate the following elements in our reward function: (i) the effective
cost of an IV repletion, which can be thought of as encompassing the prescription
cost, care-provider time, and the cost of the drug itself, (ii) the effective cost of PO
repletion, given these same considerations, (iii) a penalty for electrolyte levels above
the reference range, and (iv) a penalty for electrolyte levels below this range. The
reward function for potassium can then be written as: rt+1 = w ·ϕt(st, at, st+1), where
ϕ is a four-dimensional vector function such that:
ϕt(·) =
−1aroutet [0]
−1aroutet [1]
−1st+1[K]>Kmax · 10(1 + exp−σ(K−Kmax−1)
)−1
−1st+1[K]<Kmin· 10
[1−
(1 + exp−σ(K−Kmin+1)
)−1]
∈
{0,−1}
{0,−1}
(−10, 0)
(−10, 0)
(6.3)
93
0 1 2 3 4 5 6 7 8Potassium (K), mEq/L
10
8
6
4
2
0
4
10
8
6
4
2
0
3
Reference K
Figure 6.3: Penalizing abnormal potassium levels in reward function: σ = 3.5
and w, ||w||1 = 1 determines the relative weighting of each penalty. In the above
equation, K is the last known measurement of potassium; Kmax and Kmin define the
upper and lower bounds respectively of the target potassium value. Penalties for
values above and below the reference range are applied independently in order to
allow for asymmetric weighting of the risks posed by hypokalaemia when compared
with hyperkalaemia. The sigmoid function used to model penalties on abnormal
vitals (Figure 6.3) can be justified as follows: the maximum and minimum thresholds
for electrolyte reference ranges are fairly arbitrary, and can vary considerably across
hospitals. While patients with abnormal values near these thresholds are likely to
be asymptomatic or experience few adverse effects, more severe electrolyte imbalance
becomes increasingly harmful to patient outcome, until irrevocable. The parameter σ
in the definition of this function effectively determines the sharpness of this threshold,
and can be set according to the width of the reference range and our confidence in
the threshold value. The maximum value of the sigmoid penalties in ϕ are scaled the
lie between 0 and -10, in order that the mean non-zero penalties of all four terms lie
within approximately the same order of magnitude, to aid subsequent analysis and
the choice of weights w for the final reward function.
The vector function ϕ for both magnesium and phosphate are defined in much
the same way, with elements corresponding to IV repletion cost, PO repletion cost,
abnormally high and abnormal low electrolyte levels respectively.
94
6.1.3 Fitted Q-Iteration with Gradient-boosted Trees
Now that we have defined our Markov decision processes, we can extract a sequence
of one-step transition tuples from each hospital admission to produce a dataset D =
{⟨snt , ant , snt+1, ϕnt+1⟩t=0:Tn−1}n=1:N , where N is the number of distinct hospital visits,
and |D| =∑
n Tn, and solve for optimal treatment policies using batch reinforcement
learning methods, namely Fitted Q-iteration (FQI) [21]. FQI is a data-efficient value
function approximation algorithm that learn an estimator for value Q(s, a) of each
state-action pair in the MDP—that is, the expected discounted cumulative reward
starting from ⟨s, a⟩—through a sequence of supervised learning problems. FQI also
offers flexibility in the use of any regression method to solve the supervised problems
at each iteration.
Here, we fit our estimate of Q at each iteration of FQI using gradient boost-
ing machines (GBMs) [26]. This is an ensemble method in which weaker predictive
models, such as decision trees, are built sequentially by training on residual errors,
rather building all trees concurrently as is the case for random forests or extremely
randomized trees. This allows models to learn higher-order terms and more complex
interactions amongst features [132]. Gradient boosted trees have been increasingly
used for function approximation in clinical supervised learning tasks, and have been
demonstrated to have strong predictive performance [45, 129]. Boosting has been
previously explored in conjunction with FQI by Tosatto et al. [120], who propose an
additive model of the Q-function in which a weak learner is built at each iteration
from the Bellman residual error in the previous estimate of Q. In this work, we
instead output a fully fitted GBM at the end of each iteration of FQI.
Treating repletion strategy as a hierarchical decision-making task, we estimate the
following three Q-functions for each electrolyte: Qroute(s, a) : S,Aroute → R which
gives the action value estimates for each repletion route, QPO(s, a) : S,APO → R and
QIV (s, a) : S,AIV → R, corresponding to value estimates of different doses of oral
95
and IV supplements respectively. For each Q, we train an optimal policy such that:
πroute(s) = argmaxa∈Aroute
Q(s, a) (6.4)
πPO(s) = argmaxa∈{APO\APO
0 }QPO(s, a) (6.5)
πIV (s) = argmaxa∈{AIV \AIV
0 }QIV (s, a) (6.6)
where APO0 and AIV
0 denote elements in these action spaces corresponding to a dose
of 0, that is, no repletion. In order to obtain our final treatment recommendations,
we first query πroute(s) for current state s. If this policy recommends one or both
modes of repletion, we query policies πPO and πIV accordingly to select the most
appropriate non-zero dosage level for the corresponding repletion route.
6.1.4 Reward Inference using IRL with Batch Data
In order to better understand current clinical practice, and in turn determine how
to set priorities in clinical objectives for our optimal policy, we first apply an inverse
reinforcement learning (IRL)-based approach to inference of the weights in the reward
function r = w · ϕ, as optimized by clinicians in the historical UPHS data. The
fundamental strategy of most algorithms for inverse reinforcement learning are as
follows [5]: we first initialize a reward function parametrized by some weights w, and
solve the MDP given this reward function for an optimal policy. We then estimate
some representation of the dynamics of the RL agent when following this optimal
policy, such as the state visitation frequency [89, 135]. Finally, we compare this
estimate with the dynamics observed in the available batch data, and update w to
shift learnt policies towards to this behaviour, iterating until the learnt optimal policy
96
is sufficiently close. Here, we use policy feature expectations µπ, where:
µπ = Eπ
[∞∑t=0
γtϕt(·)
](6.7)
to obtain a representation of agent dynamics that is decoupled from the reward func-
tion. For the behaviour policy, this is evaluated by simple averaging over patient
trajectories in the dataset. However, in order to estimate the feature expectations
for the learnt optimal policy given only this batch data, we turn to estimators for
off-policy evaluation (as in Chapter 5), specifically per-decision importance sampling:
µπPDIS =
1
N
N∑n=1
T∑t=0
γtρ(n)t ϕ
(s(n)t
)where ρ
(n)t =
t∏i=0
π(ani |sni )πb(ani |sni )
(6.8)
At each epoch of IRL, we use the difference in the ℓ1-normalized feature expectations
of the behaviour and evaluation policy in order to update the reward weights w, as our
objective is to infer the relative values of the elements in w, ||w||1 = 1, and optimal
policies are invariant to scaling in the reward function. The complete procedure is
outlined in Algorithm 5.
6.2 Results
For the experiments described in this section, we take the full cohort of 13,164 patient
visits described in Section 6.1.1 and divide these into a training set of 7,000 and a
test set of 6,134 visits. Of those in the training set, the number of hospital visits
comprising electrolyte repletion events is 4,109, 4,430 and 867 for potassium (K),
magnesium (Mg) and phosphate (P) respectively, where the mean length each visit
is approximately four days. Sampling at 6-hour intervals, these admissions yield one-
step transition tuple sets of size 54,228, 59,775, and 15,863; these make up the training
sets for the treatment policies of each electrolyte. Learnt policies are then evaluated
97
Algorithm 5 Linear IRL with Batch Data
Input: D = {⟨snt , ant , snt+1, ϕnt+1⟩t=0:Tn−1}n=1:N where ϕ ∈ Rd; behaviour policy πb
Reward weights w0 ∈ Rd, ||w0||1 = 1; discount γ; epochs E; learning rate α
Initialize w = w0
µb =1N
∑n
∑t γ
tϕnt+1 Evaluate feature expectations of behaviour policy
for epoch i = 1→ E dornt+1 = w · ϕn
t+1 ∀t, n ∈ Dπw ← FQI(D, w, γ) Solve for optimal policy with reward r = w · ϕµw ← PDIS(πw, πb,D, γ) Get OPE estimate of policy feature expectations
∇ =µb
||µb||1− µw
||µw||1Run gradient update for new weights w
w ← w + α · ∇w =
w
||w||1end
Result: w Return final weights for reward function r(·) = w · ϕ(·)
using cohorts of size 3,440, 4,233, and 901 for K, Mg and P, selected in the same way
from the test partition.
6.2.1 Understanding Behaviour in UPHS
Our first set of experiments aim to infer the reward function optimized by providers
in the UPHS dataset, in order to gain insight into incentives underlying existing
patterns of behaviour. For each electrolyte, we initialize our reward function with
weights w0 = 14[ 1 1 1 1 ]T , assigning equal priority to each of the four elements of
ϕ(·), and run the procedure described in Algorithm 5 over twenty epochs with a
learning rate of 0.2. At each epoch, the current weights w are used to learn a policy
for the route of electrolyte supplementation πroute(s), given the transitions in the
training set. The first column in Table 6.3 summarizes the final weights, averaging
over three independent runs of IRL, for cohorts K, Mg and P. We find that in the
case of potassium, we obtain small negative weights on both the cost of IV and the
cost PO repletion, that is, agents in the UPHS data appear to be rewarding the
98
administration of potassium supplements. This provides some evidence in support of
concerns that care providers either tend to order potassium supplements reflexively
(without fully considering cost or clinical necessity) or are unnecessarily conservative
in avoiding potassium deficiency. This is emphasised by the fact that the penalty
incurred for hypokalaemia (low potassium levels) is significantly higher relative to that
for hyperkalaemia. In order to try to correct for this, we train our optimal policies
with FQI using rewards with small positive penalties on repletion, while maintaining
the relative weights of the remaining penalties at approximately the same value.
On the other hand, inferred weights for both magnesium and phosphate repletions
suggest that a cost is incurred by the agent for both intravenous and oral repletion
in these cases, with a larger penalty on IV. This is more in line with what we would
expect. In particular, a higher effective cost on IV repletion can be justified in a num-
ber of ways: in the cost of the prescription itself in the necessary form for intravenous
delivery, in the provider time taken to initiate and monitor delivery of the drug, and
increased risk of overcorrection when setting the infusion rate, as well as bruising,
clotting or infection at the infusion site.
Additionally, for both Mg and P, greater penalties are placed on above normal
values. This may be because the risks posed to patients by excess magnesium or
phosphate levels are considered to be more critical, or simply due to the fact these
electrolytes are less likely to be over-corrected; both hypermagnesemia and hyper-
phosphatemia are rare in the dataset. While this may be reasonable, we attempt to
avoid reinforcing this behaviour in our learnt optimal policies by shifting weights to
penalize abnormally high and low values approximately equally, and doing to same
with IV and oral repletion, when running FQI.
99
Table 6.3: Inferred behavioural priorities from IRL versus chosen reward weights foroptimal policy, for electrolyte repletion route recommendations for K, Mg and P.
w UPHS policy πrouteb FQI policy πroute
w
K [−0.05 −0.08 0.20 0.67 ]T [ 0.07 0.04 0.15 0.74 ]T
Mg [ 0.09 0.08 0.56 0.25 ]T [ 0.29 0.29 0.21 0.21 ]T
P [ 0.31 0.11 0.32 0.26 ]T [ 0.17 0.17 0.33 0.33 ]T
34
K (Measured)
UPHSK-POK-IV
0 30 60 90 120 150Hours into admission
FQIK-POK-IV
Figure 6.4: UPHS vs FQI-recommended potassium repletion for sample admission
6.2.2 Analysing Policies from FQI
With our chosen reward functions (as parametrized by the FQI policy weights in
Table 6.3), we learn policies for repletion route, oral and IV dosage for each of the
three electrolytes. Figure 6.4 illustrates the recommended repletion for potassium
for a single hospital visit, obtained through construction of a hierarchical treatment
recommendation strategy as outlined in Section 6.1.3. This is plotted for comparison
along with the true repletion events in the UPHS data, and the measured potassium
values. We can see from this example that, while there are multiple instances of IV
repletion, when potassium levels drop but are still well within reference range, the
FQI policy shows a preference for oral repletions, all repletion recommendations occur
when potassium is below the reference range, and IV repletions are only recommended
with the patient is significantly hypokalaemic.
100
0 500 1000 1500Count (K)
IV1
IV2
IV3
IV4
IV5
IV6
PO1
PO2
PO3
Actio
n
0 1000 2000 3000 4000 5000Count (Mg)
IV1
IV2
IV3
IV4
PO1
PO2
PO3
0 200 400 600 800Count (P)
IV1
IV2
IV3
PO1
PO2
PO3
PolicyUPHSFQI
Figure 6.5: Distribution of recommended actions for K, Mg and P
Figure 6.5 compares the distribution of actions taken in the UPHS histories with
those recommended by our policies from FQI. We find that for potassium, the learnt
policy recommends 75% fewer intravenous supplements and 50% fewer oral supple-
ments. Where repletion is recommended, our policy tends to favour higher effective
dosage of either oral and IV potassium. This strategy can be justified by the fact that
in current practice, repletion events often fail to bring potassium levels into the tar-
get range (Figure 6.1), suggesting that smaller doses are often either unnecessary or
not cost effective. The total number of recommended repletions for both phosphate,
though the distribution of recommended doses for repletion approximately matches
than in the historical data. For magnesium, as in the case of potassium, we see a shift
in preference towards higher IV doses, as well as more frequent oral repletion, while
the total number of recommendations recommended remains roughly unchanged. The
lack of significant reduction in repletions may again be partly attributed to the fact
that over-correction of magnesium is highly unlikely.
In order to investigate that factors influencing recommendations in our output
policy, we train a pair of classification policies for PO and IV repletion respectively,
mapping from patient states to binary actions. Figure 6.6 plots the Shapley values
[75] for these classifiers in the case of potassium. Shapley values evaluate the contribu-
101
1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5SHAP value (impact on model output)
ICUBUNWBC
CreatinineHaemoglobin
WeightUrine output
Anion GapSodium (Na)
Potassium (K)
Low
High
Feat
ure
valu
e
1 0 1 2 3 4 5SHAP value (impact on model output)
WeightPhosphate (P)
ICUSystolic BPHeart rate
AgeCalcium (Ca)
CreatininePotassium (K)
Anion Gap
Low
High
Feat
ure
valu
e
Figure 6.6: Shapley values of top 10 features for (a) K-PO and (b) K-IV repletion.
tion of each feature in the state representation in pushing the predicted probability of
repletion away from the population mean prediction, along with the direction of influ-
ence. Therefore, a high Shapley value associated with a feature can be interpreted as
higher probability of recommended repletion. In addition to population-level feature
importances, it allows for individual-level explanations of predictions.
As expected, current potassium values are among the most influential features in
both cases, with low potassium levels associated with higher probability of repletion.
We also find that oral repletion is more likely to be recommended at high levels
of sodium (which is typically inversely correlated with potassium) and at high urine
output. The latter is likely the direct result of the administration of diuretics, causing
increased rates of potassium loss and necessitating repletion (as we noted in Figure
6.2). Creatinine also features highly in policies for both oral and IV potassium,
with high levels of creatinine associated with increased probability of recommended
IV repletion. A possible mechanism that may explain this is that accumulation of
creatinine is commonly used to diagnose kidney failure, and typically necessitates
102
dialysis. Serum potassium can drop significantly following dialysis, with roughly
45% of patients presenting with post-dialysis hypokalaemia [91]. The optimal range
of potassium levels for patients undergoing dialysis is therefore often higher [101],
motivating urgent repletion.
6.2.3 Off-policy Policy Evaluation
Finally, we can produce a quantitative estimate of the value of our learnt policies
offline using Fitted Q evaluation (FQE) [64]. FQE adapts the iterative Q-value es-
timation problem solved in FQI through a series of supervised learning problems,
to the task of off-policy policy evaluation. Given our dataset of one-step transi-
tions, and policy πe to be evaluated, each iteration k of FQE takes as input all
state-action pairs of the form ⟨sit, ait⟩ in the dataset. The targets are then given by
Qk
πe(st, at) = rit+1 + γQk−1(s
it+1, πe(s
it+1))∀t, i, such that the value of a given state-
action pair is given by an estimate of the immediate reward plus the expected dis-
counted value of following policy πe from this point onwards. Solving this regression
task at each iteration yields a sequence of estimates Qk.
Figure 6.7 plots the distribution over all state-action pairs ⟨s, πe(s)⟩ of Q-values es-
timated in this way for the optimal repletion route policies of potassium, magnesium
and phosphate, over their corresponding test cohorts. Note these values represent
estimates of the discounted cumulative weighted repletion cost and penalties for elec-
trolyte imbalance, starting from each ⟨s, πe(s)⟩. We find that in all three polices, the
mean value of the FQI policy is greater than that of the estimated behaviour policy
followed by clinicians in the historical data. While gains are marginal in the case of
magnesium (in line with the fact that the percentage reduction in recommended Mg
repletion is relatively modest), estimates for both potassium and phosphate suggest
significant improvement over current practice.
103
K3.0
2.5
2.0
1.5
1.0Q
Val
ue
Mg
25
20
15
10
P2.0
1.8
1.6
1.4
1.2
1.0
0.8
0.6
0.4
UPHSFQI
Figure 6.7: Evaluation of policies for K, Mg and P using Fitted-Q Evaluation
6.3 Conclusion
This chapter presents a data-centric approach to electrolyte repletion therapy in hospi-
tal. Patient admissions in a multi-centre dataset from the University of Pennsylvania
health system are modelled as Markov decision processes, and used to learn strategies
for efficient repletion of potassium, magnesium and phosphate levels through batch
reinforcement learning methods. The proposed policies suggest fewer repletion events
overall, with a shift towards oral supplements at higher doses. These recommenda-
tions have the potential to ease the burden presented by the ordering of prescriptions
for the pharmacist, reduce demands on the clinician in terms of periodic re-evaluation
of electrolyte levels and in the administration of supplements, and lower costs for the
hospital, without compromising on patient outcome.
The work here outlines the first phase of ongoing work. As a next step, we look
to verify the robustness of learnt policies to temporal drift—the work here draws on
data collected between 2010 and 2015, and it is important to ensure that the quality
of our recommendations is invariant to any shifts in behavioural patterns in the last
104
five years—as well as generalizability across datasets from different health systems,
such as the MIMIC database [49].
Having evaluated our policies offline under these conditions to the extent possible,
we can then run comparisons with actions chosen by clinical experts post hoc; these
are likely to significantly differ from actions observed in historical data, as they are
unconstrained by any procedural bottlenecks. We envision an experimental design
similar to that in Li et al. [68] for example, presenting clinicians with single slices
of patient trajectories, encompassing all relevant patient information available up to
that time. In choosing time slices that would be most informative in evaluating and
honing our current optimal policy, we can draw from work by Gottesman et al. [35] on
identifying influential transitions. Finally, we hope to develop within the framework
of the current clinical workflow a way to operationalize these tools, either through
reminders to clinicians when repletion is deemed necessary, or via a background sys-
tem that presents the best route forward given an active request for repletion by care
providers. While there remain a number of fundamental questions to be answered
before adoption by clinicians is possible, if implemented in a scalable and sustainable
way, we believe these tools can be transformative to the current healthcare system.
105
Chapter 7
Conclusion
In this thesis, we introduce a generalizable framework for the management of routine
interventions in the care of critically ill patients. We motivate the use of reinforcement
learning in the development of clinician-in-loop decision support systems and describe
how we can model planning problems in the acute care setting as Markov decision
processes, using clinically motivated definitions of state, action and reward function.
In choosing these problems, we target an array of common diagnostic and therapeutic
interventions: the management of mechanical ventilation and sedation, the ordering
of laboratory tests requiring invasive procedures, and the administration of effective
electrolyte repletion therapy. We explore how we can better understand objectives
and biases driving current clinical behaviour with respect to these interventions, and
use this where applicable to guide the intervention strategies learnt. Finally, we
present various methods by which to evaluate the optimal policies learnt through
offline, off-policy reinforcement learning using only past clinical histories—through
qualitative assessment of produced recommendations, the application of state-of-the-
art off-policy policy evaluation methods, and with comparison and analyses with
domain experts—demonstrating that this approach shows promise in re-evaluating
and streamlining current clinical practice.
106
7.1 Future Research Directions
There are several avenues to be explored in building on the methodology described in
this thesis. These may be broadly encompassed by the following three prongs of work:
(i) advancing representation learning for clinical time series data, (ii) developing ex-
isting batch reinforcement learning methods, to improve sample-efficiency and speed
up learning from biased observational datasets with limited observability of certain
regions of the state-action space, and (iii) creating a robust framework for off-policy
evaluation using these observational datasets.
With respect to state and action representation, the use of Gaussian processes
(GPs) in this work was restricted to either the imputation of missing values or the
estimation of uncertainty in time series forecasting. However, this could in principle
be extended to learning a complete state transition model for the MDP, or alterna-
tively, to infer latent representations in the form of Gaussian process latent variable
models (GPLVMs) of the physiological state of the patient. Recurrent neural net-
works have also been widely used both to model clinical time series [27], while other
deep architectures such as auto-encoders have been attempted in learning latent state
representations [103]. However, it is possible that there is a fundamental limitation
in the quality of the representation we can learn given the nature of the data at our
disposal, with no prior information. I believe an important direction for future in-
vestigations is in mechanisms by which we can incorporate domain knowledge more
explicitly to restrict the model class we must search over, both for modelling patient
dynamics and in learning a policy function.
At the opposite end of the spectrum, it may also be worthwhile to revisit the
use of discrete state representations in reinforcement learning, for example through
clustering methods [57] or self-organizing maps [25], and explore whether we can
provide guarantees on the quality of clustering and minimize loss of information,
while taking advantage of the interpretability and sample efficiency of tabular RL.
107
In terms of the action space in clinical decision-making, the options framework
[117] in hierarchical reinforcement learning—where an option can be thought of as
a ‘macro-action’, defined by some policy, an initiation set of states and a termina-
tion condition based on sub-goals—naturally fits into the way in which interventions
in acute care are typically administered. Each option may have a different set of
available actions, such as for a patient prior to mechanical ventilation, immediately
following intubation, or after the initiation of weaning protocol; policies are likely to
be consistent within an option, while extremely dissimilar across options.
Reward design is central to efficient learning in RL and remains an incredibly
challenging problem. While a number of heuristics and preliminary approaches to
systematic reward design are presented in this thesis, work is still needed in design-
ing mechanisms to, for example, explicitly learn the prioritization of frequent reward
signals against sparse feedback, as well as immediate versus long-term objectives in
problems with extended horizons. One possible approach to tackling these questions
is by modelling discount factors. The degree of long-term impact can vary accord-
ing to the reward objectives we consider. For example, while the negative impact of
sudden spikes in pain levels or transient physiological instability may be relatively
brief, the need for reintubation or the onset of organ failure can have a much more
prolonged impact on patient outcome. This motivates the optimization of different
objectives with tailored discount functions or, equivalently over different treatment
horizons [24]. Improving our understanding of current clinical practice is also in-
credibly important. A well-calibrated model of clinical reasoning in diagnostic and
therapeutic decisions can be used to bootstrap the learning of optimal policies. In
doing so, we can account for how this reasoning varies across individuals and is sub-
ject to procedural limitations, and design incentives to shift behaviour towards more
efficient care.
108
Finally, robust off-policy evaluation is a fundamental roadblock in use of reinforce-
ment learning in practice and continues to be an important, active area of research.
While approaches to evaluation in this work are restricted to model-free methods, this
inherently limits the quality of both the learnt policies and of off-policy evaluation
in data-poor settings. Leveraging recent work that looks to enable robust evaluation
in continuous, high-dimensional environments [64], and drawing on progress in mod-
elling transition dynamics in clinical data to aid the development of model-based or
hybrid OPE methods, that allow for deeper analysis of policy performance offline, is
necessary in engendering confidence in these methods and facilitating the next steps
towards implementation in practice.
7.1.1 Translation to Clinical Practice
Beyond the modelling questions inherent to machine learning algorithms for clinical
decision support, there are several hurdles to be overcome before their adoption in
standard clinical practice [109, 127]. In this thesis, we touch upon the importance
of careful consideration of the choice of problem, along with endorsement by rele-
vant organizational stakeholders. Once a useful solution had been developed using
retrospective data as proof of concept, a necessary next step is to clearly quantify
the estimated value addition of the tool—in terms of clinical outcome as well as cost
and time saved—in order to justify the launch of prospective studies [110]. These
prospective studies typically involve ‘silent’ implementation of the decision support
tool, evaluated by clinicians post hoc, rather than immediately influencing patient
care [56]. This is followed by peer-reviewed randomized control trials evaluating
the statistical validity of estimated benefits, and ensuring that the tool is providing
novel, substantive insights rather than simply fitting to confounders. Additionally,
it is crucial to verify that the recommendations provided are consistent in accuracy
and utility when accounting for sociocultural factors in the healthcare delivery en-
109
vironment (such as clinician expertise, attitudes, and existing care patterns) and to
determine how the added benefit of the system may be influenced by characteristics
of the patient population [4]. This requires exploring the quality of learnt policies for
minority subgroups, and testing their robustness to dataset shift over time.
Even where the performance of the system is acceptable, a thoughtful approach
to diffusion of the technology and the reporting of recommendations is necessary for
sustained adoption by clinicians [53]. This includes providing some transparency and
explainability in output policies, both to foster trust in the machine learning system,
and to ensure continued confidence between patients and physicians [90]. Incorporat-
ing factors such as compliance, efficacy, and constraints in personnel or equipment,
as well as providing a degree of flexibility in recommendations that accounts for the
experience or expertise of the clinician, can help with this. It is also critical that the
necessary logistical infrastructure is in place to implement the recommended poli-
cies, through well-integrated EHR, pathology and prescribing systems for example,
and that clear regulation is in place regarding where responsibility lies for the de-
cisions made. This is essential in countering the legal and economic incentives that
perpetuate current modes of practice [95].
Finally, strategies are needed for the continual monitoring and maintenance of
these systems. It is important to be able to model any downstream effects of policies:
the potential impact of interventions on immediate as well as long-term patient health,
and whether these interventions may cause a shift in pressures to other stages of the
healthcare pipeline. For instance, policies that push for increased testing can in
turn increase the rate of false positives, and cause heightened demand for certain
unnecessary treatments. Policies favouring prolonged life support may overburden
the ICU, or palliative care facilities, while adding limited value in terms of quality
of life. It is also possible that recommendations reinforce biases that already exist
in the system, as these tools are used in both intended and unintended ways. These
110
are questions that are just beginning to arise in other fields, such as insurance or
loan approval systems [11, 69], but are still under-explored in the healthcare context.
Understanding how learning systems can be adapted to account for these issues, how
frequently systems should be adapted—continual updates may be subject to drift,
and are difficult to validate—and how these updates can incorporate changes in the
clinical landscape (from evolving definitions of disease progression to the inclusion of
new procedures or therapeutics) poses a significant challenge to future work.
Ultimately, data-driven decision support systems have the potential to be enor-
mously impactful to patient management in critical care. Recent events have drawn
sharp focus to the precarious state of current health infrastructure and the pressures
faced by healthcare workers; this is true on a global scale. Building robust data-driven
systems with careful consideration is one way in which we can ease these pressures,
streamline care, and help ensure that we are better prepared to tackle future crises.
111
Bibliography
[1] Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse reinforce-ment learning. In Proceedings of the 21st International Conference on MachineLearning, page 1. ACM, 2004.
[2] Annette VM Alfonzo, Chris Isles, Colin Geddes, and Chris Deighan. Potassiumdisorders—clinical spectrum and emergency management. Resuscitation, 70(1):10–25, 2006.
[3] Nicolino Ambrosino and Luciano Gabbrielli. The difficult-to-wean patient. Ex-pert Review of Respiratory Medicine, 4(5):685–692, 2010.
[4] Derek C Angus. Randomized clinical trials of artificial intelligence. JAMA, 323(11):1043–1045, 2020.
[5] Saurabh Arora and Prashant Doshi. A survey of inverse reinforcement learning:Challenges, methods and progress. arXiv preprint arXiv:1806.06877, 2018.
[6] Onur Atan, William R Zame, and Mihaela van der Schaar. Learning optimalpolicies from observational data. arXiv preprint arXiv:1802.08679, 2018.
[7] Tony Badrick. Evidence-based laboratory medicine. The Clinical BiochemistReviews, 34(2):43, 2013.
[8] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, JohnSchulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym. arXiv preprintarXiv:1606.01540, 2016.
[9] Daniel S Brown and Scott Niekum. Efficient probabilistic performance boundsfor inverse reinforcement learning. In Thirty-Second AAAI Conference on Ar-tificial Intelligence, 2018.
[10] Leo Anthony Celi, L Hinske Christian, Gil Alterovitz, and Peter Szolovits. Anartificial intelligence tool to predict fluid requirement in the intensive care unit:a proof-of-concept study. Critical Care, 12(6):R151, 2008.
[11] Allison JB Chaney, Brandon M Stewart, and Barbara E Engelhardt. Howalgorithmic confounding in recommendation systems increases homogeneity anddecreases utility. In Proceedings of the 12th ACM Conference on RecommenderSystems, pages 224–232, 2018.
112
[12] Li-Fang Cheng, Gregory Darnell, Bianca Dumitrascu, Corey Chivers, Michael EDraugelis, Kai Li, and Barbara E Engelhardt. Sparse multi-output gaussianprocesses for medical time series prediction. arXiv preprint arXiv:1703.09112,2017.
[13] Li-Fang Cheng, Niranjani Prasad, and Barbara E Engelhardt. An optimalpolicy for patient laboratory tests in intensive care units. In Pacific Symposiumon Biocomputing. Pacific Symposium on Biocomputing, volume 24, page 320.NIH Public Access, 2019.
[14] Paul F Christiano, Jan Leike, Tom B Brown, Miljan Martic, Shane Legg, andDario Amodei. Deep reinforcement learning from human preferences. In Pro-ceedings of the 31st International Conference on Neural Information ProcessingSystems, pages 4302–4310. Curran Associates Inc., 2017.
[15] Federico Cismondi, Leo A Celi, Andre S Fialho, Susana M Vieira, Shane R Reti,Joao MC Sousa, and Stan N Finkelstein. Reducing unnecessary lab testing inthe icu with artificial intelligence. International journal of medical informatics,82(5):345–358, 2013.
[16] Michael R Clarkson, Barry M Brenner, and Ciara Magee. Pocket Companionto Brenner and Rector’s The Kidney E-Book. Elsevier Health Sciences, 2010.
[17] Julie-Ann Collins, Aram Rudenski, John Gibson, Luke Howard, and RonanODriscoll. Relating oxygen partial pressure, saturation and content: thehaemoglobin–oxygen dissociation curve. Breathe, 11(3):194, 2015.
[18] Giorgio Conti, Jean Mantz, Dan Longrois, and Peter Tonner. Sedation andweaning from mechanical ventilation: Time for ‘best practice’ to catch up withnew realities? Multidisciplinary respiratory medicine, 9(1):45, 2014.
[19] Shayan Doroudi, Kenneth Holstein, Vincent Aleven, and Emma Brunskill. Se-quence matters, but how exactly? A method for evaluating activity sequencesfrom data. International Educational Data Mining Society, 2016.
[20] Robert Durichen, Marco A. F. Pimentel, Lei Clifton, Achim Schweikard, andDavid A. Clifton. Multitask Gaussian processes for multivariate physiologicaltime-series analysis. IEEE Transactions on Biomedical Engineering, 62(1):314–322, 2015.
[21] Damien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-based batch modereinforcement learning. Journal of Machine Learning Research, 6(Apr):503–556, 2005.
[22] Damien Ernst, Guy-Bart Stan, Jorge Goncalves, and Louis Wehenkel. Clinicaldata based optimal STI strategies for HIV: A reinforcement learning approach.In 2006 45th IEEE Conference on Decision and Control, pages 667–672. IEEE,2006.
113
[23] Pablo Escandell-Montero, Milena Chermisi, Jos M. Martnez-Martnez, JuanGmez-Sanchis, Carlo Barbieri, Emilio Soria-Olivas, Flavio Mari, Joan Vila-Francs, Andrea Stopper, Emanuele Gatti, and Jos D. Martn-Guerrero. Opti-mization of anemia treatment in hemodialysis patients via reinforcement learn-ing. Artificial Intelligence in Medicine, 62(1):47 – 60, 2014. ISSN 0933-3657.
[24] William Fedus, Carles Gelada, Yoshua Bengio, Marc G Bellemare, and HugoLarochelle. Hyperbolic discounting and learning over multiple horizons. arXivpreprint arXiv:1902.06865, 2019.
[25] Vincent Fortuin, Matthias Huser, Francesco Locatello, Heiko Strathmann, andGunnar Ratsch. SOM-VAE: Interpretable discrete representation learning ontime series. arXiv preprint arXiv:1806.02199, 2018.
[26] Jerome H Friedman. Greedy function approximation: a gradient boosting ma-chine. Annals of statistics, pages 1189–1232, 2001.
[27] Joseph Futoma, Sanjay Hariharan, Mark Sendak, Nathan Brajer, MeredithClement, Armando Bedoya, Cara O’Brien, and Katherine Heller. An improvedmulti-output gaussian process rnn with real-time validation for early sepsisdetection. arXiv preprint arXiv:1708.05894, 2017.
[28] Yuanyuan Gao, Anqi Xu, Paul Jen-Hwa Hu, and Tsang-Hsiang Cheng. In-corporating association rule networks in feature category-weighted naive bayesmodel to support weaning decision making. Decision Support Systems, 2017.
[29] Pierre Geurts, Damien Ernst, and Louis Wehenkel. Extremely randomizedtrees. Machine learning, 63(1):3–42, 2006.
[30] Marzyeh Ghassemi, Marco A. F. Pimentel, Tristan Naumann, Thomas Brennan,David A. Clifton, Peter Szolovits, and Mengling Feng. A multivariate timeseriesmodeling approach to severity of illness assessment and forecasting in ICU withsparse, heterogeneous clinical data. In Proceedings of the Twenty-Ninth AAAIConference on Artificial Intelligence, pages 446–453, 2015.
[31] Mohammad Ghavamzadeh, Marek Petrik, and Yinlam Chow. Safe policy im-provement by minimizing robust baseline regret. In Advances in Neural Infor-mation Processing Systems, pages 2298–2306, 2016.
[32] J Goldstone. The pulmonary physician in critical care: Difficult weaning. Tho-rax, 57(11):986–991, 2002. ISSN 0040-6376.
[33] Omer Gottesman, Fredrik Johansson, Matthieu Komorowski, Aldo Faisal,David Sontag, Finale Doshi-Velez, and Leo Anthony Celi. Guidelines for re-inforcement learning in healthcare. Nature Medicine, 25(1):16–19, 2019.
114
[34] Omer Gottesman, Yao Liu, Scott Sussex, Emma Brunskill, and Finale Doshi-Velez. Combining parametric and nonparametric models for off-policy eval-uation. In International Conference on Machine Learning, pages 2366–2375,2019.
[35] Omer Gottesman, Joseph Futoma, Yao Liu, Soanli Parbhoo, Emma Brun-skill, Finale Doshi-Velez, et al. Interpretable off-policy evaluation in re-inforcement learning by highlighting influential transitions. arXiv preprintarXiv:2002.03478, 2020.
[36] Felix Graßer, Stefanie Beckert, Denise Kuster, Jochen Schmitt, Susanne Abra-ham, Hagen Malberg, and Sebastian Zaunseder. Therapy decision support basedon recommender system methods. Journal of healthcare engineering, 2017, 2017.
[37] Dylan Hadfield-Menell, Smitha Milli, Pieter Abbeel, Stuart J Russell, and AncaDragan. Inverse reward design. In Advances in Neural Information ProcessingSystems, pages 6765–6774, 2017.
[38] Drayton A Hammond, Jarrod King, Niranjan Kathe, Kristina Erbach, JelenaStojakovic, Julie Tran, and Oktawia A Clem. Effectiveness and safety of potas-sium replacement in critically ill patients: A retrospective cohort study. Criticalcare nurse, 39(1):e13–e18, 2019.
[39] Hrayr Harutyunyan, Hrant Khachatrian, David C Kale, Greg Ver Steeg, andAram Galstyan. Multitask learning and benchmarking with clinical time seriesdata. arXiv preprint arXiv:1703.07771, 2017.
[40] J Henry, Yuriy Pylypchuk, Talisha Searcy, and Vaishali Patel. Adoption ofelectronic health record systems among us non-federal acute care hospitals:2008–2015. ONC data brief, 35:1–9, 2016.
[41] Mohammed Hijazi and Mariam Al-Ansari. Protocol-driven vs. physician-drivenelectrolyte replacement in adult critically ill patients. Annals of Saudi medicine,25(2):105–110, 2005.
[42] Guy W Soo Hoo. Blood gases, weaning, and extubation. Respiratory Care, 48(11):1019–1021, 2012. ISSN 0020-1324.
[43] Jessie Huang, Fa Wu, Doina Precup, and Yang Cai. Learning safe policies withexpert guidance. In Advances in Neural Information Processing Systems, pages9105–9114, 2018.
[44] Christopher G Hughes, Stuart McGrane, and Pratik P Pandharipande. Sedationin the intensive care setting. Clin Pharmacol, 4:53–63, 2012.
[45] Stephanie L Hyland, Martin Faltys, Matthias Huser, Xinrui Lyu, ThomasGumbsch, Cristobal Esteban, Christian Bock, Max Horn, Michael Moor, Bas-tian Rieck, et al. Machine learning for early prediction of circulatory failure inthe intensive care unit. arXiv preprint arXiv:1904.07990, 2019.
115
[46] ICUMedical Inc. Reducing the risk of iatrogenic anemia and catheter-relatedbloodstream infections using closed blood sampling. 2015.
[47] Alexander Irpan, Kanishka Rao, Konstantinos Bousmalis, Chris Harris, JulianIbarz, and Sergey Levine. Off-policy evaluation via off-policy classification. InAdvances in Neural Information Processing Systems, pages 5438–5449, 2019.
[48] Nan Jiang and Lihong Li. Doubly robust off-policy value evaluation for rein-forcement learning. In Proceedings of the 33rd International Conference on In-ternational Conference on Machine Learning-Volume 48, pages 652–661, 2016.
[49] Alistair EW Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, MenglingFeng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo AnthonyCeli, and Roger G Mark. MIMIC-III, a freely accessible critical care database.Scientific data, 3, 2016.
[50] Thomas T Joseph, Matthew DiMeglio, Annmarie Huffenberger, and KrzysztofLaudanski. Behavioural patterns of electrolyte repletion in intensive care units:lessons from a large electronic dataset. Scientific reports, 8(1):1–9, 2018.
[51] Andre G Journel and Charles J Huijbregts. Mining geostatistics, volume 600.Academic press London, 1978.
[52] Adam Kalai and Santosh Vempala. Efficient algorithms for on-line optimization.Journal of Computer and System Sciences, 71, 2016.
[53] Christopher J Kelly, Alan Karthikesalingam, Mustafa Suleyman, Greg Corrado,and Dominic King. Key challenges for delivering clinical impact with artificialintelligence. BMC medicine, 17(1):195, 2019.
[54] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014.
[55] L Kish. Survey sampling. John Wiley & Sons, Inc., New York, London 1965,IX+ 643 S., 31 Abb., 56 Tab., Preis 83 s. Biometrische Zeitschrift, 10(1):88–89,1968.
[56] M Komorowski. Clinical management of sepsis can be improved by artificialintelligence: yes. Intensive care medicine, 2019.
[57] Matthieu Komorowski, Leo A Celi, Omar Badawi, Anthony C Gordon, andA Aldo Faisal. The artificial intelligence clinician learns optimal treatmentstrategies for sepsis in intensive care. Nature Medicine, 24(11):1716, 2018.
[58] Raymond L Konger, Paul Ndekwe, Genea Jones, Ronald P Schmidt, MartyTrey, Eric J Baty, Denise Wilhite, Imtiaz A Munshi, Bradley M Sutter, Mad-damsetti Rao, et al. Reduction in unnecessary clinical laboratory testingthrough utilization management at a us government veterans affairs hospital.American journal of clinical pathology, 145(3):355–364, 2016.
116
[59] James S Krinsley, Praveen K Reddy, and Abid Iqbal. What is the optimal rateof failed extubation? Critical Care, 16(1):111, 2012.
[60] Hung-Ju Kuo, Hung-Wen Chiu, Chun-Nin Lee, Tzu-Tao Chen, Chih-ChengChang, and Mauo-Ying Bien. Improvement in the prediction of ventilator wean-ing outcomes by an artificial neural network in a medical icu. Respiratory care,60(11):1560–1569, 2015.
[61] Timothy S Lancaster, Matthew R Schill, Jason W Greenberg, Marc R Moon,Richard B Schuessler, Ralph J Damiano Jr, and Spencer J Melby. Potassiumand magnesium supplementation do not protect against atrial fibrillation aftercardiac operation: a time-matched analysis. The Annals of thoracic surgery,102(4):1181–1188, 2016.
[62] John Langford and Tong Zhang. The epoch-greedy algorithm for contextualmulti-armed bandits. In Proceedings of the 20th International Conference onNeural Information Processing Systems, pages 817–824. Citeseer, 2007.
[63] Romain Laroche, Paul Trichelair, and Remi Tachet Des Combes. Safe pol-icy improvement with baseline bootstrapping. In International Conference onMachine Learning, pages 3652–3661, 2019.
[64] Hoang Le, Cameron Voloshin, and Yisong Yue. Batch policy learning underconstraints. In International Conference on Machine Learning, pages 3703–3712, 2019.
[65] Joon Lee and David M Maslove. Using information theory to identify redun-dancy in common laboratory tests in the intensive care unit. BMC medicalinformatics and decision making, 15(1):59, 2015.
[66] Jan Leike, Miljan Martic, Victoria Krakovna, Pedro A Ortega, Tom Everitt,Andrew Lefrancq, Laurent Orseau, and Shane Legg. Ai safety gridworlds. arXivpreprint arXiv:1711.09883, 2017.
[67] Sergey Levine, Zoran Popovic, and Vladlen Koltun. Feature construction forinverse reinforcement learning. In Advances in Neural Information ProcessingSystems, pages 1342–1350, 2010.
[68] Luchen Li, Ignacio Albert-Smet, and Aldo A Faisal. Optimizing medical treat-ment for sepsis in intensive care: from reinforcement learning to pre-trial eval-uation. arXiv preprint arXiv:2003.06474, 2020.
[69] Lydia T Liu, Sarah Dean, Esther Rolf, Max Simchowitz, and Moritz Hardt.Delayed impact of fair machine learning. In Proceedings of the 28th InternationalJoint Conference on Artificial Intelligence, pages 6196–6200. AAAI Press, 2019.
[70] Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. Breaking the curseof horizon: Infinite-horizon off-policy estimation. In Advances in Neural Infor-mation Processing Systems, pages 5361–5371, 2018.
117
[71] Yao Liu, Omer Gottesman, Aniruddh Raghu, Matthieu Komorowski, Aldo AFaisal, Finale Doshi-Velez, and Emma Brunskill. Representation balancingmdps for off-policy policy evaluation. In Advances in Neural Information Pro-cessing Systems, pages 2644–2653, 2018.
[72] Daniel J Lizotte and Eric B Laber. Multi-objective markov decision processesfor data-driven decision support. Journal of Machine Learning Research, 17(210):1–28, 2016.
[73] Robert Tyler Loftin, James MacGlashan, Bei Peng, Matthew E Taylor,Michael L Littman, Jeff Huang, and David L Roberts. A strategy-aware tech-nique for learning behaviors from discrete human feedback. In Twenty-EighthAAAI Conference on Artificial Intelligence, 2014.
[74] TO Loftsgard and R Kashyap. Clinicians role in reducing lab order frequencyin icu settings. Journal of Perioperative and Critical Intensive Care Nursing, 2(112):2, 2016.
[75] Scott M Lundberg and Su-In Lee. A unified approach to interpreting modelpredictions. In Advances in neural information processing systems, pages 4765–4774, 2017.
[76] Yuan Luo, Peter Szolovits, Anand S. Dighe, and Jason M. Baron. Using ma-chine learning to predict laboratory test results. American Journal of ClinicalPathology, 145(6):778–788, 2016.
[77] Martin A Makary and Michael Daniel. Medical error - the third leading causeof death in the us. British Medical Journal (Online), 353, 2016.
[78] Paul E Marik and Abdalsamih M Taeb. SIRS, qSOFA and new sepsis definition.Journal of thoracic disease, 9(4):943, 2017.
[79] David M Maslove, Francois Lamontagne, John C Marshall, and Daren K Hey-land. A path to precision in the icu. Critical Care, 21(1):79, 2017.
[80] Andreas Maurer and Massimiliano Pontil. Empirical Bernstein bounds andsample variance penalization. arXiv preprint arXiv:0907.3740, 2009.
[81] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, IoannisAntonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deepreinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
[82] Brett L Moore, Eric D Sinzinger, Todd M Quasny, and Larry D Pyeatt. In-telligent control of closed-loop sedation in simulated ICU patients. In FLAIRSConference, pages 109–114, 2004.
[83] Martina Mueller, Jonas S Almeida, Romesh Stanislaus, and Carol L Wagner.Can machine learning methods predict extubation outcome in premature infantsas well as clinicians? Journal of neonatal biology, 2, 2013.
118
[84] Thomas A Murray, Ying Yuan, and Peter F Thall. A bayesian machine learningapproach for optimizing dynamic treatment regimes. Journal of the AmericanStatistical Association, 113(523):1255–1267, 2018.
[85] Sriraam Natarajan and Prasad Tadepalli. Dynamic preferences in multi-criteriareinforcement learning. In Proceedings of the 22nd International Conference onMachine learning, pages 601–608. ACM, 2005.
[86] S. Nemati, M. M. Ghassemi, and G. D. Clifford. Optimal medication dosingfrom suboptimal clinical examples: A deep reinforcement learning approach.In 2016 38th Annual International Conference of the IEEE Engineering inMedicine and Biology Society (EMBC), pages 2978–2981, Aug 2016.
[87] Shamim Nemati, Li Wei H Lehman, Ryan P Adams, and Atul Malhotra. Discov-ering shared cardiovascular dynamics within a patient cohort. In 34th AnnualInternational Conference of the IEEE Engineering in Medicine and Biology So-ciety, EMBS 2012, pages 6526–6529, 2012.
[88] Shamim Nemati, Mohammad M Ghassemi, and Gari D Clifford. Optimal medi-cation dosing from suboptimal clinical examples: A deep reinforcement learningapproach. In Engineering in Medicine and Biology Society (EMBC), 2016 IEEE38th Annual International Conference of the, pages 2978–2981. IEEE, 2016.
[89] Andrew Y Ng and Stuart J Russell. Algorithms for Inverse ReinforcementLearning. In Proceedings of the Seventeenth International Conference on Ma-chine Learning, pages 663–670. Morgan Kaufmann Publishers Inc., 2000.
[90] Shantanu Nundy, Tara Montgomery, and Robert M Wachter. Promoting trustbetween patients and physicians in the era of artificial intelligence. Jama, 322(6):497–498, 2019.
[91] Tsuyoshi Ohnishi, Miho Kimachi, Shingo Fukuma, Tadao Akizawa, and Shu-nichi Fukuhara. Postdialysis hypokalemia and all-cause mortality in patientsundergoing maintenance hemodialysis. Clinical Journal of the American Societyof Nephrology, 14(6):873–881, 2019.
[92] Dirk Ormoneit and Saunak Sen. Kernel-based reinforcement learning. Machinelearning, 49(2-3):161–178, 2002.
[93] Gustavo A Ospina-Tascon, Gustavo Luiz Buchele, and Jean-Louis Vincent.Multicenter, randomized, controlled trials evaluating mortality in intensive care:doomed to fail? Critical care medicine, 36(4):1311–1322, 2008.
[94] R. Padmanabhan, N. Meskin, and W. M. Haddad. Closed-loop control of anes-thesia and mean arterial pressure using reinforcement learning. In 2014 IEEESymposium on Adaptive Dynamic Programming and Reinforcement Learning(ADPRL), pages 1–8, Dec 2014.
119
[95] Trishan Panch, Heather Mattie, and Leo Anthony Celi. The inconvenient truthabout AI in healthcare. Npj Digital Medicine, 2(1):1–3, 2019.
[96] Sonal Parasrampuria and Jawanna Henry. Hospitals use of electronic healthrecords data, 2015-2017. 2019.
[97] Shruti B Patel and John P Kress. Sedation and analgesia in the mechanicallyventilated patient. American journal of respiratory and critical care medicine,185(5):486–497, 2012.
[98] Niranjani Prasad, Li Fang Cheng, Corey Chivers, Michael Draugelis, and Bar-bara E. Engelhardt. A reinforcement learning approach to weaning of mechan-ical ventilation in intensive care units. In Uncertainty in Artificial Intelligence2017, 1 2017.
[99] Niranjani Prasad, Barbara Engelhardt, and Finale Doshi-Velez. Defining ad-missible rewards for high-confidence policy evaluation in batch reinforcementlearning. In Proceedings of the ACM Conference on Health, Inference, andLearning, pages 1–9, 2020.
[100] Doina Precup, Richard S. Sutton, and Satinder P. Singh. Eligibility traces foroff-policy policy evaluation. In Proceedings of the Seventeenth InternationalConference on Machine Learning, ICML ’00, pages 759–766, San Francisco,CA, USA, 2000. Morgan Kaufmann Publishers Inc. ISBN 1-55860-707-2.
[101] Patrick H Pun and John P Middleton. Dialysate potassium, dialysate magne-sium, and hemodialysis risk. Journal of the American Society of Nephrology,28(12):3441–3451, 2017.
[102] Martin L Puterman. Markov Decision Processes: Discrete Stochastic DynamicProgramming. John Wiley & Sons, Inc., 1994.
[103] Aniruddh Raghu, Matthieu Komorowski, Leo Anthony Celi, Peter Szolovits,and Marzyeh Ghassemi. Continuous state-space models for optimal sepsis treat-ment: a deep reinforcement learning approach. In Proceedings of the MachineLearning for Health Care, MLHC 2017, Boston, Massachusetts, USA, 18-19August 2017, pages 147–163, 2017.
[104] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processesfor Machine Learning. The MIT Press, 2006.
[105] Martin Riedmiller. Neural fitted q iteration–first experiences with a data effi-cient neural reinforcement learning method. In European Conference on Ma-chine Learning, pages 317–328. Springer, 2005.
[106] Carlos A Santacruz, Adriano J Pereira, Edgar Celis, and Jean-Louis Vincent.Which multicenter randomized controlled trials in critical care medicine haveshown reduced mortality? a systematic review. Critical Care Medicine, 47(12):1680–1691, 2019.
120
[107] Elad Sarafian, Aviv Tamar, and Sarit Kraus. Safe policy learning from obser-vations. arXiv preprint arXiv:1805.07805, 2018.
[108] Rhodri Saunders and Dimitris Geogopoulos. Evaluating the cost-effectivenessof proportional-assist ventilation plus vs. pressure support ventilation in theintensive care unit in two countries. Frontiers in public health, 6, 2018.
[109] Mark P Sendak, Joshua DArcy, Sehj Kashyap, Michael Gao, Marshall Nichols,Kristin Corey, William Ratliff, and Suresh Balu. A path for translation ofmachine learning products into healthcare delivery. European Medical JournalInnovations, 2020.
[110] Nigam H Shah, Arnold Milstein, and Steven C Bagley. Making machine learningmodels clinically useful. Jama, 322(14):1351–1352, 2019.
[111] Susan M Shortreed, Eric Laber, Daniel J Lizotte, T Scott Stroup, Joelle Pineau,and Susan A Murphy. Informing sequential clinical decision-making throughreinforcement learning: an empirical study. Machine learning, 84(1-2):109–136,2011.
[112] Mervyn Singer, Clifford S Deutschman, Christopher Warren Seymour, ManuShankar-Hari, Djillali Annane, Michael Bauer, Rinaldo Bellomo, Gordon RBernard, Jean-Daniel Chiche, Craig M Coopersmith, et al. The third inter-national consensus definitions for sepsis and septic shock (sepsis-3). Jama, 315(8):801–810, 2016.
[113] Jonathan Sorg, Satinder P Singh, and Richard L Lewis. Internal rewards miti-gate agent boundedness. In Proceedings of the 27th international conference onmachine learning (ICML-10), pages 1007–1014, 2010.
[114] Jonathan Daniel Sorg. The optimal reward problem: Designing effective rewardfor bounded agents. University of Michigan, 2011.
[115] Oliver Stegle, Sebastian V. Fallert, David J. C. MacKay, and Søren Brage.Gaussian process robust regression for noisy heart rate data. IEEE Transactionson Biomedical Engineering, 55(9):2143–2151, 2008.
[116] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduc-tion. MIT press, 2018.
[117] Richard S Sutton, Doina Precup, and Satinder Singh. Between MDPs andsemi-MDPs: A framework for temporal abstraction in reinforcement learning.Artificial intelligence, 112(1-2):181–211, 1999.
[118] Philip Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. Highconfidence policy improvement. In International Conference on Machine Learn-ing, pages 2380–2388, 2015.
121
[119] Philip S Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. High-confidence off-policy evaluation. In Twenty-Ninth AAAI Conference on Artifi-cial Intelligence, 2015.
[120] Samuele Tosatto, Matteo Pirotta, Carlo d’Eramo, and Marcello Restelli.Boosted fitted q-iteration. In Proceedings of the 34th International Conferenceon Machine Learning-Volume 70, pages 3434–3443. JMLR. org, 2017.
[121] Udensi K Udensi and Paul B Tchounwou. Potassium homeostasis, oxidativestress, and human disease. International journal of clinical and experimentalphysiology, 4(3):111, 2017.
[122] Jean-Louis Vincent, Greg S Martin, and Mitchell M Levy. qSOFA does notreplace SIRS in the definition of sepsis. Critical care, 20(1):210, 2016.
[123] Andrew S Wang, Navpreet K Dhillon, Nikhil T Linaval, Nicholas Rottler, Au-drey R Yang, Daniel R Margulies, Eric J Ley, and Galinos Barmparas. Theimpact of iv electrolyte replacement on the fluid balance of critically ill surgicalpatients. The American Surgeon, 85(10):1171–1174, 2019.
[124] Yingfei Wang and Warren Powell. An optimal learning method for developingpersonalized treatment regimes. arXiv preprint arXiv:1607.01462, 2016.
[125] Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3-4):279–292, 1992.
[126] Colin P West, Mashele M Huschka, Paul J Novotny, Jeff A Sloan, Joseph CKolars, Thomas M Habermann, and Tait D Shanafelt. Association of perceivedmedical errors with resident distress and empathy: a prospective longitudinalstudy. Jama, 296(9):1071–1078, 2006.
[127] Jenna Wiens, Suchi Saria, Mark Sendak, Marzyeh Ghassemi, Vincent X Liu,Finale Doshi-Velez, Kenneth Jung, Katherine Heller, David Kale, MohammedSaeed, et al. Do no harm: a roadmap for responsible machine learning for healthcare. Nature medicine, 25(9):1337–1340, 2019.
[128] Andrew Wilson and Ryan Adams. Gaussian process kernels for pattern discov-ery and extrapolation. In International conference on machine learning, pages1067–1075, 2013.
[129] Andrew Wong, Albert T Young, April S Liang, Ralph Gonzales, Vanja C Dou-glas, and Dexter Hadley. Development and validation of an electronic healthrecord–based machine learning model to estimate delirium risk in newly hospi-talized patients without known cognitive impairment. JAMA network open, 1(4):e181018–e181018, 2018.
[130] Hannah Wunsch, Jason Wagner, Maximilian Herlim, David Chong, AndrewKramer, and Scott D Halpern. ICU occupancy and mechanical ventilator usein the united states. Critical care medicine, 41(12), 2013.
122
[131] Chao Yu, Jiming Liu, and Shamim Nemati. Reinforcement learning in health-care: a survey. arXiv preprint arXiv:1908.08796, 2019.
[132] Zhongheng Zhang, Yiming Zhao, Aran Canes, Dan Steinberg, Olga Lyashevska,et al. Predictive analytics with gradient boosting in clinical medicine. Annalsof translational medicine, 7(7), 2019.
[133] Yufan Zhao, Donglin Zeng, Mark A. Socinski, and Michael R. Kosorok. Re-inforcement learning strategies for clinical trials in nonsmall cell lung cancer.Biometrics, 67(4):1422–1433, 2011. ISSN 1541-0420.
[134] Ming Zhi, Eric L Ding, Jesse Theisen-Toupal, Julia Whelan, and Ramy Ar-naout. The landscape of inappropriate laboratory testing: a 15-year meta-analysis. PloS one, 8(11):e78962, 2013.
[135] Brian D Ziebart, AndrewMaas, J Andrew Bagnell, and Anind K Dey. Maximumentropy inverse reinforcement learning. In Proceedings of the 23rd nationalconference on Artificial intelligence-Volume 3, pages 1433–1438, 2008.
123