Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo,...
Transcript of Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo,...
![Page 1: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/1.jpg)
1
Towards Improving Health Decisionswith Reinforcement Learning
Finale Doshi-Velez
Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman, Liwei Lehman, Matthieu
Komorowski, Aldo Faisal, David Sontag, Fredrik Johansson, Leo Celi, Aniruddh Raghu, Yao Liu, Emma Brunskill, and the CS282 2017 Course
![Page 2: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/2.jpg)
2
Our Lab: ML Towards Effective, Interpretable Health Interventions
Predicting and Optimizing Interventions in ICU (Wu et al. 2015; Ghassemi et al. 2017; Peng 2018; Raghu 2018; Gottesman 2018; Gottesman 2019)
HIV Management Optimization (Parbhoo et al., 2017, Parbhoo et al. 2018)
Depression Treatment Optimization (Hughes et al., 2016; Hughes et al.2017; Hughes et al. 2018; Pradier 2019)
![Page 3: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/3.jpg)
3
Our Lab: ML Towards Effective, Interpretable Health Interventions
Predicting and Optimizing Interventions in ICU (Wu et al. 2015; Ghassemi et al. 2017; Peng 2018; Raghu 2018; Gottesman 2018; Gottesman 2019)
HIV Management Optimization (Parbhoo et al., 2017, Parbhoo et al. 2018)
Depression Treatment Optimization (Hughes et al., 2016; Hughes et al.2017; Hughes et al. 2018; Pradier 2019)
Today: How can reinforcement learninghelp solve problems in healthcare?
![Page 4: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/4.jpg)
4
Our Lab: ML Towards Effective, Interpretable Health Interventions
Predicting and Optimizing Interventions in ICU (Wu et al. 2015; Ghassemi et al. 2017; Peng 2018; Raghu 2018; Gottesman 2018; Gottesman 2019)
HIV Management Optimization (Parbhoo et al., 2017, Parbhoo et al. 2018)
Depression Treatment Optimization (Hughes et al., 2016; Hughes et al.2017; Hughes et al. 2018; Pradier 2019)
Today: How can reinforcement learninghelp solve problems in healthcare?Focus: Situations that require a
sequence of decisions
![Page 5: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/5.jpg)
5
Challenges in the Health Space
● The data are typically available only in batch● No control over the clinician policy!
● The data give very partial views of the process● Measurements, confounds missing● Intents missing
● Success is not always easy to quantify
BUT: We still want to extract as much from these data as we can!
![Page 6: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/6.jpg)
6
Problem Set-Up
action
observation,reward
WorldAgent
st
![Page 7: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/7.jpg)
7
Solutions: Train Model/Value Function
Solves the long-term problem (e.g. Ernst 2005; Parbhoo 2014; Marivate 2015), often in simulation/simplified settings.
CD4
ViralLoad
Mut.
CD4
ViralLoad
Mut.
state state
drugs drugs
CD4
ViralLoad
Mut.
state
drugs
Rewards: If V
t > 40:
rt = -.7 log V
t + .6 log T
t – .2|M
t|
Else: r
t = 5 + .6 log T
t – .2|M
t|
![Page 8: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/8.jpg)
8
Solutions: Nonparametric
Use the full patient history to predict immediate outcomes (e.g. Bogojeska 2012), but often ignore long term effects.
CD
4, e
tc.
, drug)= I( )f(
Patient Space
v<400in 21 days
![Page 9: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/9.jpg)
9
Our insight: These approaches have complementary strengths!
Patient Space
![Page 10: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/10.jpg)
10
Patient Space
Patients in clustersmay be best modeledby their neighbors
Our insight: These approaches have complementary strengths!
![Page 11: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/11.jpg)
11
Patient Space
Patients withoutneighbors may bebetter modeled with a parametric model
. . .
Our insight: These approaches have complementary strengths!
![Page 12: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/12.jpg)
12
New Solution: Ensemble the Predictors
POMDP Action
Kernel Action
Patient Statistics
ActualActionh( )=
![Page 13: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/13.jpg)
13
Application to HIV Management
● 32,960 patients from EU Resist Database; hold out 3,000 for testing.
● Observations: CD4s, viral loads, mutations
● Actions: 312 common drug combinations (from 20 drugs)
Approach DR Reward
Random Policy -7.31 ± 3.72
Neighbor Policy 9.35 ± 2.61
Model-Based Policy 3.37 ± 2.15
Policy-Mixture Policy 11.52 ± 1.31
Model-Mixture Policy 12.47 ± 1.38
Parbhoo et al., AMIA 2017; Parbhoo et al., PLoS Medicine 2018
![Page 14: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/14.jpg)
14
Application to HIV Management
● 32,960 patients from EU Resist Database; hold out 3,000 for testing.
● Observations: CD4s, viral loads, mutations
● Actions: 312 common drug combinations (from 20 drugs)
Approach DR Reward
Random Policy -7.31 ± 3.72
Neighbor Policy 9.35 ± 2.61
Model-Based Policy 3.37 ± 2.15
Policy-Mixture Policy 11.52 ± 1.31
Model-Mixture Policy 12.47 ± 1.38
Parbhoo et al., AMIA 2017; Parbhoo et al., PLoS Medicine 2018
. . .
??. . .
Extension: Putting the mixing in the model.
![Page 15: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/15.jpg)
15
Application to HIV Management
● 32,960 patients from EU Resist Database; hold out 3,000 for testing.
● Observations: CD4s, viral loads, mutations
● Actions: 312 common drug combinations (from 20 drugs)
Approach DR Reward
Random Policy -7.31 ± 3.72
Neighbor Policy 9.35 ± 2.61
Model-Based Policy 3.37 ± 2.15
Policy-Mixture Policy 11.52 ± 1.31
Model-Mixture Policy 12.47 ± 1.38
Parbhoo et al., AMIA 2017; Parbhoo et al., PLoS Medicine 2018
![Page 16: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/16.jpg)
16
And: Our hypothesis was correct!Model used when neighbors are far
2nd Quantile Distance
Kernel POMDP
History Length
Kernel POMDP
![Page 17: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/17.jpg)
17
Application to Sepsis Management
● Cohort of 15,415 patients with sepsis from the MIMIC dataset (same as Raghu et al. 2017); contains vitals and some lab tests.
● Actions: focus on vasopressors and fluids, used to manage circulation.
● Goal: reduce 30-day mortality; rewards based on probability of 30-day mortality:
Peng et al., AMIA 2018
r (o ,a ,o ' )=−logf (o ' )1−f (o ' )
f (o' )+ logf (o)1−f (o)
![Page 18: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/18.jpg)
18
Minor Adjustment: Values, not Models
LSTM+DDQN Action
Kernel Action
Patient Statistics
ActualActionh( )=
Generative models hard to build → LSTM+DDQN
LSTM+DDQN suggests never-taken actions → hard cap.
![Page 19: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/19.jpg)
19
Application to Sepsis Management
![Page 20: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/20.jpg)
20
Application to Sepsis Management
![Page 21: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/21.jpg)
21
Application to Sepsis Management
![Page 22: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/22.jpg)
22
Just the start: Statistical Methods have high variance
Gottesman et al. 2018, Raghu et al. 2018
![Page 23: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/23.jpg)
23
And select non-representative cohorts
![Page 24: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/24.jpg)
24
And select non-representative cohorts
More issues arise with poor representation
choices and poor reward functions
![Page 25: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/25.jpg)
25
How can increase confidence in our results?
a
o,r
WorldAgent
st
Reward
Representation
Off-policy Eval
Checks++
![Page 26: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/26.jpg)
26
Off-Policy Evaluation
a
o,r
WorldAgent
st
Reward
Representation
Off-policy Eval
Checks++
![Page 27: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/27.jpg)
27
Off-Policy Evaluation
Core question: Given data collected under some behavior policy π
b, can we estimate the value of some other evaluation policy π
e?
Three main kinds of approaches:• Importance-sampling: reweight current data (high variance)
• Model-based: build model with current data, simulate (high bias)• Value-based: apply value evaluation to current data (high bias)
ρn=∏t
πe(atn|stn)
πb(atn|stn)
![Page 28: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/28.jpg)
28
Off-Policy Evaluation
Core question: Given data collected under some behavior policy π
b, can we estimate the value of some other evaluation policy π
e?
Three main kinds of approaches:• Importance-sampling: reweight current data (high variance)
• Model-based: build model with current data, simulate (high bias)• Value-based: apply value evaluation to current data (high bias)
ρn=∏t
πe(atn|stn)
πb(atn|stn)
![Page 29: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/29.jpg)
29
Stitching to IncreaseSample Sizes
s1
Importance sampling-based estimators suffer because importance weights most importance weights get small very fast:
One way to ameliorate the issue: “stitch” trajectories with zero weight to get more non-zero weight trajectories.
s2 s3 s4
desired sequence πe
real weight-0 sequence
real weight-0 sequence
ρn=∏t
πe(atn|stn)
πb(atn|stn)
(Sussex et al, ICMLWS 2018)
![Page 30: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/30.jpg)
30
Stitching to IncreaseSample Sizes
s1
Importance sampling-based estimators suffer because importance weights most importance weights get small very fast:
One way to ameliorate the issue: “stitch” trajectories with zero weight to get more non-zero weight trajectories.
s2 s3 s4
desired sequence πe
real weight-0 sequence
real weight-0 sequence
ρn=∏t
πe(atn|stn)
πb(atn|stn)
Stitchednonzeroweight sequence
![Page 31: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/31.jpg)
31
Stitching to IncreaseSample Sizes
s1
Importance sampling-based estimators suffer because importance weights most importance weights get small very fast:
One way to ameliorate the issue: “stitch” trajectories with zero weight to get more non-zero weight trajectories.
s2 s3 s4
desired sequence pie
real weight-0 sequence
real weight-0 sequence
ρn=∏t
πe(atn|stn)
πb(atn|stn)
Stitchednonzeroweight sequence
![Page 32: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/32.jpg)
32
Off-Policy Evaluation
Core question: Given data collected under some behavior policy π
b, can we estimate the value of some other evaluation policy π
e?
Three main kinds of approaches:• Importance-sampling: reweight current data (high variance)
• Model-based: build model with current data, simulate (high bias)• Value-based: apply value evaluation to current data (high bias)
ρn=∏t
πe(atn|stn)
πb(atn|stn)
![Page 33: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/33.jpg)
33
Better Models: Mixtures help again!
(Gottesman et al, ICML 2019)
Well modeled area
Poorly modeled area
Accurate simulation
Inaccurate simulation
Real Transition
We use RL to bound the long-term accuracy of the value estimate.
![Page 34: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/34.jpg)
34
Better Models: Mixtures help again!
Well modeled area
Poorly modeled area
Accurate simulation
Inaccurate simulation
Real Transition
We use RL to bound the long-term accuracy of the value estimate.
![Page 35: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/35.jpg)
35
Bound on the Quality
|𝑔𝑇− �̂�𝑇|≤ 𝐿𝑟∑𝑡=0
𝑇
𝛾𝑡∑𝑡 ′=0
𝑡 −1
(𝐿𝑡 )𝑡′ 𝜀𝑡 (𝑡−𝑡′−1 )+∑
𝑡=0
𝑇
𝛾𝑡 𝜀𝑟 (𝑡 )
Total return error
Error due to state estimation
Error due to reward estimation
Closely related to bound in - Asadi, Misra, Littman. “Lipschitz Continuity in Model-based Reinforcement Learning.” (ICML 2018).
![Page 36: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/36.jpg)
36
Estimating Errors
Parametric Nonparametric
�̂�𝑡 ,𝑝 ≈max Δ (𝑥𝑡 ′+1 , 𝑓 𝑡(𝑥𝑡′ ,𝑎))
�̂� 𝑡′ (𝑥𝑡′ ,𝑎)
𝑥 𝑥
𝑥𝑡 ′∗
![Page 37: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/37.jpg)
37
Toy Example
Parametric model
Possible actions
![Page 38: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/38.jpg)
38
Example with HIV SimulatorWe use RL to bound the long-term accuracy of the value estimate.
![Page 39: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/39.jpg)
39
Better Models: Designed for Evaluation
Main objective: find a model that will minimize error in individual treatment effects:
where the value function is estimated via trajectories from an approximated model M. Question: Can we do better than just optimizing M for p(M|data)?
Show this can be optimized via a transfer-learning type objective:
(Es0[Vπ(s0)]−E s0 [V̂
π(s0)])
2
Es0[(Vπ(s0)−V̂
π(s0))
2]
L(M )=∑ntl(M ,n, t)+∑nt
ρnt l(M ,n ,t )+ ...
“on-policy” loss “reweighted for πe” loss
(Liu et al, NIPS 2018)
![Page 40: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/40.jpg)
40
Better Models: Designed for Evaluation
Main objective: find a model that will minimize error in individual treatment effects:
where the value function is estimated via trajectories from an approximated model M. Question: Can we do better than just optimizing M for p(M|data)?
Show this can be optimized via a transfer-learning type objective:
(Es0[Vπ(s0)]−E s0 [V̂
π(s0)])
2
Es0[(Vπ(s0)−V̂
π(s0))
2]
L(M )=∑ntl(M ,n, t)+∑nt
ρnt l(M ,n ,t )+ ...
“on-policy” loss “reweighted for πe” loss
![Page 41: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/41.jpg)
41
Checking the reasonablenessof our policies
a
o,r
WorldAgent
st
Reward
Representation
Off-policy Eval
Checks++
![Page 42: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/42.jpg)
42
Some Basic Digging
![Page 43: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/43.jpg)
43
Positive Evidence:Reproducing across sites (robust to covariate shift)
Our HIV results hold across two distinct cohorts.
EU
Res
ist
Sw
iss
HIV
C
ohor
t
![Page 44: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/44.jpg)
44
Positive Evidence: Check importance weights, variances
Sepsis: results hold with different control variates
![Page 45: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/45.jpg)
45
Ask the Experts
![Page 46: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/46.jpg)
46
Asking the Doctors
● HIV: Checking against standard of care:
● As well as three expert clinicians:
![Page 47: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/47.jpg)
47
Asking the Doctors
● HIV: Checking against standard of care:
● As well as three expert clinicians:
What’s the best way to “ask the doctors”?
![Page 48: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/48.jpg)
48
Detour: Summarizing a Treatment Policy
How can we best communicate a treatment policy to a clinical expert? Formalize as the following game:
Us: Present expert with some state-action pairs
Expert: Predict the agent’s action in a new state, s’
Our Goal: choose the state-action pairs so the expert predicts the best.
(Lage et al, IJCAI 2019)
![Page 49: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/49.jpg)
49
Example 1: Gridworld
What happens in states like:Given:
![Page 50: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/50.jpg)
50
Example 2: HIV Simulator
What happens in states like:
Given:
![Page 51: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/51.jpg)
51
Finding: Humans use different methods in different scenarios
HIV Gridworld
![Page 52: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/52.jpg)
52
…and it’s important to account for it!
GridworldHIV
Predicted by Model
Random
![Page 53: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/53.jpg)
53
Offering Options
![Page 54: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/54.jpg)
54
In Progress: Displaying Diverse Alternatives
(Masood et al, IJCAI 2019)
If policies can’t be statistically differentiated, share all the options.
Start
Goal
![Page 55: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/55.jpg)
55
In Progress: Displaying Diverse Alternatives
(Masood et al, IJCAI 2019)
If policies can’t be statistically differentiated, share all the options.
Start
Goal
![Page 56: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/56.jpg)
56
Applied to Hypotension Management
![Page 57: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/57.jpg)
57
Applied to Hypotension Management
Example for a single decision point
![Page 58: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/58.jpg)
58
Reward Design
a
o,r
WorldAgent
st
Reward
Representation
Off-policy Eval
Holistic Checks
![Page 59: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/59.jpg)
59
In Progress: IRL to Identify Rewards
(Lee and Srinivasan et al, IJCAI 2019; Srinivasan, in submission)
![Page 60: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/60.jpg)
60
Going Forward
RL in the health space is tricky, but has potential in several settings. Let’s
● Think holistically about how RL can provide value in a human-agent system.
● Be careful with analyses but not turn away from messy problems!
Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman, Liwei Lehman, Matthieu Komorowski, Aldo Faisal, David Sontag, Fredrik Johansson, Leo Celi, Aniruddh Raghu, Yao Liu, Emma Brunskill, and the CS282 2017 Course
![Page 61: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/61.jpg)
61
Going Forward
RL in the health space is tricky, but has potential in several settings. Let’s
● Think holistically about how RL can provide value in a human-agent system.
● Be careful with analyses but not turn away from messy problems!
Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman, Liwei Lehman, Matthieu Komorowski, Aldo Faisal, David Sontag, Fredrik Johansson, Leo Celi, Aniruddh Raghu, Yao Liu, Emma Brunskill, and the CS282 2017 Course
![Page 62: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/62.jpg)
62
Modeling Improvement #2: Personalizing to patient dynamicsAssume that there exists some small latent vector that would allow us to personalize to the patient’s dynamics (HiP-MDP).
Killian et al. NIPS 2017; Yao et al. ICML LLARLA workshop 2018
s s'
a
T(s'|s,a,θ)
s s'
a
T(s'|s,a,θ')
![Page 63: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/63.jpg)
63
Modeling Improvement #2: Personalizing to patient dynamicsAssume that there exists some small latent vector that would allow us to personalize to the patient’s dynamics (HiP-MDP).
Killian et al. NIPS 2017; Yao et al. ICML LLARLA workshop 2018
s s'
a
T(s'|s,a,θ)
s s'
a
T(s'|s,a,θ')
Consider two planning approaches: 1. Plan given T(s’|s,a,θ)2. Directly learn a policy a = π(s,θ)
![Page 64: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/64.jpg)
64
Modeling Improvement #2: Personalizing to patient dynamicsResults with a (simple) HIV simulator
Killian et al. NIPS 2017; Yao et al. ICML LLARLA workshop 2018
![Page 65: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/65.jpg)
65
Off-policy Evaluation Challenges:Sensitive to Algorithm Choices
![Page 66: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/66.jpg)
66
Off-policy Evaluation Challenges:Sensitive to Algorithm Choices
Sepsis: Neural networks definitely not calibrated...
![Page 67: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/67.jpg)
67
Off-policy Evaluation Challenges:Sensitive to Algorithm Choices
kNN is more calibrated
Calibration helps
![Page 68: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/68.jpg)
68
In Progress: Displaying Diverse Alternatives
(Masood et al, IJCAI 2019)
If policies can’t be statistically differentiated, give plausible alternatives.
![Page 69: Towards Improving Health Decisions with Reinforcement ... Doshi-Velez Collaborators: Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Xuefeng Peng, David Wihl, Yi Ding, Omer Gottesman,](https://reader034.fdocuments.net/reader034/viewer/2022042203/5ea4210af5f6da73ac1abda8/html5/thumbnails/69.jpg)
69
SODA-RL Applied to Hypotension Management
Quantitative Results: Safety, quality are important to consider