Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning...

Safe Reinforcement Learning Philip S. Thomas

Stanford CS234: Reinforcement Learning, Guest Lecture

May 24, 2017

Lecture overview

• What makes a reinforcement learning algorithm safe?

• Notation

• Creating a safe reinforcement learning algorithm • Off-policy policy evaluation (OPE)

• High-confidence off-policy policy evaluation (HCOPE)

• Safe policy improvement (SPI)

• Empirical results

• Research directions

What does it mean for a reinforcement learning algorithm to be safe?

Changing the objective

-50 +0 +20 +20

+0 +0 +0 +0 +0 +0 +20

+20 +20 +20

Policy 1

Policy 2

Changing the objective

• Policy 1: • Reward = 0 with probability 0.999999

• Reward = 109 with probability 1-0.999999

• Expected reward approximately 1000

• Policy 2: • Reward = 999 with probability 0.5

• Reward = 1000 with probability 0.5

• Expected reward 999.5

Another notion of safety

Another notion of safety (Munos et. al)

Another notion of safety

The Problem

• If you apply an existing method, do you have confidence that it will work?

Reinforcement learning successes

A property of many real applications

• Deploying “bad” policies can be costly or dangerous.

Deploying bad policies can be costly

Deploying bad policies can be dangerous

What property should a safe algorithm have?

• Guaranteed to work on the first try • “I guarantee that with probability at least 1 − 𝛿, I will not change your policy

to one that is worse than the current policy.”

• You get to choose 𝛿

• This guarantee is not contingent on the tuning of any hyperparameters

Lecture overview

• Notation

Notation

• Policy, 𝜋 𝜋 𝑎 𝑠 = Pr(𝐴𝑡 = 𝑎|𝑆𝑡 = 𝑠)

• History: 𝐻 = 𝑠1, 𝑎1, 𝑟1, 𝑠2, 𝑎2, 𝑟2, … , 𝑠𝐿 , 𝑎𝐿 , 𝑟𝐿

• Historical data: 𝐷 = 𝐻1, 𝐻2, … , 𝐻𝑛

• Historical data from behavior policy, 𝜋b

• Objective:

𝐽 𝜋 = 𝐄 𝛾𝑡𝑅𝑡

𝑡=1

𝜋 19

Environment

Action, 𝑎

State, 𝑠 Reward, 𝑟

Safe reinforcement learning algorithm

• Reinforcement learning algorithm, 𝑎

• Historical data, 𝐷, which is a random variable

• Policy produced by the algorithm, 𝑎(𝐷), which is a random variable

• A safe reinforcement learning algorithm, 𝑎, satisfies:

Pr 𝐽 𝑎 𝐷 ≥ 𝐽 𝜋b ≥ 1 − 𝛿

or, in general: Pr 𝐽 𝑎 𝐷 ≥ 𝐽min ≥ 1 − 𝛿

Lecture overview

• Notation

Creating a safe reinforcement learning algorithm • Off-policy policy evaluation (OPE)

• For any evaluation policy, 𝜋e, Convert historical data, 𝐷, into 𝑛 independent and unbiased estimates of 𝐽 𝜋e

• High-confidence off-policy policy evaluation (HCOPE) • Use a concentration inequality to convert the 𝑛 independent and unbiased

estimates of 𝐽 𝜋e into a 1 − 𝛿 confidence lower bound on 𝐽 𝜋e

• Safe policy improvement (SPI) • Use HCOPE method to create a safe reinforcement learning algorithm, 𝑎

Off-policy policy evaluation (OPE)

Historical Data, 𝐷

Proposed Policy, 𝜋e Estimate of 𝐽(𝜋e)

Importance Sampling (Intuition)

Probability of history

Evaluation Policy, 𝜋e Behavior Policy, 𝜋b

𝐽 𝜋𝑒 =1

𝑛 𝛾𝑡𝑅𝑡

𝑡=1

𝑖=1

𝐽 𝜋e =1

𝑛 𝑤𝑖 𝛾𝑡𝑅𝑡

𝑡=1

𝑖=1

Math Slide 2/3

𝑤𝑖 =Pr 𝐻𝑖 𝜋e

Pr 𝐻𝑖 𝜋b

Math Slide 2/3

• Reminder: • History, 𝐻 = 𝑠1, 𝑎1, 𝑟1, 𝑠2, 𝑎2, 𝑟2, … , 𝑠𝐿 , 𝑎𝐿 , 𝑟𝐿

• Objective, 𝐽 𝜋e = 𝐄 𝛾𝑡𝑅𝑡𝐿𝑡=1 𝜋e

= 𝜋e 𝑎𝑡 𝑠𝑡

𝜋b 𝑎𝑡 𝑠𝑡

𝑡=1

Importance sampling (History)

• Kahn, H., Marhshall, A. W. (1953). Methods of reducing sample size in Monte Carlo computations. In Journal of the Operations Research Society of America, 1(5):263–278 • Let 𝑋 = 0 with probability 1 − 10−10 and 𝑋 = 1010 with probability 10−10

• 𝐄 𝑋 = 1

• Monte Carlo estimate from 𝑛 ≪ 1010 samples of 𝑋 is almost always zero

• Idea: Sample 𝑋 from some other distribution and use importance sampling to “correct” the estimate

• Can produce lower variance estimates.

• Josiah Hannah et. al, ICML 2017 (to appear).

Importance sampling (History, continued)

• Precup, D., Sutton, R. S., Singh, S. (2000). Eligibility traces for off-policy policy evaluation. In Proceedings of the 17th International Conference on Machine Learning, pp. 759–766. Morgan Kaufmann

Importance sampling (Proof)

• Estimate 𝐄𝑝[𝑓 𝑋 ] given a sample of 𝑋~𝑞

• Let 𝑃 = supp 𝑝 , 𝑄 = supp(𝑞), and 𝐹 = supp(𝑓)

• Importance sampling estimate: 𝑝 𝑋

𝑞 𝑋𝑓 𝑋

𝐄𝑞

𝑝(𝑋)

𝑞(𝑋)𝑓(𝑋) = 𝑞(𝑋)

𝑝(𝑋)

𝑞(𝑋)𝑓(𝑋)

𝑥∈𝑄

= 𝑝(𝑋)𝑓(𝑋)

𝑥∈𝑃

− 𝑝 𝑋 𝑓 𝑋

𝑥∈𝑃∩𝑄

𝑥∈𝑃

+ 𝑝(𝑋)𝑓(𝑋)

𝑥∈𝑃 ∩𝑄

𝑥∈𝑃∩𝑄

Importance sampling (Proof)

• Assume 𝑃 ⊆ 𝑄 (can relax assumption to 𝑃 ⊆ 𝑄 ∪ 𝐹 )

• Importance sampling is an unbiased estimator of 𝐄𝑝 𝑓 𝑋

𝐄𝑞

𝑝(𝑋)

𝑞(𝑋)𝑓(𝑋) = 𝑝(𝑋)𝑓(𝑋)

𝑥∈𝑃

𝑥∈𝑃∩𝑄

= 𝐄𝑝 𝑓 𝑋

𝑥∈𝑃

Importance sampling (proof)

• Assume 𝑓 𝑥 ≥ 0 for all 𝑥

• Importance sampling is a negative-bias estimator of 𝐄𝑝 𝑓 𝑋

𝐄𝑞

𝑝(𝑋)

𝑞(𝑋)𝑓(𝑋) = 𝑝(𝑋)𝑓(𝑋)

𝑥∈𝑃

𝑥∈𝑃∩𝑄

≤ 𝑝(𝑋)𝑓(𝑋)

𝑥∈𝑃

= 𝐄𝑝 𝑓 𝑋

Importance sampling (reminder)

IS 𝐷 =1

𝜋e 𝑎𝑡 𝑠𝑡

𝑡=1

𝛾𝑡𝑅𝑡𝑖

𝑡=1

𝑖=1

𝐄 IS(𝐷) = 𝐽 𝜋e

High confidence off-policy policy evaluation (HCOPE)

Historical Data, 𝐷

Proposed Policy, 𝜋𝑒 1 − 𝛿 confidence lower bound on 𝐽(𝜋𝑒)

Probability, 1 − 𝛿

• Let 𝑋1, … , 𝑋𝑛 be 𝑛 independent identically distributed random variables such that𝑋i ∈ [0, 𝑏]

• Then with probability at least 1 − 𝛿:

𝐄 𝑋𝑖 ≥1

𝑛 𝑋𝑖 −

𝑖=1

𝑏ln 1

Hoeffding’s inequality

Math Slide 3/3

𝑡=1

𝑖=1

Safe policy improvement (SPI)

Historical Data, 𝐷 New policy 𝜋, or No Solution Found Probability, 1 − 𝛿

Historical Data

Training Set (20%)

Candidate Policy, 𝜋

Testing Set (80%)

Safety Test

Safe policy improvement (SPI)

Is 1 − 𝛿 confidence lower bound on 𝐽 𝜋 larger that 𝐽(𝜋cur)?

WON’T WORK

Off-policy policy evaluation (revisited)

• Importance sampling (IS):

IS 𝐷 =1

𝑡=1

𝑖=1

• Per-decision importance sampling (PDIS)

PDIS 𝐷 = 𝛾𝑡

𝑡=1

𝜋e 𝑎𝜏 𝑠𝜏

𝜋b 𝑎𝜏 𝑠𝜏

𝜏=1

𝑅𝑡𝑖

𝑖=1

• Importance sampling (IS):

IS 𝐷 =1

𝑡=1

𝑖=1

• Weighted importance sampling (WIS)

WIS 𝐷 =1

𝑤𝑖𝑛𝑖=1

𝑤𝑖 𝛾𝑡𝑅𝑡𝑖

𝑡=1

𝑖=1

• Weighted importance sampling (WIS)

𝑤𝑖𝑛𝑖=1

𝑤𝑖 𝛾𝑡𝑅𝑡𝑖

𝑡=1

𝑖=1

• Not unbiased. When 𝑛 = 1, 𝐄 WIS = 𝐽 𝜋b

• Strongly consistent estimator of 𝐽 𝜋e

• i.e., Pr lim𝑛→∞

WIS(𝐷) = J 𝜋e = 1

• If • Finite horizon • One behavior policy, or bounded rewards

• Weighted per-decision importance sampling • Also called consistent weighted per-decision importance sampling

• A fun exercise!

Control variates

• Given: 𝑋

• Estimate: 𝜇 = 𝐄 𝑋

• 𝜇 = 𝑋

• Unbiased: 𝐄 𝜇 = 𝐄 𝑋 = 𝜇

• Variance: Var 𝜇 = Var(𝑋)

Control variates

• Given: 𝑋, 𝑌, 𝐄 𝑌

• Estimate: 𝜇 = 𝐄 𝑋

• 𝜇 = 𝑋 − 𝑌 + 𝐄[Y]

• Unbiased: 𝐄 𝜇 = 𝐄 𝑋 − 𝑌 + 𝐄[𝑌] = 𝐄 𝑋 − 𝐄 𝑌 + 𝐄 𝑌 = 𝐄 𝑋 = 𝜇

• Variance: Var 𝜇 = Var 𝑋 − 𝑌 + 𝐄[𝑌] = Var 𝑋 − 𝑌 = Var 𝑋 + Var 𝑌 − 2Cov(𝑋, 𝑌)

• Lower variance if 2Cov 𝑋, 𝑌 > Var(𝑌)

• We call 𝑌 a control variate.

• Idea: add a control variate to importance sampling estimators • 𝑋 is the importance sampling estimator • 𝑌 is a control variate build from an approximate model of the MDP

• 𝐄 𝑌 = 0 in this case

• PDISCV 𝐷 = PDIS 𝐷 − CV(𝐷)

• Called the doubly robust estimator (Jiang and Li, 2015) • Robust to 1) poor approximate model, and 2) error in estimates of 𝜋b

• If the model is poor, the estimates are still unbiased • If the sampling policy is unknown, but the model is good, MSE will still be low

• DR 𝐷 = PDISCV 𝐷

• Non-recursive and weighted forms, as well as control variate view provided by Thomas and Brunskill (2016)

DR 𝜋𝑒 𝒟) = 1

𝑛 𝛾𝑡𝑤𝑡

𝑖 𝑅𝑡𝑖 −𝑞 𝜋e 𝑆𝑡

𝑖 , 𝐴𝑡𝑖 + 𝛾𝑡𝜌𝑡−1

𝑖 𝑣 𝜋𝑒 𝑆𝑡𝑖

𝑡=0

𝑖=1

• Recall: we want the control variate, 𝑌, to cancel with 𝑋: 𝑅 − 𝑞 𝑆, 𝐴 + 𝛾𝑣 𝑆′

Per-decision importance sampling (PDIS) 𝑤𝑡

𝑖 = 𝜋e 𝑎𝜏 𝑠𝜏

𝜋b 𝑎𝜏 𝑠𝜏

𝜏=1

Empirical Results (Gridworld)

2 20 200 2,000

Number of Episodes, n

Approximate model Direct method (Dudik, 2011) Indirect method (Sutton and Barto, 1998)

2 20 200 2,000

CWPDIS

2 20 200 2,000

CWPDIS

• What if supp 𝜋e ⊂ supp 𝜋b ?

• There is a state-action pair, 𝑠, 𝑎 , such that

𝜋𝑒 𝑎 𝑠 = 0, but 𝜋𝑏 𝑎 𝑠 ≠ 0

• If we see a history where (𝑠, 𝑎) occurs, what weight should we give it?

IS 𝐷 =1

𝑡=1

𝑖=1

• What if there are zero samples (𝑛 = 0)? • The importance sampling estimate is undefined

• What if no samples are in supp 𝜋e (or supp(𝑝) in general)? • Importance sampling says: the estimate is zero • Alternate approach: undefined

• Importance sampling estimator is unbiased if 𝑛 > 0

• Alternate approach will be unbiased given that at least one sample is in the support of 𝑝.

• Alternate approach detailed in Importance Sampling with Unequal Support (Thomas and Brunskill, AAAI 2017)

• Thomas et. al. Predictive Off-Policy Policy Evaluation for Nonstationary Decision Problems, with Applications to Digital Marketing (AAAI 2017)

• Thomas and Brunskill. Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning (ICML 2016)

2 20 200 2,000

CWPDIS

1 10 100 1,000 10,000

High-confidence off-policy policy evaluation (revisited) • Consider using IS + Hoeffding’s inequality for HCOPE on mountain car

Natural Temporal Difference Learning, Dabney and Thomas, 2014

High-confidence off-policy policy evaluation (revisited) • Using 100,000 trajectories

• Evaluation policy’s true performance is 0.19 ∈ [0,1].

• We get a 95% confidence lower bound of:

−5,831,000

What went wrong?

𝑤𝑖 = 𝜋e 𝑎𝑡 𝑠𝑡

𝑡=1

What went wrong?

𝐄 𝑋𝑖 ≥1

𝑛 𝑋𝑖 −

𝑖=1

𝑏ln 1

𝑏 ≈ 109.4

Largest observed importance weighted return:316.

High-confidence off-policy policy evaluation (revisited) • Removing the upper tail only decreases the expected value.

High-confidence off-policy policy evaluation (revisited)

• Thomas et. al, High confidence off-policy evaluation, AAAI 2015

High-confidence off-policy policy evaluation (revisited) • Use 20% of the data to optimize 𝑐.

• Use 80% to compute lower bound with optimized 𝑐.

• Mountain car results:

CUT Chernoff-Hoeffding Maurer Anderson Bubeck et al.

95% Confidence lower bound on

the mean

0.145 −5,831,000 −129,703 0.055 −.046

High-confidence off-policy policy evaluation (revisited) • Digital Marketing:

High-confidence off-policy policy evaluation (revisited) • Cognitive dissonance

𝐄 𝑋𝑖 ≥1

𝑛 𝑋𝑖 −

𝑖=1

𝑏ln 1

High-confidence off-policy policy evaluation (revisited) • Student’s 𝑡-test

• Assumes that IS(𝐷) is normally distributed

• By the central limit theorem, it (is as 𝑛 → ∞)

• Efron’s Bootstrap methods (e.g., BCa) • Also, without importance sampling: Hanna, Stone, and Niekum, AAMAS 2017

P. S. Thomas. Safe reinforcement learning (PhD Thesis, 2015)

Historical Data

Training Set (20%)

Candidate Policy, 𝜋

Testing Set (80%)

Safety Test

Safe policy improvement (revisited)

Is 1 − 𝛿 confidence lower bound on 𝐽 𝜋 larger that 𝐽(𝜋cur)?

• Thomas et. al, ICML 2015

Lecture overview

• Notation

Empirical Results

• What to look for: • Data efficiency

• Error rates

Empirical Results: Mountain Car

Empirical Results: Digital Marketing

Environment

Action, 𝑎

State, 𝑠 Reward, 𝑟

0.002715

0.003832

n=10000 n=30000 n=60000 n=100000

Expecte

None, CUT None, BCa k-Fold, CUT k-Fold, Bca

Example Results : Diabetes Treatment

Blood Glucose (sugar)

Eat Carbohydrates Release Insulin

Hyperglycemia

Hypoglycemia

Hyperglycemia

injection =bloodglucose − targetbloodglucose

𝐶𝐹+

mealsize

𝐶𝑅

Intelligent Diabetes Management

Future Directions

• How to deal with long horizons?

• How to deal with importance sampling being “unfair”?

• What to do when the behavior policy is not known?

• What to do when the behavior policy is deterministic?

Summary

• Safe reinforcement learning • Risk-sensitive • Learning from demonstration • Asymptotic convergence even if data is off-policy • Guaranteed (with probability 𝟏 − 𝜹) not to make the policy worse

• Designing a safe reinforcement learning algorithm: • Off-policy policy evaluation (OPE)

• IS, PDIS, WIS, WPDIS, DR, WDR, US, TSP, MAGIC

• High confidence off-policy policy evaluation (HCOPE) • Hoeffding, CUT inequality, Student’s 𝑡-test, BCa

• Safe policy improvement (SPI) • Selecting the candidate policy

Takeaway

• Safe reinforcement learning is tractable! • Not just polynomial amounts of data – practical amounts of data

Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning...

Documents

Transcript of Safe Reinforcement Learning - Stanford University€¦ · Creating a safe reinforcement learning...

Tutorial on Safe Exploration for Reinforcement Learning · 2019-07-10 · Illustration of safe learning 64 Policy Need to safely explore! Felix Berkenkamp, Andreas Krause Safe Model-based

Local Policy Optimization for Trajectory-Centric Reinforcement … · 2020. 6. 2. · Local Policy Optimization for Trajectory-Centric Reinforcement Learning Patrik Kolaric1, Devesh

Driving Policy Safe Driver Program - San Mateo Policy... · Driving Policy & Safe Driver Program San Mateo County C:\Documents and Settings\jtgoebel\Desktop\Driving Policy & Safe

Safe and ffi ffolicy Reinforcement Learning - RL-Tokyo · Safe and ffi ffolicy Reinforcement Learning NIPS 2016 Yasuhiro Fujita Preferred Networks Inc. January 11, 2017 [Munos et

Reinforcement Learning Based Safe Decision Making for ...

Variational Policy Gradient Method for Reinforcement …...Variational Policy Gradient Method for Reinforcement Learning with General Utilities Junyu Zhang Department of Electrical

The Safe Off-loading of Cut and Bent Reinforcement

Lecture 7: Policy Gradient Reinforcement... · 2017-03-06 · Lecture 7: Policy Gradient Introduction Policy-Based Reinforcement Learning In the last lecture we approximated the value

Lyapunov Design for Safe Reinforcement Learningjmlr.csail.mit.edu/papers/volume3/perkins02a/perkins02a.pdf · Lyapunov Design for Safe Reinforcement Learning Theodore J. Perkins PERKINS@CS.UMASS.EDU

Reinforcement Learning - Policy Gradienthome.deib.polimi.it/restelli/MyWebSite/pdf/rl7.pdf · Value Based and Policy–Based Reinforcement Learning Value Based ... Policy gradient

Safe Model-based Reinforcement Learning with Stability Guarantees · 2017. 11. 13. · Safe Model-based Reinforcement Learning with Stability Guarantees Felix Berkenkamp Department

Deep Reinforcement Learning from Policy-Dependent Human … · Deep Reinforcement Learning from Policy-Dependent Human Feedback ear function-approximation techniques for success.

Routing using Safe Reinforcement Learning

Virtuously Safe Reinforcement Learning - arXiv

Dramix safe concrete reinforcement for safe shotcrete structures

SAFE REINFORCEMENT LEARNINGpthomas/papers/Thomas2015c.pdf · SAFE REINFORCEMENT LEARNING A Dissertation Presented by PHILIP S. THOMAS Submitted to the Graduate School of the University

Reinforcement Learning of Motor Skills using Policy Search and …€¦ · Reinforcement Learning, Policy Search, Learning from Demonstrations, Interactive Machine Learning, Movement

Policy Transfer Algorithms for Meta Inverse Reinforcement ...Policy Transfer Algorithms for Meta Inverse Reinforcement Learning Benjamin Kha Abstract Inverse reinforcement learning

Policy reinforcement regarding heat storage technologies · Policy reinforcement regarding heat storage technologies ... 6 Overview of other decision making processes and ... (Policy

Striving for Safe and Efﬁcient Deep Reinforcement Learning