GradientDICE: Rethinking Generalized Ofﬂine Estimation of ...no... · GradientDICE: Rethinking...

GradientDICE: Rethinking Generalized Offline Estimation

of Stationary ValuesShangtong Zhang1, Bo Liu2, Shimon Whiteson1

1 University of Oxford2 Auburn University

Preview• Off-policy evaluation with density ratio learning

• Use the Perron-Frobenius theorem to reduce the constraints from 3 to 2, reducing the positiveness constraint, making the problem convex in both tabular and linear setting

• A special weighted norm

• Improvements over DualDICE and GenDICE in tabular, linear and neural network settings

Off-policy evaluation is to estimate the performance of a policy with off-policy data

• The target policy

• A data set

• The performance metric

{si, ai, ri, s′ i}i=1,…,N

si, ai ∼ dμ(s, a), ri = r(si, ai), s′ i ∼ p( ⋅ |si, ai)

ργ(π) ≐ ∑s,a dγ(s, a)r(s, a)

dγ(s, a) ≐ (1 − γ)∑∞t=0 γt Pr(St = s, At = a ∣ π, p) (γ < 1)

dγ(s, a) ≐ limt→∞ Pr(St = s, At = a ∣ π, p) (γ = 1)

Density ratio learning is promising for off-policy evaluation (Liu et al, 2018)

• Learn with function approximation

τ*(s, a) ≐dγ(s, a)

dμ(s, a)

ργ(π) = ∑s,a dμ(s, a)τ*(s, a)r(s, a) ≈ 1N ∑N

i=1 τ*(si, ai)ri

Density ratio satisfies a Bellman-like equation (Zhang et al, 2020)•

Dτ* = (1 − γ)μ0 + γP⊤π Dτ*

D ∈ ℝNsa×Nsa, D ≐ diag(dμ)

τ* ∈ ℝNsa

μ0 ∈ ℝNsa, μ0(s, a) ≐ μ0(s)π(a |s)

Pπ ∈ ℝNsa×Nsa, Pπ((s, a), (s′ , a′ )) ≐ p(s′ |s, a)π(a′ |s′ )

is easy as it implies a unique solution

γ < 1

• exists

Dτ = (1 − γ)μ0 + γP⊤π Dτ

(I − γP⊤π )−1

Previous work requires three constraints for γ = 1

Dτ = P⊤π Dτ

Dτ ≻ 0

1⊤Dτ = 1

GenDICE (Zhang et al, 2020) considers 1 & 3 explicitly

and implements 2 with positive function approximation (e.g. ), projected SGD, or stochastic mirror descent

Mousavi et al. (2020) implements 3 with self-normalization over all state-action pairs

L(τ) ≐ divergence(Dτ, P⊤π Dτ) + (1 − 1⊤Dτ)2

τ2, eτ

Previous work requires three constraints for γ = 1

Dτ = P⊤π Dτ

Dτ ≻ 0

1⊤Dτ = 1The objective becomes non-convex with positive function approximation or self-normalization, even in tabular or linear setting.

Projected SGD is computationally infeasible.

Stochastic mirror descent significantly reduces the capacity of the (linear) function class.

We actually need only two constraints!

Dτ = P⊤π Dτ

Dτ ≻ 0

1⊤Dτ = 1

Perron-Frobenius theorem: the solution space of 1 is one-dimensional Either 2 or 3 is enough to guarantee a unique solution

GradientDICE considers a special norm for the lossL2

• GenDICE:

• GradientTD loss:

L(τ) ≐ divergence((1 − γ)μ0 + γP⊤π Dτ, Dτ) + (1 − 1⊤Dτ)2

subject to Dy ≻ 0

L(τ) ≐ | | (1 − γ)μ0 + γP⊤π Dτ − Dτ | |D−1 + (1 − 1⊤Dτ)2

| |… | |D

GradientDICE considers a special norm for the lossL2

• Convergence in both tabular and linear setting with

L(τ) ≐ | | (1 − γ)μ0 + γP⊤π Dτ − Dτ | |D−1 + (1 − 1⊤Dτ)2

minτ∈ℝNsa

maxf∈ℝNsa,η∈ℝ

L(τ, η, f ) ≐ (1 − γ)𝔼μ0[ f(s, a)]

+γ𝔼p[τ(s, a)f(s′ , a′ )]−𝔼dμ

[τ(s, a)f(s, a)]

𝔼dμ[ f(s, a)2]

+λ(𝔼dμ[ητ(s, a) − η]− η2

2 )γ ∈ [0,1]

GradientDICE outperforms baselines in Boyan’s Chain (Tabular)

• 30 runs (mean + standard errors)

• Grid Search for hyperparameters, e.g.,learning rates from

• Tuned to minimize final prediction error

{4−6,4−5, …,4−1}

GradientDICE outperforms baselines in Boyan’s Chain (Linear)

• Grid Search for hyperparameters, e.g.,learning rates from

{4−6,4−5, …,4−1}

GradientDICE outperforms baselines in Reacher-v2 (Network)

• Grid Search for hyperparameters, e.g.,Learning rates from Penalty from

{0.01,0.005,0.001}{0.1,1}

Thanks• Code and Dockerfile are available at

https://github.com/ShangtongZhang/DeepRL

GradientDICE: Rethinking Generalized Ofﬂine Estimation of ...no... · GradientDICE: Rethinking...

Documents

Transcript of GradientDICE: Rethinking Generalized Ofﬂine Estimation of ...no... · GradientDICE: Rethinking...

Assisting Fuzzy Ofﬂine Handwriting Recognition Using ...

Vision-based Ofﬂine-Online Perception Paradigm for ...refbase.cvc.uab.es/files/rrg2015.pdf · Vision-based Ofﬂine-Online Perception Paradigm for Autonomous ... this vision-based

Ofﬂine Synthesis of Online Dependence Testing: Parametric ...

Comprehensive online to offline promotions to … Asia Pacific Sourcing ... Toys"R"Us Uraltoys Comprehensive online to offline promotions ... India, Poland, Russia,

Online Actions with Offline Impact: How Online …...Online Actions with Offline Impact: How Online Social Networks Influence Online and Offline User Behavior Tim Althoff Stanford

Available online at ScienceDirect · functions for designing efficient online/offline signature schemes. An online/offline signature scheme consists of two phases and it can efficiently

Ofﬂine Ad Slot Schedulingnikolova/papers/preprint...Ofﬂine Ad Slot Scheduling Jon Feldman Google, Inc. S. Muthukrishnan Google, Inc. Evdokia Nikolova1 Massachusetts Institute of

Ofﬂine Deformable Face Tracking in Arbitrary Videos

Online Actions with Offline Impact: How Online …Online Actions with Offline Impact: How Online Social Networks Influence Online and Offline User Behavior Tim Althoff Stanford

CLEF 2017 NewsREEL Overview: Ofﬂine and Online ...ceur-ws.org/Vol-1866/invited_paper_17.pdfCLEF 2017 NewsREEL Overview: Ofﬂine and Online Evaluation of Stream-based News Recommender

Conservative Q-Learning for Ofﬂine Reinforcement Learning...Conservative Q-Learning for Ofﬂine Reinforcement Learning Aviral Kumar 1, Aurick Zhou , George Tucker 2, Sergey Levine;

Ofﬂine Voice Activity Detector Using Speech Supergaussianity · 2018. 1. 4. · Ofﬂine Voice Activity Detector Using Speech Supergaussianity Ivan J. Tashev Microsoft Research

udaFORMAXX – Ofﬂine- Coding System

Increase In-store Sales with Offline Affiliate Marketingww1.prweb.com/prfiles/2018/01/26/15131714/RevTrax Affiliate... · Increase In-store Sales with Offline Affiliate Marketing

Ofﬂine Environment for Focused Crawler Evaluationradio.feld.cvut.cz › conf › poster › poster2016 › ... · Ofﬂine Environment for Focused Crawler Evaluation Toma´ˇs Gog

Approximation Algorithms for Ofﬂine Risk-averse ...people.csail.mit.edu/enikolova/papers/journal-risk-preprint.pdf · Approximation Algorithms for Ofﬂine Risk-averse Combinatorial

Ofﬂine Symbolic Analysis for Multi-Processor Execution Replaypeople.cs.vt.edu/dongyoon/papers/MICRO-09-SAT.pdf · Ofﬂine Symbolic Analysis for Multi-Processor Execution Replay

Accelerating Online Reinforcement Learning with Ofﬂine ...

Pasture: Secure Offline Data Access Using Commodity …• Offline video rental services: Most web-based video ... of Pasture, providing secure and practical offline data access

Accelerating Online Reinforcement Learning with Ofﬂine ... · 2. Standard actor-critic methods do not take advantage of ofﬂine training, even if the policy is pretrained with