Trust region policy optimization (TRPO) - SJSU · 2018-04-17 · Trust Region Policy Optimization...

Trustregionpolicyoptimization(TRPO)

ValueIteration

• ThisiswhatwesimilartowhatQ-Learningdoes,themaindifferencebeingthatwewemightnotknowtheactualexpectedrewardandinsteadexploretheworldandusediscountedrewardstomodelourvaluefunction.

ValueIteration

Model-based

Model-free

ValueIteration

• OncewehaveQ(s,a),wecanfindoptimalpolicyπ*using:

PolicyIteration• Wecandirectlyoptimizeinthepolicyspace.

Smaller thanQ-functionspace

PreliminariesFollowingidentityexpressestheexpectedreturnofanotherpolicy intermsoftheadvantageoverπ,accumulatedovertimesteps:

WhereAπ istheadvantagefunction:

Andisthevisitation frequencyofstatesinpolicy

PreliminariesToremovethecomplexity dueto,following localapproximation isintroduced:

Ifwehaveaparameterized policy ,where isadifferentiable functionoftheparametervector ,then matchestofirstorder. i.e.,

Thisimplies thatasufficiently small stepthatimproves willalsoimprove ,butdoesnotgiveusanyguidance onhowbigofasteptotake.

• Toaddressthisissue,Kakade &Langford(2002)proposedconservativepolicyiteration:

where,

• Theyderivedthefollowinglowerbound:

Preliminaries

• Computationally,thisα-couplingmeansthatifwerandomlychooseaseedforourrandomnumbergenerator,andthenwesamplefromeachofπ andπnew aftersettingthatseed,theresultswillagreeforatleastfraction1-α ofseeds.• Thusα canbeconsideredasameasureofdisagreementbetweenπandπnew

Theorem1• Previousresultwasapplicabletomixturepoliciesonly.Schulmanshowedthatitcanbeextended togeneralstochasticpoliciesbyusingadistancemeasurecalled“TotalVariation”divergencebetweenπandas:

• Let

• Theyprovedthatfor ,followingresultholds:

fordiscreteprobability distributionsp;q

• NotethefollowingrelationbetweenTotalVariation&Kullback–Leibler:

• Thusboundingconditionbecomes:

Theorem1

Algorithm1

TrustRegionPolicyOptimization

• Forparameterizedpolicieswithparametervector,weareguaranteedtoimprovethetrueobjectivebyperformingfollowingmaximization:

• However,usingthepenaltycoefficientlikeaboveresultsinverysmallstepsizes.OnewaytotakelargerstepsinarobustwayistouseaconstraintontheKLdivergencebetweenthenewpolicyandtheoldpolicy,i.e.,atrustregionconstraint:

• Theconstraintisboundedateverypointinstatespace,whichisnotpractical.Wecanusethefollowingheuristicapproximation:

• Thus,theoptimizationproblembecomes:

• Intermsofexpectation,previousequationcanbewrittenas:

where,qdenotesthesamplingdistribution• Thissamplingdistributioncanbecalculatedintwoways:

Ø a)SinglePathMethodØ b)VineMethod

FinalAlgorithm

• Step1: Usethesinglepathorvineprocedurestocollectasetofstate-actionpairsalongwithMonteCarloestimatesoftheirQ-values• Step2:Byaveragingoversamples,constructtheestimatedobjectiveandconstraintinEquation(14)• Step3: Approximatelysolvethisconstrainedoptimizationproblemtoupdatethepolicy’sparametervector

Trust region policy optimization (TRPO) - SJSU · 2018-04-17 · Trust Region Policy Optimization...

Documents

Transcript of Trust region policy optimization (TRPO) - SJSU · 2018-04-17 · Trust Region Policy Optimization...

A PRIMAL-DUAL TRUST REGION ALGORITHM FOR NONLINEAR OPTIMIZATION

Trpo 4 требования_сценарии

L101: Optimization fundamentals · Gradients too expensive/complicated to calculate, e.g.: hyperparameter optimization Two large families: Model-based (similar to trust region but

Reinforcement and Imitation Learning for Diverse ...ai.stanford.edu/~yukez/papers/rss2018zhu.pdf · such as trust region policy optimization (TRPO) and proximal policy optimization

Trust-based Multi-Objective Optimization for Node-to-Task Assignment in Coalition Networks

Andreessen Horowitz Trust Optimization Event - DTP - 10.9.14

Derivative Free Trust Region Algorithms for Stochastic Optimizationanton/truststoch3.pdf · Derivative Free Trust Region Algorithms for Stochastic Optimization Vijay Bharadwaj Anton

Trust Region Policy Optimization - arXiv · parameterizations (Wampler & Popovi´c, 2009). The in-ability of ADP and gradient-based methods to consistently ... Trust Region Policy

Quasi-Newton Trust Region Policy Optimization2.2 Trust Region Policy Optimization (TRPO) In this section, we ﬁrst describe the original TRPO problem and then we present our proposed

Truly Proximal Policy Optimizationauai.org/uai2019/proceedings/papers/21.pdfThe well-known trust region policy optimization (TRPO) method addressed this problem by imposing onto the

A D.C. OPTIMIZATION ALGORITHM FOR SOLVINGcseweb.ucsd.edu/~datorres/docs/tao-an--dcoptimization.pdf · A D.C. OPTIMIZATION ALGORITHM FOR SOLVING THE TRUST-REGION SUBPROBLEM PHAM DINH

Lectures for PHD course on Trust Region Method - … · Trust Region Method Lectures for PHD course on Unconstrained Numerical Optimization Enrico Bertolazzi DIMS { Universita di

Trpo 7 повторное использ_компонентов

Trust Region Policy Optimization

Trpo 9 управление проектами

Trust Region Policy Optimizationppoupart/teaching/cs885-spring...Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML

Information Meeting on Financial Results for 1HFY2004 · Sumitomo Trust & Banking Co., Ltd. Sumitomo Trust & Banking ... Review business model through optimization of capital allocation

Trpo 8 проект_инерфейса

Date: January 21, 2020 Trust Region Policy Optimization (TRPO) …wangjk/slides/CS2621_TRPO_PPO.pdf · 2020-02-28 · Trust Region Policy Optimization (TRPO) John Schulman, Sergey

Some recent developments in nonlinear optimization algorithms · 1. Globalization techniques in unconstrained optimization In order to introduce the line-search and trust-region techniques,