Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and...
description
Transcript of Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and...
![Page 1: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/1.jpg)
![Page 2: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/2.jpg)
STATISTICAL
REINFORCEMENT
LEARNING
ModernMachine
LearningApproaches
Chapman&Hall/CRC
MachineLearning&PatternRecognitionSeries
SERIESEDITORS
RalfHerbrich
ThoreGraepel
AmazonDevelopmentCenter
MicrosoftResearchLtd.
Berlin,Germany
Cambridge,UK
AIMSANDSCOPE
Thisseriesreflectsthelatestadvancesandapplicationsinmachinelearningandpatternrecognitionthroughthepublicationofabroadrangeofreferenceworks,textbooks,andhandbooks.Theinclusionofconcreteexamples,applications,andmethodsishighlyencouraged.Thescopeoftheseriesincludes,butisnotlimitedto,titlesintheareasofmachinelearning,patternrecognition,computationalintelligence,robotics,computational/statisticallearningtheory,naturallanguageprocessing,computervision,gameAI,gametheory,neuralnetworks,computationalneuroscience,andotherrelevanttopics,suchasmachinelearningappliedtobioinformaticsorcognitivescience,whichmightbeproposedbypotentialcontribu-tors.
PUBLISHEDTITLES
BAYESIANPROGRAMMING
PierreBessière,EmmanuelMazer,Juan-ManuelAhuactzin,andKamelMekhnacha
UTILITY-BASEDLEARNINGFROMDATA
CraigFriedmanandSvenSandow
HANDBOOKOFNATURALLANGUAGEPROCESSING,SECONDEDITION
NitinIndurkhyaandFredJ.Damerau
COST-SENSITIVEMACHINELEARNING
BalajiKrishnapuram,ShipengYu,andBharatRao
COMPUTATIONALTRUSTMODELSANDMACHINELEARNING
![Page 3: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/3.jpg)
XinLiu,AnwitamanDatta,andEe-PengLim
MULTILINEARSUBSPACELEARNING:DIMENSIONALITYREDUCTIONOF
MULTIDIMENSIONALDATA
HaipingLu,KonstantinosN.Plataniotis,andAnastasiosN.Venetsanopoulos
MACHINELEARNING:AnAlgorithmicPerspective,SecondEdition
StephenMarsland
SPARSEMODELING:THEORY,ALGORITHMS,ANDAPPLICATIONS
IrinaRishandGenadyYa.Grabarnik
AFIRSTCOURSEINMACHINELEARNING
SimonRogersandMarkGirolami
STATISTICALREINFORCEMENTLEARNING:MODERNMACHINELEARNINGAPPROACHES
MasashiSugiyama
MULTI-LABELDIMENSIONALITYREDUCTION
LiangSun,ShuiwangJi,andJiepingYe
REGULARIZATION,OPTIMIZATION,KERNELS,ANDSUPPORTVECTORMACHINES
JohanA.K.Suykens,MarcoSignoretto,andAndreasArgyriou
ENSEMBLEMETHODS:FOUNDATIONSANDALGORITHMS
Zhi-HuaZhou
Chapman&Hall/CRC
MachineLearning&PatternRecognitionSeries
STATISTICAL
REINFORCEMENT
LEARNING
ModernMachine
LearningApproaches
MasashiSugiyama
UniversityofTokyo
Tokyo,Japan
CRCPress
Taylor&FrancisGroup
6000BrokenSoundParkwayNW,Suite300
![Page 4: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/4.jpg)
BocaRaton,FL33487-2742
©2015byTaylor&FrancisGroup,LLC
CRCPressisanimprintofTaylor&FrancisGroup,anInformabusiness
NoclaimtooriginalU.S.Governmentworks
VersionDate:20150128
InternationalStandardBookNumber-13:978-1-4398-5690-1(eBook-PDF)
Thisbookcontainsinformationobtainedfromauthenticandhighlyregardedsources.Reasonableeffortshavebeenmadetopublishreliabledataandinformation,buttheauthorandpublishercannotassumeresponsibilityforthevalidityofallmaterialsortheconsequencesoftheiruse.Theauthorsandpublishershaveattemptedtotracethecopyrightholdersofallmaterialreproducedinthispublicationandapologizetocopyrightholdersifpermissiontopublishinthisformhasnotbeenobtained.Ifanycopyrightmaterialhasnotbeenacknowledgedpleasewriteandletusknowsowemayrectifyinanyfuturereprint.
ExceptaspermittedunderU.S.CopyrightLaw,nopartofthisbookmaybereprinted,reproduced,transmitted,orutilizedinanyformbyanyelectronic,mechanical,orothermeans,nowknownorhereafterinvented,includingphotocopying,microfilming,andrecording,orinanyinformationstor-ageorretrievalsystem,withoutwrittenpermissionfromthepublishers.
Forpermissiontophotocopyorusematerialelectronicallyfromthiswork,pleaseaccesswww.copy-
right.com(http://www.copyright.com/)orcontacttheCopyrightClearanceCenter,Inc.(CCC),222
RosewoodDrive,Danvers,MA01923,978-750-8400.CCCisanot-for-profitorganizationthatprovideslicensesandregistrationforavarietyofusers.FororganizationsthathavebeengrantedaphotocopylicensebytheCCC,aseparatesystemofpaymenthasbeenarranged.
TrademarkNotice:Productorcorporatenamesmaybetrademarksorregisteredtrademarks,andareusedonlyforidentificationandexplanationwithoutintenttoinfringe.
VisittheTaylor&FrancisWebsiteat
http://www.taylorandfrancis.com
andtheCRCPressWebsiteat
http://www.crcpress.com
Contents
![Page 5: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/5.jpg)
Foreword
ix
Preface
xi
Author
xiii
I
Introduction
1
1IntroductiontoReinforcementLearning
3
1.1
ReinforcementLearning…………………
3
1.2
MathematicalFormulation
……………….
8
1.3
StructureoftheBook………………….
12
1.3.1
Model-FreePolicyIteration……………
12
1.3.2
Model-FreePolicySearch…………….
13
1.3.3
Model-BasedReinforcementLearning………
14
II
Model-FreePolicyIteration
![Page 6: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/6.jpg)
15
2PolicyIterationwithValueFunctionApproximation
17
2.1
ValueFunctions
…………………….
17
2.1.1
StateValueFunctions………………
17
2.1.2
State-ActionValueFunctions…………..
18
2.2
Least-SquaresPolicyIteration
……………..
20
2.2.1
Immediate-RewardRegression………….
20
2.2.2
Algorithm…………………….
21
2.2.3
Regularization………………….
23
2.2.4
ModelSelection………………….
25
2.3
Remarks
………………………..
![Page 7: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/7.jpg)
26
3BasisDesignforValueFunctionApproximation
27
3.1
GaussianKernelsonGraphs
………………
27
3.1.1
MDP-InducedGraph……………….
27
3.1.2
OrdinaryGaussianKernels……………
29
3.1.3
GeodesicGaussianKernels……………
29
3.1.4
ExtensiontoContinuousStateSpaces………
30
3.2
Illustration……………………….
30
3.2.1
Setup………………………
31
v
vi
Contents
3.2.2
GeodesicGaussianKernels……………
![Page 8: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/8.jpg)
31
3.2.3
OrdinaryGaussianKernels……………
33
3.2.4
Graph-LaplacianEigenbases……………
34
3.2.5
DiffusionWavelets………………..
35
3.3
NumericalExamples…………………..
36
3.3.1
Robot-ArmControl……………….
36
3.3.2
Robot-AgentNavigation……………..
39
3.4
Remarks
………………………..
45
4SampleReuseinPolicyIteration
47
4.1
Formulation
………………………
47
4.2
Off-PolicyValueFunctionApproximation………..
48
![Page 9: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/9.jpg)
4.2.1
EpisodicImportanceWeighting………….
49
4.2.2
Per-DecisionImportanceWeighting
……….
50
4.2.3
AdaptivePer-DecisionImportanceWeighting…..
50
4.2.4
Illustration……………………
51
4.3
AutomaticSelectionofFlatteningParameter………
54
4.3.1
Importance-WeightedCross-Validation………
54
4.3.2
Illustration……………………
55
4.4
Sample-ReusePolicyIteration
……………..
56
4.4.1
Algorithm…………………….
56
4.4.2
Illustration……………………
57
![Page 10: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/10.jpg)
4.5
NumericalExamples…………………..
58
4.5.1
InvertedPendulum………………..
58
4.5.2
MountainCar…………………..
60
4.6
Remarks
………………………..
63
5ActiveLearninginPolicyIteration
65
5.1
EfficientExplorationwithActiveLearning
……….
65
5.1.1
ProblemSetup………………….
65
5.1.2
DecompositionofGeneralizationError………
66
5.1.3
EstimationofGeneralizationError………..
67
5.1.4
DesigningSamplingPolicies……………
68
5.1.5
![Page 11: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/11.jpg)
Illustration……………………
69
5.2
ActivePolicyIteration
…………………
71
5.2.1
Sample-ReusePolicyIterationwithActiveLearning.
72
5.2.2
Illustration……………………
73
5.3
NumericalExamples…………………..
75
5.4
Remarks
………………………..
77
6RobustPolicyIteration
79
6.1
RobustnessandReliabilityinPolicyIteration
……..
79
6.1.1
Robustness……………………
79
6.1.2
Reliability…………………….
80
6.2
![Page 12: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/12.jpg)
LeastAbsolutePolicyIteration……………..
81
Contents
vii
6.2.1
Algorithm…………………….
81
6.2.2
Illustration……………………
81
6.2.3
Properties…………………….
83
6.3
NumericalExamples…………………..
84
6.4
PossibleExtensions
…………………..
88
6.4.1
HuberLoss……………………
88
6.4.2
PinballLoss……………………
89
6.4.3
Deadzone-LinearLoss………………
90
6.4.4
![Page 13: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/13.jpg)
ChebyshevApproximation…………….
90
6.4.5
ConditionalValue-At-Risk…………….
91
6.5
Remarks
………………………..
92
III
Model-FreePolicySearch
93
7DirectPolicySearchbyGradientAscent
95
7.1
Formulation
………………………
95
7.2
GradientApproach
…………………..
96
7.2.1
GradientAscent…………………
96
7.2.2
BaselineSubtractionforVarianceReduction…..
98
7.2.3
VarianceAnalysisofGradientEstimators…….
99
7.3
![Page 14: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/14.jpg)
NaturalGradientApproach……………….
101
7.3.1
NaturalGradientAscent……………..
101
7.3.2
Illustration……………………
103
7.4
ApplicationinComputerGraphics:ArtistAgent…….
104
7.4.1
SumiePainting………………….
105
7.4.2
DesignofStates,Actions,andImmediateRewards..
105
7.4.3
ExperimentalResults………………
112
7.5
Remarks
………………………..
113
8DirectPolicySearchbyExpectation-Maximization
117
8.1
Expectation-MaximizationApproach
………….
117
8.2
SampleReuse
![Page 15: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/15.jpg)
……………………..
120
8.2.1
EpisodicImportanceWeighting………….
120
8.2.2
Per-DecisionImportanceWeight…………
122
8.2.3
AdaptivePer-DecisionImportanceWeighting…..
123
8.2.4
AutomaticSelectionofFlatteningParameter…..
124
8.2.5
Reward-WeightedRegressionwithSampleReuse…
125
8.3
NumericalExamples…………………..
126
8.4
Remarks
………………………..
132
9Policy-PriorSearch
133
9.1
Formulation
………………………
133
9.2
PolicyGradientswithParameter-BasedExploration…..
![Page 16: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/16.jpg)
134
9.2.1
Policy-PriorGradientAscent…………..
135
9.2.2
BaselineSubtractionforVarianceReduction…..
136
9.2.3
VarianceAnalysisofGradientEstimators…….
136
viii
Contents
9.2.4
NumericalExamples……………….
138
9.3
SampleReuseinPolicy-PriorSearch…………..
143
9.3.1
ImportanceWeighting………………
143
9.3.2
VarianceReductionbyBaselineSubtraction……
145
9.3.3
NumericalExamples……………….
146
9.4
Remarks
………………………..
![Page 17: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/17.jpg)
153
IV
Model-BasedReinforcementLearning
155
10TransitionModelEstimation
157
10.1ConditionalDensityEstimation
…………….
157
10.1.1Regression-BasedApproach……………
157
10.1.2ǫ-NeighborKernelDensityEstimation………
158
10.1.3Least-SquaresConditionalDensityEstimation….
159
10.2Model-BasedReinforcementLearning………….
161
10.3NumericalExamples…………………..
162
10.3.1ContinuousChainWalk……………..
162
10.3.2HumanoidRobotControl…………….
167
10.4Remarks
………………………..
172
11DimensionalityReductionforTransitionModelEstimation173
11.1SufficientDimensionalityReduction…………..
173
11.2Squared-LossConditionalEntropy……………
174
11.2.1ConditionalIndependence…………….
![Page 18: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/18.jpg)
174
11.2.2DimensionalityReductionwithSCE……….
175
11.2.3RelationtoSquared-LossMutualInformation…..
176
11.3NumericalExamples…………………..
177
11.3.1ArtificialandBenchmarkDatasets………..
177
11.3.2HumanoidRobot…………………
180
11.4Remarks
………………………..
182
References
183
Index
191
Foreword
Howcanagentslearnfromexperiencewithoutanomniscientteacherexplicitly
tellingthemwhattodo?Reinforcementlearningistheareawithinmachine
learningthatinvestigateshowanagentcanlearnanoptimalbehaviorby
correlatinggenericrewardsignalswithitspastactions.Thedisciplinedraws
uponandconnectskeyideasfrombehavioralpsychology,economics,control
theory,operationsresearch,andotherdisparatefieldstomodelthelearning
process.Inreinforcementlearning,theenvironmentistypicallymodeledasa
Markovdecisionprocessthatprovidesimmediaterewardandstateinforma-
tiontotheagent.However,theagentdoesnothaveaccesstothetransition
structureoftheenvironmentandneedstolearnhowtochooseappropriate
![Page 19: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/19.jpg)
actionstomaximizeitsoverallrewardovertime.
ThisbookbyProf.MasashiSugiyamacoverstherangeofreinforcement
learningalgorithmsfromafresh,modernperspective.Withafocusonthe
statisticalpropertiesofestimatingparametersforreinforcementlearning,the
bookrelatesanumberofdifferentapproachesacrossthegamutoflearningsce-
narios.Thealgorithmsaredividedintomodel-freeapproachesthatdonotex-
plicitlymodelthedynamicsoftheenvironment,andmodel-basedapproaches
thatconstructdescriptiveprocessmodelsfortheenvironment.Withineach
ofthesecategories,therearepolicyiterationalgorithmswhichestimatevalue
functions,andpolicysearchalgorithmswhichdirectlymanipulatepolicypa-
rameters.
Foreachofthesedifferentreinforcementlearningscenarios,thebookmetic-
ulouslylaysouttheassociatedoptimizationproblems.Acarefulanalysisis
givenforeachofthesecases,withanemphasisonunderstandingthestatistical
propertiesoftheresultingestimatorsandlearnedparameters.Eachchapter
containsillustrativeexamplesofapplicationsofthesealgorithms,withquan-
titativecomparisonsbetweenthedifferenttechniques.Theseexamplesare
drawnfromavarietyofpracticalproblems,includingrobotmotioncontrol
andAsianbrushpainting.
Insummary,thebookprovidesathoughtprovokingstatisticaltreatmentof
reinforcementlearningalgorithms,reflectingtheauthor’sworkandsustained
researchinthisarea.Itisacontemporaryandwelcomeadditiontotherapidly
growingmachinelearningliterature.Bothbeginnerstudentsandexperienced
ix
x
Foreword
researcherswillfindittobeanimportantsourceforunderstandingthelatest
reinforcementlearningtechniques.
DanielD.Lee
GRASPLaboratory
![Page 20: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/20.jpg)
SchoolofEngineeringandAppliedScience
UniversityofPennsylvania,Philadelphia,PA,USA
Preface
Inthecomingbigdataera,statisticsandmachinelearningarebecoming
indispensabletoolsfordatamining.Dependingonthetypeofdataanalysis,
machinelearningmethodsarecategorizedintothreegroups:
•Supervisedlearning:Giveninput-outputpaireddata,theobjective
ofsupervisedlearningistoanalyzetheinput-outputrelationbehindthe
data.Typicaltasksofsupervisedlearningincluderegression(predict-
ingtherealvalue),classification(predictingthecategory),andranking
(predictingtheorder).Supervisedlearningisthemostcommondata
analysisandhasbeenextensivelystudiedinthestatisticscommunity
forlongtime.Arecenttrendofsupervisedlearningresearchinthema-
chinelearningcommunityistoutilizesideinformationinadditiontothe
input-outputpaireddatatofurtherimprovethepredictionaccuracy.For
example,semi-supervisedlearningutilizesadditionalinput-onlydata,
transferlearningborrowsdatafromothersimilarlearningtasks,and
multi-tasklearningsolvesmultiplerelatedlearningtaskssimultaneously.
•Unsupervisedlearning:Giveninput-onlydata,theobjectiveofun-
supervisedlearningistofindsomethingusefulinthedata.Duetothis
ambiguousdefinition,unsupervisedlearningresearchtendstobemore
adhocthansupervisedlearning.Nevertheless,unsupervisedlearningis
regardedasoneofthemostimportanttoolsindataminingbecause
ofitsautomaticandinexpensivenature.Typicaltasksofunsupervised
learningincludeclustering(groupingthedatabasedontheirsimilarity),
densityestimation(estimatingtheprobabilitydistributionbehindthe
data),anomalydetection(removingoutliersfromthedata),datavisual-
ization(reducingthedimensionalityofthedatato1–3dimensions),and
blindsourceseparation(extractingtheoriginalsourcesignalsfromtheir
![Page 21: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/21.jpg)
mixtures).Also,unsupervisedlearningmethodsaresometimesusedas
datapre-processingtoolsinsupervisedlearning.
•Reinforcementlearning:Supervisedlearningisasoundapproach,
butcollectinginput-outputpaireddataisoftentooexpensive.Unsu-
pervisedlearningisinexpensivetoperform,butittendstobeadhoc.
Reinforcementlearningisplacedbetweensupervisedlearningandunsu-
pervisedlearning—noexplicitsupervision(outputdata)isprovided,
butwestillwanttolearntheinput-outputrelationbehindthedata.
Insteadofoutputdata,reinforcementlearningutilizesrewards,which
xi
xii
Preface
evaluatethevalidityofpredictedoutputs.Givingimplicitsupervision
suchasrewardsisusuallymucheasierandlesscostlythangivingex-
plicitsupervision,andthereforereinforcementlearningcanbeavital
approachinmoderndataanalysis.Varioussupervisedandunsupervised
learningtechniquesarealsoutilizedintheframeworkofreinforcement
learning.
Thisbookisdevotedtointroducingfundamentalconceptsandpracti-
calalgorithmsofstatisticalreinforcementlearningfromthemodernmachine
learningviewpoint.Variousillustrativeexamples,mainlyinrobotics,arealso
providedtohelpunderstandtheintuitionandusefulnessofreinforcement
learningtechniques.Targetreadersaregraduate-levelstudentsincomputer
scienceandappliedstatisticsaswellasresearchersandengineersinrelated
fields.Basicknowledgeofprobabilityandstatistics,linearalgebra,andele-
mentarycalculusisassumed.
Machinelearningisarapidlydevelopingareaofscience,andtheauthor
hopesthatthisbookhelpsthereadergraspvariousexcitingtopicsinrein-
forcementlearningandstimulatereaders’interestinmachinelearning.Please
visitourwebsiteat:http://www.ms.k.u-tokyo.ac.jp.
![Page 22: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/22.jpg)
MasashiSugiyama
UniversityofTokyo,Japan
Author
MasashiSugiyamawasborninOsaka,Japan,in1974.HereceivedBachelor,
Master,andDoctorofEngineeringdegreesinComputerSciencefromAll
TokyoInstituteofTechnology,Japanin1997,1999,and2001,respectively.
In2001,hewasappointedAssistantProfessorinthesameinstitute,andhe
waspromotedtoAssociateProfessorin2003.HemovedtotheUniversityof
TokyoasProfessorin2014.
HereceivedanAlexandervonHumboldtFoundationResearchFellowship
andresearchedatFraunhoferInstitute,Berlin,Germany,from2003to2004.In
2006,hereceivedaEuropeanCommissionProgramErasmusMundusSchol-
arshipandresearchedattheUniversityofEdinburgh,Scotland.Hereceived
theFacultyAwardfromIBMin2007forhiscontributiontomachinelearning
undernon-stationarity,theNagaoSpecialResearcherAwardfromtheInfor-
mationProcessingSocietyofJapanin2011andtheYoungScientists’Prize
fromtheCommendationforScienceandTechnologybytheMinisterofEd-
ucation,Culture,Sports,ScienceandTechnologyforhiscontributiontothe
density-ratioparadigmofmachinelearning.
Hisresearchinterestsincludetheoriesandalgorithmsofmachinelearning
anddatamining,andawiderangeofapplicationssuchassignalprocessing,
imageprocessing,androbotcontrol.HepublishedDensityRatioEstimationin
MachineLearning(CambridgeUniversityPress,2012)andMachineLearning
inNon-StationaryEnvironments:IntroductiontoCovariateShiftAdaptation
(MITPress,2012).
Theauthorthankshiscollaborators,HirotakaHachiya,SethuVijayaku-
mar,JanPeters,JunMorimoto,ZhaoTingting,NingXie,VootTangkaratt,
TetsuroMorimura,andNorikazuSugimoto,forexcitingandcreativediscus-
sions.HeacknowledgessupportfromMEXTKAKENHI17700142,18300057,
![Page 23: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/23.jpg)
20680007,23120004,23300069,25700022,and26280054,theOkawaFounda-
tion,EUErasmusMundusFellowship,AOARD,SCAT,theJSTPRESTO
program,andtheFIRSTprogram.
xiii
Thispageintentionallyleftblank
PartI
Introduction
Thispageintentionallyleftblank
Chapter1
IntroductiontoReinforcement
Learning
Reinforcementlearningisaimedatcontrollingacomputeragentsothata
targettaskisachievedinanunknownenvironment.
Inthischapter,wefirstgiveaninformaloverviewofreinforcementlearning
inSection1.1.Thenweprovideamoreformalformulationofreinforcement
learninginSection1.2.Finally,thebookissummarizedinSection1.3.
1.1
ReinforcementLearning
AschematicofreinforcementlearningisgiveninFigure1.1.Inanunknown
environment(e.g.,inamaze),acomputeragent(e.g.,arobot)takesanaction
(e.g.,towalk)basedonitsowncontrolpolicy.Thenitsstateisupdated(e.g.,
bymovingforward)andevaluationofthatactionisgivenasa“reward”(e.g.,
praise,neutral,orscolding).Throughsuchinteractionwiththeenvironment,
theagentistrainedtoachieveacertaintask(e.g.,gettingoutofthemaze)
withoutexplicitguidance.Acrucialadvantageofreinforcementlearningisits
non-greedynature.Thatis,theagentistrainednottoimproveperformancein
![Page 24: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/24.jpg)
ashortterm(e.g.,greedilyapproachinganexitofthemaze),buttooptimize
thelong-termachievement(e.g.,successfullygettingoutofthemaze).
Areinforcementlearningproblemcontainsvarioustechnicalcomponents
suchasstates,actions,transitions,rewards,policies,andvalues.Beforego-
ingintomathematicaldetails(whichwillbeprovidedinSection1.2),we
intuitivelyexplaintheseconceptsthroughillustrativereinforcementlearning
problemshere.
Letusconsideramazeproblem(Figure1.2),wherearobotagentislocated
inamazeandwewanttoguidehimtothegoalwithoutexplicitsupervision
aboutwhichdirectiontogo.Statesarepositionsinthemazewhichtherobot
agentcanvisit.IntheexampleillustratedinFigure1.3,thereare21states
inthemaze.Actionsarepossibledirectionsalongwhichtherobotagentcan
move.IntheexampleillustratedinFigure1.4,thereare4actionswhichcorre-
spondtomovementtowardthenorth,south,east,andwestdirections.States
3
4
StatisticalReinforcementLearning
Action
Environment
Reward
Agent
State
FIGURE1.1:Reinforcementlearning.
andactionsarefundamentalelementsthatdefineareinforcementlearning
problem.
Transitionsspecifyhowstatesareconnectedtoeachotherthroughactions
(Figure1.5).Thus,knowingthetransitionsintuitivelymeansknowingthemap
ofthemaze.Rewardsspecifytheincomes/coststhattherobotagentreceives
whenmakingatransitionfromonestatetoanotherbyacertainaction.Inthe
caseofthemazeexample,therobotagentreceivesapositiverewardwhenit
![Page 25: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/25.jpg)
reachesthegoal.Morespecifically,apositiverewardisprovidedwhenmaking
atransitionfromstate12tostate17byaction“east”orfromstate18to
state17byaction“north”(Figure1.6).Thus,knowingtherewardsintuitively
meansknowingthelocationofthegoalstate.Toemphasizethefactthata
rewardisgiventotherobotagentrightaftertakinganactionandmakinga
transitiontothenextstate,itisalsoreferredtoasanimmediatereward.
Undertheabovesetup,thegoalofreinforcementlearningtofindthepolicy
forcontrollingtherobotagentthatallowsittoreceivethemaximumamount
ofrewardsinthelongrun.Here,apolicyspecifiesanactiontherobotagent
takesateachstate(Figure1.7).Throughapolicy,aseriesofstatesandac-
tionsthattherobotagenttakesfromastartstatetoanendstateisspecified.
Suchaseriesiscalledatrajectory(seeFigure1.7again).Thesumofim-
mediaterewardsalongatrajectoryiscalledthereturn.Inpractice,rewards
thatcanbeobtainedinthedistantfutureareoftendiscountedbecausere-
ceivingrewardsearlierisregardedasmorepreferable.Inthemazetask,such
adiscountingstrategyurgestherobotagenttoreachthegoalasquicklyas
possible.
Tofindtheoptimalpolicyefficiently,itisusefultoviewthereturnasa
functionoftheinitialstate.Thisiscalledthe(state-)value.Thevaluescan
beefficientlyobtainedviadynamicprogramming,whichisageneralmethod
forsolvingacomplexoptimizationproblembybreakingitdownintosimpler
subproblemsrecursively.Withthehopethatmanysubproblemsareactually
thesame,dynamicprogrammingsolvessuchoverlappedsubproblemsonly
onceandreusesthesolutionstoreducethecomputationcosts.
Inthemazeproblem,thevalueofastatecanbecomputedfromthevalues
ofneighboringstates.Forexample,letuscomputethevalueofstate7(see
![Page 26: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/26.jpg)
IntroductiontoReinforcementLearning
5
FIGURE1.2:Amazeproblem.Wewanttoguidetherobotagenttothe
goal.
1
6
12
17
2
7
13
18
3
8
14
19
4
9
11
15
20
5
![Page 27: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/27.jpg)
10
16
21
FIGURE1.3:Statesarevisitablepositionsinthemaze.
North
West
East
South
FIGURE1.4:Actionsarepossiblemovementsoftherobotagent.
![Page 28: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/28.jpg)
![Page 29: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/29.jpg)
![Page 30: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/30.jpg)
![Page 31: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/31.jpg)
![Page 32: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/32.jpg)
![Page 33: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/33.jpg)
![Page 34: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/34.jpg)
![Page 35: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/35.jpg)
![Page 36: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/36.jpg)
![Page 37: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/37.jpg)
![Page 38: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/38.jpg)
![Page 39: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/39.jpg)
![Page 40: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/40.jpg)
![Page 41: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/41.jpg)
![Page 42: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/42.jpg)
![Page 43: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/43.jpg)
![Page 44: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/44.jpg)
![Page 45: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/45.jpg)
![Page 46: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/46.jpg)
![Page 47: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/47.jpg)
![Page 48: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/48.jpg)
![Page 49: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/49.jpg)
![Page 50: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/50.jpg)
![Page 51: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/51.jpg)
![Page 52: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/52.jpg)
![Page 53: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/53.jpg)
![Page 54: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/54.jpg)
![Page 55: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/55.jpg)
![Page 56: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/56.jpg)
![Page 57: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/57.jpg)
![Page 58: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/58.jpg)
![Page 59: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/59.jpg)
![Page 60: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/60.jpg)
![Page 61: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/61.jpg)
![Page 62: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/62.jpg)
![Page 63: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/63.jpg)
6
StatisticalReinforcementLearning
1
6
![Page 64: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/64.jpg)
12
17
2
7
13
18
3
8
14
19
4
9
11
15
20
5
10
16
21
FIGURE1.5:Transitionsspecifyconnectionsbetweenstatesviaactions.
Thus,knowingthetransitionsmeansknowingthemapofthemaze.
1
6
12
17
2
7
13
18
3
8
14
![Page 65: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/65.jpg)
19
4
9
11
15
20
5
10
16
21
FIGURE1.6:Apositiverewardisgivenwhentherobotagentreachesthe
goal.Thus,therewardspecifiesthegoallocation.
FIGURE1.7:Apolicyspecifiesanactiontherobotagenttakesateach
state.Thus,apolicyalsospecifiesatrajectory,whichisaseriesofstatesand
actionsthattherobotagenttakesfromastartstatetoanendstate.
IntroductiontoReinforcementLearning
7
.35
.39
.9
1
.39
.43
.81
.9
.43
.48
.73
.81
.48
![Page 66: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/66.jpg)
.53
.59
.66
.73
.43
.48
.59
.66
FIGURE1.8:Valuesofeachstatewhenreward+1isgivenatthegoalstate
andtherewardisdiscountedattherateof0.9accordingtothenumberof
steps.
Figure1.5again).Fromstate7,therobotagentcanreachstate2,state6,
andstate8byasinglestep.Iftherobotagentknowsthevaluesofthese
neighboringstates,thebestactiontherobotagentshouldtakeistovisitthe
neighboringstatewiththelargestvalue,becausethisallowstherobotagent
toearnthelargestamountofrewardsinthelongrun.However,thevalues
ofneighboringstatesareunknowninpracticeandthustheyshouldalsobe
computed.
Now,weneedtosolve3subproblemsofcomputingthevaluesofstate2,
state6,andstate8.Then,inthesameway,thesesubproblemsarefurther
decomposedasfollows:
•Theproblemofcomputingthevalueofstate2isdecomposedinto3
subproblemsofcomputingthevaluesofstate1,state3,andstate7.
•Theproblemofcomputingthevalueofstate6isdecomposedinto2
subproblemsofcomputingthevaluesofstate1andstate7.
•Theproblemofcomputingthevalueofstate8isdecomposedinto3
subproblemsofcomputingthevaluesofstate3,state7,andstate9.
Thus,byremovingoverlaps,theoriginalproblemofcomputingthevalueof
state7hasbeendecomposedinto6uniquesubproblems:computingthevalues
ofstate1,state2,state3,state6,state8,andstate9.
Ifwefurthercontinuethisproblemdecomposition,weencountertheprob-
lemofcomputingthevaluesofstate17,wheretherobotagentcanreceive
![Page 67: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/67.jpg)
reward+1.Thenthevaluesofstate12andstate18canbeexplicitlycom-
puted.Indeed,ifadiscountingfactor(amultiplicativepenaltyfordelayed
rewards)is0.9,thevaluesofstate12andstate18are(0.9)1=0.9.Thenwe
canfurtherknowthatthevaluesofstate13andstate19are(0.9)2=0.81.
Byrepeatingthisprocedure,wecancomputethevaluesofallstates(asillus-
tratedinFigure1.8).Basedonthesevalues,wecanknowtheoptimalaction
8
StatisticalReinforcementLearning
therobotagentshouldtake,i.e.,anactionthatleadstherobotagenttothe
neighboringstatewiththelargestvalue.
Notethat,inreal-worldreinforcementlearningtasks,transitionsareoften
notdeterministicbutstochastic,becauseofsomeexternaldisturbance;inthe
caseoftheabovemazeexample,thefloormaybeslipperyandthustherobot
agentcannotmoveasperfectlyasitdesires.Also,stochasticpoliciesinwhich
mappingfromastatetoanactionisnotdeterministicareoftenemployed
inmanyreinforcementlearningformulations.Inthesecases,theformulation
becomesslightlymorecomplicated,butessentiallythesameideacanstillbe
usedforsolvingtheproblem.
Tofurtherhighlightthenotableadvantageofreinforcementlearningthat
nottheimmediaterewardsbutthelong-termaccumulationofrewardsismax-
imized,letusconsideramountain-carproblem(Figure1.9).Therearetwo
mountainsandacarislocatedinavalleybetweenthemountains.Thegoalis
toguidethecartothetopoftheright-handhill.However,theengineofthe
carisnotpowerfulenoughtodirectlyrunuptheright-handhillandreach
thegoal.Theoptimalpolicyinthisproblemistofirstclimbtheleft-handhill
andthengodowntheslopetotherightwithfullaccelerationtogettothe
goal(Figure1.10).
Supposewedefinetheimmediaterewardsuchthatmovingthecartothe
rightgivesapositivereward+1andmovingthecartotheleftgivesanega-
![Page 68: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/68.jpg)
tivereward−1.Then,agreedysolutionthatmaximizestheimmediatereward
movesthecartotheright,whichdoesnotallowthecartogettothegoal
duetolackofenginepower.Ontheotherhand,reinforcementlearningseeks
asolutionthatmaximizesthereturn,i.e.,thediscountedsumofimmediate
rewardsthattheagentcancollectovertheentiretrajectory.Thismeansthat
thereinforcementlearningsolutionwillfirstmovethecartothelefteven
thoughnegativerewardsaregivenforawhile,toreceivemorepositivere-
wardsinthefuture.Thus,thenotionof“priorinvestment”canbenaturally
incorporatedinthereinforcementlearningframework.
1.2
MathematicalFormulation
Inthissection,thereinforcementlearningproblemismathematicallyfor-
mulatedastheproblemofcontrollingacomputeragentunderaMarkovde-
cisionprocess.
Weconsidertheproblemofcontrollingacomputeragentunderadiscrete-
timeMarkovdecisionprocess(MDP).Thatis,ateachdiscretetime-stept,
theagentobservesastatest∈S,selectsanactionat∈A,makesatransitionst+1∈S,andreceivesanimmediatereward,rt=r(st,at,st+1)∈R.
IntroductiontoReinforcementLearning
9
Goal
Car
FIGURE1.9:Amountain-carproblem.Wewanttoguidethecartothe
goal.However,theengineofthecarisnotpowerfulenoughtodirectlyrunup
theright-handhill.
Goal
FIGURE1.10:Theoptimalpolicytoreachthegoalistofirstclimbthe
left-handhillandthenheadfortheright-handhillwithfullacceleration.
SandAarecalledthestatespaceandtheactionspace,respectively.r(s,a,s′)
![Page 69: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/69.jpg)
iscalledtheimmediaterewardfunction.
Theinitialpositionoftheagent,s1,isdrawnfromtheinitialprobability
distribution.IfthestatespaceSisdiscrete,theinitialprobabilitydistributionisspecifiedbytheprobabilitymassfunctionP(s)suchthat
0≤P(s)≤1,∀s∈S,XP(s)=1.
s∈SIfthestatespaceSiscontinuous,theinitialprobabilitydistributionisspeci-
fiedbytheprobabilitydensityfunctionp(s)suchthat
p(s)≥0,∀s∈S,
10
StatisticalReinforcementLearning
Z
p(s)ds=1.
s∈SBecausetheprobabilitymassfunctionP(s)canbeexpressedasaprobability
densityfunctionp(s)byusingtheDiracdeltafunction1δ(s)as
X
p(s)=
δ(s′−s)P(s′),
s′∈Swefocusonlyonthecontinuousstatespacebelow.
Thedynamicsoftheenvironment,whichrepresentthetransitionprob-
abilityfromstatestostates′whenactionaistaken,arecharacterized
bythetransitionprobabilitydistributionwithconditionalprobabilitydensity
p(s′|s,a):
p(s′|s,a)≥0,∀s,s′∈S,∀a∈A,Z
p(s′|s,a)ds′=1,∀s∈S,∀a∈A.
![Page 70: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/70.jpg)
s′∈STheagent’sdecisionisdeterminedbyapolicyπ.Whenweconsideradeter-
ministicpolicywheretheactiontotakeateachstateisuniquelydetermined,
weregardthepolicyasafunctionofstates:
π(s)∈A,∀s∈S.Actionacanbeeitherdiscreteorcontinuous.Ontheotherhand,whendevel-
opingmoresophisticatedreinforcementlearningalgorithms,itisoftenmore
convenienttoconsiderastochasticpolicy,whereanactiontotakeatastate
isprobabilisticallydetermined.Mathematically,astochasticpolicyisacon-
ditionalprobabilitydensityoftakingactionaatstates:
π(a|s)≥0,∀s∈S,∀a∈A,Z
π(a|s)da=1,∀s∈S.a∈AByintroducingstochasticityinactionselection,wecanmoreactivelyexplore
theentirestatespace.Notethatwhenactionaisdiscrete,thestochasticpolicy
isexpressedusingDirac’sdeltafunction,asinthecaseofthestatedensities.
Asequenceofstatesandactionsobtainedbytheproceduredescribedin
Figure1.11iscalledatrajectory.
1TheDiracdeltafunctionδ(·)allowsustoobtainthevalueofafunctionfatapointτ
viatheconvolutionwithf:
Z
∞
f(s)δ(s−τ)ds=f(τ).
−∞
Dirac’sdeltafunctionδ(·)canbeexpressedastheGaussiandensitywithstandarddeviationσ→0:
1
a2
δ(a)=lim√
exp−
.
σ→0
![Page 71: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/71.jpg)
2πσ2
2σ2
IntroductiontoReinforcementLearning
11
1.Theinitialstates1ischosenfollowingtheinitialprobabilityp(s).
2.Fort=1,…,T,
(a)Theactionatischosenfollowingthepolicyπ(at|st).
(b)Thenextstatest+1isdeterminedaccordingtothetransition
probabilityp(st+1|st,at).
FIGURE1.11:Generationofatrajectorysample.
Whenthenumberofsteps,T,isfiniteorinfinite,thesituationiscalled
thefinitehorizonorinfinitehorizon,respectively.Below,wefocusonthe
finite-horizoncasebecausethetrajectorylengthisalwaysfiniteinpractice.
Wedenoteatrajectorybyh(whichstandsfora“history”):
h=[s1,a1,…,sT,aT,sT+1].
Thediscountedsumofimmediaterewardsalongthetrajectoryhiscalled
thereturn:
T
X
R(h)=
γt−1r(st,at,st+1),
t=1
whereγ∈[0,1)iscalledthediscountfactorforfuturerewards.Thegoalofreinforcementlearningistolearntheoptimalpolicyπ∗thatmaximizestheexpectedreturn:
h
i
π∗=argmaxEpπ(h)R(h),
![Page 72: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/72.jpg)
π
whereEpπ(h)denotestheexpectationovertrajectoryhdrawnfrompπ(h),and
pπ(h)denotestheprobabilitydensityofobservingtrajectoryhunderpolicy
π:
T
Y
pπ(h)=p(s1)
p(st+1|st,at)π(at|st).
t=1
“argmax”givesthemaximizerofafunction(Figure1.12).
Forpolicylearning,variousmethodshavebeendevelopedsofar.These
methodscanbeclassifiedintomodel-basedreinforcementlearningandmodel-
freereinforcementlearning.Theterm“model”indicatesamodelofthetran-
sitionprobabilityp(s′|s,a).Inthemodel-basedreinforcementlearningap-
proach,thetransitionprobabilityislearnedinadvanceandthelearnedtran-
sitionmodelisexplicitlyusedforpolicylearning.Ontheotherhand,inthe
model-freereinforcementlearningapproach,policiesarelearnedwithoutex-
plicitlyestimatingthetransitionprobability.Ifstrongpriorknowledgeofthe
12
StatisticalReinforcementLearning
max
argmax
FIGURE1.12:“argmax”givesthemaximizerofafunction,while“max”
givesthemaximumvalueofafunction.
transitionmodelisavailable,themodel-basedapproachwouldbemorefavor-
able.Ontheotherhand,learningthetransitionmodelwithoutpriorknowl-
edgeitselfisahardstatisticalestimationproblem.Thus,ifgoodpriorknowl-
edgeofthetransitionmodelisnotavailable,themodel-freeapproachwould
bemorepromising.
![Page 73: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/73.jpg)
1.3
StructureoftheBook
Inthissection,weexplainthestructureofthisbook,whichcoversmajor
reinforcementlearningapproaches.
1.3.1
Model-FreePolicyIteration
Policyiterationisapopularandwell-studiedapproachtoreinforcement
learning.Thekeyideaofpolicyiterationistodeterminepoliciesbasedonthe
valuefunction.
Letusfirstintroducethestate-actionvaluefunctionQπ(s,a)∈Rforpolicyπ,whichisdefinedastheexpectedreturntheagentwillreceivewhen
takingactionaatstatesandfollowingpolicyπthereafter:
h
i
Qπ(
s,a)=Epπ(h)R(h)s1=s,a1=a,
where“|s1=s,a1=a”meansthattheinitialstates1andthefirstactiona1
arefixedats1=sanda1=a,respectively.Thatis,theright-handsideof
theaboveequationdenotestheconditionalexpectationofR(h)givens1=s
anda1=a.
LetQ∗(s,a)betheoptimalstate-actionvalueatstatesforactionadefinedas
Q∗(s,a)=maxQπ(s,a).π
Basedontheoptimalstate-actionvaluefunction,theoptimalactiontheagent
shouldtakeatstatesisdeterministicallygivenasthemaximizerofQ∗(s,a)
IntroductiontoReinforcementLearning
13
![Page 74: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/74.jpg)
1.Initializepolicyπ(a|s).
2.Repeatthefollowingtwostepsuntilthepolicyπ(a|s)converges.
(a)Policyevaluation:Computethestate-actionvaluefunction
Qπ(s,a)forthecurrentpolicyπ(a|s).
(b)Policyimprovement:Updatethepolicyas
π(a|s)←−δa−argmaxQπ(s,a′).
a′
FIGURE1.13:Algorithmofpolicyiteration.
withrespecttoa.Thus,theoptimalpolicyπ∗(a|s)isgivenbyπ∗(a|s)=δa−argmaxQ∗(s,a′),a′
whereδ(·)denotesDirac’sdeltafunction.
Becausetheoptimalstate-actionvalueQ∗isunknowninpractice,thepolicyiterationalgorithmalternatelyevaluatesthevalueQπforthecurrent
policyπandupdatesthepolicyπbasedonthecurrentvalueQπ(Figure1.13).
Theperformanceoftheabovepolicyiterationalgorithmdependsonthe
qualityofpolicyevaluation;i.e.,howtolearnthestate-actionvaluefunction
fromdataisthekeyissue.Valuefunctionapproximationcorrespondstoare-
gressionprobleminstatisticsandmachinelearning.Thus,variousstatistical
machinelearningtechniquescanbeutilizedforbettervaluefunctionapprox-
imation.PartIIofthisbookaddressesthisissue,includingleast-squareses-
timationandmodelselection(Chapter2),basisfunctiondesign(Chapter3),
efficientsamplereuse(Chapter4),activelearning(Chapter5),androbust
learning(Chapter6).
1.3.2
Model-FreePolicySearch
Oneofthepotentialweaknessesofpolicyiterationisthatpoliciesare
learnedviavaluefunctions.Thus,improvingthequalityofvaluefunction
approximationdoesnotnecessarilycontributetoimprovingthequalityof
resultingpolicies.Furthermore,asmallchangeinvaluefunctionscancausea
bigdifferenceinpolicies,whichisproblematicin,e.g.,robotcontrolbecause
suchinstabilitycandamagetherobot’sphysicalsystem.Anotherweakness
![Page 75: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/75.jpg)
ofpolicyiterationisthatpolicyimprovement,i.e.,findingthemaximizerof
Qπ(s,a)withrespecttoa,iscomputationallyexpensiveordifficultwhenthe
actionspaceAiscontinuous.
14
StatisticalReinforcementLearning
Policysearch,whichdirectlylearnspolicyfunctionswithoutestimating
valuefunctions,canovercometheabovelimitations.Thebasicideaofpolicy
searchistofindthepolicythatmaximizestheexpectedreturn:
h
i
π∗=argmaxEpπ(h)R(h).π
Inpolicysearch,howtofindagoodpolicyfunctioninavastfunctionspaceis
thekeyissuetobeaddressed.PartIIIofthisbookfocusesonpolicysearchand
introducesgradient-basedmethodsandtheexpectation-maximizationmethod
inChapter7andChapter8,respectively.However,apotentialweaknessof
thesedirectpolicysearchmethodsistheirinstabilityduetothestochasticity
ofpolicies.Toovercometheinstabilityproblem,analternativeapproachcalled
policy-priorsearch,whichlearnsthepolicy-priordistributionfordeterministic
policies,isintroducedinChapter9.Efficientsamplereuseinpolicy-prior
searchisalsodiscussedthere.
1.3.3
Model-BasedReinforcementLearning
Intheabovemodel-freeapproaches,policiesarelearnedwithoutexplicitly
modelingtheunknownenvironment(i.e.,thetransitionprobabilityofthe
agentintheenvironment,p(s′|s,a)).Ontheotherhand,themodel-based
approachexplicitlylearnstheenvironmentinadvanceandusesthelearned
environmentmodelforpolicylearning.
Noadditionalsamplingcostisnecessarytogenerateartificialsamplesfrom
thelearnedenvironmentmodel.Thus,themodel-basedapproachisparticu-
![Page 76: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/76.jpg)
larlyusefulwhendatacollectionisexpensive(e.g.,robotcontrol).However,
accuratelyestimatingthetransitionmodelfromalimitedamountoftrajec-
torydatainmulti-dimensionalcontinuousstateandactionspacesishighly
challenging.PartIVofthisbookfocusesonmodel-basedreinforcementlearn-
ing.InChapter10,anon-parametrictransitionmodelestimatorthatpossesses
theoptimalconvergenceratewithhighcomputationalefficiencyisintroduced.
However,evenwiththeoptimalconvergencerate,estimatingthetransition
modelinhigh-dimensionalstateandactionspacesisstillchallenging.InChap-
ter11,adimensionalityreductionmethodthatcanbeefficientlyembedded
intothetransitionmodelestimationprocedureisintroducedanditsusefulness
isdemonstratedthroughexperiments.
PartII
Model-FreePolicy
Iteration
InPartII,weintroduceareinforcementlearningapproachbasedonvalue
functionscalledpolicyiteration.
Thekeyissueinthepolicyiterationframeworkishowtoaccuratelyap-
proximatethevaluefunctionfromasmallnumberofdatasamples.InChap-
ter2,afundamentalframeworkofvaluefunctionapproximationbasedon
leastsquaresisexplained.Inthisleast-squaresformulation,howtodesign
goodbasisfunctionsiscriticalforbettervaluefunctionapproximation.A
practicalbasisdesignmethodbasedonmanifold-basedsmoothing(Chapelle
etal.,2006)isexplainedinChapter3.
Inreal-worldreinforcementlearningtasks,gatheringdataisoftencostly.
InChapter4,wedescribeamethodforefficientlyreusingpreviouslycor-
rectedsamplesintheframeworkofcovariateshiftadaptation(Sugiyama&
Kawanabe,2012).InChapter5,weapplyastatisticalactivelearningtech-
nique(Sugiyama&Kawanabe,2012)tooptimizingdatacollectionstrategies
forreducingthesamplingcost.
Finally,inChapter6,anoutlier-robustextensionoftheleast-squares
![Page 77: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/77.jpg)
methodbasedonrobustregression(Huber,1981)isintroduced.Sucharo-
bustmethodishighlyusefulinhandlingnoisyreal-worlddata.
Thispageintentionallyleftblank
Chapter2
PolicyIterationwithValueFunction
Approximation
Inthischapter,weintroducetheframeworkofleast-squarespolicyiteration.
InSection2.1,wefirstexplaintheframeworkofpolicyiteration,whichitera-
tivelyexecutesthepolicyevaluationandpolicyimprovementstepsforfinding
betterpolicies.Then,inSection2.2,weshowhowvaluefunctionapproxima-
tioninthepolicyevaluationstepcanbeformulatedasaregressionproblem
andintroducealeast-squaresalgorithmcalledleast-squarespolicyiteration
(Lagoudakis&Parr,2003).Finally,thischapterisconcludedinSection2.3.
2.1
ValueFunctions
Atraditionalwaytolearntheoptimalpolicyisbasedonvaluefunction.
Inthissection,weintroducetwotypesofvaluefunctions,thestatevalue
functionandthestate-actionvaluefunction,andexplainhowtheycanbe
usedforfindingbetterpolicies.
2.1.1
StateValueFunctions
ThestatevaluefunctionVπ(s)∈Rforpolicyπmeasuresthe“value”ofstates,whichisdefinedastheexpectedreturntheagentwillreceivewhen
followingpolicyπfromstates:
h
i
Vπ(
s)=Epπ(h)R(h)s1=s,
where“|s1=s”meansthattheinitialstates1isfixedats1=s.Thatis,the
![Page 78: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/78.jpg)
right-handsideoftheaboveequationdenotestheconditionalexpectationof
returnR(h)givens1=s.
Byrecursion,Vπ(s)canbeexpressedas
h
i
Vπ(s)=Ep(s′|s,a)π(a|s)r(s,a,s′)+γVπ(s′),
whereEp(s′|s,a)π(a|s)denotestheconditionalexpectationoveraands′drawn
17
18
StatisticalReinforcementLearning
fromp(s′|s,a)π(a|s)givens.ThisrecursiveexpressioniscalledtheBellman
equationforstatevalues.Vπ(s)maybeobtainedbyrepeatingthefollowing
updatefromsomeinitialestimate:
h
i
Vπ(s)←−Ep(s′|s,a)π(a|s)r(s,a,s′)+γVπ(s′).
Theoptimalstatevalueatstates,V∗(s),isdefinedasthemaximizerofstatevalueVπ(s)withrespecttopolicyπ:
V∗(s)=maxVπ(s).π
BasedontheoptimalstatevalueV∗(s),theoptimalpolicyπ∗,whichisde-terministic,canbeobtainedas
π∗(a|s)=δ(a−a∗(s)),whereδ(·)denotesDirac’sdeltafunctionand
n
h
io
a∗(s)=argmaxEp(s′|s,a)r(s,a,s′)+γV∗(s′).
a∈A
![Page 79: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/79.jpg)
Ep(s′|s,a)denotestheconditionalexpectationovers′drawnfromp(s′|s,a)
givensanda.Thisalgorithm,firstcomputingtheoptimalvaluefunction
andthenobtainingtheoptimalpolicybasedontheoptimalvaluefunction,is
calledvalueiteration.
Apossiblevariationistoiterativelyperformpolicyevaluationandim-
provementas
h
i
Policyevaluation:Vπ(s)←−Ep(s′|s,a)π(a|s)r(s,a,s′)+γVπ(s′).
Policyimprovement:π∗(a|s)←−δ(a−aπ(s)),
where
n
h
io
aπ(s)=argmaxEp(s′|s,a)r(s,a,s′)+γVπ(s′)
.
a∈AThesetwostepsmaybeiteratedeitherforallstatesatonceorinastate-by-
statemanner.Thisiterativealgorithmiscalledthepolicyiteration(basedon
statevaluefunctions).
2.1.2
State-ActionValueFunctions
Intheabovepolicyimprovementstep,theactiontotakeisoptimizedbased
onthestatevaluefunctionVπ(s).Amoredirectwaytohandlethisaction
optimizationistoconsiderthestate-actionvaluefunctionQπ(s,a)forpolicy
π:
h
i
Qπ(
s,a)=Epπ(h)R(h)s1=s,a1=a,
![Page 80: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/80.jpg)
PolicyIterationwithValueFunctionApproximation
19
where“|s1=s,a1=a”meansthattheinitialstates1andthefirstactiona1
arefixedats1=sanda1=a,respectively.Thatis,theright-handsideof
theaboveequationdenotestheconditionalexpectationofreturnR(h)given
s1=sanda1=a.
Letr(s,a)betheexpectedimmediaterewardwhenactionaistakenat
states:
r(s,a)=Ep(s′|s,a)[r(s,a,s′)].
Then,inthesamewayasVπ(s),Qπ(s,a)canbeexpressedbyrecursionas
h
i
Qπ(s,a)=r(s,a)+γEπ(a′|s′)p(s′|s,a)Qπ(s′,a′),
(2.1)
whereEπ(a′|s′)p(s′|s,a)denotestheconditionalexpectationovers′anda′drawn
fromπ(a′|s′)p(s′|s,a)givensanda.Thisrecursiveexpressioniscalledthe
Bellmanequationforstate-actionvalues.
BasedontheBellmanequation,theoptimalpolicymaybeobtainedby
iteratingthefollowingtwosteps:
h
i
Policyevaluation:Qπ(s,a)←−r(s,a)+γEπ(a′|s′)p(s′|s,a)Qπ(s′,a′).
Policyimprovement:π(a|s)←−δa−argmaxQπ(s,a′).
a′∈AInpractice,itissometimespreferabletouseanexplorativepolicy.For
example,Gibbspolicyimprovementisgivenby
exp(Qπ(s,a)/τ)
π(a|s)←−R
,
exp(Qπ(s,a′)/τ)da′
A
whereτ>0determinesthedegreeofexploration.WhentheactionspaceA
![Page 81: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/81.jpg)
isdiscrete,ǫ-greedypolicyimprovementisalsoused:
(1−ǫ+ǫ/|A|ifa=argmaxQπ(s,a′),
π(a|s)←−
a′∈Aǫ/|A|otherwise,
whereǫ∈(0,1]determinestherandomnessofthenewpolicy.TheabovepolicyimprovementstepbasedonQπ(s,a)isessentiallythe
sameastheonebasedonVπ(s)explainedinSection2.1.1.However,the
policyimprovementstepbasedonQπ(s,a)doesnotcontaintheexpectation
operatorandthuspolicyimprovementcanbemoredirectlycarriedout.For
thisreason,wefocusontheaboveformulation,calledpolicyiterationbased
onstate-actionvaluefunctions.
20
StatisticalReinforcementLearning
2.2
Least-SquaresPolicyIteration
Asexplainedintheprevioussection,theoptimalpolicyfunctionmaybe
learnedviastate-actionvaluefunctionQπ(s,a).However,learningthestate-
actionvaluefunctionfromdataisachallengingtaskforcontinuousstates
andactiona.
Learningthestate-actionvaluefunctionfromdatacanactuallybere-
gardedasaregressionprobleminstatisticsandmachinelearning.Inthissec-
tion,weexplainhowtheleast-squaresregressiontechniquecanbeemployed
invaluefunctionapproximation,whichiscalledleast-squarespolicyiteration
(Lagoudakis&Parr,2003).
2.2.1
Immediate-RewardRegression
Letusapproximatethestate-actionvaluefunctionQπ(s,a)bythefollow-
inglinear-in-parametermodel:
![Page 82: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/82.jpg)
B
Xθbφb(s,a),
b=1
whereφb(s,a)Barebasisfunctions,Bdenotesthenumberofbasisfunc-
b=1
tions,andθbB
areparameters.Specificdesignsofbasisfunctionswillbe
b=1
discussedinChapter3.Below,weusethefollowingvectorrepresentationfor
compactlyexpressingtheparametersandbasisfunctions:
θ⊤φ(s,a),
where⊤denotesthetransposeand
θ=(θ1,…,θB)⊤∈RB,⊤φ(s,a)=φ1(s,a),…,φB(s,a)
∈RB.FromtheBellmanequationforstate-actionvalues(2.1),wecanexpress
theexpectedimmediaterewardr(s,a)as
h
i
r(s,a)=Qπ(s,a)−γEπ(a′|s′)p(s′|s,a)Qπ(s′,a′).
Bysubstitutingthevaluefunctionmodelθ⊤φ(s,a)intheaboveequation,
theexpectedimmediaterewardr(s,a)maybeapproximatedas
h
i
r(s,a)≈θ⊤φ(s,a)−γEπ(a′|s′)p(s′|s,a)θ⊤φ(s′,a′).
Nowletusdefineanewbasisfunctionvectorψ(s,a):
h
i
ψ(s,a)=φ(s,a)−γEπ(a′|s′)p(s′|s,a)φ(s′,a′).
![Page 83: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/83.jpg)
PolicyIterationwithValueFunctionApproximation
21
r(s1,a1)
r(s,a)
r(sT,aT)
r(s1,a1,s2)
r(s2,a2)
T
θψ(s,a)
r(sT,aT,sT+1)
r(s2,a2,s3)
(s,a)
(s1,a1)
(s2,a2)
(sT,aT)
FIGURE2.1:Linearapproximationofstate-actionvaluefunctionQπ(s,a)
aslinearregressionofexpectedimmediaterewardr(s,a).
Thentheexpectedimmediaterewardr(s,a)maybeapproximatedas
r(s,a)≈θ⊤ψ(s,a).
Asexplainedabove,thelinearapproximationproblemofthestate-action
valuefunctionQπ(s,a)canbereformulatedasthelinearregressionproblem
oftheexpectedimmediaterewardr(s,a)(seeFigure2.1).Thekeytrickwas
![Page 84: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/84.jpg)
topushtherecursivenatureofthestate-actionvaluefunctionQπ(s,a)into
thecompositebasisfunctionψ(s,a).
2.2.2
Algorithm
Now,weexplainhowtheparametersθarelearnedintheleast-squares
framework.Thatis,themodelθ⊤ψ(s,a)isfittedtotheexpectedimmediate
rewardr(s,a)underthesquaredloss:
(
”
#)
T
1X
2
minEpπ(h)
θ⊤ψ(st,at)−r(st,at)
,
θ
Tt=1
wherehdenotesthehistorysamplefollowingthecurrentpolicyπ:
h=[s1,a1,…,sT,aT,sT+1].
ForhistorysamplesH=h1,…,hN,where
hn=[s1,n,a1,n,…,sT,n,aT,n,sT+1,n],
anempiricalversionoftheaboveleast-squaresproblemisgivenas
(
”
#)
N
T
1X
1X
2
min
![Page 85: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/85.jpg)
θ⊤b
ψ(st,n,at,n;H)−r(st,n,at,n,st+1,n)
.
θ
N
T
n=1
t=1
22
StatisticalReinforcementLearning
1
2
θ−r
Ψ
ˆ
NT
θ
FIGURE2.2:Gradientdescent.
Here,b
ψ(s,a;H)isanempiricalestimatorofψ(s,a)givenby
X
h
i
b
1
ψ(s,a;H)=φ(s,a)−
E
γφ(s′,a′),
|H
![Page 86: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/86.jpg)
π(a′|s′)
(s,a)|s′∈H(s,a)whereH(s,a)denotesasubsetofHthatconsistsofalltransitionsamplesfrom
statesbyactiona,|H(s,a)|denotesthenumberofelementsinthesetH(s,a),
P
and
denotesthesummationoveralldestinationstatess′intheset
s′∈Hs,a)H(s,a).
Letb
ΨbetheNT×BmatrixandrbetheNT-dimensionalvectordefined
as
b
ΨN(t−1)+n,b=b
ψb(st,n,at,n),
rN(t−1)+n=r(st,n,at,n,st+1,n).
b
Ψissometimescalledthedesignmatrix.Thentheaboveleast-squaresprob-
lemcanbecompactlyexpressedas
1
min
kb
Ψθ−rk2,
θ
NT
wherek·kdenotestheℓ2-norm.Becausethisisaquadraticfunctionwith
respecttoθ,itsglobalminimizerb
θcanbeanalyticallyobtainedbysettingits
derivativetozeroas
b
⊤⊤
![Page 87: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/87.jpg)
θ=(b
Ψb
Ψ)−1b
Ψr.
(2.2)
⊤IfBistoolargeandcomputingtheinverseofb
Ψb
Ψisintractable,wemay
useagradientdescentmethod.Thatis,startingfromsomeinitialestimateθ,
thesolutionisupdateduntilconvergence,asfollows(seeFigure2.2):
⊤⊤θ←−θ−ε(b
Ψb
Ψθ−b
Ψr),
PolicyIterationwithValueFunctionApproximation
23
⊤⊤whereb
Ψb
Ψθ−b
Ψrcorrespondstothegradientoftheobjectivefunction
kb
Ψθ−rk2andεisasmallpositiveconstantrepresentingthestepsizeof
gradientdescent.
Anotablevariationoftheaboveleast-squaresmethodistocomputethe
![Page 88: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/88.jpg)
solutionby
eθ=(Φ⊤b
Ψ)−1Φ⊤r,
whereΦistheNT×Bmatrixdefinedas
ΦN(t−1)+n,b=φ(st,n,at,n).
This
variation
is
called
the
least-squaresfixed-pointapproximation
(Lagoudakis&Parr,2003)andisshowntohandletheestimationerrorin-
cludedinthebasisfunctionb
ψinasoundway(Bradtke&Barto,1996).
However,forsimplicity,wefocusonEq.(2.2)below.
2.2.3
Regularization
Regressiontechniquesinmachinelearningaregenerallyformulatedasmin-
imizationofagoodness-of-fittermandaregularizationterm.Intheabove
least-squaresframework,thegoodness-of-fitofourmodelismeasuredbythe
squaredloss.Inthefollowingchapters,wediscusshowotherlossfunctionscan
beutilizedinthepolicyiterationframework,e.g.,samplereuseinChapter4
andoutlier-robustlearninginChapter6.Herewefocusontheregularization
termandintroducepracticallyusefulregularizationtechniques.
Theℓ2-regularizeristhemoststandardregularizerinstatisticsandma-
chinelearning;itisalsocalledtheridgeregression(Hoerl&Kennard,1970):
1
min
kb
Ψθ−rk2+λkθk2,
θ
NT
![Page 89: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/89.jpg)
whereλ≥0istheregularizationparameter.Theroleoftheℓ2-regularizer
kθk2istopenalizethegrowthoftheparametervectorθtoavoidoverfitting
tonoisysamples.Apracticaladvantageoftheuseoftheℓ2-regularizeristhat
theminimizerb
θcanstillbeobtainedanalytically:
b
⊤⊤θ=(b
Ψb
Ψ+λIB)−1b
Ψr,
whereIBdenotestheB×Bidentitymatrix.BecauseoftheadditionofλIB,
thematrixtobeinvertedabovehasabetternumericalconditionandthus
thesolutiontendstobemorestablethanthesolutionobtainedbyplainleast
squareswithoutregularization.
Notethatthesamesolutionastheaboveℓ2-penalizedleast-squaresprob-
lemcanbeobtainedbysolvingthefollowingℓ2-constrainedleast-squaresprob-
lem:
1
min
kb
Ψθ−rk2
θ
NT
24
StatisticalReinforcementLearning
θ
θ
![Page 90: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/90.jpg)
2
2
θ
ˆ
θ
ˆ
LS
LS
θ
ˆ
θ
ˆ
ℓ2−CLS
ℓ1−CLS
θ
θ
1
1
(a)ℓ2-constraint
(b)ℓ1-constraint
FIGURE2.3:Feasibleregions(i.e.,regionswheretheconstraintissatisfied).
Theleast-squares(LS)solutionisthebottomoftheellipticalhyperboloid,
whereasthesolutionofconstrainedleast-squares(CLS)islocatedatthepoint
wherethehyperboloidtouchesthefeasibleregion.
subjecttokθk2≤C,
whereCisdeterminedfromλ.Notethatthelargerthevalueofλis(i.e.,the
strongertheeffectofregularizationis),thesmallerthevalueofCis(i.e.,the
smallerthefeasibleregionis).Thefeasibleregion(i.e.,theregionwherethe
constraintkθk2≤Cissatisfied)isillustratedinFigure2.3(a).
Anotherpopularchoiceofregularizationinstatisticsandmachinelearn-
ingistheℓ1-regularizer,whichisalsocalledtheleastabsoluteshrinkageand
selectionoperator(LASSO)(Tibshirani,1996):
![Page 91: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/91.jpg)
1
min
kb
Ψθ−rk2+λkθk1,
θ
NT
wherek·k1denotestheℓ1-normdefinedastheabsolutesumofelements:
B
X
kθk1=
|θb|.
b=1
Inthesamewayastheℓ2-regularizationcase,thesamesolutionastheabove
ℓ1-penalizedleast-squaresproblemcanbeobtainedbysolvingthefollowing
constrainedleast-squaresproblem:
1
min
kb
Ψθ−rk2
θ
NT
subjecttokθk1≤C,
PolicyIterationwithValueFunctionApproximation
25
1stSubset
(K–1)thsubset
Kthsubset
···
Estimation
Validation
![Page 92: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/92.jpg)
FIGURE2.4:Crossvalidation.
whereCisdeterminedfromλ.ThefeasibleregionisillustratedinFig-
ure2.3(b).
Anotablepropertyofℓ1-regularizationisthatthesolutiontendstobe
sparse,i.e.,manyoftheelementsθbBbecomeexactlyzero.Thereasonwhy
b=1
thesolutionbecomessparsecanbeintuitivelyunderstoodfromFigure2.3(b):
thesolutiontendstobeononeofthecornersofthefeasibleregion,where
thesolutionissparse.Ontheotherhand,intheℓ2-constraintcase(seeFig-
ure2.3(a)again),thesolutionissimilartotheℓ1-constraintcase,butitis
notgenerallyonanaxisandthusthesolutionisnotsparse.Suchasparse
solutionhasvariouscomputationaladvantages.Forexample,thesolutionfor
large-scaleproblemscanbecomputedefficiently,becauseallparametersdo
nothavetobeexplicitlyhandled;see,e.g.,Tomiokaetal.,2011.Furthermore,
thesolutionsforalldifferentregularizationparameterscanbecomputedef-
ficiently(Efronetal.,2004),andtheoutputofthelearnedmodelcanbe
computedefficiently.
2.2.4
ModelSelection
Inregression,tuningparametersareoftenincludedinthealgorithm,such
asbasisparametersandtheregularizationparameter.Suchtuningparameters
canbeobjectivelyandsystematicallyoptimizedbasedoncross-validation
(Wahba,1990)asfollows(seeFigure2.4).
First,thetrainingdatasetHisdividedintoKdisjointsubsetsofapprox-
imatelythesamesize,HkK.Thentheregressionsolutionbθ
k=1
kisobtained
usingH\Hk(i.e.,allsampleswithoutHk),anditssquarederrorforthehold-
outsamplesHkiscomputed.Thisprocedureisrepeatedfork=1,…,K,and
themodel(suchasthebasisparameterandtheregularizationparameter)that
minimizestheaverageerrorischosenasthemostsuitableone.
Onemaythinkthattheordinarysquarederrorisdirectlyusedformodel
![Page 93: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/93.jpg)
selection,insteadofitscross-validationestimator.However,theordinary
squarederrorisheavilybiased(orinotherwords,over-fitted)sincethesame
trainingsamplesareusedtwiceforlearningparametersandestimatingthe
generalizationerror(i.e.,theout-of-samplepredictionerror).Ontheother
hand,thecross-validationestimatorofsquarederrorisalmostunbiased,where
“almost”comesfromthefactthatthenumberoftrainingsamplesisreduced
duetodatasplittinginthecross-validationprocedure.
26
StatisticalReinforcementLearning
Ingeneral,cross-validationiscomputationallyexpensivebecausethe
squarederrorneedstobeestimatedmanytimes.Forexample,whenperform-
ing5-foldcross-validationfor10modelcandidates,thelearningprocedurehas
toberepeated5×10=50times.However,thisisoftenacceptableinpractice
becausesensiblemodelselectiongivesanaccuratesolutionevenwithasmall
numberofsamples.Thus,intotal,thecomputationtimemaynotgrowthat
much.Furthermore,cross-validationissuitableforparallelcomputingsinceer-
rorestimationfordifferentmodelsanddifferentfoldsareindependentofeach
other.Forinstance,whenperforming5-foldcross-validationfor10modelcan-
didates,theuseof50computingunitsallowsustocomputeeverythingat
once.
2.3
Remarks
Reinforcementlearningviaregressionofstate-actionvaluefunctionsisa
highlypowerfulandflexibleapproach,becausewecanutilizevariousregression
techniquesdevelopedinstatisticsandmachinelearningsuchasleast-squares,
regularization,andcross-validation.
Inthefollowingchapters,weintroducemoresophisticatedregressiontech-
niquessuchasmanifold-basedsmoothing(Chapelleetal.,2006)inChapter3,
covariateshiftadaptation(Sugiyama&Kawanabe,2012)inChapter4,active
![Page 94: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/94.jpg)
learning(Sugiyama&Kawanabe,2012)inChapter5,androbustregression
(Huber,1981)inChapter6.
Chapter3
BasisDesignforValueFunction
Approximation
Least-squarespolicyiterationexplainedinChapter2workswell,givenappro-
priatebasisfunctionsforvaluefunctionapproximation.Becauseofitssmooth-
ness,theGaussiankernelisapopularandusefulchoiceasabasisfunction.
However,itdoesnotallowfordiscontinuity,whichisconceivableinmanyre-
inforcementlearningtasks.Inthischapter,weintroduceanalternativebasis
functionbasedongeodesicGaussiankernels(GGKs),whichexploitthenon-
linearmanifoldstructureinducedbytheMarkovdecisionprocesses(MDPs).
ThedetailsofGGKareexplainedinSection3.1,anditsrelationtoother
basisfunctiondesignsisdiscussedinSection3.2.Then,experimentalperfor-
manceisnumericallyevaluatedinSection3.3,andthischapterisconcluded
inSection3.4.
3.1
GaussianKernelsonGraphs
Inleast-squarespolicyiteration,thechoiceofbasisfunctionsφb(s,a)B
b=1
isanopendesignissue(seeChapter2).Traditionally,Gaussiankernelshave
beenapopularchoice(Lagoudakis&Parr,2003;Engeletal.,2005),butthey
cannotapproximatediscontinuousfunctionswell.Tocopewiththisproblem,
moresophisticatedmethodsofconstructingsuitablebasisfunctionshavebeen
proposedwhicheffectivelymakeuseofthegraphstructureinducedbyMDPs
(Mahadevan,2005).Inthissection,weintroduceanalternativewayofcon-
structingbasisfunctionsbyincorporatingthegraphstructureofthestate
space.
3.1.1
![Page 95: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/95.jpg)
MDP-InducedGraph
LetGbeagraphinducedbyanMDP,wherestatesSarenodesofthe
graphandthetransitionswithnon-zerotransitionprobabilitiesfromonenode
toanotherareedges.Theedgesmayhaveweightsdetermined,e.g.,basedon
thetransitionprobabilitiesorthedistancebetweennodes.Thegraphstructure
correspondingtoanexamplegridworldshowninFigure3.1(a)isillustrated
27
28
StatisticalReinforcementLearning
123456789101112131415161718192021
1
2
→→→→→→→↓↓
→→→→→→→→
−10
3
→→→→→→→↓↓
→→→→→→↑↑↑
4
→→↓→↓→→→↓
→↑↑→→↑↑↑↑
−20
5
↓↓↓↓↓↓↓↓↓
→→→→↑↑↑↑↑
6
→→→→→→↓↓↓
→→↑→↑↑↑↑↑
−30
![Page 96: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/96.jpg)
7
→↓↓→↓→↓↓↓
↑→↑↑↑↑↑↑↑
8
→→↓→→→↓↓↓
↑→↑→↑↑↑↑↑
9
→→→→→→↓↓↓
→→↑↑↑↑↑↑↑
10
→→→→→→→→→→→↑↑→↑↑↑↑↑
11
→→→→→→→→→↑→↑→↑↑↑↑↑↑
5
12
→→→→→→→→↑
→↑→↑→↑↑↑↑
13
→→→↑→→↑↑↑
↑→→↑↑↑↑↑↑
14
→→↑↑→↑↑→↑
↑→↑↑↑→↑↑↑
10
15
→→→→→→→→↑
↑→↑↑↑↑↑↑↑
16
↑→↑↑↑→→↑↑
↑→→↑↑↑↑↑↑
20
17
![Page 97: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/97.jpg)
→→→→→→→→↑
↑→↑↑↑↑↑↑↑
15
15
18
↑↑↑→→→↑↑↑
→→↑↑↑→↑↑↑
10
y
19
→→→→→→→→↑
↑↑↑↑↑↑↑↑↑
5
20
20
x
(a)Blackareasarewallsoverwhich
(b)Optimalstatevaluefunction(in
theagentcannotmove,whilethegoal
log-scale).
isrepresentedingray.Arrowsonthe
gridsrepresentoneoftheoptimalpoli-
cies.
(c)GraphinducedbytheMDPanda
randompolicy.
FIGURE3.1:Anillustrativeexampleofareinforcementlearningtaskof
guidinganagenttoagoalinthegridworld.
inFigure3.1(c).Inpractice,suchgraphstructure(includingtheconnection
weights)isestimatedfromsamplesofafinitelength.Weassumethatthe
graphGisconnected.Typically,thegraphissparseinreinforcementlearning
tasks,i.e.,
ℓ≪n(n−1)/2,
![Page 98: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/98.jpg)
whereℓisthenumberofedgesandnisthenumberofnodes.
BasisDesignforValueFunctionApproximation
29
3.1.2
OrdinaryGaussianKernels
OrdinaryGaussiankernels(OGKs)ontheEuclideanspacearedefinedas
ED(s,s′)2
K(s,s′)=exp−
,
2σ2
whereED(s,s′)aretheEuclideandistancebetweenstatessands′;forex-
ample,
ED(s,s′)=kx−x′k,
whentheCartesianpositionsofsands′inthestatespacearegivenbyxand
x′,respectively.σ2isthevarianceparameteroftheGaussiankernel.
TheaboveGaussianfunctionisdefinedonthestatespaceS,wheres′is
treatedasacenterofthekernel.InordertoemploytheGaussiankernelin
least-squarespolicyiteration,itneedstobeextendedoverthestate-action
spaceS×A.Thisisusuallycarriedoutbysimply“copying”theGaussian
functionovertheactionspace(Lagoudakis&Parr,2003;Mahadevan,2005).
Moreprecisely,letthetotalnumberkofbasisfunctionsbemp,wheremis
thenumberofpossibleactionsandpisthenumberofGaussiancenters.For
thei-thactiona(i)(∈A)(i=1,2,…,m)andforthej-thGaussiancenter
c(j)(∈S)(j=1,2,…,p),the(i+(j−1)m)-thbasisfunctionisdefinedasφi+(j−1)m(s,a)=I(a=a(i))K(s,c(j)),
(3.1)
whereI(·)istheindicatorfunction:
(1ifa=a(i),
I(a=a(i))=
![Page 99: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/99.jpg)
0otherwise.
3.1.3
GeodesicGaussianKernels
Ongraphs,anaturaldefinitionofthedistancewouldbetheshortestpath.
TheGaussiankernelbasedontheshortestpathisgivenby
SP(s,s′)2
K(s,s′)=exp−
,
(3.2)
2σ2
whereSP(s,s′)denotestheshortestpathfromstatestostates′.Theshortest
pathonagraphcanbeinterpretedasadiscreteapproximationtothegeodesic
distanceonanon-linearmanifold(Chung,1997).Forthisreason,wecallEq.
(3.2)ageodesicGaussiankernel(GGK)(Sugiyamaetal.,2008).
ShortestpathsongraphscanbeefficientlycomputedusingtheDijkstraal-
gorithm(Dijkstra,1959).Withitsnaiveimplementation,computationalcom-
plexityforcomputingtheshortestpathsfromasinglenodetoallothernodes
isO(n2),wherenisthenumberofnodes.IftheFibonacciheapisemployed,
30
StatisticalReinforcementLearning
computationalcomplexitycanbereducedtoO(nlogn+ℓ)(Fredman&Tar-
jan,1987),whereℓisthenumberofedges.Sincethegraphinvaluefunction
approximationproblemsistypicallysparse(i.e.,ℓ≪n2),usingtheFibonacci
heapprovidessignificantcomputationalgains.Furthermore,thereexistvar-
iousapproximationalgorithmswhicharecomputationallyveryefficient(see
Goldberg&Harrelson,2005andreferencestherein).
AnalogouslytoOGKs,weneedtoextendGGKstothestate-actionspace
tousetheminleast-squarespolicyiteration.Anaivewayistojustemploy
Eq.(3.1),butthiscancauseashiftintheGaussiancenterssincethestate
![Page 100: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/100.jpg)
usuallychangeswhensomeactionistaken.Toincorporatethistransition,
thebasisfunctionsaredefinedastheexpectationofGaussianfunctionsafter
transition:
X
φi+(j−1)m(s,a)=I(a=a(i))
P(s′|s,a)K(s′,c(j)).
(3.3)
s′∈SThisshiftingschemeisshowntoworkverywellwhenthetransitionispre-
dominantlydeterministic(Sugiyamaetal.,2008).
3.1.4
ExtensiontoContinuousStateSpaces
Sofar,wefocusedondiscretestatespaces.However,theconceptofGGKs
canbenaturallyextendedtocontinuousstatespaces,whichisexplainedhere.
First,thecontinuousstatespaceisdiscretized,whichgivesagraphasadis-
creteapproximationtothenon-linearmanifoldstructureofthecontinuous
statespace.Basedonthegraph,GGKscanbeconstructedinthesameway
asthediscretecase.Finally,thediscreteGGKsareinterpolated,e.g.,usinga
linearmethodtogivecontinuousGGKs.
Althoughthisprocedurediscretizesthecontinuousstatespace,itmustbe
notedthatthediscretizationisonlyforthepurposeofobtainingthegraphas
adiscreteapproximationofthecontinuousnon-linearmanifold;theresulting
basisfunctionsthemselvesarecontinuouslyinterpolatedandhencethestate
spaceisstilltreatedascontinuous,asopposedtoconventionaldiscretization
procedures.
3.2
Illustration
Inthissection,thecharacteristicsofGGKsarediscussedincomparisonto
existingbasisfunctions.
![Page 101: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/101.jpg)
BasisDesignforValueFunctionApproximation
31
3.2.1
Setup
Letusconsideratoyreinforcementlearningtaskofguidinganagentto
agoalinadeterministicgridworld(seeFigure3.1(a)).Theagentcantake
4actions:up,down,left,andright.Notethatactionswhichmaketheagent
collidewiththewallaredisallowed.Apositiveimmediatereward+1isgiven
iftheagentreachesagoalstate;otherwiseitreceivesnoimmediatereward.
Thediscountfactorissetatγ=0.9.
Inthistask,astatescorrespondstoatwo-dimensionalCartesiangrid
positionxoftheagent.Forillustrationpurposes,letusdisplaythestate
valuefunction,
Vπ(s):S→R,
whichistheexpectedlong-termdiscountedsumofrewardstheagentreceives
whentheagenttakesactionsfollowingpolicyπfromstates.Fromthedefi-
nition,itcanbeconfirmedthatVπ(s)isexpressedintermsofQπ(s,a)as
Vπ(s)=Qπ(s,π(s)).
TheoptimalstatevaluefunctionV∗(s)(inlog-scale)isillustratedinFig-ure3.1(b).AnMDP-inducedgraphstructureestimatedfrom20seriesofran-
domwalksamples1oflength500isillustratedinFigure3.1(c).Here,theedge
weightsinthegrapharesetat1(whichisequivalenttotheEuclideandistance
betweentwonodes).
3.2.2
GeodesicGaussianKernels
AnexampleofGGKsforthisgraphisdepictedinFigure3.2(a),wherethe
varianceofthekernelissetatalargevalue(σ2=30)forillustrationpurposes.
ThegraphshowsthatGGKshaveanicesmoothsurfacealongthemaze,but
notacrossthepartitionbetweentworooms.SinceGGKshave“centers,”they
areextremelyusefulforadaptivelychoosingasubsetofbases,e.g.,usinga
uniformallocationstrategy,sample-dependentallocationstrategy,ormaze-
dependentallocationstrategyofthecenters.Thisisapracticaladvantage
![Page 102: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/102.jpg)
oversomenon-orderedbasisfunctions.Moreover,sinceGGKsarelocalby
nature,theilleffectsoflocalnoiseareconstrainedlocally,whichisanother
usefulpropertyinpractice.
Theapproximatedvaluefunctionsobtainedby40GGKs2aredepictedin
Figure3.3(a),whereoneGGKcenterisputatthegoalstateandtheremaining
9centersarechosenrandomly.ForGGKs,kernelfunctionsareextendedover
theactionspaceusingtheshiftingscheme(seeEq.(3.3))sincethetransitionis
1Moreprecisely,ineachrandomwalk,aninitialstateischosenrandomly.Then,anactionischosenrandomlyandtransitionismade;thisisrepeated500times.Thisentireprocedureisindependentlyrepeated20timestogeneratethetrainingset.
2Notethatthetotalnumberkofbasisfunctionsis160sinceeachGGKiscopiedovertheactionspaceasperEq.(3.3).
![Page 103: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/103.jpg)
![Page 104: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/104.jpg)
32
StatisticalReinforcementLearning
1
1
1
0.5
0.5
0.5
0
0
0
5
5
5
10
10
10
20
20
20
15
15
15
15
15
15
10
y
10
y
10
y
![Page 105: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/105.jpg)
5
20
5
20
5
20
x
x
x
(a)GeodesicGaussiankernels
1
1
1
0.8
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.2
0.4
0.2
5
5
5
10
10
10
20
20
![Page 106: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/106.jpg)
20
15
15
15
15
15
15
10
y
10
y
10
y
5
20
5
20
5
20
x
x
x
(b)OrdinaryGaussiankernels
0.05
0.05
0.05
0
0
0
−0.05
−0.05
−0.05
![Page 107: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/107.jpg)
5
5
5
10
10
10
20
20
20
15
15
15
15
15
15
10
y
10
y
10
y
5
20
5
20
5
20
x
x
x
(c)Graph-Laplacianeigenbases
0.2
![Page 108: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/108.jpg)
0.15
0.2
0.1
0.1
0
0.05
0
−0.2
0
−0.1
5
5
5
10
10
10
20
20
20
15
15
15
15
15
15
10
y
10
y
10
y
5
![Page 109: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/109.jpg)
20
5
20
5
20
x
x
x
(d)Diffusionwavelets
FIGURE3.2:Examplesofbasisfunctions.
![Page 110: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/110.jpg)
![Page 111: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/111.jpg)
BasisDesignforValueFunctionApproximation
33
−1.5
−2
−2
−2.5
−2.5
−3
−3
−3.5
−3.5
5
5
10
10
20
20
15
15
15
15
10
y
10
y
5
20
5
20
x
x
(a)GeodesicGaussiankernels(MSE=
![Page 112: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/112.jpg)
(b)OrdinaryGaussiankernels(MSE=
1.03×10−2)
1.19×10−2)
−4
−6
−6
−8
−8
−10
−10
−12
−12
5
5
10
10
20
20
15
15
15
15
10
y
10
y
5
20
5
20
x
x
![Page 113: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/113.jpg)
(c)Graph-Laplacianeigenbases(MSE=
(d)Diffusionwavelets
(MSE=5.00×
4.73×10−4)
10−4)
FIGURE3.3:Approximatedvaluefunctionsinlog-scale.Theerrorsarecom-
putedwithrespecttotheoptimalvaluefunctionillustratedinFigure3.1(b).
deterministic(seeSection3.1.3).TheproposedGGK-basedmethodproduces
anicesmoothfunctionalongthemazewhilethediscontinuityaroundthepar-
titionbetweentworoomsissharplymaintained(cf.Figure3.1(b)).Asaresult,
forthisparticularcase,GGKsgivetheoptimalpolicy(seeFigure3.4(a)).
AsdiscussedinSection3.1.3,thesparsityofthestatetransitionmatrixal-
lowsefficientandfastcomputationsofshortestpathsonthegraph.Therefore,
least-squarespolicyiterationwithGGK-basedbasesisstillcomputationally
attractive.
3.2.3
OrdinaryGaussianKernels
OGKssharesomeofthepreferablepropertiesofGGKsdescribedabove.
However,asillustratedinFigure3.2(b),thetailofOGKsextendsbeyondthe
partitionbetweentworooms.Therefore,OGKstendtoundesirablysmooth
outthediscontinuityofthevaluefunctionaroundthebarrierwall(see
34
StatisticalReinforcementLearning
123456789101112131415161718192021
123456789101112131415161718192021
1
![Page 114: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/114.jpg)
1
2
→→→→→→↓↓↓
→→→→→→→→
2
→→→→→→→→↓
→→→→→→→→
3
→→→→→↓↓↓↓
→→→→→→→→↑
3
→→→→→→→→↑
→→→→→→→→↑
4
→→→→→↓↓↓↓
→→→→→→→→↑
4
→→→→→→→→↑
→→→→→→→→↑
5
→→→→↓↓↓↓↓
→→→→→→→→↑
5
→→→→→→→→↑
→→→→→→→→↑
6
→→→↓↓↓↓↓↓
→→→→→→→→↑
6
→→→→→→→→↑
→→→→→→→→↑
7
![Page 115: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/115.jpg)
→→→↓↓↓↓↓↓
→→→→→→→↑↑
7
→→→→→→→→↑
→→→→→→→→↑
8
→→↓↓↓↓↓↓↓
→→→→→→→↑↑
8
→→→→→→→→↑
→→→→→↑↑↑↑
9
→↓↓↓↓↓↓↓↓
→→→→↑↑→↑↑
9
→→→→→→→→↑
→↑↑↑↑↑↑↑↑
10
→→→→→→→→→→→→→→↑↑↑↑↑
10
→→→→→→→→→→↑↑↑↑↑↑↑↑↑
11
→→→→→↑↑↑↑↑↑→↑↑↑↑↑↑↑
11
→→→→→→→↑↑↑↑↑↑↑↑↑↑↑↑
12
→→→↑↑↑↑↑↑
↑↑↑↑↑↑↑↑↑
12
→→→→→→↑↑↑
↑↑↑↑↑↑↑↑↑
13
![Page 116: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/116.jpg)
→→→↑↑↑↑↑↑
↑↑↑↑↑↑↑↑↑
13
→→→→→↑↑↑↑
↑↑↑↑↑↑↑↑↑
14
→→→↑↑↑↑↑↑
↑↑↑↑↑↑↑↑↑
14
→→→→→↑↑↑↑
↑↑↑↑↑↑↑↑↑
15
→→→→↑↑↑↑↑
↑↑↑↑↑↑↑↑↑
15
→→→→↑↑↑↑↑
↑↑↑↑↑↑↑↑↑
16
→→→→↑↑↑↑↑
↑↑↑↑↑↑↑↑↑
16
→→→→↑↑↑↑↑
↑↑↑↑↑↑↑↑↑
17
→→→→↑↑↑↑↑
↑↑↑↑↑↑↑↑↑
17
→→→→↑↑↑↑↑
↑↑↑↑↑↑↑↑↑
18
→→→→↑↑↑↑↑
↑↑↑↑↑↑↑↑↑
![Page 117: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/117.jpg)
18
→→→→↑↑↑↑↑
↑↑↑↑↑↑↑↑↑
19
→→→↑↑↑↑↑↑
↑↑↑↑↑↑↑↑↑
19
→→→→↑↑↑↑↑
↑↑↑↑↑↑↑↑↑
20
20
(a)GeodesicGaussiankernels
(b)OrdinaryGaussiankernels
123456789101112131415161718192021
123456789101112131415161718192021
1
1
2
→←↓↓↓↓↓↓↓
↓←↓↓→→→→
2
↓↓↓↓↓↓↓→↓
→→→→→→→→
3
↑←↓↓↓↓↓↓↓
↑↑↓↓→→→→↑
3
↓↓↓↓↓↓→↓↑
→→→→→→→→↑
4
↓↓↓↓↓↑↑↓↓
↑↑↑↓↓↓→→↑
![Page 118: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/118.jpg)
4
↓↓↓↓↓→↓→↓
→→→→→→→→↑
5
↓↓↓↓↓↑↑←↓
→↑↑↑↓↓↓↑↑
5
↓↓↓↓→↓→↓↑
→→→→→→→→↑
6
↓↓↓↓↓↓↓↓↓
↓→↑↑↓↓↓↑↑
6
↓→↓→↓→↓→↓
→→→→→→→→↑
7
↓↓↓↓↓↓↓↓↓
↓→→→→↓↓←↑
7
→↓→↓→↓→↓↓
→→→→→→→→↑
8
↓↓↑↑↓↓↓↓↓
↓↓↑→→→→←←
8
↓→↓→↓→↓→↑
→←→→→→↑→↑
9
↓↓↓↑↓↓↓↓↓
↓↓↑↑→→→→↑
9
→↓→↓→↓→→↑
![Page 119: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/119.jpg)
↑→←→→↑→↑↑
10
↓↓↓←↓↓↓↓→→↓↓↑↑←→→→↑
10
↓→↓→↓→↑→→→→↓↑↑↑→↑→↑
11
↓↓↓←↓↓↓↓→→→↓↓←←←←↑↑
11
→↓→↓→↑→→↑→↓↑↑↑↑↑→↑↑
12
↓↓↓↑↓↓↓↓↓
→→↓↓↓←←↑↑
12
↓→↓→↓→↑→↑
↑↑↑↑↑↑↑→↑
13
↓↓↑↑↓↓↓↓↓
→→→→↓←←←↓
13
→↓→↑→↑→↑↓
→↑↑↑↑↑↑↑↑
14
↓↓↓↓↓↓↓↓↓
↓→→→→→→↓↓
14
↑→↑→↑→↑→↑
↑→↑↑↑↑↑↑↑
15
↓↓↓↓↓↓↓↓↓
↓↓↓→→→←↓↓
15
↓↑↓↑↓↑→↑↓
![Page 120: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/120.jpg)
→↑→↑↑↑↑↑↑
16
↓↓↓↓↓↑↑←↓
↓↓↓←↑↑←↓↓
16
↑↓↑↓↑↓↑↓↑
↑→↑→↑↑↑←↑
17
↓↓↓↓↓↑↑↓↓
↑↓↓←←←←↓↓
17
↓↑↓↑↓↑↓↑↓
→↑→↑↑↑←↑←
18
↑↓↓↓↓←←↓↓
↑→→↓←←↑↓↓
18
↑↓↑↓↑↓↑↓↑
↑→↑→↑←↑←↑
19
→↑←←←←←←←→→→→←←↑↑↑
19
↑↑←↑←↑←↑←→↑→↑→↑←↑←
20
20
(c)Graph-Laplacianeigenbases
(d)Diffusionwavelets
FIGURE3.4:Obtainedpolicies.
Figure3.3(b)).Thiscausesanerrorinthepolicyaroundthepartition(see
x=10,y=2,3,…,9ofFigure3.4(b)).
3.2.4
Graph-LaplacianEigenbases
![Page 121: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/121.jpg)
Mahadevan(2005)proposedemployingthesmoothestvectorsongraphsas
basesinvaluefunctionapproximation.Accordingtothespectralgraphtheory
(Chung,1997),suchsmoothbasesaregivenbytheminoreigenvectorsofthe
graph-Laplacianmatrix,whicharecalledgraph-Laplacianeigenbases(GLEs).
GLEsmayberegardedasanaturalextensionofFourierbasestographs.
ExamplesofGLEsareillustratedinFigure3.2(c),showingthattheyhave
Fourier-likestructureonthegraph.ItshouldbenotedthatGLEsarerather
globalinnature,implyingthatnoiseinalocalregioncanpotentiallyde-
gradetheglobalqualityofapproximation.AnadvantageofGLEsisthatthey
haveanaturalorderingofthebasisfunctionsaccordingtothesmoothness.
Thisispracticallyveryhelpfulinchoosingasubsetofbasisfunctions.Fig-
ure3.3(c)depictstheapproximatedvaluefunctioninlog-scale,wherethetop
BasisDesignforValueFunctionApproximation
35
40smoothestGLEsoutof326GLEsareused(notethattheactualnumber
ofbasesis160becauseoftheduplicationovertheactionspace).Itshows
thatGLEsgloballygiveaverygoodapproximation,althoughthesmalllocal
fluctuationissignificantlyemphasizedsincethegraphisinlog-scale.Indeed,
themeansquarederror(MSE)betweentheapproximatedandoptimalvalue
functionsdescribedinthecaptionsofFigure3.3showsthatGLEsgivea
muchsmallerMSEthanGGKsandOGKs.However,theobtainedvaluefunc-
tioncontainssystematiclocalfluctuationandthisresultsinaninappropriate
policy(seeFigure3.4(c)).
MDP-inducedgraphsaretypicallysparse.Insuchcases,theresultant
graph-LaplacianmatrixisalsosparseandGLEscanbeobtainedjustbysolv-
ingasparseeigenvalueproblem,whichiscomputationallyefficient.However,
findingminoreigenvectorscouldbenumericallyunstable.
3.2.5
DiffusionWavelets
CoifmanandMaggioni(2006)proposeddiffusionwavelets(DWs),which
![Page 122: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/122.jpg)
areanaturalextensionofwaveletstothegraph.Theconstructionisbased
onasymmetrizedrandomwalkonagraph.Itisdiffusedonthegraphupto
adesiredlevel,resultinginamulti-resolutionstructure.Adetailedalgorithm
forconstructingDWsandmathematicalpropertiesaredescribedinCoifman
andMaggioni(2006).
WhenconstructingDWs,themaximumnestlevelofwaveletsandtoler-
anceusedintheconstructionalgorithmneedstobespecifiedbyusers.The
maximumnestlevelissetat10andthetoleranceissetat10−10,whichare
suggestedbytheauthors.ExamplesofDWsareillustratedinFigure3.2(d),
showinganicemulti-resolutionstructureonthegraph.DWsareover-complete
bases,soonehastoappropriatelychooseasubsetofbasesforbetterapprox-
imation.Figure3.3(d)depictstheapproximatedvaluefunctionobtainedby
DWs,wherewechosethemostglobal40DWsfrom1626over-completeDWs
(notethattheactualnumberofbasesis160becauseoftheduplicationover
theactionspace).Thechoiceofthesubsetbasescouldpossiblybeenhanced
usingmultipleheuristics.However,thecurrentchoiceisreasonablesinceFig-
ure3.3(d)showsthatDWsgiveamuchsmallerMSEthanGaussiankernels.
Nevertheless,similartoGLEs,theobtainedvaluefunctioncontainsalotof
smallfluctuations(seeFigure3.3(d))andthisresultsinanerroneouspolicy
(seeFigure3.4(d)).
Thankstothemulti-resolutionstructure,computationofdiffusionwavelets
canbecarriedoutrecursively.However,duetotheover-completeness,itisstill
ratherdemandingincomputationtime.Furthermore,appropriatelydetermin-
ingthetuningparametersaswellaschoosinganappropriatebasissubsetis
notstraightforwardinpractice.
36
StatisticalReinforcementLearning
3.3
NumericalExamples
![Page 123: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/123.jpg)
Asdiscussedintheprevioussection,GGKsbringanumberofpreferable
propertiesformakingvaluefunctionapproximationeffective.Inthissection,
thebehaviorofGGKsisillustratednumerically.
3.3.1
Robot-ArmControl
Here,asimulatorofatwo-jointrobotarm(movinginaplane),illustrated
inFigure3.5(a),isemployed.Thetaskistoleadtheend-effector(“hand”)
ofthearmtoanobjectwhileavoidingtheobstacles.Possibleactionsareto
increaseordecreasetheangleofeachjoint(“shoulder”and“elbow”)by5
degreesintheplane,simulatingcoarsestepper-motorjoints.Thus,thestate
spaceSisthe2-dimensionaldiscretespaceconsistingoftwojoint-angles,as
illustratedinFigure3.5(b).Theblackareainthemiddlecorrespondstothe
obstacleinthejoint-anglestatespace.TheactionspaceAinvolves4actions:
increaseordecreaseoneofthejointangles.Apositiveimmediatereward+1
isgivenwhentherobot’send-effectortouchestheobject;otherwisetherobot
receivesnoimmediatereward.Notethatactionswhichmakethearmcollide
withobstaclesaredisallowed.Thediscountfactorissetatγ=0.9.Inthis
environment,therobotcanchangethejointangleexactlyby5degrees,and
thereforetheenvironmentisdeterministic.However,becauseoftheobstacles,
itisdifficulttoexplicitlycomputeaninversekinematicmodel.Furthermore,
theobstaclesintroducediscontinuityinvaluefunctions.Therefore,thisrobot-
armcontroltaskisaninterestingtestbedforinvestigatingthebehaviorof
GGKs.
Trainingsamplesfrom50seriesof1000randomarmmovementsarecol-
lected,wherethestartstateischosenrandomlyineachtrial.Thegraph
inducedbytheaboveMDPconsistsof1605nodesanduniformweightsare
assignedtotheedges.Sincethereare16goalstatesinthisenvironment(see
Figure3.5(b)),thefirst16Gaussiancentersareputatthegoalsandthere-
mainingcentersarechosenrandomlyinthestatespace.ForGGKs,kernel
functionsareextendedovertheactionspaceusingtheshiftingscheme(see
Eq.(3.3))sincethetransitionisdeterministicinthisexperiment.
Figure3.6illustratesthevaluefunctionsapproximatedusingGGKsand
![Page 124: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/124.jpg)
OGKs.ThegraphsshowthatGGKsgiveanicesmoothsurfacewithobstacle-
induceddiscontinuitysharplypreserved,whileOGKstendtosmoothout
thediscontinuity.Thismakesasignificantdifferenceinavoidingtheobsta-
cle.From“A”to“B”inFigure3.5(b),theGGK-basedvaluefunctionresults
inatrajectorythatavoidstheobstacle(seeFigure3.6(a)).Ontheotherhand,
theOGK-basedvaluefunctionyieldsatrajectorythattriestomovethearm
throughtheobstaclebyfollowingthegradientupward(seeFigure3.6(b)),
causingthearmtogetstuckbehindtheobstacle.
![Page 125: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/125.jpg)
![Page 126: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/126.jpg)
![Page 127: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/127.jpg)
![Page 128: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/128.jpg)
![Page 129: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/129.jpg)
![Page 130: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/130.jpg)
![Page 131: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/131.jpg)
![Page 132: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/132.jpg)
BasisDesignforValueFunctionApproximation
37
-
(a)Aschematic
A
![Page 133: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/133.jpg)
B
(b)Statespace
FIGURE3.5:Atwo-jointrobotarm.Inthisexperiment,GGKsareputat
allthegoalstatesandtheremainingkernelsaredistributeduniformlyover
themaze;theshiftingschemeisusedinGGKs.
Figure3.7summarizestheperformanceofGGKsandOGKsmeasured
bythepercentageofsuccessfultrials(i.e.,theend-effectorreachestheobject)
over30independentruns.Moreprecisely,ineachrun,50,000trainingsamples
arecollectedusingadifferentrandomseed,apolicyisthencomputedbythe
GGK-orOGK-basedleast-squarespolicyiteration,andfinallytheobtained
policyistested.ThisgraphshowsthatGGKsremarkablyoutperformOGKs
sincethearmcansuccessfullyavoidtheobstacle.TheperformanceofOGKs
doesnotgobeyond0.6evenwhenthenumberofkernelsisincreased.Thisis
causedbythetaileffectofOGKs.Asaresult,theOGK-basedpolicycannot
leadtheend-effectortotheobjectifitstartsfromthebottomlefthalfofthe
statespace.
Whenthenumberofkernelsisincreased,theperformanceofbothGGKs
andOGKsgetsworseataroundk=20.Thisiscausedbythekernelalloca-
![Page 134: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/134.jpg)
![Page 135: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/135.jpg)
38
StatisticalReinforcementLearning
3
1
2
0.5
1
0
0
180
180
100
100
0
0
0
0
Joint2(degree)
Joint2(degree)
−180
−100
Joint1(degree)
−180
−100
Joint1(degree)
(a)GeodesicGaussiankernels
(b)OrdinaryGaussiankernels
FIGURE3.6:Approximatedvaluefunctionswith10kernels(theactual
numberofbasesis40becauseoftheduplicationovertheactionspace).
1
0.9
0.8
![Page 136: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/136.jpg)
0.7
0.6
0.5
0.4
Fractionofsuccessfultrials0.3
0.2
GGK(5)
GGK(9)
0.1
OGK(5)
OGK(9)
00
20
40
60
80
100
Numberofkernels
FIGURE3.7:Fractionofsuccessfultrials.
tionstrategy:thefirst16kernelsareputatthegoalstatesandtheremaining
kernelcentersarechosenrandomly.Whenkislessthanorequalto16,the
approximatedvaluefunctiontendstohaveaunimodalprofilesinceallkernels
areputatthegoalstates.However,whenkislargerthan16,thisunimodality
isbrokenandthesurfaceoftheapproximatedvaluefunctionhasslightfluc-
tuations,causinganerrorinpoliciesanddegradingperformanceataround
BasisDesignforValueFunctionApproximation
39
k=20.Thisperformancedegradationtendstorecoverasthenumberof
kernelsisfurtherincreased.
MotionexamplesoftherobotarmtrainedwithGGKandOGKareillus-
![Page 137: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/137.jpg)
tratedinFigure3.8andFigure3.9,respectively.
Overall,theaboveresultshowsthatwhenGGKsarecombinedwiththe
above-mentionedkernel-centerallocationstrategy,almostperfectpoliciescan
beobtainedwithasmallnumberofkernels.Therefore,theGGKmethodis
computationallyhighlyadvantageous.
3.3.2
Robot-AgentNavigation
Theabovesimplerobot-armcontrolsimulationshowsthatGGKsare
promising.Here,GGKsareappliedtoamorechallengingtaskofmobile-robot
navigation,whichinvolvesahigh-dimensionalandverylargestatespace.
AKheperarobot,illustratedinFigure3.10(a),isemployedforthenavi-
gationtask.TheKheperarobotisequippedwith8infraredsensors(“s1”to
“s8”inthefigure),eachofwhichgivesameasureofthedistancefromthesur-
roundingobstacles.Eachsensorproducesascalarvaluebetween0and1023:
thesensorobtainsthemaximumvalue1023ifanobstacleisjustinfrontofthe
sensorandthevaluedecreasesastheobstaclegetsfartheruntilitreachesthe
minimumvalue0.Therefore,thestatespaceSis8-dimensional.TheKhep-
erarobothastwowheelsandtakesthefollowingdefinedactions:forward,
leftrotation,rightrotation,andbackward(i.e.,theactionspaceAcontains
actions).Thespeedoftheleftandrightwheelsforeachactionisdescribed
inFigure3.10(a)inthebracket(theunitispulseper10milliseconds).Note
thatthesensorvaluesandthewheelspeedarehighlystochasticduetothe
crosstalk,sensornoise,slip,etc.Furthermore,perceptualaliasingoccursdue
tothelimitedrangeandresolutionofsensors.Therefore,thestatetransition
isalsohighlystochastic.Thediscountfactorissetatγ=0.9.
ThegoalofthenavigationtaskistomaketheKheperarobotexplore
theenvironmentasmuchaspossible.Tothisend,apositivereward+1is
givenwhentheKheperarobotmovesforwardandanegativereward−2is
givenwhentheKheperarobotcollideswithanobstacle.Norewardisgiven
totheleftrotation,rightrotation,andbackwardactions.Thisrewarddesign
encouragestheKheperarobottogoforwardwithouthittingobstacles,through
whichextensiveexplorationintheenvironmentcouldbeachieved.
![Page 138: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/138.jpg)
Trainingsamplesarecollectedfrom200seriesof100randommovementsin
afixedenvironmentwithseveralobstacles(seeFigure3.11(a)).Then,agraph
isconstructedfromthegatheredsamplesbydiscretizingthecontinuousstate
spaceusingaself-organizingmap(SOM)(Kohonen,1995).ASOMconsists
ofneuronslocatedonaregulargrid.Eachneuroncorrespondstoacluster
andneuronsareconnectedtoadjacentonesbyneighborhoodrelation.The
SOMissimilartothek-meansclusteringalgorithm,butitisdifferentinthat
thetopologicalstructureoftheentiremapistakenintoaccount.Thanksto
this,theentirespacetendstobecoveredbytheSOM.Thenumberofnodes
![Page 139: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/139.jpg)
![Page 140: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/140.jpg)
![Page 141: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/141.jpg)
![Page 142: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/142.jpg)
![Page 143: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/143.jpg)
![Page 144: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/144.jpg)
![Page 145: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/145.jpg)
![Page 146: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/146.jpg)
40
StatisticalReinforcementLearning
FIGURE3.8:AmotionexampleoftherobotarmtrainedwithGGK(from
lefttorightandtoptobottom).
FIGURE3.9:AmotionexampleoftherobotarmtrainedwithOGK(from
lefttorightandtoptobottom).
![Page 147: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/147.jpg)
![Page 148: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/148.jpg)
![Page 149: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/149.jpg)
BasisDesignforValueFunctionApproximation
41
(a)Aschematic
1000
800
600
400
200
0
−200
−400
−1000
−800
−600
−400
−200
0
200
400
600
800
1000
(b)Statespaceprojectedontoa2-dimensionalsubspaceforvisualization
FIGURE3.10:Kheperarobot.Inthisexperiment,GGKsaredistributed
![Page 150: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/150.jpg)
uniformlyoverthemazewithouttheshiftingscheme.
(states)inthegraphissetat696(equivalenttotheSOMmapsizeof24×29).
√
Thisvalueiscomputedbythestandardrule-of-thumbformula5n(Vesanto
etal.,2000),wherenisthenumberofsamples.Theconnectivityofthegraph
isdeterminedbystatetransitionsoccurringinthesamples.Morespecifically,
ifthereisastatetransitionfromonenodetoanotherinthesamples,anedge
isestablishedbetweenthesetwonodesandtheedgeweightissetaccording
totheEuclideandistancebetweenthem.
Figure3.10(b)illustratesanexampleoftheobtainedgraphstructure.For
visualizationpurposes,the8-dimensionalstatespaceisprojectedontoa2-
dimensionalsubspacespannedby
(−1−10011
0
0),
(0
0
11
00
−1−1).
![Page 151: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/151.jpg)
42
StatisticalReinforcementLearning
(a)Training
(b)Test
FIGURE3.11:Simulationenvironment.
Notethatthisprojectionisperformedonlyforthepurposeofvisualization.
Allthecomputationsarecarriedoutusingtheentire8-dimensionaldata.
Thei-thelementintheabovebasescorrespondstotheoutputofthei-th
sensor(seeFigure3.10(a)).Theprojectionontothissubspaceroughlymeans
thatthehorizontalaxiscorrespondstothedistancetotheleftandright
obstacles,whiletheverticalaxiscorrespondstothedistancetothefrontand
backobstacles.Forclearvisibility,theedgeswhoseweightislessthan250are
plotted.RepresentativelocalposesoftheKheperarobotwithrespecttothe
obstaclesareillustratedinFigure3.10(b).Thisgraphhasanotablefeature:
thenodesaroundtheregion“B”inthefigurearedirectlyconnectedtothe
nodesat“A,”butareverysparselyconnectedtothenodesat“C,”“D,”and
“E.”Thisimpliesthatthegeodesicdistancefrom“B”to“C,”“B”to“D,”
or“B”to“E”istypicallylargerthantheEuclideandistance.
Sincethetransitionfromonestatetoanotherishighlystochasticinthe
currentexperiment,theGGKfunctionissimplyduplicatedovertheaction
space(seeEq.(3.1)).ForobtainingcontinuousGGKs,GGKfunctionsneedto
beinterpolated(seeSection3.1.4).Asimplelinearinterpolationmethodmay
beemployedingeneral,butthecurrentexperimenthasuniquecharacteristics:
![Page 152: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/152.jpg)
atleastoneofthesensorvaluesisalwayszerosincetheKheperarobotisnever
completelysurroundedbyobstacles.Therefore,samplesarealwaysonthe
surfaceofthe8-dimensionalhypercube-shapedstatespace.Ontheotherhand,
thenodecentersdeterminedbytheSOMarenotgenerallyonthesurface.This
meansthatanysampleisnotincludedintheconvexhullofitsnearestnodes
andthefunctionvalueneedstobeextrapolated.Here,theEuclideandistance
betweenthesampleanditsnearestnodeissimplyaddedwhencomputing
kernelvalues.Moreprecisely,forastatesthatisnotgenerallylocatedona
nodecenter,theGGK-basedbasisfunctionisdefinedas
(ED(s,˜
s)+SP(˜
s,c(j)))2
φi+(j−1)m(s,a)=I(a=a(i))exp−
,
2σ2
BasisDesignforValueFunctionApproximation
43
where˜
sisthenodeclosesttosintheEuclideandistance.
Figure3.12illustratesanexampleofactionsselectedateachnodebythe
GGK-basedandOGK-basedpolicies.Onehundredkernelsareusedandthe
widthissetat1000.Thesymbols↑,↓,⊂,and⊃inthefigureindicateforward,backward,leftrotation,andrightrotationactions.Thisshowsthatthereisa
cleardifferenceintheobtainedpoliciesatthestate“C.”Thebackwardaction
ismostlikelytobetakenbytheOGK-basedpolicy,whiletheleftrotation
andrightrotationaremostlikelytobetakenbytheGGK-basedpolicy.This
causesasignificantdifferenceintheperformance.Toexplainthis,supposethat
theKheperarobotisatthestate“C,”i.e.,itfacesawall.TheGGK-based
policyguidestheKheperarobotfrom“C”to“A”via“D”or“E”bytaking
theleftandrightrotationactionsanditcanavoidtheobstaclesuccessfully.
![Page 153: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/153.jpg)
Ontheotherhand,theOGK-basedpolicytriestoplanapathfrom“C”to
“A”via“B”byactivatingthebackwardaction.Asaresult,theforwardaction
istakenat“B.”Forthisreason,theKheperarobotreturnsto“C”againand
endsupmovingbackandforthbetween“C”and“B.”
Fortheperformanceevaluation,amorecomplicatedenvironmentthan
theoneusedforgatheringtrainingsamples(seeFigure3.11)isused.This
meansthathowwelltheobtainedpoliciescanbegeneralizedtoanunknown
environmentisevaluatedhere.Inthistestenvironment,theKheperarobot
runsfromafixedstartingposition(seeFigure3.11(b))andtakes150steps
followingtheobtainedpolicy,withthesumofrewards(+1fortheforward
action)computed.IftheKheperarobotcollideswithanobstaclebefore150
steps,theevaluationisstopped.Themeantestperformanceover30indepen-
dentrunsisdepictedinFigure3.13asafunctionofthenumberofkernels.
Moreprecisely,ineachrun,agraphisconstructedbasedonthetraining
samplestakenfromthetrainingenvironmentandthespecifiednumberofker-
nelsisputrandomlyonthegraph.Then,apolicyislearnedbytheGGK-
orOGK-basedleast-squarespolicyiterationusingthetrainingsamples.Note
thattheactualnumberofbasesisfourtimesmorebecauseoftheexten-
sionofbasisfunctionsovertheactionspace.Thetestperformanceismea-
sured5timesforeachpolicyandtheaverageisoutput.Figure3.13shows
thatGGKssignificantlyoutperformOGKs,demonstratingthatGGKsare
promisingeveninthechallengingsettingwithahigh-dimensionallargestate
space.
Figure3.14depictsthecomputationtimeofeachmethodasafunctionof
thenumberofkernels.Thisshowsthatthecomputationtimemonotonically
increasesasthenumberofkernelsincreasesandtheGGK-basedandOGK-
basedmethodshavecomparablecomputationtime.However,giventhatthe
GGK-basedmethodworksmuchbetterthantheOGK-basedmethodwitha
smallernumberofkernels(seeFigure3.13),theGGK-basedmethodcouldbe
regardedasacomputationallyefficientalternativetothestandardOGK-based
method.
Finally,thetrainedKheperarobotisappliedtomapbuilding.Starting
![Page 154: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/154.jpg)
fromaninitialposition(indicatedbyasquareinFigure3.15),theKhepera
44
StatisticalReinforcementLearning
1000
⊃⊃⊃⊃⊃⊃⊃⊃↑⊂⊂⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃↑↑⊂⊂⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃↑
⊃⊃⊂⊂⊃⊃⊃⊃⊃⊃⊂⊂⊂⊂⊂⊂⊂⊂⊃⊃⊂⊂⊂⊂
![Page 155: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/155.jpg)
⊂⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊂⊂↓
⊃⊂⊂⊂⊂⊂↑⊂⊂⊂⊂⊂⊂⊂⊂⊂⊃↑
⊃⊃⊃
![Page 156: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/156.jpg)
⊃⊃⊃⊃↓⊃⊃⊃↓
↓
⊃⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂800
⊃⊃⊃⊃⊃⊃⊂⊂⊂⊃⊃⊃↑
⊂⊂⊂⊃⊃⊃⊃⊃↑↑↑⊂⊂⊂↑⊂⊂⊂⊂⊂⊂⊂
![Page 157: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/157.jpg)
⊂⊂⊂⊂⊂⊃⊃⊃⊃⊃⊃⊃⊃⊃↑↑⊃⊃⊃↑
↑↑⊂⊂⊂⊂↑⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊃⊃↑⊃⊃⊃⊃⊃↑⊂↑⊂↑⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊃⊃⊃⊃⊃⊃⊃↑
⊂↑⊂⊂⊂⊂↑↑
⊃
![Page 158: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/158.jpg)
⊂↑↑⊂⊂⊂⊂600
↑
⊂⊃⊃⊃⊃⊃↑⊃↑
⊂⊂⊂↑⊃⊃⊂⊃↑
↑
↑
⊂↑
⊂⊂⊂⊂⊂⊂⊃⊃⊃⊃↑
↑
⊃↑↑
⊃↑
↑
↑
↑
![Page 159: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/159.jpg)
↑
↑
⊂↑⊂↑↑⊂⊂⊂⊂⊃⊃⊃⊃↑
↑
↑
↑
⊂⊂↑↑↑
⊂⊂⊂⊂↑↑↑⊂⊂⊂400
⊃↑
⊃⊃⊃↑
↑
↑
↑⊂↑⊂⊂⊂⊃⊃⊃↑
↑
↑↑↑↑
↑↑↑
![Page 160: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/160.jpg)
⊂⊂⊃⊃↑
↑
↑
⊃⊃⊃↑
↑↑
⊂↑
↑↑↑
↑↑↑
↑
⊂⊂⊃⊃⊃⊃↑↑↑↑↑
↑
⊂⊂⊂↑↑
↑
↑
⊂↑↑↑
⊂200
![Page 161: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/161.jpg)
⊃⊃⊃↑
↑
⊂⊂⊃⊃↑↑↑↑↑
↑
↑
⊂⊃↑↑↑↑↑
↑
↑↑
⊂⊂⊃⊃⊃⊃↑
⊃↑↑
↑↑↑
↑
⊂↑
↑↑↑
↑↑↑
⊂⊂⊂⊃⊃
![Page 162: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/162.jpg)
↑↑
↑
↑
↑↑
⊃⊃↑
↑
↑
↑↑↑↑↑
↑↑
⊂↑↑
↑↑↑↑
↑⊂⊂0
⊃⊃⊃⊃↑↑↑↑↑
↑
↑↑
↑
↑↑↑
↑
↑↑↑↑↑↑↑↑↑
↑
↑
↑
↑↑
↑↑
↑↑
↑↑↑
![Page 163: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/163.jpg)
↑↑↑↑
↑↑
↑
↑
↑↑
↑↑
↑↑
↑
↑↑
↑↑
↑↑
↑↑↑
↑↑
↑↑
⊂↑↑
↑↑
⊂↑⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊃↑↑↑↑↑↑↑↑
↑
↑↑↑↑
↑↑↑↑
⊃⊃
![Page 164: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/164.jpg)
⊃⊃⊃⊃↑↑
↑
↑↑↑
↑
↑↑↑
↑↑
↑↑
↑↑↑↑
↑↑
↑↑
↑
↑
↑↑
↑↑
↑↑
↑
↑
↑↑↑
−200
↑
↑↑
−400
−1000
−800
−600
−400
−200
0
![Page 165: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/165.jpg)
200
400
600
800
1000
(a)GeodesicGaussiankernels
1000
⊃⊃⊃↓↓↓↓↓↓↓
↓↓↓
↓
⊃⊃⊃⊃⊃⊃↓↓
↓↓↓↓↓↓
↓↓
↓
⊃⊃⊃⊃⊃↓
⊃⊃↓↓⊃↓↓↓↓
↓↓
↓↓
↓↓↓↓↓
↓↓↓
↓
↓↓↓↓↓⊂
![Page 166: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/166.jpg)
↓⊂⊂⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃↓
↓
↓
⊃⊃↓
↓↓↓
↓↓↓
↓↓
↓
↓↓
↓↓↓↓⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊃⊃⊃⊃
![Page 167: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/167.jpg)
⊃↓↓⊃⊃↓↓↓
↓
↓
↓
↓
↓↓↓↓↓⊂⊂⊂⊂⊂⊂800
⊃⊃⊃⊃⊃↓
⊂⊂⊂⊃⊃⊃↓
⊂⊂⊂⊃⊃⊃↓↓↑↓
↓↓
⊂⊂⊂⊂⊂⊂⊂⊂
![Page 168: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/168.jpg)
⊂⊂⊂⊂⊂⊂⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃⊃↓↑
↓↓↑
↑↓↓↓↓↓
⊂⊂⊂⊂⊂⊂⊂⊂⊃⊃⊃⊃⊃⊃⊃↓↓↑↑↑
↓↓
⊂⊂⊂⊂⊂⊂⊂⊂⊂⊃⊃⊃⊃⊃⊃⊃↓
↑↓
⊂⊂⊂⊂⊃⊃
![Page 169: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/169.jpg)
↓
↑↑
↓↓
⊂⊂⊂600
↓
⊂⊃⊃⊃⊃⊃↓↑
↓
⊂⊂⊂⊃⊃↑
⊂↓
↑
⊂↑
↑
↓
⊂⊂⊂⊂⊂⊂⊃⊃⊃⊃⊃⊃↑
⊂⊂⊃↑
⊃⊃
![Page 170: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/170.jpg)
⊃↑
↑
↑↓⊂⊂⊂⊂⊂⊂⊂⊃⊃⊃⊃⊃⊃↑
↑
⊂⊂↑↑↑
↑↑
⊂↓⊂⊂⊂⊂⊂⊂400
⊃⊃⊃⊃⊃↑
↑
↑
↑↑
⊂⊂⊂⊂⊃⊃⊃⊃↑
↑↑↑↑
![Page 171: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/171.jpg)
↑⊂⊂⊂⊂⊃⊃⊃↑
↑
⊃⊃↑
↑
↑⊂⊂↑
↑↑↑
↑↑⊂⊂⊂⊂⊃⊃⊃⊃⊃↑↑↑↑
↑
⊂⊂⊂↑↑
↑
⊂⊂↑↑⊂⊂
![Page 172: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/172.jpg)
200
⊃⊃↑
↑
↑
⊂⊂⊃⊃⊃⊃↑↑↑↑
↑
⊂↑
↑↑↑↑↑
↑
↑⊂⊂⊂⊃⊃⊃↑
↑
⊃⊃⊃↑↑↑
↑
⊂↑
↑↑↑
↑↑↑
⊂⊂⊂⊃
![Page 173: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/173.jpg)
⊃⊃↑↑
↑
↑↑
⊃⊃↑
↑
⊂↑↑↑↑↑
↑↑
↑
↑↑
↑↑↑↑
⊂⊂⊂0
⊃⊃⊃⊃↑↑↑↑↑
↑
↑↑
↑
↑↑↑
↑
↑↑↑↑↑↑↑↑↑
↑
↑
↑
↑↑
↑↑
↑↑
![Page 174: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/174.jpg)
↑↑↑
↑↑↑↑
↑↑
↑
↑
↑↑
↑↑
↑↑
↑
↑↑
↑↑
↑↑
↑↑↑
↑↑
↑⊂⊂↑⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊂⊃⊃↑↑↑↑↑↑↑
↑
↑↑↑↑
↑↑↑↑
⊃⊃
![Page 175: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/175.jpg)
⊃⊃⊃⊃↑↑
↑
↑↑↑
↑
↑↑↑
↑↑
↑↑
↑↑↑↑
↑↑
↑↑
↑
↑
↑↑
↑↑
↑↑
↑
↑
↑↑↑
−200
↑
↑↑
−400
−1000
−800
−600
−400
−200
0
![Page 176: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/176.jpg)
200
400
600
800
1000
(b)OrdinaryGaussiankernels
FIGURE3.12:Examplesofobtainedpolicies.Thesymbols↑,↓,⊂,and⊃indicateforward,backward,leftrotation,andrightrotationactions.
robottakesanaction2000timesfollowingthelearnedpolicy.Eightykernels
withGaussianwidthσ=1000areusedforvaluefunctionapproximation.The
resultsofGGKsandOGKsaredepictedinFigure3.15.Thegraphsshowthat
theGGKresultgivesabroaderprofileoftheenvironment,whiletheOGK
resultonlyrevealsalocalareaaroundtheinitialposition.
MotionexamplesoftheKheperarobottrainedwithGGKandOGKare
illustratedinFigure3.16andFigure3.17,respectively.
BasisDesignforValueFunctionApproximation
45
70
GGK(200)
4500
65
GGK(1000)
OGK(200)
GGK(1000)
60
4000
OGK(1000)
OGK(1000)
55
![Page 177: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/177.jpg)
3500
50
3000
45
2500
40
2000
35
1500
Averagedtotalrewards30
Computationtime[sec]1000
25
500
200102030405060708090100
00102030405060708090100
Numberofkernels
Numberofkernels
FIGURE3.13:Averageamountof
FIGURE3.14:Computationtime.
exploration.
(a)GeodesicGaussiankernels
(b)OrdinaryGaussiankernels
FIGURE3.15:Resultsofmapbuilding(cf.Figure3.11(b)).
3.4
Remarks
Theperformanceofleast-squarespolicyiterationdependsheavilyonthe
choiceofbasisfunctionsforvaluefunctionapproximation.Inthischapter,
thegeodesicGaussiankernel(GGK)wasintroducedandshowntopossess
severalpreferablepropertiessuchassmoothnessalongthegraphandeasy
computability.ItwasalsodemonstratedthatthepoliciesobtainedbyGGKs
arenotassensitivetothechoiceoftheGaussiankernelwidth,whichisa
usefulpropertyinpractice.Also,theheuristicsofputtingGaussiancenters
![Page 178: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/178.jpg)
ongoalstateswasshowntoworkwell.
However,whenthetransitionishighlystochastic(i.e.,thetransitionprob-
abilityhasawidesupport),thegraphconstructedbasedonthetransition
samplescouldbenoisy.Whenanerroneoustransitionresultsinashort-cut
![Page 179: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/179.jpg)
![Page 180: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/180.jpg)
![Page 181: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/181.jpg)
![Page 182: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/182.jpg)
![Page 183: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/183.jpg)
![Page 184: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/184.jpg)
![Page 185: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/185.jpg)
![Page 186: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/186.jpg)
46
StatisticalReinforcementLearning
FIGURE3.16:AmotionexampleoftheKheperarobottrainedwithGGK
(fromlefttorightandtoptobottom).
FIGURE3.17:AmotionexampleoftheKheperarobottrainedwithOGK
(fromlefttorightandtoptobottom).
overobstacles,thegraph-basedapproachmaynotworkwellsincethetopology
ofthestatespacechangessignificantly.
Chapter4
SampleReuseinPolicyIteration
Off-policyreinforcementlearningisaimedatefficientlyusingdatasamples
gatheredfromapolicythatisdifferentfromthecurrentlyoptimizedpolicy.A
commonapproachistouseimportancesamplingtechniquesforcompensating
forthebiascausedbythedifferencebetweenthedata-samplingpolicyandthe
targetpolicy.Inthischapter,weexplainhowimportancesamplingcanbeuti-
lizedtoefficientlyreusepreviouslycollecteddatasamplesinpolicyiteration.
Afterformulatingtheproblemofoff-policyvaluefunctionapproximationin
Section4.1,representativeoff-policyvaluefunctionapproximationtechniques
includingadaptiveimportancesamplingarereviewedinSection4.2.Then,in
Section4.3,howtheadaptivityofimportancesamplingcanbeoptimallycon-
trolledisexplained.InSection4.4,off-policyvaluefunctionapproximation
techniquesareintegratedintheframeworkofleast-squarespolicyiteration
![Page 187: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/187.jpg)
forefficientsamplereuse.ExperimentalresultsareshowninSection4.5,and
finallythischapterisconcludedinSection4.6.
4.1
Formulation
AsexplainedinSection2.2,least-squarespolicyiterationmodelsthestate-
actionvaluefunctionQπ(s,a)byalineararchitecture,
θ⊤φ(s,a),
andlearnstheparameterθsothatthegeneralizationerrorGisminimized:
”
#
T
1X
2
G(θ)=Epπ(h)
θ⊤ψ(s
.
(4.1)
T
t,at)−r(st,at)
t=1
Here,Epπ(h)denotestheexpectationoverhistory
h=[s1,a1,…,sT,aT,sT+1]
followingthetargetpolicyπand
h
i
ψ(s,a)=φ(s,a)−γEπ(a′|s′)p(s′|s,a)φ(s′,a′).
47
48
StatisticalReinforcementLearning
![Page 188: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/188.jpg)
Whenhistorysamplesfollowingthetargetpolicyπareavailable,thesitu-
ationiscalledon-policyreinforcementlearning.Inthiscase,justreplacingthe
expectationcontainedinthegeneralizationerrorGbysampleaveragesgives
astatisticallyconsistentestimator(i.e.,theestimatedparameterconvergesto
theoptimalvalueasthenumberofsamplesgoestoinfinity).
Here,weconsiderthesituationcalledoff-policyreinforcementlearning,
wherethesamplingpolicye
πforcollectingdatasamplesisgenerallydifferent
fromthetargetpolicyπ.Letusdenotethehistorysamplesfollowinge
πby
Heπ=heπ1,…,heπN,
whereeachepisodicsampleheπnisgivenas
heπn=[seπ1,n,aeπ1,n,…,seπT,n,aeπT,n,seπT+1,n].
Undertheoff-policysetup,naivelearningbyminimizingthesample-
approximatedgeneralizationerrorb
GNIWleadstoaninconsistentestimator:
N
XT
X
2
b
1
GNIW(θ)=
θ⊤b
ψ(seπ
,
NT
t,n,ae
π
t,n;He
π)−r(seπt,n,aeπt,n,seπt+1,n)
n=1t=1
![Page 189: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/189.jpg)
where
X
h
i
b
1
ψ(s,a;H)=φ(s,a)−
E
γφ(s′,a′).
|H
e
π(a′|s′)
(s,a)|s′∈H(s,a)H(s,a)denotesasubsetofHthatconsistsofalltransitionsamplesfromstate
sbyactiona,|H(s,a)|denotesthenumberofelementsinthesetH(s,a),and
P
denotesthesummationoveralldestinationstatess′intheset
s′∈Hs,a)H(s,a).NIWstandsfor“NoImportanceWeight,”whichwillbeexplained
later.
Thisinconsistencyproblemcanbeavoidedbygatheringnewsamplesfol-
lowingthetargetpolicyπ,i.e.,whenthecurrentpolicyisupdated,newsam-
plesaregatheredfollowingtheupdatedpolicyandthenewsamplesareused
forpolicyevaluation.However,whenthedatasamplingcostishigh,thisis
tooexpensive.Itwouldbemorecostefficientifpreviouslygatheredsamples
couldbereusedeffectively.
4.2
Off-PolicyValueFunctionApproximation
Importancesamplingisageneraltechniquefordealingwiththeoff-policy
situation.Supposewehavei.i.d.(independentandidenticallydistributed)sam-
plesxnN
n=1fromastrictlypositiveprobabilitydensityfunctione
![Page 190: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/190.jpg)
p(x).Using
SampleReuseinPolicyIteration
49
thesesamples,wewouldliketocomputetheexpectationofafunctiong(x)
overanotherprobabilitydensityfunctionp(x).Aconsistentapproximationof
theexpectationisgivenbytheimportance-weightedaverageas
1N
X
p(x
p(x)
g(x
n)N→∞
−→E
g(x)
N
n)ep(x
e
p(x)
e
p(x)
n=1
n)
Z
Z
p(x)
=
g(x)
![Page 191: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/191.jpg)
e
p(x)dx=
g(x)p(x)dx=E
e
p(x)
p(x)[g(x)].
However,applyingtheimportancesamplingtechniqueinoff-policyrein-
forcementlearningisnotstraightforwardsinceourtrainingsamplesofstate
sandactionaarenoti.i.d.duetothesequentialnatureofMarkovdeci-
sionprocesses(MDPs).Inthissection,representativeimportance-weighting
techniquesforMDPsarereviewed.
4.2.1
EpisodicImportanceWeighting
Basedontheindependencebetweenepisodes,
p(h,h′)=p(h)p(h′)=p(s1,a1,…,sT,aT,sT+1)p(s′1,a′1,…,s′T,a′T,s′T+1),thegeneralizationerrorGcanberewrittenas
”
#
T
1X
2
G(θ)=Epeπ(h)
θ⊤ψ(s
w
,
T
t,at)−r(st,at)
T
t=1
wherewTistheepisodicimportanceweight(EIW):
pπ(h)
wT=
.
![Page 192: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/192.jpg)
peπ(h)
pπ(h)andpeπ(h)aretheprobabilitydensitiesofobservingepisodicdatah
underpolicyπande
π:
T
Y
pπ(h)=p(s1)
π(at|st)p(st+1|st,at),
t=1
T
Y
peπ(h)=p(s1)
eπ(at|st)p(st+1|st,at).
t=1
Notethattheimportanceweightscanbecomputedwithoutexplicitlyknowing
p(s1)andp(st+1|st,at),sincetheyarecanceledout:
QTπ(a
w
t=1
t|st)
T=Q
.
T
t=1e
π(at|st)
50
StatisticalReinforcementLearning
![Page 193: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/193.jpg)
UsingthetrainingdataHeπ,wecanconstructaconsistentestimatorofG
as
N
XT
X
2
b
1
GEIW(θ)=
θ⊤b
ψ(seπ
b
w
NT
t,n,ae
π
t,n;He
π)−r(seπt,n,aeπt,n,seπt+1,n)
T,n,
n=1t=1
(4.2)
where
QTπ(aeπ
b
w
t=1
t,n|se
π
t,n)
T,n=Q
.
T
![Page 194: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/194.jpg)
t=1e
π(aeπt,n|seπt,n)
4.2.2
Per-DecisionImportanceWeighting
AcrucialobservationinEIWisthattheerroratthet-thstepdoesnot
dependonthesamplesafterthet-thstep(Precupetal.,2000).Thus,the
generalizationerrorGcanberewrittenas
”
#
T
1X
2
G(θ)=Epeπ(h)
θ⊤ψ(s
w
,
T
t,at)−r(st,at)
t
t=1
wherewtistheper-decisionimportanceweight(PIW):
Q
Q
p(s
t
π(a
t
π(a
w
1)
t′=1
t′|st′)p(st′+1|st′,at′)
![Page 195: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/195.jpg)
t′=1
t′|st′)
t=
Q
=Q
.
p(s
t
t
1)
t′=1e
π(at′|st′)p(st′+1|st′,at′)
t′=1e
π(at′|st′)
UsingthetrainingdataHeπ,wecanconstructaconsistentestimatoras
follows(cf.Eq.(4.2)):
N
XT
X
2
b
1
GPIW(θ)=
θ⊤b
ψ(seπ
b
w
NT
t,n,ae
π
t,n;He
π)−r(seπt,n,aeπt,n,seπt+1,n)
![Page 196: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/196.jpg)
t,n,
n=1t=1
where
Qt
π(aeπ
)
b
w
t′=1
t′,n|se
π
t′,n
t,n=Q
.
t
)
t′=1e
π(aeπt′,n|seπt′,n
b
wt,nonlycontainstherelevanttermsuptothet-thstep,whileb
wT,nincludes
allthetermsuntiltheendoftheepisode.
4.2.3
AdaptivePer-DecisionImportanceWeighting
ThePIWestimatorisguaranteedtobeconsistent.However,botharenot
efficientinthestatisticalsense(Shimodaira,2000),i.e.,theydonothavethe
smallestadmissiblevariance.Forthisreason,thePIWestimatorcanhave
largevarianceinfinitesamplecasesandthereforelearningwithPIWtendsto
beunstableinpractice.
Toimprovethestability,itisimportanttocontrolthetrade-offbetween
![Page 197: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/197.jpg)
SampleReuseinPolicyIteration
51
consistencyandefficiency(orsimilarlybiasandvariance)basedontraining
data.Here,theflatteningparameterν(∈[0,1])isintroducedtocontrolthetrade-offbyslightly“flattening”theimportanceweights(Shimodaira,2000;
Sugiyamaetal.,2007):
N
XT
X
b
1
GAIW(θ)=
θ⊤b
ψ(seπ
NT
t,n,ae
π
t,n;He
π)
n=1t=1
2
−r(seπt,n,aeπt,n,seπt+1,n)(b
wt,n)ν,
whereAIWstandsfortheadaptiveper-decisionimportanceweight.When
ν=0,AIWisreducedtoNIWandthereforeithaslargebiasbuthasrelatively
smallvariance.Ontheotherhand,whenν=1,AIWisreducedtoPIW.Thus,
ithassmallbiasbuthasrelativelylargevariance.Inpractice,anintermediate
valueofνwillyieldthebestperformance.
Letb
ΨbetheNT×Bmatrix,c
WbetheNT×NTdiagonalmatrix,and
rbetheNT-dimensionalvectordefinedas
![Page 198: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/198.jpg)
b
ΨN(t−1)+n,b=b
ψb(st,n,at,n),
c
WN(t−1)+n,N(t−1)+n=b
wt,n,
rN(t−1)+n=r(st,n,at,n,st+1,n).
Then,b
GAIWcanbecompactlyexpressedas
b
1
ν
GAIW(θ)=
(b
Ψθ−r)⊤c
W(b
Ψθ−r).
NT
Becausethisisaconvexquadraticfunctionwithrespecttoθ,itsglobalmin-
imizerb
θAIWcanbeanalyticallyobtainedbysettingitsderivativetozeroas
b
⊤ν
⊤ν
θ
cb
c
AIW=(b
ΨWΨ)−1b
ΨWr.
![Page 199: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/199.jpg)
Thismeansthatthecostforcomputingb
θAIWisessentiallythesameasb
θNIW,
whichisgivenasfollows(seeSection2.2.2):
b
⊤⊤θ
b
NIW=(b
ΨΨ)−1b
Ψr.
4.2.4
Illustration
Here,theinfluenceoftheflatteningparameterνontheestimatorb
θAIWis
illustratedusingthechain-walkMDPillustratedinFigure4.1.
TheMDPconsistsof10states
S=s(1),…,s(10)
52
StatisticalReinforcementLearning
FIGURE4.1:Ten-statechain-walkMDP.
![Page 200: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/200.jpg)
andtwoactions
A=a(1),a(2)=“L,”“R”.
Thereward+1isgivenwhenvisitings(1)ands(10).Thetransitionprobability
pisindicatedbythenumbersattachedtothearrowsinthefigure.Forexample,
p(s(2)|s(1),a=“R”)=0.9
and
p(s(1)|s(1),a=“R”)=0.1
meanthattheagentcansuccessfullymovetotherightnodewithprobability
0.9(indicatedbysolidarrowsinthefigure)andtheactionfailswithprob-
ability0.1(indicatedbydashedarrowsinthefigure).SixGaussiankernels
withstandarddeviationσ=10areusedasbasisfunctions,andkernelcen-
tersarelocatedats(1),s(5),ands(10).Morespecifically,thebasisfunctions,
φ(s,a)=(φ1(s,a),…,φ6(s,a))aredefinedas
(s−c
φ
j)2
3(i−1)+j(s,a)=I(a=a(i))exp
−
,
2σ2
fori=1,2andj=1,2,3,where
c1=1,c2=5,c3=10,
and
1ifxistrue,
I(x)=
0ifxisnottrue.
Theexperimentsarerepeated50times,wherethesamplingpolicye
π(a|s)
andthecurrentpolicyπ(a|s)arechosenrandomlyineachtrialsuchthat
eπ6=π.Thediscountfactorissetatγ=0.9.ThemodelparameterbθAIWis
learnedfromthetrainingsamplesHeπanditsgeneralizationerroriscomputed
fromthetestsamplesHπ.
![Page 201: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/201.jpg)
TheleftcolumnofFigure4.2depictsthetruegeneralizationerrorGav-
eragedover50trialsasafunctionoftheflatteningparameterνforN=10,
30,and50.Figure4.2(a)showsthatwhenthenumberofepisodesislarge
(N=50),thegeneralizationerrortendstodecreaseastheflatteningparam-
eterincreases.Thiswouldbeanaturalresultduetotheconsistencyofb
θAIW
SampleReuseinPolicyIteration
53
0.07
0.08
0.068
0.075
0.066
Trueerror
0.064
Estimatederror
0.07
0.062
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
![Page 202: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/202.jpg)
Flatteningparameterν
Flatteningparameterν
(a)N=50
0.084
0.082
0.073
0.08
0.072
0.071
0.078
Trueerror
0.07
0.076
Estimatederror
0.069
0.074
0.068
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
Flatteningparameterν
Flatteningparameterν
(b)N=30
![Page 203: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/203.jpg)
0.11
0.14
0.135
0.105
0.13
0.125
Trueerror
0.1
0.12
Estimatederror
0.115
0.0950
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
Flatteningparameterν
Flatteningparameterν
(c)N=10
FIGURE4.2:Left:TruegeneralizationerrorGaveragedover50trialsas
afunctionoftheflatteningparameterνinthe10-statechain-walkproblem.
ThenumberofstepsisfixedatT=10.ThetrendofGdiffersdependingon
thenumberNofepisodicsamples.Right:Generalizationerrorestimatedby
5-foldimportanceweightedcrossvalidation(IWCV)(b
GIWCV)averagedover
![Page 204: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/204.jpg)
50trialsasafunctionoftheflatteningparameterνinthe10-statechain-walk
problem.ThenumberofstepsisfixedatT=10.IWCVnicelycapturesthe
trendofthetruegeneralizationerrorG.
54
StatisticalReinforcementLearning
whenν=1.Ontheotherhand,Figure4.2(b)showsthatwhenthenumberof
episodesisnotlarge(N=30),ν=1performsratherpoorly.Thisimpliesthat
theconsistentestimatortendstobeunstablewhenthenumberofepisodes
isnotlargeenough;ν=0.7worksthebestinthiscase.Figure4.2(c)shows
theresultswhenthenumberofepisodesisfurtherreduced(N=10).This
illustratesthattheconsistentestimatorwithν=1isevenworsethanthe
ordinaryestimator(ν=0)becausethebiasisdominatedbylargevariance.
Inthiscase,thebestνisevensmallerandisachievedatν=0.4.
TheaboveresultsshowthatAIWcanoutperformPIW,particularlywhen
onlyasmallnumberoftrainingsamplesareavailable,providedthattheflat-
teningparameterνischosenappropriately.
4.3
AutomaticSelectionofFlatteningParameter
Inthissection,theproblemofselectingtheflatteningparameterinAIW
isaddressed.
4.3.1
Importance-WeightedCross-Validation
Generally,thebestνtendstobelarge(small)whenthenumberoftraining
samplesislarge(small).However,thisgeneraltrendisnotsufficienttofine-
tunetheflatteningparametersincethebestvalueofνdependsontraining
samples,policies,themodelofvaluefunctions,etc.Inthissection,wediscuss
howmodelselectionisperformedtochoosethebestflatteningparameterν
automaticallyfromthetrainingdataandpolicies.
Ideally,thevalueofνshouldbesetsothatthegeneralizationerrorG
![Page 205: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/205.jpg)
isminimized,butthetrueGisnotaccessibleinpractice.Tocopewiththis
problem,wecanusecross-validation(seeSection2.2.4)forestimatingthe
generalizationerrorG.However,intheoff-policyscenariowherethesampling
policye
πandthetargetpolicyπaredifferent,ordinarycross-validationgives
abiasedestimateofG.Intheoff-policyscenario,importance-weightedcross-
validation(IWCV)(Sugiyamaetal.,2007)ismoreuseful,wherethecross-
validationestimateofthegeneralizationerrorisobtainedwithimportance
weighting.
Morespecifically,letusdivideatrainingdatasetHeπcontainingNepisodes
intoKsubsetsHeπ
ofapproximatelythesamesize.Forsimplicity,weas-
kK
k=1
k
sumethatNisdivisiblebyK.Letb
θAIWbetheparameterlearnedfromH\Hk
(i.e.,allsampleswithoutHk).Then,thegeneralizationerrorisestimatedwith
SampleReuseinPolicyIteration
55
0.11
NIW(ν=0)
0.105
PIW(ν=1)
AIW+IWCV
0.1
0.095
0.09
Trueerror0.085
![Page 206: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/206.jpg)
0.08
0.075
10
15
20
25
30
35
40
45
50
Numberofepisodes
FIGURE4.3:TruegeneralizationerrorGaveragedover50trialsobtained
byNIW(ν=0),PIW(ν=1),AIW+IWCV(νischosenbyIWCV)inthe
10-statechain-walkMDP.
importanceweightingas
K
X
b
1
G
b
IWCV=
Gk
K
IWCV,
k=1
where
XT
X
2
b
![Page 207: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/207.jpg)
K
k
Gk
b
⊤b
IWCV=
θ
ψ(s
b
w
NT
AIW
t,at;He
π
k)−r(st,at,st+1)
t.
h∈Heπt=1
k
Thegeneralizationerrorestimateb
GIWCViscomputedforallcandidate
models(inthecurrentsetting,acandidatemodelcorrespondstoadifferent
valueoftheflatteningparameterν)andtheonethatminimizestheestimated
generalizationerrorischosen:
b
ν
b
IWCV=argminGIWCV.
ν
4.3.2
Illustration
ToillustratehowIWCVworks,letususethesamenumericalexamples
![Page 208: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/208.jpg)
asSection4.2.4.TherightcolumnofFigure4.2depictsthegeneralization
errorestimatedby5-foldIWCVaveragedover50trialsasafunctionofthe
flatteningparameterν.ThegraphsshowthatIWCVnicelycapturesthetrend
ofthetruegeneralizationerrorforallthreecases.
Figure4.3describes,asafunctionofthenumberNofepisodes,theav-
eragetruegeneralizationerrorobtainedbyNIW(AIWwithν=0),PIW
56
StatisticalReinforcementLearning
(AIWwithν=1),andAIW+IWCV(ν∈0.0,0.1,…,0.9,1.0isselectedin
eachtrialusing5-foldIWCV).Thisresultshowsthattheimprovementofthe
performancebyNIWsaturateswhenN≥30,implyingthatthebiascaused
byNIWisnotnegligible.TheperformanceofPIWisworsethanNIWwhen
N≤20,whichiscausedbythelargevarianceofPIW.Ontheotherhand,
AIW+IWCVconsistentlygivesgoodperformanceforallN,illustratingthe
strongadaptationabilityofAIW+IWCV.
4.4
Sample-ReusePolicyIteration
Inthissection,AIW+IWCVisextendedfromsingle-steppolicyevaluation
tofullpolicyiteration.Thismethodiscalledsample-reusepolicyiteration
(SRPI).
4.4.1
Algorithm
LetusdenotethepolicyattheL-thiterationbyπL.Inon-policypolicy
iteration,newdatasamplesHπLarecollectedfollowingthenewpolicyπL
duringthepolicyevaluationstep.Thus,previouslycollecteddatasamples
Hπ1,…,HπL−1arenotused:
E:Hπ1
E:Hπ2
E:Hπ3
![Page 209: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/209.jpg)
π
I
I
1
→
b
Qπ1→π2−→
b
Qπ2→π3−→···I
−→πL,
where“E:H”indicatesthepolicyevaluationstepusingthedatasampleH
and“I”indicatesthepolicyimprovementstep.Itwouldbemorecostefficient
ifallpreviouslycollecteddatasampleswerereusedinpolicyevaluation:
E:Hπ1
E:Hπ1,Hπ2
E:Hπ1,Hπ2,Hπ3
π
I
I
1
−→
b
Qπ1→π2
−→
b
Qπ2→π3
−→
···I
−→πL.
Sincethepreviouspoliciesandthecurrentpolicyaredifferentingeneral,
anoff-policyscenarioneedstobeexplicitlyconsideredtoreusepreviously
collecteddatasamples.Here,weexplainhowAIW+IWCVcanbeusedin
![Page 210: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/210.jpg)
thissituation.Forthispurpose,thedefinitionofb
GAIWisextendedsothat
multiplesamplingpoliciesπ1,…,πLaretakenintoaccount:
L
XN
XT
X
b
1
GL
AIW=
θ⊤b
ψ(sπl
LNT
t,n,aπl
t,n;HπlL
l=1)
l=1n=1t=1
!
Qt
νL
2
πL(aπl
)
−r(
t′,n|sπl
t′,n
sπ
t′=1
l
t,n,aπl
t,n,sπl
![Page 211: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/211.jpg)
t+1,n)
Q
,
(4.3)
t
π
)
t′=1
l(aπl
t′,n|sπl
t′,n
whereb
GL
isthegeneralizationerrorestimatedattheL-thpolicyevaluation
AIW
usingAIW.TheflatteningparameterνLischosenbasedonIWCVbefore
performingpolicyevaluation.
SampleReuseinPolicyIteration
57
ν=0
4.5
4.5
ν=1
ν^
=νIWCV
4.4
4.4
4.3
4.3
![Page 212: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/212.jpg)
4.2
4.2
Return
Return
4.1
4.1
4
4
ν=0
3.9
ν=1
3.9
ν^
=ν
3.8
IWCV
3.8
5
10
15
20
25
30
35
40
45
10
15
20
25
30
35
![Page 213: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/213.jpg)
40
Totalnumberofepisodes
Totalnumberofepisodes
(a)N=5
(b)N=10
FIGURE4.4:Theperformanceofpolicieslearnedinthreescenarios:ν=0,
ν=1,andSRPI(νischosenbyIWCV)inthe10-statechain-walkproblem.
Theperformanceismeasuredbytheaveragereturncomputedfromtestsam-
plesover30trials.TheagentcollectstrainingsampleHπL(N=5or10with
T=10)ateveryiterationandperformspolicyevaluationusingallcollected
samplesHπ1,…,HπL.Thetotalnumberofepisodesmeansthenumberof
trainingepisodes(N×L)collectedbytheagentinpolicyiteration.
4.4.2
Illustration
Here,thebehaviorofSRPIisillustratedunderthesameexperimental
setupasSection4.3.2.Letusconsiderthreescenarios:νisfixedat0,νisfixedat1,andνischosenbyIWCV(i.e.,SRPI).TheagentcollectssamplesHπLin
L
eachpolicyiterationfollowingthecurrentpolicyπLandcomputesb
θAIWfrom
allcollectedsamplesHπ1,…,HπLusingEq.(4.3).ThreeGaussiankernels
areusedasbasisfunctions,wherekernelcentersarerandomlyselectedfrom
thestatespaceSineachtrial.Theinitialpolicyπ1ischosenrandomlyand
Gibbspolicyimprovement,
exp(Qπ(s,a)/τ)
π(a|s)←−P
,
(4.4)
exp(Qπ(s,a′)/τ)
a′∈Aisperformedwithτ=2L.
Figure4.4depictstheaveragereturnover30trialswhenN=5and10
withafixednumberofsteps(T=10).ThegraphsshowthatSRPIprovides
![Page 214: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/214.jpg)
stableandfastlearningofpolicies,whiletheperformanceimprovementof
policieslearnedwithν=0saturatesinearlyiterations.Themethodwith
ν=1canimprovepolicieswell,butitsprogresstendstobebehindSRPI.
Figure4.5depictstheaveragevalueoftheflatteningparameterusedin
SRPIasafunctionofthetotalnumberofepisodicsamples.Thegraphsshow
thatthevalueoftheflatteningparameterchosenbyIWCVtendstoriseinthe
beginningandgodownlater.Atfirstsight,thisdoesnotagreewiththegeneral
trendofpreferringalow-varianceestimatorinearlystagesandpreferringa
58
StatisticalReinforcementLearning
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
Flatteningparameter
0.3
Flatteningparameter
0.2
![Page 215: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/215.jpg)
0.2
0.1
0.1
0
0
5
10
15
20
25
30
35
40
45
10
15
20
25
30
35
40
Totalnumberofepisodes
Totalnumberofepisodes
(a)N=5
(b)N=10
FIGURE4.5:FlatteningparametervaluesusedbySRPIaveragedover30
trialsasafunctionofthetotalnumberofepisodicsamplesinthe10-state
chain-walkproblem.
low-biasestimatorlater.However,thisresultisstillconsistentwiththegeneral
trend:whenthereturnincreasesrapidly(thetotalnumberofepisodicsamples
isupto15whenN=5and30whenN=10inFigure4.5),thevalueofthe
flatteningparameterincreases(seeFigure4.4).Afterthat,thereturndoes
![Page 216: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/216.jpg)
notincreaseanymore(seeFigure4.4)sincethepolicyiterationhasalready
beenconverged.Then,itisnaturaltopreferasmallflatteningparameter
(Figure4.5)sincethesampleselectionbiasbecomesmildafterconvergence.
TheseresultsshowthatSRPIcaneffectivelyreusepreviouslycollected
samplesbyappropriatelytuningtheflatteningparameteraccordingtothe
conditionofdatasamples,policies,etc.
4.5
NumericalExamples
Inthissection,theperformanceofSRPIisnumericallyinvestigatedin
morecomplextasks.
4.5.1
InvertedPendulum
First,weconsiderthetaskoftheswing-upinvertedpendulumillustrated
inFigure4.6,whichconsistsofapolehingedatthetopofacart.Thegoalof
thetaskistoswingthepoleupbymovingthecart.Therearethreeactions:
applyingpositiveforce+50(kg·m/s2)tothecarttomoveright,negative
force−50tomoveleft,andzeroforcetojustcoast.Thatis,theactionspace
SampleReuseinPolicyIteration
59
FIGURE4.6:Illustrationoftheinvertedpendulumtask.
![Page 217: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/217.jpg)
Aisdiscreteanddescribedby
A=50,−50,0kg·m/s2.
Notethattheforceitselfisnotstrongenoughtoswingthepoleup.Thusthe
cartneedstobemovedbackandforthseveraltimestoswingthepoleup.
ThestatespaceSiscontinuousandconsistsoftheangleϕ[rad](∈[0,2π])andtheangularvelocity˙
ϕ[rad/s](∈[−π,π]).Thus,astatesisdescribedbytwo-dimensionalvectors=(ϕ,˙
ϕ)⊤.Theangleϕandangularvelocity˙
ϕare
updatedasfollows:
ϕt+1=ϕt+˙
ϕt+1∆t,
9.8sin(ϕ
˙
ϕ
t)−αwd(˙
ϕt)2sin(2ϕt)/2+αcos(ϕt)at
t+1=˙
ϕt+
∆t,
4l/3−αwdcos2(ϕt)
whereα=1/(W+w)andat(∈A)istheactionchosenattimet.Therewardfunctionr(s,a,s′)isdefinedas
r(s,a,s′)=cos(ϕs′),
whereϕs′denotestheangleϕofstates′.Theproblemparametersaresetas
follows:themassofthecartWis8[kg],themassofthepolewis2[kg],the
lengthofthepoledis0.5[m],andthesimulationtimestep∆tis0.1[s].
Forty-eightGaussiankernelswithstandarddeviationσ=πareusedas
basisfunctions,andkernelcentersarelocatedoverthefollowinggridpoints:
0,2/3π,4/3π,2π×−3π,−π,π,3π.
Thatis,thebasisfunctionsφ(s,a)=φ1(s,a),…,φ16(s,a)aresetas
![Page 218: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/218.jpg)
ks−c
φ
jk2
16(i−1)+j(s,a)=I(a=a(i))exp
−
,
2σ2
fori=1,2,3andj=1,…,16,where
c1=(0,−3π)⊤,c2=(0,−π)⊤,…,c12=(2π,3π)⊤.
60
StatisticalReinforcementLearning
−6
ν=0
1
−7
ν=1
0.9
ν^
=νIWCV
−8
0.8
0.7
−9
0.6
−10
0.5
−11
0.4
−12
0.3
![Page 219: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/219.jpg)
Flatteningparameter
−13
Sumofdiscountedrewards
0.2
−14
0.1
−15
0
10
20
30
40
50
60
70
80
90
10
20
30
40
50
60
70
80
90
Totalnumberofepisodes
Totalnumberofepisodes
(a)Performanceofpolicy
(b)Averageflatteningparameter
FIGURE4.7:ResultsofSRPIintheinvertedpendulumtask.Theagentcol-
lectstrainingsampleHπL(N=10andT=100)ineachiterationandpolicy
![Page 220: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/220.jpg)
evaluationisperformedusingallcollectedsamplesHπ1,…,HπL.(a)The
performanceofpolicieslearnedwithν=0,ν=1,andSRPI.Theperformance
ismeasuredbytheaveragereturncomputedfromtestsamplesover20trials.
Thetotalnumberofepisodesmeansthenumberoftrainingepisodes(N×L)
collectedbytheagentinpolicyiteration.(b)Averageflatteningparameter
valueschosenbyIWCVinSRPIover20trials.
Theinitialpolicyπ1(a|s)ischosenrandomly,andtheinitial-stateproba-
bilitydensityp(s)issettobeuniform.TheagentcollectsdatasamplesHπL
(N=10andT=100)ateachpolicyiterationfollowingthecurrentpolicy
πL.Thediscountedfactorissetatγ=0.95andthepolicyisupdatedby
Gibbspolicyimprovement(4.4)withτ=L.
Figure4.7(a)describestheperformanceoflearnedpolicies.Thegraph
showsthatSRPInicelyimprovestheperformancethroughouttheentirepolicy
iteration.Ontheotherhand,theperformancewhentheflatteningparameter
isfixedatν=0orν=1isnotproperlyimprovedafterthemiddleof
iterations.TheaverageflatteningparametervaluesdepictedinFigure4.7(b)
showthattheflatteningparametertendstoincreasequicklyinthebeginning
andtheniskeptatmediumvalues.Motionexamplesoftheinvertedpendulum
bySRPIwithνchosenbyIWCVandν=1areillustratedinFigure4.8and
Figure4.9,respectively.
Theseresultsindicatethattheflatteningparameteriswelladjustedto
reusethepreviouslycollectedsampleseffectivelyforpolicyevaluation,and
thusSRPIcanoutperformtheothermethods.
4.5.2
MountainCar
Next,weconsiderthemountaincartaskillustratedinFigure4.10.The
taskconsistsofacarandtwohillswhoselandscapeisdescribedbysin(3x).
![Page 221: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/221.jpg)
![Page 222: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/222.jpg)
![Page 223: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/223.jpg)
![Page 224: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/224.jpg)
![Page 225: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/225.jpg)
SampleReuseinPolicyIteration
61
FIGURE4.8:MotionexamplesoftheinvertedpendulumbySRPIwithν
chosenbyIWCV(fromlefttorightandtoptobottom).
FIGURE4.9:MotionexamplesoftheinvertedpendulumbySRPIwith
ν=1(fromlefttorightandtoptobottom).
![Page 226: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/226.jpg)
62
StatisticalReinforcementLearning
Goal
FIGURE4.10:Illustrationofthemountaincartask.
Thetopoftherighthillisthegoaltowhichwewanttoguidethecar.There
arethreeactions,
+0.2,−0.2,0,
whicharethevaluesoftheforceappliedtothecar.Notethattheforceofthe
carisnotstrongenoughtoclimbuptheslopetoreachthegoal.Thestate
spaceSisdescribedbythehorizontalpositionx[m](∈[−1.2,0.5])andthevelocity˙x[m/s](∈[−1.5,1.5]):s=(x,˙x)⊤.
Thepositionxandvelocity˙xareupdatedby
xt+1=xt+˙xt+1∆t,
a
˙x
t
t+1=˙
xt+−9.8wcos(3xt)+
−k˙x∆t,
w
t
![Page 227: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/227.jpg)
whereat(∈A)istheactionchosenatthetimet.TherewardfunctionR(s,a,s′)isdefinedas
1ifx
R(s,a,s′)=
s′≥0.5,
−0.01otherwise,
wherexs′denotesthehorizontalpositionxofstates′.Theproblemparame-
tersaresetasfollows:themassofthecarwis0.2[kg],thefrictioncoefficientkis0.3,andthesimulationtimestep∆tis0.1[s].
Thesameexperimentalsetupastheswing-upinvertedpendulumtaskin
Section4.5.1isused,exceptthatthenumberofGaussiankernelsis36,the
kernelstandarddeviationissetatσ=1,andthekernelcentersareallocated
overthefollowinggridpoints:
−1.2,0.35,0.5×−1.5,−0.5,0.5,1.5.
Figure4.11(a)showstheperformanceoflearnedpoliciesmeasuredbythe
SampleReuseinPolicyIteration
63
0.2
1
ν=0
ν=1
0.9
ν^
=νIWCV
0.15
0.8
0.7
0.1
0.6
0.5
![Page 228: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/228.jpg)
0.05
0.4
0.3
Flatteningparameter
Sumofdiscountedrewards
0
0.2
0.1
−0.05
0
10
20
30
40
50
60
70
80
90
10
20
30
40
50
60
70
80
90
Totalnumberofepisodes
Totalnumberofepisodes
(a)Performanceofpolicy
(b)Averageflatteningparameter
![Page 229: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/229.jpg)
FIGURE4.11:Resultsofsample-reusepolicyiterationinthemountain-car
task.TheagentcollectstrainingsampleHπL(N=10andT=100)atev-
eryiterationandpolicyevaluationisperformedusingallcollectedsamples
Hπ1,…,HπL.(a)Theperformanceismeasuredbytheaveragereturncom-
putedfromtestsamplesover20trials.Thetotalnumberofepisodesmeansthe
numberoftrainingepisodes(N×L)collectedbytheagentinpolicyiteration.
(b)AverageflatteningparametervaluesusedbySRPIover20trials.
averagereturncomputedfromthetestsamples.Thegraphshowssimilarten-
denciestotheswing-upinvertedpendulumtaskforSRPIandν=1,while
themethodwithν=0performsrelativelywellthistime.Thisimpliesthat
thebiasinthepreviouslycollectedsamplesdoesnotaffecttheestimationof
thevaluefunctionsthatstrongly,becausethefunctionapproximatorisbetter
suitedtorepresentthevaluefunctionforthisproblem.Theaverageflattening
parametervalues(cf.Figure4.11(b))showthattheflatteningparameterde-
creasessoonaftertheincreaseinthebeginning,andthenthesmallervalues
tendtobechosen.ThisindicatesthatSRPItendstouselow-varianceesti-
matorsinthistask.MotionexamplesbySRPIwithνchosenbyIWCVare
illustratedinFigure4.12.
TheseresultsshowthatSRPIcanperformstableandfastlearningby
effectivelyreusingpreviouslycollecteddata.
4.6
Remarks
Instabilityhasbeenoneofthecriticallimitationsofimportance-sampling
techniques,whichoftenmakesoff-policymethodsimpractical.Toovercome
thisweakness,anadaptiveimportance-samplingtechniquewasintroducedfor
controllingthetrade-offbetweenconsistencyandstabilityinvaluefunction
![Page 230: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/230.jpg)
![Page 231: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/231.jpg)
![Page 232: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/232.jpg)
![Page 233: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/233.jpg)
![Page 234: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/234.jpg)
64
StatisticalReinforcementLearning
Goal
Goal
Goal
Goal
![Page 235: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/235.jpg)
Goal
Goal
Goal
Goal
Goal
Goal
Goal
Goal
Goal
Goal
Goal
FIGURE4.12:MotionexamplesofthemountaincarbySRPIwithνchosen
byIWCV(fromlefttorightandtoptobottom).
approximation.Furthermore,importance-weightedcross-validationwasintro-
ducedforautomaticallychoosingthetrade-offparameter.
Therangeofapplicationofimportancesamplingisnotlimitedtopolicy
iteration.Wewillexplainhowimportancesamplingcanbeutilizedforsample
reuseinthepolicysearchframeworksinChapter8andChapter9.
Chapter5
ActiveLearninginPolicyIteration
InChapter4,weconsideredtheoff-policysituationwhereadata-collecting
policyandthetargetpolicyaredifferent.Intheframeworkofsample-reuse
policyiteration,newsamplesarealwayschosenfollowingthetargetpolicy.
However,acleverchoiceofsamplingpoliciescanactuallyfurtherimprovethe
performance.Thetopicofchoosingsamplingpoliciesiscalledactivelearning
instatisticsandmachinelearning.Inthischapter,weaddresstheproblem
ofchoosingsamplingpoliciesinsample-reusepolicyiteration.InSection5.1,
weexplainhowastatisticalactivelearningmethodcanbeemployedforop-
timizingthesamplingpolicyinvaluefunctionapproximation.InSection5.2,
![Page 236: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/236.jpg)
weintroduceactivepolicyiteration,whichincorporatestheactivelearning
ideaintotheframeworkofsample-reusepolicyiteration.Theeffectivenessof
activepolicyiterationisnumericallyinvestigatedinSection5.3,andfinally
thischapterisconcludedinSection5.4.
5.1
EfficientExplorationwithActiveLearning
Theaccuracyofestimatedvaluefunctionsdependsontrainingsamples
collectedfollowingsamplingpolicye
π(a|s).Inthissection,weexplainhowa
statisticalactivelearningmethod(Sugiyama,2006)canbeemployedforvalue
functionapproximation.
5.1.1
ProblemSetup
Letusconsiderasituationwherecollectingstate-actiontrajectorysam-
plesiseasyandcheap,butgatheringimmediaterewardsamplesishardand
expensive.Forexample,considerarobot-armcontroltaskofhittingaball
withabatanddrivingtheballasfarawayaspossible(seeFigure5.6).Let
usadoptthecarryoftheballastheimmediatereward.Inthissetting,ob-
tainingstate-actiontrajectorysamplesoftherobotarmiseasyandrelatively
cheapsincewejustneedtocontroltherobotarmandrecorditsstate-action
trajectoriesovertime.However,explicitlycomputingthecarryoftheball
fromthestate-actionsamplesishardduetofrictionandelasticityoflinks,
65
66
StatisticalReinforcementLearning
airresistance,aircurrents,andsoon.Forthisreason,inpractice,wemay
havetoputtherobotinopenspace,lettherobotreallyhittheball,and
measurethecarryoftheballmanually.Thus,gatheringimmediatereward
samplesismuchmoreexpensivethanthestate-actiontrajectorysamples.In
![Page 237: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/237.jpg)
suchasituation,immediaterewardsamplesaretooexpensivetobeusedfor
designingthesamplingpolicy.Onlystate-actiontrajectorysamplesmaybe
usedfordesigningsamplingpolicies.
Thegoalofactivelearninginthecurrentsetupistodeterminethesampling
policysothattheexpectedgeneralizationerrorisminimized.However,since
thegeneralizationerrorisnotaccessibleinpractice,itneedstobeestimated
fromsamplesforperformingactivelearning.Adifficultyofestimatingthe
generalizationerrorinthecontextofactivelearningisthatitsestimation
needstobecarriedoutonlyfromstate-actiontrajectorysampleswithoutusing
immediaterewardsamples.Thismeansthatstandardgeneralizationerror
estimationtechniquessuchascross-validationcannotbeemployed.Below,
weexplainhowthegeneralizationerrorcanbeestimatedwithoutthereward
samples.
5.1.2
DecompositionofGeneralizationError
Theinformationweareallowedtouseforestimatingthegeneralization
errorisasetofroll-outsampleswithoutimmediaterewards:
Heπ=heπ1,…,heπN,
whereeachepisodicsampleheπnisgivenas
heπn=[seπ1,n,aeπ1,n,…,seπT,n,aeπT,n,seπT+1,n].
Letusdefinethedeviationofanobservedimmediaterewardreπ
t,nfromits
expectationr(seπt,n,aeπt,n)as
ǫeπt,n=reπt,n−r(seπt,n,aeπt,n).
Notethatǫeπt,ncouldberegardedasadditivenoiseinthecontextofleast-
squaresfunctionfitting.Bydefinition,ǫeπt,nhasmeanzeroanditsvariance
generallydependsonseπt,nandaeπt,n,i.e.,heteroscedasticnoise(Bishop,2006).
However,sinceestimatingthevarianceofǫeπt,nwithoutusingrewardsamples
isnotgenerallypossible,weignorethedependenceofthevarianceonseπt,nand
aeπt,n.Letusdenotetheinput-independentcommonvariancebyσ2.
Wewouldliketoestimatethegeneralizationerror,
”
![Page 238: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/238.jpg)
#
1T
X
⊤2
G(b
θ)=E
bb
pe
π(h)
θψ(s
,
T
t,at;He
π)−r(st,at)
t=1
ActiveLearninginPolicyIteration
67
fromHeπ.Itsexpectationover“noise”canbedecomposedasfollows
(Sugiyama,2006):
h
i
EǫeπG(b
θ)=Bias+Variance+ModelError,
whereEǫeπdenotestheexpectationover“noise”ǫeπt,nT,N
t=1,n=1.
“Bias,”
“Variance,”and“ModelError”arethebiasterm,thevarianceterm,andthe
![Page 239: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/239.jpg)
modelerrortermdefinedby
”
#
T
1Xn
hi
o2
Bias=E
b
pe
π(h)
(E
θ−θ∗)⊤b
ψ(s
,
T
ǫe
π
t,at;He
π)
t=1
”
#
T
1Xn
hi
o2
Variance=E
b
pe
π(h)
(b
![Page 240: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/240.jpg)
θ−E
θ)⊤b
ψ(s
,
T
ǫe
π
t,at;He
π)
t=1
”
#
T
1X
ModelError=Epeπ(h)
(θ∗⊤b
ψ(s
.
T
t,at;He
π)−r(st,at))2
t=1
θ∗denotestheoptimalparameterinthemodel:”
#
T
1X
θ∗=argminEpeπ(h)(θ⊤ψ(st,at)−r(st,at))2.
θ
Tt=1
Notethat,foralinearestimatorb
![Page 241: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/241.jpg)
θsuchthat
bθ=b
Lr,
whereb
LissomematrixandristheNT-dimensionalvectordefinedas
rN(t−1)+n=r(st,n,at,n,st+1,n),
thevariancetermcanbeexpressedinacompactformas
⊤Variance=σ2tr(Ub
Lb
L),
wherethematrixUisdefinedas
”
#
1T
X
U=E
b
pe
π(h)
ψ(s
.
(5.1)
T
t,at;He
π)b
ψ(st,at;Heπ)⊤t=1
5.1.3
EstimationofGeneralizationError
Sinceweareinterestedinfindingaminimizerofthegeneralizationerror
withrespecttoe
![Page 242: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/242.jpg)
π,themodelerror,whichisconstant,canbesafelyignoredin
generalizationerrorestimation.Ontheotherhand,thebiastermincludesthe
68
StatisticalReinforcementLearning
unknownoptimalparameterθ∗.Thus,itmaynotbepossibletoestimatethebiastermwithoutusingrewardsamples.Similarly,itmaynotbepossibleto
estimatethe“noise”varianceσ2includedinthevariancetermwithoutusing
rewardsamples.
Itisknownthatthebiastermissmallenoughtobeneglectedwhenthe
modelisapproximatelycorrect(Sugiyama,2006),i.e.,θ∗⊤b
ψ(s,a)approxi-
matelyagreeswiththetruefunctionr(s,a).Thenwehave
h
i
⊤EǫeπG(b
θ)−ModelError−Bias∝tr(UbLb
L),
(5.2)
whichdoesnotrequireimmediaterewardsamplesforitscomputation.Since
Epeπ(h)includedinUisnotaccessible(seeEq.(5.1)),Uisreplacedbyits
consistentestimatorb
U:
N
XT
X
b
1
![Page 243: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/243.jpg)
U=
b
ψ(seπ
NT
t,n,ae
π
t,n;He
π)b
ψ(seπt,n,aeπt,n;Heπ)⊤b
wt,n.
n=1t=1
Consequently,thefollowinggeneralizationerrorestimatorisobtained:
⊤J=tr(b
Ub
Lb
L),
whichcanbecomputedonlyfromHeπandthuscanbeemployedintheactive
learningscenarios.IfitispossibletogatherHeπmultipletimes,theaboveJ
maybecomputedmultipletimesandtheiraverageisusedasageneralization
errorestimator.
NotethatthevaluesofthegeneralizationerrorestimatorJandthetrue
generalizationerrorGarenotdirectlycomparablesinceirrelevantadditive
andmultiplicativeconstantsareignored(seeEq.(5.2)).However,thisisno
problemaslongastheestimatorJhasasimilarprofiletothetrueerrorGas
afunctionofsamplingpolicye
πsincethepurposeofderivingageneralization
errorestimatorinactivelearningisnottoapproximatethetruegeneralization
erroritself,buttoapproximatetheminimizerofthetruegeneralizationerror
withrespecttosamplingpolicye
π.
5.1.4
![Page 244: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/244.jpg)
DesigningSamplingPolicies
Basedonthegeneralizationerrorestimatorderivedabove,asampling
policyisdesignedasfollows:
1.PrepareKcandidatesofsamplingpolicy:e
πkK.
k=1
2.Collectepisodicsampleswithoutimmediaterewardsforeachsampling-
policycandidate:HeπkK.
k=1
3.EstimateUusingallsamplesHeπkK:
k=1
K
XN
XT
X
b
1
U=
b
ψ(seπk
KNT
t,n,ae
πk
t,n;He
πkK
k=1)b
ψ(seπk
t,n,ae
πk
t,n;He
πkK
k=1)⊤b
![Page 245: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/245.jpg)
weπk
t,n,
k=1n=1t=1
ActiveLearninginPolicyIteration
69
whereb
weπk
t,ndenotestheimportanceweightforthek-thsamplingpolicy
eπk:
Qt
π(aeπk
)
b
weπ
t′,n|se
πk
t′,n
k
t′=1
t,n=Q
.
t
)
t′=1e
πk(aeπk
t′,n|se
πk
t′,n
4.Estimatethegeneralizationerrorforeachk:
![Page 246: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/246.jpg)
e
πk
e
πk
J
b⊤k=tr(b
Ub
L
L
),
e
πk
whereb
L
isdefinedas
beπk
e
πk
e
πk
e
πk
e
πk
e
πk
L
=(b
Ψ⊤c
W
b
![Page 247: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/247.jpg)
Ψ)−1b
Ψ⊤c
W
.
beπk
e
πk
Ψ
istheNT×Bmatrixandc
W
istheNT×NTdiagonalmatrix
definedas
b
Ψeπk
=b
ψ
N(t−1)+n,b
b(se
πk
t,n,ae
πk
t,n),
c
Weπk
=
N(t−1)+n,N(t−1)+n
b
weπk
t,n.
5.(Ifpossible)repeat2to4severaltimesandcalculatetheaveragefor
eachk.
6.Determinethesamplingpolicyas
![Page 248: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/248.jpg)
eπAL=argminJk.
k=1,…,K
7.Collecttrainingsampleswithimmediaterewardsfollowinge
πAL.
8.Learnthevaluefunctionbyleast-squarespolicyiterationusingthecol-
lectedsamples.
5.1.5
Illustration
Here,thebehavioroftheactivelearningmethodisillustratedonatoy
10-statechain-walkenvironmentshowninFigure5.1.TheMDPconsistsof
10states,
S=s(i)10
i=1=1,2,…,10,
and2actions,
A=a(i)2i=1=“L,”“R”.
70
StatisticalReinforcementLearning
02
02
.
.
1
2
3
8
9
10
···
08
![Page 249: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/249.jpg)
08
.
.
FIGURE5.1:Ten-statechainwalk.Filledandunfilledarrowsindicatethe
transitionswhentakingaction“R”and“L,”andsolidanddashedlinesindi-
catethesuccessfulandfailedtransitions.
Theimmediaterewardfunctionisdefinedas
r(s,a,s′)=f(s′),
wheretheprofileofthefunctionf(s′)isillustratedinFigure5.2.
Thetransitionprobabilityp(s′|s,a)isindicatedbythenumbersattached
tothearrowsinFigure5.1.Forexample,p(s(2)|s(1),a=“R”)=0.8and
p(s(1)|s(1),a=“R”)=0.2.Thus,theagentcansuccessfullymovetothe
intendeddirectionwithprobability0.8(indicatedbysolid-filledarrowsinthe
figure)andtheactionfailswithprobability0.2(indicatedbydashed-filled
arrowsinthefigure).Thediscountfactorγissetat0.9.Thefollowing12
Gaussianbasisfunctionsφ(s,a)areused:
(s−c
i)2
I(a=a(j))exp−
2τ2
φ2(i−1)+j(s,a)=
fori=1,…,5andj=1,2
I(a=a(j))fori=6andj=1,2,
wherec1=1,c2=3,c3=5,c4=7,c5=9,andτ=1.5.I(a=a′)denotes
theindicatorfunction:
1ifa=a′,
I(a=a′)=
0
ifa6=a′.
![Page 250: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/250.jpg)
Samplingpoliciesandevaluationpoliciesareconstructedasfollows.First,
3
2.5
2
’)1.5
f(s
1
0.5
01
2
3
4
5
6
7
8
9
10
s’
FIGURE5.2:Profileofthefunctionf(s′).
ActiveLearninginPolicyIteration
71
adeterministic“base”policyπisprepared.Forexample,“LLLLLRRRRR,”
wherethei-thletterdenotestheactiontakenats(i).Letπǫbethe“ǫ-greedy”
versionofthebasepolicyπ,i.e.,theintendedactioncanbesuccessfullychosen
withprobability1−ǫ/2andtheotheractionischosenwithprobabilityǫ/2.
![Page 251: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/251.jpg)
Experimentsareperformedforthreedifferentevaluationpolicies:
π1:“RRRRRRRRRR,”
π2:“RRLLLLLRRR,”
π3:“LLLLLRRRRR,”
withǫ=0.1.Foreachevaluationpolicyπ0.1
i
(i=1,2,3),10candidatesofthe
samplingpolicye
π(k)
areprepared,where
=πk/10.Notethat
is
i
10
k=1
eπ(k)
i
i
eπ(1)
i
equivalenttotheevaluationpolicyπ0.1
i
.
Foreachsamplingpolicy,theactivelearningcriterionJiscomputed5
timesandtheiraverageistaken.Thenumbersofepisodesandstepsareset
atN=10andT=10,respectively.Theinitial-stateprobabilityp(s)is
settobeuniform.Whenthematrixinverseiscomputed,10−3isaddedto
diagonalelementstoavoiddegeneracy.Thisexperimentisrepeated100times
withdifferentrandomseedsandthemeanandstandarddeviationofthetrue
generalizationerroranditsestimateareevaluated.
TheresultsaredepictedinFigure5.3asfunctionsoftheindexkofthe
samplingpolicies.Thegraphsshowthatthegeneralizationerrorestimator
![Page 252: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/252.jpg)
overallcapturesthetrendofthetruegeneralizationerrorwellforallthree
cases.
Next,thevaluesoftheobtainedgeneralizationerrorGisevaluatedwhen
kischosensothatJisminimized(activelearning,AL),theevaluationpolicy
(k=1)isusedforsampling(passivelearning,PL),andkischosenoptimally
sothatthetruegeneralizationerrorisminimized(optimal,OPT).Figure5.4
showsthattheactivelearningmethodcomparesfavorablywithpassivelearn-
ingandperformswellforreducingthegeneralizationerror.
5.2
ActivePolicyIteration
InSection5.1,theunknowngeneralizationerrorwasshowntobeaccu-
ratelyestimatedwithoutusingimmediaterewardsamplesinone-steppolicy
evaluation.Inthissection,thisone-stepactivelearningideaisextendedtothe
frameworkofsample-reusepolicyiterationintroducedinChapter4,whichis
calledactivepolicyiteration.LetusdenotetheevaluationpolicyattheL-th
iterationbyπL.
72
StatisticalReinforcementLearning
2.5
2
2
1.5
1.5
|G
1
J
1
0.5
0.5
![Page 253: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/253.jpg)
0
−0.5
0
2
4
6
8
10
2
4
6
8
10
Samplingpolicyindexk
Samplingpolicyindexk
(a)π0.1
1
0.6
1.4
0.5
1.2
0.4
1
0.3
0.8
|G
J
0.2
0.6
0.1
0.4
0
![Page 254: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/254.jpg)
0.2
−0.1
0
2
4
6
8
10
2
4
6
8
10
Samplingpolicyindexk
Samplingpolicyindexk
(b)π0.1
2
0.8
1
0.6
0.8
0.4
0.6
|G
J
0.2
0.4
0
0.2
−0.2
0
2
![Page 255: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/255.jpg)
4
6
8
10
2
4
6
8
10
Samplingpolicyindexk
Samplingpolicyindexk
(c)π0.1
3
FIGURE5.3:Themeanandstandarddeviationofthetruegeneralization
errorG(left)andtheestimatedgeneralizationerrorJ(right)over100trials.
5.2.1
Sample-ReusePolicyIterationwithActiveLearning
Intheoriginalsample-reusepolicyiteration,newdatasamplesHπlare
collectedfollowingthenewtargetpolicyπlforthenextpolicyevaluation
step:
E:Hπ1
E:Hπ1,Hπ2
E:Hπ1,Hπ2,Hπ3
π
I
I
1
→
b
Qπ1→π2
→
b
![Page 256: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/256.jpg)
Qπ2→π3
→
···I
→πL+1,
ActiveLearninginPolicyIteration
73
3.5
0.35
3
0.3
2.5
0.25
2
0.2
1.5
0.15
1
0.1
0.5
0.05
0
0
AL
PL
OPT
AL
PL
OPT
(a)π0.1
![Page 257: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/257.jpg)
(b)π0.1
1
2
1
0.8
0.6
0.4
0.2
0
AL
PL
OPT
(c)π0.1
3
FIGURE5.4:Thebox-plotsofthevaluesoftheobtainedgeneralizationerror
Gover100trialswhenkischosensothatJisminimized(activelearning,AL),
theevaluationpolicy(k=1)isusedforsampling(passivelearning,PL),andk
ischosenoptimallysothatthetruegeneralizationerrorisminimized(optimal,
OPT).Thebox-plotnotationindicatesthe5%quantile,25%quantile,50%
quantile(i.e.,median),75%quantile,and95%quantilefrombottomtotop.
where“E:H”indicatespolicyevaluationusingthedatasampleHand“I”
denotespolicyimprovement.Ontheotherhand,inactivepolicyiteration,the
optimizedsamplingpolicye
πlisusedateachiteration:
E:He
π1
E:He
π1,Heπ2
E:He
π1,Heπ2,Heπ3
π
I
![Page 258: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/258.jpg)
I
1
→
b
Qπ1→π2
→
b
Qπ2→π3
→
···I
→πL+1.
Notethat,inactivepolicyiteration,thepreviouslycollectedsamplesareused
notonlyforvaluefunctionapproximation,butalsoforactivelearning.Thus,
activepolicyiterationmakesfulluseofthesamples.
5.2.2
Illustration
Here,thebehaviorofactivepolicyiterationisillustratedusingthesame
10-statechain-walkproblemasSection5.1.5(seeFigure5.1).
74
StatisticalReinforcementLearning
Theinitialevaluationpolicyπ1issetas
π
b
1(a|s)=0.15pu(a)+0.85I(a=argmaxQ0(s,a′)),
a′
wherepu(a)denotestheprobabilitymassfunctionoftheuniformdistribution
and
12
![Page 259: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/259.jpg)
X
b
Q0(s,a)=
φb(s,a).
b=1
Policiesareupdatedinthel-thiterationusingtheǫ-greedyrulewithǫ=
0.15/l.Inthesampling-policyselectionstepofthel-thiteration,thefollowing
foursampling-policycandidatesareprepared:
eπ(1),
,
,
,π0.15/l+0.15,π0.15/l+0.5,π0.15/l+0.85
l
eπ(2)
l
eπ(3)
l
eπ(4)
l
=π0.15/l
l
l
l
l
,
whereπldenotesthepolicyobtainedbygreedyupdateusingb
Qπl−1.
Thenumberofiterationstolearnthepolicyissetat7,thenumberof
stepsissetatT=10,andthenumberNofepisodesisdifferentineachitera-
tionanddefinedasN1,…,N7,whereNl(l=1,…,7)denotesthenumberofepisodescollectedinthel-thiteration.Inthisexperiment,twotypesof
schedulingarecompared:5,5,3,3,3,1,1and3,3,3,3,3,3,3,whichare
referredtoasthe“decreasingN”strategyandthe“fixedN”strategy,respec-
![Page 260: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/260.jpg)
tively.TheJ-valuecalculationisrepeated5timesforactivelearning.The
performanceofthefinallyobtainedpolicyπ8ismeasuredbythereturnfor
testsamplesrπ8
t,nT,N
t,n=1(50episodeswith50stepscollectedfollowingπ8):
1N
XT
X
Performance=
γt−1rπ8
N
t,n,
n=1t=1
wherethediscountfactorγissetat0.9.
Theperformanceofpassivelearning(PL;thecurrentpolicyisusedasthe
samplingpolicyineachiteration)andactivelearning(AL;thebestsampling
policyischosenfromthepolicycandidatespreparedineachiteration)is
compared.Theexperimentsarerepeated1000timeswithdifferentrandom
seedsandtheaverageperformanceofPLandALisevaluated.Theresults
aredepictedinFigure5.5,showingthatALworksbetterthanPLinboth
typesofepisodeschedulingwithstatisticalsignificancebythet-testatthe
significancelevel1%(Henkel,1976)fortheerrorvaluesobtainedafterthe7th
iteration.Furthermore,the“decreasingN”strategyoutperformsthe“fixed
N”strategyforbothPLandAL,showingtheusefulnessofthe“decreasing
N”strategy.
ActiveLearninginPolicyIteration
75
14
13
![Page 261: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/261.jpg)
12
11
10
AL(decreasingN)
Performance(average)
9
PL(decreasingN)
AL(fixedN)
8
PL(fixedN)
71
2
3
4
5
6
7
Iteration
FIGURE5.5:Themeanperformanceover1000trialsinthe10-statechain-
walkexperiment.Thedottedlinesdenotetheperformanceofpassivelearning
(PL)andthesolidlinesdenotetheperformanceoftheproposedactivelearning
(AL)method.Theerrorbarsareomittedforclearvisibility.Forboththe
“decreasingN”and“fixedN”strategies,theperformanceofALafterthe7th
iterationissignificantlybetterthanthatofPLaccordingtothet-testatthe
significancelevel1%appliedtotheerrorvaluesatthe7thiteration.
5.3
NumericalExamples
Inthissection,theperformanceofactivepolicyiterationisevaluatedusing
aball-battingrobotillustratedinFigure5.6,whichconsistsoftwolinksand
twojoints.Thegoaloftheball-battingtaskistocontroltherobotarmso
thatitdrivestheballasfarawayaspossible.ThestatespaceSiscontinuous
andconsistsofanglesϕ1[rad](∈[0,π/4])andϕ2[rad](∈[−π/4,π/4])and
![Page 262: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/262.jpg)
angularvelocities˙
ϕ1[rad/s]and˙
ϕ2[rad/s].Thus,astates(∈S)isdescribedbya4-dimensionalvectors=(ϕ1,˙
ϕ1,ϕ2,˙
ϕ2)⊤.TheactionspaceAisdiscrete
andcontainstwoelements:
A=a(i)2i=1=(50,−35)⊤,(−50,10)⊤,
wherethei-thelement(i=1,2)ofeachvectorcorrespondstothetorque
[N·m]addedtojointi.
Theopendynamicsengine(http://ode.org/)isusedforphysicalcalculationsincludingtheupdateoftheanglesandangularvelocities,andcollision
detectionbetweentherobotarm,ball,andpin.Thesimulationtimestepis
setat7.5[ms]andthenextstateisobservedafter10timesteps.Theaction
choseninthecurrentstateistakenfor10timesteps.Tomaketheexperi-
mentsrealistic,noiseisaddedtoactions:ifaction(f1,f2)⊤istaken,theactual
![Page 263: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/263.jpg)
76
StatisticalReinforcementLearning
FIGURE5.6:Aball-battingrobot.
torquesappliedtothejointsaref1+ε1andf2+ε2,whereε1andε2aredrawn
independentlyfromtheGaussiandistributionwithmean0andvariance3.
Theimmediaterewardisdefinedasthecarryoftheball.Thisrewardis
givenonlywhentherobotarmcollideswiththeballforthefirsttimeatstate
s′aftertakingactionaatcurrentstates.Forvaluefunctionapproximation,
thefollowing110basisfunctionsareused:
ks−c
ik2
![Page 264: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/264.jpg)
I(a=a(j))exp−
2τ2
φ2(i−1)+j=
fori=1,…,54andj=1,2,
I(a=a(j))fori=55andj=1,2,
whereτissetat3π/2andtheGaussiancentersci(i=1,…,54)arelocated
ontheregulargrid:0,π/4×−π,0,π×−π/4,0,π/4×−π,0,π.
ForL=7andT=10,the“decreasingN”strategyandthe“fixed
N”strategyarecompared.The“decreasingN”strategyisdefinedby
10,10,7,7,7,4,4andthe“fixedN”strategyisdefinedby7,7,7,7,7,7,7.
Theinitialstateisalwayssetats=(π/4,0,0,0)⊤,andJ-calculationsare
repeated5timesintheactivelearningmethod.Theinitialevaluationpolicy
π1issetattheǫ-greedypolicydefinedas
π
b
1(a|s)=0.15pu(a)+0.85I
a=argmaxQ0(s,a′),
a′
110
X
b
Q0(s,a)=
φb(s,a).
b=1
Policiesareupdatedinthel-thiterationusingtheǫ-greedyrulewithǫ=
0.15/l.Sampling-policycandidatesarepreparedinthesamewayasthechain-
walkexperimentinSection5.2.2.
Thediscountfactorγissetat1andtheperformanceoflearnedpolicyπ8
ActiveLearninginPolicyIteration
![Page 265: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/265.jpg)
77
70
65
60
55
50
45
AL(decreasingN)
Performance(average)
40
PL(decreasingN)
AL(fixedN)
35
PL(fixedN)
301
2
3
4
5
6
7
Iteration
FIGURE5.7:Themeanperformanceover500trialsintheball-batting
experiment.Thedottedlinesdenotetheperformanceofpassivelearning(PL)
andthesolidlinesdenotetheperformanceoftheproposedactivelearning(AL)
method.Theerrorbarsareomittedforclearvisibility.Forthe“decreasingN”
strategy,theperformanceofALafterthe7thiterationissignificantlybetter
thanthatofPLaccordingtothet-testatthesignificancelevel1%forthe
errorvaluesatthe7thiteration.
ismeasuredbythereturnfortestsamplesrπ8
t,n10,20
t,n=1(20episodeswith10
![Page 266: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/266.jpg)
P
P
stepscollectedfollowingπ
N
T
8):
rπ8
n=1
t=1t,n.
Theexperimentisrepeated500timeswithdifferentrandomseedsand
theaverageperformanceofeachlearningmethodisevaluated.Theresults,
depictedinFigure5.7,showthatactivelearningoutperformspassivelearning.
Forthe“decreasingN”strategy,theperformancedifferenceisstatistically
significantbythet-testatthesignificancelevel1%fortheerrorvaluesafter
the7thiteration.
Motionexamplesoftheball-battingrobottrainedwithactivelearningand
passivelearningareillustratedinFigure5.8andFigure5.9,respectively.
5.4
Remarks
Whenwecannotaffordtocollectmanytrainingsamplesduetohighsam-
plingcosts,itiscrucialtochoosethemostinformativesamplesforefficiently
learningthevaluefunction.Inthischapter,anactivelearningmethodforop-
timizingdatasamplingstrategieswasintroducedintheframeworkofsample-
reusepolicyiteration,andtheresultingactivepolicyiterationwasdemon-
stratedtobepromising.
![Page 267: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/267.jpg)
![Page 268: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/268.jpg)
![Page 269: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/269.jpg)
![Page 270: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/270.jpg)
![Page 271: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/271.jpg)
![Page 272: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/272.jpg)
![Page 273: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/273.jpg)
![Page 274: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/274.jpg)
![Page 275: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/275.jpg)
![Page 276: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/276.jpg)
![Page 277: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/277.jpg)
![Page 278: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/278.jpg)
78
StatisticalReinforcementLearning
FIGURE5.8:Amotionexampleoftheball-battingrobottrainedwithactive
learning(fromlefttorightandtoptobottom).
FIGURE5.9:Amotionexampleoftheball-battingrobottrainedwithpas-
sivelearning(fromlefttorightandtoptobottom).
![Page 279: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/279.jpg)
Chapter6
RobustPolicyIteration
Theframeworkofleast-squarespolicyiteration(LSPI)introducedinChap-
ter2isuseful,thankstoitscomputationalefficiencyandanalyticaltractabil-
ity.However,duetothesquaredloss,ittendstobesensitivetooutliersin
observedrewards.Inthischapter,weintroduceanalternativepolicyiter-
ationmethodthatemploystheabsolutelossforenhancingrobustnessand
reliability.InSection6.1,robustnessandreliabilitybroughtbytheuseofthe
absolutelossisdiscussed.InSection6.2,thepolicyiterationframeworkwith
theabsolutelosscalledleast-absolutepolicyiteration(LAPI)isintroduced.
InSection6.3,theusefulnessofLAPIisillustratedthroughexperiments.
VariationsofLAPIareconsideredinSection6.4,andfinallythischapteris
concludedinSection6.5.
6.1
RobustnessandReliabilityinPolicyIteration
ThebasicideaofLSPIistofitalinearmodeltoimmediaterewardsun-
derthesquaredloss,whiletheabsolutelossisusedinthischapter(seeFig-
ure6.1).Thisisjustreplacementoflossfunctions,butthismodificationhighly
enhancesrobustnessandreliability.
6.1.1
Robustness
Inmanyroboticsapplications,immediaterewardsareobtainedthrough
measurementsuchasdistancesensorsorcomputervision.Duetointrinsic
measurementnoiseorrecognitionerror,theobtainedrewardsoftendeviate
fromthetruevalue.Inparticular,therewardsoccasionallycontainoutliers,
whicharesignificantlydifferentfromregularvalues.
Residualminimizationunderthesquaredlossamountstoobtainingthe
meanofsamplesxim
i=1:
![Page 280: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/280.jpg)
”
#
m
X
1m
X
argmin
(xi−c)2=mean(xim
i=1)=
xi.
c
m
i=1
i=1
Ifoneofthevaluesisanoutlierhavingaverylargeorsmallvalue,themean
79
80
StatisticalReinforcementLearning
5
Absoluteloss
Squaredloss
4
3
2
1
0
−3
−2
−1
0
![Page 281: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/281.jpg)
1
2
3
FIGURE6.1:Theabsoluteandsquaredlossfunctionsforreducingthe
temporal-differenceerror.
wouldbestronglyaffectedbythisoutlier.Thismeansthatallthevalues
xim
i=1areresponsibleforthemean,andthereforeevenasingleoutlierob-
servationcansignificantlydamagethelearnedresult.
Ontheotherhand,residualminimizationundertheabsolutelossamounts
toobtainingthemedian:
”
#
2n+1
X
argmin
|xi−c|=median(xi2n+1)=x
i=1
n+1,
c
i=1
wherex1≤x2≤···≤x2n+1.Themedianisinfluencednotbythemagnitude
ofthevaluesxi2n+1butonlybytheirorder.Thus,aslongastheorderis
i=1
keptunchanged,themedianisnotaffectedbyoutliers.Infact,themedianis
knowntobethemostrobustestimatorinlightofbreakdown-pointanalysis
(Huber,1981;Rousseeuw&Leroy,1987).
Therefore,theuseoftheabsolutelosswouldremedytheproblemofro-
bustnessinpolicyiteration.
6.1.2
Reliability
Inpracticalrobot-controltasks,weoftenwanttoattainastableperfor-
![Page 282: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/282.jpg)
mance,ratherthantoachievea“dream”performancewithlittlechanceof
success.Forexample,intheacquisitionofahumanoidgait,wemaywantthe
robottowalkforwardinastablemannerwithhighprobabilityofsuccess,
ratherthantorushveryfastinachancelevel.
Ontheotherhand,wedonotwanttobetooconservativewhentraining
robots.Ifweareoverlyconcernedwithunrealisticfailure,nopracticallyuseful
controlpolicycanbeobtained.Forexample,anyrobotscanbebrokenin
principleiftheyareactivatedforalongtime.However,ifwefearthisfact
toomuch,wemayendupinpraisingacontrolpolicythatdoesnotmovethe
robotsatall,whichisobviouslynonsense.
Sincethesquared-losssolutionisnotrobustagainstoutliers,itissensitive
torareeventswitheitherpositiveornegativeverylargeimmediaterewards.
RobustPolicyIteration
81
Consequently,thesquaredlossprefersanextraordinarilysuccessfulmotion
evenifthesuccessprobabilityisverylow.Similarly,itdislikesanunrealistic
troubleevenifsuchaterribleeventmaynothappeninreality.Ontheother
hand,theabsolutelosssolutionisnoteasilyaffectedbysuchrareeventsdueto
itsrobustness.Therefore,theuseoftheabsolutelosswouldproduceareliable
controlpolicyeveninthepresenceofsuchextremeevents.
6.2
LeastAbsolutePolicyIteration
Inthissection,apolicyiterationmethodwiththeabsolutelossisintro-
duced.
6.2.1
Algorithm
Insteadofthesquaredloss,alinearmodelisfittedtoimmediaterewards
undertheabsolutelossas
”
![Page 283: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/283.jpg)
#
T
X
min
θ⊤b
ψ(st,at)−rt.
θ
t=1
Thisminimizationproblemlookscumbersomeduetotheabsolutevalueoper-
atorwhichisnon-differentiable,butthisminimizationproblemcanbereduced
tothefollowinglinearprogram(Boyd&Vandenberghe,2004):
T
X
min
bt
θ,btT
t=1
t=1
subjectto−bt≤θ⊤bψ(st,at)−rt≤bt,t=1,…,T.
ThenumberofconstraintsisTintheabovelinearprogram.WhenTislarge,
wemayemploysophisticatedoptimizationtechniquessuchascolumngen-
eration(Demirizetal.,2002)forefficientlysolvingthelinearprogramming
problem.Alternatively,anapproximatesolutioncanbeobtainedbygradient
descentorthe(quasi)-Newtonmethodsiftheabsolutelossisapproximated
byasmoothloss(see,e.g.,Section6.4.1).
Thepolicyiterationmethodbasedontheabsolutelossiscalledleastab-
solutepolicyiteration(LAPI).
![Page 284: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/284.jpg)
6.2.2
Illustration
Forillustrationpurposes,letusconsiderthe4-stateMDPproblemde-
scribedinFigure6.2.Theagentisinitiallylocatedatstates(0)andtheactions
82
StatisticalReinforcementLearning
FIGURE6.2:IllustrativeMDPproblem.
theagentisallowedtotakearemovingtotheleftorrightstate.Iftheleft
movementactionischosen,theagentalwaysreceivessmallpositivereward
+0.1ats(L).Ontheotherhand,iftherightmovementactionischosen,the
agentreceivesnegativereward−1withprobability0.9999ats(R1)oritre-
ceivesverylargepositivereward+20,000withprobability0.0001ats(R2).The
meanandmedianrewardsforleftmovementareboth+0.1,whilethemean
andmedianrewardsforrightmovementare+1.0001and−1,respectively.
IfQ(s(0),“Left”)andQ(s(0),“Right”)areapproximatedbytheleast-
squaresmethod,itreturnsthemeanrewards,i.e.,+0.1and+1.0001,re-
spectively.Thus,theleast-squaresmethodprefersrightmovement,whichisa
“gambling”policythatnegativereward−1isalmostalwaysobtainedats(R1),
butitispossibletoobtainveryhighreward+20,000withaverysmallprob-
abilityats(R2).Ontheotherhand,ifQ(s(0),“Left”)andQ(s(0),“Right”)are
approximatedbytheleastabsolutemethod,itreturnsthemedianrewards,
i.e.,+0.1and−1,respectively.Thus,theleastabsolutemethodprefersleft
movement,whichisastablepolicythattheagentcanalwaysreceivesmall
positivereward+0.1ats(L).
IfalltherewardsinFigure6.2arenegated,thevaluefunctionsarealso
negatedandadifferentinterpretationcanbeobtained:theleast-squares
methodisafraidoftheriskofreceivingverylargenegativereward−20,000
ats(R2)withaverylowprobability,andconsequentlyitendsupinavery
conservativepolicythattheagentalwaysreceivesnegativereward−0.1at
s(L).Ontheotherhand,theleastabsolutemethodtriestoreceivepositive
![Page 285: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/285.jpg)
reward+1ats(R1)withoutbeingafraidofvisitings(R2)toomuch.
Asillustratedabove,theleastabsolutemethodtendstoprovidequalita-
tivelydifferentsolutionsfromtheleast-squaresmethod.
RobustPolicyIteration
83
6.2.3
Properties
Here,propertiesoftheleastabsolutemethodareinvestigatedwhenthe
modelb
Q(s,a)iscorrectlyspecified,i.e.,thereexistsaparameterθ∗suchthatb
Q(s,a)=Q(s,a)
forallsanda.
Underthecorrectmodelassumption,whenthenumberofsamplesTtends
toinfinity,theleastabsolutesolutionb
θwouldsatisfythefollowingequa-
tion(Koenker,2005):
b⊤θψ(s,a)=Mp(s′|s,a)[r(s,a,s′)]forallsanda,
(6.1)
whereMp(s′|s,a)denotestheconditionalmedianofs′overp(s′|s,a)givens
anda.ψ(s,a)isdefinedby
ψ(s,a)=φ(s,a)−γEp(s′|s,a)Eπ(a′|s′)[φ(s′,a′)],
whereEp(s′|s,a)denotestheconditionalexpectationofs′overp(s′|s,a)given
sanda,andEπ(a′|s′)denotestheconditionalexpectationofa′overπ(a′|s′)
givens′.
FromEq.(6.1),wecanobtainthefollowingBellman-likerecursiveexpres-
sion:
h
i
![Page 286: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/286.jpg)
b
Q(s,a)=M
b
p(s′|s,a)[r(s,a,s′)]+γEp(s′|s,a)Eπ(a′|s′)Q(s′,a′).
(6.2)
Notethatinthecaseoftheleast-squaresmethodwhere
b⊤θψ(s,a)=Ep(s′|s,a)[r(s,a,s′)]
issatisfiedinthelimitunderthecorrectmodelassumption,wehave
h
i
b
Q(s,a)=E
b
p(s′|s,a)[r(s,a,s′)]+γEp(s′|s,a)Eπ(a′|s′)Q(s′,a′).
(6.3)
ThisistheordinaryBellmanequation,andthusEq.(6.2)couldberegarded
asanextensionoftheBellmanequationtotheabsoluteloss.
FromtheordinaryBellmanequation(6.3),wecanrecovertheoriginal
definitionofthestate-valuefunctionQ(s,a):
”
#
T
X
Qπ(
s,a)=Epπ(h)
γt−1r(st,at,st+1),s1=s,a1=a,
t=1
whereEpπ(h)denotestheexpectationovertrajectoryh=[s1,a1,…,
sT,aT,sT+1]and“|s1=s,a1=a”meansthattheinitialstates1andthe
firstactiona1arefixedats1=sanda1=a,respectively.Incontrast,from
theabsolute-lossBellmanequation(6.2),wehave
![Page 287: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/287.jpg)
”
#
T
X
Q′(
s,a)=Epπ(h)
γt−1Mp(s
s
.
t+1|st,at)[r(st,at,st+1)]1=s,a1=a
t=1
84
StatisticalReinforcementLearning
Bar
1stlink
1stjoint
2ndlink
![Page 288: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/288.jpg)
2ndjoint
Endeffector
FIGURE6.3:Illustrationoftheacrobot.Thegoalistoswinguptheend
effectorbyonlycontrollingthesecondjoint.
Thisisthevaluefunctionthattheleastabsolutemethodistryingtoap-
proximate,whichisdifferentfromtheordinaryvaluefunction.Sincethedis-
countedsumofmedianrewards—nottheexpectedrewards—ismaximized,
theleastabsolutemethodisexpectedtobelesssensitivetooutliersthanthe
least-squaresmethod.
6.3
NumericalExamples
Inthissection,thebehaviorofLAPIisillustratedthroughexperiments
usingtheacrobotshowninFigure6.3.Theacrobotisanunder-actuated
systemandconsistsoftwolinks,twojoints,andanendeffector.Thelengthof
eachlinkis0.3[m],andthediameterofeachjointis0.15[m].Thediameterof
theendeffectoris0.10[m],andtheheightofthehorizontalbaris1.2[m].The
firstjointconnectsthefirstlinktothehorizontalbarandisnotcontrollable.
Thesecondjointconnectsthefirstlinktothesecondlinkandiscontrollable.
Theendeffectorisattachedtothetipofthesecondlink.Thecontrolcommand
(action)wecanchooseistoapplypositivetorque+50[N·m],notorque0
[N·m],ornegativetorque−50[N·m]tothesecondjoint.Notethatthe
acrobotmovesonlywithinaplaneorthogonaltothehorizontalbar.
Thegoalistoacquireacontrolpolicysuchthattheendeffectorisswungup
ashighaspossible.Thestatespaceconsistsoftheangleθi[rad]andangular
velocity˙θi[rad/s]ofthefirstandsecondjoints(i=1,2).Theimmediate
RobustPolicyIteration
85
rewardisgivenaccordingtotheheightyofthecenteroftheendeffectoras
![Page 289: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/289.jpg)
10
ify>1.75,
r(s,a,s′)=
exp−(y−1.85)2
if1.5<y≤1.75,
2(0.2)2
0.001
otherwise.
Notethat0.55≤y≤1.85inthecurrentsetting.
Here,supposethatthelengthofthelinksisunknown.Thus,theheight
ycannotbedirectlycomputedfromstateinformation.Theheightoftheend
effectorissupposedtobeestimatedfromanimagetakenbyacamera—
theendeffectorisdetectedintheimageandthenitsverticalcoordinateis
computed.Duetorecognitionerror,theestimatedheightishighlynoisyand
couldcontainoutliers.
Ineachpolicyiterationstep,20episodictrainingsamplesoflength150
aregathered.Theperformanceoftheobtainedpolicyisevaluatedusing50
episodictestsamplesoflength300.Notethatthetestsamplesarenotused
forlearningpolicies.Theyareusedonlyforevaluatinglearnedpolicies.The
policiesareupdatedinasoft-maxmanner:
exp(Q(s,a)/η)
π(a|s)←−P
,
exp(Q(s,a′)/η)
a′∈Awhereη=10−l+1withlbeingtheiterationnumber.Thediscounted
factorissetatγ=1,i.e.,nodiscount.Asbasisfunctionsforvaluefunction
![Page 290: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/290.jpg)
approximation,theGaussiankernelwithstandarddeviationπisused,where
Gaussiancentersarelocatedat
(θ1,θ2,˙θ1,˙θ2)∈−π,−π,0,π,π×−π,0,π×−π,0,π×−π,0,π.2
2
Theabove135(=5×3×3×3)Gaussiankernelsaredefinedforeachofthe
threeactions.Thus,405(=135×3)kernelsareusedintotal.
Letusconsidertwonoiseenvironments:oneisthecasewherenonoiseis
addedtotherewardsandtheothercaseiswhereLaplaciannoisewithmean
zeroandstandarddeviation2isaddedtotherewardswithprobability0.1.
NotethatthetailoftheLaplaciandensityisheavierthanthatoftheGaussian
density(seeFigure6.4),implyingthatasmallnumberofoutlierstendtobe
includedintheLaplaciannoiseenvironment.Anexampleofthenoisytraining
samplesisshowninFigure6.5.Foreachnoiseenvironment,theexperimentis
repeated50timeswithdifferentrandomseedsandtheaveragesofthesumof
rewardsobtainedbyLAPIandLSPIaresummarizedinFigure6.6.Thebest
methodintermsofthemeanvalueandcomparablemethodsaccordingtothe
t-test(Henkel,1976)atthesignificancelevel5%isspecifiedby“.”
Inthenoiselesscase(seeFigure6.6(a)),bothLAPIandLSPIimprovethe
performanceoveriterationsinacomparableway.Ontheotherhand,inthe
noisycase(seeFigure6.6(b)),theperformanceofLSPIisnotimprovedmuch
duetooutliers,whileLAPIstillproducesagoodcontrolpolicy.
86
StatisticalReinforcementLearning
1
10
Gaussiandensity
True
Laplaciandensity
Samplewithnoise
![Page 291: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/291.jpg)
8
0.8
6
0.6
4
2
0.4
Immediatereward0
0.2
−2
−4
0
0.55
1.5
1.751.85
−4
−2
0
2
4
Heightofendeffector
FIGURE6.4:Probabilitydensity
FIGURE6.5:Exampleoftraining
functionsofGaussianandLapla-
sampleswithLaplaciannoise.The
ciandistributions.
horizontalaxisistheheightofthe
endeffector.Thesolidlinedenotes
thenoiselessimmediaterewardand
“”denotesanoisytrainingsample.
14
12
![Page 292: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/292.jpg)
12
10
10
8
8
6
6
Sumofrewards
Sumofrewards
4
4
2
2
LSPI
LAPI
0
0
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
![Page 293: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/293.jpg)
6
7
8
9
10
Iteration
Iteration
(a)Nonoise
(b)Laplaciannoise
FIGURE6.6:Averageandstandarddeviationofthesumofrewardsover50
runsfortheacrobotswinging-upsimulation.Thebestmethodintermsofthe
meanvalueandcomparablemethodsaccordingtothet-testatthesignificance
level5%specifiedby“.”
Figure6.7andFigure6.8depictmotionexamplesoftheacrobotlearned
byLAPIandLSPIintheLaplacian-noiseenvironment.WhenLSPIisused
(Figure6.7),thesecondjointisswunghardinordertolifttheendeffector.
However,theendeffectortendstostaybelowthehorizontalbar,andtherefore
onlyasmallamountofrewardcanbeobtainedbyLSPI.Thiswouldbedueto
theexistenceofoutliers.Ontheotherhand,whenLAPIisused(Figure6.8),
theendeffectorgoesbeyondthebar,andthereforealargeamountofreward
canbeobtainedeveninthepresenceofoutliers.
![Page 294: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/294.jpg)
![Page 295: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/295.jpg)
![Page 296: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/296.jpg)
![Page 297: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/297.jpg)
![Page 298: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/298.jpg)
![Page 299: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/299.jpg)
![Page 300: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/300.jpg)
![Page 301: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/301.jpg)
![Page 302: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/302.jpg)
![Page 303: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/303.jpg)
![Page 304: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/304.jpg)
![Page 305: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/305.jpg)
RobustPolicyIteration
87
FIGURE6.7:AmotionexampleoftheacrobotlearnedbyLSPIinthe
Laplacian-noiseenvironment(fromlefttorightandtoptobottom).
FIGURE6.8:AmotionexampleoftheacrobotlearnedbyLAPIinthe
Laplacian-noiseenvironment(fromlefttorightandtoptobottom).
88
StatisticalReinforcementLearning
![Page 306: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/306.jpg)
6.4
PossibleExtensions
Inthissection,possiblevariationsofLAPIareconsidered.
6.4.1
HuberLoss
UseoftheHuberlosscorrespondstomakingacompromisebetweenthe
squaredandabsolutelossfunctions(Huber,1981):
”
#
T
X
argmin
ρHB
κ
θ⊤b
ψ(st,at)−rt
,
θ
t=1
whereκ(≥0)isathresholdparameterandρHB
κ
istheHuberlossdefinedas
follows(seeFigure6.9):
1x2
if|x|≤κ,
2
ρHB
κ
(x)= κ|x|−1κ2if|x|>κ.
2
TheHuberlossconvergestotheabsolutelossasκtendstozero,andit
![Page 307: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/307.jpg)
convergestothesquaredlossasκtendstoinfinity.
TheHuberlossfunctionisratherintricate,butthesolutioncanbeob-
tainedbysolvingthefollowingconvexquadraticprogram(Mangasarian&
Musicant,2000):
T
T
X
X
1
min
b2t+κ
ct
θ,b
2
t,ctT
t=1
t=1
t=1
subjectto−ct≤θ⊤bψ(st,at)−rt−bt≤ct,t=1,…,T.
Anotherwaytoobtainthesolutionistouseagradientdescentmethod,
wheretheparameterθisupdatedasfollowsuntilconvergence:
T
X
θ←θ−ε
∆ρHB
κ
![Page 308: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/308.jpg)
(θ⊤b
ψ(st,at)−rt)b
ψ(st,at).
t=1
ε(>0)isthelearningrateand∆ρHB
κ
isthederivativeofρHB
κ
givenby
x
if|x|≤κ,
∆ρHB
κ
(x)=
κ
ifx>κ,
−κifx<−κ.
Inpractice,thefollowingstochasticgradientmethod(Amari,1967)wouldbe
RobustPolicyIteration
89
5
Huberloss
Pinballloss
4
Deadzone-linearloss
3
![Page 309: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/309.jpg)
2
1
0
−3
−2
−1
0
1
2
FIGURE6.9:TheHuberlossfunction(withκ=1),thepinballlossfunction
(withτ=0.3),andthedeadzone-linearlossfunction(withǫ=1).
moreconvenient.Forarandomlychosenindext∈1,…,Tineachiteration,
repeatthefollowingupdateuntilconvergence:
θ←θ−ε∆ρHB
κ
(θ⊤b
ψ(st,at)−rt)b
ψ(st,at).
Theplain/stochasticgradientmethodsalsocomeinhandywhenapprox-
imatingtheleastabsolutesolution,sincetheHuberlossfunctionwithsmall
κcanberegardedasasmoothapproximationtotheabsoluteloss.
6.4.2
PinballLoss
Theabsolutelossinducesthemedian,whichcorrespondstothe50-
percentilepoint.Asimilardiscussionisalsopossibleforanarbitrarypercentile
100τ(0≤τ≤1)basedonthepinballloss(Koenker,2005):
”
#
T
X
min
ρPB
![Page 310: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/310.jpg)
τ
(θ⊤b
ψ(st,at)−rt),
θ
t=1
whereρPB
τ
(x)isthepinballlossdefinedby
(2τx
ifx≥0,
ρPB
τ
(x)=
2(τ−1)xifx<0.
TheprofileofthepinballlossisdepictedinFigure6.9.Whenτ=0.5,the
pinballlossisreducedtotheabsoluteloss.
Thesolutioncanbeobtainedbysolvingthefollowinglinearprogram:
T
X
min
bt
θ,btT
t=1
t=1
![Page 311: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/311.jpg)
b
b
subjectto
t
≤
t
θ⊤b
ψ(s
,t=1,…,T.
2(τ−1)
t,at)−rt≤2τ
90
StatisticalReinforcementLearning
6.4.3
Deadzone-LinearLoss
Anothervariantoftheabsolutelossisthedeadzone-linearloss(seeFig-
ure6.9):
”
#
T
X
min
ρDL
ǫ
(θ⊤b
ψ(st,at)−rt),
θ
t=1
whereρDL
![Page 312: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/312.jpg)
ǫ
(x)isthedeadzone-linearlossdefinedby
(0
if|x|≤ǫ,
ρDL
ǫ
(x)=
|x|−ǫif|x|>ǫ.
Thatis,ifthemagnitudeoftheerrorislessthanǫ,noerrorisassessed.This
lossisalsocalledtheǫ-insensitivelossandusedinsupportvectorregression
(Vapnik,1998).
Whenǫ=0,thedeadzone-linearlossisreducedtotheabsoluteloss.
Thus,thedeadzone-linearlossandtheabsolutelossarerelatedtoeachother.
However,theeffectofthedeadzone-linearlossiscompletelyoppositetothe
absolutelosswhenǫ>0.Theinfluenceof“good”samples(withsmallerror)
isdeemphasizedinthedeadzone-linearloss,whiletheabsolutelosstendsto
suppresstheinfluenceof“bad”samples(withlargeerror)comparedwiththe
squaredloss.
Thesolutioncanbeobtainedbysolvingthefollowinglinearprogram(Boyd
&Vandenberghe,2004):
T
X
min
b
t
θ,btT
![Page 313: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/313.jpg)
t=1
t=1
subjectto
−b
t−ǫ≤θ⊤b
ψ(st,at)−rt≤bt+ǫ,
bt≥0,t=1,…,T.
6.4.4
ChebyshevApproximation
TheChebyshevapproximationminimizestheerrorforthe“worst”sample:
min
max|θ⊤b
ψ(st,at)−rt|.
θ
t=1,…,T
Thisisalsocalledtheminimaxapproximation.
Thesolutioncanbeobtainedbysolvingthefollowinglinearprogram(Boyd
&Vandenberghe,2004):
min
b
θ,b
subjectto−b≤θ⊤bψ(st,at)−rt≤b,t=1,…,T.
![Page 314: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/314.jpg)
RobustPolicyIteration
91
FIGURE6.10:Theconditionalvalue-at-risk(CVaR).
6.4.5
ConditionalValue-At-Risk
Intheareaoffinance,theconditionalvalue-at-risk(CVaR)isapopular
riskmeasure(Rockafellar&Uryasev,2002).TheCVaRcorrespondstothe
meanoftheerrorforasetof“bad”samples(seeFigure6.10).
Morespecifically,letusconsiderthedistributionoftheabsoluteerrorover
alltrainingsamples(st,at,rt)Tt=1:
Φ(α|θ)=P(st,at,rt):|θ⊤b
ψ(st,at)−rt|≤α.
Forβ∈[0,1),letαβ(θ)bethe100βpercentileoftheabsoluteerrordistribu-tion:
αβ(θ)=minα|Φ(α|θ)≥β.
Thus,onlythefraction(1−β)oftheabsoluteerror|θ⊤b
ψ(st,at)−rt|exceeds
thethresholdαβ(θ).αβ(θ)isalsoreferredtoasthevalue-at-risk(VaR).
Letusconsidertheβ-taildistributionoftheabsoluteerror:
0
ifα<αβ(θ),
Φβ(α|θ)= Φ(α|θ)−β
ifα≥α
1−β
β(θ).
Letφβ(θ)bethemeanoftheβ-taildistributionoftheabsolutetemporal
difference(TD)error:
h
i
![Page 315: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/315.jpg)
φβ(θ)=EΦ
|θ⊤b
ψ(s
,
β
t,at)−rt|
whereEΦdenotestheexpectationoverthedistributionΦ
β
β.φβ(θ)iscalled
theCVaR.Bydefinition,theCVaRoftheabsoluteerrorisreducedtothe
meanabsoluteerrorifβ=0anditconvergestotheworstabsoluteerror
asβtendsto1.Thus,theCVaRsmoothlybridgestheleastabsoluteand
Chebyshevapproximationmethods.CVaRisalsoreferredtoastheexpected
shortfall.
92
StatisticalReinforcementLearning
TheCVaRminimizationprobleminthecurrentcontextisformulatedas
h
h
ii
minEΦ
|θ⊤b
ψ(s
.
β
t,at)−rt|
θ
Thisoptimizationproblemlookscomplicated,butthesolutionb
θCVcanbeob-
![Page 316: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/316.jpg)
tainedbysolvingthefollowinglinearprogram(Rockafellar&Uryasev,2002):
T
X
min
T(1−β)α+
ct
θ,btT
,c
,α
t=1
tT
t=1
t=1
subjectto
−b
t≤θ⊤b
ψ(st,at)−rt≤bt,
ct≥bt−α,
ct≥0,t=1,…,T.
Notethatifthedefinitionoftheabsoluteerrorisslightlychanged,the
CVaRminimizationmethodamountstominimizingthedeadzone-linearloss
![Page 317: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/317.jpg)
(Takeda,2007).
6.5
Remarks
LSPIcanberegardedasregressionofimmediaterewardsunderthe
squaredloss.Inthischapter,theabsolutelosswasusedforregression,which
contributestoenhancingrobustnessandreliability.Theleastabsolutemethod
isformulatedasalinearprogramanditcanbesolvedefficientlybystandard
optimizationsoftware.
LSPImaximizesthestate-actionvaluefunctionQ(s,a),whichistheex-
pectationofreturns.Anotherwaytoaddresstherobustnessandreliability
istomaximizeotherquantitiessuchasthemedianoraquantileofreturns.
AlthoughBellman-likesimplerecursiveexpressionsarenotavailableforquan-
tilesofrewards,aBellman-likerecursiveequationholdsforthedistribution
ofthediscountedsumofrewards(Morimuraetal.,2010a;Morimuraetal.,
2010b).Developingrobustreinforcementlearningalgorithmsalongthisline
ofresearchwouldbeapromisingfuturedirection.
PartIII
Model-FreePolicySearch
InthepolicyiterationapproachexplainedinPartII,thevaluefunctionis
firstestimatedandthenthepolicyisdeterminedbasedonthelearnedvalue
function.Policyiterationwasdemonstratedtoworkwellinmanyreal-world
applications,especiallyinproblemswithdiscretestatesandactions(Tesauro,
1994;Williams&Young,2007;Abeetal.,2010).Althoughpolicyiteration
canalsohandlecontinuousstatesbyfunctionapproximation(Lagoudakis&
Parr,2003),continuousactionsarehardtodealwithduetothedifficultyof
findingamaximizerofthevaluefunctionwithrespecttoactions.Moreover,
sincepoliciesareindirectlydeterminedviavaluefunctionapproximation,mis-
specificationofvaluefunctionmodelscanleadtoaninappropriatepolicyeven
inverysimpleproblems(Weaver&Baxter,1999;Baxteretal.,2001).Another
limitationofpolicyiterationespeciallyinphysicalcontroltasksisthatcontrol
![Page 318: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/318.jpg)
policiescanvarydrasticallyineachiteration.Thiscausessevereinstabilityin
thephysicalsystemandthusisnotfavorableinpractice.
Policysearchisanalternativeapproachtoreinforcementlearningthatcan
overcomethelimitationsofpolicyiteration(Williams,1992;Dayan&Hin-
ton,1997;Kakade,2002).Inthepolicysearchapproach,policiesaredirectly
learnedsothatthereturn(i.e.,thediscountedsumoffuturerewards),
T
Xγt−1r(st,at,st+1),
t=1
ismaximized.
InPartIII,wefocusontheframeworkofpolicysearch.First,directpolicy
searchmethodsareintroduced,whichtrytofindthepolicythatachievesthe
maximumreturnviagradientascent(Chapter7)orexpectation-maximization
(Chapter8).Apotentialweaknessofthedirectpolicysearchapproachisits
instabilityduetotherandomnessofstochasticpolicies.Toovercometheinsta-
bilityproblem,analternativeapproachcalledpolicy-priorsearchisintroduced
inChapter9.
Thispageintentionallyleftblank
Chapter7
DirectPolicySearchbyGradient
Ascent
Thedirectpolicysearchapproachtriestofindthepolicythatmaximizes
theexpectedreturn.Inthischapter,weintroducegradient-basedalgorithms
fordirectpolicysearch.AftertheproblemformulationinSection7.1,the
gradientascentalgorithmisintroducedinSection7.2.Then,inSection7.3,
itsextentionusingnaturalgradientsisdescribed.InSection7.4,applicationto
computergraphicsisshown.Finally,thischapterisconcludedinSection7.5.
7.1
Formulation
![Page 319: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/319.jpg)
Inthissection,theproblemofdirectpolicysearchismathematicallyfor-
mulated.
LetusconsideraMarkovdecisionprocessspecifiedby
(S,A,p(s′|s,a),p(s),r,γ),
whereSisasetofcontinuousstates,Aisasetofcontinuousactions,p(s′|s,a)
isthetransitionprobabilitydensityfromcurrentstatestonextstates′when
actionaistaken,p(s)istheprobabilitydensityofinitialstates,r(s,a,s′)
isanimmediaterewardfortransitionfromstos′bytakingactiona,and
0<γ≤1isthediscountedfactorforfuturerewards.
Letπ(a|s,θ)beastochasticpolicyparameterizedbyθ,whichrepresents
theconditionalprobabilitydensityoftakingactionainstates.Lethbea
trajectoryoflengthT:
h=[s1,a1,…,sT,aT,sT+1].
Thereturn(i.e.,thediscountedsumoffuturerewards)alonghisdefinedas
T
X
R(h)=
γt−1r(st,at,st+1),
t=1
andtheexpectedreturnforpolicyparameterθisdefinedas
Z
J(θ)=Ep(h|θ)[R(h)]=
p(h|θ)R(h)dh,
95
96
StatisticalReinforcementLearning
FIGURE7.1:Gradientascentfordirectpolicysearch.
whereEp(h|θ)istheexpectationovertrajectoryhdrawnfromp(h|θ),and
p(h|θ)denotestheprobabilitydensityofobservingtrajectoryhunderpolicy
![Page 320: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/320.jpg)
parameterθ:
T
Y
p(h|θ)=p(s1)
p(st+1|st,at)π(at|st,θ).
t=1
Thegoalofdirectpolicysearchistofindtheoptimalpolicyparameterθ∗thatmaximizestheexpectedreturnJ(θ):
θ∗=argmaxJ(θ).θ
However,directlymaximizingJ(θ)ishardsinceJ(θ)usuallyinvolveshigh
non-linearitywithrespecttoθ.Below,agradient-basedalgorithmisintro-
ducedtofindalocalmaximizerofJ(θ).Analternativeapproachbasedon
theexpectation-maximizationalgorithmisprovidedinChapter8.
7.2
GradientApproach
Inthissection,agradientascentmethodfordirectpolicysearchisintro-
duced(Figure7.1).
7.2.1
GradientAscent
Thesimplestapproachtofindingalocalmaximizeroftheexpectedreturn
isgradientascent(Williams,1992):
θ←−θ+ε∇θJ(θ),
DirectPolicySearchbyGradientAscent
97
whereεisasmallpositiveconstantand∇θJ(θ)denotesthegradientofex-pectedreturnJ(θ)withrespecttopolicyparameterθ.Thegradient∇θJ(θ)isgivenby
![Page 321: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/321.jpg)
Z
∇θJ(θ)=∇θp(h|θ)R(h)dhZ
=
p(h|θ)∇θlogp(h|θ)R(h)dhZ
T
X
=
p(h|θ)
∇θlogπ(at|st,θ)R(h)dh,t=1
wheretheso-called“logtrick”isused:
∇θp(h|θ)=p(h|θ)∇θlogp(h|θ).Thisexpressionmeansthatthegradient∇θJ(θ)isgivenastheexpectationoverp(h|θ):
”
#
T
X
∇θJ(θ)=Ep(h|θ)∇θlogπ(at|st,θ)R(h).t=1
Sincep(h|θ)isunknown,theexpectationisapproximatedbytheempirical
averageas
N
T
1XX
∇bθJ(θ)=
∇
![Page 322: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/322.jpg)
N
θlogπ(at,n|st,n,θ)R(hn),
n=1t=1
where
hn=[s1,n,a1,n,…,sT,n,aT,n,sT+1,n]
isanindependentsamplefromp(h|θ).ThisalgorithmiscalledREINFORCE
(Williams,1992),whichisanacronymfor“REwardIncrement=Nonnegative
Factor×OffsetReinforcement×CharacteristicEligibility.”
Apopularchoiceforpolicymodelπ(a|s,θ)istheGaussianpolicymodel,
wherepolicyparameterθconsistsofmeanvectorµandstandarddeviation
σ:
1
(a−µ⊤φ(s))2
π(a|s,µ,σ)=√
exp−
.
(7.1)
σ2π
2σ2
Here,φ(s)denotesthebasisfunction.ForthisGaussianpolicymodel,the
policygradientsareexplicitlycomputedas
a−µ⊤φ(s)
∇µlogπ(a|s,µ,σ)=φ(s),
σ2
(a−µ⊤φ(s))2−σ2
∇σlogπ(a|s,µ,σ)=.
σ3
![Page 323: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/323.jpg)
98
StatisticalReinforcementLearning
Asshownabove,thegradientascentalgorithmfordirectpolicysearchis
verysimpletoimplement.Furthermore,thepropertythatpolicyparameters
aregraduallyupdatedinthegradientascentalgorithmispreferablewhen
reinforcementlearningisappliedtothecontrolofavulnerablephysicalsystem
suchasahumanoidrobot,becausesuddenpolicychangecandamagethe
system.However,thevarianceofpolicygradientstendstobelargeinpractice
(Peters&Schaal,2006;Sehnkeetal.,2010),whichcanresultinslowand
unstableconvergence.
7.2.2
BaselineSubtractionforVarianceReduction
Baselinesubtractionisausefultechniquetoreducethevarianceofgradient
estimators.Technically,baselinesubtractioncanbeviewedasthemethodof
controlvariates(Fishman,1996),whichisaneffectiveapproachtoreducing
thevarianceofMonteCarlointegralestimators.
Thebasicideaofbaselinesubtractionisthatanunbiasedestimatorb
ηis
stillunbiasedifazero-meanrandomvariablemmultipliedbyaconstantξis
subtracted:
b
ηξ=b
η−ξm.
Theconstantξ,whichiscalledabaseline,maybechosensothatthevariance
ofb
ηξisminimized.Bybaselinesubtraction,amorestableestimatorthanthe
originalb
ηcanbeobtained.
Apolicygradientestimatorwithbaselineξsubtractedisgivenby
T
X
∇b
![Page 324: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/324.jpg)
b
θJξ(θ)=∇θJ(θ)−ξ∇θlogπ(at,n|st,n,θ)t=1
1N
X
T
X
=
(R(h
∇N
n)−ξ)
θlogπ(at,n|st,n,θ),
n=1
t=1
wheretheexpectationof∇θlogπ(a|s,θ)iszero:Z
E[∇θlogπ(a|s,θ)]=π(a|s,θ)∇θlogπ(a|s,θ)daZ
=
∇θπ(a|s,θ)daZ
=∇θπ(a|s,θ)da=∇θ1=0.Theoptimalbaselineisdefinedastheminimizerofthevarianceofthegradient
estimatorwithrespecttothebaseline(Greensmithetal.,2004;Weaver&Tao,
2001):
ξ∗=argminVarb
p(h|θ)[∇θJξ(θ)],
![Page 325: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/325.jpg)
ξ
DirectPolicySearchbyGradientAscent
99
whereVarp(h|θ)denotesthetraceofthecovariancematrix:
Varp(h|
E
θ)[ζ]=tr
p(h|θ)(ζ−Ep(h|θ)[ζ])(ζ−Ep(h|θ)[ζ])⊤h
i
=Ep(h|θ)kζ−Ep(h|θ)[ζ]k2.
ItwasshowninPetersandSchaal(2006)thattheoptimalbaselineξ∗isgivenas
P
E
T
ξ∗=p(h|θ)[R(h)k
t=1∇θlogπ(at|st,θ)k2]P
.
E
T
p(h|θ)[k
t=1∇θlogπ(at|st,θ)k2]Inpractice,theexpectationsareapproximatedbysampleaverages.
7.2.3
VarianceAnalysisofGradientEstimators
![Page 326: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/326.jpg)
Here,thevarianceofgradientestimatorsistheoreticallyinvestigatedfor
theGaussianpolicymodel(7.1)withφ(s)=s.SeeZhaoetal.(2012)for
technicaldetails.
Inthetheoreticalanalysis,subsetsofthefollowingassumptionsarecon-
sidered:
Assumption(A):r(s,a,s′)∈[−β,β]forβ>0.Assumption(B):r(s,a,s′)∈[α,β]for0<α<β.Assumption(C):Forδ>0,thereexisttwoseriesctTt=1anddtTt=1such
thatkstk≥ctandkstk≤dtholdwithprobabilityatleast1−δ,
2N
respectively,overthechoiceofsamplepaths.
NotethatAssumption(B)isstrongerthanAssumption(A).Let
ζ(T)=CTα2−DTβ2/(2π),
where
T
X
T
X
CT=
c2tandDT=
d2t.
t=1
t=1
First,thevarianceofgradientestimatorsisanalyzed.
Theorem7.1UnderAssumptions(A)and(C),thefollowingupperbound
holdswithprobabilityatleast1−δ/2:
h
i
D
Var
b
Tβ2(1−γT)2
![Page 327: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/327.jpg)
p(h|θ)∇µJ(µ,σ)≤
.
Nσ2(1−γ)2
UnderAssumption(A),itholdsthat
h
i
2Tβ2(1−γT)2
Var
b
p(h|θ)∇σJ(µ,σ)≤
.
Nσ2(1−γ)2
100
StatisticalReinforcementLearning
Theaboveupperboundsaremonotoneincreasingwithrespecttotrajec-
torylengthT.
Forthevarianceof∇bµJ(µ,σ),thefollowinglowerboundholds(itsupper
boundhasnotbeenderivedyet):
Theorem7.2UnderAssumptions(B)and(C),thefollowinglowerbound
holdswithprobabilityatleast1−δ:
h
i
(1−γT)2
Var
b
![Page 328: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/328.jpg)
p(h|θ)∇µJ(µ,σ)≥
ζ(T).
Nσ2(1−γ)2
Thislowerboundisnon-trivialifζ(T)>0,whichcanbefulfilled,e.g.,if
αandβsatisfy
2πCTα2>DTβ2.
Next,thecontributionoftheoptimalbaselineisinvestigated.Itwasshown
(Greensmithetal.,2004;Weaver&Tao,2001)thattheexcessvarianceforan
arbitrarybaselineξisgivenby
Var
b
b
p(h|θ)[∇θJξ(θ)]−Varp(h|θ)[∇θJξ∗(θ)]
2
(ξ−ξ∗)2T
X
=
E
∇.
N
p(h|θ)
θlogπ(at|st,θ)
t=1
Basedonthisexpression,thefollowingtheoremcanbeobtained.
Theorem7.3UnderAssumptions(B)and(C),thefollowingboundshold
withprobabilityatleast1−δ:
![Page 329: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/329.jpg)
CTα2(1−γT)2≤Var
b
J(µ,σ)]−Var
b
Jξ∗(µ,σ)]Nσ2(1−γ)2
p(h|θ)[∇µp(h|θ)[∇µβ2(1−γT)2D
≤
T.
Nσ2(1−γ)2
Thistheoremshowsthatthelowerboundoftheexcessvarianceispositive
andmonotoneincreasingwithrespecttothetrajectorylengthT.Thismeans
thatthevarianceisalwaysreducedbyoptimalbaselinesubtractionandthe
amountofvariancereductionismonotoneincreasingwithrespecttothetra-
jectorylengthT.Notethattheupperboundisalsomonotoneincreasingwith
respecttothetrajectorylengthT.
Finally,thevarianceofgradientestimatorswiththeoptimalbaselineis
investigated:
Theorem7.4UnderAssumptions(B)and(C),itholdsthat
(1−γT)2
Var
b
p(h|θ)[∇µJξ∗(µ,σ)]≤(β2D
Nσ2(1−γ)2
T−α2CT),
wheretheinequalityholdswithprobabilityatleast1−δ.
![Page 330: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/330.jpg)
DirectPolicySearchbyGradientAscent
101
(a)Ordinarygradients
(b)Naturalgradients
FIGURE7.2:Ordinaryandnaturalgradients.Ordinarygradientstreatall
dimensionsequally,whilenaturalgradientstaketheRiemannianstructure
intoaccount.
Thistheoremshowsthattheupperboundofthevarianceofthegradient
estimatorswiththeoptimalbaselineisstillmonotoneincreasingwithrespect
tothetrajectorylengthT.Thus,whenthetrajectorylengthTislarge,the
varianceofthegradientestimatorscanstillbelargeevenwiththeoptimal
baseline.
InChapter9,anothergradientapproachwillbeintroducedforovercoming
thislarge-varianceproblem.
7.3
NaturalGradientApproach
Thegradient-basedpolicyparameterupdateusedintheREINFORCE
algorithmisperformedundertheEuclideanmetric.Inthissection,weshow
anotherusefulchoiceofthemetricforgradient-basedpolicysearch.
7.3.1
NaturalGradientAscent
UseoftheEuclideanmetricimpliesthatalldimensionsofthepolicypa-
rametervectorθaretreatedequally(Figure7.2(a)).However,sinceapolicy
parameterθspecifiesaconditionalprobabilitydensityπ(a|s,θ),useofthe
Euclideanmetricintheparameterspacedoesnotnecessarilymeanalldi-
mensionsaretreatedequallyinthespaceofconditionalprobabilitydensities.
Thus,asmallchangeinthepolicyparameterθcancauseabigchangeinthe
conditionalprobabilitydensityπ(a|s,θ)(Kakade,2002).
Figure7.3describestheGaussiandensitieswithmeanµ=−5,0,5and
standarddeviationσ=1,2.Thisshowsthatifthestandarddeviationis
![Page 331: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/331.jpg)
102
StatisticalReinforcementLearning
0.4
0.3
0.2
0.1
0
−10
−5
0
5
10
a
FIGURE7.3:Gaussiandensitieswithdifferentmeansandstandarddevi-
ations.Ifthestandarddeviationisdoubled(fromthesolidlinestodashed
lines),thedifferenceinmeanshouldalsobedoubledtomaintainthesame
overlappinglevel.
doubled,thedifferenceinmeanshouldalsobedoubledtomaintainthesame
overlappinglevel.Thus,itis“natural”tocomputethedistancebetweentwo
Gaussiandensitiesparameterizedwith(µ,σ)and(µ+∆µ,σ)notby∆µ,but
by∆µ/σ.
Gradientsthattreatalldimensionsequallyinthespaceofprobability
densitiesarecallednaturalgradients(Amari,1998;Amari&Nagaoka,2000).
Theordinarygradientisdefinedasthesteepestascentdirectionunderthe
Euclideanmetric(Figure7.2(a)):
∇θJ(θ)=argmaxJ(θ+∆θ)subjectto∆θ⊤∆θ≤ǫ,
∆θ
whereǫisasmallpositivenumber.Ontheotherhand,thenaturalgradi-
entisdefinedasthesteepestascentdirectionundertheRiemannianmetric
(Figure7.2(b)):
e
∇θJ(θ)=argmaxJ(θ+∆θ)subjectto∆θ⊤Rθ∆θ≤ǫ,
![Page 332: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/332.jpg)
∆θ
whereRθistheRiemannianmetric,whichisapositivedefinitematrix.The
solutionoftheaboveoptimizationproblemisgivenby
e
∇θJ(θ)=R−1θ
∇θJ(θ).Thus,theordinarygradient∇θJ(θ)ismodifiedbytheinverseRiemannianmetricR−1inthenaturalgradient.
θ
Astandarddistancemetricinthespaceofprobabilitydensitiesisthe
Kullback–Leibler(KL)divergence(Kullback&Leibler,1951).TheKLdiver-
gencefromdensityptodensityqisdefinedas
Z
p(θ)
KL(pkq)=
p(θ)log
dθ.
q(θ)
DirectPolicySearchbyGradientAscent
103
KL(pkq)isalwaysnon-negativeandzeroifandonlyifp=q.Thus,smaller
KL(pkq)meansthatpandqare“closer.”However,notethattheKLdiver-
genceisnotsymmetric,i.e.,KL(pkq)6=KL(qkp)ingeneral.
Forsmall∆θ,theKLdivergencefromp(h|θ)top(h|θ+∆θ)canbeap-
proximatedby
∆θ⊤Fθ∆θ,
whereFθistheFisherinformationmatrix:
![Page 333: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/333.jpg)
Fθ=Ep(h|θ)[∇θlogp(h|θ)∇θlogp(h|θ)⊤].
Thus,FθistheRiemannianmetricinducedbytheKLdivergence.
Thentheupdateruleofthepolicyparameterθbasedonthenatural
gradientisgivenby
−1
θ←−θ+εb
Fθ∇θJ(θ),whereεisasmallpositiveconstantandb
FθisasampleapproximationofFθ:
N
X
b
1
Fθ=
∇N
θlogp(hn|θ)∇θlogp(hn|θ)⊤.
n=1
Undermildregularityconditions,theFisherinformationmatrixFθcan
beexpressedas
Fθ=−Ep(h|θ)[∇2θlogp(h|θ)],where∇2logp(hθ
|θ)denotestheHessianmatrixoflogp(h|θ).Thatis,the
(b,b′)-thelementof∇2logp(hlogp(h
θ
|θ)isgivenby
∂2
∂θ
|θ).Thismeans
b∂θb′
![Page 334: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/334.jpg)
thatthenaturalgradienttakesthecurvatureintoaccount,bywhichthecon-
vergencebehavioratflatplateausandsteepridgestendstobeimproved.On
theotherhand,apotentialweaknessofnaturalgradientsisthatcomputation
oftheinverseRiemannianmetrictendstobenumericallyunstable(Deisenroth
etal.,2013).
7.3.2
Illustration
Letusillustratethedifferencebetweenordinaryandnaturalgradients
numerically.
Considerone-dimensionalreal-valuedstatespaceS=Randone-
dimensionalreal-valuedactionspaceA=R.Thetransitiondynamicsislin-
earanddeterministicass′=s+a,andtherewardfunctionisquadraticas
r=0.5s2−0.05a.Thediscountfactorissetatγ=0.95.TheGaussianpolicy
model,
1
(a−µs)2
π(a|s,µ,σ)=√
exp−
,
σ2π
2σ2
isemployed,whichcontainsthemeanparameterµandthestandarddevia-
tionparameterσ.Theoptimalpolicyparametersinthissetuparegivenby
(µ∗,σ∗)≈(−0.912,0).
104
StatisticalReinforcementLearning
1
1
0.8
![Page 335: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/335.jpg)
0.8
0.6
0.6
σ
σ
0.4
0.4
0.2
0.2
0
0
−1.5
−1
−0.5
−1.5
−1
−0.5
µ
µ
(a)Ordinarygradients
(b)Naturalgradients
FIGURE7.4:Numericalillustrationsofordinaryandnaturalgradients.
Figure7.4showsnumericalcomparisonofordinaryandnaturalgradients
fortheGaussianpolicy.Thecontourlinesandthearrowsindicatetheex-
pectedreturnsurfaceandthegradientdirections,respectively.Thegraphs
showthattheordinarygradientstendtostronglyreducethestandarddevia-
tionparameterσwithoutreallyupdatingthemeanparameterµ.Thismeans
thatthestochasticityofthepolicyislostquicklyandthustheagentbecomes
lessexploratory.Consequently,onceσgetsclosertozero,thesolutionisat
aflatplateaualongthedirectionofµandthuspolicyupdatesinµarevery
slow.Ontheotherhand,thenaturalgradientsreduceboththemeanparam-
eterµandthestandarddeviationparameterσinabalancedway.Asaresult,
![Page 336: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/336.jpg)
convergencegetsmuchfasterthantheordinarygradientmethod.
7.4
ApplicationinComputerGraphics:ArtistAgent
Orientalinkpainting,whichisalsocalledsumie,isoneofthemostdis-
tinctivepaintingstylesandhasattractedartistsaroundtheworld.Major
challengesinsumiesimulationaretoabstractcomplexsceneinformationand
reproducesmoothandnaturalbrushstrokes.Reinforcementlearningisuseful
toautomaticallygeneratesuchsmoothandnaturalstrokes(Xieetal.,2013).
Inthissection,theREINFORCEalgorithmexplainedinSection7.2isapplied
tosumieagenttraining.
DirectPolicySearchbyGradientAscent
105
7.4.1
SumiePainting
Amongvarioustechniquesofnon-photorealisticrendering(Gooch&
Gooch,2001),stroke-basedpainterlyrenderingsynthesizesanimagefroma
sourceimageinadesiredpaintingstylebyplacingdiscretestrokes(Hertz-
mann,2003).Suchanalgorithmsimulatesthecommonpracticeofhuman
painterswhocreatepaintingswithbrushstrokes.
Westernpaintingstylessuchaswater-color,pastel,andoilpaintingoverlay
strokesontomultiplelayers,whileorientalinkpaintingusesafewexpressive
strokesproducedbysoftbrushtuftstoconveysignificantinformationabouta
targetscene.Theappearanceofthestrokeinorientalinkpaintingistherefore
determinedbytheshapeoftheobjecttopaint,thepathandpostureofthe
brush,andthedistributionofpigmentsinthebrush.
Drawingsmoothandnaturalstrokesinarbitraryshapesischallenging
sinceanoptimalbrushtrajectoryandthepostureofabrushfootprintare
differentforeachshape.Existingmethodscanefficientlymapbrushtexture
bydeformationontoauser-giventrajectorylineortheshapeofatargetstroke
(Hertzmann,1998;Guo&Kunii,2003).However,thegeometricalprocessof
![Page 337: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/337.jpg)
morphingtheentiretextureofabrushstrokeintothetargetshapeleads
toundesirableeffectssuchasunnaturalfoldingsandcreasedappearancesat
cornersorcurves.
Here,asoft-tuftbrushistreatedasareinforcementlearningagent,andthe
REINFORCEalgorithmisusedtoautomaticallydrawartisticstrokes.More
specifically,givenanyclosedcontourthatrepresentstheshapeofadesired
singlestrokewithoutoverlap,theagentmovesthebrushonthecanvastofill
thegivenshapefromastartpointtoanendpointwithstableposesalonga
smoothcontinuousmovementtrajectory(seeFigure7.5).
Inorientalinkpainting,thereareseveraldifferentbrushstylesthatcharac-
terizethepaintings.Below,tworepresentativestylescalledtheuprightbrush
styleandtheobliquebrushstyleareconsidered(seeFigure7.6).Intheupright
brushstyle,thetipofthebrushshouldbelocatedonthemedialaxisofthe
expectedstrokeshape,andthebottomofthebrushshouldbetangenttoboth
sidesoftheboundary.Ontheotherhand,intheobliquebrushstyle,thetip
ofthebrushshouldtouchonesideoftheboundaryandthebottomofthe
brushshouldbetangenttotheothersideoftheboundary.Thechoiceofthe
uprightbrushstyleandtheobliquebrushstyleisexclusiveandauserisasked
tochooseoneofthestylesinadvance.
7.4.2
DesignofStates,Actions,andImmediateRewards
Here,specificdesignofstates,actions,andimmediaterewardstailoredto
thesumieagentisdescribed.
![Page 338: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/338.jpg)
![Page 339: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/339.jpg)
106
StatisticalReinforcementLearning
(a)Brushmodel
(b)Footprints
![Page 340: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/340.jpg)
(c)Basicstrokestyles
FIGURE7.5:Illustrationofthebrushagentanditspath.(a)Astrokeisgen-
eratedbymovingthebrushwiththefollowing3actions:Action1isregulating
thedirectionofthebrushmovement,Action2ispushingdown/liftingupthe
brush,andAction3isrotatingthebrushhandle.OnlyAction1isdetermined
byreinforcementlearning,andAction2andAction3aredeterminedbased
onAction1.(b)Thetopsymbolillustratesthebrushagent,whichconsistsof
atipQandacirclewithcenterCandradiusr.Othersillustratefootprintsof
arealbrushwithdifferentinkquantities.(c)Thereare6basicstrokestyles:
fullink,dryink,first-halfhollow,hollow,middlehollow,andboth-endhollow.
Smallfootprintsonthetopofeachstrokeshowtheinterpolationorder.
7.4.2.1
States
Theglobalmeasurement(i.e.,theposeconfigurationofafootprintunder
theglobalCartesiancoordinate)andthelocalmeasurement(i.e.,thepose
andthelocomotioninformationofthebrushagentrelativetothesurrounding
environment)areusedasstates.Here,onlythelocalmeasurementisusedto
calculatearewardandapolicy,bywhichtheagentcanlearnthedrawing
policythatisgeneralizabletonewshapes.Below,thelocalmeasurementis
regardedasstatesandtheglobalmeasurementisdealtwithonlyimplicitly.
![Page 341: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/341.jpg)
DirectPolicySearchbyGradientAscent
107
FIGURE7.6:Uprightbrushstyle(left)andobliquebrushstyle(right).
Thelocalstate-spacedesignconsistsoftwocomponents:acurrentsur-
roundingshapeandanupcomingshape.Morespecifically,statevectorscon-
sistsofthefollowingsixfeatures:
s=(ω,φ,d,κ1,κ2,l)⊤.
Eachfeatureisdefinedasfollows(seeFigures7.7):
•ω∈(−π,π]:Theangleofthevelocityvectorofthebrushagentrelativetothemedialaxis.
•φ∈(−π,π]:Theheadingdirectionofthebrushagentrelativetothemedialaxis.
•d∈[−2,2]:TheratioofoffsetdistanceδfromthecenterCofthebrushagenttothenearestpointPonthemedialaxisMovertheradiusrof
thebrushagent(|d|=δ/r).dtakesapositive/negativevaluewhenthe
centerofthebrushagentisontheleft-/right-handsideofthemedial
axis:
–dtakesthevalue0whenthecenterofthebrushagentisonthe
![Page 342: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/342.jpg)
medialaxis.
–dtakesavaluein[−1,1]whenthebrushagentisinsidethebound-
aries.
–Thevalueofdisin[−2,−1)orin(1,2]whenthebrushagentgoes
overtheboundaryofoneside.
![Page 343: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/343.jpg)
108
StatisticalReinforcementLearning
dt–1<=1
t
P
rt–1
f
t–1
C
t–1
t–1
Q
Qt–1
C
r
t
r
d
P
C
t
t
P
>1
Qt
t
f
t
t
Pt–1
FIGURE7.7:Illustrationofthedesignofstates.Left:Thebrushagent
consistsofatipQandacirclewithcenterCandradiusr.Right:Theratiod
![Page 344: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/344.jpg)
oftheoffsetdistanceδovertheradiusr.Footprintft−1isinsidethedrawing
area,andthecirclewithcenterCt−1andthetipQt−1touchtheboundaryon
eachside.Inthiscase,δt−1≤rt−1anddt−1∈[0,1].Ontheotherhand,ftgoesovertheboundary,andthenδt>rtanddt>1.Notethatdisrestrictedtobein[−2,2],andPisthenearestpointonmedialaxisMtoC.
Notethatthecenteroftheagentisrestrictedwithintheshape.There-
fore,theextremevaluesofdare±2whenthecenteroftheagentison
theboundary.
•κ1,κ2∈(−1,1):κ1providesthecurrentsurroundinginformationonthepointPt,whereasκ2providestheupcomingshapeinformationonpoint
Pt+1:
2
p
κi=
arctan0.05/r′,
π
i
wherer′iistheradiusofthecurve.Morespecifically,thevaluetakes
0/negative/positivewhentheshapeisstraight/left-curved/right-curved,
andthelargeritsabsolutevalueis,thetighterthecurveis.
•l∈0,1:Abinarylabelthatindicateswhethertheagentmovestoaregioncoveredbythepreviousfootprintsornot.l=0meansthatthe
agentmovestoaregioncoveredbythepreviousfootprint.Otherwise,
l=1meansthatitmovestoanuncoveredregion.
7.4.2.2
Actions
Togenerateelegantbrushstrokes,thebrushagentshouldmoveinside
givenboundariesproperly.Here,thefollowingactionsareconsideredtocontrol
thebrush(seeFigure7.5(a)):
•Action1:Movementofthebrushonthecanvaspaper.
•Action2:Scalingup/downofthefootprint.
![Page 345: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/345.jpg)
DirectPolicySearchbyGradientAscent
109
•Action3:Rotationoftheheadingdirectionofthebrush.
Sinceproperlycoveringthewholedesiredregionisthemostimportantin
termsofthevisualquality,themovementofthebrush(Action1)isregarded
astheprimaryaction.Morespecifically,Action1takesavaluein(−π,−π]
thatindicatestheoffsetturningangleofthemotiondirectionrelativetothe
medialaxisofanexpectedstrokeshape.Inpracticalapplications,theagent
shouldbeabletodealwitharbitrarystrokesinvariousscales.Toachieve
stableperformanceindifferentscales,thevelocityisadaptivelychangedas
r/3,whereristheradiusofthecurrentfootprint.
Action1isdeterminedbytheGaussianpolicyfunctiontrainedbythe
REINFORCEalgorithm,andAction2andAction3aredeterminedasfollows.
•Obliquebrushstrokestyle:Thetipoftheagentissettotouchoneside
oftheboundary,andthebottomoftheagentissettobetangenttothe
othersideoftheboundary.
•Uprightbrushstrokestyle:Thetipoftheagentischosentotravelalong
themedialaxisoftheshape.
IfitisnotpossibletosatisfytheaboveconstraintsbyadjustingAction2and
Action3,thenewfootprintwillsimplybethesamepostureastheprevious
one.
7.4.2.3
ImmediateRewards
Theimmediaterewardfunctionmeasuresthequalityofthebrushagent’s
movementaftertakinganactionateachtimestep.Therewardisdesignedto
reflectthefollowingtwoaspects:
•Thedistancebetweenthecenterofthebrushagentandthenearestpoint
onthemedialaxisoftheshapeatthecurrenttimestep:Thisdetects
whethertheagentmovesoutoftheregionortravelsbackwardfromthe
correctdirection.
•Changeofthelocalconfigurationofthebrushagentafterexecutingan
![Page 346: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/346.jpg)
action:Thisdetectswhethertheagentmovessmoothly.
Thesetwoaspectsareformalizedbydefiningtherewardfunctionasfol-
lows:
0
ifft=ft+1orlt+1=0,
r(st,at,st+1)= 2+|κ1(t)|+|κ2(t)|
otherwise,
E(t)
+E(t)
location
posture
whereftandft+1arethefootprintsattimestepstandt+1,respectively.This
rewarddesignimpliesthattheimmediaterewardiszerowhenthebrushis
blockedbyaboundaryasft=ft+1orthebrushisgoingbackwardtoaregion
110
StatisticalReinforcementLearning
thathasalreadybeencoveredbypreviousfootprints.κ1(t)andκ2(t)arethe
valuesofκ1andκ2attimestept.|κ1(t)|+|κ2(t)|adaptivelyincreasesthe
immediaterewarddependingonthecurvaturesκ1(t)andκ2(t)ofthemedial
axis.
E(t)
measuresthequalityofthelocationofthebrushagentwithre-
location
specttothemedialaxis,definedby
![Page 347: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/347.jpg)
(τ
E(t)
=
1|ωt|+τ2(|dt|+5)
dt∈[−2,−1)∪(1,2],location
τ1|ωt|+τ2|dt|
dt∈[−1,1],wheredtisthevalueofdattimestept.τ1andτ2areweightparameters,
whicharechosendependingonthebrushstyle:τ1=τ2=0.5fortheupright
brushstyleandτ1=0.1andτ2=0.9fortheobliquebrushstyle.Sincedt
containsinformationaboutwhethertheagentgoesovertheboundaryornot,
asillustratedinFigure7.7,thepenalty+5isaddedtoElocationwhenthe
agentgoesovertheboundaryoftheshape.
E(t)
posturemeasuresthequalityofthepostureofthebrushagentbasedon
neighboringfootprints,definedby
E(t)
posture=∆ωt/3+∆φt/3+∆dt/3,
where∆ωt,∆φt,and∆dtarechangesinangleωofthevelocityvector,heading
directionφ,andratiodoftheoffsetdistance,respectively.Thenotation∆xt
denotesthenormalizedsquaredchangebetweenxt−1andxtdefinedby
1
ifxt=xt−1=0,
∆xt=
(x
t−xt−1)2
otherwise.
(|xt|+|xt−1|)2
7.4.2.4
![Page 348: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/348.jpg)
TrainingandTestSessions
Anaivewaytotrainanagentistouseanentirestrokeshapeasatraining
sample.However,thishasseveraldrawbacks,e.g.,collectingmanytraining
samplesiscostlyandgeneralizationtonewshapesishard.Toovercomethese
limitations,theagentistrainedbasedonpartialshapes,nottheentireshapes
(Figure7.8(a)).Thisallowsustogeneratevariouspartialshapesfromasingle
entireshape,whichsignificantlyincreasesthenumberandvariationoftrain-
ingsamples.Anothermeritisthatthegeneralizationabilitytonewshapes
canbeenhanced,becauseevenwhentheentireprofileofanewshapeisquite
differentfromthatoftrainingdata,thenewshapemaycontainsimilarpartial
shapes.Figure7.8(c)illustrates8examplesof80digitizedrealsinglebrush
strokesthatarecommonlyusedinorientalinkpainting.Boundariesareex-
tractedastheshapeinformationandarearrangedinaqueuefortraining(see
Figure7.8(b)).
Inthetrainingsession,theinitialpositionofthefirstepisodeischosento
bethestartpointofthemedialaxis,andthedirectiontomoveischosentobe
![Page 349: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/349.jpg)
![Page 350: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/350.jpg)
![Page 351: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/351.jpg)
DirectPolicySearchbyGradientAscent
111
(a)Combinationofshapes
(b)Setupofpolicytraining
(c)Trainingshapes
FIGURE7.8:Policytrainingscheme.(a)Eachentireshapeiscomposed
ofoneoftheupperregionsUi,thecommonregionΩ,andoneofthelower
![Page 352: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/352.jpg)
regionsLj.(b)Boundariesareextractedastheshapeinformationandare
arrangedinaqueuefortraining.(c)Eightexamplesof80digitizedrealsingle
brushstrokesthatarecommonlyusedinorientalinkpaintingareillustrated.
thegoalpoint,asillustratedinFigure7.8(b).Inthefirstepisode,theinitial
footprintissetatthestartpointoftheshape.Then,inthefollowingepisodes,
theinitialfootprintissetateitherthelastfootprintinthepreviousepisode
orthestartpointoftheshape,dependingonwhethertheagentmovedwell
orwasblockedbytheboundaryinthepreviousepisode.
Afterlearningadrawingpolicy,thebrushagentappliesthelearnedpolicy
tocoveringgivenboundarieswithsmoothstrokes.Thelocationoftheagentis
112
StatisticalReinforcementLearning
30
30
25
25
20
20
15
15
Return10
Return10
Upperbound
Upperbound
5
RL
5
RL
0
10
![Page 353: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/353.jpg)
20
30
40
10
20
30
40
Iteration
Iteration
(a)Uprightbrushstyle
(b)Obliquebrushstyle
FIGURE7.9:Averageandstandarddeviationofreturnsobtainedbythe
reinforcementlearning(RL)methodover10trialsandtheupperlimitofthe
returnvalue.
initializedatthestartpointofanewshape.Theagentthensequentiallyselects
actionsbasedonthelearnedpolicyandmakestransitionsuntilitreachesthe
goalpoint.
7.4.3
ExperimentalResults
First,theperformanceofthereinforcementlearning(RL)methodisin-
vestigated.PoliciesareseparatelytrainedbytheREINFORCEalgorithmfor
theuprightbrushstyleandtheobliquebrushstyleusing80singlestrokesas
trainingdata(seeFigure7.8(c)).Theparametersoftheinitialpolicyareset
at
θ=(µ⊤,σ)⊤=(0,0,0,0,0,0,2)⊤,
wherethefirstsixelementscorrespondtotheGaussianmeanandthelast
elementistheGaussianstandarddeviation.TheagentcollectsN=300
episodicsampleswithtrajectorylengthT=32.Thediscountedfactoris
setatγ=0.99.
Theaverageandstandarddeviationsofthereturnfor300trainingepisodic
samplesover10trialsareplottedinFigure7.9.Thegraphsshowthatthe
averagereturnssharplyincreaseinanearlystageandapproachtheoptimal
![Page 354: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/354.jpg)
values(i.e.,receivingthemaximumimmediatereward,+1,forallsteps).
Next,theperformanceoftheRLmethodiscomparedwiththatofthe
dynamicprogramming(DP)method(Xieetal.,2011),whichinvolvesdis-
cretizationofcontinuousstatespace.InFigure7.10,theexperimentalresults
obtainedbyDPwithdifferentnumbersoffootprintcandidatesineachstep
oftheDPsearchareplottedtogetherwiththeresultobtainedbyRL.This
showsthattheexecutiontimeoftheDPmethodincreasessignificantlyasthe
numberoffootprintcandidatesincreases.IntheDPmethod,thebestreturn
DirectPolicySearchbyGradientAscent
113
30
2500
DP
2000
RL
20
1500
10
1000
Averagereturn
0
DP
Computationtime
500
RL
−10
0
0
50
![Page 355: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/355.jpg)
100
150
200
0
50
100
150
200
Thenumberoffootprintcandidates
Thenumberoffootprintcandidates
(a)Averagereturn
(b)Computationtime
FIGURE7.10:Averagereturnandcomputationtimeforreinforcement
learning(RL)anddynamicprogramming(DP).
value26.27isachievedwhenthenumberoffootprintcandidatesissetat180.
Althoughthismaximumvalueiscomparabletothereturnobtainedbythe
RLmethod(26.44),RLisabout50timesfasterthantheDPmethod.Fig-
ure7.11showssomeexemplarystrokesgeneratedbyRL(thetoptworows)
andDP(thebottomtworows).ThisshowsthattheagenttrainedbyRLis
abletodrawnicestrokeswithstableposesafterthe30thpolicyupdateiter-
ation(seealsoFigure7.9).Ontheotherhand,asillustratedinFigure7.11,
theDPresultsfor5,60,and100footprintcandidatesareunacceptablypoor.
GiventhattheDPmethodrequiresmanualtuningofthenumberoffootprint
candidatesateachstepforeachinputshape,theRLmethodisdemonstrated
tobepromising.
TheRLmethodisfurtherappliedtomorerealisticshapes,illustratedin
Figure7.12.Althoughtheshapesarenotincludedinthetrainingsamples,
theRLmethodcanproducesmoothandnaturalbrushstrokesforvarious
unlearnedshapes.MoreresultsareillustratedinFigure7.13,showingthat
theRLmethodispromisinginphotoconversionintothesumiestyle.
7.5
Remarks
![Page 356: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/356.jpg)
Inthischapter,gradient-basedalgorithmsfordirectpolicysearchareintro-
duced.Thesegradient-basedmethodsaresuitableforcontrollingvulnerable
physicalsystemssuchashumanoidrobots,thankstothenatureofgradient
methodsthatparametersareupdatedgradually.Furthermore,directpolicy
searchcanhandlecontinuousactionsinastraightforwardway,whichisan
advantageoverpolicyiteration,explainedinPartII.
![Page 357: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/357.jpg)
![Page 358: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/358.jpg)
![Page 359: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/359.jpg)
![Page 360: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/360.jpg)
![Page 361: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/361.jpg)
114
StatisticalReinforcementLearning
1stiteration
10thiteration
20thiteration
30thiteration
40thiteration
(a)RLmethod
5candidates
60candidates
100candidates
140candidates
180candidates
(b)DPmethod
FIGURE7.11:ExamplesofstrokesgeneratedbyRLandDP.Thetoptwo
rowsshowtheRLresultsoverpolicyupdateiterations,whilethebottomtwo
rowsshowtheDPresultsfordifferentnumbersoffootprintcandidates.The
linesegmentconnectsthecenterandthetipofafootprint,andthecircle
denotesthebottomcircleofthefootprint.
Thegradient-basedmethodwassuccessfullyappliedtoautomaticsumie
paintinggeneration.Consideringlocalmeasurementsinstatedesignwas
showntobeuseful,whichallowedabrushagenttolearnageneraldrawing
policythatisindependentofaspecificentireshape.Anotherimportantfactor
wastotrainthebrushagentonpartialshapes,nottheentireshapes.This
![Page 362: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/362.jpg)
contributedhighlytoenhancingthegeneralizationabilitytonewshapes,be-
causeevenwhenanewshapeisquitedifferentfromtrainingdataasawhole,
itoftencontainssimilarpartialshapes.Inthiskindofreal-worldapplica-
tionsmanuallydesigningimmediaterewardfunctionsisoftentimeconsuming
anddifficult.Theuseofinversereinforcementlearning(Abbeel&Ng,2004)
wouldbeapromisingapproachforthispurpose.Inparticular,inthecon-
![Page 363: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/363.jpg)
![Page 364: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/364.jpg)
DirectPolicySearchbyGradientAscent
115
(a)Realphoto
(b)Userinputboundaries
(c)TrajectoriesestimatedbyRL
(d)Renderingresults
FIGURE7.12:Resultsonnewshapes.
textofsumiedrawing,suchdata-drivendesignofrewardfunctionswillallow
automaticlearningofthestyleofaparticularartistfromhis/herdrawings.
Apracticalweaknessofthegradient-basedapproachisthatthestepsize
ofgradientascentisoftendifficulttochoose.InChapter8,astep-size-free
methodofdirectpolicysearchbasedontheexpectation-maximizationalgo-
rithmwillbeintroduced.Anothercriticalproblemofdirectpolicysearchis
thatpolicyupdateisratherunstableduetothestochasticityofpolicies.Al-
thoughvariancereductionbybaselinesubtractioncanmitigatethisproblem
tosomeextent,theinstabilityproblemisstillcriticalinpractice.Thenatural
gradientmethodcouldbeanalternative,butcomputingtheinverseRieman-
nianmetrictendstobeunstable.InChapter9,anothergradientapproach
![Page 365: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/365.jpg)
thatcanaddresstheinstabilityproblemwillbeintroduced.
![Page 366: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/366.jpg)
![Page 367: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/367.jpg)
![Page 368: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/368.jpg)
![Page 369: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/369.jpg)
116
StatisticalReinforcementLearning
FIGURE7.13:Photoconversionintothesumiestyle.
![Page 370: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/370.jpg)
Chapter8
DirectPolicySearchby
Expectation-Maximization
Gradient-baseddirectpolicysearchmethodsintroducedinChapter7are
usefulparticularlyincontrollingcontinuoussystems.However,appropriately
choosingthestepsizeofgradientascentisoftendifficultinpractice.In
thischapter,weintroduceanotherdirectpolicysearchmethodbasedonthe
expectation-maximization(EM)algorithmthatdoesnotcontainthestepsize
parameter.InSection8.1,themainideaoftheEM-basedmethodisdescribed,
whichisexpectedtoconvergefasterbecausepoliciesaremoreaggressivelyup-
datedthanthegradient-basedapproach.Inpractice,however,directpolicy
searchoftenrequiresalargenumberofsamplestoobtainastablepolicy
updateestimator.Toimprovethestabilitywhenthesamplesizeissmall,
reusingpreviouslycollectedsamplesisapromisingapproach.InSection8.2,
thesample-reusetechniquethathasbeensuccessfullyusedtoimprovethe
performanceofpolicyiteration(seeChapter4)isappliedtotheEM-based
method.ThenitsexperimentalperformanceisevaluatedinSection8.3and
thischapterisconcludedinSection8.4.
8.1
Expectation-MaximizationApproach
Thegradient-basedoptimizationalgorithmsintroducedinSection7.2
graduallyupdatepolicyparametersoveriterations.Althoughthisisadvan-
tageouswhencontrollingaphysicalsystem,itrequiresmanyiterationsuntil
convergence.Inthissection,theexpectation-maximization(EM)algorithm
(Dempsteretal.,1977)isusedtocopewiththisproblem.
ThebasicideaofEM-basedpolicysearchistoiterativelyupdatethepolicy
parameterθbymaximizingalowerboundoftheexpectedreturnJ(θ):
Z
J(θ)=
p(h|θ)R(h)dh.
ToderivealowerboundofJ(θ),Jensen’sinequality(Bishop,2006)isutilized:
Z
![Page 371: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/371.jpg)
Z
q(h)f(g(h))dh≥f
q(h)g(h)dh,
117
118
StatisticalReinforcementLearning
whereqisaprobabilitydensity,fisaconvexfunction,andgisanon-negative
function.Forf(t)=−logt,Jensen’sinequalityyields
Z
Z
q(h)logg(h)dh≤log
q(h)g(h)dh.
(8.1)
AssumethatthereturnR(h)isnonnegative.Lete
θbethecurrentpolicy
parameterduringtheoptimizationprocedure,andqandginEq.(8.1)areset
as
p(h|e
θ)R(h)
p(h|θ)
q(h)=
andg(h)=
.
J(e
θ)
p(h|e
![Page 372: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/372.jpg)
θ)
Thenthefollowinglowerboundholdsforallθ:
Z
J(θ)
p(h|θ)R(h)
log
=log
dh
J(e
θ)
J(e
θ)
Zp(h|eθ)R(h)p(h|θ)
=log
dh
J(e
θ)
p(h|e
θ)
Zp(h|eθ)R(h)
p(h|θ)
≥
log
dh.
J(e
θ)
p(h|e
θ)
Thisyields
logJ(θ)≥loge
J(θ),
where
![Page 373: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/373.jpg)
ZR(h)p(h|eθ)
p(h|θ)
loge
J(θ)=
log
dh+logJ(e
θ).
J(e
θ)
p(h|e
θ)
IntheEMapproach,theparameterθisiterativelyupdatedbymaximizing
thelowerbounde
J(θ):
bθ=argmaxe
J(θ).
θ
Sinceloge
J(e
θ)=logJ(e
θ),thelowerbounde
JtouchesthetargetfunctionJat
thecurrentsolutione
θ:
e
J(e
θ)=J(e
θ).
Thus,monotonenon-decreaseoftheexpectedreturnisguaranteed:
J(b
θ)≥J(e
θ).
![Page 374: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/374.jpg)
Thisupdateisiterateduntilconvergence(seeFigure8.1).
LetusemploytheGaussianpolicymodeldefinedas
π(a|s,θ)=π(a|s,µ,σ)
DirectPolicySearchbyExpectation-Maximization
119
FIGURE8.1:PolicyparameterupdateintheEM-basedpolicysearch.The
policyparameterθisupdatediterativelybymaximizingthelowerbound
e
J(θ),whichtouchesthetrueexpectedreturnJ(θ)atthecurrentsolutione
θ.
1
(a−µ⊤φ(s))2
=
√
exp−
,
σ2π
2σ2
whereθ=(µ⊤,σ)⊤andφ(s)denotesthebasisfunction.
Themaximizerb
θ=(b
µ⊤,b
σ)⊤ofthelowerbounde
J(θ)canbeanalytically
obtainedas
Z
!
!
![Page 375: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/375.jpg)
T
−1
X
Z
T
X
b
µ=
p(h|e
θ)R(h)
φ(st)φ(st)⊤dh
p(h|e
θ)R(h)
atφ(st)dh
t=1
t=1
!
!
N
−1
X
T
X
N
X
T
X
≈
R(hn)
φ(st,n)φ(st,n)⊤R(hn)
at,nφ(st,n),
![Page 376: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/376.jpg)
n=1
t=1
n=1
t=1
Z
!
−1
Z
T
1X
b
σ2=
p(h|e
θ)R(h)dh
p(h|e
θ)R(h)
(a
T
t−b
µ⊤φ(st))2dh
t=1
!
!
N
−1
X
N
X
T
1X
≈
R(hn)
![Page 377: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/377.jpg)
R(hn)
(a
,
T
t,n−b
µ⊤φ(st,n))2
n=1
n=1
t=1
wheretheexpectationoverhisapproximatedbytheaverageoverroll-out
samplesH=hnN
n=1fromthecurrentpolicye
θ:
hn=[s1,n,a1,n,…,sT,n,aT,n].
NotethatEM-basedpolicysearchforGaussianmodelsiscalledreward-
weightedregression(RWR)(Peters&Schaal,2007).
120
StatisticalReinforcementLearning
8.2
SampleReuse
Inpractice,alargenumberofsamplesisneededtoobtainastablepolicy
updateestimatorintheEM-basedpolicysearch.Inthissection,thesample-
reusetechniqueisappliedtotheEMmethodtocopewiththeinstability
problem.
8.2.1
EpisodicImportanceWeighting
TheoriginalRWRmethodisanon-policyalgorithmthatusesdatadrawn
fromthecurrentpolicy.Ontheotherhand,thesituationcalledoff-policyrein-
forcementlearningisconsideredhere,wherethesamplingpolicyforcollecting
![Page 378: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/378.jpg)
datasamplesisdifferentfromthetargetpolicy.Morespecifically,Ntrajec-
torysamplesaregatheredfollowingthepolicyπℓintheℓ-thpolicyupdate
iteration:
Hπℓ=hπℓ
1,…,hπℓ
N,
whereeachtrajectorysamplehπℓ
nisgivenas
hπℓ
n=[sπℓ
1,n,aπℓ
1,n,…,sπℓ,aπℓ,sπℓ
].
T,n
T,n
T+1,n
Wewanttoutilizeallthesesamplestoimprovethecurrentpolicy.
SupposethatwearecurrentlyattheL-thpolicyupdateiteration.Ifthe
policiesπℓL
remainunchangedovertheRWRupdates,justusingthe
ℓ=1
NIW
plainupdaterulesprovidedinSection8.1givesaconsistentestimatorb
θL+1=
(b
µNIW⊤L+1
,b
σNIW)⊤,where
L+1
!
L
![Page 379: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/379.jpg)
−1
XN
X
T
X
b
µNIW
L+1=
R(hπℓ
n)
φ(sπℓ
t,n)φ(sπℓ
t,n)⊤ℓ=1n=1
t=1
!
L
XN
X
T
X
×
R(hπℓ
n)
aπℓ
t,nφ(sπℓ
t,n)
,
ℓ=1n=1
t=1
!
L
![Page 380: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/380.jpg)
−1
XN
X
(b
σNIW
L+1)2=
R(hπℓ
n)
ℓ=1n=1
!
L
XN
X
T
1X
2
×
R(hπℓ
⊤n)
aπℓ
φ(sπℓ
.
T
t,n−b
µNIW
L+1
t,n)
ℓ=1n=1
t=1
Thesuperscript“NIW”standsfor“noimportanceweight.”However,since
policiesareupdatedineachRWRiteration,datasamplesHπℓL
![Page 381: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/381.jpg)
collected
ℓ=1
overiterationsgenerallyfollowdifferentprobabilitydistributionsinducedby
differentpolicies.Therefore,naiveuseoftheaboveupdateruleswillresultin
aninconsistentestimator.
DirectPolicySearchbyExpectation-Maximization
121
InthesamewayasthediscussioninChapter4,importancesamplingcan
beusedtocopewiththisproblem.Thebasicideaofimportancesampling
istoweightthesamplesdrawnfromadifferentdistributiontomatchthe
targetdistribution.Morespecifically,fromi.i.d.(independentandidentically
distributed)sampleshπℓ
nN
n=1followingp(h|θℓ),theexpectationofafunction
g(h)overanotherprobabilitydensityfunctionp(h|θL)canbeestimatedina
consistentmannerbytheimportance-weightedaverage:
N
1X
p(hπℓ
p(h|θ
g(hπ
N→∞
L)
ℓ
n|θL)
−→E
g(h)
N
![Page 382: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/382.jpg)
n)p(hπℓ
p(h|θℓ)
n|θ
p(h|θ
n=1
ℓ)
ℓ)
Z
Z
p(h|θ
=
g(h)
L)p(h|θ
g(h)p(h|θ
p(h|
ℓ)dh=
L)dh
θℓ)
=Ep(h|θL)[g(h)].
Theratiooftwodensitiesp(h|θL)/p(h|θℓ)iscalledtheimportanceweightfor
trajectoryh.
ThisimportancesamplingtechniquecanbeemployedinRWRtoobtain
EIW
aconsistentestimatorb
θ
⊤L+1=(b
µEIW
L+1
,b
σEIW)⊤,where
L+1
![Page 383: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/383.jpg)
!
L
−1
XN
X
T
X
b
µEIW
L+1=
R(hπℓ
n)w(L,ℓ)(h)
φ(sπℓ
t,n)φ(sπℓ
t,n)⊤ℓ=1n=1
t=1
!
L
XN
X
T
X
×
R(hπℓ
n)w(L,ℓ)(h)
aπℓ
t,nφ(sπℓ
t,n)
,
ℓ=1n=1
t=1
![Page 384: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/384.jpg)
!
L
−1
XN
X
(b
σEIW
L+1)2=
R(hπℓ
n)w(L,ℓ)(hπℓ
n)
ℓ=1n=1
!
L
XN
X
T
1X
2
×
R(hπℓ
⊤n)w(L,ℓ)(hπℓ
n)
aπℓ
φ(sπℓ
.
T
t,n−b
µEIW
L+1
t,n)
![Page 385: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/385.jpg)
ℓ=1n=1
t=1
Here,w(L,ℓ)(h)denotestheimportanceweightdefinedby
p(h|θ
w(L,ℓ)(h)=
L).
p(h|θℓ)
Thesuperscript“EIW”standsfor“episodicimportanceweight.”
p(h|θL)andp(h|θℓ)denotetheprobabilitydensityofobservingtrajectory
h=[s1,a1,…,sT,aT,sT+1]
underpolicyparametersθLandθℓ,whichcanbeexplicitlywrittenas
T
Y
p(h|θL)=p(s1)
p(st+1|st,at)π(at|st,θL),
t=1
122
StatisticalReinforcementLearning
T
Y
p(h|θℓ)=p(s1)
p(st+1|st,at)π(at|st,θℓ).
t=1
Thetwoprobabilitydensitiesp(h|θL)andp(h|θℓ)bothcontainunknownprob-
abilitydensitiesp(s1)andp(st+1|st,at)Tt=1.However,sincetheycancelout
intheimportanceweight,itcanbecomputedwithouttheknowledgeofp(s)
andp(s′|s,a)as
QTπ(a
w(L,ℓ)(h)=
![Page 386: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/386.jpg)
t=1
t|st,θL)
Q
.
T
π(a
t=1
t|st,θℓ)
EIW
Althoughtheimportance-weightedestimatorb
θL+1isguaranteedtobe
consistent,ittendstohavelargevariance(Shimodaira,2000;Sugiyama&
Kawanabe,2012).Therefore,theimportance-weightedestimatortendstobe
unstablewhenthenumberofepisodesNisrathersmall.
8.2.2
Per-DecisionImportanceWeight
Sincetherewardatthet-thstepdoesnotdependonfuturestate-action
transitionsafterthet-thstep,anepisodicimportanceweightcanbedecom-
posedintostepwiseimportanceweights(Precupetal.,2000).Forinstance,
theexpectedreturnJ(θL)canbeexpressedas
Z
J(θL)=
R(h)p(h|θL)dh
ZT
X
=
γt−1r(st,at,st+1)w(L,ℓ)(h)p(h|θℓ)dh
t=1
ZT
X
=
γt−1r(st,at,st+1)w(L,ℓ)
![Page 387: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/387.jpg)
t
(h)p(h|θℓ)dh,
t=1
wherew(L,ℓ)
t
(h)isthet-stepimportanceweight,calledtheper-decisionim-
portanceweight(PIW),definedas
Qt
π(a
w(L,ℓ)
t′=1
t′|st′,θL)
t
(h)=Q
.
t
π(a
t′=1
t′|st′,θℓ)
Here,thePIWideaisappliedtoRWRandamorestablealgorithmis
developed.Aslightcomplicationisthatthepolicyupdateformulagivenin
Section8.2.1containsdoublesumsoverTsteps,e.g.,
T
X
T
X
R(h)
φ(st′)φ(st′)=
γt−1r(st,at,st+1)φ(st′)φ(st′).
t′=1
t,t′=1
Inthiscase,thesummand
![Page 388: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/388.jpg)
γt−1r(st,at,st+1)φ(st′)φ(st′)
DirectPolicySearchbyExpectation-Maximization
123
doesnotdependonfuturestate-actionpairsafterthemax(t,t′)-thstep.Thus,
theepisodicimportanceweightfor
γt−1r(st,at,st+1)φ(st′)φ(st′)
canbesimplifiedtotheper-decisionimportanceweightw(L,ℓ)
.Conse-
max(t,t′)
quently,thePIW-basedpolicyupdaterulesaregivenas
−1
L
XN
XT
X
b
µPIW
L+1=
γt−1rt,nφ(sπℓ)φ(sπℓ)⊤w(L,ℓ)
(hπℓ
t′,n
t′,n
max(t,t′)
n)
ℓ=1n=1t,t′=1
![Page 389: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/389.jpg)
L
XN
XT
X
×
γt−1r
t,naπℓφ(sπℓ)w(L,ℓ)
(hπℓ
,
t′,n
t′,n
max(t,t′)
n)
ℓ=1n=1t,t′=1
!
L
−1
XN
XT
X
(b
σPIW
L+1)2=
γt−1rt,nw(L,ℓ)
t
(hπℓ
n)
ℓ=1n=1t=1
!
L
![Page 390: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/390.jpg)
N
T
1XXX
2
×
γt−1r
aπℓ
⊤φ(sπℓ)
w(L,ℓ)
(hπℓ
,
T
t,n
t′,n−b
µPIW
L+1
t′,n
max(t,t′)
n)
ℓ=1n=1t,t′=1
where
rt,n=r(st,n,at,n,st+1,n).
PIW
ThisPIWestimatorb
θ
⊤L+1=(b
µPIW
L+1
,b
σPIW)⊤isconsistentandpotentially
L+1EIW
![Page 391: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/391.jpg)
morestablethantheplainEIWestimatorb
θL+1.
8.2.3
AdaptivePer-DecisionImportanceWeighting
TomoreactivelycontrolthestabilityofthePIWestimator,theadaptive
per-decisionimportanceweight(AIW)isemployed.Morespecifically,anim-
portanceweightw(L,ℓ)
(h)is“flattened”byflatteningparameterν
max(t,t
∈[0,1]′)
ν
asw(L,ℓ)
(h)
,i.e.,theν-thpoweroftheper-decisionimportanceweight.
max(t,t′)
AIW
Thenwehaveb
θ
⊤L+1=(b
µAIW
L+1
,b
σAIW
L+1)⊤,where
−1
L
XN
XT
X
![Page 392: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/392.jpg)
ν
b
µAIW
L+1=
γt−1rt,nφ(sπℓ)φ(sπℓ)⊤w(L,ℓ)
(hπℓ
t′,n
t′,n
max(t,t′)
n)
ℓ=1n=1t,t′=1
L
XN
XT
X
ν
×
γt−1r
t,naπℓφ(sπℓ)
w(L,ℓ)
(hπℓ
,
t′,n
t′,n
max(t,t′)
n)
ℓ=1n=1t,t′=1
![Page 393: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/393.jpg)
!
L
−1
XN
XT
X
ν
(b
σAIW
L+1)2=
γt−1rt,nw(L,ℓ)
t
(hπℓ
n)
ℓ=1n=1t=1
124
StatisticalReinforcementLearning
!
L
N
T
1XXX
2
ν
×
γt−1r
aπℓ
⊤φ(sπℓ)
w(L,ℓ)
![Page 394: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/394.jpg)
(hπℓ
.
T
t,n
t′,n−b
µAIW
L+1
t′,n
max(t,t′)
n)
ℓ=1n=1t,t′=1
Whenν=0,AIWisreducedtoNIW.Therefore,itisrelativelystable,but
notconsistent.Ontheotherhand,whenν=1,AIWisreducedtoPIW.
Therefore,itisconsistent,butratherunstable.Inpractice,anintermediate
νoftenproducesabetterestimator.Notethatthevalueoftheflattening
parametercanbedifferentineachiteration,i.e.,νmaybereplacedbyνℓ.
However,forsimplicity,asinglecommonvalueνisconsideredhere.
8.2.4
AutomaticSelectionofFlatteningParameter
Theflatteningparameterallowsustocontrolthetrade-offbetweenconsis-
tencyandstability.Here,weshowhowthevalueoftheflatteningparameter
canbeoptimallychosenusingdatasamples.
Thegoalofpolicysearchistofindtheoptimalpolicythatmaximizesthe
expectedreturnJ(θ).Therefore,theoptimalflatteningparametervalueν∗LattheL-thiterationisgivenby
AIW
ν∗L=argmaxJ(bθL+1(ν)).ν
Directlyobtainingν∗requiresthecomputationoftheexpectedreturnL
AIW
J(b
![Page 395: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/395.jpg)
θL+1(ν))foreachcandidateofν.Tothisend,datasamplesfollowing
AIW
π(a|s;bθL+1(ν))areneededforeachν,whichisprohibitivelyexpensive.To
reusesamplesgeneratedbypreviouspolicies,avariationofcross-validation
calledimportance-weightedcross-validation(IWCV)(Sugiyamaetal.,2007)
isemployed.
ThebasicideaofIWCVistosplitthetrainingdatasetHπ1:L=HπℓLℓ=1
intoan“estimationpart”anda“validationpart.”Thenthepolicyparam-
AIW
eterb
θL+1(ν)islearnedfromtheestimationpartanditsexpectedreturn
AIW
J(b
θ
(ν))isapproximatedusingtheimportance-weightedlossfortheval-
idationpart.AspointedoutinSection8.2.1,importanceweightingtendsto
beunstablewhenthenumberNofepisodesissmall.Forthisreason,per-
decisionimportanceweightingisusedforcross-validation.Below,howIWCV
isappliedtotheselectionoftheflatteningparameterνinthecurrentcontext
isexplainedinmoredetail.
LetusdividethetrainingdatasetHπ1:L=HπℓLintoKdisjointsubsets
ℓ=1
Hπ1:L
ofthesamesize,whereeach
containsN/Kepisodicsamples
k
K
k=1
Hπ1:L
k
fromeveryHπℓ.Forsimplicity,weassumethatNisdivisiblebyK,i.e.,N/K
isaninteger.K=5willbeusedintheexperimentslater.
![Page 396: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/396.jpg)
AIW
Letb
θL+1,k(ν)bethepolicyparameterlearnedfromHπ1:L
k
′
k′6=k(i.e.,all
AIW
datawithoutHπ1:L)byAIWestimation.Theexpectedreturnofb
θ
k
L+1,k(ν)is
DirectPolicySearchbyExpectation-Maximization
125
estimatedusingthePIWestimatorfromHπ1:Las
k
X
T
X
b
AIW
1
Jk
IWCV(b
θL+1,k(ν))=
γt−1r(s
η
t,at,st+1)w(L,ℓ)
t
![Page 397: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/397.jpg)
(h),
π
h∈H1:Lt=1k
whereηisanormalizationconstant.Anordinarychoiceisη=LN/K,buta
morestablevariantgivenby
X
η=
w(L,ℓ)
t
(h)
π
h∈H1:Lk
isoftenpreferredinpractice(Precupetal.,2000).
Theaboveprocedureisrepeatedforallk=1,…,K,andtheaverage
score,
K
X
b
AIW
1
AIW
J
b
IWCV(b
θL+1(ν))=
Jk
K
IWCV(b
θL+1,k(ν)),
k=1
![Page 398: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/398.jpg)
AIW
iscomputed.ThisistheK-foldIWCVestimatorofJ(b
θL+1(ν)),whichwas
showntobealmostunbiased(Sugiyamaetal.,2007).
ThisK-foldIWCVscoreiscomputedforeachcandidatevalueoftheflat-
teningparameterνandtheonethatmaximizestheIWCVscoreischosen:
AIW
b
ν
b
IWCV=argmaxJIWCV(b
θL+1(ν)).
ν
ThisIWCVschemecanalsobeusedforchoosingthebasisfunctionsφ(s)in
theGaussianpolicymodel.
Notethatwhentheimportanceweightsw(L,ℓ)
areallone(i.e.,noim-
max(t,t′)
portanceweighting),theaboveIWCVprocedureisreducedtotheordinary
CVprocedure.TheuseofIWCVisessentialheresincethetargetpolicy
AIW
π(a|s,bθL+1(ν))isusuallydifferentfromthepreviouspoliciesusedforcollect-
ingthedatasamplesHπ1:L.Therefore,theexpectedreturnestimatedusing
AIW
ordinaryCV,b
JCV(b
θL+1(ν)),wouldbeheavilybiased.
8.2.5
Reward-WeightedRegressionwithSampleReuse
Sofar,wehaveintroducedAIWtocontrolthestabilityofthepolicy-
parameterupdateandIWCVtoautomaticallychoosetheflatteningparameter
basedontheestimatedexpectedreturn.Thepolicysearchalgorithmthat
![Page 399: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/399.jpg)
combinesthesetwomethodsiscalledreward-weightedregressionwithsample
reuse(RRR).
Ineachiteration(L=1,2,…)ofRRR,episodicdatasamplesHπLare
collectedfollowingthecurrentpolicyπ(a|s,θAIW
L
),theflatteningparameter
νischosensoastomaximizetheexpectedreturnb
JIWCV(ν)estimatedby
IWCVusingHπℓL,andthenthepolicyparameterisupdatedto
ℓ=1
θAIW
L+1
usingHπℓL.
ℓ=1
126
StatisticalReinforcementLearning
elbow
![Page 400: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/400.jpg)
wrist
FIGURE8.2:Ballbalancingusingarobotarmsimulator.Twojointsofthe
robotsarecontrolledtokeeptheballinthemiddleofthetray.
8.3
NumericalExamples
TheperformanceofRRRisexperimentallyevaluatedonaball-balancing
taskusingarobotarmsimulator(Schaal,2009).
AsillustratedinFigure8.2,a7-degree-of-freedomarmismountedonthe
ceilingupsidedown,whichisequippedwithacirculartrayofradius0.24[m]
attheendeffector.Thegoalistocontrolthejointsoftherobotsothatthe
ballisbroughttothemiddleofthetray.However,thedifficultyisthatthe
angleofthetraycannotbecontrolleddirectly,whichisatypicalrestriction
inreal-worldjoint-motionplanningbasedonfeedbackfromtheenvironment
(e.g.,thestateoftheball).
Tosimplifytheproblem,onlytwojointsarecontrolledhere:thewristangle
αrollandtheelbowangleαpitch.Alltheremainingjointsarefixed.Control
ofthewristandelbowangleswouldroughlycorrespondtochangingtheroll
andpitchanglesofthetray,butnotdirectly.
Twoseparatecontrolsubsystemsaredesignedhere,eachofwhichisin
chargeofcontrollingtherollandpitchangles.Eachsubsystemhasitsown
policyparameterθ,statespaceS,andactionspaceA.ThestatespaceSis
continuousandconsistsof(x,˙x),wherex[m]isthepositionoftheballonthe
trayalongeachaxisand˙x[m/s]isthevelocityoftheball.Theactionspace
Aiscontinuousandcorrespondstothetargetanglea[rad]ofthejoint.The
rewardfunctionisdefinedas
5(x′)2+(˙x′)2+a2
r(s,a,s′)=exp−
,
2(0.24/2)2
wherethenumber0.24inthedenominatorcomesfromtheradiusofthetray.
Below,howthecontrolsystemisdesignedisexplainedinmoredetail.
![Page 401: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/401.jpg)
DirectPolicySearchbyExpectation-Maximization
127
FIGURE8.3:Theblockdiagramoftherobot-armcontrolsystemforball
balancing.Thecontrolsystemhastwofeedbackloops,i.e.,joint-trajectory
planningbyRRRandtrajectorytrackingbyahigh-gainproportional-
derivative(PD)controller.
AsillustratedinFigure8.3,thecontrolsystemhastwofeedbackloopsfor
trajectoryplanningusinganRRRcontrollerandtrajectorytrackingusinga
high-gainproportional-derivative(PD)controller(Siciliano&Khatib,2008).
TheRRRcontrolleroutputsthetargetjointangleobtainedbythecurrent
policyatevery0.2[s].NineGaussiankernelsareusedasbasisfunctionsφ(s)
withthekernelcenterscb9
locatedoverthestatespaceat
b=1
(x,˙x)∈(−0.2,−0.4),(−0.2,0),(−0.1,0.4),(0,−0.4),(0,0),(0,0.4),
(0.1,−0.4),(0.2,0),(0.2,0.4).
TheGaussianwidthissetatσbasis=0.1.Basedonthediscrete-timetarget
anglesobtainedbyRRR,thedesiredjointtrajectoryinthecontinuoustime
domainislinearlyinterpolatedas
at,u=at+u˙at,
![Page 402: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/402.jpg)
whereuisthetimefromthelastoutputatofRRRatthet-thstep.˙atisthe
angularvelocitycomputedby
a
˙a
t−at−1
t=
,
0.2
wherea0istheinitialangleofajoint.Theangularvelocityisassumedtobe
constantduringthe0.2[s]cycleoftrajectoryplanning.
Ontheotherhand,thePDcontrollerconvertsdesiredjointtrajectoriesto
motortorquesas
τt,u=µp∗(at,u−αt,u)+µd∗(˙at−˙αt,u),whereτisthe2-dimensionalvectorconsistingofthetorqueappliedtothe
wristandelbowjoints.a=(apitch,aroll)⊤and˙a=(˙apitch,˙aroll)⊤arethe
2-dimensionalvectorsconsistingofthedesiredanglesandvelocities.α=
128
StatisticalReinforcementLearning
(αpitch,αroll)⊤and˙α=(˙αpitch,˙αroll)⊤arethe2-dimensionalvectorsconsist-
ingofthecurrentjointanglesandvelocities.µpandµdarethe2-dimensional
vectorsconsistingoftheproportionalandderivativegains.“∗”denotestheelement-wiseproduct.Sincethecontrolcycleoftherobotarmis0.002[s],
thePDcontrollerisapplied100times(i.e.,t=0.002,0.004,…,0.198,0.2)ineachRRRcycle.
Figure8.4depictsadesiredtrajectoryofthewristjointgeneratedby
arandompolicyandanactualtrajectoryobtainedusingthehigh-gainPD
controllerdescribedabove.Thegraphsshowthatthedesiredtrajectoryis
followedbytherobotarmreasonablywell.
ThepolicyparameterθLislearnedthroughtheRRRiterations.Theinitial
policyparametersθ1=(µ⊤
![Page 403: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/403.jpg)
1,σ1)⊤aresetmanuallyas
µ1=(−0.5,−0.5,0,−0.5,0,0,0,0,0)⊤andσ1=0.1,
sothatawiderangeofstatesandactionscanbesafelyexploredinthefirstiter-
ation.Theinitialpositionoftheballisrandomlyselectedasx∈[−0.05,0.05].Thedatasetcollectedineachiterationconsistsof10episodeswith20steps.
Thedurationofanepisodeis4[s]andthesamplingcyclebyRRRis0.2[s].
Threescenariosareconsideredhere:
•NIW:Samplereusewithν=0.
•PIW:Samplereusewithν=1.
•RRR:SamplereusewithνchosenbyIWCVfrom0,0.25,0.5,0.75,1
ineachiteration.
Thediscountfactorissetatγ=0.99.Figure8.5depictstheaveragedexpected
returnover10trialsasafunctionofthenumberofpolicyupdateiterations.
Theexpectedreturnineachtrialiscomputedfrom20testepisodicsamples
thathavenotbeenusedfortraining.ThegraphshowsthatRRRnicelyim-
provestheperformanceoveriterations.Ontheotherhand,theperformance
forν=0issaturatedafterthe3rditeration,andtheperformanceforν=1
isimprovedinthebeginningbutsuddenlygoesdownatthe5thiteration.
Theresultforν=1indicatesthatalargechangeinpoliciescausessevere
instabilityinsamplereuse.
Figure8.6andFigure8.7depictexamplesoftrajectoriesofthewristangle
αroll,theelbowangleαpitch,resultingballmovementx,andrewardrfor
policiesobtainedbyNIW(ν=0)andRRR(νischosenbyIWCV)after
the10thiteration.BythepolicyobtainedbyNIW,theballgoesthroughthe
middleofthetray,i.e.,(xroll,xpitch)=(0,0),anddoesnotstop.Ontheother
hand,thepolicyobtainedbyRRRsuccessfullyguidestheballtothemiddle
ofthetrayalongtherollaxis,althoughthemovementalongthepitchaxis
lookssimilartothatbyNIW.MotionexamplesbyRRRwithνchosenby
IWCVareillustratedinFigure8.8.
DirectPolicySearchbyExpectation-Maximization
![Page 404: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/404.jpg)
129
0.2
1
0.15
0.5
0.1
0
0.05
Angle[rad]
−0.5
Angularvelocity[rad/s]
0
−1
Desiredtrajectory
Actualtrajectory
−0.05
−1.5
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0
0.2
0.4
0.6
![Page 405: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/405.jpg)
0.8
1
1.2
1.4
1.6
1.8
2
Time[s]
Time[s]
(a)Trajectoryinangles
(b)Trajectoryinangularvelocities
FIGURE8.4:Anexampleofdesiredandactualtrajectoriesofthewrist
jointintherealisticball-balancingtask.Thetargetjointangleisdetermined
byarandompolicyatevery0.2[s],andthenalinearlyinterpolatedangleand
constantvelocityaretrackedusingtheproportional-derivative(PD)controller
inthecycleof0.002[s].
17
16
15
Reusen=0(NIW)
14
Reusen=1(PIW)
RRR(
^
n=νIWCV)
13
12
Expectedreturn11
10
9
2
4
![Page 406: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/406.jpg)
6
8
10
Iteration
FIGURE8.5:Theperformanceoflearnedpolicieswhenν=0(NIW),ν=1
(PIW),andνischosenbyIWCV(RRR)inballbalancingusingasimulated
robot-armsystem.Theperformanceismeasuredbythereturnaveragedover
10trials.Thesymbol“”indicatesthatthemethodisthebestorcomparable
tothebestoneintermsoftheexpectedreturnbythet-testatthesignifi-
cancelevel5%,performedateachiteration.Theerrorbarsindicate1/10ofa
standarddeviation.
130
StatisticalReinforcementLearning
0.2
1.7
0.15
1.65
0.1
1.6
0.05
[rad]
[rad]
roll
0
α
pitch1.55
α
−0.05
Angle
1.5
![Page 407: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/407.jpg)
Angle
−0.1
1.45
−0.15
−0.2
1.4
0
1
2
3
4
0
1
2
3
4
Time[s]
Time[s]
0.2
1
Pitch
Roll
0.15
Middleoftray
0.8
0.1
[m]
r
x
0.6
0.05
Reward0.4
![Page 408: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/408.jpg)
0
Ballposition
0.2
−0.05
−0.1
0
0
1
2
3
4
0
1
2
3
4
Time[s]
Time[s]
FIGURE8.6:Typicalexamplesoftrajectoriesofwristangleαroll,elbow
angleαpitch,resultingballmovementx,andrewardrforpoliciesobtainedby
NIW(ν=0)atthe10thiterationintheball-balancingtask.
0.2
1.7
0.15
1.65
0.1
1.6
0.05
[rad]
[rad]
roll
0
![Page 409: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/409.jpg)
α
pitch1.55
α
−0.05
Angle
1.5
Angle
−0.1
1.45
−0.15
−0.2
1.4
0
1
2
3
4
0
1
2
3
4
Time[s]
Time[s]
0.2
1
Pitch
Roll
0.15
Middleoftray
0.8
0.1
![Page 410: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/410.jpg)
[m]
r
x
0.6
0.05
Reward0.4
0
Ballposition
0.2
−0.05
−0.1
0
0
1
2
3
4
0
1
2
3
4
Time[s]
Time[s]
FIGURE8.7:Typicalexamplesoftrajectoriesofwristangleαroll,elbow
angleαpitch,resultingballmovementx,andrewardrforpoliciesobtainedby
RRR(νischosenbyIWCV)atthe10thiterationintheball-balancingtask.
![Page 411: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/411.jpg)
![Page 412: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/412.jpg)
![Page 413: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/413.jpg)
![Page 414: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/414.jpg)
![Page 415: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/415.jpg)
![Page 416: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/416.jpg)
DirectPolicySearchbyExpectation-Maximization
131
![Page 417: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/417.jpg)
FIGURE8.8:MotionexamplesofballbalancingbyRRR(fromlefttoright
andtoptobottom).
132
StatisticalReinforcementLearning
8.4
Remarks
Adirectpolicysearchalgorithmbasedonexpectation-maximization(EM)
iterativelymaximizesthelower-boundoftheexpectedreturn.TheEM-based
approachdoesnotincludethestepsizeparameter,whichisanadvantageover
thegradient-basedapproachintroducedinChapter7.Asample-reusevariant
oftheEM-basedmethodwasalsoprovided,whichcontributestoimproving
thestabilityofthealgorithminsmall-samplescenarios.
Inpractice,however,theEM-basedapproachisstillratherinstableevenif
itiscombinedwiththesample-reusetechnique.InChapter9,anotherpolicy
searchapproachwillbeintroducedtofurtherimprovethestabilityofpolicy
updates.
Chapter9
Policy-PriorSearch
ThedirectpolicysearchmethodsexplainedinChapter7andChapter8are
usefulinsolvingproblemswithcontinuousactionssuchasrobotcontrol.How-
ever,theytendtosufferfrominstabilityofpolicyupdate.Inthischapter,we
introduceanalternativepolicysearchmethodcalledpolicy-priorsearch,which
isadoptedinthePGPE(policygradientswithparameter-basedexploration)
method(Sehnkeetal.,2010).Thebasicideaistousedeterministicpoliciesto
removeexcessiverandomnessandintroduceusefulstochasticitybyconsidering
apriordistributionforpolicyparameters.
![Page 418: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/418.jpg)
Afterformulatingtheproblemofpolicy-priorsearchinSection9.1,a
gradient-basedalgorithmisintroducedinSection9.2,includingitsimprove-
mentusingbaselinesubtraction,theoreticalanalysis,andexperimentaleval-
uation.Then,inSection9.3,asample-reusevariantisdescribedanditsper-
formanceistheoreticallyanalyzedandexperimentallyinvestigatedusinga
humanoidrobot.Finally,thischapterisconcludedinSection9.4.
9.1
Formulation
Inthissection,thepolicysearchproblemisformulatedbasedonpolicy
priors.
Thebasicideaistouseadeterministicpolicyandintroducestochasticity
bydrawingpolicyparametersfromapriordistribution.Morespecifically,pol-
icyparametersarerandomlydeterminedfollowingthepriordistributionatthe
beginningofeachtrajectory,andthereafteractionselectionisdeterministic
(Figure9.1).Notethattransitionsaregenerallystochastic,andthustrajecto-
riesarealsostochasticeventhoughthepolicyisdeterministic.Thankstothis
per-trajectoryformulation,thevarianceofgradientestimatorsinpolicy-prior
searchdoesnotincreasewithrespecttothetrajectorylength,whichallows
ustoovercomethecriticaldrawbackofdirectpolicysearch.
Policy-priorsearchusesadeterministicpolicywithtypicallyalinearar-
chitecture:
π(a|s,θ)=δ(a=θ⊤φ(s)),
whereδ(·)istheDiracdeltafunctionandφ(s)isthebasisfunction.Thepolicy
133
134
StatisticalReinforcementLearning
a
s
a
![Page 419: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/419.jpg)
s
a
s
s
a
s
a
s
a
s
(a)Stochasticpolicy
a
s
a
s
s
a
s
a
s
a
s
a
s
(b)Deterministicpolicywithprior
FIGURE9.1:Illustrationofthestochasticpolicyandthedeterministicpol-
icywithapriorunderdeterministictransition.Thenumberofpossibletra-
jectoriesisexponentialwithrespecttothetrajectorylengthwhenstochastic
policiesareused,whileitdoesnotgrowwhendeterministicpoliciesdrawn
fromapriordistributionareused.
parameterθisdrawnfromapriordistributionp(θ|ρ)withhyper-parameter
ρ.
![Page 420: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/420.jpg)
Theexpectedreturninpolicy-priorsearchisdefinedintermsoftheex-
pectationsoverbothtrajectoryhandpolicyparameterθasafunctionof
hyper-parameterρ:
ZZ
J(ρ)=Ep(h|θ)p(θ|ρ)[R(h)]=
p(h|θ)p(θ|ρ)R(h)dhdθ,
whereEp(h|θ)p(θ|ρ)denotestheexpectationovertrajectoryhandpolicy
parameterθdrawnfromp(h|θ)p(θ|ρ).Inpolicy-priorsearch,thehyper-
parameterρisoptimizedsothattheexpectedreturnJ(ρ)ismaximized.
Thus,theoptimalhyper-parameterρ∗isgivenbyρ∗=argmaxJ(ρ).ρ
9.2
PolicyGradientswithParameter-BasedExploration
Inthissection,agradient-basedalgorithmforpolicy-priorsearchisgiven.
Policy-PriorSearch
135
9.2.1
Policy-PriorGradientAscent
Here,agradientmethodisusedtofindalocalmaximizeroftheexpected
returnJwithrespecttohyper-parameterρ:
ρ←−ρ+ε∇ρJ(ρ),whereεisasmallpositiveconstantand∇ρJ(ρ)isthederivativeofJwithrespecttoρ:
ZZ
∇ρJ(ρ)=p(h|θ)∇ρp(θ|ρ)R(h)dhdθ
![Page 421: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/421.jpg)
ZZ
=
p(h|θ)p(θ|ρ)∇ρlogp(θ|ρ)R(h)dhdθ=Ep(h|θ)p(θ|ρ)[∇ρlogp(θ|ρ)R(h)],wherethelogarithmicderivative,
∇∇ρp(θ|ρ)
ρlogp(θ|ρ)=
,
p(θ|ρ)
wasusedinthederivation.Theexpectationsoverhandθareapproximated
bytheempiricalaverages:
1N
X
∇bρJ(ρ)=
∇N
ρlogp(θn|ρ)R(hn),
(9.1)
n=1
whereeachtrajectorysamplehnisdrawnindependentlyfromp(h|θn)and
parameterθnisdrawnfromp(θ|ρ).Thus,inpolicy-priorsearch,samplesare
pairsofθandh:
H=(θ1,h1),…,(θN,hN).
Asthepriordistributionforpolicyparameterθ=(θ1,…,θB)⊤,where
Bisthedimensionalityofthebasisvectorφ(s),theindependentGaussian
distributionisastandardchoice.ForthisGaussianprior,thehyper-parameter
ρconsistsofpriormeansη=(η1,…,ηB)⊤andpriorstandarddeviations
τ=(τ1,…,τB)⊤:
B
![Page 422: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/422.jpg)
Y
1
(θ
p(
b−ηb)2
θ|η,τ)=
√
exp−
.
(9.2)
τ
2π
2τ2
b=1
b
b
Thenthederivativesoflog-priorlogp(θ|η,τ)withrespecttoηbandτbare
givenas
θ
∇b−ηb
ηlogp(θ|η,τ)=
,
b
τ2
b
(θ
∇b−ηb)2−τ2
b
τlogp(θ|η,τ)=
.
![Page 423: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/423.jpg)
b
τ3
b
BysubstitutingthesederivativesintoEq.(9.1),thepolicy-priorgradientswith
respecttoηandτcanbeapproximated.
136
StatisticalReinforcementLearning
9.2.2
BaselineSubtractionforVarianceReduction
AsexplainedinSection7.2.2,subtractionofabaselinecanreducethevari-
anceofgradientestimators.Here,abaselinesubtractionmethodforpolicy-
priorsearchisdescribed.
Forabaselineξ,amodifiedgradientestimatorisgivenby
1N
X
∇bρJξ(ρ)=
(R(h
N
n)−ξ)∇ρlogp(θn|ρ).n=1
Letξ∗betheoptimalbaselinethatminimizesthevarianceofthegradient:ξ∗=argminVarb
p(h|θ)p(θ|ρ)[∇ρJξ(ρ)],ξ
whereVarp(h|θ)p(θ|ρ)denotesthetraceofthecovariancematrix:
Varp(h|θ)p(θ|ρ)[ζ]
=trEp(h|θ)p(θ|ρ)(ζ−Ep(h|θ)p(θ|ρ)[ζ])(ζ−Ep(h|θ)p(θ|ρ)[ζ])⊤
![Page 424: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/424.jpg)
h
i
=Ep(h|θ)p(θ|ρ)kζ−Ep(h|θ)p(θ|ρ)[ζ]k2.
ItwasshowninZhaoetal.(2012)thattheoptimalbaselineforpolicy-prior
searchisgivenby
E
ξ∗=p(h|θ)p(θ|ρ)[R(h)k∇ρlogp(θ|ρ)k2],Ep(θ|ρ)[k∇ρlogp(θ|ρ)k2]whereEp(θ|ρ)denotestheexpectationoverpolicyparameterθdrawnfrom
p(θ|ρ).Inpractice,theexpectationsareapproximatedbythesampleaverages.
9.2.3
VarianceAnalysisofGradientEstimators
Herethevarianceofgradientestimatorsistheoreticallyinvestigatedfor
theindependentGaussianprior(9.2)withφ(s)=s.SeeZhaoetal.(2012)
fortechnicaldetails.
Below,subsetsofthefollowingassumptionsareconsidered(whicharethe
sameastheonesusedinSection7.2.3):
Assumption(A):r(s,a,s′)∈[−β,β]forβ>0.Assumption(B):r(s,a,s′)∈[α,β]for0<α<β.Assumption(C):Forδ>0,thereexisttwoseriesctTt=1anddtTt=1such
that
kstk≥ctandtk≤dt
holdwithprobabilityatleast1−δ,respectively,overthechoiceof
2N
samplepaths.
Policy-PriorSearch
137
![Page 425: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/425.jpg)
NotethatAssumption(B)isstrongerthanAssumption(A).
Let
B
X
G=
τ−2.
b
b=1
First,thevarianceofgradientestimatorsinpolicy-priorsearchisanalyzed:
Theorem9.1UnderAssumption(A),thefollowingupperboundshold:
h
i
β2(1−γT)2G
β2G
Var
b
p(h|θ)p(θ|ρ)∇ηJ(η,τ)≤≤
,
N(1−γ)2
N(1−γ)2
h
i
2β2(1−γT)2G
2β2G
Var
b
p(h|θ)p(θ|ρ)∇τJ(η,τ)≤≤
.
N(1−γ)2
N(1−γ)2
![Page 426: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/426.jpg)
ThesecondupperboundsareindependentofthetrajectorylengthT,while
theupperboundsfordirectpolicysearch(Theorem7.1inSection7.2.3)are
monotoneincreasingwithrespecttothetrajectorylengthT.Thus,gradient
estimationinpolicy-priorsearchisexpectedtobemorereliablethanthatin
directpolicysearchwhenthetrajectorylengthTislarge.
Thefollowingtheoremmoreexplicitlycomparesthevarianceofgradient
estimatorsindirectpolicysearchandpolicy-priorsearch:
Theorem9.2InadditiontoAssumptions(B)and(C),assumethat
ζ(T)=CTα2−DTβ2/(2π)
ispositiveandmonotoneincreasingwithrespecttoT,where
T
X
T
X
CT=
c2tandDT=
d2t.
t=1
t=1
IfthereexistsT0suchthat
ζ(T0)≥β2Gσ2,
thenitholdsthat
Var
b
b
p(h|θ)p(θ|ρ)[∇µJ(θ)]>Varp(h|θ)p(θ|ρ)[∇ηJ(η,τ)]forallT>T0,withprobabilityatleast1−δ.
Theabovetheoremmeansthatpolicy-priorsearchismorefavorablethan
directpolicysearchintermsofthevarianceofgradientestimatorsofthe
mean,iftrajectorylengthTislarge.
Next,thecontributionoftheoptimalbaselinetothevarianceofthegradi-
entestimatorwithrespecttomeanparameterηisinvestigated.Itwasshown
![Page 427: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/427.jpg)
inZhaoetal.(2012)thattheexcessvarianceforabaselineξisgivenby
Var
b
b
p(h|θ)p(θ|ρ)[∇ρJξ(ρ)]−Varp(h|θ)p(θ|ρ)[∇ρJξ∗(ρ)]
138
StatisticalReinforcementLearning
(ξ−ξ∗)2h
i
=
E
k∇.
N
p(h|θ)p(θ|ρ)
ρlogp(θ|ρ)k2
Basedonthisexpression,thefollowingtheoremholds.
Theorem9.3Ifr(s,a,s′)≥α>0,thefollowinglowerboundholds:
α2(1−γT)2G
Var
b
b
p(h|θ)p(θ|ρ)[∇ηJ(η,τ)]−Varp(h|θ)p(θ|ρ)[∇ηJξ∗(η,τ)]≥.
N(1−γ)2
UnderAssumption(A),thefollowingupperboundholds:
β2(1−γT)2G
![Page 428: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/428.jpg)
Var
b
b
p(h|θ)p(θ|ρ)[∇ηJ(η,τ)]−Varp(h|θ)p(θ|ρ)[∇ηJξ∗(η,τ)]≤.
N(1−γ)2
Theabovetheoremshowsthatthelowerboundoftheexcessvariance
ispositiveandmonotoneincreasingwithrespecttothetrajectorylengthT.
Thismeansthatthevarianceisalwaysreducedbysubtractingtheoptimal
baselineandtheamountofvariancereductionismonotoneincreasingwith
respecttothetrajectorylengthT.Notethattheupperboundisalsomonotone
increasingwithrespecttothetrajectorylengthT.
Finally,thevarianceofthegradientestimatorwiththeoptimalbaseline
isinvestigated:
Theorem9.4UnderAssumptions(B)and(C),thefollowingupperbound
holdswithprobabilityatleast1−δ:
(1−γT)2
(β2−α2)G
Var
b
p(h|θ)p(θ|ρ)[∇ηJξ∗(η,τ)]≤(β2−α2)G≤
.
N(1−γ)2
N(1−γ)2
ThesecondupperboundisindependentofthetrajectorylengthT,while
Theorem7.4inSection7.2.3showedthattheupperboundofthevariance
ofgradientestimatorswiththeoptimalbaselineindirectpolicysearchis
monotoneincreasingwithrespecttotrajectorylengthT.Thus,whentrajec-
torylengthTislarge,policy-priorsearchismorefavorablethandirectpolicy
searchintermsofthevarianceofthegradientestimatorwithrespecttothe
meanevenwhenoptimalbaselinesubtractionisapplied.
![Page 429: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/429.jpg)
9.2.4
NumericalExamples
Here,theperformanceofthedirectpolicysearchandpolicy-priorsearch
algorithmsareexperimentallycompared.
9.2.4.1
Setup
LetthestatespaceSbeone-dimensionalandcontinuous,andtheinitial
stateisrandomlychosenfollowingthestandardnormaldistribution.Theac-
tionspaceAisalsosettobeone-dimensionalandcontinuous.Thetransition
dynamicsoftheenvironmentissetat
st+1=st+at+ε,
Policy-PriorSearch
![Page 430: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/430.jpg)
139
TABLE9.1:Varianceandbiasofestimatedparameters.
(a)TrajectorylengthT=10
Method
Variance
Bias
µ,η
σ,τ
µ,η
σ,τ
REINFORCE
13.25726.917-0.310-1.510
REINFORCE-OB
0.091
0.120
0.067
0.129
PGPE
0.971
1.686
-0.069
0.132
PGPE-OB
0.037
0.069
-0.016
0.051
(b)TrajectorylengthT=50
Method
Variance
Bias
µ,η
![Page 431: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/431.jpg)
σ,τ
µ,η
σ,τ
REINFORCE
188.386278.310-1.813-5.175
REINFORCE-OB
0.545
0.900
-0.299-0.201
PGPE
1.657
3.372
-0.105-0.329
PGPE-OB
0.085
0.182
0.048
-0.078
whereε∼N(0,0.52)isstochasticnoiseandN(µ,σ2)denotesthenormaldistributionwithmeanµandvarianceσ2.Theimmediaterewardisdefined
as
r=exp−s2/2−a2/2+1,
whichisboundedas1<r≤2.ThelengthofthetrajectoryissetatT=10
or50,thediscountfactorissetatγ=0.9,andthenumberofepisodicsamples
issetatN=100.
9.2.4.2
VarianceandBias
First,thevarianceandthebiasofgradientestimatorsofthefollowing
methodsareinvestigated:
•REINFORCE:REINFORCE(gradient-baseddirectpolicysearch)
withoutabaseline(Williams,1992).
•REINFORCE-OB:REINFORCEwithoptimalbaselinesubtraction
![Page 432: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/432.jpg)
(Peters&Schaal,2006).
•PGPE:PGPE(gradient-basedpolicy-priorsearch)withoutabaseline
(Sehnkeetal.,2010).
•PGPE-OB:PGPEwithoptimalbaselinesubtraction(Zhaoetal.,
2012).
Table9.1summarizesthevarianceofgradientestimatorsover100runs,
showingthatthevarianceofREINFORCEisoveralllargerthanPGPE.A
notabledifferencebetweenREINFORCEandPGPEisthatthevarianceof
REINFORCEsignificantlygrowsasthetrajectorylengthTincreases,whereas
140
StatisticalReinforcementLearning
thatofPGPEisnotinfluencedthatmuchbyT.Thisagreeswellwiththe
theoreticalanalysesgiveninSection7.2.3andSection9.2.3.Optimalbaseline
subtraction(REINFORCE-OBandPGPE-OB)isshowntocontributehighly
toreducingthevariance,especiallywhentrajectorylengthTislarge,which
alsoagreeswellwiththetheoreticalanalysis.
Thebiasofthegradientestimatorofeachmethodisalsoinvestigated.
Here,gradientsestimatedwithN=1000areregardedastruegradients,and
thebiasofgradientestimatorsiscomputed.Theresultsarealsoincludedin
Table9.1,showingthatintroductionofbaselinesdoesnotincreasethebias;
rather,ittendstoreducethebias.
9.2.4.3
VarianceandPolicyHyper-ParameterChangethroughEn-
tirePolicy-UpdateProcess
Next,thevarianceofgradientestimatorsisinvestigatedwhenpolicyhyper-
parametersareupdatedoveriterations.Ifthedeviationparameterσtakesa
negativevalueduringthepolicy-updateprocess,itissetat0.05.Inthisex-
periment,thevarianceiscomputedfrom50runsforT=20andN=10,and
policiesareupdatedover50iterations.Inordertoevaluatethevariancein
astablemanner,theaboveexperimentsarerepeated20timeswithrandom
![Page 433: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/433.jpg)
choiceofinitialmeanparameterµfrom[−3.0,−0.1],andtheaveragevariance
ofgradientestimatorsisinvestigatedwithrespecttomeanparameterµover
20trials.TheresultsareplottedinFigure9.2.Figure9.2(a)comparesthe
varianceofREINFORCEwith/withoutbaselines,whereasFigure9.2(b)com-
paresthevarianceofPGPEwith/withoutbaselines.Thesegraphsshowthat
introductionofbaselinescontributeshighlytothereductionofthevariance
overiterations.
LetusillustratehowparametersareupdatedbyPGPE-OBover50itera-
tionsforN=10andT=10.Theinitialmeanparameterissetatη=−1.6,
−0.8,or−0.1,andtheinitialdeviationparameterissetatτ=1.Figure9.3
depictsthecontouroftheexpectedreturnandillustratestrajectoriesofpa-
rameterupdatesoveriterationsbyPGPE-OB.Inthegraph,themaximumof
thereturnsurfaceislocatedatthemiddlebottom,andPGPE-OBleadsthe
solutionstoamaximumpointrapidly.
9.2.4.4
PerformanceofLearnedPolicies
Finally,thereturnobtainedbyeachmethodisevaluated.Thetrajectory
lengthisfixedatT=20,andthemaximumnumberofpolicy-updateitera-
tionsissetat50.Averagereturnsover20runsareinvestigatedasfunctions
ofthenumberofepisodicsamplesN.Figure9.4(a)showstheresultswhen
initialmeanparameterµischosenrandomlyfrom[−1.6,−0.1],whichtends
toperformwell.ThegraphshowsthatPGPE-OBperformsthebest,espe-
ciallywhenN<5;thenREINFORCE-OBfollowswithasmallmargin.The
Policy-PriorSearch
141
6
REINFORCE
REINFORCE−OB
5
−scale4
![Page 434: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/434.jpg)
10
3
2
Varianceinlog
1
00
10
20
30
40
50
Iteration
(a)REINFORCEandREINFORCE-OB
4
PGPE
3.5
PGPE−OB
3
2.5
−scale
10
2
1.5
1
0.5
Varianceinlog
0
−0.50
10
20
30
40
![Page 435: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/435.jpg)
50
Iteration
(b)PGPEandPGPE-OB
FIGURE9.2:Meanandstandarderrorofthevarianceofgradientestimators
withrespecttothemeanparameterthroughpolicy-updateiterations.
1
17.00
τ
17.54
17.81
17.27
0.8
18.07
17.54
18.34
18.0717.81
18.61
0.6
18.88
19.14
18.34
0.4
18.61
19.41
18.88
19.68
19.14
0.2
19.41
19.68
Policy-priorstandarddeviation
0
![Page 436: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/436.jpg)
−1.6
−1.4
−1.2
−1
−0.8
−0.6
−0.4
−0.2
Policy-priormeanη
FIGURE9.3:Trajectoriesofpolicy-priorparameterupdatesbyPGPE.
142
StatisticalReinforcementLearning
16.5
16
15.5
Return
15
REINFORCE
14.5
REINFORCE−OB
PGPE
PGPE−OB
0
5
10
15
20
Iteration
(a)Goodinitialpolicy
16.5
![Page 437: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/437.jpg)
16
15.5
15
14.5
Return
14
13.5
REINFORCE
REINFORCE−OB
13
PGPE
PGPE−OB
12.50
5
10
15
20
Iteration
(b)Poorinitialpolicy
FIGURE9.4:Averageandstandarderrorofreturnsover20runsasfunctions
ofthenumberofepisodicsamplesN.
plainPGPEalsoworksreasonablywell,althoughitisslightlyunstabledueto
largervariance.TheplainREINFORCEishighlyunstable,whichiscausedby
thehugevarianceofgradientestimators(seeFigure9.2again).Figure9.4(b)
describestheresultswheninitialmeanparameterµischosenrandomlyfrom
[−3.0,−0.1],whichtendstoresultinpoorerperformance.Inthissetup,the
differenceamongthecomparedmethodsismoresignificantthanthecasewith
goodinitialpolicies,meaningthatREINFORCEissensitivetothechoiceof
initialpolicies.Overall,thePGPEmethodstendtooutperformtheREIN-
FORCEmethods,andamongthePGPEmethods,PGPE-OBworksvery
wellandconvergesquickly.
![Page 438: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/438.jpg)
Policy-PriorSearch
143
9.3
SampleReuseinPolicy-PriorSearch
AlthoughPGPEwasshowntooutperformREINFORCE,itsbehavioris
stillratherunstableifthenumberofdatasamplesusedforestimatingthegra-
dientissmall.Inthissection,thesample-reuseideaisappliedtoPGPE.Tech-
nically,theoriginalPGPEiscategorizedasanon-policyalgorithmwheredata
drawnfromthecurrenttargetpolicyisusedtoestimatepolicy-priorgradients.
Ontheotherhand,off-policyalgorithmsaremoreflexibleinthesensethat
adata-collectingpolicyandthecurrenttargetpolicycanbedifferent.Here,
PGPEisextendedtotheoff-policyscenariousingtheimportance-weighting
technique.
9.3.1
ImportanceWeighting
Letusconsideranoff-policyscenariowhereadata-collectingpolicyand
thecurrenttargetpolicyaredifferentingeneral.InthecontextofPGPE,
twohyper-parametersareconsidered:ρasthetargetpolicytolearnandρ′
asapolicyfordatacollection.Letusdenotethedatasamplescollectedwith
hyper-parameterρ′byH′:
H′=
i.i.d.
θ′n,h′nN′
n=1
∼p(h|θ)p(θ|ρ′).IfdataH′isnaivelyusedtoestimatepolicy-priorgradientsbyEq.(9.1),we
sufferaninconsistencyproblem:
N′
1X∇N′
![Page 439: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/439.jpg)
ρlogp(θ′n|ρ)R(h′n)N′−→∞
9
∇ρJ(ρ),n=1
where
ZZ
∇ρJ(ρ)=p(h|θ)p(θ|ρ)∇ρlogp(θ|ρ)R(h)dhdθisthegradientoftheexpectedreturn,
ZZ
J(ρ)=
p(h|θ)p(θ|ρ)R(h)dhdθ,
withrespecttothepolicyhyper-parameterρ.Below,thisnaivemethodis
referredtoasnon-importance-weightedPGPE(NIW-PGPE).
Thisinconsistencyproblemcanbesystematicallyresolvedbyimportance
weighting:
1N′
X
∇bN′→∞
ρJIW(ρ)=
w(θ′
−→∇N′
n)∇ρlogp(θ′n|ρ)R(h′n)ρJ(ρ),
n=1
144
![Page 440: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/440.jpg)
StatisticalReinforcementLearning
wherew(θ)=p(θ|ρ)/p(θ|ρ′)istheimportanceweight.Thisextendedmethod
iscalledimportance-weightedPGPE(IW-PGPE).
Below,thevarianceofgradientestimatorsinIW-PGPEistheoretically
analyzed.SeeZhaoetal.(2013)fortechnicaldetails.AsdescribedinSec-
tion9.2.1,thedeterministiclinearpolicymodelisusedhere:
π(a|s,θ)=δ(a=θ⊤φ(s)),
(9.3)
whereδ(·)istheDiracdeltafunctionandφ(s)istheB-dimensionalbasis
function.Policyparameterθ=(θ1,…,θB)⊤isdrawnfromtheindependent
Gaussianprior,wherepolicyhyper-parameterρconsistsofpriormeansη=
(η1,…,ηB)⊤andpriorstandarddeviationsτ=(τ1,…,τB)⊤:
B
Y
1
(θ
p(
b−ηb)2
θ|η,τ)=
√
exp−
.
(9.4)
τ
2π
2τ2
b=1
b
b
Let
B
X
![Page 441: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/441.jpg)
G=
τ−2,
b
b=1
andletVarp(h′|θ′)p(θ′|ρ′)denotethetraceofthecovariancematrix:
Varp(h′|θ′)p(θ′|ρ′)[ζ]
=trEp(h′|θ′)p(θ′|ρ′)(ζ−Ep(h′|θ′)p(θ′|ρ′)[ζ])(ζ−Ep(h′|θ′)p(θ′|ρ′)[ζ])⊤h
i
=Ep(h′|θ′)p(θ′|ρ′)kζ−Ep(h′|θ′)p(θ′|ρ′)[ζ]k2,
whereEp(h′|θ′)p(θ′|ρ′)denotestheexpectationovertrajectoryh′andpolicy
parameterθ′drawnfromp(h′|θ′)p(θ′|ρ′).Thenthefollowingtheoremholds:
Theorem9.5Assumethatforalls,a,ands′,thereexistsβ>0suchthat
r(s,a,s′)∈[−β,β],and,forallθ,thereexists0<wmax<∞suchthat0<w(θ)≤wmax.Then,thefollowingupperboundshold:
h
i
β2(1−γT)2G
Var
b
p(h′|θ′)p(θ′|ρ′)∇ηJIW(η,τ)≤
w
N′(1−γ)2
max,
h
i
2β2(1−γT)2G
Var
b
p(h′|θ′)p(θ′|ρ′)∇τJIW(η,τ)≤
w
N′(1−γ)2
![Page 442: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/442.jpg)
max.
Itisinterestingtonotethattheupperboundsarethesameastheones
fortheplainPGPE(Theorem9.1inSection9.2.3)exceptforfactorwmax.
Whenwmax=1,theboundsarereducedtothoseoftheplainPGPEmethod.
However,ifthesamplingdistributionissignificantlydifferentfromthetarget
distribution,wmaxcantakealargevalueandthusIW-PGPEcanproducea
gradientestimatorwithlargevariance.Therefore,IW-PGPEmaynotbea
reliableapproachasitis.
Below,avariancereductiontechniqueforIW-PGPEisintroducedwhich
leadstoapracticallyusefulalgorithm.
Policy-PriorSearch
145
9.3.2
VarianceReductionbyBaselineSubtraction
Here,abaselineisintroducedforIW-PGPEtoreducethevarianceof
gradientestimators,inthesamewayastheplainPGPEexplainedinSec-
tion9.2.2.
Apolicy-priorgradientestimatorwithabaselineξ∈RisdefinedasN′
1X
∇bρJξ
(ρ)=
(R(h′
IW
N′
n)−ξ)w(θ′n)∇ρlogp(θ′n|ρ).n=1
![Page 443: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/443.jpg)
Here,thebaselineξisdeterminedsothatthevarianceisminimized.Letξ∗betheoptimalbaselineforIW-PGPEthatminimizesthevariance:
ξ∗=argminVarb
p(h′|θ′)p(θ′|ρ′)[∇ρJξ(ρ)].IW
ξ
ThentheoptimalbaselineforIW-PGPEisgivenasfollows(Zhaoetal.,2013):
E
ξ∗=p(h′|θ′)p(θ′|ρ′)[R(h′)w2(θ′)k∇ρlogp(θ′|ρ)k2],Ep(θ′|ρ′)[w2(θ′)k∇ρlogp(θ′|ρ)k2]whereEp(θ′|ρ′)denotestheexpectationoverpolicyparameterθ′drawnfrom
p(θ′|ρ′).Inpractice,theexpectationsareapproximatedbythesampleaver-
ages.Theexcessvarianceforabaselineξisgivenas
Var
b
b
p(h′|θ′)p(θ′|ρ′)[∇ρJξ(ρ)]Jξ∗(ρ)]IW
−Varp(h′|θ′)p(θ′|ρ′)[∇ρIW(ξ−ξ∗)2=
E
N′
p(θ′|ρ′)[w2(θ′)k∇ρlogp(θ′|ρ)k2].Next,contributionsoftheoptimalbaselinetovariancereductioninIW-
PGPEareanalyzedforthedeterministiclinearpolicymodel(9.3)andthe
independentGaussianprior(9.4).SeeZhaoetal.(2013)fortechnicaldetails.
Theorem9.6Assumethatforalls,a,ands′,thereexistsα>0suchthat
r(s,a,s′)≥α,and,forallθ,thereexistswmin>0suchthatw(θ)≥wmin.
![Page 444: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/444.jpg)
Then,thefollowinglowerboundshold:
h
i
h
i
Var
b
b
p(h′|θ′)p(θ′|ρ′)∇ηJIW(η,τ)−Varp(h′|θ′)p(θ′|ρ′)∇ηJξ∗(η,τ)IW
α2(1−γT)2G
≥
w
N′(1−γ)2
min,
h
i
h
i
Var
b
b
p(h′|θ′)p(θ′|ρ′)∇τJIW(η,τ)−Varp(h′|θ′)p(θ′|ρ′)∇τJξ∗(η,τ)IW
2α2(1−γT)2G
≥
w
N′(1−γ)2
min.
Assumethatforalls,a,ands′,thereexistsβ>0suchthatr(s,a,s′)∈
![Page 445: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/445.jpg)
146
StatisticalReinforcementLearning
[−β,β],and,forallθ,thereexists0<wmax<∞suchthat0<w(θ)≤wmax.
Then,thefollowingupperboundshold:
h
i
h
i
Var
b
b
p(h′|θ′)p(θ′|ρ′)∇ηJIW(η,τ)−Varp(h′|θ′)p(θ′|ρ′)∇ηJξ∗(η,τ)IW
β2(1−γT)2G
≤
w
N′(1−γ)2
max,
h
i
h
i
Var
b
b
p(h′|θ′)p(θ′|ρ′)∇τJIW(η,τ)−Varp(h′|θ′)p(θ′|ρ′)∇τJξ∗(η,τ)IW
2β2(1−γT)2G
≤
w
N′(1−γ)2
![Page 446: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/446.jpg)
max.
ThistheoremshowsthattheboundsofthevariancereductioninIW-PGPE
broughtbytheoptimalbaselinedependontheboundsoftheimportance
weight,wminandwmax—thelargertheupperboundwmaxis,themore
optimalbaselinesubtractioncanreducethevariance.
FromTheorem9.5andTheorem9.6,thefollowingcorollarycanbeimme-
diatelyobtained:
Corollary9.7Assumethatforalls,a,ands′,thereexists0<α<βsuch
thatr(s,a,s′)∈[α,β],and,forallθ,thereexists0<wmin<wmax<∞suchthatwmin≤w(θ)≤wmax.Then,thefollowingupperboundshold:
h
i
(1−γT)2G
Var
b
p(h′|θ′)p(θ′|ρ′)∇ηJξ∗(η,τ)(β2w
IW
≤N′(1−γ)2
max−α2wmin),
h
i
2(1−γT)2G
Var
b
p(h′|θ′)p(θ′|ρ′)∇τJξ∗(η,τ)(β2w
IW
≤N′(1−γ)2
max−α2wmin).
FromTheorem9.5andthiscorollary,wecanconfirmthattheupper
boundsforthebaseline-subtractedIW-PGPEaresmallerthanthoseforthe
plainIW-PGPEwithoutbaselinesubtraction,becauseα2wmin>0.Inpartic-
![Page 447: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/447.jpg)
ular,ifwminislarge,theupperboundsforthebaseline-subtractedIW-PGPE
canbemuchsmallerthanthosefortheplainIW-PGPEwithoutbaseline
subtraction.
9.3.3
NumericalExamples
Here,weconsiderthecontrollingtaskofthehumanoidrobotCB-i(Cheng
etal.,2007)showninFigure9.5(a).Thegoalistoleadtheendeffectorof
therightarm(righthand)toatargetobject.First,itssimulatedupper-body
model,illustratedinFigure9.5(b),isusedtoinvestigatetheperformanceof
theIW-PGPE-OBmethod.ThentheIW-PGPE-OBmethodisappliedtothe
realrobot.
9.3.3.1
Setup
Theperformanceofthefollowing4methodsiscompared:
![Page 448: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/448.jpg)
![Page 449: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/449.jpg)
Policy-PriorSearch
147
(a)CB-i
(b)Simulatedupper-bodymodel
FIGURE9.5:HumanoidrobotCB-ianditsupper-bodymodel.Thehu-
manoidrobotCB-iwasdevelopedbytheJST-ICORPComputationalBrain
ProjectandATRComputationalNeuroscienceLabs(Chengetal.,2007).
•IW-REINFORCE-OB:Importance-weightedREINFORCEwiththe
optimalbaseline.
•NIW-PGPE-OB:Data-reusePGPE-OBwithoutimportanceweight-
ing.
![Page 450: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/450.jpg)
•PGPE-OB:PlainPGPE-OBwithoutdatareuse.
•IW-PGPE-OB:Importance-weightedPGPEwiththeoptimalbase-
line.
TheupperbodyofCB-ihas9degreesoffreedom:theshoulderpitch,
shoulderroll,elbowpitchoftherightarm;shoulderpitch,shoulderroll,elbow
pitchoftheleftarm;waistyaw;torsoroll;andtorsopitch(Figure9.5(b)).At
eachtimestep,thecontrollerreceivesstatesfromthesystemandsendsout
actions.Thestatespaceis18-dimensional,whichcorrespondstothecurrent
angleandangularvelocityofeachjoint.Theactionspaceis9-dimensional,
whichcorrespondstothetargetangleofeachjoint.Bothstatesandactions
arecontinuous.
Giventhestateandactionineachtimestep,thephysicalcontrolsystem
calculatesthetorquesateachjointbyusingaproportional-derivative(PD)
controlleras
τi=Kp(a
˙s
i
i−si)−Kdii,
148
StatisticalReinforcementLearning
wheresi,˙si,andaidenotethecurrentangle,thecurrentangularvelocity,
andthetargetangleofthei-thjoint,respectively.KpandK
denotethe
i
di
positionandvelocitygainsforthei-thjoint,respectively.Theseparameters
aresetat
Kp=200andK=10
i
di
![Page 451: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/451.jpg)
fortheelbowpitchjoints,and
Kp=2000andK=100
i
di
forotherjoints.
Theinitialpositionoftherobotisfixedatthestanding-up-straightpose
withthearmsdown.Theimmediaterewardrtatthetimesteptisdefinedas
rt=exp(−10dt)−0.0005min(ct,10,000),
wheredtisthedistancebetweentherighthandoftherobotandthetarget
object,andctisthesumofcontrolcostsforeachjoint.Thelineardeterministic
policyisusedforthePGPEmethods,andtheGaussianpolicyisusedforIW-
REINFORCE-OB.Inbothcases,thelinearbasisfunctionφ(s)=sisused.
ForPGPE,theinitialpriormeanηisrandomlychosenfromthestandard
normaldistribution,andtheinitialpriorstandarddeviationτissetat1.
Toevaluatetheusefulnessofdatareusemethodswithasmallnumber
ofsamples,theagentcollectsonlyN=3on-policysampleswithtrajectory
lengthT=100ateachiteration.Allpreviousdatasamplesarereusedto
estimatethegradientsinthedatareusemethods,whileonlyon-policysam-
plesareusedtoestimatethegradientsintheplainPGPE-OBmethod.The
discountfactorissetatγ=0.9.
9.3.3.2
Simulationwith2DegreesofFreedom
First,theperformanceonthereachingtaskwithonly2degreesoffreedom
isinvestigated.Thebodyoftherobotisfixedandonlytherightshoulderpitch
andrightelbowpitchareused.Figure9.6depictstheaveragedexpectedreturn
over10trialsasafunctionofthenumberofiterations.Theexpectedreturn
ateachtrialiscomputedfrom50newlydrawntestepisodicdatathatarenot
usedforpolicylearning.ThegraphshowsthatIW-PGPE-OBnicelyimproves
theperformanceoveriterationswithonlyasmallnumberofon-policysamples.
TheplainPGPE-OBmethodcanalsoimprovetheperformanceoveritera-
tions,butslowly.NIW-PGPE-OBisnotasgoodasIW-PGPE-OB,especially
atthelateriterations,becauseoftheinconsistencyoftheNIWestimator.
![Page 452: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/452.jpg)
Thedistancefromtherighthandtotheobjectandthecontrolcostsalong
thetrajectoryarealsoinvestigatedforthreepolicies:theinitialpolicy,thepolicyobtainedatthe20thiterationbyIW-PGPE-OB,andthepolicyobtained
atthe50thiterationbyIW-PGPE-OB.Figure9.7(a)plotsthedistanceto
thetargetobjectasafunctionofthetimestep.Thisshowsthatthepolicy
obtainedatthe50thiterationdecreasesthedistancerapidlycomparedwith
Policy-PriorSearch
149
5
IW−PGPE−OB
NIW−PGPE−OB
PGPE−OB
4
IW−REINFORCE−OB
3
Return
2
1
0
10
20
30
40
50
Iteration
FIGURE9.6:Averageandstandarderrorofreturnsover10runsasfunctions
ofthenumberofiterationsforthereachingtaskwith2degreesoffreedom
(rightshoulderpitchandrightelbowpitch).
0.35
Initialpolicy
![Page 453: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/453.jpg)
Policyatthe20thiteration
0.3
Policyatthe50thiteration
0.25
0.2
0.15
Distance
0.1
0.05
00
10
20
30
40
50
60
70
80
90
100
TImesteps
(a)Distance
120
Initialpolicy
110
Policyatthe20thiteration
Policyatthe50thiteration
100
90
80
70
Controlcosts60
![Page 454: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/454.jpg)
50
40
300
10
20
30
40
50
60
70
80
90
100
Timesteps
(b)Controlcosts
FIGURE9.7:Distanceandcontrolcostsofarmreachingwith2degreesof
freedomusingthepolicylearnedbyIW-PGPE-OB.
![Page 455: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/455.jpg)
![Page 456: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/456.jpg)
![Page 457: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/457.jpg)
![Page 458: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/458.jpg)
150
StatisticalReinforcementLearning
FIGURE9.8:Typicalexampleofarmreachingwith2degreesoffreedom
usingthepolicyobtainedbyIW-PGPE-OBatthe50thiteration(fromleftto
rightandtoptobottom).
theinitialpolicyandthepolicyobtainedatthe20thiteration,whichmeans
thattherobotcanreachtheobjectquicklybyusingthelearnedpolicy.
Figure9.7(b)plotsthecontrolcostasafunctionofthetimestep.This
showsthatthepolicyobtainedatthe50thiterationdecreasesthecontrol
coststeadilyuntilthereachingtaskiscompleted.Thisisbecausetherobot
mainlyadjuststheshoulderpitchinthebeginning,whichconsumesalarger
amountofenergythantheenergyrequiredforcontrollingtheelbowpitch.
Then,oncetherighthandgetsclosertothetargetobject,therobotstarts
![Page 459: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/459.jpg)
adjustingtheelbowpitchtoreachthetargetobject.Thepolicyobtainedat
the20thiterationactuallyconsumeslesscontrolcosts,butitcannotleadthe
armtothetargetobject.
Figure9.8illustratesatypicalsolutionofthereachingtaskwith2degrees
offreedombythepolicyobtainedbyIW-PGPE-OBatthe50thiteration.The
imagesshowthattherighthandissuccessfullyledtothetargetobjectwithin
only10timesteps.
9.3.3.3
SimulationwithAll9DegreesofFreedom
Finally,thesameexperimentiscarriedoutusingall9degreesoffreedom.
Thepositionofthetargetobjectismoredistantfromtherobotsothatit
cannotbereachedbyonlyusingtherightarm.
Policy-PriorSearch
151
−2
−3
−4
−5
−6
Return
−7
−8
TruncatedIW−PGPE−OB
−9
IW−PGPE−OB
−10
NIW−PGPE−OB
PGPE−OB
0
50
![Page 460: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/460.jpg)
100
150
200
250
300
350
400
Iteration
FIGURE9.9:Averageandstandarderrorofreturnsover10runsasfunctions
ofthenumberofiterationsforthereachingtaskwithall9degreesoffreedom.
Becauseall9jointsareused,thedimensionalityofthestatespaceismuch
increasedandthisgrowsthevaluesofimportanceweightsexponentially.In
ordertomitigatethelargevaluesofimportanceweights,wedecidednotto
reuseallpreviouslycollectedsamples,butonlysamplescollectedinthelast
5iterations.Thisallowsustokeepthedifferencebetweenthesamplingdis-
tributionandthetargetdistributionreasonablysmall,andthusthevaluesof
importanceweightscanbesuppressedtosomeextent.Furthermore,follow-
ingWawrzynski(2009),weconsideraversionofIW-PGPE-OB,denotedas
“truncatedIW-PGPE-OB”below,wheretheimportanceweightistruncated
asw=min(w,2).
TheresultsplottedinFigure9.9showthattheperformanceofthetrun-
catedIW-PGPE-OBisthebest.Thisimpliesthatthetruncationofimpor-
tanceweightsishelpfulwhenapplyingIW-PGPE-OBtohigh-dimensional
problems.
Figure9.10illustratesatypicalsolutionofthereachingtaskwithall9
degreesoffreedombythepolicyobtainedbythetruncatedIW-PGPE-OB
atthe400thiteration.Theimagesshowthatthepolicylearnedbyourpro-
posedmethodsuccessfullyleadstherighthandtothetargetobject,andthe
irrelevantpartsarekeptattheinitialpositionforreducingthecontrolcosts.
9.3.3.4
RealRobotControl
Finally,theIW-PGPE-OBmethodisappliedtotherealCB-irobotshown
![Page 461: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/461.jpg)
inFigure9.11(Sugimotoetal.,2014).
Theexperimentalsettingisessentiallythesameastheabovesimulation
studieswith9joints,butpoliciesareupdatedonlyevery5trialsandsamples
takenfromthelast10trialsarereusedforstabilizationpurposes.Figure9.12
![Page 462: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/462.jpg)
![Page 463: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/463.jpg)
![Page 464: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/464.jpg)
![Page 465: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/465.jpg)
152
StatisticalReinforcementLearning
FIGURE9.10:Typicalexampleofarmreachingwithall9degreesoffree-
domusingthepolicyobtainedbythetruncatedIW-PGPE-OBatthe400th
iteration(fromlefttorightandtoptobottom).
FIGURE9.11:ReachingtaskbytherealCB-irobot(Sugimotoetal.,2014).
plotstheobtainedrewardscumulatedoverpolicyupdateiterations,showing
thatrewardsaresteadilyincreasedoveriteration.Figure9.13exhibitsthe
acquiredreachingmotionbasedonthepolicyobtainedatthe120thiteration,
showingthattheendeffectoroftherobotcansuccessfullyreachthetarget
object.
Policy-PriorSearch
153
![Page 466: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/466.jpg)
60
40
Cumulativerewards20
0
20
40
60
80
100
120
Numberofupdates
FIGURE9.12:Obtainedrewardcumulatedoverpolicyupdatediterations.
9.4
Remarks
Whenthetrajectorylengthislarge,directpolicysearchtendstoproduce
gradientestimatorswithlargevariance,duetotherandomnessofstochas-
ticpolicies.Policy-priorsearchcanavoidthisproblembyusingdeterminis-
ticpoliciesandintroducingstochasticitybyconsideringapriordistribution
overpolicyparameters.Boththeoreticallyandexperimentally,advantagesof
policy-priorsearchoverdirectpolicysearchwereshown.
Asamplereuseframeworkforpolicy-priorsearchwasalsointroduced
whichishighlyusefulinreal-worldreinforcementlearningproblemswithhigh
samplingcosts.Followingthesamelineasthesamplereusemethodsforpolicy
iterationdescribedinChapter4anddirectpolicysearchintroducedinChap-
ter8,importanceweightingplaysanessentialroleinsample-reusepolicy-prior
search.Whenthedimensionalityofthestate-actionspaceishigh,however,
importanceweightstendtotakeextremelylargevalues,whichcausesinstabil-
ityoftheimportanceweightingmethods.Tomitigatethisproblem,truncation
oftheimportanceweightsisusefulinpractice.
![Page 467: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/467.jpg)
![Page 468: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/468.jpg)
![Page 469: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/469.jpg)
![Page 470: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/470.jpg)
![Page 471: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/471.jpg)
154
StatisticalReinforcementLearning
FIGURE9.13:Typicalexampleofarmreachingusingthepolicyobtained
bytheIW-PGPE-OBmethod(fromlefttorightandtoptobottom).
PartIV
Model-Based
ReinforcementLearning
![Page 472: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/472.jpg)
ThereinforcementlearningmethodsexplainedinPartIIandPartIIIare
categorizedintothemodel-freeapproach,meaningthatpoliciesarelearned
withoutexplicitlymodelingtheunknownenvironment(i.e.,thetransition
probabilityoftheagent).Ontheotherhand,inPartIV,weintroducean
alternativeapproachcalledthemodel-basedapproach,whichexplicitlymodels
theenvironmentinadvanceandusesthelearnedenvironmentmodelforpolicy
learning.
Inthemodel-basedapproach,noadditionalsamplingcostisnecessaryto
generateartificialsamplesfromthelearnedenvironmentmodel.Thus,the
model-basedapproachisusefulwhendatacollectionisexpensive(e.g.,robot
control).However,accuratelyestimatingthetransitionmodelfromalimited
amountoftrajectorydatainmulti-dimensionalcontinuousstateandaction
spacesishighlychallenging.
InChapter10,weintroduceanon-parametricmodelestimatorthatpos-
sessestheoptimalconvergenceratewithhighcomputationalefficiency,and
demonstrateitsusefulnessthroughexperiments.Then,inChapter11,we
combinedimensionalityreductionwithmodelestimationtocopewithhigh
dimensionalityofstateandactionspaces.
Thispageintentionallyleftblank
Chapter10
TransitionModelEstimation
Inthischapter,weintroducetransitionprobabilityestimationmethodsfor
model-basedreinforcementlearning(Wang&Dietterich,2003;Deisenroth&
Rasmussen,2011).AmongthemethodsdescribedinSection10.1,anon-
parametrictransitionmodelestimatorcalledleast-squaresconditionaldensity
estimation(LSCDE)(Sugiyamaetal.,2010)isshowntobethemostpromis-
ingapproach(Tangkarattetal.,2014a).TheninSection10.2,wedescribe
howthetransitionmodelestimatorcanbeutilizedinmodel-basedreinforce-
mentlearning.InSection10.3,experimentalperformanceofamodel-based
![Page 473: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/473.jpg)
policy-priorsearchmethodisevaluated.Finally,inSection10.4,thischapter
isconcluded.
10.1
ConditionalDensityEstimation
Inthissection,theproblemofapproximatingthetransitionprobabil-
ityp(s′|s,a)fromindependenttransitionsamples(sm,am,s′m)M
m=1isad-
dressed.
10.1.1
Regression-BasedApproach
Intheregression-basedapproach,theproblemoftransitionprobability
estimationisformulatedasafunctionapproximationproblemofpredicting
outputs′giveninputsandaunderGaussiannoise:
s′=f(s,a)+ǫ,
wherefisanunknownregressionfunctiontobelearned,ǫisanindepen-
dentGaussiannoisevectorwithmeanzeroandcovariancematrixσ2I,andI
denotestheidentitymatrix.
Letusapproximatefbythefollowinglinear-in-parametermodel:
f(s,a,Γ)=Γ⊤φ(s,a),
whereΓistheB×dim(s)parametermatrixandφ(s,a)istheB-dimensional
157
158
StatisticalReinforcementLearning
basisvector.AtypicalchoiceofthebasisvectoristheGaussiankernel,which
isdefinedforB=Mas
ks−s
φ
bk2+(a−ab)2
b(s,a)=exp
![Page 474: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/474.jpg)
−
,
2κ2
andκ>0denotestheGaussiankernelwidth.IfBistoolarge,thenumberof
basisfunctionsmaybereducedbyonlyusingasubsetofsamplesasGaussian
centers.DifferentGaussianwidthsforsandamaybeusedifnecessary.
TheparametermatrixΓislearnedsothattheregularizedsquarederror
isminimized:
”
#
M
X
2
b
Γ=argmin
f(sm,am,Γ)−f(sm,am)
+trΓ⊤RΓ
,
Γ
m=1
whereRistheB×Bpositivesemi-definitematrixcalledtheregularization
matrix.Thesolutionb
Γisgivenanalyticallyas
b
Γ=(Φ⊤Φ+R)−1Φ⊤(s′1,…,s′M)⊤,
whereΦistheM×Bdesignmatrixdefinedas
Φm,b=φb(sm,am).
Wecanconfirmthatpredictedoutputvectorbs′=f(s,a,b
Γ)actuallyfollows
theGaussiandistributionwithmean
(s′1,…,s′M)Φ(Φ⊤Φ+R)−1φ(s,a)
andcovariancematrixb
![Page 475: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/475.jpg)
δ2I,where
bδ2=σ2tr(Φ⊤Φ+R)−2Φ⊤Φ.
ThetuningparameterssuchastheGaussiankernelwidthκandtheregu-
larizationmatrixRcanbedeterminedeitherbycross-validationorevidence
maximizationiftheabovemethodisregardedasGaussianprocessregression
intheBayesianframework(Rasmussen&Williams,2006).
Thisistheregression-basedestimatorofthetransitionprobabilitydensity
p(s′|s,a)foranarbitrarytestinputsanda.Thus,bytheuseofkernel
regressionmodels,theregressionfunctionf(whichistheconditionalmeanof
outputs)isapproximatedinanon-parametricway.However,theconditional
distributionofoutputsitselfisrestrictedtobeGaussian,whichishighly
restrictiveinreal-worldreinforcementlearning.
10.1.2
ǫ-NeighborKernelDensityEstimation
Whentheconditioningvariables(s,a)arediscrete,theconditionaldensity
p(s′|s,a)canbeeasilyestimatedbystandarddensityestimatorssuchaskernel
TransitionModelEstimation
159
densityestimation(KDE)byonlyusingsampless′iisuchthat(si,ai)agrees
withthetargetvalues(s,a).ǫ-neighborKDE(ǫKDE)extendsthisideatothe
continuouscasesuchthat(si,ai)areclosetothetargetvalues(s,a).
Morespecifically,ǫKDEwiththeGaussiankernelisgivenby
1
X
b
p(s′|s,a)=
N(s′;s′
|I
i,σ2I),
![Page 476: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/476.jpg)
(s,a),ǫ|i∈I(s,a),ǫwhereI(s,a),ǫisthesetofsampleindicessuchthatk(s,a)−(si,ai)k≤ǫ
andN(s′;s′i,σ2I)denotestheGaussiandensitywithmeans′iandcovariance
matrixσ2I.TheGaussianwidthσandthedistancethresholdǫmaybechosen
bycross-validation.
ǫKDEisausefulnon-parametricdensityestimatorthatiseasytoim-
plement.However,itisunreliableinhigh-dimensionalproblemsduetothe
distance-basedconstruction.
10.1.3
Least-SquaresConditionalDensityEstimation
Anon-parametricconditionaldensityestimatorcalledleast-squarescondi-
tionaldensityestimation(LSCDE)(Sugiyamaetal.,2010)possessesvarious
usefulproperties:
•Itcandirectlyhandlemulti-dimensionalmulti-modalinputsandout-
puts.
•Itwasprovedtoachievetheoptimalconvergencerate(Kanamorietal.,
2012).
•Ithashighnumericalstability(Kanamorietal.,2013).
•Itisrobustagainstoutliers(Sugiyamaetal.,2010).
•Itssolutioncanbeanalyticallyandefficientlycomputedjustbysolving
asystemoflinearequations(Kanamorietal.,2009).
•Generatingsamplesfromthelearnedtransitionmodelisstraightforward.
Letusmodelthetransitionprobabilityp(s′|s,a)bythefollowinglinear-
in-parametermodel:
α⊤φ(s,a,s′),
(10.1)
whereαistheB-dimensionalparametervectorandφ(s,a,s′)istheB-
dimensionalbasisfunctionvector.Atypicalchoiceofthebasisfunctionis
theGaussiankernel,whichisdefinedforB=Mas
ks−s
φ
bk2+(a−ab)2+ks′−s′bk2
![Page 477: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/477.jpg)
b(s,a,s′)=exp
−
.
2κ2
160
StatisticalReinforcementLearning
κ>0denotestheGaussiankernelwidth.IfBistoolarge,thenumberof
basisfunctionsmaybereducedbyonlyusingasubsetofsamplesasGaussian
centers.DifferentGaussianwidthsfors,a,ands′maybeusedifnecessary.
Theparameterαislearnedsothatthefollowingsquarederrorismini-
mized:
ZZZ
1
2
J0(α)=
α⊤φ(s,a,s′)−p(s′|s,a)p(s,a)dsdads′
2ZZZ
1
2
=
α⊤φ(s,a,s′)
p(s,a)dsdads′
2ZZZ
−
α⊤φ(s,a,s′)p(s,a,s′)dsdads′+C,
wheretheidentityp(s′|s,a)=p(s,a,s′)/p(s,a)isusedinthesecondterm
![Page 478: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/478.jpg)
and
ZZZ
1
C=
p(s′|s,a)p(s,a,s′)dsdads′.
2
BecauseCisconstantindependentofα,onlythefirsttwotermswillbe
consideredfromhereon:
1
J(α)=J0(α)−C=α⊤Uα−α⊤v,
2
whereUistheB×BandvistheB-dimensionalvectordefinedas
ZZ
U=
Φ(s,a)p(s,a)dsda,
ZZZ
v=
φ(s,a,s′)p(s,a,s′)dsdads′,
Z
Φ(s,a)=
φ(s,a,s′)φ(s,a,s′)⊤ds′.
Notethat,fortheGaussianmodel(10.1),the(b,b′)-thelementofmatrix
Φ(s,a)canbecomputedanalyticallyas
√
ks′
Φ
b−s′b′k2
b,b′(s,a)=(
πκ)dim(s′)exp−
4κ2
ks−s
×exp−
![Page 479: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/479.jpg)
bk2+ks−sb′k2+(a−ab)2+(a−ab′)2
.
2κ2
BecauseUandvincludedinJ(α)containtheexpectationsoverunknown
densitiesp(s,a)andp(s,a,s′),theyareapproximatedbysampleaverages.
Thenwehave
b
1
J(α)=
α⊤b
Uα−b
v⊤α,
2
TransitionModelEstimation
161
where
M
X
M
X
b
1
1
U=
Φ(s
φ(s
M
![Page 480: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/480.jpg)
m,am)
and
bv=M
m,am,s′m).
m=1
m=1
Byaddinganℓ2-regularizertob
J(α)toavoidoverfitting,theLSCDEop-
timizationcriterionisgivenas
λ
e
α=argminb
J(α)+
kαk2,
α∈RM2
whereλ≥0istheregularizationparameter.Thesolutione
αisgivenanalyti-
callyas
e
α=(b
U+λI)−1b
v,
whereIdenotestheidentitymatrix.Becauseconditionalprobabilitydensities
arenon-negativebydefinition,thesolutione
αismodifiedas
b
αb=max(0,e
αb).
Finally,thesolutionisnormalizedinthetestphase.Morespecifically,given
atestinputpoint(s,a),thefinalLSCDEsolutionisgivenas
b
![Page 481: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/481.jpg)
α⊤φ(s,a,s′)
b
p(s′|s,a)=R
,
b
α⊤φ(s,a,s′′)ds′′
where,fortheGaussianmodel(10.1),thedenominatorcanbeanalytically
computedas
Z
√
B
X
ks−s
b
bk2+(a−ab)2
α⊤φ(s,a,s′′)ds′′=(2πκ)dim(s′)
αbexp−
.
2κ2
b=1
ModelselectionoftheGaussianwidthκandtheregularizationparameterλ
ispossiblebycross-validation(Sugiyamaetal.,2010).
10.2
Model-BasedReinforcementLearning
Model-basedreinforcementlearningissimplycarriedoutasfollows.
1.Collecttransitionsamples(sm,am,s′m)M
m=1.
2.Obtainatransitionmodelestimateb
p(s′|s,a)from(sm,am,s′m)M
m=1.
![Page 482: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/482.jpg)
162
StatisticalReinforcementLearning
3.Runamodel-freereinforcementlearningmethodusingtrajectorysam-
plesehte
T
t=1artificiallygeneratedfromestimatedtransitionmodel
b
p(s′|s,a)andcurrentpolicyπ(a|s,θ).
Model-basedreinforcementlearningisparticularlyadvantageouswhenthe
samplingcostislimited.Morespecifically,inmodel-freemethods,weneedto
fixthesamplingscheduleinadvance—forexample,whethermanysamples
aregatheredinthebeginningoronlyasmallbatchofsamplesiscollectedfor
alongerperiod.However,optimizingthesamplingscheduleinadvanceisnot
possiblewithoutstrongpriorknowledge.Thus,weneedtojustblindlydesign
thesamplingscheduleinpractice,whichcancausesignificantperformance
degradation.Ontheotherhand,model-basedmethodsdonotsufferfromthis
problem,becausewecandrawasmanytrajectorysamplesaswewantfrom
thelearnedtransitionmodelwithoutadditionalsamplingcosts.
10.3
NumericalExamples
Inthissection,theexperimentalperformanceofthemodel-freeandmodel-
basedversionsofPGPE(policygradientswithparameter-basedexploration)
areevaluated:
M-PGPE(LSCDE):Themodel-basedPGPEmethodwithtransitionmodel
estimatedbyLSCDE.
M-PGPE(GP):Themodel-basedPGPEmethodwithtransitionmodeles-
timatedbyGaussianprocess(GP)regression.
IW-PGPE:Themodel-freePGPEmethodwithsamplereusebyimportance
weighting(themethodintroducedinChapter9).
10.3.1
ContinuousChainWalk
Letusfirstconsiderasimplecontinuouschainwalktask,describedin
![Page 483: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/483.jpg)
Figure10.1.
10.3.1.1
Setup
Let
(1(4<s′<6),
s∈S=[0,10],a∈A=[−5,5],andr(s,a,s′)=0(otherwise).Thatis,theagentreceivespositivereward+1atthecenterofthestatespace.
ThetrajectorylengthissetatT=10andthediscountfactorissetat
TransitionModelEstimation
163
0
4
6
10
FIGURE10.1:Illustrationofcontinuouschainwalk.
γ=0.99.Thefollowinglinear-in-parameterpolicymodelisusedinboth
theM-PGPEandIW-PGPEmethods:
6
X
(s−c
a=
θ
i)2
iexp
−
,
2
i=1
where(c1,…,c6)=(0,2,4,6,8,10).Ifanactiondeterminedbytheabove
![Page 484: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/484.jpg)
policyisoutoftheactionspace,itispulledbacktobeconfinedinthedomain.
Astransitiondynamics,thefollowingtwoscenariosareconsidered:
Gaussian:Thetruetransitiondynamicsisgivenby
st+1=st+at+εt,
whereεtistheGaussiannoisewithmean0andstandarddeviation0.3.
Bimodal:Thetruetransitiondynamicsisgivenby
st+1=st±at+εt,
whereεtistheGaussiannoisewithmean0andstandarddeviation0.3,
andthesignofatisrandomlychosenwithprobability1/2.
Ifthenextstateisoutofthestatespace,itisprojectedbacktothe
domain.Below,thebudgetfordatacollectionisassumedtobelimitedto
N=20trajectorysamples.
10.3.1.2
ComparisonofModelEstimators
WhenthetransitionmodelislearnedintheM-PGPEmethods,allN=20
trajectorysamplesaregatheredrandomlyinthebeginningatonce.More
specifically,theinitialstates1andtheactiona1arechosenfromtheuniform
distributionsoverSandA,respectively.Thenthenextstates2andtheim-
mediaterewardr1areobtained.Afterthat,theactiona2ischosenfromthe
uniformdistributionoverA,andthenextstates3andtheimmediatereward
r2areobtained.ThisprocessisrepeateduntilrTisobtained,bywhichatra-
jectorysampleisobtained.ThisdatagenerationprocessisrepeatedNtimes
toobtainNtrajectorysamples.
Figure10.2andFigure10.3illustratethetruetransitiondynamicsand
![Page 485: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/485.jpg)
![Page 486: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/486.jpg)
164
StatisticalReinforcementLearning
)10
,as’|(sp’s5
argmax05
10
0
5
−5
a
0
s
(a)Truetransition
)10
)10
,a
,a
s’|
s’|
(s
(s
p’
p’
s5
s5
argmax
argmax
0
0
5
5
10
![Page 487: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/487.jpg)
10
0
0
5
5
−5
a
0
s
−5
a
0
s
(b)TransitionestimatedbyLSCDE
(c)TransitionestimatedbyGP
FIGURE10.2:GaussiantransitiondynamicsanditsestimatesbyLSCDE
andGP.
theirestimatesobtainedbyLSCDEandGPintheGaussianandbimodal
cases,respectively.Figure10.2showsthatbothLSCDEandGPcanlearnthe
entireprofileofthetruetransitiondynamicswellintheGaussiancase.Onthe
otherhand,Figure10.3showsthatLSCDEcanstillsuccessfullycapturethe
entireprofileofthetruetransitiondynamicswelleveninthebimodalcase,
butGPfailstocapturethebimodalstructure.
Basedontheestimatedtransitionmodels,policiesarelearnedbytheM-
PGPEmethod.Morespecifically,fromthelearnedtransitionmodel,1000
artificialtrajectorysamplesaregeneratedforgradientestimationandan-
other1000artificialtrajectorysamplesareusedforbaselineestimation.Then
policiesareupdatedbasedontheseartificialtrajectorysamples.Thispolicy
updatestepisrepeated100times.Forevaluatingthereturnofalearnedpol-
icy,100additionaltesttrajectorysamplesareusedwhicharenotemployedfor
policylearning.Figure10.4andFigure10.5depicttheaveragesandstandard
errorsofreturnsover100runsfortheGaussianandbimodalcases,respec-
![Page 488: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/488.jpg)
tively.Theresultsshowthat,intheGaussiancase,theGP-basedmethod
performsverywellandLSCDEalsoexhibitsreasonableperformance.Inthe
bimodalcase,ontheotherhand,GPperformspoorlyandLSCDEgivesmuch
betterresultsthanGP.ThisillustratesthehighflexibilityofLSCDE.
![Page 489: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/489.jpg)
![Page 490: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/490.jpg)
TransitionModelEstimation
165
)10
,as’|(sp’s5
argmax05
10
0
5
−5
a
0
s
(a)Truetransition
)10
)
,a
10
s
,a
’|
s’|
(s
(s
p’
p
s5
’s5
argmax0
argmax0
5
5
10
![Page 491: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/491.jpg)
10
0
0
5
5
−5
a
0
s
−5
a
0
s
(b)TransitionestimatedbyLSCDE
(c)TransitionestimatedbyGP
FIGURE10.3:BimodaltransitiondynamicsanditsestimatesbyLSCDE
andGP.
10
2.8
M−PGPE(LSCDE)
M−PGPE(GP)
8
2.6
IW−PGPE
2.4
M−PGPE(LSCDE)
6
M−PGPE(GP)
2.2
Return
IW−PGPE
Return
![Page 492: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/492.jpg)
4
2
1.8
2
1.6
0
20
40
60
80
100
0
20
40
60
80
100
Iteration
Iteration
FIGURE10.4:Averagesandstan-
FIGURE10.5:Averagesandstan-
darderrorsofreturnsofthepolicies
darderrorsofreturnsofthepolicies
over100runsobtainedbyM-PGPE
over100runsobtainedbyM-PGPE
withLSCDE,M-PGPEwithGP,
withLSCDE,M-PGPEwithGP,
andIW-PGPEforGaussiantransi-
andIW-PGPEforbimodaltransi-
tion.
tion.
![Page 493: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/493.jpg)
166
StatisticalReinforcementLearning
4
2
1.9
3.5
1.8
3
Return
Return1.7
2.5
1.6
2
1.5
20x1
10x2
5x4
4x5
2x10
1x20
20x1
10x2
5x4
4x5
2x10
1x20
Samplingschedules
Samplingschedules
FIGURE10.6:Averagesandstan-
FIGURE10.7:Averagesandstan-
darderrorsofreturnsobtainedby
![Page 494: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/494.jpg)
darderrorsofreturnsobtainedby
IW-PGPEover100runsforGaus-
IW-PGPEover100runsforbimodal
siantransitionwithdifferentsam-
transitionwithdifferentsampling
plingschedules(e.g.,5×4means
schedules(e.g.,5×4meansgathering
gatheringk=5trajectorysamples
k=5trajectorysamples4times).
4times).
10.3.1.3
ComparisonofModel-BasedandModel-FreeMethods
Next,theperformanceofthemodel-basedandmodel-freePGPEmethods
arecompared.
Underthefixedbudgetscenario,thescheduleofcollecting20trajectory
samplesneedstobedeterminedfortheIW-PGPEmethod.First,theinfluence
ofthechoiceofsamplingschedulesisillustrated.Figure10.6andFigure10.7
showexpectedreturnsaveragedover100runsunderthesamplingschedule
thatabatchofktrajectorysamplesaregathered20/ktimesfordifferentval-
uesofk.Here,policyupdateisperformed100timesafterobservingeachbatch
ofktrajectorysamples,becausethisperformedbetterthantheusualscheme
ofupdatingthepolicyonlyonce.Figure10.6showsthattheperformanceof
IW-PGPEdependsheavilyonthesamplingschedule,andgatheringk=20
trajectorysamplesatonceisshowntobethebestchoiceintheGaussiancase.
Figure10.7showsthatgatheringk=20trajectorysamplesatonceisalsothe
bestchoiceinthebimodalcase.
Althoughthebestsamplingscheduleisnotaccessibleinpractice,theop-
timalsamplingscheduleisusedforevaluatingtheperformanceofIW-PGPE.
Figure10.4andFigure10.5showtheaveragesandstandarderrorsofreturns
obtainedbyIW-PGPEover100runsasfunctionsofthesamplingsteps.These
graphsshowthatIW-PGPEcanimprovethepoliciesonlyinthebeginning,
becausealltrajectorysamplesaregatheredatonceinthebeginning.The
![Page 495: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/495.jpg)
performanceofIW-PGPEmaybefurtherimprovedifitispossibletogather
moretrajectorysamples.However,thisisprohibitedunderthefixedbudget
scenario.Ontheotherhand,returnsofM-PGPEkeepincreasingoveriter-
TransitionModelEstimation
167
ations,becauseartificialtrajectorysamplescanbekeptgeneratedwithout
additionalsamplingcosts.Thisillustratesapotentialadvantageofmodel-
basedreinforcementlearning(RL)methods.
10.3.2
HumanoidRobotControl
Finally,theperformanceofM-PGPEisevaluatedonapracticalcontrol
problemofasimulatedupper-bodymodelofthehumanoidrobotCB-i(Cheng
etal.,2007),whichwasalsousedinSection9.3.3;seeFigure9.5forthe
illustrationsofCB-ianditssimulator.
10.3.2.1
Setup
ThesimulatorisbasedontheupperbodyoftheCB-ihumanoidrobot,
whichhas9jointsforshoulderpitch,shoulderroll,elbowpitchoftheright
arm,andshoulderpitch,shoulderroll,elbowpitchoftheleftarm,waistyaw,
torsoroll,andtorsopitch.Thestatevectoris18-dimensionalandreal-valued,
whichcorrespondstothecurrentangleindegreeandthecurrentangular
velocityforeachjoint.Theactionvectoris9-dimensionalandreal-valued,
whichcorrespondstothetargetangleofeachjointindegree.Thegoalofthe
controlproblemistoleadtheendeffectoroftherightarm(righthand)tothe
targetobject.Anoisycontrolsystemissimulatedbyperturbingactionvectors
withindependentbimodalGaussiannoise.Morespecifically,foreachelement
oftheactionvector,Gaussiannoisewithmean0andstandarddeviation3is
addedwithprobability0.6,andGaussiannoisewithmean−5andstandard
deviation3isaddedwithprobability0.4.
Theinitialpostureoftherobotisfixedtobestandingupstraightwith
![Page 496: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/496.jpg)
armsdown.Thetargetobjectislocatedinfrontofandabovetherighthand,
whichisreachablebyusingthecontrollablejoints.Therewardfunctionat
eachtimestepisdefinedas
rt=exp(−10dt)−0.000005minct,1,000,000,
wheredtisthedistancebetweentherighthandandtargetobjectattimestep
t,andctisthesumofcontrolcostsforeachjoint.Thedeterministicpolicy
modelusedinM-PGPEandIW-PGPEisdefinedasa=θ⊤φ(s)withthe
basisfunctionφ(s)=s.ThetrajectorylengthissetatT=100andthe
discountfactorissetatγ=0.9.
10.3.2.2
Experimentwith2Joints
First,weconsiderusingonly2jointsamongthe9joints,i.e.,onlytheright
shoulderpitchandrightelbowpitchareallowedtobecontrolled,whilethe
otherjointsremainstillateachtimestep(nocontrolsignalissenttothese
168
StatisticalReinforcementLearning
joints).Therefore,thedimensionalitiesofstatevectorsandactionvectora
are4and2,respectively.
WesupposethatthebudgetfordatacollectionislimitedtoN=50trajec-
torysamples.FortheM-PGPEmethods,alltrajectorysamplesarecollected
atfirstusingtheuniformlyrandominitialstatesandpolicy.Morespecifically,
theinitialstateischosenfromtheuniformdistributionoverS.Ateachtime
step,theactionaiofthei-thjointisfirstdrawnfromtheuniformdistribu-
tionon[si−5,si+5],wheresidenotesthestateforthei-thjoint.Intotal,
5000transitionsamplesarecollectedformodelestimation.Then,fromthe
learnedtransitionmodel,1000artificialtrajectorysamplesaregeneratedfor
gradientestimationandanother1000artificialtrajectorysamplesaregener-
atedforbaselineestimationineachiteration.Thesamplingscheduleofthe
IW-PGPEmethodischosentocollectk=5trajectorysamples50/ktimes,
whichperformswell,asshowninFigure10.8.Theaverageandstandarderror
![Page 497: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/497.jpg)
ofthereturnobtainedbyeachmethodover10runsareplottedinFigure10.9,
showingthatM-PGPE(LSCDE)tendstooutperformbothM-PGPE(GP)and
IW-PGPE.
Figure10.10illustratesanexampleofthereachingmotionwith2joints
obtainedbyM-PGPE(LSCDE)atthe60thiteration.Thisshowsthatthe
learnedpolicysuccessfullyleadstherighthandtothetargetobjectwithin
only13stepsinthisnoisycontrolsystem.
10.3.2.3
Experimentwith9Joints
Finally,theperformanceofM-PGPE(LSCDE)andIW-PGPEisevaluated
onthereachingtaskwithall9joints.
Theexperimentalsetupisessentiallythesameasthe2-jointcase,butthe
budgetforgatheringN=1000trajectorysamplesisgiventothiscomplex
andhigh-dimensionaltask.Thepositionofthetargetobjectismovedto
farleft,whichisnotreachablebyusingonly2joints.Thus,therobotis
requiredtomoveotherjointstoreachtheobjectwiththerighthand.Five
thousandrandomlychosentransitionsamplesareusedasGaussiancentersfor
M-PGPE(LSCDE).ThesamplingscheduleforIW-PGPEissetatgathering
1000trajectorysamplesatonce,whichisthebestsamplingscheduleaccording
toFigure10.11.Theaveragesandstandarderrorsofreturnsobtainedby
M-PGPE(LSCDE)andIW-PGPEover30runsareplottedinFigure10.12,
showingthatM-PGPE(LSCDE)tendstooutperformIW-PGPE.
Figure10.13exhibitsatypicalreachingmotionwith9jointsobtainedby
M-PGPE(LSCDE)atthe1000thiteration.Thisshowsthattherighthandis
ledtothedistantobjectsuccessfullywithin14steps.
TransitionModelEstimation
169
3.5
3
Return2.5
![Page 498: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/498.jpg)
2
1.5
50x1
25x2
10x5
5x10
1x50
Samplingschedules
FIGURE10.8:AveragesandstandarderrorsofreturnsobtainedbyIW-
PGPEover10runsforthe2-jointhumanoidrobotsimulatorfordifferent
samplingschedules(e.g.,5×10meansgatheringk=5trajectorysamples10
times).
0
150
300
450
600
750
1000
5
4
3
Return2
1
M−PGPE(LSCDE)
M−PGPE(GP)
IW−PGPE
0
0
20
40
60
![Page 499: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/499.jpg)
Iteration
FIGURE10.9:Averagesandstandarderrorsofobtainedreturnsover10
runsforthe2-jointhumanoidrobotsimulator.Allmethodsuse50trajectory
samplesforpolicylearning.InM-PGPE(LSCDE)andM-PGPE(GP),all50
trajectorysamplesaregatheredinthebeginningandtheenvironmentmodel
islearned;then2000artificialtrajectorysamplesaregeneratedineachup-
dateiteration.InIW-PGPE,abatchof5trajectorysamplesisgatheredfor
10iterations,whichwasshowntobethebestsamplingscheduling(seeFig-
ure10.8).Notethatpolicyupdateisperformed100timesafterobservingeach
batchoftrajectorysamples,whichweconfirmedtoperformwell.Thebottom
horizontalaxisisfortheM-PGPEmethods,whilethetophorizontalaxisis
fortheIW-PGPEmethod.
![Page 500: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/500.jpg)
![Page 501: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/501.jpg)
![Page 502: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/502.jpg)
![Page 503: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/503.jpg)
170
StatisticalReinforcementLearning
FIGURE10.10:Exampleofarmreachingwith2jointsusingapolicyob-
tainedbyM-PGPE(LSCDE)atthe60thiteration(fromlefttorightandtop
tobottom).
−4.5
−5
−5.5
Return
−6
−6.5
−71000x1
500x2
![Page 504: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/504.jpg)
100x10
50x20
10x100
5x200
1x1000
Samplingschedules
FIGURE10.11:AveragesandstandarderrorsofreturnsobtainedbyIW-
PGPEover30runsforthe9-jointhumanoidrobotsimulatorfordifferent
samplingschedules(e.g.,100×10meansgatheringk=100trajectorysamples
10times).
![Page 505: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/505.jpg)
![Page 506: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/506.jpg)
![Page 507: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/507.jpg)
![Page 508: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/508.jpg)
TransitionModelEstimation
171
0
20
40
60
80
100
−4
−5
−6
Return
−7
M−PGPE
IW−PGPE
−8
0
200
400
600
800
1000
Iteration
![Page 509: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/509.jpg)
FIGURE10.12:Averagesandstandarderrorsofobtainedreturnsover30
runsforthehumanoidrobotsimulatorwith9joints.Bothmethodsuse1000
trajectorysamplesforpolicylearning.InM-PGPE(LSCDE),all1000tra-
jectorysamplesaregatheredinthebeginningandtheenvironmentmodel
islearned;then2000artificialtrajectorysamplesaregeneratedineachup-
dateiteration.InIW-PGPE,abatchof1000trajectorysamplesisgatheredat
once,whichwasshowntobethebestscheduling(seeFigure10.11).Notethat
policyupdateisperformed100timesafterobservingeachbatchoftrajectory
samples.ThebottomhorizontalaxisisfortheM-PGPEmethod,whilethe
tophorizontalaxisisfortheIW-PGPEmethod.
FIGURE10.13:Exampleofarmreachingwith9jointsusingapolicyob-
tainedbyM-PGPE(LSCDE)atthe1000thiteration(fromlefttorightand
toptobottom).
172
StatisticalReinforcementLearning
10.4
Remarks
Model-basedreinforcementlearningisapromisingapproach,giventhat
thetransitionmodelcanbeestimatedaccurately.However,estimatingthe
high-dimensionalconditionaldensityischallenging.Inthischapter,anon-
parametricconditionaldensityestimatorcalledleast-squaresconditionalden-
sityestimation(LSCDE)wasintroduced,andmodel-basedPGPEwith
LSCDEwasshowntoworkexcellentlyinexperiments.
Underthefixedsamplingbudget,themodel-freeapproachrequiresusto
designthesamplingscheduleappropriatelyinadvance.However,thisisprac-
ticallyveryhardunlessstrongpriorknowledgeisavailable.Ontheotherhand,
model-basedmethodsdonotsufferfromthisproblem,whichisanexcellent
practicaladvantageoverthemodel-freeapproach.
Inrobotics,themodel-freeapproachseemstobepreferredbecauseac-
![Page 510: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/510.jpg)
curatelylearningthetransitiondynamicsofcomplexrobotsischallenging
(Deisenrothetal.,2013).Furthermore,model-freemethodscanutilizethe
priorknowledgeintheformofpolicydemonstration(Kober&Peters,2011).
Ontheotherhand,themodel-basedapproachisadvantageousinthatnoin-
teractionwiththerealrobotisrequiredoncethetransitionmodelhasbeen
learnedandthelearnedtransitionmodelcanbeutilizedforfurthersimulation.
Actually,thechoiceofmodel-freeormodel-basedmethodsisnotonlyan
ongoingresearchtopicinmachinelearning,butalsoabigdebatableissuein
neuroscience.Therefore,furtherdiscussionwouldbenecessarytomoredeeply
understandtheprosandconsofthemodel-basedandmodel-freeapproaches.
Combiningorswitchingthemodel-freeandmodel-basedapproacheswould
alsobeaninterestingdirectiontobefurtherinvestigated.
Chapter11
DimensionalityReductionfor
TransitionModelEstimation
Least-squaresconditionaldensityestimation(LSCDE),introducedinChap-
ter10,isapracticaltransitionmodelestimator.However,transitionmodel
estimationisstillchallengingwhenthedimensionalityofstateandaction
spacesishigh.Inthischapter,adimensionalityreductionmethodisintro-
ducedtoLSCDEwhichfindsalow-dimensionalexpressionoftheoriginal
stateandactionvectorthatisrelevanttopredictingthenextstate.After
mathematicallyformulatingtheproblemofdimensionalityreductioninSec-
tion11.1,adetaileddescriptionofthedimensionalityreductionalgorithm
basedonsquared-lossconditionalentropyisprovidedinSection11.2.Then
numericalexamplesaregiveninSection11.3,andthischapterisconcluded
inSection11.4.
11.1
SufficientDimensionalityReduction
Sufficientdimensionalityreduction(Li,1991;Cook&Ni,2005)isaframe-
![Page 511: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/511.jpg)
workofdimensionalityreductioninasupervisedlearningsettingofanalyzing
aninput-outputrelation—inourcase,inputisthestate-actionpair(s,a)
andoutputisthenextstates′.Sufficientdimensionalityreductionisaimedat
findingalow-dimensionalexpressionzofinput(s,a)thatcontains“sufficient”
informationaboutoutputs′.
Letzbealinearprojectionofinput(s,a).Morespecifically,usingmatrix
WsuchthatWW⊤=IwhereIdenotestheidentitymatrix,zisgivenby
s
z=W
.
a
Thegoalofsufficientdimensionalityreductionis,fromindependenttransition
samples(sm,am,s′m)M
m=1,tofindWsuchthats′and(s,a)areconditionally
independentgivenz.Thisconditionalindependencemeansthatzcontainsall
informationabouts′andisequivalentlyexpressedas
p(s′|s,a)=p(s′|z).
(11.1)
173
174
StatisticalReinforcementLearning
11.2
Squared-LossConditionalEntropy
Inthissection,thedimensionalityreductionmethodbasedonthesquared-
lossconditionalentropy(SCE)isintroduced.
11.2.1
ConditionalIndependence
![Page 512: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/512.jpg)
SCEisdefinedandexpressedas
ZZ
1
SCE(s′|z)=−
p(s′|z)p(s′,z)dzds′
2ZZ
Z
1
2
1
=−
p(s′|z)−1p(z)dzds′−1+
ds′.
2
2
ItwasshowninTangkarattetal.(2015)that
SCE(s′|z)≥SCE(s′|s,a),
andtheequalityholdsifandonlyifEq.(11.1)holds.Thus,sufficientdimen-
sionalityreductioncanbeperformedbyminimizingSCE(s′|z)withrespect
toW:
W∗=argminSCE(s′|z).W∈GHere,GdenotestheGrassmannmanifold,whichisthesetofmatricesW
suchthatWW⊤=Iwithoutredundancyintermsofthespan.
SinceSCEcontainsunknowndensitiesp(s′|z)andp(s′,z),itcannotbe
directlycomputed.Here,letusemploytheLSCDEmethodintroducedin
Chapter10toobtainanestimatorb
p(s′|z)ofconditionaldensityp(s′|z).Then,
byreplacingtheexpectationoverp(s′,z)withthesampleaverage,SCEcan
beapproximatedas
M
X
![Page 513: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/513.jpg)
d
1
1
SCE(s′|z)=−
b
p(s′
e
α⊤b
v,
2M
m|zm)=−2
m=1
where
M
s
1X
z
m
m=W
and
bv=
φ(z
a
m,s′m).
m
Mm=1
φ(z,s′)isthebasisfunctionvectorusedinLSCDEgivenby
kz−z
φ
bk2+ks′−s′bk2
b(z,s′)=exp
−
![Page 514: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/514.jpg)
,
2κ2
DimensionalityReductionforTransitionModelEstimation
175
whereκ>0denotestheGaussiankernelwidth.e
αistheLSCDEsolution
givenby
e
α=(b
U+λI)−1b
v,
whereλ≥0istheregularizationparameterand
√
b
(πκ)dim(s′)
ks′
U
b−s′b′k2
b,b′=
exp−
M
4κ2
M
X
kz
×
![Page 515: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/515.jpg)
exp−
m−zbk2+kzm−zb′k2
.
2κ2
m=1
11.2.2
DimensionalityReductionwithSCE
WiththeaboveSCEestimator,apracticalformulationforsufficientdi-
mensionalityreductionisgivenby
c
W=argmaxS(W),whereS(W)=e
α⊤b
v.
W∈GThegradientofS(W)withrespecttoWℓ,ℓ′isgivenby
∂S
∂b
v⊤=−e
α⊤∂b
U
e
α+2
e
α.
∂Wℓ,ℓ′
∂Wℓ,ℓ′
∂Wℓ,ℓ′
IntheEuclideanspace,theabovegradientgivesthesteepestdirection(see
alsoSection7.3.1).However,ontheGrassmannmanifold,thenaturalgradi-
ent(Amari,1998)givesthesteepestdirection.ThenaturalgradientatW
istheprojectionoftheordinarygradienttothetangentspaceoftheGrass-
![Page 516: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/516.jpg)
mannmanifold.Ifthetangentspaceisequippedwiththecanonicalmetric
W,W′=1tr(W⊤W′),thenaturalgradientatWisgivenasfollows(Edel-
2
manetal.,1998):
∂SW⊤∂W
⊥W⊥,
whereW⊥isthematrixsuchthatW⊤,W⊤isanorthogonalmatrix.
⊥ThegeodesicfromWtothedirectionofthenaturalgradientoverthe
Grassmannmanifoldcanbeexpressedusingt∈Ras”
#!
O
∂SW⊤W
W
∂W
⊥t=
I
Oexp−t
⊤,
−W
∂S
W
⊥O
⊥∂W
where“exp”foramatrixdenotesthematrixexponentialandOdenotesthe
![Page 517: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/517.jpg)
zeromatrix.Thenlinesearchalongthegeodesicinthenaturalgradientdi-
rectionisperformedbyfindingthemaximizerfromWt|t≥0(Edelman
etal.,1998).
176
StatisticalReinforcementLearning
OnceWisupdatedbythenaturalgradientmethod,SCEisre-estimated
fornewWandnaturalgradientascentisperformedagain.Thisentirepro-
cedureisrepeateduntilWconverges,andthefinalsolutionisgivenby
b
α⊤φ(z,s′)
b
p(s′|z)=R
,
b
α⊤φ(z,s′′)ds′′
whereb
αb=max(0,e
αb),andthedenominatorcanbeanalyticallycomputedas
Z
√
B
X
kz−z
b
bk2
α⊤φ(z,s′′)ds′′=(2πκ)dim(s′)
αbexp−
![Page 518: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/518.jpg)
.
2κ2
b=1
WhenSCEisre-estimated,performingcross-validationforLSCDEinevery
stepiscomputationallyexpensive.Inpractice,cross-validationmaybeper-
formedonlyonceeveryseveralgradientupdates.Furthermore,tofindabetter
localoptimalsolution,thisgradientascentproceduremaybeexecutedmul-
tipletimeswithrandomlychoseninitialsolutions,andtheoneachievingthe
largestobjectivevalueischosen.
11.2.3
RelationtoSquared-LossMutualInformation
TheabovedimensionalityreductionmethodminimizesSCE:
ZZ
1
p(z,s′)2
SCE(s′|z)=−
dzds′.
2
p(z)
Ontheotherhand,thedimensionalityreductionmethodproposedinSuzuki
andSugiyama(2013)maximizessquared-lossmutualinformation(SMI):
ZZ
1
p(z,s′)2
SMI(z,s′)=
dzds′.
2
p(z)p(s′)
NotethatSMIcanbeapproximatedalmostinthesamewayasSCEby
theleast-squaresmethod(Suzuki&Sugiyama,2013).Theaboveequations
showthattheessentialdifferencebetweenSCEandSMIiswhetherp(s′)
isincludedinthedenominatorofthedensityratio,andSCEisreducedto
![Page 519: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/519.jpg)
thenegativeSMIifp(s′)isuniform.However,ifp(s′)isnotuniform,the
densityratiofunctionp(z,s′)includedinSMImaybemorefluctuatedthan
p(z)p(s′)
p(z,s′)includedinSCE.Sinceasmootherfunctioncanbemoreaccurately
p(z)
estimatedfromasmallnumberofsamplesingeneral(Vapnik,1998),SCE-
baseddimensionalityreductionisexpectedtoworkbetterthanSMI-based
dimensionalityreduction.
DimensionalityReductionforTransitionModelEstimation
177
11.3
NumericalExamples
Inthissection,experimentalbehavioroftheSCE-baseddimensionality
reductionmethodisillustrated.
11.3.1
ArtificialandBenchmarkDatasets
Thefollowingdimensionalityreductionschemesarecompared:
•None:Nodimensionalityreductionisperformed.
•SCE(Section11.2):Dimensionalityreductionisperformedbymini-
mizingtheleast-squaresSCEapproximatorusingnaturalgradientsover
theGrassmannmanifold(Tangkarattetal.,2015).
•SMI(Section11.2.3):Dimensionalityreductionisperformedbymax-
imizingtheleast-squaresSMIapproximatorusingnaturalgradientsover
theGrassmannmanifold(Suzuki&Sugiyama,2013).
•True:The“true”subspaceisused(onlyforartificialdatasets).
Afterdimensionalityreduction,thefollowingconditionaldensityestimators
arerun:
•LSCDE(Section10.1.3):Least-squaresconditionaldensityestima-
tion(Sugiyamaetal.,2010).
![Page 520: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/520.jpg)
•ǫKDE(Section10.1.2):ǫ-neighborkerneldensityestimation,where
ǫischosenbyleast-squarescross-validation.
First,thebehaviorofSCE-LSCDEiscomparedwiththeplainLSCDE
withnodimensionalityreduction.Thedatasetshave5-dimensionalinputx=
(x(1),…,x(5))⊤and1-dimensionaloutputy.Amongthe5dimensionsofx,
onlythefirstdimensionx(1)isrelevanttopredictingtheoutputyandthe
other4dimensionsx(2),…,x(5)arejuststandardGaussiannoise.Figure11.1
plotsthefirstdimensionofinputandoutputofthesamplesinthedatasets
andconditionaldensityestimationresults.Thegraphsshowthattheplain
LSCDEdoesnotperformwellduetotheirrelevantnoisedimensionsininput,
whileSCE-LSCDEgivesmuchbetterestimates.
Next,artificialdatasetswith5-dimensionalinputx=(x(1),…,x(5))⊤and1-dimensionaloutputyareused.Eachelementofxfollowsthestandard
Gaussiandistributionandyisgivenby
(a)y=x(1)+(x(1))2+(x(1))3+ε,
(b)y=(x(1))2+(x(2))2+ε,
178
StatisticalReinforcementLearning
6
6
Sample
Sample
Plain-LSCDE
Plain-LSCDE
5
SCE-LSCDE
SCE-LSCDE
4
4
![Page 521: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/521.jpg)
y2
y3
2
0
1
−2
0
2
3
4
5
6
7
3
4
5
6
7
8
x(1)
x(1)
(a)Bonemineraldensity
(b)OldFaithfulgeyser
FIGURE11.1:ExamplesofconditionaldensityestimationbyplainLSCDE
andSCE-LSCDE.
whereεistheGaussiannoisewithmeanzeroandstandarddeviation1/4.
ThetoprowofFigure11.2showsthedimensionalityreductionerrorbe-
tweentrueW∗anditsestimatecWfordifferentsamplesizen,measured
by
⊤Error
![Page 522: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/522.jpg)
c
DR=kc
WW−W∗⊤W∗kFrobenius,wherek·kFrobeniusdenotestheFrobeniusnorm.TheSMI-basedandSCE-based
dimensionalityreductionmethodsbothperformsimilarlyforthedataset(a),
whiletheSCE-basedmethodclearlyoutperformstheSMI-basedmethodfor
thedataset(b).Thehistogramsofy400
i=1plottedinthe2ndrowofFigure11.2
showthattheprofileofthehistogram(whichisasampleapproximationof
p(y))inthedataset(b)ismuchsharperthanthatinthedataset(a).As
explainedinSection11.2.3,thedensityratiofunctionusedinSMIcontains
p(y)inthedenominator.Therefore,itwouldbehighlynon-smoothandthus
ishardtoapproximate.Ontheotherhand,thedensityratiofunctionused
inSCEdoesnotcontainp(y).Therefore,itwouldbesmootherthantheone
usedinSMIandthusiseasiertoapproximate.
The3rdand4throwsofFigure11.2plottheconditionaldensityestimation
errorbetweentruep(y|x)anditsestimateb
p(y|x),evaluatedbythesquared
loss(withoutaconstant):
Z
1
n′
X
1n′
X
ErrorCDE=
b
p(y|e
x
b
p(e
y
![Page 523: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/523.jpg)
2n′
i)2dy−n′
i|e
xi),
i=1
i=1
where(e
xi,e
yi)n′
i=1isasetoftestsamplesthathavenotbeenusedfor
conditionaldensityestimation.Wesetn′=1000.Thegraphsshowthat
LSCDEoveralloutperformsǫKDEforbothdatasets.Forthedataset(a),
SMI-LSCDEandSCE-LSCDEperformequallywell,andaremuchbetterthan
DimensionalityReductionforTransitionModelEstimation
179
1
0.25
SMI-based
SMI-based
SCE-based
SCE-based
0.8
0.2
0.6
0.15
DR
DR
Error0.4
Error
0.1
![Page 524: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/524.jpg)
0.2
0.05
0
0
50
100150200250300350400
50
100150200250300350400
Samplesizen
Samplesizen
40
200
30
150
20
100
Frequency
Frequency
10
50
0
0
−2
0
2
4
6
−5
0
5
10
y
![Page 525: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/525.jpg)
y
1
0.1
LSCDE
εKDE
LSCDE
εKDE
LSCDE*
εKDE*
0
LSCDE*
εKDE*
0.5
−0.1
0
−0.2
−0.5
CDE
CDE
−0.3
−1
Error−0.4
Error
−1.5
−0.5
−0.6
−2
−0.7
−2.5
50
100150200250300350400
50
![Page 526: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/526.jpg)
100150200250300350400
Samplesizen
Samplesizen
0.1
1
SMI-LSCDE
SMI-LSCDE
SMI-
SMI-
εKDE
εKDE
0
SCE-LSCDE
SCE-εKDE
SCE-LSCDE
SCE-εKDE
0.5
−0.1
0
−0.2
−0.5
CDE
CDE
−0.3
−1
Error−0.4
Error
−1.5
−0.5
−0.6
−2
−0.7
![Page 527: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/527.jpg)
−2.5
50
100150200250300350400
50
100150200250300350400
Samplesizen
Samplesizen
FIGURE11.2:Toprow:Themeanandstandarderrorofthedimensionality
reductionerrorover20runsontheartificialdatasets.2ndrow:Histograms
ofoutputyi400
i=1.3rdand4throws:Themeanandstandarderrorofthe
conditionaldensityestimationerrorover20runs.
180
StatisticalReinforcementLearning
plainLSCDEwithnodimensionalityreduction(LSCDE)andcomparableto
LSCDEwiththetruesubspace(LSCDE*).Forthedataset(b),SCE-LSCDE
outperformsSMI-LSCDEandLSCDEandiscomparabletoLSCDE*.
Next,theUCIbenchmarkdatasets(Bache&Lichman,2013)areusedfor
performanceevaluation.nsamplesareselectedrandomlyfromeachdatasetfor
conditionaldensityestimation,andtherestofthesamplesareusedtomeasure
theconditionaldensityestimationerror.Sincethedimensionalityofzisun-
knownforthebenchmarkdatasets,itwasdeterminedbycross-validation.The
resultsaresummarizedinTable11.1,showingthatSCE-LSCDEworkswell
overall.Table11.2describesthedimensionalitiesselectedbycross-validation,
showingthatboththeSCE-basedandSMI-basedmethodsreducethedimen-
sionalitysignificantly.
11.3.2
HumanoidRobot
Finally,SCE-LSCDEisappliedtotransitionestimationofahumanoid
robot.Weuseasimulatoroftheupper-bodypartofthehumanoidrobot
![Page 528: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/528.jpg)
CB-i(Chengetal.,2007)(seeFigure9.5).
Therobothas9controllablejoints:shoulderpitch,shoulderroll,elbow
pitchoftherightarm,andshoulderpitch,shoulderroll,elbowpitchofthe
leftarm,waistyaw,torsoroll,andtorsopitchjoints.Postureoftherobotis
describedby18-dimensionalreal-valuedstatevectors,whichcorrespondsto
theangleandangularvelocityofeachjointinradianandradian-per-second,
respectively.Therobotiscontrolledbysendinganactioncommandatothe
system.Theactioncommandaisa9-dimensionalreal-valuedvector,which
correspondstothetargetangleofeachjoint.Whentherobotiscurrentlyat
statesandreceivesactiona,thephysicalcontrolsystemofthesimulator
calculatestheamountoftorquetobeappliedtoeachjoint(seeSection9.3.3
fordetails).
Intheexperiment,theactionvectoraisrandomlychosenandanoisy
controlsystemissimulatedbyaddingabimodalGaussiannoisevector.More
specifically,theactionaiofthei-thjointisfirstdrawnfromtheuniformdis-
tributionon[si−0.087,si+0.087],wheresidenotesthestateforthei-th
joint.ThedrawnactionisthencontaminatedbyGaussiannoisewithmean
0andstandarddeviation0.034withprobability0.6andGaussiannoisewith
mean−0.087andstandarddeviation0.034withprobability0.4.Byrepeat-
edlycontrollingtherobotMtimes,transitionsamples(sm,am,s′m)M
m=1
areobtained.Ourgoalistolearnthesystemdynamicsasastatetransition
probabilityp(s′|s,a)fromthesesamples.
Thefollowingthreescenariosareconsidered:usingonly2joints(right
shoulderpitchandrightelbowpitch),only4joints(inaddition,rightshoulder
rollandwaistyaw),andall9joints.Thesesetupscorrespondto6-dimensional
inputand4-dimensionaloutputinthe2-jointcase,12-dimensionalinputand
8-dimensionaloutputinthe4-jointcase,and27-dimensionalinputand18-
dimensionaloutputinthe9-jointcase.Fivehundred,1000,and1500transition
![Page 529: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/529.jpg)
![Page 530: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/530.jpg)
DimensionalityReductionforTransitionModelEstimation
181
r
llea
0
t-test
le
1
1
1
1
1
1
1
1
1
0
1
1
0
0
ca
![Page 531: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/531.jpg)
1
1
1
(sm
×××××××××
××
ed
S
×
××
ira
)
)
)
)
)
)
)
)
)
)
)
)
)
)
sets
p
1
4
6
2
1
![Page 532: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/532.jpg)
1
4
2
3
4
4
4
1
2
ta
E
a
le
(.0
(.0
(.0
(.0
(.0
(.0
(.0
(.0
(.0
(.1
(.1
(.1
(.0
(.0
d
p
n
D
3
![Page 533: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/533.jpg)
6
2
5
1
9
3
6
0
5
5
5
5
9
s
m
.1
.4
.7
.9
.9
.8
.1
.9
.8
.9
.2
.6
.7
.8
u
ǫK
1
![Page 534: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/534.jpg)
1
2
2
0
1
1
6
0
1
9
3
0
0
-sa
ctio
−−−−−−−−−−−−−−
rio
o
u
)
)
va
tw
)
)
)
)
)
)
)
)
)
![Page 535: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/535.jpg)
)
)
)
red
5
4
9
4
1
1
2
7
2
6
3
3
3
6
r
e
o
E
fo
th
D
(.0
(.0
(.0
(.0
(.0
(.0
(.0
![Page 536: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/536.jpg)
(.0
(.0
(.0
(.1
(.1
(.0
(.3
s
N
C
1
5
2
2
9
6
3
0
1
2
5
5
3
0
n
to
S
.4
.6
.7
.0
.0
![Page 537: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/537.jpg)
.4
.1
.1
.3
.9
.8
.6
1
.7
2
1
.1
2
2
3
1
2
7
3
0
1
ru
L
1
1
g
−
−−
−−−−−−−−−
0
−
−
![Page 538: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/538.jpg)
1
inrd
)
er
)
)
)
)
)
)
)
)
)
)
)
)
)
8
5
1
9
2
2
7
2
4
9
3
0
6
1
ov
![Page 539: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/539.jpg)
cco
E
(.3
r
a
(.0
(.0
(.1
(.2
(.0
(.1
(.1
(.0
(.0
(.4
(.6
(.1
(.5
s
D
2
7
5
7
7
0
3
3
8
1
7
4
![Page 540: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/540.jpg)
8
7
d
.6
.7
.9
.4
.9
.6
.9
.9
.1
.4
.2
.4
.3
.3
erro
o
ǫK
0
sed
1
1
2
5
0
2
1
6
1
3
![Page 541: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/541.jpg)
1
7
1
2
n
eth
a
−
−−
−−−−
−−−
b
−−
−
−
tioam
I-
)
)
)
)
)
)
)
)
)
)
)
)
)
)
le
![Page 542: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/542.jpg)
M
5
4
8
6
1
2
3
4
3
7
0
4
5
3
b
S
ED(.0(.0(.1(.2(.0(.0(.0(.0(.0(.4(.5(.8(.2(.6
estim
ra
3
3
4
6
a
C
1
5
9
0
5
2
![Page 543: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/543.jpg)
0
2
0
4
y
p
S
.9
.8
.6
.6
.2
.3
.8
.9
.3
.0
.4
.0
.0
.7
L
1
1
2
5
1
2
2
6
1
6
![Page 544: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/544.jpg)
9
8
2
9
sit
m
−−−−−−−−−−−−−−
en
co
d
d
)
)
)
)
)
l
)
6
4
4
)
5
)
)
)
)
7
)
)
)
a
![Page 545: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/545.jpg)
n
1
2
7
3
6
2
4
4
7
a
n
r
E
(.1
(.0
(.1
(.1
(.0
(.1
(.1
(.0
(.0
(.2
(.3
(.5
(.1
(.1
io
D
7
4
![Page 546: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/546.jpg)
3
3
9
7
5
3
0
8
5
0
3
4
it
.5
.9
.9
.9
.2
.1
.5
.7
.4
d
erro
ce.
ǫK
1
.7
.0
.2
0
.4
![Page 547: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/547.jpg)
1
6
1
4
.7
7
1
2
n
1
3
6
2
9
n
fa
sed
−−−−−−−−−−−−−−
co
a
e
ea
ld
-b
)
m
o
E
)
)
)
)
![Page 548: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/548.jpg)
)
)
)
)
)
)
)
6
)
)
th
b
9
4
8
2
1
1
2
2
3
4
3
1
3
f
e
C
E
o
y
S
![Page 549: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/549.jpg)
(.8
th
b
D
(.0
(.0
(.1
(.0
(.0
(.0
(.0
(.0
(.0
(.0
(.5
9
(.2
(.8
r
f
C
3
0
2
6
9
1
5
8
6
3
7
![Page 550: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/550.jpg)
1
7
o
ed
S
.7
.8
.9
.4
.1
.3
.8
.1
.3
.1
.3
.4
.8
.3
L
1
1
2
6
1
2
2
7
1
7
8
0
![Page 551: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/551.jpg)
2
8
erro
s
−
−1
ecifi
−−−−−
−−−−
−−−
rdatermsp
0
0
0
0
0
0
0
0
0
0
0
0
d
0
0
0
0
0
0
0
0
![Page 552: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/552.jpg)
0
0
0
0
0
0
n
n
in
re
1
1
5
8
5
4
3
1
3
2
1
1
2
5
a
)
)
sta
d
)
)
)
![Page 553: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/553.jpg)
)
)
o
)
)
)
)
)
)
)
)
8
d
%5
,dy
,1
,1
,1
,1
,1
,1
,1
,1
,1
,2
,2
,4
,8
,1
n
3
1
![Page 554: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/554.jpg)
1
2
2
7
a
eth
el
(1
(7
(4
(6
(9
(1
(1
(1
(8
(8
(7
(6
(1
n
m
v
(dx
(2
le
ea
est
e
e
es
M
![Page 555: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/555.jpg)
b
ce
g
G
em
in
:
P
ir
y
ts
ts
ts
e
n
set
o
t
ch
in
W
F
.1
h
ca
sin
M
ch
o
W
crete
ck
![Page 556: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/556.jpg)
erg
in
in
in
1
T
o
o
o
ifi
ta
u
erv
a
e
n
n
to
J
J
J
1
a
n
o
to
sic
it
D
S
Y
o
![Page 557: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/557.jpg)
S
H
u
y
h
ed
rest
C
E
2
4
9
E
o
sig
A
h
W
R
F
L
P
e
B
etter).
th
A
b
t
T
is
a
![Page 558: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/558.jpg)
182
StatisticalReinforcementLearning
![Page 559: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/559.jpg)
TABLE11.2:Meanandstandarderrorofthechosensubspacedimensional-
ityover10runsforbenchmarkandrobottransitiondatasets.
SCE-based
SMI-based
Dataset
(dx,dy)
LSCDE
ǫKDE
LSCDE
ǫKDE
Housing
(13,1)
3.9(0.74)
2.0(0.79)
2.0(0.39)
1.3(0.15)
AutoMPG
(7,1)
3.2(0.66)
1.3(0.15)
2.1(0.67)
1.1(0.10)
Servo
(4,1)
1.9(0.35)
2.4(0.40)
2.2(0.33)
1.6(0.31)
Yacht
(6,1)
1.0(0.00)
1.0(0.00)
![Page 560: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/560.jpg)
1.0(0.00)
1.0(0.00)
Physicochem
(9,1)
6.5(0.58)
1.9(0.28)
6.6(0.58)
2.6(0.86)
WhiteWine
(11,1)
1.2(0.13)
1.0(0.00)
1.4(0.31)
1.0(0.00)
RedWine
(11,1)
1.0(0.00)
1.3(0.15)
1.2(0.20)
1.0(0.00)
ForestFires
(12,1)
1.2(0.20)
4.9(0.99)
1.4(0.22)
6.8(1.23)
Concrete
(8,1)
1.0(0.00)
1.0(0.00)
1.2(0.13)
1.0(0.00)
![Page 561: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/561.jpg)
Energy
(8,2)
5.9(0.10)
3.9(0.80)
2.1(0.10)
2.0(0.30)
Stock
(7,2)
3.2(0.83)
2.1(0.59)
2.1(0.60)
2.7(0.67)
2Joints
(6,4)
2.9(0.31)
2.7(0.21)
2.5(0.31)
2.0(0.00)
4Joints
(12,8)
5.2(0.68)
6.2(0.63)
5.4(0.67)
4.6(0.43)
9Joints
(27,18)
13.8(1.28)15.3(0.94)11.4(0.75)13.2(1.02)
samplesaregeneratedforthe2-joint,4-joint,and9-jointcases,respectively.
Thenrandomlychosenn=100,200,and500samplesareusedforconditional
densityestimation,andtherestisusedforevaluatingthetesterror.The
resultsaresummarizedinTable11.1,showingthatSCE-LSCDEperforms
wellfortheallthreecases.Table11.2describesthedimensionalitiesselected
![Page 562: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/562.jpg)
bycross-validation.Thisshowsthatthedimensionalitiesaremuchreduced,
implyingthattransitionofthehumanoidrobotishighlyredundant.
11.4
Remarks
Copingwithhighdimensionalityofthestateandactionspacesisoneof
themostimportantchallengesinmodel-basedreinforcementlearning.Inthis
chapter,adimensionalityreductionmethodforconditionaldensityestimation
wasintroduced.Thekeyideawastousethesquared-lossconditionalentropy
(SCE)fordimensionalityreduction,whichcanbeestimatedbyleast-squares
conditionaldensityestimation.Thisallowedustoperformdimensionalityre-
ductionandconditionaldensityestimationsimultaneouslyinanintegrated
manner.Incontrast,dimensionalityreductionbasedonsquared-lossmutual
information(SMI)yieldsatwo-stepprocedureoffirstreducingthedimension-
alityandthentheconditionaldensityisestimated.SCE-baseddimensionality
reductionwasshowntooutperformtheSMI-basedmethod,particularlywhen
outputfollowsaskeweddistribution.
References
Abbeel,P.,&Ng,A.Y.(2004).Apprenticeshiplearningviainverserein-
forcementlearning.ProceedingsofInternationalConferenceonMachine
Learning(pp.1–8).
Abe,N.,Melville,P.,Pendus,C.,Reddy,C.K.,Jensen,D.L.,Thomas,V.P.,
Bennett,J.J.,Anderson,G.F.,Cooley,B.R.,Kowalczyk,M.,Domick,M.,
&Gardinier,T.(2010).Optimizingdebtcollectionsusingconstrainedrein-
forcementlearning.ProceedingsofACMSIGKDDInternationalConference
onKnowledgeDiscoveryandDataMining(pp.75–84).
Amari,S.(1967).Theoryofadaptivepatternclassifiers.IEEETransactions
onElectronicComputers,EC-16,299–307.
Amari,S.(1998).Naturalgradientworksefficientlyinlearning.NeuralCom-
putation,10,251–276.
![Page 563: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/563.jpg)
Amari,S.,&Nagaoka,H.(2000).Methodsofinformationgeometry.Provi-
dence,RI,USA:OxfordUniversityPress.
Bache,K.,&Lichman,M.(2013).UCImachinelearningrepository.http:
//archive.ics.uci.edu/ml/
Baxter,J.,Bartlett,P.,&Weaver,L.(2001).Experimentswithinfinite-
horizon,policy-gradientestimation.JournalofArtificialIntelligenceRe-
search,15,351–381.
Bishop,C.M.(2006).Patternrecognitionandmachinelearning.NewYork,
NY,USA:Springer.
Boyd,S.,&Vandenberghe,L.(2004).Convexoptimization.Cambridge,UK:
CambridgeUniversityPress.
Bradtke,S.J.,&Barto,A.G.(1996).Linearleast-squaresalgorithmsfor
temporaldifferencelearning.MachineLearning,22,33–57.
Chapelle,O.,Schölkopf,B.,&Zien,A.(Eds.).(2006).Semi-supervisedlearn-
ing.Cambridge,MA,USA:MITPress.
Cheng,G.,Hyon,S.,Morimoto,J.,Ude,A.,Joshua,G.H.,Colvin,G.,Scrog-
gin,W.,&Stephen,C.J.(2007).CB:Ahumanoidresearchplatformfor
exploringneuroscience.AdvancedRobotics,21,1097–1114.
183
184
References
Chung,F.R.K.(1997).Spectralgraphtheory.Providence,RI,USA:American
MathematicalSociety.
Coifman,R.,&Maggioni,M.(2006).Diffusionwavelets.AppliedandCom-
putationalHarmonicAnalysis,21,53–94.
Cook,R.D.,&Ni,L.(2005).Sufficientdimensionreductionviainverse
regression.JournaloftheAmericanStatisticalAssociation,100,410–428.
Dayan,P.,&Hinton,G.E.(1997).Usingexpectation-maximizationforrein-
forcementlearning.NeuralComputation,9,271–278.
Deisenroth,M.P.,Neumann,G.,&Peters,J.(2013).Asurveyonpolicy
![Page 564: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/564.jpg)
searchforrobotics.FoundationsandTrendsinRobotics,2,1–142.
Deisenroth,M.P.,&Rasmussen,C.E.(2011).PILCO:Amodel-basedand
data-efficientapproachtopolicysearch.ProceedingsofInternationalCon-
ferenceonMachineLearning(pp.465–473).
Demiriz,A.,Bennett,K.P.,&Shawe-Taylor,J.(2002).Linearprogramming
boostingviacolumngeneration.MachineLearning,46,225–254.
Dempster,A.P.,Laird,N.M.,&Rubin,D.B.(1977).Maximumlikelihood
fromincompletedataviatheEMalgorithm.JournaloftheRoyalStatistical
Society,seriesB,39,1–38.
Dijkstra,E.W.(1959).Anoteontwoproblemsinconnexion[sic]withgraphs.
NumerischeMathematik,1,269–271.
Edelman,A.,Arias,T.A.,&Smith,S.T.(1998).Thegeometryofalgo-
rithmswithorthogonalityconstraints.SIAMJournalonMatrixAnalysis
andApplications,20,303–353.
Efron,B.,Hastie,T.,Johnstone,I.,&Tibshirani,R.(2004).Leastangle
regression.AnnalsofStatistics,32,407–499.
Engel,Y.,Mannor,S.,&Meir,R.(2005).ReinforcementlearningwithGaus-
sianprocesses.ProceedingsofInternationalConferenceonMachineLearn-
ing(pp.201–208).
Fishman,G.S.(1996).MonteCarlo:Concepts,algorithms,andapplications.
Berlin,Germany:Springer-Verlag.
Fredman,M.L.,&Tarjan,R.E.(1987).Fibonacciheapsandtheiruses
inimprovednetworkoptimizationalgorithms.JournaloftheACM,34,
569–615.
Goldberg,A.V.,&Harrelson,C.(2005).Computingtheshortestpath:A*
searchmeetsgraphtheory.ProceedingsofAnnualACM-SIAMSymposium
onDiscreteAlgorithms(pp.156–165).
References
185
Gooch,B.,&Gooch,A.(2001).Non-photorealisticrendering.Natick,MA,
![Page 565: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/565.jpg)
USA:A.K.PetersLtd.
Greensmith,E.,Bartlett,P.L.,&Baxter,J.(2004).Variancereductiontech-
niquesforgradientestimatesinreinforcementlearning.JournalofMachine
LearningResearch,5,1471–1530.
Guo,Q.,&Kunii,T.L.(2003).“Nijimi”renderingalgorithmforcreating
qualityblackinkpaintings.ProceedingsofComputerGraphicsInternational
(pp.152–159).
Henkel,R.E.(1976).Testsofsignificance.BeverlyHills,CA,USA.:SAGE
Publication.
Hertzmann,A.(1998).Painterlyrenderingwithcurvedbrushstrokesofmul-
tiplesizes.ProceedingsofAnnualConferenceonComputerGraphicsand
InteractiveTechniques(pp.453–460).
Hertzmann,A.(2003).Asurveyofstrokebasedrendering.IEEEComputer
GraphicsandApplications,23,70–81.
Hoerl,A.E.,&Kennard,R.W.(1970).Ridgeregression:Biasedestimation
fornonorthogonalproblems.Technometrics,12,55–67.
Huber,P.J.(1981).Robuststatistics.NewYork,NY,USA:Wiley.
Kakade,S.(2002).Anaturalpolicygradient.AdvancesinNeuralInformation
ProcessingSystems14(pp.1531–1538).
Kanamori,T.,Hido,S.,&Sugiyama,M.(2009).Aleast-squaresapproachto
directimportanceestimation.JournalofMachineLearningResearch,10,
1391–1445.
Kanamori,T.,Suzuki,T.,&Sugiyama,M.(2012).Statisticalanalysisof
kernel-basedleast-squaresdensity-ratioestimation.MachineLearning,86,
335–367.
Kanamori,T.,Suzuki,T.,&Sugiyama,M.(2013).Computationalcomplex-
ityofkernel-baseddensity-ratioestimation:Aconditionnumberanalysis.
MachineLearning,90,431–460.
Kober,J.,&Peters,J.(2011).Policysearchformotorprimitivesinrobotics.
MachineLearning,84,171–203.
Koenker,R.(2005).Quantileregression.Cambridge,MA,USA:Cambridge
UniversityPress.
![Page 566: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/566.jpg)
Kohonen,T.(1995).Self-organizingmaps.Berlin,Germany:Springer.
Kullback,S.,&Leibler,R.A.(1951).Oninformationandsufficiency.Annals
ofMathematicalStatistics,22,79–86.
186
References
Lagoudakis,M.G.,&Parr,R.(2003).Least-squarespolicyiteration.Journal
ofMachineLearningResearch,4,1107–1149.
Li,K.(1991).Slicedinverseregressionfordimensionreduction.Journalof
theAmericanStatisticalAssociation,86,316–342.
Mahadevan,S.(2005).Proto-valuefunctions:Developmentalreinforcement
learning.ProceedingsofInternationalConferenceonMachineLearning(pp.
553–560).
Mangasarian,O.L.,&Musicant,D.R.(2000).Robustlinearandsupport
vectorregression.IEEETransactionsonPatternAnalysisandMachine
Intelligence,22,950–955.
Morimura,T.,Sugiyama,M.,Kashima,H.,Hachiya,H.,&Tanaka,T.(2010a).
Nonparametricreturndistributionapproximationforreinforcementlearn-
ing.ProceedingsofInternationalConferenceonMachineLearning(pp.
799–806).
Morimura,T.,Sugiyama,M.,Kashima,H.,Hachiya,H.,&Tanaka,T.
(2010b).Parametricreturndensityestimationforreinforcementlearning.
ConferenceonUncertaintyinArtificialIntelligence(pp.368–375).
Peters,J.,&Schaal,S.(2006).Policygradientmethodsforrobotics.Process-
ingoftheIEEE/RSJInternationalConferenceonIntelligentRobotsand
Systems(pp.2219–2225).
Peters,J.,&Schaal,S.(2007).Reinforcementlearningbyreward-weighted
regressionforoperationalspacecontrol.ProceedingsofInternationalCon-
ferenceonMachineLearning(pp.745–750).Corvallis,Oregon,USA.
Precup,D.,Sutton,R.S.,&Singh,S.(2000).Eligibilitytracesforoff-policypolicyevaluation.ProceedingsofInternationalConferenceonMachine
![Page 567: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/567.jpg)
Learning(pp.759–766).
Rasmussen,C.E.,&Williams,C.K.I.(2006).Gaussianprocessesformachine
learning.Cambridge,MA,USA:MITPress.
Rockafellar,R.T.,&Uryasev,S.(2002).Conditionalvalue-at-riskforgeneral
lossdistributions.JournalofBanking&Finance,26,1443–1471.
Rousseeuw,P.J.,&Leroy,A.M.(1987).Robustregressionandoutlierdetec-
tion.NewYork,NY,USA:Wiley.
Schaal,S.(2009).TheSLsimulationandreal-timecontrolsoftwarepack-
age(TechnicalReport).ComputerScienceandNeuroscience,Universityof
SouthernCalifornia.
Sehnke,F.,Osendorfer,C.,Rückstiess,T.,Graves,A.,Peters,J.,&Schmid-
huber,J.(2010).Parameter-exploringpolicygradients.NeuralNetworks,
23,551–559.
References
187
Shimodaira,H.(2000).Improvingpredictiveinferenceundercovariateshift
byweightingthelog-likelihoodfunction.JournalofStatisticalPlanningand
Inference,90,227–244.
Siciliano,B.,&Khatib,O.(Eds.).(2008).Springerhandbookofrobotics.
Berlin,Germany:Springer-Verlag.
Sugimoto,N.,Tangkaratt,V.,Wensveen,T.,Zhao,T.,Sugiyama,M.,&Mo-
rimoto,J.(2014).Efficientreuseofpreviousexperiencesinhumanoidmotor
learning.ProceedingsofIEEE-RASInternationalConferenceonHumanoid
Robots(pp.554–559).
Sugiyama,M.(2006).Activelearninginapproximatelylinearregressionbased
onconditionalexpectationofgeneralizationerror.JournalofMachine
LearningResearch,7,141–166.
Sugiyama,M.,Hachiya,H.,Towell,C.,&Vijayakumar,S.(2008).Geodesic
Gaussiankernelsforvaluefunctionapproximation.AutonomousRobots,
25,287–304.
![Page 568: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/568.jpg)
Sugiyama,M.,&Kawanabe,M.(2012).Machinelearninginnon-stationary
environments:Introductiontocovariateshiftadaptation.Cambridge,MA,
USA:MITPress.
Sugiyama,M.,Krauledat,M.,&Müller,K.-R.(2007).Covariateshiftadapta-
tionbyimportanceweightedcrossvalidation.JournalofMachineLearning
Research,8,985–1005.
Sugiyama,M.,Suzuki,T.,&Kanamori,T.(2012).Densityratiomatching
undertheBregmandivergence:Aunifiedframeworkofdensityratioesti-
mation.AnnalsoftheInstituteofStatisticalMathematics,64,1009–1044.
Sugiyama,M.,Takeuchi,I.,Suzuki,T.,Kanamori,T.,Hachiya,H.,&
Okanohara,D.(2010).Least-squaresconditionaldensityestimation.IEICE
TransactionsonInformationandSystems,E93-D,583–594.
Sutton,R.S.,&Barto,G.A.(1998).Reinforcementlearning:Anintroduction.
Cambridge,MA,USA:MITPress.
Suzuki,T.,&Sugiyama,M.(2013).
Sufficientdimensionreductionvia
squared-lossmutualinformationestimation.NeuralComputation,25,725–
758.
Takeda,A.(2007).Supportvectormachinebasedonconditionalvalue-at-risk
minimization(TechnicalReportB-439).DepartmentofMathematicaland
ComputingSciences,TokyoInstituteofTechnology.
Tangkaratt,V.,Mori,S.,Zhao,T.,Morimoto,J.,&Sugiyama,M.(2014).
Model-basedpolicygradientswithparameter-basedexplorationbyleast-
squaresconditionaldensityestimation.NeuralNetworks,57,128–140.
188
References
Tangkaratt,V.,Xie,N.,&Sugiyama,M.(2015).Conditionaldensityesti-
mationwithdimensionalityreductionviasquared-lossconditionalentropy
minimization.NeuralComputation,27,228–254.
Tesauro,G.(1994).
![Page 569: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/569.jpg)
TD-gammon,aself-teachingbackgammonprogram,
achievesmaster-levelplay.NeuralComputation,6,215–219.
Tibshirani,R.(1996).Regressionshrinkageandsubsetselectionwiththe
lasso.JournaloftheRoyalStatisticalSociety,SeriesB,58,267–288.
Tomioka,R.,Suzuki,T.,&Sugiyama,M.(2011).Super-linearconvergenceof
dualaugmentedLagrangianalgorithmforsparsityregularizedestimation.
JournalofMachineLearningResearch,12,1537–1586.
Vapnik,V.N.(1998).Statisticallearningtheory.NewYork,NY,USA:Wiley.
Vesanto,J.,Himberg,J.,Alhoniemi,E.,&Parhankangas,J.(2000).SOM
toolboxforMatlab5(TechnicalReportA57).HelsinkiUniversityofTech-
nology.
Wahba,G.(1990).Splinemodelsforobservationaldata.Philadelphia,PA,
USA:SocietyforIndustrialandAppliedMathematics.
Wang,X.,&Dietterich,T.G.(2003).Model-basedpolicygradientrein-
forcementlearning.ProceedingsofInternationalConferenceonMachine
Learning(pp.776–783).
Wawrzynski,P.(2009).Real-timereinforcementlearningbysequentialactor-
criticsandexperiencereplay.NeuralNetworks,22,1484–1497.
Weaver,L.,&Baxter,J.(1999).Reinforcementlearningfromstateandtem-
poraldifferences(TechnicalReport).DepartmentofComputerScience,
AustralianNationalUniversity.
Weaver,L.,&Tao,N.(2001).Theoptimalrewardbaselineforgradient-
basedreinforcementlearning.ProceedingsofConferenceonUncertaintyin
ArtificialIntelligence(pp.538–545).
Williams,J.D.,&Young,S.J.(2007).PartiallyobservableMarkovdecision
processesforspokendialogsystems.ComputerSpeechandLanguage,21,
393–422.
Williams,R.J.(1992).Simplestatisticalgradient-followingalgorithmsfor
connectionistreinforcementlearning.MachineLearning,8,229–256.
Xie,N.,Hachiya,H.,&Sugiyama,M.(2013).Artistagent:Areinforcement
learningapproachtoautomaticstrokegenerationinorientalinkpainting.
IEICETransactionsonInformationandSystems,E95-D,1134–1144.
![Page 570: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/570.jpg)
Xie,N.,Laga,H.,Saito,S.,&Nakajima,M.(2011).Contour-drivenSumi-e
renderingofrealphotos.Computers&Graphics,35,122–134.
References
189
Zhao,T.,Hachiya,H.,Niu,G.,&Sugiyama,M.(2012).Analysisandim-
provementofpolicygradientestimation.NeuralNetworks,26,118–129.
Zhao,T.,Hachiya,H.,Tangkaratt,V.,Morimoto,J.,&Sugiyama,M.(2013).
Efficientsamplereuseinpolicygradientswithparameter-basedexploration.
NeuralComputation,25,1512–1547.
Thispageintentionallyleftblank
![Page 571: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/571.jpg)
![Page 572: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/572.jpg)
![Page 573: Masashi Sugiyama-Statistical Reinforcement Learning_ Modern Machine Learning Approaches-Chapman and Hall_CRC (2015)](https://reader034.fdocuments.net/reader034/viewer/2022042703/563db7ac550346aa9a8ceb0d/html5/thumbnails/573.jpg)
DocumentOutlineCoverContentsForewordPrefaceAuthorPartI:Introduction
Chapter1:IntroductiontoReinforcementLearningPartII:Model-FreePolicyIteration
Chapter2:PolicyIterationwithValueFunctionApproximationChapter3:BasisDesignforValueFunctionApproximationChapter4:SampleReuseinPolicyIterationChapter5:ActiveLearninginPolicyIterationChapter6:RobustPolicyIteration
PartIII:Model-FreePolicySearchChapter7:DirectPolicySearchbyGradientAscentChapter8:DirectPolicySearchbyExpectation-MaximizationChapter9:Policy-PriorSearch
PartIV:Model-BasedReinforcementLearningChapter10:TransitionModelEstimationChapter11:DimensionalityReductionforTransitionModelEstimation
References