Reinforcement Learning › ~bboots3 › CS4641-Fall2018 › Lecture20 › 20_RL1.pdfReinforcement...
Transcript of Reinforcement Learning › ~bboots3 › CS4641-Fall2018 › Lecture20 › 20_RL1.pdfReinforcement...
![Page 1: Reinforcement Learning › ~bboots3 › CS4641-Fall2018 › Lecture20 › 20_RL1.pdfReinforcement Learning §Basic idea: §Receive feedback in the form of rewards §Agent’s utility](https://reader033.fdocuments.net/reader033/viewer/2022060500/5f1ad21de0d9a173f741ba12/html5/thumbnails/1.jpg)
ReinforcementLearning
Robot Image Credit: Viktoriya Sukhanova © 123RF.com
Based on slides by Dan Klein
![Page 2: Reinforcement Learning › ~bboots3 › CS4641-Fall2018 › Lecture20 › 20_RL1.pdfReinforcement Learning §Basic idea: §Receive feedback in the form of rewards §Agent’s utility](https://reader033.fdocuments.net/reader033/viewer/2022060500/5f1ad21de0d9a173f741ba12/html5/thumbnails/2.jpg)
ReinforcementLearning
§ Basicidea:§ Receivefeedbackintheformofrewards§ Agent’sutilityisdefinedbytherewardfunction§ Must(learnto)actsoastomaximizeexpectedrewards§ Alllearningisbasedonobservedsamplesofoutcomes!
Environment
Agent
Actions:aState:sReward:r
![Page 3: Reinforcement Learning › ~bboots3 › CS4641-Fall2018 › Lecture20 › 20_RL1.pdfReinforcement Learning §Basic idea: §Receive feedback in the form of rewards §Agent’s utility](https://reader033.fdocuments.net/reader033/viewer/2022060500/5f1ad21de0d9a173f741ba12/html5/thumbnails/3.jpg)
ReinforcementLearning
§ StillassumeaMarkovdecisionprocess(MDP):§ AsetofstatessÎ S§ Asetofactions(perstate)A§ AmodelT(s,a,s’)§ ArewardfunctionR(s,a,s’)
§ Stilllookingforapolicyp(s)
§ Newtwist:don’tknowTorR§ I.e.wedon’tknowwhichstatesaregoodorwhattheactionsdo§ Mustactuallytryactionsandstatesouttolearn
![Page 4: Reinforcement Learning › ~bboots3 › CS4641-Fall2018 › Lecture20 › 20_RL1.pdfReinforcement Learning §Basic idea: §Receive feedback in the form of rewards §Agent’s utility](https://reader033.fdocuments.net/reader033/viewer/2022060500/5f1ad21de0d9a173f741ba12/html5/thumbnails/4.jpg)
Offline(MDPs)vs.Online(RL)
OfflineSolution OnlineLearning
![Page 5: Reinforcement Learning › ~bboots3 › CS4641-Fall2018 › Lecture20 › 20_RL1.pdfReinforcement Learning §Basic idea: §Receive feedback in the form of rewards §Agent’s utility](https://reader033.fdocuments.net/reader033/viewer/2022060500/5f1ad21de0d9a173f741ba12/html5/thumbnails/5.jpg)
Model-BasedLearning
![Page 6: Reinforcement Learning › ~bboots3 › CS4641-Fall2018 › Lecture20 › 20_RL1.pdfReinforcement Learning §Basic idea: §Receive feedback in the form of rewards §Agent’s utility](https://reader033.fdocuments.net/reader033/viewer/2022060500/5f1ad21de0d9a173f741ba12/html5/thumbnails/6.jpg)
Model-BasedLearning
§ Model-BasedIdea:§ Learnanapproximatemodelbasedonexperiences§ Solveforvaluesasifthelearnedmodelwerecorrect
§ Step1:LearnempiricalMDPmodel§ Countoutcomess’foreachs,a§ Normalizetogiveanestimateof§ Discovereach whenweexperience(s,a,s’)
§ Step2:SolvethelearnedMDP§ Forexample,usevalueiteration,asbefore
![Page 7: Reinforcement Learning › ~bboots3 › CS4641-Fall2018 › Lecture20 › 20_RL1.pdfReinforcement Learning §Basic idea: §Receive feedback in the form of rewards §Agent’s utility](https://reader033.fdocuments.net/reader033/viewer/2022060500/5f1ad21de0d9a173f741ba12/html5/thumbnails/7.jpg)
Example:Model-BasedLearning
InputPolicyp
Assume:g =1
ObservedEpisodes(Training) LearnedModel
A
B C D
E
B,east,C,-1C,east,D,-1D,exit,x,+10
B,east,C,-1C,east,D,-1D,exit,x,+10
E,north,C,-1C,east,A,-1A,exit,x,-10
Episode1 Episode2
Episode3 Episode4E,north,C,-1C,east,D,-1D,exit,x,+10
T(s,a,s’).T(B,east,C)=1.00T(C,east,D)=0.75T(C,east,A)=0.25
…
R(s,a,s’).R(B,east,C)=-1R(C,east,D)=-1R(D,exit,x)=+10
…
![Page 8: Reinforcement Learning › ~bboots3 › CS4641-Fall2018 › Lecture20 › 20_RL1.pdfReinforcement Learning §Basic idea: §Receive feedback in the form of rewards §Agent’s utility](https://reader033.fdocuments.net/reader033/viewer/2022060500/5f1ad21de0d9a173f741ba12/html5/thumbnails/8.jpg)
Model-FreeLearning
![Page 9: Reinforcement Learning › ~bboots3 › CS4641-Fall2018 › Lecture20 › 20_RL1.pdfReinforcement Learning §Basic idea: §Receive feedback in the form of rewards §Agent’s utility](https://reader033.fdocuments.net/reader033/viewer/2022060500/5f1ad21de0d9a173f741ba12/html5/thumbnails/9.jpg)
PassiveReinforcementLearning
![Page 10: Reinforcement Learning › ~bboots3 › CS4641-Fall2018 › Lecture20 › 20_RL1.pdfReinforcement Learning §Basic idea: §Receive feedback in the form of rewards §Agent’s utility](https://reader033.fdocuments.net/reader033/viewer/2022060500/5f1ad21de0d9a173f741ba12/html5/thumbnails/10.jpg)
PassiveReinforcementLearning
§ Simplifiedtask:policyevaluation§ Input:afixedpolicyp(s)§ Youdon’tknowthetransitionsT(s,a,s’)§ Youdon’tknowtherewardsR(s,a,s’)§ Goal:learnthestatevalues
§ Inthiscase:§ Learneris“alongfortheride”§ Nochoiceaboutwhatactionstotake§ Justexecutethepolicyandlearnfromexperience§ ThisisNOTofflineplanning!Youactuallytakeactionsintheworld.
![Page 11: Reinforcement Learning › ~bboots3 › CS4641-Fall2018 › Lecture20 › 20_RL1.pdfReinforcement Learning §Basic idea: §Receive feedback in the form of rewards §Agent’s utility](https://reader033.fdocuments.net/reader033/viewer/2022060500/5f1ad21de0d9a173f741ba12/html5/thumbnails/11.jpg)
DirectEvaluation
§ Goal:Computevaluesforeachstateunderp
§ Idea:Averagetogetherobservedsamplevalues§ Actaccordingtop§ Everytimeyouvisitastate,writedownwhatthesumofdiscountedrewardsturnedouttobe
§ Averagethosesamples
§ Thisiscalleddirectevaluation
![Page 12: Reinforcement Learning › ~bboots3 › CS4641-Fall2018 › Lecture20 › 20_RL1.pdfReinforcement Learning §Basic idea: §Receive feedback in the form of rewards §Agent’s utility](https://reader033.fdocuments.net/reader033/viewer/2022060500/5f1ad21de0d9a173f741ba12/html5/thumbnails/12.jpg)
Example:DirectEvaluation
InputPolicyp
Assume:g =1
ObservedEpisodes(Training) OutputValues
A
B C D
E
B,east,C,-1C,east,D,-1D,exit,x,+10
B,east,C,-1C,east,D,-1D,exit,x,+10
E,north,C,-1C,east,A,-1A,exit,x,-10
Episode1 Episode2
Episode3 Episode4E,north,C,-1C,east,D,-1D,exit,x,+10
A
B C D
E
+8 +4 +10
-10
-2
![Page 13: Reinforcement Learning › ~bboots3 › CS4641-Fall2018 › Lecture20 › 20_RL1.pdfReinforcement Learning §Basic idea: §Receive feedback in the form of rewards §Agent’s utility](https://reader033.fdocuments.net/reader033/viewer/2022060500/5f1ad21de0d9a173f741ba12/html5/thumbnails/13.jpg)
ProblemswithDirectEvaluation
§ What’sgoodaboutdirectevaluation?§ It’seasytounderstand§ Itdoesn’trequireanyknowledgeofT,R§ Iteventuallycomputesthecorrectaveragevalues,usingjustsampletransitions
§ Whatbadaboutit?§ Eachstatemustbelearnedseparately§ So,ittakesalongtimetolearn
OutputValues
A
B C D
E
+8 +4 +10
-10
-2
IfBandEbothgotoCunderthispolicy,howcantheirvaluesbedifferent?
![Page 14: Reinforcement Learning › ~bboots3 › CS4641-Fall2018 › Lecture20 › 20_RL1.pdfReinforcement Learning §Basic idea: §Receive feedback in the form of rewards §Agent’s utility](https://reader033.fdocuments.net/reader033/viewer/2022060500/5f1ad21de0d9a173f741ba12/html5/thumbnails/14.jpg)
WhyNotUsePolicyEvaluation?
§ SimplifiedBellmanupdatescalculateVforafixedpolicy:§ Eachround,replaceVwithaone-step-look-aheadlayeroverV
§ Thisapproachfullyexploitedtheconnectionsbetweenthestates§ Unfortunately,weneedTandRtodoit!
§ Keyquestion:howcanwedothisupdatetoVwithoutknowingTandR?§ Inotherwords,howtowetakeaweightedaveragewithoutknowingtheweights?
p(s)
s
s,p(s)
s, p(s),s’s’
![Page 15: Reinforcement Learning › ~bboots3 › CS4641-Fall2018 › Lecture20 › 20_RL1.pdfReinforcement Learning §Basic idea: §Receive feedback in the form of rewards §Agent’s utility](https://reader033.fdocuments.net/reader033/viewer/2022060500/5f1ad21de0d9a173f741ba12/html5/thumbnails/15.jpg)
Sample-BasedPolicyEvaluation?§ WewanttoimproveourestimateofVbycomputingtheseaverages:
§ Idea:Takesamplesofoutcomess’(bydoingtheaction!)andaverage
p(s)
s
s,p(s)
'1s'2s '3ss, p(s),s’
s'
Almost!Butwecan’trewindtimetogetsampleaftersamplefromstates.
![Page 16: Reinforcement Learning › ~bboots3 › CS4641-Fall2018 › Lecture20 › 20_RL1.pdfReinforcement Learning §Basic idea: §Receive feedback in the form of rewards §Agent’s utility](https://reader033.fdocuments.net/reader033/viewer/2022060500/5f1ad21de0d9a173f741ba12/html5/thumbnails/16.jpg)
TemporalDifferenceLearning
![Page 17: Reinforcement Learning › ~bboots3 › CS4641-Fall2018 › Lecture20 › 20_RL1.pdfReinforcement Learning §Basic idea: §Receive feedback in the form of rewards §Agent’s utility](https://reader033.fdocuments.net/reader033/viewer/2022060500/5f1ad21de0d9a173f741ba12/html5/thumbnails/17.jpg)
TemporalDifferenceLearning§ Bigidea:learnfromeveryexperience!
§ UpdateV(s)eachtimeweexperienceatransition(s,a,s’,r)§ Likelyoutcomess’willcontributeupdatesmoreoften
§ Temporaldifferencelearningofvalues§ Policystillfixed,stilldoingevaluation!§ Movevaluestowardvalueofwhateversuccessoroccurs:runningaverage
p(s)s
s,p(s)
s’
SampleofV(s):
UpdatetoV(s):
Sameupdate:
![Page 18: Reinforcement Learning › ~bboots3 › CS4641-Fall2018 › Lecture20 › 20_RL1.pdfReinforcement Learning §Basic idea: §Receive feedback in the form of rewards §Agent’s utility](https://reader033.fdocuments.net/reader033/viewer/2022060500/5f1ad21de0d9a173f741ba12/html5/thumbnails/18.jpg)
ExponentialMovingAverage
§ Exponentialmovingaverage§ Therunninginterpolationupdate:
§ Makesrecentsamplesmoreimportant:
§ Forgetsaboutthepast(distantpastvalueswerewronganyway)
§ Decreasinglearningrate(alpha)cangiveconvergingaverages
![Page 19: Reinforcement Learning › ~bboots3 › CS4641-Fall2018 › Lecture20 › 20_RL1.pdfReinforcement Learning §Basic idea: §Receive feedback in the form of rewards §Agent’s utility](https://reader033.fdocuments.net/reader033/viewer/2022060500/5f1ad21de0d9a173f741ba12/html5/thumbnails/19.jpg)
Example:TemporalDifferenceLearning
Assume:g =1,α =1/2
ObservedTransitions
B,east,C,-2
0
0 0 8
0
0
-1 0 8
0
0
-1 3 8
0
C,east,D,-2
A
B C D
E
States
![Page 20: Reinforcement Learning › ~bboots3 › CS4641-Fall2018 › Lecture20 › 20_RL1.pdfReinforcement Learning §Basic idea: §Receive feedback in the form of rewards §Agent’s utility](https://reader033.fdocuments.net/reader033/viewer/2022060500/5f1ad21de0d9a173f741ba12/html5/thumbnails/20.jpg)
ProblemswithTDValueLearning
§ TDvalueleaningisamodel-freewaytodopolicyevaluation,mimickingBellmanupdateswithrunningsampleaverages
§ However,ifwewanttoturnvaluesintoa(new)policy,we’resunk:
§ Idea:learnQ-values,notvalues§ Makesactionselectionmodel-freetoo!
a
s
s,a
s,a,s’s’
![Page 21: Reinforcement Learning › ~bboots3 › CS4641-Fall2018 › Lecture20 › 20_RL1.pdfReinforcement Learning §Basic idea: §Receive feedback in the form of rewards §Agent’s utility](https://reader033.fdocuments.net/reader033/viewer/2022060500/5f1ad21de0d9a173f741ba12/html5/thumbnails/21.jpg)
ActiveReinforcementLearning
![Page 22: Reinforcement Learning › ~bboots3 › CS4641-Fall2018 › Lecture20 › 20_RL1.pdfReinforcement Learning §Basic idea: §Receive feedback in the form of rewards §Agent’s utility](https://reader033.fdocuments.net/reader033/viewer/2022060500/5f1ad21de0d9a173f741ba12/html5/thumbnails/22.jpg)
ActiveReinforcementLearning
§ Fullreinforcementlearning:optimalpolicies(likevalueiteration)§ Youdon’tknowthetransitionsT(s,a,s’)§ Youdon’tknowtherewardsR(s,a,s’)§ Youchoosetheactionsnow§ Goal:learntheoptimalpolicy/values
§ Inthiscase:§ Learnermakeschoices!§ Fundamentaltradeoff:explorationvs.exploitation§ ThisisNOTofflineplanning!Youactuallytakeactionsintheworldandfindoutwhathappens…
![Page 23: Reinforcement Learning › ~bboots3 › CS4641-Fall2018 › Lecture20 › 20_RL1.pdfReinforcement Learning §Basic idea: §Receive feedback in the form of rewards §Agent’s utility](https://reader033.fdocuments.net/reader033/viewer/2022060500/5f1ad21de0d9a173f741ba12/html5/thumbnails/23.jpg)
Detour:Q-ValueIteration
§ Valueiteration:findsuccessive(depth-limited)values§ StartwithV0(s)=0,whichweknowisright§ GivenVk,calculatethedepthk+1valuesforallstates:
§ ButQ-valuesaremoreuseful,socomputetheminstead§ StartwithQ0(s,a)=0,whichweknowisright§ GivenQk,calculatethedepthk+1q-valuesforallq-states:
![Page 24: Reinforcement Learning › ~bboots3 › CS4641-Fall2018 › Lecture20 › 20_RL1.pdfReinforcement Learning §Basic idea: §Receive feedback in the form of rewards §Agent’s utility](https://reader033.fdocuments.net/reader033/viewer/2022060500/5f1ad21de0d9a173f741ba12/html5/thumbnails/24.jpg)
Q-Learning§ Q-Learning:sample-basedQ-valueiteration
§ LearnQ(s,a)valuesasyougo§ Receiveasample(s,a,s’,r)§ Consideryouroldestimate:§ Consideryournewsampleestimate:
§ Incorporatethenewestimateintoarunningaverage:
![Page 25: Reinforcement Learning › ~bboots3 › CS4641-Fall2018 › Lecture20 › 20_RL1.pdfReinforcement Learning §Basic idea: §Receive feedback in the form of rewards §Agent’s utility](https://reader033.fdocuments.net/reader033/viewer/2022060500/5f1ad21de0d9a173f741ba12/html5/thumbnails/25.jpg)
Q-LearningProperties
§ Amazingresult:Q-learningconvergestooptimalpolicy-- evenifyou’reactingsuboptimally!
§ Thisiscalledoff-policylearning
§ Caveats:§ Youhavetoexploreenough§ Youhavetoeventuallymakethelearningratesmallenough
§ …butnotdecreaseittooquickly§ Basically,inthelimit,itdoesn’tmatterhowyouselectactions(!)