Lecture22
-
Upload
albert-orriols-puig -
Category
Education
-
view
2.817 -
download
1
Transcript of Lecture22
Introduction to MachineIntroduction to Machine LearningLearning
Lecture 22Lecture 22Reinforcement Learning
Albert Orriols i Puightt // lb t i l thttp://www.albertorriols.net
Artificial Intelligence – Machine Learningg gEnginyeria i Arquitectura La Salle
Universitat Ramon Llull
Recap of Lecture 21Value functions
Vπ(s): Long-term reward estimation from state s following policy πo s a e s o o g po cy
Qπ(s,a): Long-term reward estimation from state s executing action a o s a e s e ecu g ac o aand then following policy π
The long term reward is a recency-weighted average ofThe long term reward is a recency weighted average of the received rewards
r r r ra a a ast st+1 st+2 st+3rt rt+1 rt+2 rt+3at at+1 at+2 at+3… …
Slide 2Artificial Intelligence Machine Learning
Recap of Lecture 21
Policy
A policy, π, is a mapping from states, s∈S, and actions, a∈A(s), to the probability π(s, a) of taking action a when in state s.
Slide 3Artificial Intelligence Machine Learning
Today’s Agenda
Bellman equations for value functionsOptimal policyLearning the optimal policyQ-learningQ-learning
Slide 4Artificial Intelligence Machine Learning
Let’s Estimate the Future Reward
I want to estimate which will be my reward given a y gcertain state and a policy π
For the state-value function Vπ(s)For the state value function V (s)
For the action-value function Qπ(s,a)
Slide 5Artificial Intelligence Machine Learning
Bellman Equation for a Policy π
Playing a little with the equationsy g q
ThereforeTherefore
Finally
Slide 6Artificial Intelligence Machine Learning
Calculation of Value Functions
How to calculate the value functions for a given policy g p y1. Solve a set of linear equations
Bellman equation for VπBellman equation for Vπ
This is a system of |S| linear equations
2. Iterative method (convergence proved)Calculate the value by sweeping through the states
Slide 8
3. Greedy methods
Artificial Intelligence Machine Learning
Example: The GridworldRewards
-1 if the agent goes out of the grid
0 for all the other states except from state A and B0 for all the other states except from state A and B
From A, all four actions yield a reward of 10 and take the agent to A’
From B, all four actions yield a reward of 5 and take the agent to B’
(b) obtained by solvingPolicy = equal probability for each movement
Slide 9
Policy = equal probability for each movementγ=0.9
Artificial Intelligence Machine Learning
Optimal PolicyWe search for a policy that achieves a lot of reward over p ythe long run
Value functions enable us to define a partial order overValue functions enable us to define a partial order over policies
A policy π is better than or equal to π’ if its expected return isA policy π is better than or equal to π if its expected return is greater than or equal to that of π’ for all states
Optimal policies π* share the optimal state value function V*Optimal policies π share the optimal state-value function V
Which can be written as
Slide 11Artificial Intelligence Machine Learning
Focusing on the ObjectiveWe want to find the optimal policyp p y
There are many methods for this purposeD i iDynamic programming
Policy iterationValue iteration[Asynchronous versions]
RL algorithmsQ-learningSarsaTD-learning
We are going to see Q-learning
Slide 13
We are going to see Q-learning
Artificial Intelligence Machine Learning
Q-learningRL algorithmsg
Learning by doing
Temporal difference methodLearn directly from raw experience without a model of the environment’s dynamics
AdvantagesAdvantagesNo model of the world needed
Good policies before learning the optimal policy
Reacts to changes in the environment
Slide 14
g
Artificial Intelligence Machine Learning
Dynamic Programming in Brief
Needs a model of the environment to compute true expected values
Slide 15
A very informative backup
Artificial Intelligence Machine Learning
Temporal Difference Leraning
No model of the world needed
Slide 16
Most incremental
Artificial Intelligence Machine Learning
Q-learningBased on Q-backupsQ p
The learned action-value function Q directly approximates Q*, independent of the policy being followed
Slide 17Artificial Intelligence Machine Learning
Q-learning: Pseudo codePseudo code for Q-learningQ g
Slide 18Artificial Intelligence Machine Learning
Some Last RemarksExploration regimep g
Explore vs. exploitε greedy action selectionε-greedy action selectionSoft-max action selection
I iti li ti f Q l b ti i tiInitialization of Q-values: be optimistic
Learning rate αIn stationary environments
α(s) = 1 / (number of visits to state s)
In non-stationary environmentsα takes a constant valueThe higher the value the higher the influence of recentThe higher the value, the higher the influence of recent experiences
Slide 30Artificial Intelligence Machine Learning
Next Class
R i f t l i ith LCSReinforcement learning with LCSs
Slide 31Artificial Intelligence Machine Learning
Introduction to MachineIntroduction to Machine LearningLearning
Lecture 22Lecture 22Reinforcement Learning
Albert Orriols i Puightt // lb t i l thttp://www.albertorriols.net
Artificial Intelligence – Machine Learningg gEnginyeria i Arquitectura La Salle
Universitat Ramon Llull