Lecture22

Introduction to MachineIntroduction to Machine LearningLearning

Lecture 22Lecture 22Reinforcement Learning

Albert Orriols i Puightt // lb t i l thttp://www.albertorriols.net

[email protected]

Artificial Intelligence – Machine Learningg gEnginyeria i Arquitectura La Salle

Universitat Ramon Llull

Recap of Lecture 21Value functions

Vπ(s): Long-term reward estimation from state s following policy πo s a e s o o g po cy

Qπ(s,a): Long-term reward estimation from state s executing action a o s a e s e ecu g ac o aand then following policy π

The long term reward is a recency-weighted average ofThe long term reward is a recency weighted average of the received rewards

r r r ra a a ast st+1 st+2 st+3rt rt+1 rt+2 rt+3at at+1 at+2 at+3… …

Slide 2Artificial Intelligence Machine Learning

Recap of Lecture 21

Policy

A policy, π, is a mapping from states, s∈S, and actions, a∈A(s), to the probability π(s, a) of taking action a when in state s.


Today’s Agenda

Bellman equations for value functionsOptimal policyLearning the optimal policyQ-learningQ-learning


Let’s Estimate the Future Reward

I want to estimate which will be my reward given a y gcertain state and a policy π

For the state-value function Vπ(s)For the state value function V (s)

For the action-value function Qπ(s,a)


Bellman Equation for a Policy π

Playing a little with the equationsy g q

ThereforeTherefore

Finally


Q-value Bellman EquationIf we estimate the q-valueq


Calculation of Value Functions

How to calculate the value functions for a given policy g p y1. Solve a set of linear equations

Bellman equation for VπBellman equation for Vπ

This is a system of |S| linear equations

2. Iterative method (convergence proved)Calculate the value by sweeping through the states

Slide 8

3. Greedy methods

Artificial Intelligence Machine Learning

Example: The GridworldRewards

-1 if the agent goes out of the grid

0 for all the other states except from state A and B0 for all the other states except from state A and B

From A, all four actions yield a reward of 10 and take the agent to A’

From B, all four actions yield a reward of 5 and take the agent to B’

(b) obtained by solvingPolicy = equal probability for each movement

Slide 9

Policy = equal probability for each movementγ=0.9


Looking for the Optimal Policy


Optimal PolicyWe search for a policy that achieves a lot of reward over p ythe long run

Value functions enable us to define a partial order overValue functions enable us to define a partial order over policies

A policy π is better than or equal to π’ if its expected return isA policy π is better than or equal to π if its expected return is greater than or equal to that of π’ for all states

Optimal policies π* share the optimal state value function V*Optimal policies π share the optimal state-value function V

Which can be written as


Learning Optimal Policies


Focusing on the ObjectiveWe want to find the optimal policyp p y

There are many methods for this purposeD i iDynamic programming

Policy iterationValue iteration[Asynchronous versions]

RL algorithmsQ-learningSarsaTD-learning

We are going to see Q-learning

Slide 13

We are going to see Q-learning


Q-learningRL algorithmsg

Learning by doing

Temporal difference methodLearn directly from raw experience without a model of the environment’s dynamics

AdvantagesAdvantagesNo model of the world needed

Good policies before learning the optimal policy

Reacts to changes in the environment

Slide 14

g


Dynamic Programming in Brief

Needs a model of the environment to compute true expected values

Slide 15

A very informative backup


Temporal Difference Leraning

No model of the world needed

Slide 16

Most incremental


Q-learningBased on Q-backupsQ p

The learned action-value function Q directly approximates Q*, independent of the policy being followed


Q-learning: Pseudo codePseudo code for Q-learningQ g


Q-learning in Action15x15 maze world; R(goal)=1; R(other)=0

γ=0.9

α=0.65

Slide 19

Q-learning in ActionInitial policyInitial policy

Slide 20

Q-learning in ActionAfter 20 episodesAfter 20 episodes

Slide 21


Slide 22


Slide 23


Slide 24


Slide 25


Slide 26


Slide 27


Slide 28


Slide 29

Some Last RemarksExploration regimep g

Explore vs. exploitε greedy action selectionε-greedy action selectionSoft-max action selection

I iti li ti f Q l b ti i tiInitialization of Q-values: be optimistic

Learning rate αIn stationary environments

α(s) = 1 / (number of visits to state s)

In non-stationary environmentsα takes a constant valueThe higher the value the higher the influence of recentThe higher the value, the higher the influence of recent experiences


Next Class

R i f t l i ith LCSReinforcement learning with LCSs


Introduction to MachineIntroduction to Machine LearningLearning

Lecture 22Lecture 22Reinforcement Learning

Albert Orriols i Puightt // lb t i l thttp://www.albertorriols.net

[email protected]

Artificial Intelligence – Machine Learningg gEnginyeria i Arquitectura La Salle

Universitat Ramon Llull

Lecture22

Education

Transcript of Lecture22