Lecture22

32
Introduction to Machine Introduction to Machine Learning Learning Lecture 22 Lecture 22 Reinforcement Learning Albert Orriols i Puig htt // lb t il t http://www.albertorriols.net [email protected] Artificial Intelligence Machine Learning Enginyeria i Arquitectura La Salle Universitat Ramon Llull

Transcript of Lecture22

Introduction to MachineIntroduction to Machine LearningLearning

Lecture 22Lecture 22Reinforcement Learning

Albert Orriols i Puightt // lb t i l thttp://www.albertorriols.net

[email protected]

Artificial Intelligence – Machine Learningg gEnginyeria i Arquitectura La Salle

Universitat Ramon Llull

Recap of Lecture 21Value functions

Vπ(s): Long-term reward estimation from state s following policy πo s a e s o o g po cy

Qπ(s,a): Long-term reward estimation from state s executing action a o s a e s e ecu g ac o aand then following policy π

The long term reward is a recency-weighted average ofThe long term reward is a recency weighted average of the received rewards

r r r ra a a ast st+1 st+2 st+3rt rt+1 rt+2 rt+3at at+1 at+2 at+3… …

Slide 2Artificial Intelligence Machine Learning

Recap of Lecture 21

Policy

A policy, π, is a mapping from states, s∈S, and actions, a∈A(s), to the probability π(s, a) of taking action a when in state s.

Slide 3Artificial Intelligence Machine Learning

Today’s Agenda

Bellman equations for value functionsOptimal policyLearning the optimal policyQ-learningQ-learning

Slide 4Artificial Intelligence Machine Learning

Let’s Estimate the Future Reward

I want to estimate which will be my reward given a y gcertain state and a policy π

For the state-value function Vπ(s)For the state value function V (s)

For the action-value function Qπ(s,a)

Slide 5Artificial Intelligence Machine Learning

Bellman Equation for a Policy π

Playing a little with the equationsy g q

ThereforeTherefore

Finally

Slide 6Artificial Intelligence Machine Learning

Q-value Bellman EquationIf we estimate the q-valueq

Slide 7Artificial Intelligence Machine Learning

Calculation of Value Functions

How to calculate the value functions for a given policy g p y1. Solve a set of linear equations

Bellman equation for VπBellman equation for Vπ

This is a system of |S| linear equations

2. Iterative method (convergence proved)Calculate the value by sweeping through the states

Slide 8

3. Greedy methods

Artificial Intelligence Machine Learning

Example: The GridworldRewards

-1 if the agent goes out of the grid

0 for all the other states except from state A and B0 for all the other states except from state A and B

From A, all four actions yield a reward of 10 and take the agent to A’

From B, all four actions yield a reward of 5 and take the agent to B’

(b) obtained by solvingPolicy = equal probability for each movement

Slide 9

Policy = equal probability for each movementγ=0.9

Artificial Intelligence Machine Learning

Looking for the Optimal Policy

Slide 10Artificial Intelligence Machine Learning

Optimal PolicyWe search for a policy that achieves a lot of reward over p ythe long run

Value functions enable us to define a partial order overValue functions enable us to define a partial order over policies

A policy π is better than or equal to π’ if its expected return isA policy π is better than or equal to π if its expected return is greater than or equal to that of π’ for all states

Optimal policies π* share the optimal state value function V*Optimal policies π share the optimal state-value function V

Which can be written as

Slide 11Artificial Intelligence Machine Learning

Learning Optimal Policies

Slide 12Artificial Intelligence Machine Learning

Focusing on the ObjectiveWe want to find the optimal policyp p y

There are many methods for this purposeD i iDynamic programming

Policy iterationValue iteration[Asynchronous versions]

RL algorithmsQ-learningSarsaTD-learning

We are going to see Q-learning

Slide 13

We are going to see Q-learning

Artificial Intelligence Machine Learning

Q-learningRL algorithmsg

Learning by doing

Temporal difference methodLearn directly from raw experience without a model of the environment’s dynamics

AdvantagesAdvantagesNo model of the world needed

Good policies before learning the optimal policy

Reacts to changes in the environment

Slide 14

g

Artificial Intelligence Machine Learning

Dynamic Programming in Brief

Needs a model of the environment to compute true expected values

Slide 15

A very informative backup

Artificial Intelligence Machine Learning

Temporal Difference Leraning

No model of the world needed

Slide 16

Most incremental

Artificial Intelligence Machine Learning

Q-learningBased on Q-backupsQ p

The learned action-value function Q directly approximates Q*, independent of the policy being followed

Slide 17Artificial Intelligence Machine Learning

Q-learning: Pseudo codePseudo code for Q-learningQ g

Slide 18Artificial Intelligence Machine Learning

Q-learning in Action15x15 maze world; R(goal)=1; R(other)=0

γ=0.9

α=0.65

Slide 19

Q-learning in ActionInitial policyInitial policy

Slide 20

Q-learning in ActionAfter 20 episodesAfter 20 episodes

Slide 21

Q-learning in ActionAfter 30 episodesAfter 30 episodes

Slide 22

Q-learning in ActionAfter 100 episodesAfter 100 episodes

Slide 23

Q-learning in ActionAfter 150 episodesAfter 150 episodes

Slide 24

Q-learning in ActionAfter 200 episodesAfter 200 episodes

Slide 25

Q-learning in ActionAfter 250 episodesAfter 250 episodes

Slide 26

Q-learning in ActionAfter 300 episodesAfter 300 episodes

Slide 27

Q-learning in ActionAfter 350 episodesAfter 350 episodes

Slide 28

Q-learning in ActionAfter 400 episodesAfter 400 episodes

Slide 29

Some Last RemarksExploration regimep g

Explore vs. exploitε greedy action selectionε-greedy action selectionSoft-max action selection

I iti li ti f Q l b ti i tiInitialization of Q-values: be optimistic

Learning rate αIn stationary environments

α(s) = 1 / (number of visits to state s)

In non-stationary environmentsα takes a constant valueThe higher the value the higher the influence of recentThe higher the value, the higher the influence of recent experiences

Slide 30Artificial Intelligence Machine Learning

Next Class

R i f t l i ith LCSReinforcement learning with LCSs

Slide 31Artificial Intelligence Machine Learning

Introduction to MachineIntroduction to Machine LearningLearning

Lecture 22Lecture 22Reinforcement Learning

Albert Orriols i Puightt // lb t i l thttp://www.albertorriols.net

[email protected]

Artificial Intelligence – Machine Learningg gEnginyeria i Arquitectura La Salle

Universitat Ramon Llull