Lecture21

Introduction to MachineIntroduction to Machine LearningLearning

Lecture 21Lecture 21Reinforcement Learning

Albert Orriols i Puightt // lb t i l thttp://www.albertorriols.net

[email protected]

Artificial Intelligence – Machine Learningg gEnginyeria i Arquitectura La Salle

Universitat Ramon Llull

Recap of Lectures 5-18Supervised learningp g

Data classification

Labeled dataLabeled data

Build a model that covers all the space

Unsupervised learningClusteringClustering

Unlabeled data

G i il bj tGroup similar objects

Association rule analysis

Unlabeled data

Get the most frequent/important associations

Slide 2

Genetic Fuzzy SystemsArtificial Intelligence Machine Learning

Today’s Agenda

IntroductionReinforcement LearningSome examples before going fartherSome examples before going farther

Slide 3Artificial Intelligence Machine Learning

IntroductionWhat does reinforcement learning aim at?g

Learning from interaction (with environment)

Goal-directed learning

GOALState

EnvironmentEnvironmentAction

Agentagent

Learning what to do and its effect

Slide 4

Trial-and-error search and delayed reward

Artificial Intelligence Machine Learning

Introduction

Learn a reactive behaviors

Behaviors as a mapping between perceptions and actions

Th t h t l it h t it l d k i d tThe agent has to exploit what it already knows in order to obtain reward, but it also has to explore in order to make better action selections in the future.

Dilemma − neither exploitation nor exploration can be e a e t e e p o tat o o e p o at o ca bepursued exclusively without failing at the task.


How Can We Learn It?1. Look-up tables 3. Rulesp 3. Rules

Perception ActionState 1 Action 1State 1 Action 1

State 2 Action 2

State 3 Action 3

Ne ral Net orks Fi it t t

… …

2. Neural Networks 4. Finite automata


Reinforcement Learning


Reinforcement LearningReward function

Agent

State Action:r S R→

Reward function

st atReward

rt :r S A R× →or

Environment

Agent and environment interact at discrete time steps t=0,1,2, …

The agentg

observes state at step t: st ε S

produces action a at step t: a ε A(s )produces action at at step t: at ε A(st)

gets resulting reward: rt+1 ε R


goes to the next step st+1

Reinforcement LearningAgent

Statest

Actionat

Rewardr

Environment

rt

Environment

Trace of a trial

st st+1 st+2 st+3rt rt+1 rt+2 rt+3at at+1 at+2 at+3… …

Agent goal:

Maximize the total amount of reward t receives

Therefore, that means maximizing not only the immediate reward,


Therefore, that means maximizing not only the immediate reward, but cumulative reward in the long run

Example of RLExample: Recycling robotExample: Recycling robot

State

charge level of battery

Actions

look for cans, wait for can, go recharge

R dReward

positive for finding cans, negative for running out of battery


More precisely…Restricting to Markovian Decision Process (MDP)g ( )

Finite set of situations

Fi it t f tiFinite set of actions

Transition probabilities

Reward probabilities

This means thatThe agent needs to have complete information of the world

Slide 11

State st+1 only depends on state st and action at


Recycling Robot Example

1, waitR , searchβ R1 3,β− −

wait search

High Low

recharge1,0g

search waitsearch wait

searchR 1 search 1 waitR, searchα R ,1 searchα− R 1, waitR


Recycling Robot Example{ , }=S high low{ , }g

( ) { , }=A wait seigh archh( ) { , , }=A wait search rechaow rgel

: expected # cans while searchingsearchR : expected # cans while : expected

searchingwait# cans while ingwait

search wait>

R

R

R R >R R


Breaking the Markovian Property

Possible problems that do not satisfy MDPp yWhen action and states are not finite

Solution: Discretize the set of actions and statesSolution: Discretize the set of actions and states

When transition probabilities do not depend only on the current statestate

Possible solution: represent states as structures build up over time from sequences of sensationsqThis is POMDP Partial observable MDPUse POMDP algorithms to solve these problemsg


Elements of Reinforcement Learning


Elements of RL

Policy: what to do

Reward: what’s good

Value: What’s good because it predicts rewarda ue at s good because t p ed cts e a d

Model: What follows what


Components of an RL AgentPolicy (behavior)

Mapping from states to actions

π*: S AS

RewardLocal reward in state t:

rt

ModelProbability of transition from state s to s’ by executing action aProbability of transition from state s to s by executing action a

T(s,a,s’)

AndThe transitions probabilities depend only on these parameters

Slide 17

This is not known by the agentArtificial Intelligence Machine Learning

Components of an RL AgentValue functions

Vπ(s): Long-term reward estimation from state s following policy π

Qπ(s,a): Long-term reward estimation from state s executing action a and then following policy πac o a a d e o o g po cy

A simple exampleAA maze

Note that the agent does not know its own position. It can only

Slide 18

ote t at t e age t does ot o ts o pos t o t ca o yperceive what it has in the surrounding states


Components of an RL AgentValue functions

Vπ(s): Long-term reward estimation from state s following policy π

Qπ(s,a): Long-term reward estimation from state s executing action a and then following policy πac o a a d e o o g po cy

A simple exampleAA maze

Note that the agent does not know its own position. It can only

Slide 19

ote t at t e age t does ot o ts o pos t o t ca o yperceive what it has in the surrounding states


Pursuing the goal: Maximize long term reward


Goals and RewardsOk, but I need to maximize my long term reward. How I , y gget the long term reward?

Long term reward defined in terms of the goal of the agentLong term reward defined in terms of the goal of the agent

The agent receives the local reward at each time step

How?Intuitive idea: Sum all the rewards obtained so far

Problem: It can increase heavily in non-ending tasks


Goals and RewardsHow can we deal with non-ending tasks?g

Weighted addition of local rewards

The γ parameter (0 < γ < 1) is the discounting factore γ pa a ete (0 γ ) s t e d scou t g acto

st st+1 st+2 st+3rt rt+1 rt+2 rt+3at at+1 at+2 at+3… …

Note the bias for immediate rewards

Slide 22

ote t e b as o ed ate e a dsIf you want to avoid it, set γ close to 1


Some examples


Pole balancingBalance the polep

The car can move forward and backwarda d bac a d

Avoid failure: the pole falling beyondthe pole falling beyonda certain critical angle the car hitting the end of the trackg

RewardReward -1 upon failure-ak, for k steps before failurea , for k steps before failure


Mountain Car ProblemObjectivej

Get to the top of the hill as quickly as possiblequ c y as poss b e

St t d fi itiState definition:Car position and speed

ActionsForward, reverse, none

Reward-1 for each step that are not the on the top of the hill

Slide 25

-1 for each step that are not the on the top of the hill-number of steps before reaching the top of the hill


Next Class

H t l th li iHow to learn the policies


Introduction to MachineIntroduction to Machine LearningLearning

Lecture 21Lecture 21Reinforcement Learning

Albert Orriols i Puightt // lb t i l thttp://www.albertorriols.net

[email protected]

Artificial Intelligence – Machine Learningg gEnginyeria i Arquitectura La Salle

Universitat Ramon Llull

Lecture21

Education

Transcript of Lecture21