Lecture21

27
Introduction to Machine Introduction to Machine Learning Learning Lecture 21 Lecture 21 Reinforcement Learning Albert Orriols i Puig htt // lb t il t http://www.albertorriols.net [email protected] Artificial Intelligence Machine Learning Enginyeria i Arquitectura La Salle Universitat Ramon Llull

description

 

Transcript of Lecture21

Page 1: Lecture21

Introduction to MachineIntroduction to Machine LearningLearning

Lecture 21Lecture 21Reinforcement Learning

Albert Orriols i Puightt // lb t i l thttp://www.albertorriols.net

[email protected]

Artificial Intelligence – Machine Learningg gEnginyeria i Arquitectura La Salle

Universitat Ramon Llull

Page 2: Lecture21

Recap of Lectures 5-18Supervised learningp g

Data classification

Labeled dataLabeled data

Build a model that covers all the space

Unsupervised learningClusteringClustering

Unlabeled data

G i il bj tGroup similar objects

Association rule analysis

Unlabeled data

Get the most frequent/important associations

Slide 2

Genetic Fuzzy SystemsArtificial Intelligence Machine Learning

Page 3: Lecture21

Today’s Agenda

IntroductionReinforcement LearningSome examples before going fartherSome examples before going farther

Slide 3Artificial Intelligence Machine Learning

Page 4: Lecture21

IntroductionWhat does reinforcement learning aim at?g

Learning from interaction (with environment)

Goal-directed learning

GOALState

EnvironmentEnvironmentAction

Agentagent

Learning what to do and its effect

Slide 4

Trial-and-error search and delayed reward

Artificial Intelligence Machine Learning

Page 5: Lecture21

Introduction

Learn a reactive behaviors

Behaviors as a mapping between perceptions and actions

Th t h t l it h t it l d k i d tThe agent has to exploit what it already knows in order to obtain reward, but it also has to explore in order to make better action selections in the future.

Dilemma − neither exploitation nor exploration can be e a e t e e p o tat o o e p o at o ca bepursued exclusively without failing at the task.

Slide 5Artificial Intelligence Machine Learning

Page 6: Lecture21

How Can We Learn It?1. Look-up tables 3. Rulesp 3. Rules

Perception ActionState 1 Action 1State 1 Action 1

State 2 Action 2

State 3 Action 3

Ne ral Net orks Fi it t t

… …

2. Neural Networks 4. Finite automata

Slide 6Artificial Intelligence Machine Learning

Page 7: Lecture21

Reinforcement Learning

Slide 7Artificial Intelligence Machine Learning

Page 8: Lecture21

Reinforcement LearningReward function

Agent

State Action:r S R→

Reward function

st atReward

rt :r S A R× →or

Environment

Agent and environment interact at discrete time steps t=0,1,2, …

The agentg

observes state at step t: st ε S

produces action a at step t: a ε A(s )produces action at at step t: at ε A(st)

gets resulting reward: rt+1 ε R

Slide 8Artificial Intelligence Machine Learning

goes to the next step st+1

Page 9: Lecture21

Reinforcement LearningAgent

Statest

Actionat

Rewardr

Environment

rt

Environment

Trace of a trial

st st+1 st+2 st+3rt rt+1 rt+2 rt+3at at+1 at+2 at+3… …

Agent goal:

Maximize the total amount of reward t receives

Therefore, that means maximizing not only the immediate reward,

Slide 9Artificial Intelligence Machine Learning

Therefore, that means maximizing not only the immediate reward, but cumulative reward in the long run

Page 10: Lecture21

Example of RLExample: Recycling robotExample: Recycling robot

State

charge level of battery

Actions

look for cans, wait for can, go recharge

R dReward

positive for finding cans, negative for running out of battery

Slide 10Artificial Intelligence Machine Learning

Page 11: Lecture21

More precisely…Restricting to Markovian Decision Process (MDP)g ( )

Finite set of situations

Fi it t f tiFinite set of actions

Transition probabilities

Reward probabilities

This means thatThe agent needs to have complete information of the world

Slide 11

State st+1 only depends on state st and action at

Artificial Intelligence Machine Learning

Page 12: Lecture21

Recycling Robot Example

1, waitR , searchβ R1 3,β− −

wait search

High Low

recharge1,0g

search waitsearch wait

searchR 1 search 1 waitR, searchα R ,1 searchα− R 1, waitR

Slide 12Artificial Intelligence Machine Learning

Page 13: Lecture21

Recycling Robot Example{ , }=S high low{ , }g

( ) { , }=A wait seigh archh( ) { , , }=A wait search rechaow rgel

: expected # cans while searchingsearchR : expected # cans while : expected

searchingwait# cans while ingwait

search wait>

R

R

R R >R R

Slide 13Artificial Intelligence Machine Learning

Page 14: Lecture21

Breaking the Markovian Property

Possible problems that do not satisfy MDPp yWhen action and states are not finite

Solution: Discretize the set of actions and statesSolution: Discretize the set of actions and states

When transition probabilities do not depend only on the current statestate

Possible solution: represent states as structures build up over time from sequences of sensationsqThis is POMDP Partial observable MDPUse POMDP algorithms to solve these problemsg

Slide 14Artificial Intelligence Machine Learning

Page 15: Lecture21

Elements of Reinforcement Learning

Slide 15Artificial Intelligence Machine Learning

Page 16: Lecture21

Elements of RL

Policy: what to do

Reward: what’s good

Value: What’s good because it predicts rewarda ue at s good because t p ed cts e a d

Model: What follows what

Slide 16Artificial Intelligence Machine Learning

Page 17: Lecture21

Components of an RL AgentPolicy (behavior)

Mapping from states to actions

π*: S AS

RewardLocal reward in state t:

rt

ModelProbability of transition from state s to s’ by executing action aProbability of transition from state s to s by executing action a

T(s,a,s’)

AndThe transitions probabilities depend only on these parameters

Slide 17

This is not known by the agentArtificial Intelligence Machine Learning

Page 18: Lecture21

Components of an RL AgentValue functions

Vπ(s): Long-term reward estimation from state s following policy π

Qπ(s,a): Long-term reward estimation from state s executing action a and then following policy πac o a a d e o o g po cy

A simple exampleAA maze

Note that the agent does not know its own position. It can only

Slide 18

ote t at t e age t does ot o ts o pos t o t ca o yperceive what it has in the surrounding states

Artificial Intelligence Machine Learning

Page 19: Lecture21

Components of an RL AgentValue functions

Vπ(s): Long-term reward estimation from state s following policy π

Qπ(s,a): Long-term reward estimation from state s executing action a and then following policy πac o a a d e o o g po cy

A simple exampleAA maze

Note that the agent does not know its own position. It can only

Slide 19

ote t at t e age t does ot o ts o pos t o t ca o yperceive what it has in the surrounding states

Artificial Intelligence Machine Learning

Page 20: Lecture21

Pursuing the goal: Maximize long term reward

Slide 20Artificial Intelligence Machine Learning

Page 21: Lecture21

Goals and RewardsOk, but I need to maximize my long term reward. How I , y gget the long term reward?

Long term reward defined in terms of the goal of the agentLong term reward defined in terms of the goal of the agent

The agent receives the local reward at each time step

How?Intuitive idea: Sum all the rewards obtained so far

Problem: It can increase heavily in non-ending tasks

Slide 21Artificial Intelligence Machine Learning

Page 22: Lecture21

Goals and RewardsHow can we deal with non-ending tasks?g

Weighted addition of local rewards

The γ parameter (0 < γ < 1) is the discounting factore γ pa a ete (0 γ ) s t e d scou t g acto

st st+1 st+2 st+3rt rt+1 rt+2 rt+3at at+1 at+2 at+3… …

Note the bias for immediate rewards

Slide 22

ote t e b as o ed ate e a dsIf you want to avoid it, set γ close to 1

Artificial Intelligence Machine Learning

Page 23: Lecture21

Some examples

Slide 23Artificial Intelligence Machine Learning

Page 24: Lecture21

Pole balancingBalance the polep

The car can move forward and backwarda d bac a d

Avoid failure: the pole falling beyondthe pole falling beyonda certain critical angle the car hitting the end of the trackg

RewardReward -1 upon failure-ak, for k steps before failurea , for k steps before failure

Slide 24Artificial Intelligence Machine Learning

Page 25: Lecture21

Mountain Car ProblemObjectivej

Get to the top of the hill as quickly as possiblequ c y as poss b e

St t d fi itiState definition:Car position and speed

ActionsForward, reverse, none

Reward-1 for each step that are not the on the top of the hill

Slide 25

-1 for each step that are not the on the top of the hill-number of steps before reaching the top of the hill

Artificial Intelligence Machine Learning

Page 26: Lecture21

Next Class

H t l th li iHow to learn the policies

Slide 26Artificial Intelligence Machine Learning

Page 27: Lecture21

Introduction to MachineIntroduction to Machine LearningLearning

Lecture 21Lecture 21Reinforcement Learning

Albert Orriols i Puightt // lb t i l thttp://www.albertorriols.net

[email protected]

Artificial Intelligence – Machine Learningg gEnginyeria i Arquitectura La Salle

Universitat Ramon Llull