Reinforcement Learning - Queen's University · • Reinforcement learning: the task of learning the...

Reinforcement LearningChapter 21 AIMA3e

Ryan Robinson

Chapter Outline• 21.1: Introduction

• 21.2: Passive Reinforcement Learning

• 21.3: Active Reinforcement Learning

• 21.4: Generalization in Reinforcement Learning

• 21.5: Policy Search

• 21.6: Applications (not included here)

21.1 Introduction• “In which we examine how an agent can learn from success

and failure, from reward and punishment”

• Reward or reinforcement is defined as feedback about performance, whether positive or negative• Eg. Score in a ping-pong game, win or loss in a chess game

• The task of reinforcement learning is to learn the optimal policy, which is one that maximizes the expected reward

• Assumptions:• No prior knowledge of environment

• No knowledge of the reward function

• Imagine playing the whole game then learn “you lose”

3 Types of Agents• Utility-based agents

• Learns a utility function on states and uses it to select actions that maximize the expected outcome utility

• Q-Learning agents• Learns an action-utility function, or Q-function, giving the

expected utility of taking a given action in a given state

• Reflex agent• Learns a policy that maps directly from states to actions

• The first two will come up repeatedly in this chapter as an important distinction between algorithms for reinforcement learning

21.2 Passive Learning•

Direct Utility Estimation•

Adaptive Dynamic Programming• Learns the transition model between states P(s’|s,a) so it

is able to take advantage of those constraints in ways Direct Utility Estimation couldn’t

• Uses dynamic programming to solve the corresponding Markov decision process; algorithm found in Figure 21.2

• Bellman equation still used to calculate utilities• Solved as in Chapter 17 with linear algebra package or modified

policy iteration

ADP Continued• Learning the model itself is easy because it is a fully

observable environment

• Supervised learning task with state-action pair as input and resulting state as output

• Simplest way: probability tables using what s’ values have been experienced with each s, a pair

• Fast algorithm if not for the time to learn the transition model, which is intractable for large state spaces• Eg. Backgammon would involve solving 1050 equations for 1050

unknowns

Another ADP Problem•

Robust Control Theory•

Temporal-Difference Learning•

21.3 Active Reinforcement Learning• An active agent must decide what actions to take, unlike the

passive agents that had a set policy

• Greedy agent: agents tend to settle on a policy after very minor variations, and very seldom is that the optimal policy

• How does that happen? The learned model isn’t the same as the true environment, so optimal in the learned model rarely matches optimal for the environment

• The greedy agent is missing the fact that actions do not just provide rewards according to the current model, they also help improve the model for the future

• Agents must then trade off between exploitation and exploration• Eg bandit problem

Exploration• Extremely difficult to obtain an optimal exploration method,

but possible to come up with a reasonable exploration method

• GLIE = Greedy in the Limit of Infinite Expression, a type of scheme that must try each action in each state an unbounded number of times to avoid having a finite probability that an optimal action is missed because of an unusually bad series of outcomes

• Lots of GLIE schemes available• Choose a random action 1/t times, otherwise take the greedy

action• Will always converge to optimal but very slowly

• More complicated but effective systems use exploration functions to make unexplored action-state pairs more likely

• See pgs. 841-842 if you want the details

Learning an action-utility function• Adapting the TD learning agent to be active, the update rule

remains the same, but the problem of learning the model is identical to that of the Active ADP agent just looked at

• An alternative Active TD method is called Q-learning, which learns an action-utility representation instead of utilities• Model-free, both for learning and for action selection!

• In the case of the 4X3 world example, Q-learning is much slower than ADP learning• That advantage of knowledge-based learning is amplified for

complex environments such as chess, checkers, and backgammon, where the ADP models are far superior

21.4 Generalization in Reinforcement Learning• So far we have assumed that the utility functions and Q-

functions learned by the agents are represented in tabular form with one output value for each input tuple

• In small samples such as our 4X3 world, it is not hard to visit every state multiple times, but realistic worlds are impossible• even something as simple as chess is impossible

• One way to handle this problem is with function approximation: reducing the problem to a function of its defining features, such as (x, y) coordinates

• The compression achieved is enormous, and is not aiming to represent all utility functions, but rather allows for a function to generalize to states not visited by the agent

Function Approximation•

21.5 Policy Search•

Summary• 21.1 Introduction

• Reinforcement learning: the task of learning the optimal policy from reward and punishment

• 3 types of agents• 21.2 Passive Reinforcement Learning

• Direct Utility Estimation• Adaptive Dynamic Programming• Temporal-Difference Learning

• 21.3 Active Reinforcement Learning• Trade-off between Exploration and Exploitation• Learning the action-utility function (Q-learning)

• 21.4 Generalization• Functional Approximation

• 21.5 Policy Search

Reinforcement Learning - Queen's University · • Reinforcement learning: the task of learning the...

Documents

Transcript of Reinforcement Learning - Queen's University · • Reinforcement learning: the task of learning the...