CS 188: Artificial Intelligence Spring 2007

30
CS 188: Artificial Intelligence Spring 2007 Lecture 21:Reinforcement Learning: II MDP 4/12/2007 Srini Narayanan – ICSI and UC Berkeley

description

CS 188: Artificial Intelligence Spring 2007. Lecture 21:Reinforcement Learning: II MDP 4/12/2007. Srini Narayanan – ICSI and UC Berkeley. Announcements. Othello tournament signup Please send email to [email protected] HW on classification out Due 4/23 Can work in pairs. - PowerPoint PPT Presentation

Transcript of CS 188: Artificial Intelligence Spring 2007

Page 1: CS 188: Artificial Intelligence Spring 2007

CS 188: Artificial IntelligenceSpring 2007

Lecture 21:Reinforcement Learning: IIMDP

4/12/2007

Srini Narayanan – ICSI and UC Berkeley

Page 2: CS 188: Artificial Intelligence Spring 2007

Announcements

Othello tournament signup Please send email to

[email protected] HW on classification out

Due 4/23 Can work in pairs

Page 3: CS 188: Artificial Intelligence Spring 2007

Reinforcement Learning

Basic idea: Receive feedback in the form of rewards Agent’s utility is defined by the reward function Must learn to act so as to maximize expected utility Change the rewards, change the behavior

Examples: Learning optimal paths Playing a game, reward at the end for winning / losing Vacuuming a house, reward for each piece of dirt picked up Automated taxi, reward for each passenger delivered

Page 4: CS 188: Artificial Intelligence Spring 2007

Recap: MDPs Markov decision processes:

States S Actions A Transitions P(s’|s,a) (or T(s,a,s’)) Rewards R(s,a,s’) Start state s0

Examples: Gridworld, High-Low, N-Armed Bandit Any process where the result of your action is stochastic

Goal: find the “best” policy Policies are maps from states to actions What do we mean by “best”? This is like search – it’s planning using a model, not actually interacting

with the environment

Page 5: CS 188: Artificial Intelligence Spring 2007

MDP Solutions In deterministic single-agent search, want an optimal

sequence of actions from start to a goal In an MDP, like expectimax, want an optimal policy (s)

A policy gives an action for each state Optimal policy maximizes expected utility (i.e. expected rewards)

if followed Defines a reflex agent

Optimal policy when R(s, a, s’) = -0.04 for all non-terminals s

Page 6: CS 188: Artificial Intelligence Spring 2007

Example Optimal Policies

R(s) = -2.0R(s) = -0.4

R(s) = -0.03R(s) = -0.01

Page 7: CS 188: Artificial Intelligence Spring 2007

Stationarity In order to formalize optimality of a policy, need to

understand utilities of reward sequences Typically consider stationary preferences:

Theorem: only two ways to define stationary utilities Additive utility:

Discounted utility:

Page 8: CS 188: Artificial Intelligence Spring 2007

Infinite Utilities?! Problem: infinite state sequences with infinite rewards

Solutions: Finite horizon:

Terminate after a fixed T steps Gives nonstationary policy ( depends on time left)

Absorbing state(s): guarantee that for every policy, agent will eventually “die” (like “done” for High-Low)

Discounting: for 0 < < 1

Smaller means smaller horizon

Page 9: CS 188: Artificial Intelligence Spring 2007

How (Not) to Solve an MDP

The inefficient way: Enumerate policies For each one, calculate the expected utility

(discounted rewards) from the start state E.g. by simulating a bunch of runs

Choose the best policy We’ll return to a (better) idea like this later

Page 10: CS 188: Artificial Intelligence Spring 2007

Utility of a State

Define the utility of a state under a policy:V(s) = expected total (discounted) rewards starting in s

and following

Recursive definition (one-step look-ahead):

Page 11: CS 188: Artificial Intelligence Spring 2007

Policy Evaluation

Idea one: turn recursive equations into updates

Idea two: it’s just a linear system, solve with Matlab (or Mosek, or Cplex)

Page 12: CS 188: Artificial Intelligence Spring 2007

Example: High-Low Policy: always say “high” Iterative updates:

Page 13: CS 188: Artificial Intelligence Spring 2007

Optimal Utilities Goal: calculate the optimal

utility of each state

V*(s) = expected (discounted) rewards with optimal actions

Why: Given optimal utilities, MEU tells us the optimal policy

Page 14: CS 188: Artificial Intelligence Spring 2007

Bellman’s Equation for Selecting actions

Definition of utility leads to a simple relationship amongst optimal utility values:

Optimal rewards = maximize over first action and then follow optimal policy

Formally: Bellman’s Equation

That’s my equation!

Page 15: CS 188: Artificial Intelligence Spring 2007

Example: GridWorld

Page 16: CS 188: Artificial Intelligence Spring 2007

Value Iteration Idea:

Start with bad guesses at all utility values (e.g. V0(s) = 0) Update all values simultaneously using the Bellman equation

(called a value update or Bellman update):

Repeat until convergence

Theorem: will converge to unique optimal values Basic idea: bad guesses get refined towards optimal values Policy may converge long before values do

Page 17: CS 188: Artificial Intelligence Spring 2007

Example: Bellman Updates

Page 18: CS 188: Artificial Intelligence Spring 2007

Example: Value Iteration

Information propagates outward from terminal states and eventually all states have correct value estimates

[DEMO]

Page 19: CS 188: Artificial Intelligence Spring 2007

Convergence* Define the max-norm:

Theorem: For any two approximations U and V (any two utility vectors)

I.e. any distinct approximations must get closer to each other (after the Bellman update), so, in particular, any approximation must get closer to the true U (Bellman update is U) and value iteration converges to a unique, stable, optimal solution

Theorem:

I.e. once the change in our approximation is small, it must also be close to correct

Page 20: CS 188: Artificial Intelligence Spring 2007

Policy Iteration

Alternate approach: Policy evaluation: calculate utilities for a fixed policy

until convergence (remember the beginning of lecture)

Policy improvement: update policy based on resulting converged utilities

Repeat until policy converges

This is policy iteration Can converge faster under some conditions

Page 21: CS 188: Artificial Intelligence Spring 2007

Policy Iteration

If we have a fixed policy , use simplified Bellman equation to calculate utilities:

For fixed utilities, easy to find the best action according to one-step look-ahead

Page 22: CS 188: Artificial Intelligence Spring 2007

Comparison In value iteration:

Every pass (or “backup”) updates both utilities (explicitly, based on current utilities) and policy (possibly implicitly, based on current policy)

In policy iteration: Several passes to update utilities with frozen policy Occasional passes to update policies

Hybrid approaches (asynchronous policy iteration): Any sequences of partial updates to either policy entries or

utilities will converge if every state is visited infinitely often

Page 23: CS 188: Artificial Intelligence Spring 2007

Reinforcement Learning

Reinforcement learning: Still have an MDP:

A set of states s S A model T(s,a,s’) A reward function R(s)

Still looking for a policy (s)

New twist: don’t know T or R I.e. don’t know which states are good or what the actions do Must actually try actions and states out to learn

Page 24: CS 188: Artificial Intelligence Spring 2007

Example: Animal Learning

RL studied experimentally for more than 60 years in psychology Rewards: food, pain, hunger, drugs, etc. Mechanisms and sophistication debated

Example: foraging Bees learn near-optimal foraging plan in field of

artificial flowers with controlled nectar supplies Bees have a direct neural connection from nectar

intake measurement to motor planning area

Page 25: CS 188: Artificial Intelligence Spring 2007

Example: Backgammon Reward only for win / loss in

terminal states, zero otherwise

TD-Gammon learns a function approximation to U(s) using a neural network

Combined with depth 3 search, one of the top 3 players in the world

Page 26: CS 188: Artificial Intelligence Spring 2007

Passive Learning

Simplified task You don’t know the transitions T(s,a,s’) You don’t know the rewards R(s,a,s’) You are given a policy (s) Goal: learn the state values (and maybe the model)

In this case: No choice about what actions to take Just execute the policy and learn from experience We’ll get to the general case soon

Page 27: CS 188: Artificial Intelligence Spring 2007

Example: Direct Estimation

Episodes:

x

y

(1,1) up -1

(1,2) up -1

(1,2) up -1

(1,3) right -1

(2,3) right -1

(3,3) right -1

(3,2) up -1

(3,3) right +100

(done)

(1,1) up -1

(1,2) up -1

(1,3) right -1

(2,3) right -1

(3,3) right -1

(3,2) right -1

(4,2) right -100

(done)U(1,1) ~ (93 + -105) / 2 = -6

U(3,3) ~ (100 + 98 + -101) / 3 = 32.3

= 1, R = -1

+100

-100

Page 28: CS 188: Artificial Intelligence Spring 2007

Model-Based Learning Idea:

Learn the model empirically (rather than values) Solve the MDP as if the learned model were correct

Empirical model learning Simplest case:

Count outcomes for each s,a Normalize to give estimate of T(s,a,s’) Discover R(s,a,s’) the first time we experience (s,a,s’)

More complex learners are possible (e.g. if we know that all squares have related action outcomes, e.g. “stationary noise”)

Page 29: CS 188: Artificial Intelligence Spring 2007

Example: Model-Based Learning

Episodes:

x

y

T(<3,3>, right, <4,3>) = 1 / 3

T(<2,3>, right, <3,3>) = 2 / 2

+100

-100

= 1

(1,1) up -1

(1,2) up -1

(1,2) up -1

(1,3) right -1

(2,3) right -1

(3,3) right -1

(3,2) up -1

(3,3) right +100

(done)

(1,1) up -1

(1,2) up -1

(1,3) right -1

(2,3) right -1

(3,3) right -1

(3,2) right -1

(4,2) right -100

(done)

Page 30: CS 188: Artificial Intelligence Spring 2007

Model-Based Learning

In general, want to learn the optimal policy, not evaluate a fixed policy

Idea: adaptive dynamic programming Learn an initial model of the environment: Solve for the optimal policy for this model (value or

policy iteration) Refine model through experience and repeat Crucial: we have to make sure we actually learn about

all of the model