1 Markov Decision Processes - MIT

6.246 Reinforcement Learning: Foundations and Methods Feb 23, 2021

Lecture 3: Markov Decision ProcessesInstructor: Cathy Wu Scribe: Athul Paul Jacob

Note: the lecture notes have not been thoroughly checked for errors and are not at the level of publication.

1 Markov Decision Processes

Last lecture, we talked about deterministic decision problems as well as a little bit of stochasticity byintroducing the stochastic variant of LQR. In this lecture, we will introduce a formulation called Markovdecision processes to study decision problems that have stochasticity.

1.1 Why stochastic problems?

The reasons can be roughly split into two categories:

• Stochastic Environment: This is the case where there is stochasticity in the environment itself. Assuch, a stochastic framework is necessary to model the problem. Various components of the environ-ment can be a source of stochasticity:

– Uncertainty in Reward/Objective: Examples of this include problems like multi-armed ban-dits and contextual bandits.

– Uncertainty in Dynamics

– Uncertainty in horizon: Uncertainty in the length of the problem. Stochastic Shortest Pathwhich will be introduced later is an example of this.

• Stochastic Policies: Another source of stochasticity is in the policy itself. Stochastic policies usuallyhas to do with technical reasons, such as:

– Helps trade off Exploration and exploitation

– Enables off-policy learning

– Compatible with Maximum likelihood estimation (MLE)

As such, for reasons that are mentioned above, dynamic programming in the deterministic setting is insuffi-cient.

6.246 Reinforcement Learning: Foundations and Methods — Lec3 — 1

Definition 1. A Markov Decision Process (MDP) is defined as a tuple M = (S,A, P, r, γ) where

• S is the state space,

• A is action space,

• P (s′|s, a) is transition probability with:

P (s′|s, a) = P (st+1 = s′|st = s, at = a)

• r(s, a, s′) is the immediate reward at state s upon taking action a,

• γ ∈ [0, 1) is the discount factor.

The MDP generates trajectories τt = (s0, a0, ..., st−1, at−1, st) with st+1 ∼ P (·|st, at).

Note: Two key ingredients that will be discussed later are:

• Policy: How actions are selected.

• Value function: What determines which actions (and states) are good.

The state and action spaces is generally simplified to be finite but it can be infinite, countably infinite orcontinuous. In general, a non-Markovian decision process’ transitions could depend on much more informa-tion:

P (st+1 = s′|st = s, at = a, st−1, at1 , ..., s0, a0)

such as the whole history.

1.2 Example: The Amazing Goods Company (Supply Chain)

Consider an example of a supply chain problem which can be formulated as a Markov Decision Process.

Description: At each month t, a warehouse contains st items of a specific goods and the demand for thatgoods is D (stochastic). At the end of each month the manager of the warehouse can order at more itemsfrom the supplier

• The cost of maintaining an inventory s is h(s)

• The cost to order a items is C(a)

• The income for selling q items is f(q)

• If the demand d ∼ D is bigger than the available inventory s, customers that cannot be served leave.


• The value of the remaining inventory at the end of the year is g(s)

• Constraint: The store has a maximum capacity M

We can formulate the problem as an MDP as follows:

• State space: s ∈ S = {0, 1, ...,M} which is the number of goods.

• Action space: For a state s, a ∈ A(s) = {0, 1, ...,M − s}. As it is not possible to order more itemthan the capacity of the store, the action space depends on the current state s.

• Dynamics: st+1 = [st + at − dt]+. The demand dt is stochastic and time-independent. Formally,

dti.i.d.∼ D.

• Reward: rt = −C(at) − h(st + at) + f([st + at − st+1]+). This corresponds to a purchasing cost, acost for excess stock (storage, maintenance), and a reward for fulfilling orders.

• Discount: γ = 0.95. The discount factor essentially encodes the sentiment that a dollar today isworth more than a dollar tomorrow.

Infinite horizon objective: V (s0; a0, ...) =∑∞t=0 γ

trt, which corresponds to the cumulative reward, plusthe value of the remaining inventory.

1.3 Example: Freeway Atari game (David Crane, 1981)

FREEWAY is an Atari 2600 video game, released in 1981. In FREEWAY, the agent must navigate a chicken(think: jaywalker) across a busy road often lanes of incoming traffic. The top of the screen lists the score.After a successful crossing, the chicken is teleported back to the bottom of the screen. If hit by a car,a chicken is forced back either slightly, or pushed back to the bottom of the screen, depending on whatdifficulty the switch is set to. One or two players can play at once.

Figure 1: Atari 2600 video game FREEWAY

Discussion: How to devise a successful strategy for jaywalking across this busy road?We can formulate the problem as an MDP as follows:

• State space:


– Option 1: Whether or not there is a car, chicken or nothing in each location on each road laneand road shoulder, where the road is discretized by lane (10) and car length. There is also a gameover state, when enough damage has been done to the chicken. The velocity of the cars could alsobe added.

– Option 2: Another option is to consider multiple consecutive image frames of the game.

– Option 3: Fixed-sized representing the coordinates of the car assuming a maximum number ofcars.

• Action space: up, down, left, right, or no action (movement of chicken)

• Transitions: Chicken and vehicles move, based on the action selected; may include new cars (ran-domly) entering different lanes.

• Reward: Whether or not the chicken is at the top of the screen.

• Discount: γ = 0.999. Choose a high discount factor as we care about maximizing the overall scoreover time.

Infinite horizon objective: Σ∞t=0γtrt, indicating the number of times the chicken crossed the road.

Figure 2: Deep reinforcement learning vs human player. Freeway is one of the games where a DQN agentis able to exceed human performance. Mnih et al. (2013)

Some related applications:

• Self-driving cars (input from LIDAR, radar, cameras)

• Traffic signal control (input from cameras)

• Crowd navigation robot


2 MDP Assumptions

Several assumptions are made when developing MDPs which one needs to be careful about when designingthem. Consider the Atari Breakout game:

Figure 3: Non-markovian dynamics as more information would help. Mnih et al. (2013)

Figure 4: Markovian dynamics. Mnih et al. (2013)

Fact 2. An MDP satisfies the markovian property if:

p(st+1 = s|τt, at) = P (st+1 = s|st, at, ..., s0, a0) = P (st+1 = s|st = s, at = a)

i.e, the current state st and action at are sufficient for predicting the next state s.

As discussed previously, game states can be encoded as frames. However, the formulation in figure 3 isnon-markovian because we do not know which direction the ball is travelling in, from just one image. How-ever, in Mnih et al. (2013), the authors use multiple frames (see figure 4) to encode such information, whichcould possibly make it markovian.

Assumption 3. Time assumption: Time is discrete

t→ t+ 1

Possible relaxation:

• Identify proper time granularity

• Most of MDP literature extends to continuous time


Figure 5: Too fine-grained resolution

Figure 6: Too coarse-grained resolution

Identifying proper time granularity is important as one can observe from the examples in figure 5, where itis too fine-grained and in figure 6, where it is too coarse-grained. Having too fine-grained time resolutionwould give a very long horizon learning problem which can be challenging.

Assumption 4. Reward assumption: The reward is uniquely defined by a transition (or part of it)

r(s, a, s′)


• Various notions of rewards: global or local reward function.

• Move to inverse reinforcement learning (IRL) to induce the reward function from desired behaviours.

Assumption 5. The dynamics and reward do not change over time and,

p(s′|s, a) = P (st+1 = s′|st = s, at = a)

r(s, a, s′)

This is often the biggest assumption, especially in real-world contexts. Some types of non-stationaritiescan be handled.


• Identify and add/remove the non-stationary components (e.g. cyclo-stationary dynamics as seen intraffic).

• Identify the time-scale of the changes

• Work on finite horizon problems


3 Policy

Definition 6. A decision rule d can be:

• Deterministic: d : S → A,

• Stochastic: d : S → ∆(A),

• History-dependent: d : Ht → A,

• Markov: d : S → ∆(A),

A decision rule in essence is a mapping from states to a probability distribution over actions.

Definition 7. A policy (strategy, plan) d can be:

• Non-stationary: π = (d0, d1, d2, ...),

• Stationary: π = (d, d, d, ...),

A policy is a sequence of decision rules. You have as many decision rules as there are time-steps in thehorizon.

Fact 8. MDP M + stationary policy π = (d, d, d, ...) =⇒Markov Chain of state S and transition probabilityp(s′|s) = p(s′|s, d(s))

For simplicity, π will be used instead of d for stationary policies, and πt instead of dt, for non-stationarypolicies.

3.1 The Amazing Goods Company (Supply Chain) Example

In this section, we look at what different types of policies and decision rules would look like in this example.

• Stationary policy composed of deterministic Markov decision rules

π(s) =

{M − s if s ≤M/4

0 otherwise

• Stationary policy composed of stochastic history-dependent Markov decision rules

π(st) =

{U(M − s,M − s+ 10) if st ≤ st−1/2

0 otherwise

• Non-stationary policy composed of deterministic Markov decision rules

πt(s) =

{M − s if t ≤ 6

b(M − s)/5c otherwise

As one can see, any combination of different types of decision rules and policies can be constructed.


4 State Value Function

The state value function is what we are optimizing for. Although there are several things that we can opti-mize for, we are generally trying to maximize the cumulative rewards.

Given a policy π = (d1, d2, ...) (deterministic to simplify notation)

• Infinite time horizon with discount: The problem never terminates but rewards which are closerin time receive higher importance

V π(s) = E[Σ∞t=0γ

tr(st, πt(ht))|s0 = s;π]

with discount factor 0 ≤ γ ≤ 1:

– Small = short term rewards, big = long-term rewards

– For any γ ∈ [0, 1) the series always converges (for bounded rewards)

This is the most popular formulation that is used in practice. It is used when there is uncertaintyabout the deadline and/or an intrinsic definition of discount.

• Finite time horizon T : Deadline at time T , the agent focuses on the sum of the rewards up to T .

V π(s, t) = E[ΣT−1τ=t r(sτ , πτ , (hτ )) +R(sT )|st = s;π = (πt, ..., πT )

]where R(sT ) is a value function for the final state. It is used when there is an intrinsic deadline tomeet. e.g. This course for example has a fixed deadline.

• Stochastic shortest path T : The problem never terminates but the agent will eventually reach atermination state.

V π(s) = E[ΣTt=0r(st, πt, (hτ ))|s0 = s;π

]where T is the first random time when the termination state is achieved. These are less discussed butare pertinent to many applications that we will discuss. This is often used when there is a specific goalcondition like when a car reaches a destination.

• Infinite time horizon with average reward: The problem never terminates but the agent onlyfocuses on the (expected) average of the rewards.

V π(s) = limT→∞

E[

1

TΣT−1t=0 r(st, πt, (hτ ))|s0 = s;π

]The 1

T is essential for the limit to be finite. This is often used when the system needs to be constantlycontrolled over time. E.g. Medical implant.

Note: The expectations refer to all possible stochastic trajectories. A (possibly non-stationary stochastic)policy π applied from state s0 returns

(s0, r0, s1, r1, s2, r2, ...)

Where, rt = r(st, (ht)) and st ∼ p(·|st−1, at = πt(h(t))) are random realizations. More generally, forstochastic policies:

V π(s) = Ea0,s1,a1,s2,...[Σ∞t=0γ


From now on we will mostly work in the discounted infinite horizon setting.


5 Optimization Problem

Definition 9. Optimal policy and optimal value function: The solution to an MDP is an optimal policyπ∗ satisfying:

π∗ ∈ arg maxπ∈ΠVπ

In all states s ∈ S, where Π is some policy set of interest. The corresponding value function is the optimalvalue function

V ∗ = V π∗

The optimal policy maximizes the value for every state.

Limitations

• Average case: All previous value functions define an objective in expectation.

• Imperfect information (partial observations)

• Time delays

• Correlated disturbances

6 Dynamic Programming for MDPs

Consider the dynamic programming algorithm for the deterministic problems in Figure 7.

Figure 7: Dynamic programming for deterministic problems

We will shortly show that an algorithm with a similar form can be used for MDPs too.

• Finite horizon deterministic (e.g. shortest path routing, travelling salesperson)

V ∗T (sT ) = rT (sT ) ∀sT

V ∗T (st) = maxat∈A

rt(st, at) + V ∗t+1(st+1) ∀st, t = {T − 1, ..., 0}

• Finite horizon stochastic and Markov problems (e.g. driving, robotics, games)

V ∗T (sT ) = rT (sT ) ∀sT

V ∗T (st) = maxat∈A

rt(st, at) + Est+1∼P (·|st,at)V∗t+1(st+1) ∀st, t = {T − 1, ..., 0}

• For discounted infinite horizon problems (e.g. package delivery over months or years, long-termcustomer satisfaction, control of autonomous vehicles), we have the following optimal value function.

V ∗(s) = maxa∈A

r(s, a) + γEs′∼P (·|s,a)V∗(s′)) ∀s

This is known as the optimal bellman equation. From this, the optimal policy can be extractedas:

π∗(s) = arg maxa∈A

r(s, a) + γEs′∼P (·|s,a)V∗(s′)) ∀s


Question Any difficulties with this new algorithm? This is not an algorithm yet, since V ∗ is defined interms of itself.

7 Value Iteration Algorithm

With this, we can construct the value iteration algorithm as follows:

1. Let V0(s) be any function V0 : S → R [Note: not stage 0, but iteration 0]

2. Apply the principle of optimality so that given Vi at iteration i, we compute

Vi+1(s) = maxa∈A

r(s, a) + γEs′∼P (·|s,a)Vi(s′)

3. Terminate when Vi stops improving, e.g. when maxs |Vi+1(s)− Vi(s)| is small.

4. Return the greedy policy:

πK(s) = arg maxa∈A

r(s, a) + γEs′∼P (·|s,a)VK(s′)

Definition 10. Optimal Bellman Operator: For any W ∈ R|S|, the optimal Bellman operator is definedas:

TW (s) = maxa∈A

r(s, a) + γEs′∼P (·|s,a)W (s′) ∀s

With this, the value iteration algorithm above can be written concisely as:

Vi+1(s) = TVi(s) ∀s

The proof of the optimal bellman equation leverages the definition of value function as well as markov andchange of time properties.

Proof: The Optimal Bellman Equation

V ∗(s) = maxπ

E[Σ∞t=0γ


= maxa,π

[r(s, a) + γΣs′p(s

′|s, a)V π′(s′))

]= max

a


′|s, a) maxπ′

V π′(s′))

]= max

a


′|s, a) maxπ′

V π′(s′))

]= max

a


′|s, a) maxπ′

V ∗(s′))]


8 Summary

• Stochastic problems are needed to represent uncertainty in the environment.

• Markov Decision Processes (MDPs) represent a general class of stochastic sequential decisionproblems, for which reinforcement learning methods are commonly designed. MDPs enable a discussionof model-free learning.

• The Markovian property means that the next state is fully determined by the current state and action.

• Although quite general, MDPs bake in numerous assumptions. Care should be taken when modelinga problem as an MDP.

• Similarly, care should be taken to select an appropriate type of policy and value function, dependingon the use case.

• Finally, dynamic programming for the deterministic setting can also be extended for MDPs. In par-ticular, we introduce the optimal bellman operator and the value iteration algorithm.

9 Contributions

Athul Paul Jacob contributed to this draft of the lecture. TA Sirui Li reviewed the draft.

References

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, andMartin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.


1 Markov Decision Processes - MIT

Documents

Transcript of 1 Markov Decision Processes - MIT