MachineLearningforRobotics IntelligentSystemsSeries...

Post on 22-Aug-2020

7 views 0 download

Transcript of MachineLearningforRobotics IntelligentSystemsSeries...

Machine Learning for RoboticsIntelligent Systems Series

Lecture 7

Georg Martius

MPI for Intelligent Systems, Tübingen, Germany

June 12, 2017

Georg Martius Machine Learning for Robotics June 12, 2017 1 / 31

Markov Decision Processes

Georg Martius Machine Learning for Robotics June 12, 2017 2 / 31

Markov Process

A Markov process is a memoryless random process, i.e. a sequence of randomstates S1, S2, . . . with the Markov property.

Reminder: Markov propertyA state St is Markov if and only if

P (St+1 | St) = P (St+1 | S1, . . . , St)

Definition (Markov Process/ Markov Chain)A Markov Process (or Markov Chain) is a tuple (S,P)• S is a (finite) set of states• P is a state transition 0 probability matrix,

Pss′ = P (St+1 = s′ | St = s)

Georg Martius Machine Learning for Robotics June 12, 2017 3 / 31

Markov Process

A Markov process is a memoryless random process, i.e. a sequence of randomstates S1, S2, . . . with the Markov property.

Reminder: Markov propertyA state St is Markov if and only if

P (St+1 | St) = P (St+1 | S1, . . . , St)

Definition (Markov Process/ Markov Chain)A Markov Process (or Markov Chain) is a tuple (S,P)• S is a (finite) set of states• P is a state transition 0 probability matrix,

Pss′ = P (St+1 = s′ | St = s)

Georg Martius Machine Learning for Robotics June 12, 2017 3 / 31

Markov Process

A Markov process is a memoryless random process, i.e. a sequence of randomstates S1, S2, . . . with the Markov property.

Reminder: Markov propertyA state St is Markov if and only if

P (St+1 | St) = P (St+1 | S1, . . . , St)

Definition (Markov Process/ Markov Chain)A Markov Process (or Markov Chain) is a tuple (S,P)• S is a (finite) set of states• P is a state transition 0 probability matrix,

Pss′ = P (St+1 = s′ | St = s)

Georg Martius Machine Learning for Robotics June 12, 2017 3 / 31

Example: Student Markov Chain

Georg Martius Machine Learning for Robotics June 12, 2017 4 / 31

Example: Student Markov Chain Episodes

Sample episodes for starting from S1 =C1

S1, S2, . . . , ST

• C1 C2 C3 Pass Sleep

• C1 FB FB C1 C2 Sleep• C1 C2 C3 Pub C2 C3 Pass Sleep• C1 FB FB C1 C2 C3 Pub C1 FB FB

FB C1 C2 C3 Pub C2 Sleep

Georg Martius Machine Learning for Robotics June 12, 2017 5 / 31

Example: Student Markov Chain Episodes

Sample episodes for starting from S1 =C1

S1, S2, . . . , ST

• C1 C2 C3 Pass Sleep• C1 FB FB C1 C2 Sleep

• C1 C2 C3 Pub C2 C3 Pass Sleep• C1 FB FB C1 C2 C3 Pub C1 FB FB

FB C1 C2 C3 Pub C2 Sleep

Georg Martius Machine Learning for Robotics June 12, 2017 5 / 31

Example: Student Markov Chain Episodes

Sample episodes for starting from S1 =C1

S1, S2, . . . , ST

• C1 C2 C3 Pass Sleep• C1 FB FB C1 C2 Sleep• C1 C2 C3 Pub C2 C3 Pass Sleep

• C1 FB FB C1 C2 C3 Pub C1 FB FBFB C1 C2 C3 Pub C2 Sleep

Georg Martius Machine Learning for Robotics June 12, 2017 5 / 31

Example: Student Markov Chain Episodes

Sample episodes for starting from S1 =C1

S1, S2, . . . , ST

• C1 C2 C3 Pass Sleep• C1 FB FB C1 C2 Sleep• C1 C2 C3 Pub C2 C3 Pass Sleep• C1 FB FB C1 C2 C3 Pub C1 FB FBFB C1 C2 C3 Pub C2 Sleep

Georg Martius Machine Learning for Robotics June 12, 2017 5 / 31

Example: Student Markov Chain Transition Matrix

Georg Martius Machine Learning for Robotics June 12, 2017 6 / 31

Markov Reward Process

A Markov reward process is a Markov chain with values.

Definition (MRP)A Markov Reward Process is a tuple (S,P,R, γ)• S is a finite set of states• P is a state transition probability matrix,

P (St+1 | St) = P (St+1 | S1, . . . , St)

• R is a reward function, Rs = E[Rt+1|St = s]• γ is a discount factor, γ ∈ [0, 1]

Note that the reward can be stochastic (Rs is in expectation)

Georg Martius Machine Learning for Robotics June 12, 2017 7 / 31

Markov Reward Process

A Markov reward process is a Markov chain with values.

Definition (MRP)A Markov Reward Process is a tuple (S,P,R, γ)• S is a finite set of states• P is a state transition probability matrix,

P (St+1 | St) = P (St+1 | S1, . . . , St)

• R is a reward function, Rs = E[Rt+1|St = s]

• γ is a discount factor, γ ∈ [0, 1]

Note that the reward can be stochastic (Rs is in expectation)

Georg Martius Machine Learning for Robotics June 12, 2017 7 / 31

Markov Reward Process

A Markov reward process is a Markov chain with values.

Definition (MRP)A Markov Reward Process is a tuple (S,P,R, γ)• S is a finite set of states• P is a state transition probability matrix,

P (St+1 | St) = P (St+1 | S1, . . . , St)

• R is a reward function, Rs = E[Rt+1|St = s]• γ is a discount factor, γ ∈ [0, 1]

Note that the reward can be stochastic (Rs is in expectation)

Georg Martius Machine Learning for Robotics June 12, 2017 7 / 31

Example: Student MRP

Georg Martius Machine Learning for Robotics June 12, 2017 8 / 31

Return

DefinitionThe return Gt is the total discounted reward from time-step t.

Gt = Rt+1 + γRt+2 + · · · =∞∑k=0

γkRt+k+1

• The discount γ ∈ [0, 1] devaluates future rewards:A reward R after k + 1 time-steps is counted as γkR.

• Extreme cases:

γ close to 0 leads to immidate reward maximization onlyγ close to 1 leads to far-sighted evaluation

Georg Martius Machine Learning for Robotics June 12, 2017 9 / 31

Return

DefinitionThe return Gt is the total discounted reward from time-step t.

Gt = Rt+1 + γRt+2 + · · · =∞∑k=0

γkRt+k+1

• The discount γ ∈ [0, 1] devaluates future rewards:A reward R after k + 1 time-steps is counted as γkR.• Extreme cases:

γ close to 0 leads to immidate reward maximization onlyγ close to 1 leads to far-sighted evaluation

Georg Martius Machine Learning for Robotics June 12, 2017 9 / 31

Return

DefinitionThe return Gt is the total discounted reward from time-step t.

Gt = Rt+1 + γRt+2 + · · · =∞∑k=0

γkRt+k+1

• The discount γ ∈ [0, 1] devaluates future rewards:A reward R after k + 1 time-steps is counted as γkR.• Extreme cases:

γ close to 0 leads to immidate reward maximization only

γ close to 1 leads to far-sighted evaluation

Georg Martius Machine Learning for Robotics June 12, 2017 9 / 31

Return

DefinitionThe return Gt is the total discounted reward from time-step t.

Gt = Rt+1 + γRt+2 + · · · =∞∑k=0

γkRt+k+1

• The discount γ ∈ [0, 1] devaluates future rewards:A reward R after k + 1 time-steps is counted as γkR.• Extreme cases:

γ close to 0 leads to immidate reward maximization onlyγ close to 1 leads to far-sighted evaluation

Georg Martius Machine Learning for Robotics June 12, 2017 9 / 31

Discussion on discounting

Why is discouting often used?

• Mathematically convenient to discount rewards (keeps returns finite)• A way to model the uncertainty about the future (since the model may notbe exact)

• Animal/human behaviour shows preference for immediate rewards

In some cases undiscounted Markov reward processes (i.e. γ = 1), areconsidered, e.g. if all sequences terminate.

Georg Martius Machine Learning for Robotics June 12, 2017 10 / 31

Discussion on discounting

Why is discouting often used?• Mathematically convenient to discount rewards (keeps returns finite)

• A way to model the uncertainty about the future (since the model may notbe exact)

• Animal/human behaviour shows preference for immediate rewardsIn some cases undiscounted Markov reward processes (i.e. γ = 1), areconsidered, e.g. if all sequences terminate.

Georg Martius Machine Learning for Robotics June 12, 2017 10 / 31

Discussion on discounting

Why is discouting often used?• Mathematically convenient to discount rewards (keeps returns finite)• A way to model the uncertainty about the future (since the model may notbe exact)

• Animal/human behaviour shows preference for immediate rewardsIn some cases undiscounted Markov reward processes (i.e. γ = 1), areconsidered, e.g. if all sequences terminate.

Georg Martius Machine Learning for Robotics June 12, 2017 10 / 31

Discussion on discounting

Why is discouting often used?• Mathematically convenient to discount rewards (keeps returns finite)• A way to model the uncertainty about the future (since the model may notbe exact)

• Animal/human behaviour shows preference for immediate rewards

In some cases undiscounted Markov reward processes (i.e. γ = 1), areconsidered, e.g. if all sequences terminate.

Georg Martius Machine Learning for Robotics June 12, 2017 10 / 31

Discussion on discounting

Why is discouting often used?• Mathematically convenient to discount rewards (keeps returns finite)• A way to model the uncertainty about the future (since the model may notbe exact)

• Animal/human behaviour shows preference for immediate rewards

In some cases undiscounted Markov reward processes (i.e. γ = 1), areconsidered, e.g. if all sequences terminate.

Georg Martius Machine Learning for Robotics June 12, 2017 10 / 31

Discussion on discounting

Why is discouting often used?• Mathematically convenient to discount rewards (keeps returns finite)• A way to model the uncertainty about the future (since the model may notbe exact)

• Animal/human behaviour shows preference for immediate rewardsIn some cases undiscounted Markov reward processes (i.e. γ = 1), areconsidered, e.g. if all sequences terminate.

Georg Martius Machine Learning for Robotics June 12, 2017 10 / 31

Value Function

The value function describes the value of a state (in the stationary state)

DefinitionThe state value function v(s) of an MRP is the exected return starting fromstate s

v(s) = E[Gt|St = s]

Georg Martius Machine Learning for Robotics June 12, 2017 11 / 31

Example: Value Function for Student MRP

from David SilverGeorg Martius Machine Learning for Robotics June 12, 2017 12 / 31

Example: Value Function for Student MRP

from David SilverGeorg Martius Machine Learning for Robotics June 12, 2017 12 / 31

Example: Value Function for Student MRP

from David SilverGeorg Martius Machine Learning for Robotics June 12, 2017 12 / 31

Bellman Equation (MRP) I

Idea: Make value computation recursive by tearing apart contributions from:• immediate reward• and from discounted future rewards

v(s) = E[Gt | St = s]= E[Rt+1 + γRt+2 + γ2Rt+3 + · · · | St = s]= E[Rt+1 + γGt+1 | St = s]= E[Rt+1 + γv(St+1) | St = s]

Mh... need Expectation over St+1Use transition matrix to get probabilities of succeeding state:

v(s) = Rs + γ∑s′∈SPss′v(s′)

Georg Martius Machine Learning for Robotics June 12, 2017 13 / 31

Bellman Equation (MRP) I

Idea: Make value computation recursive by tearing apart contributions from:• immediate reward• and from discounted future rewards

v(s) = E[Gt | St = s]

= E[Rt+1 + γRt+2 + γ2Rt+3 + · · · | St = s]= E[Rt+1 + γGt+1 | St = s]= E[Rt+1 + γv(St+1) | St = s]

Mh... need Expectation over St+1Use transition matrix to get probabilities of succeeding state:

v(s) = Rs + γ∑s′∈SPss′v(s′)

Georg Martius Machine Learning for Robotics June 12, 2017 13 / 31

Bellman Equation (MRP) I

Idea: Make value computation recursive by tearing apart contributions from:• immediate reward• and from discounted future rewards

v(s) = E[Gt | St = s]= E[Rt+1 + γRt+2 + γ2Rt+3 + · · · | St = s]= E[Rt+1 + γGt+1 | St = s]

= E[Rt+1 + γv(St+1) | St = s]

Mh... need Expectation over St+1Use transition matrix to get probabilities of succeeding state:

v(s) = Rs + γ∑s′∈SPss′v(s′)

Georg Martius Machine Learning for Robotics June 12, 2017 13 / 31

Bellman Equation (MRP) I

Idea: Make value computation recursive by tearing apart contributions from:• immediate reward• and from discounted future rewards

v(s) = E[Gt | St = s]= E[Rt+1 + γRt+2 + γ2Rt+3 + · · · | St = s]= E[Rt+1 + γGt+1 | St = s]= E[Rt+1 + γv(St+1) | St = s]

Mh... need Expectation over St+1Use transition matrix to get probabilities of succeeding state:

v(s) = Rs + γ∑s′∈SPss′v(s′)

Georg Martius Machine Learning for Robotics June 12, 2017 13 / 31

Bellman Equation (MRP) I

Idea: Make value computation recursive by tearing apart contributions from:• immediate reward• and from discounted future rewards

v(s) = E[Gt | St = s]= E[Rt+1 + γRt+2 + γ2Rt+3 + · · · | St = s]= E[Rt+1 + γGt+1 | St = s]= E[Rt+1 + γv(St+1) | St = s]

Mh... need Expectation over St+1

Use transition matrix to get probabilities of succeeding state:

v(s) = Rs + γ∑s′∈SPss′v(s′)

Georg Martius Machine Learning for Robotics June 12, 2017 13 / 31

Bellman Equation (MRP) I

Idea: Make value computation recursive by tearing apart contributions from:• immediate reward• and from discounted future rewards

v(s) = E[Gt | St = s]= E[Rt+1 + γRt+2 + γ2Rt+3 + · · · | St = s]= E[Rt+1 + γGt+1 | St = s]= E[Rt+1 + γv(St+1) | St = s]

Mh... need Expectation over St+1Use transition matrix to get probabilities of succeeding state:

v(s) = Rs + γ∑s′∈SPss′v(s′)

Georg Martius Machine Learning for Robotics June 12, 2017 13 / 31

Example: Bellman Equation for Student MRP

from David SilverGeorg Martius Machine Learning for Robotics June 12, 2017 14 / 31

Bellman Equation (MRP) II

Bellman equations in matrix form:

v = R+ γPv

where v ∈ R|S| and R are vectors

The Bellman equation can be solved directly:

v = (I− γP)−1R

• computational complexity is O(|S|3)

Georg Martius Machine Learning for Robotics June 12, 2017 15 / 31

Markov Decision Process

A Markov reward process has no agent, there is no influence on the system.

And MDR with an active agent forms a Markov Decision Process.• Agent takes decision by executing actions• State is Markovian

Definition (MDP)A Markov Decision Process is a tuple (S,A,P,R, γ)• S is a finite set of states• A is a finite set of actions• P is a state transition probability matrix,

Pass′ = P (St+1 | St, At = a)

• R is a reward function, Ras = E[Rt+1|St = s,At = a]• γ is a discount factor, γ ∈ [0, 1]

Georg Martius Machine Learning for Robotics June 12, 2017 16 / 31

Markov Decision Process

A Markov reward process has no agent, there is no influence on the system.And MDR with an active agent forms a Markov Decision Process.• Agent takes decision by executing actions• State is Markovian

Definition (MDP)A Markov Decision Process is a tuple (S,A,P,R, γ)• S is a finite set of states• A is a finite set of actions• P is a state transition probability matrix,

Pass′ = P (St+1 | St, At = a)

• R is a reward function, Ras = E[Rt+1|St = s,At = a]• γ is a discount factor, γ ∈ [0, 1]

Georg Martius Machine Learning for Robotics June 12, 2017 16 / 31

Markov Decision Process

A Markov reward process has no agent, there is no influence on the system.And MDR with an active agent forms a Markov Decision Process.• Agent takes decision by executing actions• State is Markovian

Definition (MDP)A Markov Decision Process is a tuple (S,A,P,R, γ)• S is a finite set of states• A is a finite set of actions• P is a state transition probability matrix,

Pass′ = P (St+1 | St, At = a)

• R is a reward function, Ras = E[Rt+1|St = s,At = a]• γ is a discount factor, γ ∈ [0, 1]

Georg Martius Machine Learning for Robotics June 12, 2017 16 / 31

Example: Student MDP

from David SilverGeorg Martius Machine Learning for Robotics June 12, 2017 17 / 31

How to model decision taking?

The agent has a action function called policy.

DefinitionA policy π is a distribution over actions given states,

π(a|s) = P (At = a | St = s)

• Since it is a Markov process the policy only depends on the current state• Implication: policies are stationary (independent of time)

An MDP with a given policy turns into a MRP:

Pπss′ =∑a∈A

π(a|s)Pass′

Rπs =∑a∈A

π(a|s)Ras

Georg Martius Machine Learning for Robotics June 12, 2017 18 / 31

How to model decision taking?

The agent has a action function called policy.

DefinitionA policy π is a distribution over actions given states,

π(a|s) = P (At = a | St = s)

• Since it is a Markov process the policy only depends on the current state• Implication: policies are stationary (independent of time)

An MDP with a given policy turns into a MRP:

Pπss′ =∑a∈A

π(a|s)Pass′

Rπs =∑a∈A

π(a|s)Ras

Georg Martius Machine Learning for Robotics June 12, 2017 18 / 31

How to model decision taking?

The agent has a action function called policy.

DefinitionA policy π is a distribution over actions given states,

π(a|s) = P (At = a | St = s)

• Since it is a Markov process the policy only depends on the current state• Implication: policies are stationary (independent of time)

An MDP with a given policy turns into a MRP:

Pπss′ =∑a∈A

π(a|s)Pass′

Rπs =∑a∈A

π(a|s)Ras

Georg Martius Machine Learning for Robotics June 12, 2017 18 / 31

How to model decision taking?

The agent has a action function called policy.

DefinitionA policy π is a distribution over actions given states,

π(a|s) = P (At = a | St = s)

• Since it is a Markov process the policy only depends on the current state• Implication: policies are stationary (independent of time)

An MDP with a given policy turns into a MRP:

Pπss′ =∑a∈A

π(a|s)Pass′

Rπs =∑a∈A

π(a|s)Ras

Georg Martius Machine Learning for Robotics June 12, 2017 18 / 31

Modelling expected returns in MDP

How good is each state when we follow the policy pi?

DefinitionThe state-value function vπ(s) of an MDP is the expected return when startingfrom state s and following policy π.

vπ(s) = E[Gt|St = s]

Should we change the policy?How much does choosing a different action change the value?

DefinitionThe action-value function qπ(s, a) of an MDP is the expected return whenstarting from state s, taking action a, and then following policy π.

qπ(s, a) = E[Gt|St = s,At = a]

Georg Martius Machine Learning for Robotics June 12, 2017 19 / 31

Modelling expected returns in MDP

How good is each state when we follow the policy pi?

DefinitionThe state-value function vπ(s) of an MDP is the expected return when startingfrom state s and following policy π.

vπ(s) = E[Gt|St = s]

Should we change the policy?How much does choosing a different action change the value?

DefinitionThe action-value function qπ(s, a) of an MDP is the expected return whenstarting from state s, taking action a, and then following policy π.

qπ(s, a) = E[Gt|St = s,At = a]

Georg Martius Machine Learning for Robotics June 12, 2017 19 / 31

Modelling expected returns in MDP

How good is each state when we follow the policy pi?

DefinitionThe state-value function vπ(s) of an MDP is the expected return when startingfrom state s and following policy π.

vπ(s) = E[Gt|St = s]

Should we change the policy?How much does choosing a different action change the value?

DefinitionThe action-value function qπ(s, a) of an MDP is the expected return whenstarting from state s, taking action a, and then following policy π.

qπ(s, a) = E[Gt|St = s,At = a]

Georg Martius Machine Learning for Robotics June 12, 2017 19 / 31

Modelling expected returns in MDP

How good is each state when we follow the policy pi?

DefinitionThe state-value function vπ(s) of an MDP is the expected return when startingfrom state s and following policy π.

vπ(s) = E[Gt|St = s]

Should we change the policy?How much does choosing a different action change the value?

DefinitionThe action-value function qπ(s, a) of an MDP is the expected return whenstarting from state s, taking action a, and then following policy π.

qπ(s, a) = E[Gt|St = s,At = a]

Georg Martius Machine Learning for Robotics June 12, 2017 19 / 31

Example: State-Value function for Student MDP

from David SilverGeorg Martius Machine Learning for Robotics June 12, 2017 20 / 31

Bellman Expectation Equation

Recall: Bellman Equation: decompose expected reward into immediate rewardplus discounted value of successor state,

vπ(s) = Eπ[Rt+1 + γvπ(St+1)|St = s]

The action-value function can be similarly decomposed,

qπ(s, a) = Eπ[Rt+1 + γqπ(St+1, At+1)|St = s,At = a]

Georg Martius Machine Learning for Robotics June 12, 2017 21 / 31

Bellman Expectation Equation

Recall: Bellman Equation: decompose expected reward into immediate rewardplus discounted value of successor state,

vπ(s) = Eπ[Rt+1 + γvπ(St+1)|St = s]

The action-value function can be similarly decomposed,

qπ(s, a) = Eπ[Rt+1 + γqπ(St+1, At+1)|St = s,At = a]

Georg Martius Machine Learning for Robotics June 12, 2017 21 / 31

Bellman Equation: joined update of vπ and qπ

Value function can be derived from qπ:

vπ(s) =∑a∈A

π(a|s)qπ(s, a)

. . . and q can be computed from transition model

qπ(s, a) = Ras + γ∑s′∈SPass′vπ(s′)

Substituting q in v:

vπ(s) =∑a∈A

π(a|s)(Ras + γ

∑s′∈SPass′vπ(s′)

)

Substituting v in q:

qπ(s, a) = Ras + γ∑s′∈SPass′

∑a′∈A

π(a′|s′)qπ(s′, a′)

Georg Martius Machine Learning for Robotics June 12, 2017 22 / 31

Bellman Equation: joined update of vπ and qπ

Value function can be derived from qπ:

vπ(s) =∑a∈A

π(a|s)qπ(s, a)

. . . and q can be computed from transition model

qπ(s, a) = Ras + γ∑s′∈SPass′vπ(s′)

Substituting q in v:

vπ(s) =∑a∈A

π(a|s)(Ras + γ

∑s′∈SPass′vπ(s′)

)

Substituting v in q:

qπ(s, a) = Ras + γ∑s′∈SPass′

∑a′∈A

π(a′|s′)qπ(s′, a′)

Georg Martius Machine Learning for Robotics June 12, 2017 22 / 31

Bellman Equation: joined update of vπ and qπ

Value function can be derived from qπ:

vπ(s) =∑a∈A

π(a|s)qπ(s, a)

. . . and q can be computed from transition model

qπ(s, a) = Ras + γ∑s′∈SPass′vπ(s′)

Substituting q in v:

vπ(s) =∑a∈A

π(a|s)(Ras + γ

∑s′∈SPass′vπ(s′)

)

Substituting v in q:

qπ(s, a) = Ras + γ∑s′∈SPass′

∑a′∈A

π(a′|s′)qπ(s′, a′)

Georg Martius Machine Learning for Robotics June 12, 2017 22 / 31

Bellman Equation: joined update of vπ and qπ

Value function can be derived from qπ:

vπ(s) =∑a∈A

π(a|s)qπ(s, a)

. . . and q can be computed from transition model

qπ(s, a) = Ras + γ∑s′∈SPass′vπ(s′)

Substituting q in v:

vπ(s) =∑a∈A

π(a|s)(Ras + γ

∑s′∈SPass′vπ(s′)

)

Substituting v in q:

qπ(s, a) = Ras + γ∑s′∈SPass′

∑a′∈A

π(a′|s′)qπ(s′, a′)

Georg Martius Machine Learning for Robotics June 12, 2017 22 / 31

Bellman Equation: joined update of vπ and qπ

Value function can be derived from qπ:

vπ(s) =∑a∈A

π(a|s)qπ(s, a)

. . . and q can be computed from transition model

qπ(s, a) = Ras + γ∑s′∈SPass′vπ(s′)

Substituting q in v:

vπ(s) =∑a∈A

π(a|s)(Ras + γ

∑s′∈SPass′vπ(s′)

)

Substituting v in q:

qπ(s, a) = Ras + γ∑s′∈SPass′

∑a′∈A

π(a′|s′)qπ(s′, a′)

Georg Martius Machine Learning for Robotics June 12, 2017 22 / 31

Example: Bellman update for v in Student MDP

from David Silverπ(a|s) = 0.5

Georg Martius Machine Learning for Robotics June 12, 2017 23 / 31

Explicit solution for vπ

Since a policy induces a MRP vπ can be diretly computed (as before)

v = (I− γPπ)−1Rπ

But do we want vπ?

We want to find the optimal policy and its value function!

Georg Martius Machine Learning for Robotics June 12, 2017 24 / 31

Explicit solution for vπ

Since a policy induces a MRP vπ can be diretly computed (as before)

v = (I− γPπ)−1Rπ

But do we want vπ?

We want to find the optimal policy and its value function!

Georg Martius Machine Learning for Robotics June 12, 2017 24 / 31

Explicit solution for vπ

Since a policy induces a MRP vπ can be diretly computed (as before)

v = (I− γPπ)−1Rπ

But do we want vπ?

We want to find the optimal policy and its value function!

Georg Martius Machine Learning for Robotics June 12, 2017 24 / 31

Optimal Value Function

DefinitionThe optimal state-value function v∗(s) is the maximum value function over allpolicies

v∗(s) = maxπ vπ(s)

DefinitionThe optimal action-value function q∗(s, a) is the maximum action-valuefunction over all policies

q∗(s, a) = maxπ qπ(s, a)

What does it mean?• v∗ specifies the best possible performance in an MDP• Knowing v∗ solves the MDP (how? we will see. . . )

Georg Martius Machine Learning for Robotics June 12, 2017 25 / 31

Optimal Value Function

DefinitionThe optimal state-value function v∗(s) is the maximum value function over allpolicies

v∗(s) = maxπ vπ(s)

DefinitionThe optimal action-value function q∗(s, a) is the maximum action-valuefunction over all policies

q∗(s, a) = maxπ qπ(s, a)

What does it mean?

• v∗ specifies the best possible performance in an MDP• Knowing v∗ solves the MDP (how? we will see. . . )

Georg Martius Machine Learning for Robotics June 12, 2017 25 / 31

Optimal Value Function

DefinitionThe optimal state-value function v∗(s) is the maximum value function over allpolicies

v∗(s) = maxπ vπ(s)

DefinitionThe optimal action-value function q∗(s, a) is the maximum action-valuefunction over all policies

q∗(s, a) = maxπ qπ(s, a)

What does it mean?• v∗ specifies the best possible performance in an MDP• Knowing v∗ solves the MDP (how? we will see. . . )

Georg Martius Machine Learning for Robotics June 12, 2017 25 / 31

Example: Optimal Value Function v∗ in Student MDP

from David Silver

Georg Martius Machine Learning for Robotics June 12, 2017 26 / 31

Example: Optimal State Function q∗ in Student MDP

from David Silver

Georg Martius Machine Learning for Robotics June 12, 2017 27 / 31

Optimal Policy

Actually solving the MDP means we have also the optimal policy.

Define a partial ordering over policies

π ≥ π′ if vπ(s) ≥ vπ′(s),∀s

TheoremFor any Markov Decision Process• There exists an optimal policy π∗ that is better than or equal to all other

policies, π∗ ≥ π,∀π• All optimal policies achieve the optimal state-value function, vπ∗(s) = v∗(s)• All optimal policies achieve the optimal action-value function,qπ∗(s, a) = q∗(s, a)

Georg Martius Machine Learning for Robotics June 12, 2017 28 / 31

Optimal Policy

Actually solving the MDP means we have also the optimal policy.Define a partial ordering over policies

π ≥ π′ if vπ(s) ≥ vπ′(s),∀s

TheoremFor any Markov Decision Process• There exists an optimal policy π∗ that is better than or equal to all other

policies, π∗ ≥ π,∀π• All optimal policies achieve the optimal state-value function, vπ∗(s) = v∗(s)• All optimal policies achieve the optimal action-value function,qπ∗(s, a) = q∗(s, a)

Georg Martius Machine Learning for Robotics June 12, 2017 28 / 31

Finding an Optimal Policy

Given the optimal action-value function q∗:

the optimimal policy is given

by maximizing it.

π∗(a|s) = Ja = argmaxa∈A

q∗(s, a)K

J·K is Iverson bracket: 1 if true, otherwise 0.• There is always a deterministic optimal policy for any MDP• If we know q∗(s, a), we immediately have the optimal policy (greedy)

Georg Martius Machine Learning for Robotics June 12, 2017 29 / 31

Finding an Optimal Policy

Given the optimal action-value function q∗: the optimimal policy is given

by maximizing it.

π∗(a|s) = Ja = argmaxa∈A

q∗(s, a)K

J·K is Iverson bracket: 1 if true, otherwise 0.• There is always a deterministic optimal policy for any MDP• If we know q∗(s, a), we immediately have the optimal policy (greedy)

Georg Martius Machine Learning for Robotics June 12, 2017 29 / 31

Bellman Equation for optimal value functions

Also for the optimal value functions we can use Bellmans optimiality equations:

v∗(s) = maxa∈A q∗(s, a)

q∗(s, a) = Ras + γ∑s′∈SPass′v∗(s′)

Substituting q in v:

v∗(s) = maxa∈A

(Ras + γ

∑s′∈SPass′v∗(s′)

)

Substituting v in q:

q∗(s, a) = Ras + γ∑s′∈SPass′ maxa′∈A q∗(s′, a′)

Georg Martius Machine Learning for Robotics June 12, 2017 30 / 31

Bellman Equation for optimal value functions

Also for the optimal value functions we can use Bellmans optimiality equations:

v∗(s) = maxa∈A q∗(s, a)

q∗(s, a) = Ras + γ∑s′∈SPass′v∗(s′)

Substituting q in v:

v∗(s) = maxa∈A

(Ras + γ

∑s′∈SPass′v∗(s′)

)

Substituting v in q:

q∗(s, a) = Ras + γ∑s′∈SPass′ maxa′∈A q∗(s′, a′)

Georg Martius Machine Learning for Robotics June 12, 2017 30 / 31

Bellman Equation for optimal value functions

Also for the optimal value functions we can use Bellmans optimiality equations:

v∗(s) = maxa∈A q∗(s, a)

q∗(s, a) = Ras + γ∑s′∈SPass′v∗(s′)

Substituting q in v:

v∗(s) = maxa∈A

(Ras + γ

∑s′∈SPass′v∗(s′)

)

Substituting v in q:

q∗(s, a) = Ras + γ∑s′∈SPass′ maxa′∈A q∗(s′, a′)

Georg Martius Machine Learning for Robotics June 12, 2017 30 / 31

Solving the Bellman Optimality Equation

Bellman Optimality Equation is non-linear• No closed form solution (in general)• Many iterative solution methods

Value IterationPolicy IterationQ-learningSARSA

Georg Martius Machine Learning for Robotics June 12, 2017 31 / 31