Undiscounted infinitehorizon DP
Transcript of Undiscounted infinitehorizon DP
![Page 1: Undiscounted infinitehorizon DP](https://reader034.fdocuments.net/reader034/viewer/2022042211/62594584ff55f87d1222bcd6/html5/thumbnails/1.jpg)
Wu
Undiscounted infinite horizon DPStochastic shortest path & average reward DP
Cathy Wu6.246 Reinforcement Learning: Foundations and Methods
Mar 2, 2021
![Page 2: Undiscounted infinitehorizon DP](https://reader034.fdocuments.net/reader034/viewer/2022042211/62594584ff55f87d1222bcd6/html5/thumbnails/2.jpg)
Wu
References3
1. DPOC, vol 1. Chapter 5.
2. Dimitri Bertsekas. MIT 6.231 Dynamic Programming and Stochastic Control. Fall 2015, Lectures 10-12 & 17-18.
3. Daniela Pucci De Farias. MIT 2.997 Decision-Making in Large-Scale Systems. Spring 2004, Lectures 4-5.
4. Dimitrios Katselis. UIUC ECE586 MDPs and Reinforcement Learning. Spring 2019, Lecture 12. Acknowledgement: R. Srikant.
![Page 3: Undiscounted infinitehorizon DP](https://reader034.fdocuments.net/reader034/viewer/2022042211/62594584ff55f87d1222bcd6/html5/thumbnails/3.jpg)
Wu
Outline4
1. Undiscounted problems
2. Stochastic shortest path
3. Average reward dynamic programming
![Page 4: Undiscounted infinitehorizon DP](https://reader034.fdocuments.net/reader034/viewer/2022042211/62594584ff55f87d1222bcd6/html5/thumbnails/4.jpg)
Wu
Undiscounted Problems§ System: !"#$ = & !", (", )"§ Value of a policy * = *+, *$, …
-. !+ = limsup5→7
E9:
;<=,>,…
?
"@+
5A$
B !", *" !" , )"
§ Note that -. !+ and -∗ !+ can be +∞ or −∞.
§ Shorthand notation for DP mappings
G- ! = maxJ∈L M
E9B !, (, ) + - & !, (, ) , ∀O
G.- ! = EPB !, *(!), ) + - & !, * ! , ) , ∀O
§ G and G. need not be contractions in general, but their monotonicity is helpful (see DPOC vol2, Ch4).
§ Stochastic Shortest Path Problems (SSP) problems provide a “soft boundary” between the easy finite-state discounted problems and the hard undiscounted problems.• They share features of both.
• Some nice theory is recovered thanks to the termination state, and special conditions.
6
![Page 5: Undiscounted infinitehorizon DP](https://reader034.fdocuments.net/reader034/viewer/2022042211/62594584ff55f87d1222bcd6/html5/thumbnails/5.jpg)
Wu
“Easy” and “Difficult” Problems7
§ Easy problems• All of them are finite-state, finite-control• Bellman’s equation has unique solution• Optimal policies obtained from Bellman equation• Value and policy iteration algorithms apply
§ Somewhat complicated problems [last week, today]• Infinite state, discounted, bounded ! (contractive structure)• Finite-state SSP with “nearly” contractive structure• Bellman’s equation has unique solution; value and policy iteration work
§ Difficult problems (with additional structure) [today]• Infinite state, ! ≥ 0 or ! ≤ 0 ∀&, (, ) , deterministic problems• SSP without contractive structure• Average reward
§ Hugely large and/or model-free problems [next lectures]• Big state space and/or simulation model• Approximate DP methods
§ Continuous, measure theoretic formulations (not in this course)
![Page 6: Undiscounted infinitehorizon DP](https://reader034.fdocuments.net/reader034/viewer/2022042211/62594584ff55f87d1222bcd6/html5/thumbnails/6.jpg)
Wu
Outline8
1. Undiscounted problems
2. Stochastic shortest patha. Results overviewb. Connection to discounted problemsc. Analysis sketchd. Significance of proper policiese. Analysis sketch (sequel)
3. Average reward dynamic programming
![Page 7: Undiscounted infinitehorizon DP](https://reader034.fdocuments.net/reader034/viewer/2022042211/62594584ff55f87d1222bcd6/html5/thumbnails/7.jpg)
Wu
Stochastic Shortest Path Problems§ Assume finite-state system: States 1,… , $ and special cost-free termination
state %• Transition probabilities &'( )• Action/control constraints a ∈ + , (finite set)• Value of policy - = -/, -0, … is:
12 , = lim6→8
9 :;</
6=0
> ?;, -; ?; |?/ = ,
• Bounded >• Optimal policy if 12 , = 1∗ , for all ,.
§ Assumption (Termination inevitable): There exists an integer B such that for every policy and initial state, there is positive probability that the termination state will be reached after no more that B steps; for all -, we have
C2 = max'<0,…,F
G ?H ≠ %|?/ = , , - < 1
9
![Page 8: Undiscounted infinitehorizon DP](https://reader034.fdocuments.net/reader034/viewer/2022042211/62594584ff55f87d1222bcd6/html5/thumbnails/8.jpg)
Wu
Assumption10
§ Assumption (Termination inevitable): There exists an integer ! such that for every policy and initial state, there is positive probability that the termination state " will be reached after no more that !steps; for all #, we have
$% = max*+,,…,/
0 12 ≠ "|15 = 6 , # < 1
§ Note: We have $ = max%
$% < 1, “tractable” since $% depends only on the first ! components of #.
§ Shortest path routing examples:• acyclic (assumption is satisfied)• nonacyclic (assumption is not satisfied)
![Page 9: Undiscounted infinitehorizon DP](https://reader034.fdocuments.net/reader034/viewer/2022042211/62594584ff55f87d1222bcd6/html5/thumbnails/9.jpg)
Wu
Discounted Problems§ Assume a discount factor ! < 1.§ Conversion to an SSP problem ⟹
§ kth stage cost is the same for both problems § Value iteration converges to %∗ for all initial %':
%()* + = max0∈2 3
4 +, 6 + !89:*
;<39 6 %((>) , ∀+
§ %∗ is the unique solution of Bellman’s equation:
%∗ + = max0∈2 3
4 +, 6 + !89:*
;<39 6 %∗(>) , ∀+
§ Policy iteration terminates with an optimal policy, and linear programming works.
11
![Page 10: Undiscounted infinitehorizon DP](https://reader034.fdocuments.net/reader034/viewer/2022042211/62594584ff55f87d1222bcd6/html5/thumbnails/10.jpg)
Wu
Main Results13
§ Given any initial conditions !" 1 , …!" & , the sequence !' ( generated by value iteration,
!')* ( = max/∈1 2
3 (, 4 +678*
9
:27 4 !' ; , ∀(
converges to the optimal cost !∗ ( for each (.§ Bellman’s equation has !∗ ( as unique solution:
!∗ ( = max/∈1 2
3 (, 4 + ∑78*9 :27 4 !∗ ; , ∀( !∗ ? = 0
§ A stationary policy A is optimal if and only if for every state (, A ( attains the minimum in Bellman’s equation.
§ Key proof idea: The “tail” of the cost series,
6'8BC
D
E 3 F', A' F'
Vanishes as G → ∞.
![Page 11: Undiscounted infinitehorizon DP](https://reader034.fdocuments.net/reader034/viewer/2022042211/62594584ff55f87d1222bcd6/html5/thumbnails/11.jpg)
Wu
Finiteness of Policy Rewards14
§ View ! = max
&!& < 1
as an upper bound on the non-termination probability during the 1st ) steps, regardless of policy used.
§ For any * and any initial state +, -./ ≠ 1 | -3 = +, * = , -./ ≠ 1 | -/ ≠ 1, -3 = +, * ×, -/ ≠ 1| -3 = +, * ≤ !.
§ and similarly, -7/ ≠ 1 | -3 = +, * ≤ !7, + = 1,… , 9
§ So :{Reward between times <) and < + 1 ) − 1}
≤ )!7 max?@A,…,BC∈E F
G +, H
§ and
I& + ≤ J7@3
K
)!7 max?@A,…,BC∈E F
G +, H =)
1 − !max
?@A,…,BC∈E F
G +, H
![Page 12: Undiscounted infinitehorizon DP](https://reader034.fdocuments.net/reader034/viewer/2022042211/62594584ff55f87d1222bcd6/html5/thumbnails/12.jpg)
Wu
15
Proof: !" → !∗ (sketch)§ Assume for simplicity that !% & = 0, ∀&. For any + ≥ 1, write the cost of any policy . as
!/ 0% = 123%
4567
8 9 02, .2 02 + 12345
;
8 9 02, .2 02
≤ 123%
4567
8 9 02, .2 02 + 12345
;
=2>maxB,C
9 &, D
§ Take the maximum of both sides over . to obtain
!∗ 0% ≤ !45 0% +=5
1 − =>max
B,C9 &, D
§ Similarly, we have
!45 0% −=5
1 − =>max
B,C9 &, D ≤ !∗ 0%
§ It follows that lim5→;
!45 0% = !∗ 0% .
§ !45 0% and !45H2(0%) converge to the same limit for K < > (since K extra steps far into the future don’t matter), so !" 0% → !∗ 0% .
§ Similarly, !% ≠ 0 does not matter.
![Page 13: Undiscounted infinitehorizon DP](https://reader034.fdocuments.net/reader034/viewer/2022042211/62594584ff55f87d1222bcd6/html5/thumbnails/13.jpg)
Wu
17
§ Minimizing the ! Time to Termination : Let+ ,, . = 1, ∀, = 1,… , 3, . ∈ 5 ,
§ Under our assumptions, the costs 6∗ , uniquely solve Bellman’s equation, which has the form
6∗ , = max9∈: ; 1 +=>?@
AB;> . 6∗ C , , = 1,… , 3
§ In the special case where there is only one control at each state, 6∗ , is the mean first passage time from , to D. These times, denoted E;, are the unique solution of the classical equations
E; = 1 +=>?@
AB;>E> , , = 1,… , 3
which are seen to be a form of Bellman’s equation.
Example
![Page 14: Undiscounted infinitehorizon DP](https://reader034.fdocuments.net/reader034/viewer/2022042211/62594584ff55f87d1222bcd6/html5/thumbnails/14.jpg)
Wu
Proper policies§ Definition: A stationary policy ! is called proper, if under ", from
every state #, there is a positive probability path that leads to $.§ Important fact: If ! is proper, %& is contraction w. r. t. some weighted
sup-normmax*
1,*
%&- # − %&-/ # ≤ 1& max*1,*
- # − -/ #§ % is similarly a contraction if all " are proper (the case we just
analyzed).
18
![Page 15: Undiscounted infinitehorizon DP](https://reader034.fdocuments.net/reader034/viewer/2022042211/62594584ff55f87d1222bcd6/html5/thumbnails/15.jpg)
Wu
SSP Theory: the sequel§ The theory can be pushed one step further. Instead of all policies being
proper, assume that:1) There exists at least one proper policy2) For each improper !, #$ % = −∞ for some %
§ Example: Deterministic shortest path problem with a single destination ).• States ⟺ nodes; Controls ⟺ arcs• Termination state ⟺ the destination• Assumption (1) ⟺ every node is connected to the destination• Assumption (2) ⟺ all cycle costs > 0
§ Note that - is not necessarily a contraction. (Since not all policies may be proper)
§ The theory in summary is as follows:• -∗ is the unique solution of Bellman’s Equation• /∗ is optimal if and only if -$∗#∗ = -#∗• VI converges: -0# → #∗ for all V∈ ℜ4• PI terminates with an optimal policy, if started with a proper policy
21
![Page 16: Undiscounted infinitehorizon DP](https://reader034.fdocuments.net/reader034/viewer/2022042211/62594584ff55f87d1222bcd6/html5/thumbnails/16.jpg)
Wu
SSP Algorithms§ All the basic algorithms have counterparts under our assumptions; see DPOC
vol2, ch3.§ “Easy” case: All policies proper, in which case the mappings ! and !" are
contractions§ Even with improper (infinite cost) policies all basic algorithms have satisfactory
counterparts• VI and PI• Optimistic PI• Asynchronous VI• Asynchronous PI• Q-learning analogs
§ ** THE BOUNDARY OF NICE THEORY **§ Serious complications arise under any one of the following:
• There is no proper policy• There is improper policy with finite cost ∀$• The state space is infinite and/or the control space is infinite [infinite but compact % $ can
be dealt with]
22
![Page 17: Undiscounted infinitehorizon DP](https://reader034.fdocuments.net/reader034/viewer/2022042211/62594584ff55f87d1222bcd6/html5/thumbnails/17.jpg)
Wu
Pathologies I: Deterministic Shortest Paths§ Two policies, one proper (apply !),
one improper (apply !")§ Bellman’s equation is
# 1 = min # 1 , *Set of solutions is (−∞, *]
§ Case * > 0, #∗ = 0: VI does not converge to #∗ except if started from #∗. PI may get stuck starting from the inferior proper policy.
§ Case * < 0, #∗ = *: VI converges to #∗ if started above #∗, but not if started below #∗. PI can oscillate (if started with !" it generates !, and if started with ! it can generate !")
§Discuss: Why doesn’t this issue arise in the discounted setting?
1 3!, Cost *
Destination
!", Cost 0
(Warning: min)23
![Page 18: Undiscounted infinitehorizon DP](https://reader034.fdocuments.net/reader034/viewer/2022042211/62594584ff55f87d1222bcd6/html5/thumbnails/18.jpg)
Wu
SSP Analysis I§ For a proper policy !, "# is the unique fixed point of $#, and $#%" → "# for
all " (holds by the theory of DPOC Vol. I, Section 5.2)§ Key Fact: A ! satisfying V ≤ $#" for some V ∈ ℜ+ must be proper – true
because
V ≤ $#%" = -#%" + /012
%34-#05#
since "# = ∑0127 -#05# and some component of the term on the right goes to −∞ as : → ∞ if ! is improper (by our assumptions).
§ Consequence: $ can have at most one fixed point within ℜ+.§ Proof: If " and "; are two fixed points, select ! and !; such that V = $" =$#" and "; = $"; = $#<";. By preceding assertion, ! and !; must be proper, and V = "# and "; = "#< . Also
V = $%" ≥ $#<% " → "#< = ";Similarly, "; ≥ ", so V = "′.
26
![Page 19: Undiscounted infinitehorizon DP](https://reader034.fdocuments.net/reader034/viewer/2022042211/62594584ff55f87d1222bcd6/html5/thumbnails/19.jpg)
Wu
SSP Analysis II§ We first show that ! has a fixed point, and that PI converges to it.§ Use PI. Generate a sequence of proper policies "# starting from a proper
policy "$.§ "% is proper and &'( ≤ &'* since
&'( = !'(&'( ≤ !&'( = !'*&'( ≤ !'*# &'( ≤ &'*§ Thus &', is non-decreasing, some policy -" is repeated and &.' = !&.'. So &.' is fixed point of !.
§ Next show that !#& → &.' for all V, i.e., VI converges to the same limit as PI. (Sketch: True if & = &.'. Argue using the properness of -" to show that the terminal cost difference V − &.' does not matter).
§ To show &.' = &∗ , for any " = "$, "%, …!'( …!',5*&$ ≤ !#&$
where &$ ≡ 0. Take limsup as 8 → ∞, to obtain &' ≤ &.', so -" is optimal and &.' = &∗.
27
![Page 20: Undiscounted infinitehorizon DP](https://reader034.fdocuments.net/reader034/viewer/2022042211/62594584ff55f87d1222bcd6/html5/thumbnails/20.jpg)
Wu
Outline32
1. Undiscounted problems
2. Stochastic shortest path
3. Average reward dynamic programminga. Connections with finite horizon DPb. Connections with stochastic shortest pathc. Bellman’s equationd. Algorithms: value & policy iteratione. Connections with discounted MDPsf. Blackwell Optimal Policies
![Page 21: Undiscounted infinitehorizon DP](https://reader034.fdocuments.net/reader034/viewer/2022042211/62594584ff55f87d1222bcd6/html5/thumbnails/21.jpg)
Wu
Average Reward Problems36
§ In the average reward problems, we aim at finding a policy ! which minimizes:
"# $ = limsup,→.
10 1 2
345
,678# $3 | $5 = 0)
§ In the average-reward problem, "# $ does not offer enough information for an optimal policy to be found.
§ In most cases of interest, we will have "# $ = <# for some scalar <#, for all $, so that it does not allow us to distinguish the value of being in each state.
§ Footnotes:• For any fixed 0, the reward accrued up to time 0 does not matter (only the
state that we are at time 0 matters).• Setting: stationary dynamics, finite states and actions.
(1)
![Page 22: Undiscounted infinitehorizon DP](https://reader034.fdocuments.net/reader034/viewer/2022042211/62594584ff55f87d1222bcd6/html5/thumbnails/22.jpg)
Wu
Intuition: constant value37
Definition (Communicate)We say that two states !, # communicate under policy $ if there are %, &% ∈ {1,2, … } such that -./ !, # > 0, -.&/ (#, !) > 0.
§ If all states communicate, the optimal reward is independent of initial state [if we can go from 4 to 5 in finite expected time, we must have 6∗ 4 ≤ 6∗(5)]. So 6∗ 4 ≡ :∗, ∀4.
§ Because communication issues are so important, the methodology relies heavily on Markov chain theory.
§ The theory depends a lot on whether the chains corresponding to policies have a single or multiple “recurrent classes.” We will focus on the simplest version, using SSP theory.
![Page 23: Undiscounted infinitehorizon DP](https://reader034.fdocuments.net/reader034/viewer/2022042211/62594584ff55f87d1222bcd6/html5/thumbnails/23.jpg)
Wu
More definitions…38
Definition (Unichain Policy)We say that a policy ! is unichain if all of its recurrent states communicate.
Definition (Transient State)We say that a state " is transient under policy ! if it is only visited finitely many times, regardless of the initial condition of the system.
§ In the figure ⟹, states 1, 2, and 4 all communicate with each other, but state 4 doesn’t communicate with any state.
§ States 1, 2, and 3 are recurrent, while state 4 is transient.
§ This MDP is thus unichain.
1 2
34
![Page 24: Undiscounted infinitehorizon DP](https://reader034.fdocuments.net/reader034/viewer/2022042211/62594584ff55f87d1222bcd6/html5/thumbnails/24.jpg)
Wu
Assumption40
AssumptionOne of the states !∗ is such that for some integer # > 0, and for all initial states and all policies, !∗ is visited with positive probability at least once in the first # steps.
§ Equivalently: The special state !∗ is recurrent in the Markov chain corresponding to each stationary policy.
§ Equivalently (previous SSP assumption, termination inevitable): There exists integer # such that for every policy and initial state, there is positive probability that the termination state & will be reached after no more that # steps; for all ', we have
() = max./0,…,3 4 !5 ≠ &|!8 = 9 , ' < 1
Definition (Recurrent State)We say that a state ! is recurrent under policy ' if, conditioned on the fact that it is visited at least once, it is visited infinitely many times.
![Page 25: Undiscounted infinitehorizon DP](https://reader034.fdocuments.net/reader034/viewer/2022042211/62594584ff55f87d1222bcd6/html5/thumbnails/25.jpg)
Wu
More intuition: constant value41
§ Consider a set of states ! = #$, #&, … , #∗, … , #) .§ The states are visited in a sequence with some initial state #, say
#, … , #∗, … , #∗, … , #∗, …
§ Let *+ # , , = 1, 2, … be the stages corresponding to the ,th visit to state #∗, starting at state #. Let
/0+ # = 1∑3456 75689 : ;$ <0 #3*+=$ # − *+ #
§ Intuitively, we have the same transition probabilities whenever we start a new trajectory in state #∗. Thus, /0+ # independent of initial state # and /0+ # = /0
? # .§ Then expect @∗ , ≡ some /∗.
ℎ(#) /0$ /0&
![Page 26: Undiscounted infinitehorizon DP](https://reader034.fdocuments.net/reader034/viewer/2022042211/62594584ff55f87d1222bcd6/html5/thumbnails/26.jpg)
Wu
Connection to finite-horizon problems42
§ Going back to observe the definition of the function
!∗ #, % = max*
+ ,-./
01* #- | #/ = #
§ We conjecture that the function can be approximated as follows:!∗ #, % ≈ 4∗ # % + ℎ∗ # + 7 % , as % → ∞§ Note that, since 4∗ # is independent of the initial state, we can rewrite the
approximation as:!∗ #, % ≈ 4∗% + ℎ∗ # + 7 % , as % → ∞
§ Where the term ℎ∗(#) can be interpreted as a residual reward that depends on the initial state # and will be referred to as the differential cost (reward)function.
§ It can be shown thatℎ∗ # = + ,
-./
0= > ?@1*∗ # − 4∗
(2)
(3)
![Page 27: Undiscounted infinitehorizon DP](https://reader034.fdocuments.net/reader034/viewer/2022042211/62594584ff55f87d1222bcd6/html5/thumbnails/27.jpg)
Wu
Bellman’s equation43
§ We can now speculate about a version of Bellman’s equation for computing !∗and ℎ∗.
§ Approximating $∗ %, ' as in (3), we have$∗ %, ' + 1 = max
./. % +0
12. %, 3 $∗ 3, '
!∗ ' + 1 + ℎ∗ % + 4 ' = max.
/. % +012. %, 3 !∗' + ℎ∗ 3 + 4 '
§ Therefore, we have:!∗ + ℎ∗ % = max
./. % +0
12. %, 3 ℎ∗ 3 (4)
![Page 28: Undiscounted infinitehorizon DP](https://reader034.fdocuments.net/reader034/viewer/2022042211/62594584ff55f87d1222bcd6/html5/thumbnails/28.jpg)
Wu
!∗
!∗!∗
Connection with SSP44
§ Divide the sequence of generated states into cycles marked by successive visits to !∗.
§ Let’s focus on a single cycle: It can be viewed as a state trajectory of an SSP problem with !∗ as the termination state.• Let the cost (reward) at # of the SSP be ℎ # = & # − (∗.• We will argue (informally) that:
Average reward problem ≡ A minimum cost (maximum reward) cycle problem ≡ SSP Problem.
(Warning: min)
![Page 29: Undiscounted infinitehorizon DP](https://reader034.fdocuments.net/reader034/viewer/2022042211/62594584ff55f87d1222bcd6/html5/thumbnails/29.jpg)
Wu
Connection with SSP (continued)45
§ Consider a minimum cycle cost problem: Find a stationary policy !that minimizes the expected cost per transition with a cycle:
"# = %&& !'&& !
Where for a fixed !:%&& ! : ) cost from 1 up to the 6irst return to 1'&& ! : ) time from 1 up to the 6irst return to 1
§ Intuitively,9:: #;:: # = average cost of !, and optimal cycle cost = "∗, so
%&& ! − '&& ! "∗ ≥ 0§ Consider SSP with stage costs @ A, C − "∗. The cost of ! starting
from 1 is %&& ! − '&& ! "∗, so the optimal/min cycle ! is also optimal for the SSP.
§ Also: Optimal SSP cost starting from 1 = 0.
(Warning: min)
![Page 30: Undiscounted infinitehorizon DP](https://reader034.fdocuments.net/reader034/viewer/2022042211/62594584ff55f87d1222bcd6/html5/thumbnails/30.jpg)
Wu
Bellman’s Equation46
§ Let ℎ∗ # the optimal cost of this SSP problem when starting at the non-termination states # = 1,… , (. Then ℎ∗ 1 ,… , ℎ∗(() solve uniquely the corresponding Bellman’s equation:
ℎ∗ # = min.∈0 1 2 #, 3 − 5∗ +789:
;<:=18 3 ℎ∗ > , ∀#
§ If @∗ is an optimal stationary policy for the SSP problem, we have:ℎ∗ ( = A;; @∗ − B;; @∗ 5∗ = 0
§ Combining these equations, we have:
5∗ + ℎ∗ # = min.∈0 1 2 #, 3 +789:
;<:=18 3 ℎ∗ > , ∀#
ℎ∗ ( = 0§ If @∗ # attains the min for each #, @∗ is optimal.§ There is also Bellman Equation for a single policy @.§ Finally, flip all the signs for rewards (vs costs).§ Discuss: Any issues with solving the above Bellman equation?
(Warning: min)
![Page 31: Undiscounted infinitehorizon DP](https://reader034.fdocuments.net/reader034/viewer/2022042211/62594584ff55f87d1222bcd6/html5/thumbnails/31.jpg)
Wu
Bellman operators49
Lemma 1 (Monotonicity)Let ℎ ≤ #ℎ be arbitrary. Then $ℎ ≤ $#ℎ. $%ℎ ≤ $% #ℎ
§ Define the Bellman operators as follows:$%ℎ = '% + )%ℎ$ℎ = min
%$%ℎ
§ Then:
Lemma 2 (Offset)For all ℎ and ℎ ∈ ℝ, we have $ ℎ + /0 = $ℎ + /0§ Contraction principle does not hold for $ℎ = min% $%ℎ§ Bellman’s equation
10 + ℎ = $ℎ
![Page 32: Undiscounted infinitehorizon DP](https://reader034.fdocuments.net/reader034/viewer/2022042211/62594584ff55f87d1222bcd6/html5/thumbnails/32.jpg)
Wu
Bellman’s Equation50
TheoremSuppose that !∗ and ℎ∗ satisfy the Bellman’s equation. Let $∗ be greedy with respect to ℎ∗, i.e. %ℎ∗ ≡ %'∗ℎ∗. Then
(' ) = !∗, ∀)('∗∗ ) ≥ (' ) , ∀$
§ Bellman’s equation!. + ℎ = %ℎ (5)
![Page 33: Undiscounted infinitehorizon DP](https://reader034.fdocuments.net/reader034/viewer/2022042211/62594584ff55f87d1222bcd6/html5/thumbnails/33.jpg)
Wu
51
§ Let ! = !#, !%, … . Let ' be arbitrary. Then()*+,ℎ∗ ≤ (ℎ∗ = 0∗1 + ℎ∗
()*+3()*+,ℎ∗ ≤ ()*+3 ℎ∗ + 0∗1= ()*+3ℎ∗ + 0∗1≤ (ℎ∗ + 0∗1= ℎ∗ + 20∗1
§ Then(#(% …(56#ℎ∗ ≤ '0∗1 + ℎ∗
§ Thus, we have
7 89:;
56#<) =9 + ℎ∗ =5 ≤ '0∗1 + ℎ∗
Proof: Bellman’s Equation
![Page 34: Undiscounted infinitehorizon DP](https://reader034.fdocuments.net/reader034/viewer/2022042211/62594584ff55f87d1222bcd6/html5/thumbnails/34.jpg)
Wu
52
§ By dividing both sides by ! and take the limit as ! → ∞, we have$% ≤ '∗)
§ Take * = *∗, *∗, *∗, … , then all the inequalities before become the equality. Thus
'∗) = $%∗
Proof: Bellman’s Equation
![Page 35: Undiscounted infinitehorizon DP](https://reader034.fdocuments.net/reader034/viewer/2022042211/62594584ff55f87d1222bcd6/html5/thumbnails/35.jpg)
Wu
Remarks: Bellman’s Equation53
§ If !∗, ℎ∗ is a solution to the Bellman’s equation, then !∗, ℎ∗ + &'is also a solution, for all scalar &.
§ However, unlike the case of discounted-reward and finite-horizon problems, the average-reward Bellman’s equation does not necessarily have a solution.
§ Discuss: Are there examples in which the average reward should not be the same for all initial states?
![Page 36: Undiscounted infinitehorizon DP](https://reader034.fdocuments.net/reader034/viewer/2022042211/62594584ff55f87d1222bcd6/html5/thumbnails/36.jpg)
Wu
Value Iteration54
§ Natural VI method: Generate optimal !-stage rewards by DP algorithm starting with any "#:
"$%& ' = max,∈. / 0 ', 2 +456&
78/5 2 "$ 9 , ∀'
§ Convergence: lim$→>?@ /$ = A∗, ∀'
§ Proof outline: Let "$∗ be so generated starting from the optimal differential cost (reward), i.e. the initial condition "#∗ = ℎ∗. Then by induction:
"$∗ ' = !A∗ + ℎ∗ ' , ∀', ∀!§ On the other hand:
"$ ' − "$∗ ' ≤ max56&,…,7 "# 9 − ℎ∗ 9 , ∀'since "$ ' and "$∗ ' are optimal rewards for two !-stage problems that differ only in the terminal rewards functions, which are "# and ℎ∗.
![Page 37: Undiscounted infinitehorizon DP](https://reader034.fdocuments.net/reader034/viewer/2022042211/62594584ff55f87d1222bcd6/html5/thumbnails/37.jpg)
Wu
Relative Value Iteration55
§ The VI method just described has two drawbacks:• Since typically some components of !" diverge to ∞ or −∞, calculating lim"→)
*+ ," is
numerically cumbersome.• The method will not compute a corresponding differential reward vector ℎ∗.
§ We can bypass both difficulties by subtracting a constant from all components of the vector !" , so that the difference, call it ℎ" , remains bounded.
§ Relative VI algorithm: Pick any state /, and iterate according to
ℎ"01 2 = max6∈8 ,
9 2, ; +=>?1
@A,> ; ℎ" B
− max6∈8 C
9 /, ; +=>?1
@AC> ; ℎ" B , ∀2
§ Convergence: We can show ℎ" → ℎ∗ (under an extra assumption; see Vol. II).
![Page 38: Undiscounted infinitehorizon DP](https://reader034.fdocuments.net/reader034/viewer/2022042211/62594584ff55f87d1222bcd6/html5/thumbnails/38.jpg)
Wu
Policy Iteration56
§ At iteration !, we have a stationary policy "#.§ Policy evaluation: Compute $# and ℎ# & of "#, using the ' + 1 equations ℎ# ' = 0 and
$# + ℎ# & = , &, "# & +./01
234/ "# & ℎ# 5 , ∀&
§ Policy improvement: Find
"#71 & = arg max=∈? 4 , &, @ +./01
234/ @ ℎ# 5 , ∀&
§ If $#71 = $# and ℎ#71 & = ℎ# & , ∀&, stop; otherwise, repeat with "#71 replacing "#.§ Result: for each k, we either have $#71 > $# or we have policy improvement:
$#71 = $#, ℎ#71 & ≥ ℎ# & , & = 1,… , '§ The algorithm terminates with an optimal policy.
![Page 39: Undiscounted infinitehorizon DP](https://reader034.fdocuments.net/reader034/viewer/2022042211/62594584ff55f87d1222bcd6/html5/thumbnails/39.jpg)
Wu
Remarks59
1. Unlike discounted reward problems, average reward problems are full of technicalities.
2. Depending on the structure of transition probability matrix ! "given action ", an optimal stationary policy may not exist.
3. Such existence problems are highly technical, especially for infinite-state spaces.
4. There are general sufficient conditions for the existence of optimal stationary policies for average reward problems.
![Page 40: Undiscounted infinitehorizon DP](https://reader034.fdocuments.net/reader034/viewer/2022042211/62594584ff55f87d1222bcd6/html5/thumbnails/40.jpg)
Wu
Average reward results61
Theorem (Average reward)
Any of the following conditions are sufficient for the optimal average reward
to be the same regardless of the initial state:
1. (Unichain condition) Every stationary policy ! yields a Markov chain with
a single recurrent class (i.e. one communicating class) and a possibly
empty set of transient states.
2. There exists a unichain optimal policy.
3. For every pair of states x and y, there is a policy u such that x and y
communicate.
![Page 41: Undiscounted infinitehorizon DP](https://reader034.fdocuments.net/reader034/viewer/2022042211/62594584ff55f87d1222bcd6/html5/thumbnails/41.jpg)
Wu
Connection with Discounted Reward MDPs63
§ “Vanishing Discount Factor Idea”
§ Let !"∗ be the value of a discounted reward MDP, with discount factor $. Then !∗, ℎ of the average-reward problem are obtained by:
!∗ = lim"→, 1 − $ !"∗ /
ℎ / = lim"→,!"∗ / − !"∗ 0 for any 0
§ The following set of equations could relate average reward MDPs and $-discounted MDPs
!∗ = !7∗ / = lim8→9max71
; + 1=7 >?@A
8B C?, D? | CA = /
= lim8→9max7 lim"→,=7 ∑?@A8 $?B C?, D? | CA = /
∑?@A8 $?= lim"→, lim8→9max7
=7 ∑?@A8 $?B C?, D? | CA = /∑?@A8 $?
= lim"→,max7 lim8→9=7 ∑?@A
8 $?B C?, D? | CA = /lim8→9∑?@A
8 $?= lim"→, 1 − $ !"
∗ /
![Page 42: Undiscounted infinitehorizon DP](https://reader034.fdocuments.net/reader034/viewer/2022042211/62594584ff55f87d1222bcd6/html5/thumbnails/42.jpg)
Wu
Blackwell Optimal Policies65
§ Let !"∗ denote the optimal policy of a discounted reward MDP with discount factor $.
§ Let !∗ be the optimal policy of an average reward MDP.§ Under the unichain condition (and finite states), given a sequence
$% such that $% → 1 , it turns out that !"(∗ → !∗.§ Since there are only a finite number of policies, the convergence !"(∗ → !∗ implies that there exists an $̅ < 1 such that ∀$ ∈ $̅, 1 , the optimal policy of the discounted reward MDP coincides with that of the average reward MDP (i.e. !∗).
§ Such policies !"∗ are called Blackwell Optimal Policies.