Average Cost Markov Decision Processes: Optimality … · MARKOV DECISION PROCESSES 397 solution to...

JOURNAL OF MATHEMATICAL ANALYSIS AND APPLICATIONS 158,396-406 (1991)

Average Cost Markov Decision Processes:Optimality Conditions*

O. HERNANDEZ-LERMA

Depto. de Matemiuicas, CINVEST A V-IPN,Apdo. Postal 14-740, México, D.F. 07000, México

AND

J. C. HENNET AND J. B. LASSERRE

LAAS-CNRS, 7, Avenue du Colonel Roche,31077 Toulouse Cédex, France

Submitted by E. Stanley Lee

Received October 24, 1989

1. INTRODUCTION

We are concerned in this paper with discrete-time Markov DecisionProcesses (MDPs) with Borel state and action spaces X and A, respectively,and the long run expected average cost criterion. When X is a denumerableset, man y necessary and/or sufficient conditions for the existence of optimalcontrol policies are known. However, when X is a Borel space (i.e., a Borelsubset of a complete separable metric space), most of the available resultsimpose on the MDP very restrictive topological conditions (e.g., compactness) and/or strong recurrence assumptions (such as Doeblin's condition);see, e.g., [4,9, 12] and their references. Another related work is [7] wherewe have studied MDPs from the viewpoint of the recurrence (or ergodicity)properties of the state process. ln the present paper, however, we areconcerned with the existence of average optimal policies by looking at (static)optimization problems (see condition CS in Section 4) related-in sornecases equivalent-to the existence of a bounded solution to the so-calledOptimality Equation (see C4 in Section 4). These optimization problemsare "dual" in the sense that, under appropriate conditions, the existence ofan optimal solution to one of the problems implies existence of an optimal

• This work was partially supported by a JOInt CONACYT (México}--CNRS (France)research program. The research of the first author was also supported in part by the TW ASunder Grant RG-MP 898-152.

3960022-247Xj91 nooCopyright © 1991 by Academie Press. lne.Ali rights of reproduction in any form reserved.

MARKOV DECISION PROCESSES 397

solution to the other( s) and, moreover, the corresponding optimal valuesof the problems are equal. More generally, feasible solutions to one of theproblems provide bounds for the other. This approach is more er lessstandard when X and A are both finite sets, as in [1] and referencestherein, but in a more general setting it has been followed only by Yamada[12], who assumes that X is a. compact subset of Rn and that the transitionlaw has a density which satisfies a certain "positivity" condition (see (Al)in Remark 3.2 below). Here, we obtain results similar to those in [1, 12]in the setting of general Borel spaces, and furthermore, our "statie"problems have formally a simpler form. AIso, using the concept of "oppor·tunity cost" introduced by Flynn [2,3], we show that a stationary policydetermined from the optimality equation is strong average optimal(seeDefinition 2.2).

Our main results are presented in Section 4; they roughly consist ofrelations between several ergodicity and optimality conditions introducedin Section 3. We begin in Section 2 by presenting the Markov decisionmodel and the optimality criteria we are interested in.

2. PRELIMINARIES

We will use the following notation. A Borel space X (i.e., a Borel subsetof a complete separable metric space) is always endowed with the Borelsigma-algebra ~(X). P(X) and B(X) denote the space of probabmtymeasures on X and the space of real-valued measurable bounded functionson X, respectively. If iJ E B(X), Ilvll denotes its supremum nonn, whereas if/l is a finite signed measure on X, 1I/l1i stands for the total variation·.n~rm.Given two Borel spaces X and Y, P( Y 1 X) stands for the set of aIlstochastic kernels I/I(dy 1 x) on Y given X; that is, I/I(dy 1 x)EP(YI X) if1/1(,1 x) iS,a probability measure on Y for each XEX, and I/I(BI') is ameasurable function on X for each B E: ~(Y).

The Decision Model. We consider the standard (stationary) Markovdecision model (X, A, q, c), with state space X, action set A, transition lawq, and one-stage cost function c. Both X and A are assumed to be Borelspaces. To each state x EX we associate a nonempty measurable subsetA(x) of A, whose elements are the admissible actions when the system is instate x, and we assume that the set K:= {(x, a) 1 x E X, a E A(x)} of feasiblestate-action pairs is a measurable subset of Xx A. We will also assume thatA(x), c(x, a), and q(dy 1 x, a) satisfy the following:

Assumption 2.1. (a) A(x) is a compact set for every XEX.

(b) c(x,a)EB(K) and, for each XEX, c(x,a) is a lower semicontinuous (I.s.c.) function in a E A(x).

398 HERNÀNDEZ-LERMA, HENNET, AND LASSERRE

(c) The transition law'q E P(X 1 K) is such that Ix v(y) q(dy 1 x, a) isl.s.c. in a E A(x) for each x EX and v E B(X).

Control Policies. A poliey is a sequence 0 = {O,} such that, for eacht = 0, 1, ..., 0, is a stochastic kemel on A given the set H, of historiesh, := (xo, ao, no, X,_!, a,_!, ~,) with (an, xn) E K \ln. Here, Xn and an denotethe state and action at tÏIÎ1e n, respectively, and it is assumed that 0,satisfies the constraint ol(A(x,) 1 h,) = 1. The class of aIl poli cies is denotedby Li. :.

Let (/> be the set of ~Il stochastic kemels f/JE P(A 1 X) such thatf/J(A(x) J x) = 1 for aIl XE X,Jand let F be the set of ail measurable functionsf: X -+ A such that f(x) E A(x) for ail x E X.

A policy 0 = {o ,} is sâid to be a randomized stationary poliey ifthere exists f/JE(/> such that bl(·lhl)=f/J(·Jx,) for every history h,=(xo, ao, ..., x,) E H, and t = 0, 1, .... ln this case we identify b with f/JE (/>; inother words, we identify (/>),with the set of randomized stationary policies.

Finally, a randomized stationary policy f/JE (/> is called (pure or de terministic) stationary if ther~existsfEF such that f/J({J(x)} 1 x)= 1 for ailXE X. ln such a case, we id~ntify f/Jwith f E F, so that F becomes the set of

(pure or deterministic) stat(onary policies.

Notation. Given a randpmized stationary policy f/JE (/>, we write, forXEX,

be the expeeted total n-stage eost under 0 when the initial state is x. 1he

For a stationary policy fE f, these expressions reduce to

respectively. As is weIl known, when using a policy f/JE (/>,the state process{x ,} is a Markov chain with stationary transition kemel q( . 1 x, f/J).,

Performance Criteria. Lft P~ be the induced probability measure whenusing the policy 0 E Li giveri the initial state Xo= x (see, e.g., Hinderer [8,p. 80], for a construction of P~); the corresponding expectation operator isdenoted by E~.:

For any positive integer ~, 0 E Li and x E X, let

f•

q(. 1 x, f) = q(. 1 f(x»,

q( . 1 x, f/J)= L q( . 1 x, a) f/J(da 1 x).

(1 )

n=1,2, ..., (Vo(·,·):=O)

and

and

e(x, f) = e(x, f(x)

n-IVn(o, x):= L E~~(x" a,)

,~O

~\

e(x, f/J):= f e(x, a) f/J(da 1 x~A ~

MARKOVDECISIONPROCESSES 399

eorresponding optimal n-stage cost is v,,(x):= infb V,,(<5,x). FollowingFlynn [2, 3], we define the opportunity cost of <5at x as

(2)

and <5is said to have finite opportunity cost if 0(<5, .) is finite-valued. Wealso define the usual long-run expeeted average cost per unit time as

"

and the optimal average cast J(x) :=infbJ(<5, x), xEX.

DEFINITION2.2. A poliey <5*is sa id to be

• average optimal (AO) if J(<5*,x)=J(x) VXEX;

strong average optimal (strong AO) if limsup"n-1[VA<5*,x)v,,(x)] =0.

ln this paper, we are speeifieally interested in the concept of averageoptimality in the sense of Definition 2.2 and, as already noted by Flynn[2,3], it is clear that a poliey f> is AO if it is strong AO, and the latter inturn is implied if <5has finite opportunity cost. The converse implications,however, do not hold in general, and one of our objectives is to see howstrong optimality and finiteness of the opportunity cost relate to theconditions to be stated in Section 3.

3. ERGODICITYAND OPTIMALITYCONDITIONS

ln this section we introduce sorne ergodicity and optimality conditions,and in Section 4 we study sorne relations between them. A subscript d (dfor deterministic) will be used to indicate that a given condition isrestricted to the set of (pure or deterministic) stationary policies F.

Ergodicity Conditions

CI. There exists a scalar a.E (0, 1) such that IIq(' 1 x, tP) - q(. 1 x', tP')1I

~ 2a. for aIl x, x' E X and tP, tP' E (/J.

C2 (Geometrie ergodicity). There exist scalars a.E (0, 1) and b >0for which the following holds: For each tP E (/J there is a probabilitymeasure P.p on X such that

IIq'('1 x, tP) - p.p(·)11 ~ ba.' VXEX, and t=O, 1, ...,

---------------------

400 HERNANDEZ-LERMA, HENNET, AND LASSERRE

-where q'(Blx,,p) = P~(X, E B), B E ~(X), denotes the t-step transitionmeasure when using the policy ,p E C/J; cf. (1 ).

C3 (Positive recurrence). For each ,p E C/J, there exists an invariant

probabi1ity measure P,p for q(·1 ., ,p); that is, p,p(B) = Sx q(B 1 x,,p) p,p(dx)for all BE (lJ(X).

Remark 3.1. CI implies C2 (with b = 2), C3, and also the optimalitycondition C4 below; see, e.g., [4; 6; 5, p. 57]. Sorne sufficient conditions forCI are given in the latter references; they are easily verified in sorneinventoryJproduction systems as well as in sorne control of water reservoirproblems [11,12].

Remark 3.2. CI can be written in several equivalent forms when thestate space X is a countable set or X = Rn. For instance, suppose thatX = Rn and the transition law q(B 1 x, a) has a density pey 1 x, a) with

respect to Lebesgue measure m( . ); that is, q( B 1 x, a) = SB p( Y 1 x, a) dy foraH BE (lJ(X) and (x, a) E K. Then, by Scheffe's Theorem (see, e.g., [5,p. 125]) and using that Js- tl = s + t - 2min[s, t], we can write

IIq(' 1 x, a) - q('1 x', a/)11 = f Ip(y 1 x, a) - pey 1 x', a/)1 dy

=2-2 f min[p(y 1 x, a), pey 1 x', a')] dy. (3)

(This relation also holds when X is a countable set: replace integrals bysums.) As an example, we can show that Yamada's [12] condition (Al)implies CI. Indeed, consider [12]:

(Al) X = Rn, A = Rm, and there exists a scalar e > 0 and a Borelset CE (lJ(X) such that pey 1 x, a) ~ e for all y E C, (x, a) E K, ando < e . m( C) < 1.

Under (Al), q(. 1 x, ,p), with ,pE C/J, has a density pey 1 x, ,p) =

SAP(Ylx,a),p(dalx) (cf. (1)) satisfying p(ylx,,p)~f. for aH YEC andx E X, and (3) yields, for any ,p and ,p' E C/J,

IIq(· 1 x,,p) - q(. 1 x', ,p')11 = 2 - 2 f min[p(y 1 x, ,p), pey 1 x', ,p')] dy

~ 2 - 2 f min[p(y 1 x, ,p), pey 1 x', ,p')] dyc

~2(1-f.·m(C),

so that CI holds with IX = 1 - f. . m( C).


Remark 3.3. For the results in Section 4, the geometric ergodicitycondition C2 can be replaced by the following: For each t/J E cf>, there existsa probability measure P,p on X such that

for ail x E X, t = 0, l, ..., (4 )

where {P r} is a sequence of constants independent of x and t/J, and suchthat Lr Pr < co. Sufficient conditions for (4), as weil as for C2 and C3, aregiven, e.g., in [7, 10].

Optimality Conditions

C4. There is a constant j* and a function v* E B(X) such that(j*, v*(·» is a solution to the Optimality Equation

j*+v*(x)= min {c(x,a)+fv*(y)q(dY1x,a)}, XEX. (5)ae A(x)

Equivalently, there is a constant j* and a function v* E B(X) such that(j*, v*(·» is an optimal solution to the problem (P):

Maximize À. S.t.

where ). ER and v E B(X).

CS. There exists t/J*E cf> and p* E P(X) such that (t/J*, p*) is anoptimal solution to the dual problem (0):

Minimize f f c(x, a) t/J(da 1 x) p(dx) S.tx A

À. + v(x) - f v(y) q(dy 1 x, a) ~ c(x, a)

L L q(B 1 x, a) t/J(da 1 x) p(dx) = p(B)

V(x, a)EK,

VBE~(X),

(6)

(7)~~.

where t/J E cf>, P E P(X).If we restrict problem (0) to (deterministic) stationary policies f E F, the

corresponding "deterministic" version of problem (0) is problem (Dd):

Minimize f c(x, f) p(dx) S.t.x

f q(B 1 x, f) p(dx) = p(B)xVBE~(X) (8)

-

where f E F, P E P(X).

C6. There is a policy b E A with fini te opportunity cost.


Notice that problem (P) is "linear" in (..1.,v(· », whereas (D) (or (Dd»

is nonlinear in (<p, p). However, if the transition law q is absolutelycontinuous with respect to sorne sigma-finite measure Ji. on X--e.g., Ji. =m = Lebesgue measure if X = Rn (cf. Remark 3.2 or [12]), or Ji. = countingmeasure if X is a denumerable set (cf. [1] ~then (D) can be written as thestandard duallinear problem for (P), as in Linear Programming.

Remark 3.4. We can also write the optimality equation (5) asminUEA(X)D(x, a) = 0, where

4. THEOREMS

The objective in this section is to prove sorne results connectingconditions C4, CS, and C6. Theorem 4.1 is a "duality theorem": it givesconditions under which the existence of an optimal solution to the "primai"problem (P) in C4 yields an optimal solution to the "dual" problem (O~or to the deterministic version (Od~in CS, and conversely. Theorem 4.2shows that C4 implies C7, which extends to our present Borel-space settinga result of Flynn [2] when X is a denumerable set and A is finite.

is the so-called "discrepancy" function. Let F* := {J:= FI D(x,f(x» = O};that is,f E F* iff(x) E A(x) minimizes the right hand side (r.h.s.) of (5) foraU x E X. Under Assumption 2.1, well-known Measurable Selectiontheorems imply that F* is nonempty. On the other hand, if C4 holds, thenj* is the optimal cost function, i.e., j* = l(x) for aIl x E X, and moreover,j* = l(f, x) if f E F*, so that f E F* is AO. We will show in Theorem 4.2that a stationary policy f E F* is in fact strong AO (Definition 2.2).

D(x, a) :=c(x, a) + f v*(y) q(dy 1 x, a) - j* - v*(x), (x,a)EK,

(,

11

11

\1,~

\

THEOREM 4.1. (a) Suppose the ergodicity condition C3 ho/ds. Then:

(i) the problems (P), (D) [and (Od)] in C4 and CS, respectively,are feasible;

(ii) for any feasible solutions (..1.,v(·» of (P) and (<p,p)[respectively (f, p)] of (D) [respectively (0 d)]'

..1.~f f c(x,a)<p(dalx)p(dx) [respectivelY..1.~f C(x,f)P(dX)]; (9)x A x

(iii) if (P) has an optimal (bounded) solution, then so do (0) andCOd), and the optimal values of the corresponding objective functions are

MARKOY DECISION PROCESSES 403

equal; in fact, an optimal solution to (0) can be chosen Jrom the set oJoptimal solutions to (0,)). (See also Remark 4.3.)

(b) If C2 ho/ds and (0) [or (0,))] has an optimal solution, then sodoes (P)and the corresponding optimal values oJ(P) and (0) [or (0,)] areequal.

Proo! (a) (cf. [12]) (i) To see that (P) is feasible it suffices to takev( . ) = 0 and X sufficiently small. Feasibility of (0) or (0d) follows from C3:if fjJ E if> ànd P4> is an invariant probability measure for q(. 1 " fjJ), then thepair (fjJ,p) sa tisfies (7). Similarly, if f E F, then (J, pf) satisfies (8).

(ii) Now suppose that (À.,v(·» satisfies (6) and (fjJ,p) satisfies (7).Then, integrate (6) with respect to fjJ(da 1 x) and then with respect to p(dx)to obtain

),+ f v dp - f f [f v(y) q(dy 1 x, a)] fjJ(da 1 x) p(dx)x x A X

~ f f c(x, a) fjJ(da 1 x) p(dx).x A

Finally, using Fubini's theorem and (7), the third terrn reduces to Jx v dpso that the latter inequality reduces to (9). The pro of for (J, p), satisfying(8) is similar.

(iii) Let (j*, v*(·» be a (bounded) solution to (5) and takef E F*; that is,

j* + v*(x) = c(x, n + f v*(y) q(dy 1 x, n, XEX. (10)

Now let f}* E if> be Sllch that fjJ*(. 1 x) is the probability measure concentrated atl f(x) Vx E %, and let p* = Pf be a corresponding invariantprobability measure. Then (10) can be written as

j* + v*(x) = c(x, fjJ*)+ f v*(y) q(dy 1 x, fjJ*),

and integration with respect to p*(dx) yields j* = Hc(x, a) f/J*(da 1 x)p*(dx), which yields the desired conclusion (cf. Remark 3.4).

(b) Let (fjJ*,p*) be an optimal solution to (D), and define

•

j* := f f c(x, a) f/J*(da 1 x) p*(dx) = f c(x, f/J*) p*(dx)x A x

<X)

v*(x):= L E~'[c(x" fjJ*)- j*].1=0

(11 )


Under condition C2,

IE~· c(x" 1jJ*)- j*1'= If c(y, 1jJ*)[q'(dy 1x, 1jJ*)- p*(dy)]1

}~ IIcll . IIq'(· 1x, 1jJ*)- p*( . )11~ bllcllcx',,and therefore, v* is uniformty bounded in XEX: Iv*(x)1 ~blicll/(l-cx). Onthe other hand, by definitio(n of v* and the Markov propert y,

1:; 00

v*(x)=c(x,IjJ*)-j*+ l E~·[c(x"IjJ*)-j*],~l

= c(x, ljJ ) - j* + f v*(y) q(dy 1 x, 1jJ*);

that is,

Iteration of this inequality yieldsn-l

nj* + v*(x) ~:I E~[c(x" J)] + E~v*(xn)';=0

Now let JE F be a minimizer; of the r.h.s. of (13), so that!

j* + v*(x) ~ c(x, 1> + f v*(y) q(dy 1 x, f), XE X.

so that (j*,v*(·» is feasible for (P). To show that (j*,v*(·» is optimalfor (P), lirst note that, from\ (12),

j*+v*(x)~ min {c(x,a)+fv*(y)q(dY1x,a)} VXEX. (13)oeA(x) :

t

~j\

}

l1

\

i\

l1!

ir!11i~

\

\

i\

(12)VXEX,j* t v*(x) = c(x, f/J.*)+ f v*(y) q(dy 1 x, 1jJ*)

so that, dividing over n, taking the limit as n -+ 00, and using the boundedness of v*, j* 'j:. lU), where J(f) = J c(y, f) pj(dy) = l(J,x) VXE X, by C2.On the other hand, the optimality of (1jJ*, p*) and the delinition (11) of j*

imply that j * ~ lU). Hence H' = lU) and the equality holds in (13), Le.,(j*, v*) satisfies the optimality equation (5). Clearly, the above argumentsstill work if, instead of an optimal solution to (0), we take an optimalsolution (f*, p*) to (Od)' l'

ln the proof of Theorem 4.~ we will use that the optimal n-stage costfunctions vn, n = 1, 2, ..., can be written iteratively as

vn(x) = inf {C(X, a) 4- f v,,_I(y) q(dy 1 x, a)},oe A(x)XEX, (14 )


with Vo :=0; see, e.g., [5,8]. Also recall the definitions of D(x, a) and F*in Remark 3.4.

THEOREM 4.2. C4 implies C6; more precise/y, if C4 ho/ds and f* E F*,th en f* has finite opportunity cost, and therefore, f* is strong AD.

On the other hand, D(x, a) ~ 0 implies

where D is the "discrepancy" function in Remark 3.4. Thus, if ,we takefEF*, then D(x,f(x»=0 for aIl xeX, and from (16),

1

(16)

(15)

xeX,

VxeX.

VxeX.

Vn = 0, 1, ... ,

en+t(x)= min {D(x,a)+fen(y)q(dY1x,a)},OEA(x)

Thus Ien+l(x)1 ~ Ilen Il VXEX, and (15) follows.Now if f* E F*, (5) becomes

so that Ilenli ~ lleoll = IIv*1I < 00 for ail n. To begin, a direct caIculationusing (14) yields

Proo! Let (j*, v*(·» be a bounded solution to (5) and letf* E F*. Wewish to estimate the opportunity cost O(f*, x) in (2).

Let us define en(x) :=vn(x)-v*(x)-nj*, for XEX, n=O, 1, .... Noticethat eo(x) = -v*(x). We will first show that Ilen Il is non-increasing, i.e.,

•

i\

l

[

1

f

1

i\1

1

1

1i!

j* + v*(x) = c(x,f*) + f v*(y) q(dy 1 x, f*) VXEX,

and iterating, we obtain

n-lVn(f*, x):= L E(' c(x" f*) = v*(x) + nj* - EÇv*(xn).

,=0

Therefore,

VxeX,


-----..~_.

and by (2), 0(/*, x) ~ 2111;*11'<IxEX. This completes the proof of \Theorem 4.1. 1 !

Combining Remark 3.1 and Theorems 4.1 and 4.2, we obtain other !sufficient conditions for C6: , }i

1

COROLLARY. (a) CI impiies C6. f

(b) C2 and C5 [or C5~, i.e., replacing (0) by (Od)] together implyC6.

More generally, C6 is impljed by any set of sufficient conditions for C4,which in turn can be obtaineà in a number of ways [2,4,5, 7, 9].,

Remark 4.3. As can be seep in the proof of Theorem 4.1, the conclusionin part (a)(iii) of that theore~ still holds if C3 is replaced by the followingweaker condition: [(P) has ari optimal bounded solution and] there exists

a stationary policy f E F* su~h that q(. 1 ., f) has an invariant probabilitymeasure.

REFERENCES

1. J. A. FILARANDT. A. SCHULTZ,Çommunicating MDPs: equivalence and LP properties,Oper. Res. Leu. 7 (1988), 303-301"

2. J. FLYNN,On optimality criteria fôr dynamic programs with long finite horizons, J. Math.Anal. Appt. 76 (1980), 202-208. ~

3. J. FLYNN, Optimal steady states, excessive functions, and deterministic dynamicprograms, J. Math. Anal. Appl. 14 (1989), 586-594.

4. J. P. GEORGIN,Contrôle de chaînes de Markov sur des espaces arbitraires, Ann. lnst.H. Poincaré 14 (1978), 255-277.:

,}

5. O. HERNÂNDEZ-LERMA,"Adaptiye Markov Control Processes," Springer-Verlag,New York, 1989. "

6. O. HERNÂNDEZ-LERMAANDJ. B. t;.ASSERRE,A forecast horizon and a stopping rule forgeneral Markov decision processes;;J. Math. Anal. Appl. 132 (1988), 388-400.

7. O. HERNÂNDEZ-LERMA,R. MONTES,PE-OCA,ANDR. CAVAZOS-CADENA,Recurrence conditions for Markov decision processes with Borel state space, Ann. Oper. Res. (1991), toappear. ';.

8. K. HINDERER,"Foundations of N~n-Stationary Dynamic Programming with DiscreteTime Parameter," Lecture Notes Oper. Res., Vol. 33, Springer-Verlag, New York, 1970.

9. M. KURANO,The existence of a mi~imum pair of state and policy for Markov decisionprocesses under the hypothesis of Doeblin, SIAM J. Control Optim. 27 (1989), 296-307.

10. R. L. TWEEDIE,Criteria for rates o(convergence of Markov chains, with applications toqueueing and storage theory, in "Pà'pers in Probability Statistics and Analysis" (J. F. C.Kingman and G. E. H. Reuter, Eds:), pp.260-276, Cambridge Univ. Press, Cambridge,1983. ;

II. S. YAKOWITZ,Dynamic programmi~g applications in water resources, Water ResourcesRes. 18 (1982), 673--{j96. i

12. K. YAMADA,Duality theorem in Màrkovian decision problems, J. Math. Anal. Appl. 50(1975), 579-595. "

1

i\

f

\1

\

\

\

Average Cost Markov Decision Processes: Optimality … · MARKOV DECISION PROCESSES 397 solution to...

Documents

Transcript of Average Cost Markov Decision Processes: Optimality … · MARKOV DECISION PROCESSES 397 solution to...