Online Fractional Programming for Markov Decision Systems

Online Fractional Programming forMarkov Decision Systems

Michael J. Neely, University of Southern California

Proc. Allerton Conference on Communication, Control, and Computing, September 2011

tT[0] T[1] T[2]

state 1 state 4 state 2 1

23

4

Energy-Aware, 4-State Processor

General System Model

tT[0] T[1] T[2]

state 1 state 4 state 2

• Frames r in {0, 1, 2, …}.• k[r] = system state during frame r. k[r] in {1,…,K}.• ω[r] = random observation at start of frame r. ω[r] in Ω.• α[r] = control action on frame r. α[r] in A(k[r], ω[r]).

General System Model

tT[0] T[1] T[2]


• Frames r in {0, 1, 2, …}.• k[r] = system state during frame r. k[r] in {1,…,K}.• ω[r] = random observation at start of frame r. ω[r] in Ω.• α[r] = control action on frame r. α[r] in A(k[r], ω[r]).

Control action affects Frame Size, Penalty Vector, Next State:

• T[r] = T(k[r], ω[r], α[r]).• [y1[r],…,yL[r]] = y(k[r], ω[r], α[r]).• [Pij(ω[r], α[r])] = P(ω[r], α[r]).

Example 1: Discrete Time MDP

1

23

4Minimize: E{y0} Subject to: E{y1} ≤ 0

E{yL} ≤ 0

• All frames have unit size: T[r] = 1 for all r.• Control action α[r] affects Penalty Vector, and Transition Probs.

Additionally, we can treat problems with…• ω[r] = random observation at start of frame r:

• ω[r] is i.i.d. over frames r. • ω[r] in Ω (Arbitrary cardinality set)• Pr[ω[r] =ω] (unknown probability distribution)

Example 2: Processor with Several Energy-Saving Modes

1

23

4

• Random Job Arrivals, L different classes. • k[r] = processing mode (4 different modes). • Action α[r]: Choose which job to serve, and next mode.• k[r] and α[r] affect:

• Processing Time• Switching Time• Energy Expenditure


Relation between AveragesDefine the frame-average for y0[r]:

The time-average for y0[r] is then:

The General Problem

tT[0] T[1] T[2]


Prior Methods for “typical” MDPs• Offline Linear Programming Methods (known probabilities).

• Q-learning, Neurodynamic programming (unconstrained).• [Bertsekas, Tsitsiklis 1996]

• 2-timescales/fluid models for constrained MDPs.• [Borkar 2005][Salodkar, Bhorkar, Karandikar, Borkar 2008]• [Djonin, Krishnamurthy 2007]• [Vazquez Abad, Krishnamurthy 2003]• [Fu, van der Schaar 2010]

• The above works typically require: • Finite action space• No ω[r] process• Fixed slot length problems (does not treat fractional problems).

The Linear Fractional ProgramFor this slide, assume no ω[r] process, and set is A(k) are finite:

Variables

The Linear Fractional ProgramFor this slide, assume no ω[r] process, and set is A(k) are finite:

Where f(k, α) is interpreted as the steady state probability of beingIn state k[r] = k and using action α[r] = a, and the policy is then:

Note: See “Additional Slides 2” for the Linear Fractional Program with the ω[r] process.

Paper Outline• Linear Fractional Program involves many variables and would

require full knowledge of probs p(ω). • We develop:

Algorithm 1: • Online Policy for Solving Linear Fractional Programs.• Allows infinite sets Ω and A(k,ω). • Does not require knowledge of p(ω).• Does not operate on the actual Markov dynamics. • Solves for the values (Pij

*), {yl*(k)} associated with optimal

policy.

Algorithm 2: • Given target values (Pij

*), {yl*(k)}, we develop on online

system implementation (with the actual Markov state dynamics) that achieves them.

We can also run these algorithms in parallel, continuously refining our target values for Alg 2 based on the running estimates from Alg 1.

ALG 1: Solving for Optimal Values• Define a new stochastic system with same ω[r] process.

• Define a decision variable k[r]: • k[r] is chosen in {1, …, K} every frame.• It does not evolve according to the MDP.

• Define a new penalty process qij[r]:

qij[r] = 1{k[r]=i} Pij (ω[r], α[r])

• qij = Fraction of time we transition from “state” i to “state” j.

Treat as a Stoch Network Optimization:


“Global Balance Equation”


Use a “Virtual Queue” for each constraint k:

Hk[r]∑j qkj[r] ∑i qik[r]

Lyapunov Optimization:Drift-Plus-Penalty Ratio

• L[r] = sum of squares of all virtual queues for frame r.• Δ[r] = L[r+1] – L[r].• Every slot, observe ω[r], queues[r].• Then choose k[r] in {1,…, K}, α[r] in A(k[r],ω[r]) to greedily minimize a bound on drift-plus-penalty ratio:

E{Δ[r] + V y0[r] | queues[r]}

E{T[r] | queues[r]}

• Can be done “greedily” via a max-weight rule generalization.

Alg 1: Special Case of no ω[r] process:The drift-plus-penalty rule: • Every frame r, observe queues[r] = {Zl[r], Hk[r]}.• Then choose k[r] in {1,…,K}, and α[r] in A(k[r]) to greedily minimize:

Alg 1 Performance Theorem

Theorem: If the problem is feasible, then:

(a) All virtual queues are rate stable. (b) Averages of our variables satisfy all desired constraints. (c) We find a value within O(1/V) of the optimal objective: ratio* = y0*/T*(d) Convergence time = O(V3). (e) We get an efficient set of parameters to plug into ALG 2:

(Pij*), {yl*(k)} , {T*(k)}

ALG 2 does not need the (huge number of) individual probabilities for each ω[r] and each control option α[r].

Alg 2: Targeting the MDP• Given targets: (Pij*), {yl

*(k)} , {T*(k)}.• Let k[r] = Actual Markov State• Define qij[r] as before.

• 1i = time average of indicator function 1{k[r] =i}.• MDP structure: This is not a standard stochastic net opt.

Lyapunov Optimization Solution:• While not a standard problem, we can still solve it greedily:

Hij[r]qij[r] 1{k[r]=i}Pij*

Uncontrollable variable with “memory”

Greedy Algorithm Idea and Theorem

Δ[r] + V y0[r]

• L[r] = sum of squares of all virtual queues for frame r.• Δ[r] = L[r+1] – L[r].• Every frame, observe ω[r], queues[r], and actual k[r]. • Then take action α[r] in A(k[r], ω[r]) to greedily

minimize a bound on:

Theorem: If the targets are feasible, then this satisfies all constraints and gives performance objective within O(1/V) of optimality.

Conclusions

• General MDP with variable length frames. • ω[r] process can have infinite number of outcomes.• Control space can be infinite. • Combines Lyapunov optimization and MDP theory for

“max-weight” rules.• Online algorithms for linear fractional programming.

tT[0] T[1] T[2]

state 1 state 4 state 2 1

23

4


Additional Slides -- 1

Example “Degenerate” MDP:• Minimize: E{y0} • Subject to: E{y1} ≤ 1/2

y0=0y1=0

y0=0y1=1

y0=1y1=0

p1p2

• Can solve in an expected sense: Optimal E{y0} = ½. Optimal fraction of time in lower left = fraction in lower right = ½ . • But it is impossible for pure time averages to achieve the constraints. • Our ALG 1 would find the optimal soln in the expected sense.

1 1

Additional Slides -- 2The linear fractional program with ω[r] process, but with finite sets Ω and A(k[r], ω[r]). Let y(k, w, a) represent the prob of being in state k[r]=k, seeing w[r] =w, and using a[r] =a.

GBE

Normalization

Independence ofk[r] and w[r].

Additional Slides -- 2The linear fractional program with ω[r] process, but with finite sets Ω and A(k[r], ω[r]). Let y(k, w, a) represent the prob of being in state k[r]=k, seeing w[r] =w, and using a[r] =a.

Note: An early draft (on my webpage for 1 week) omitted the normalization and independence constraint for thislinear fractional program example. It also used more complex notation: y(k, w, a) = p( )w f(k, w, a).

The online solution givenin the paper (for the more general problem) enforces these constraints in a different (online) way.

Online Fractional Programming for Markov Decision Systems

Documents

Transcript of Online Fractional Programming for Markov Decision Systems