Online Fractional Programming forMarkov Decision Systems
Michael J. Neely, University of Southern California
Proc. Allerton Conference on Communication, Control, and Computing, September 2011
tT[0] T[1] T[2]
state 1 state 4 state 2 1
23
4
Energy-Aware, 4-State Processor
General System Model
tT[0] T[1] T[2]
state 1 state 4 state 2
• Frames r in {0, 1, 2, …}.• k[r] = system state during frame r. k[r] in {1,…,K}.• ω[r] = random observation at start of frame r. ω[r] in Ω.• α[r] = control action on frame r. α[r] in A(k[r], ω[r]).
General System Model
tT[0] T[1] T[2]
state 1 state 4 state 2
• Frames r in {0, 1, 2, …}.• k[r] = system state during frame r. k[r] in {1,…,K}.• ω[r] = random observation at start of frame r. ω[r] in Ω.• α[r] = control action on frame r. α[r] in A(k[r], ω[r]).
Control action affects Frame Size, Penalty Vector, Next State:
• T[r] = T(k[r], ω[r], α[r]).• [y1[r],…,yL[r]] = y(k[r], ω[r], α[r]).• [Pij(ω[r], α[r])] = P(ω[r], α[r]).
Example 1: Discrete Time MDP
1
23
4Minimize: E{y0} Subject to: E{y1} ≤ 0
E{yL} ≤ 0
• All frames have unit size: T[r] = 1 for all r.• Control action α[r] affects Penalty Vector, and Transition Probs.
Additionally, we can treat problems with…• ω[r] = random observation at start of frame r:
• ω[r] is i.i.d. over frames r. • ω[r] in Ω (Arbitrary cardinality set)• Pr[ω[r] =ω] (unknown probability distribution)
Example 2: Processor with Several Energy-Saving Modes
1
23
4
• Random Job Arrivals, L different classes. • k[r] = processing mode (4 different modes). • Action α[r]: Choose which job to serve, and next mode.• k[r] and α[r] affect:
• Processing Time• Switching Time• Energy Expenditure
Energy-Aware, 4-State Processor
Relation between AveragesDefine the frame-average for y0[r]:
The time-average for y0[r] is then:
The General Problem
tT[0] T[1] T[2]
state 1 state 4 state 2
Prior Methods for “typical” MDPs• Offline Linear Programming Methods (known probabilities).
• Q-learning, Neurodynamic programming (unconstrained).• [Bertsekas, Tsitsiklis 1996]
• 2-timescales/fluid models for constrained MDPs.• [Borkar 2005][Salodkar, Bhorkar, Karandikar, Borkar 2008]• [Djonin, Krishnamurthy 2007]• [Vazquez Abad, Krishnamurthy 2003]• [Fu, van der Schaar 2010]
• The above works typically require: • Finite action space• No ω[r] process• Fixed slot length problems (does not treat fractional problems).
The Linear Fractional ProgramFor this slide, assume no ω[r] process, and set is A(k) are finite:
Variables
The Linear Fractional ProgramFor this slide, assume no ω[r] process, and set is A(k) are finite:
Where f(k, α) is interpreted as the steady state probability of beingIn state k[r] = k and using action α[r] = a, and the policy is then:
Note: See “Additional Slides 2” for the Linear Fractional Program with the ω[r] process.
Paper Outline• Linear Fractional Program involves many variables and would
require full knowledge of probs p(ω). • We develop:
Algorithm 1: • Online Policy for Solving Linear Fractional Programs.• Allows infinite sets Ω and A(k,ω). • Does not require knowledge of p(ω).• Does not operate on the actual Markov dynamics. • Solves for the values (Pij
*), {yl*(k)} associated with optimal
policy.
Algorithm 2: • Given target values (Pij
*), {yl*(k)}, we develop on online
system implementation (with the actual Markov state dynamics) that achieves them.
We can also run these algorithms in parallel, continuously refining our target values for Alg 2 based on the running estimates from Alg 1.
ALG 1: Solving for Optimal Values• Define a new stochastic system with same ω[r] process.
• Define a decision variable k[r]: • k[r] is chosen in {1, …, K} every frame.• It does not evolve according to the MDP.
• Define a new penalty process qij[r]:
qij[r] = 1{k[r]=i} Pij (ω[r], α[r])
• qij = Fraction of time we transition from “state” i to “state” j.
Treat as a Stoch Network Optimization:
Treat as a Stoch Network Optimization:
“Global Balance Equation”
Treat as a Stoch Network Optimization:
Use a “Virtual Queue” for each constraint k:
Hk[r]∑j qkj[r] ∑i qik[r]
Lyapunov Optimization:Drift-Plus-Penalty Ratio
• L[r] = sum of squares of all virtual queues for frame r.• Δ[r] = L[r+1] – L[r].• Every slot, observe ω[r], queues[r].• Then choose k[r] in {1,…, K}, α[r] in A(k[r],ω[r]) to greedily minimize a bound on drift-plus-penalty ratio:
E{Δ[r] + V y0[r] | queues[r]}
E{T[r] | queues[r]}
• Can be done “greedily” via a max-weight rule generalization.
Alg 1: Special Case of no ω[r] process:The drift-plus-penalty rule: • Every frame r, observe queues[r] = {Zl[r], Hk[r]}.• Then choose k[r] in {1,…,K}, and α[r] in A(k[r]) to greedily minimize:
Alg 1 Performance Theorem
Theorem: If the problem is feasible, then:
(a) All virtual queues are rate stable. (b) Averages of our variables satisfy all desired constraints. (c) We find a value within O(1/V) of the optimal objective: ratio* = y0*/T*(d) Convergence time = O(V3). (e) We get an efficient set of parameters to plug into ALG 2:
(Pij*), {yl*(k)} , {T*(k)}
ALG 2 does not need the (huge number of) individual probabilities for each ω[r] and each control option α[r].
Alg 2: Targeting the MDP• Given targets: (Pij*), {yl
*(k)} , {T*(k)}.• Let k[r] = Actual Markov State• Define qij[r] as before.
• 1i = time average of indicator function 1{k[r] =i}.• MDP structure: This is not a standard stochastic net opt.
Lyapunov Optimization Solution:• While not a standard problem, we can still solve it greedily:
Hij[r]qij[r] 1{k[r]=i}Pij*
Uncontrollable variable with “memory”
Greedy Algorithm Idea and Theorem
Δ[r] + V y0[r]
• L[r] = sum of squares of all virtual queues for frame r.• Δ[r] = L[r+1] – L[r].• Every frame, observe ω[r], queues[r], and actual k[r]. • Then take action α[r] in A(k[r], ω[r]) to greedily
minimize a bound on:
Theorem: If the targets are feasible, then this satisfies all constraints and gives performance objective within O(1/V) of optimality.
Conclusions
• General MDP with variable length frames. • ω[r] process can have infinite number of outcomes.• Control space can be infinite. • Combines Lyapunov optimization and MDP theory for
“max-weight” rules.• Online algorithms for linear fractional programming.
tT[0] T[1] T[2]
state 1 state 4 state 2 1
23
4
Energy-Aware, 4-State Processor
Additional Slides -- 1
Example “Degenerate” MDP:• Minimize: E{y0} • Subject to: E{y1} ≤ 1/2
y0=0y1=0
y0=0y1=1
y0=1y1=0
p1p2
• Can solve in an expected sense: Optimal E{y0} = ½. Optimal fraction of time in lower left = fraction in lower right = ½ . • But it is impossible for pure time averages to achieve the constraints. • Our ALG 1 would find the optimal soln in the expected sense.
1 1
Additional Slides -- 2The linear fractional program with ω[r] process, but with finite sets Ω and A(k[r], ω[r]). Let y(k, w, a) represent the prob of being in state k[r]=k, seeing w[r] =w, and using a[r] =a.
GBE
Normalization
Independence ofk[r] and w[r].
Additional Slides -- 2The linear fractional program with ω[r] process, but with finite sets Ω and A(k[r], ω[r]). Let y(k, w, a) represent the prob of being in state k[r]=k, seeing w[r] =w, and using a[r] =a.
Note: An early draft (on my webpage for 1 week) omitted the normalization and independence constraint for thislinear fractional program example. It also used more complex notation: y(k, w, a) = p( )w f(k, w, a).
The online solution givenin the paper (for the more general problem) enforces these constraints in a different (online) way.
Top Related