Online Fractional Programming for Markov Decision Systems

25
Online Fractional Programming for Markov Decision Systems Michael J. Neely, University of Southern California Proc. Allerton Conference on Communication, Control, and Computing, September 2011 t T[0] T[1] T[2] state 1 state 4 state 2 1 2 3 4 Energy-Aware, 4-State Processor

description

Online Fractional Programming for Markov Decision Systems. 1. s tate 1. s tate 4. s tate 2. 4. 2. 3. t. T[0]. T[1]. T[2]. Energy-Aware, 4-State Processor. Michael J. Neely, University of Southern California - PowerPoint PPT Presentation

Transcript of Online Fractional Programming for Markov Decision Systems

Page 1: Online Fractional Programming for Markov Decision Systems

Online Fractional Programming forMarkov Decision Systems

Michael J. Neely, University of Southern California

Proc. Allerton Conference on Communication, Control, and Computing, September 2011

tT[0] T[1] T[2]

state 1 state 4 state 2 1

23

4

Energy-Aware, 4-State Processor

Page 2: Online Fractional Programming for Markov Decision Systems

General System Model

tT[0] T[1] T[2]

state 1 state 4 state 2

• Frames r in {0, 1, 2, …}.• k[r] = system state during frame r. k[r] in {1,…,K}.• ω[r] = random observation at start of frame r. ω[r] in Ω.• α[r] = control action on frame r. α[r] in A(k[r], ω[r]).

Page 3: Online Fractional Programming for Markov Decision Systems

General System Model

tT[0] T[1] T[2]

state 1 state 4 state 2

• Frames r in {0, 1, 2, …}.• k[r] = system state during frame r. k[r] in {1,…,K}.• ω[r] = random observation at start of frame r. ω[r] in Ω.• α[r] = control action on frame r. α[r] in A(k[r], ω[r]).

Control action affects Frame Size, Penalty Vector, Next State:

• T[r] = T(k[r], ω[r], α[r]).• [y1[r],…,yL[r]] = y(k[r], ω[r], α[r]).• [Pij(ω[r], α[r])] = P(ω[r], α[r]).

Page 4: Online Fractional Programming for Markov Decision Systems

Example 1: Discrete Time MDP

1

23

4Minimize: E{y0} Subject to: E{y1} ≤ 0

E{yL} ≤ 0

• All frames have unit size: T[r] = 1 for all r.• Control action α[r] affects Penalty Vector, and Transition Probs.

Additionally, we can treat problems with…• ω[r] = random observation at start of frame r:

• ω[r] is i.i.d. over frames r. • ω[r] in Ω (Arbitrary cardinality set)• Pr[ω[r] =ω] (unknown probability distribution)

Page 5: Online Fractional Programming for Markov Decision Systems

Example 2: Processor with Several Energy-Saving Modes

1

23

4

• Random Job Arrivals, L different classes. • k[r] = processing mode (4 different modes). • Action α[r]: Choose which job to serve, and next mode.• k[r] and α[r] affect:

• Processing Time• Switching Time• Energy Expenditure

Energy-Aware, 4-State Processor

Page 6: Online Fractional Programming for Markov Decision Systems

Relation between AveragesDefine the frame-average for y0[r]:

The time-average for y0[r] is then:

Page 7: Online Fractional Programming for Markov Decision Systems

The General Problem

tT[0] T[1] T[2]

state 1 state 4 state 2

Page 8: Online Fractional Programming for Markov Decision Systems

Prior Methods for “typical” MDPs• Offline Linear Programming Methods (known probabilities).

• Q-learning, Neurodynamic programming (unconstrained).• [Bertsekas, Tsitsiklis 1996]

• 2-timescales/fluid models for constrained MDPs.• [Borkar 2005][Salodkar, Bhorkar, Karandikar, Borkar 2008]• [Djonin, Krishnamurthy 2007]• [Vazquez Abad, Krishnamurthy 2003]• [Fu, van der Schaar 2010]

• The above works typically require: • Finite action space• No ω[r] process• Fixed slot length problems (does not treat fractional problems).

Page 9: Online Fractional Programming for Markov Decision Systems

The Linear Fractional ProgramFor this slide, assume no ω[r] process, and set is A(k) are finite:

Variables

Page 10: Online Fractional Programming for Markov Decision Systems

The Linear Fractional ProgramFor this slide, assume no ω[r] process, and set is A(k) are finite:

Where f(k, α) is interpreted as the steady state probability of beingIn state k[r] = k and using action α[r] = a, and the policy is then:

Note: See “Additional Slides 2” for the Linear Fractional Program with the ω[r] process.

Page 11: Online Fractional Programming for Markov Decision Systems

Paper Outline• Linear Fractional Program involves many variables and would

require full knowledge of probs p(ω). • We develop:

Algorithm 1: • Online Policy for Solving Linear Fractional Programs.• Allows infinite sets Ω and A(k,ω). • Does not require knowledge of p(ω).• Does not operate on the actual Markov dynamics. • Solves for the values (Pij

*), {yl*(k)} associated with optimal

policy.

Algorithm 2: • Given target values (Pij

*), {yl*(k)}, we develop on online

system implementation (with the actual Markov state dynamics) that achieves them.

We can also run these algorithms in parallel, continuously refining our target values for Alg 2 based on the running estimates from Alg 1.

Page 12: Online Fractional Programming for Markov Decision Systems

ALG 1: Solving for Optimal Values• Define a new stochastic system with same ω[r] process.

• Define a decision variable k[r]: • k[r] is chosen in {1, …, K} every frame.• It does not evolve according to the MDP.

• Define a new penalty process qij[r]:

qij[r] = 1{k[r]=i} Pij (ω[r], α[r])

• qij = Fraction of time we transition from “state” i to “state” j.

Page 13: Online Fractional Programming for Markov Decision Systems

Treat as a Stoch Network Optimization:

Page 14: Online Fractional Programming for Markov Decision Systems

Treat as a Stoch Network Optimization:

“Global Balance Equation”

Page 15: Online Fractional Programming for Markov Decision Systems

Treat as a Stoch Network Optimization:

Use a “Virtual Queue” for each constraint k:

Hk[r]∑j qkj[r] ∑i qik[r]

Page 16: Online Fractional Programming for Markov Decision Systems

Lyapunov Optimization:Drift-Plus-Penalty Ratio

• L[r] = sum of squares of all virtual queues for frame r.• Δ[r] = L[r+1] – L[r].• Every slot, observe ω[r], queues[r].• Then choose k[r] in {1,…, K}, α[r] in A(k[r],ω[r]) to greedily minimize a bound on drift-plus-penalty ratio:

E{Δ[r] + V y0[r] | queues[r]}

E{T[r] | queues[r]}

• Can be done “greedily” via a max-weight rule generalization.

Page 17: Online Fractional Programming for Markov Decision Systems

Alg 1: Special Case of no ω[r] process:The drift-plus-penalty rule: • Every frame r, observe queues[r] = {Zl[r], Hk[r]}.• Then choose k[r] in {1,…,K}, and α[r] in A(k[r]) to greedily minimize:

Page 18: Online Fractional Programming for Markov Decision Systems

Alg 1 Performance Theorem

Theorem: If the problem is feasible, then:

(a) All virtual queues are rate stable. (b) Averages of our variables satisfy all desired constraints. (c) We find a value within O(1/V) of the optimal objective: ratio* = y0*/T*(d) Convergence time = O(V3). (e) We get an efficient set of parameters to plug into ALG 2:

(Pij*), {yl*(k)} , {T*(k)}

ALG 2 does not need the (huge number of) individual probabilities for each ω[r] and each control option α[r].

Page 19: Online Fractional Programming for Markov Decision Systems

Alg 2: Targeting the MDP• Given targets: (Pij*), {yl

*(k)} , {T*(k)}.• Let k[r] = Actual Markov State• Define qij[r] as before.

• 1i = time average of indicator function 1{k[r] =i}.• MDP structure: This is not a standard stochastic net opt.

Page 20: Online Fractional Programming for Markov Decision Systems

Lyapunov Optimization Solution:• While not a standard problem, we can still solve it greedily:

Hij[r]qij[r] 1{k[r]=i}Pij*

Uncontrollable variable with “memory”

Page 21: Online Fractional Programming for Markov Decision Systems

Greedy Algorithm Idea and Theorem

Δ[r] + V y0[r]

• L[r] = sum of squares of all virtual queues for frame r.• Δ[r] = L[r+1] – L[r].• Every frame, observe ω[r], queues[r], and actual k[r]. • Then take action α[r] in A(k[r], ω[r]) to greedily

minimize a bound on:

Theorem: If the targets are feasible, then this satisfies all constraints and gives performance objective within O(1/V) of optimality.

Page 22: Online Fractional Programming for Markov Decision Systems

Conclusions

• General MDP with variable length frames. • ω[r] process can have infinite number of outcomes.• Control space can be infinite. • Combines Lyapunov optimization and MDP theory for

“max-weight” rules.• Online algorithms for linear fractional programming.

tT[0] T[1] T[2]

state 1 state 4 state 2 1

23

4

Energy-Aware, 4-State Processor

Page 23: Online Fractional Programming for Markov Decision Systems

Additional Slides -- 1

Example “Degenerate” MDP:• Minimize: E{y0} • Subject to: E{y1} ≤ 1/2

y0=0y1=0

y0=0y1=1

y0=1y1=0

p1p2

• Can solve in an expected sense: Optimal E{y0} = ½. Optimal fraction of time in lower left = fraction in lower right = ½ . • But it is impossible for pure time averages to achieve the constraints. • Our ALG 1 would find the optimal soln in the expected sense.

1 1

Page 24: Online Fractional Programming for Markov Decision Systems

Additional Slides -- 2The linear fractional program with ω[r] process, but with finite sets Ω and A(k[r], ω[r]). Let y(k, w, a) represent the prob of being in state k[r]=k, seeing w[r] =w, and using a[r] =a.

GBE

Normalization

Independence ofk[r] and w[r].

Page 25: Online Fractional Programming for Markov Decision Systems

Additional Slides -- 2The linear fractional program with ω[r] process, but with finite sets Ω and A(k[r], ω[r]). Let y(k, w, a) represent the prob of being in state k[r]=k, seeing w[r] =w, and using a[r] =a.

Note: An early draft (on my webpage for 1 week) omitted the normalization and independence constraint for thislinear fractional program example. It also used more complex notation: y(k, w, a) = p( )w f(k, w, a).

The online solution givenin the paper (for the more general problem) enforces these constraints in a different (online) way.