AdityaMahajan McGillUniversityadityam/talks/queens-2011.pdf · Stochastic games: Papavassilopoulos,...

Separation result for delayedsharing information structures

Aditya Mahajan

McGill UniversityJoint work with: Ashutosh Nayyar and Demos Teneketzis, UMichigan

Queen's University, July 12, 2011

Common theme: multi-stage multi-agentdecision making under uncertainty

Outline of this talk

What is the conceptual difficulty withmulti-agent decision making? How to resolve it?

Modeling decision making under uncertainty

Overview of single-agent decision making

Delayed sharing information structure:A “simple” model for multi-agent decision making

History of the problemOur approachMain results

Conclusion

Model of uncertainty Model of information

Model of objective

Modeling multi-stage decisionmaking under uncertainty

Model of uncertainty

Stochastic dynamics

�� = ( � , � , � )Noisy observations

� = ℎ( � , �� )State disturbance and noise arei.i.d. stochastic processes with knowndistribution.

System dynamics and observationfunction ℎ are known.

Model of information

Model of objective



Stochastic dynamics





Single DM with perfect recall

� = � ( �:� , �:�� )Model of objective



Stochastic dynamics





Single DM with perfect recall

� = � ( �:� , �:�� )Model of objective

Cost at time � = � ( � , � ).Objective: minimize expected totalcost � [ �∑�� , � ]


More general setups


Non-i.i.d. dynamics (Markov, ergodic, etc.)Unknown distributionUnknown model, unknown cost, etc.


Fixed memory/complexity at decision makerMore than one decision maker

Model of objective

Worse-case performance (instead of expected performance)Minimize regret (instead of minimizing total cost)Remain in a desirable set (rather than minimize total cost)







Conclusion

Design difficulties� = � �:�, �:��Domain of control laws increaseswith time

min��,...,�� [ �∑�� , � ]Search of optimal control policyis a functional optimization problem

Single-agent decision making



Structural resultsDefine, information state:�� = ℙ � | �:�, �:��Then, there is no loss of optimalityin restricting attention to controllaws of the form� = � ��





Dynamic ProgrammingThe following recursive equationsprovide an optimal control policy

� �� = min�� [� �, �+ �� | ��, �]


Estimation Control



� �� = min�� [� �, �+ �� | ��, �]


Estimation Control



� �� = min�� [� �, �+ �� | ��, �]

�� is policy independent


Estimation Control



� �� = min�� [� �, �+ �� | ��, �]

�� is policy independent Each step of DP is a parameteroptimization.


(One-way) separation betweenestimation and control

In single-agent decision making, estimation is separated from control.This separation is critical for decomposing the search of optmial

control policy into a sequence of parameter optimization problems.

Does this separation extendto multi-agent decision making?







Conclusion



Model of objective

Delayed-sharing information structure

Model of uncertaintyStochastic dynamics

�� = ( � , �:�� , � )Noisy observations�� = ℎ� ( � , �� )


Model of objective





Each DM has perfect recall

Each DM observes �-step delayedinformation (observation and actions)of other DMs

Model of objective







�� = �� ( ��:�, ��:��:��, ��:�� )

Model of objective







�� = �� ( ��:�, ��:��:��, ��:�� )

Model of objective

Cost at time � = � ( � , �:�� ).Objective: minimize

� [ �∑�� , �:�� ]



Some Notation

�� = �� ( ��:�, ��:��:��, ��:�� ) �� = �� ( ��:��, ��:��:�, ��:�� )Thus, �� = �� ( �� , �� )where

Common info �� = �:��:��, �:��:��Local info �� = ��:�, ��:��


Some Notation

�� = �� ( ��:�, ��:��:��, ��:�� ) �� = �� ( ��:��, ��:��:�, ��:�� )Thus, �� = �� ( �� , �� )where

Common info �� = �:��:��, �:��:��Local info �� = ��:�, ��:��

Same design difficulties as single-agent case

Literature overview

(Witsenhausen, 1971): Proposed delayed-sharing information structure.Asserted a structure of optimal control law (without proof).

Literature overview


(Varaiya and Walrand, 1978): Proved Witsenhausen's assertion for � =. Showed via a counter-example that the assertion is false for delay� > .

Literature overview



(Nayyar, Mahajan, and Teneketzis, 2011): Prove two alternativestructures of optimal control law.

Literature overview



(Nayyar, Mahajan, and Teneketzis, 2011): Prove two alternativestructures of optimal control law.

NMT 2011 also obtain a recursive algorithm to find optimal control laws.At each step, we need to solve a functional optimization problem.

Original setup�� = �� , ��Structure of optimal control law

Original setup�� = �� , ��W71 Assertion�� = �� ℙ �� | �� , ��

Structure of optimal control law


NMT11 First result�� = �� ℙ� �, ��:�� | �� , ��Structure of optimal control law


NMT11 First result�� = �� ℙ� �, ��:�� | �� , ��NMT11 Second result�� = �� ℙ �� | �� , �� ,��:��:��



NMT11 First result�� = �� ℙ� �, ��:�� | �� , ��NMT11 Second result�� = �� ℙ �� | �� , �� ,��:��:��

Contrast dependence on policy for the different results.


Importance of the problem

Applications (of one step delay sharing)

Power systems: Altman et. al, 2009Queueing theory: Kuri and Kumar, 1995Communication networks: Grizzle et. al, 1982Stochastic games: Papavassilopoulos, 1982; Chang and Cruz, 1983Economics: Li and Wu, 1991

Importance of the problem

Applications (of one step delay sharing)

Power systems: Altman et. al, 2009Queueing theory: Kuri and Kumar, 1995Communication networks: Grizzle et. al, 1982Stochastic games: Papavassilopoulos, 1982; Chang and Cruz, 1983Economics: Li and Wu, 1991

Conceptual significance

Understanding the design of networked control systems

Bridge between centralized and decentralized systems

Insights for the design of general decentralized systems

Orthogonal search does

not work due to presence

of signaling. How does

controller 1 figure out how

agent 2 will interpret

his (agent 1's) actions?

Proof outline

1. Construct a coordinated system

2. Show that any policy of the coordinated system is implementable in theoriginal system and vice-versa. Hence, the two systems are equivalent.

3. Optimal design of the coordinated system is a single-agent multi-stagedecision problem. Find a solution for the coordinated system.

4. Translate this solution back to the original system.

Step 1: The coordinated system��, �� , ��

Step 1: The coordinated system��, �� , ��

Define partially evaluated control law: �� = �� ,

Step 1: The coordinated system��

�� , ��

��

Define partially evaluated control law: �� = �� ,Coordinator prescribes �� , �� to the controllers as�� , �� = �� , ��:��, ��:��

Step 2: Equivalence

For any policy �, . . . , � of the original system, we can construct apolicy ��, . . . , �� of the coordinated system such that the systemvariables { �, �:�� , �:�� , � = , . . . , } have the same realization alongall sample paths in both cases.

Step 2: Equivalence

For any policy �, . . . , � of the original system, we can construct apolicy ��, . . . , �� of the coordinated system such that the systemvariables { �, �:�� , �:�� , � = , . . . , } have the same realization alongall sample paths in both cases.�� = �� , �� = ( �� , , �� , )

Step 2: Equivalence

For any policy �, . . . , � of the original system, we can construct apolicy ��, . . . , �� of the coordinated system such that the systemvariables { �, �:�� , �:�� , � = , . . . , } have the same realization alongall sample paths in both cases.�� = �� , �� = ( �� , , �� , )For any policy ��, . . . , �� of the coordinated system, we can constructa policy �, . . . , � of the original system such that the systemvariables { �, �:�� , �:�� , � = , . . . , } have the same realization alongall sample paths in both cases.

Step 2: Equivalence

For any policy �, . . . , � of the original system, we can construct apolicy ��, . . . , �� of the coordinated system such that the systemvariables { �, �:�� , �:�� , � = , . . . , } have the same realization alongall sample paths in both cases.�� = �� , �� = ( �� , , �� , )For any policy ��, . . . , �� of the coordinated system, we can constructa policy �, . . . , � of the original system such that the systemvariables { �, �:�� , �:�� , � = , . . . , } have the same realization alongall sample paths in both cases.

At time 1, both controllers know ��. Choose�� , �� = �� .At time 2, both controllers knows ��, �� , and �� . Choose�� , �� = �� , �� , �� .

Step 3: Solve the coordinated system

By construction, the coordinated system has a single decision makerwith perfect recall.

Use result for single-agent decision making:Define: �� = ℙ “Current state” | past history

Then, there is no loss of optimality in restricting attention to controllaws of the form:

control action = Fn ��

Step 3: Solve the coordinated system

By construction, the coordinated system has a single decision makerwith perfect recall.

Use result for single-agent decision making:Define: �� = ℙ “Current state” | past history

Then, there is no loss of optimality in restricting attention to controllaws of the form:

control action = Fn ��What is the state (for I/O mapping) for the system.

State for the coordinated system

��

�� , ��

��

� �� , ��

State for the coordinated system

��

�� , ��

��

� �� , ��

State�, �� , ��


Define �� = ℙ �, �� , �� | ��, ��:��, ��:��Then, there is no loss of optimality in restricting attention tocoordination laws of the form�� , `�� = ��


Define �� = ℙ �, �� , �� | ��, ��:��, ��:��Then, there is no loss of optimality in restricting attention tocoordination laws of the form�� , `�� = �� The following recursive equations provide an optimal coordination policy

� �� = min�� ,�� [� �, � + �� | ��, �� , �� ]

Step 4: Translate the solution

For a system with delayed-sharing information structure, there is no loss ofoptimality in restricting attention to control laws of the form�� = �� , ��Optimal control laws can be obtained by the solution of the followingrecursive equations

� �� = min�� ,⋅ ,�� ,⋅ �[� �, � + �� | ��, �� , , �� , ]

Features of the solution

The space of realizations of �� = ℙ �, �� , �� | ��, ��:��, ��:��is time-invariant. Thus, the domain of the control laws �� , �� istime-invariant.�� is not policy independent! Estimation is not separated from control.This is always the case when signaling is present.

In each step of the dynamic program, we choose the partially evaluatedcontrol laws �� , , �� , . Choosing partially evaluated functions(instead of values) allows us to write a dynamic program even in thepresence of signaling.







Conclusion

Summary

Simple methodology to resolve a 40 year old open question:

Find common information at each time

Look at the problem for the point of view of a coordinator thatobserves this common information and chooses partiallly evaluatedfunctions

Find an information state for the problem at the coordinatorℙ(state for input-output mapping | common information)( ℙ(past state | common information), past partial control laws )

This methodology is also applicable to systems with more generalinformation structures (Mahajan, Nayyar, Teneketzis, 2008).

Salient Features

The size of the information state is time-invariant

The methodology is also applicable to infinite horizon problems

Each step of DP is a functional optimization problem

Form of the DP is similar to that of POMDP

Can borrow from the POMDP literature for numerical approaches

AdityaMahajan McGillUniversityadityam/talks/queens-2011.pdf · Stochastic games: Papavassilopoulos,...

Documents

Transcript of AdityaMahajan McGillUniversityadityam/talks/queens-2011.pdf · Stochastic games: Papavassilopoulos,...