Regret Bounds for the Adaptive Control of Linear Quadratic...
Transcript of Regret Bounds for the Adaptive Control of Linear Quadratic...
Regret Bounds for the Adaptive Control of LinearQuadratic Systems
COLT 2011
Yasin Abbasi-Yadkori Csaba Szepesvari
University of AlbertaE-mail: [email protected]
Budapest, July 10, 2011
COLT 2011 (Budapest) Partial monitoring July 10, 2011 1 / 27
Contributions! !"# !"$ %&'()*+
!
%, -.*/0/)1)(2
34+5)(2
67+'()*+5
80.94(
:*1)'2;<)(=;
.4'*9+)>4.
?4=0@)*.;:*1)'2
80.94(;
:*1)'2;<A*
;.4'*9+)>4.
Off-policy learning with options and recognizersDoina Precup, Richard S. Sutton, Cosmin Paduraru, Anna J. Koop, Satinder Singh
McGill University, University of Alberta, University of Michigan
Options
Distinguished
region
Ideas and Motivation Background Recognizers Off-policy algorithm for options Learning w/o the Behavior Policy
Wall
Options• A way of behaving for a period of time
Models of options• A predictive model of the outcome of following the option• What state will you be in?• Will you still control the ball?• What will be the value of some feature?• Will your teammate receive the pass?• What will be the expected total reward along the way?• How long can you keep control of the ball?
Dribble Keepaway Pass
Options for soccer players could be
Options in a 2D world
The red and blue options
are mostly executed.
Surely we should be able
to learn about them from
this experience!
Experienced
trajectory
Off-policy learning• Learning about one policy while behaving according to another• Needed for RL w/exploration (as in Q-learning)• Needed for learning abstract models of dynamical systems
(representing world knowledge)• Enables efficient exploration• Enables learning about many ways of behaving at the same time
(learning models of options)
! a policy! a stopping condition
Non-sequential example
Problem formulation w/o recognizers
Problem formulation with recognizers
• One state
• Continuous action a ! [0, 1]
• Outcomes zi = ai
• Given samples from policy b : [0, 1] " #+
• Would like to estimate the mean outcome for a sub-region of the
action space, here a ! [0.7, 0.9]
Target policy ! : [0, 1] ! "+ is uniform within the region of interest
(see dashed line in figure below). The estimator is:
m! =1n
nX
i=1
!(ai)
b(ai)zi.
Theorem 1 Let A = {a1, . . . ak} ! A be a subset of all the
possible actions. Consider a fixed behavior policy b and let !A be
the class of policies that only choose actions from A, i.e., if!(a) > 0 then a " A. Then the policy induced by b and the binaryrecognizer cA is the policy with minimum-variance one-step
importance sampling corrections, among those in !A:
! as given by (1) = arg minp!!A
Eb
"
„
!(ai)
b(ai)
«2#
(2)
Proof: Using Lagrange multipliers
Theorem 2 Consider two binary recognizers c1 and c2, such that
µ1 > µ2. Then the importance sampling corrections for c1 have
lower variance than the importance sampling corrections for c2.
Off-policy learning
Let the importance sampling ratio at time step t be:
!t ="(st, at)
b(st, at)
The truncated n-step return, R(n)t , satisfies:
R(n)t = !t[rt+1 + (1 ! #t+1)R
(n!1)t+1 ].
The update to the parameter vector is proportional to:
!$t =h
R!t ! yt
i
""yt!0(1 ! #1) · · · !t!1(1 ! #t).
Theorem 3 For every time step t ! 0 and any initial state s,
Eb[!!t|s] = E![!!t|s].
Proof: By induction on n we show that
Eb{R(n)t |s} = E!{R
(n)t |s}
which implies that Eb{R"t |s} = E!(R"
t |s}. The rest of the proof isalgebraic manipulations (see paper).
Implementation of off-policy learning for options
In order to avoid!! ! 0, we use a restart function g : S ! [0, 1](like in the PSD algorithm). The forward algorithm becomes:
!!t = (R!t " yt)#"yt
tX
i=0
gi"i..."t!1(1 " #i+1)...(1 " #t),
where gt is the extent of restarting in state st.
The incremental learning algorithm is the following:
• Initialize !0 = g0, e0 = !0!!y0
• At every time step t:
"t = #t (rt+1 + (1 " $t+1)yt+1) " yt
%t+1 = %t + &"tet
!t+1 = #t!t(1 " $t+1) + gt+1
et+1 = '#t(1 " $t+1)et + !t+1!!yt+1
References
Off-policy learning is tricky
• The Bermuda triangle
! Temporal-difference learning! Function approximation (e.g., linear)! Off-policy
• Leads to divergence of iterative algorithms! Q-learning diverges with linear FA! Dynamic programming diverges with linear FA
Baird's Counterexample
Vk(s) =
!(7)+2!(1)
terminal
state99%
1%
100%
Vk(s) =
!(7)+2!(2)
Vk(s) =
!(7)+2!(3)
Vk(s) =
!(7)+2!(4)
Vk(s) =
!(7)+2!(5)
Vk(s) =
2!(7)+!(6)
0
5
10
0 1000 2000 3000 4000 5000
10
10
/ -10
Iterations (k)
510
1010
010
-
-
Parametervalues, !k(i)
(log scale,
broken at !1)
!k(7)
!k(1) – !k(5)
!k(6)
Precup, Sutton & Dasgupta (PSD) algorithm
• Uses importance sampling to convert off-policy case to on-policy case• Convergence assured by theorem of Tsitsiklis & Van Roy (1997)• Survives the Bermuda triangle!
BUT!
• Variance can be high, even infinite (slow learning)• Difficult to use with continuous or large action spaces• Requires explicit representation of behavior policy (probability distribution)
Option formalism
An option is defined as a triple o = !I,!, ""
• I # S is the set of states in which the option can be initiated
• ! is the internal policy of the option
• " : S $ [0, 1] is a stochastic termination condition
We want to compute the reward model of option o:
Eo{R(s)} = E{r1 + r2 + . . . + rT |s0 = s, !, "}
We assume that linear function approximation is used to represent
the model:
Eo{R(s)} % #T $s = y
Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function
approximation. In Proceedings of ICML.
Precup, D., Sutton, R. S. and Dasgupta, S. (2001). Off-policy temporal-difference
learning with function approximation. In Proceedings of ICML.
Sutton, R.S., Precup D. and Singh, S (1999). Between MDPs and semi-MDPs: A
framework for temporal abstraction in reinforcement learning. Artificial
Intelligence, vol . 112, pp. 181–211.
Sutton,, R.S. and Tanner, B. (2005). Temporal-difference networks. In Proceedings
of NIPS-17.
Sutton R.S., Rafols E. and Koop, A. (2006). Temporal abstraction in
temporal-difference networks”. In Proceedings of NIPS-18.
Tadic, V. (2001). On the convergence of temporal-difference learning with linear
function approximation. In Machine learning vol. 42.
Tsitsiklis, J. N., and Van Roy, B. (1997). An analysis of temporal-difference learning
with function approximation. IEEE Transactions on Automatic Control 42.
Acknowledgements
Theorem 4 If the following assumptions hold:
• The function approximator used to represent the model is a
state aggregrator
• The recognizer behaves consistently with the function
approximator, i.e., c(s, a) = c(p, a), !s " p
• The recognition probability for each partition, µ(p) is estimatedusing maximum likelihood:
µ(p) =N(p, c = 1)
N(p)
Then there exists a policy ! such that the off-policy learning
algorithm converges to the same model as the on-policy algorithm
using !.
Proof: In the limit, w.p.1, µ converges toP
s db(s|p)P
a c(p, a)b(s, a) where db(s|p) is the probability ofvisiting state s from partition p under the stationary distribution of b.
Let ! be defined to be the same for all states in a partition p:
!(p, a) = "(p, a)X
s
db(s|p)b(s, a)
! is well-defined, in the sense thatP
a !(s, a) = 1. Using Theorem3, off-policy updates using importance sampling corrections " willhave the same expected value as on-policy updates using !.
The authors gratefully acknowledge the ideas and encouragement
they have received in this work from Eddie Rafols, Mark Ring,
Lihong Li and other members of the rlai.net group. We thank Csaba
Szepesvari and the reviewers of the paper for constructive
comments. This research was supported in part by iCore, NSERC,
Alberta Ingenuity, and CFI.
The target policy ! is induced by a recognizer function
c : [0, 1] !" #+:
!(a) =c(a)b(a)
P
x c(x)b(x)=
c(a)b(a)µ
(1)
(see blue line below). The estimator is:
m! =1n
nX
i=1
zi!(ai)b(ai)
=1n
nX
i=1
zic(ai)b(ai)
µ
1b(ai)
=1n
nX
i=1
zic(ai)
µ
!" !"" #"" $"" %"" &"""
'&
!
!'& ()*+,+-./01.,+.2-345.13,.630780#""04.)*/301.,+.2-349
:+;<7=;0,3-762+>3,
:+;<0,3-762+>3,
?=)@3,07804.)*/30.-;+724
McGill
The importance sampling corrections are:
!(s, a) ="(s, a)b(s, a)
=c(s, a)µ(s)
where µ(s) depends on the behavior policy b. If b is unknown,instead of µ we will use a maximum likelihood estimate
µ : S ! [0, 1], and importance sampling corrections will be definedas:
!(s, a) =c(s, a)
µ(s)
On-policy learning
If ! is used to generate behavior, then the reward model of anoption can be learned using TD-learning.
The n-step truncated return is:
R(n)t = rt+1 + (1 ! "t+1)R
(n!1)t+1 .
The #-return is defined as usual:
R!t = (1 ! #)
"X
n=1
#n!1R(n)t .
The parameters of the function approximator are updated on every
step proportionally to:
!$t =h
R!t ! yt
i
""yt(1 ! "1) · · · (1 ! "t).
• Recognizers reduce variance
• First off-policy learning algorithm for option models
• Off-policy learning without knowledge of the behavior
distribution
• Observations
– Options are a natural way to reduce the variance of
importance sampling algorithms (because of the termination
condition)
– Recognizers are a natural way to define options, especially
for large or continuous action spaces.
Contributions! !"# !"$ %&'()*+
!
%, -.*/0/)1)(2
34+5)(2
67+'()*+5
80.94(
:*1)'2;<)(=;
.4'*9+)>4.
?4=0@)*.;:*1)'2
80.94(;
:*1)'2;<A*
;.4'*9+)>4.
Off-policy learning with options and recognizersDoina Precup, Richard S. Sutton, Cosmin Paduraru, Anna J. Koop, Satinder Singh
McGill University, University of Alberta, University of Michigan
Options
Distinguished
region
Ideas and Motivation Background Recognizers Off-policy algorithm for options Learning w/o the Behavior Policy
Wall
Options• A way of behaving for a period of time
Models of options• A predictive model of the outcome of following the option• What state will you be in?• Will you still control the ball?• What will be the value of some feature?• Will your teammate receive the pass?• What will be the expected total reward along the way?• How long can you keep control of the ball?
Dribble Keepaway Pass
Options for soccer players could be
Options in a 2D world
The red and blue options
are mostly executed.
Surely we should be able
to learn about them from
this experience!
Experienced
trajectory
Off-policy learning• Learning about one policy while behaving according to another• Needed for RL w/exploration (as in Q-learning)• Needed for learning abstract models of dynamical systems
(representing world knowledge)• Enables efficient exploration• Enables learning about many ways of behaving at the same time
(learning models of options)
! a policy! a stopping condition
Non-sequential example
Problem formulation w/o recognizers
Problem formulation with recognizers
• One state
• Continuous action a ! [0, 1]
• Outcomes zi = ai
• Given samples from policy b : [0, 1] " #+
• Would like to estimate the mean outcome for a sub-region of the
action space, here a ! [0.7, 0.9]
Target policy ! : [0, 1] ! "+ is uniform within the region of interest
(see dashed line in figure below). The estimator is:
m! =1n
nX
i=1
!(ai)
b(ai)zi.
Theorem 1 Let A = {a1, . . . ak} ! A be a subset of all the
possible actions. Consider a fixed behavior policy b and let !A be
the class of policies that only choose actions from A, i.e., if!(a) > 0 then a " A. Then the policy induced by b and the binaryrecognizer cA is the policy with minimum-variance one-step
importance sampling corrections, among those in !A:
! as given by (1) = arg minp!!A
Eb
"
„
!(ai)
b(ai)
«2#
(2)
Proof: Using Lagrange multipliers
Theorem 2 Consider two binary recognizers c1 and c2, such that
µ1 > µ2. Then the importance sampling corrections for c1 have
lower variance than the importance sampling corrections for c2.
Off-policy learning
Let the importance sampling ratio at time step t be:
!t ="(st, at)
b(st, at)
The truncated n-step return, R(n)t , satisfies:
R(n)t = !t[rt+1 + (1 ! #t+1)R
(n!1)t+1 ].
The update to the parameter vector is proportional to:
!$t =h
R!t ! yt
i
""yt!0(1 ! #1) · · · !t!1(1 ! #t).
Theorem 3 For every time step t ! 0 and any initial state s,
Eb[!!t|s] = E![!!t|s].
Proof: By induction on n we show that
Eb{R(n)t |s} = E!{R
(n)t |s}
which implies that Eb{R"t |s} = E!(R"
t |s}. The rest of the proof isalgebraic manipulations (see paper).
Implementation of off-policy learning for options
In order to avoid!! ! 0, we use a restart function g : S ! [0, 1](like in the PSD algorithm). The forward algorithm becomes:
!!t = (R!t " yt)#"yt
tX
i=0
gi"i..."t!1(1 " #i+1)...(1 " #t),
where gt is the extent of restarting in state st.
The incremental learning algorithm is the following:
• Initialize !0 = g0, e0 = !0!!y0
• At every time step t:
"t = #t (rt+1 + (1 " $t+1)yt+1) " yt
%t+1 = %t + &"tet
!t+1 = #t!t(1 " $t+1) + gt+1
et+1 = '#t(1 " $t+1)et + !t+1!!yt+1
References
Off-policy learning is tricky
• The Bermuda triangle
! Temporal-difference learning! Function approximation (e.g., linear)! Off-policy
• Leads to divergence of iterative algorithms! Q-learning diverges with linear FA! Dynamic programming diverges with linear FA
Baird's Counterexample
Vk(s) =
!(7)+2!(1)
terminal
state99%
1%
100%
Vk(s) =
!(7)+2!(2)
Vk(s) =
!(7)+2!(3)
Vk(s) =
!(7)+2!(4)
Vk(s) =
!(7)+2!(5)
Vk(s) =
2!(7)+!(6)
0
5
10
0 1000 2000 3000 4000 5000
10
10
/ -10
Iterations (k)
510
1010
010
-
-
Parametervalues, !k(i)
(log scale,
broken at !1)
!k(7)
!k(1) – !k(5)
!k(6)
Precup, Sutton & Dasgupta (PSD) algorithm
• Uses importance sampling to convert off-policy case to on-policy case• Convergence assured by theorem of Tsitsiklis & Van Roy (1997)• Survives the Bermuda triangle!
BUT!
• Variance can be high, even infinite (slow learning)• Difficult to use with continuous or large action spaces• Requires explicit representation of behavior policy (probability distribution)
Option formalism
An option is defined as a triple o = !I,!, ""
• I # S is the set of states in which the option can be initiated
• ! is the internal policy of the option
• " : S $ [0, 1] is a stochastic termination condition
We want to compute the reward model of option o:
Eo{R(s)} = E{r1 + r2 + . . . + rT |s0 = s, !, "}
We assume that linear function approximation is used to represent
the model:
Eo{R(s)} % #T $s = y
Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function
approximation. In Proceedings of ICML.
Precup, D., Sutton, R. S. and Dasgupta, S. (2001). Off-policy temporal-difference
learning with function approximation. In Proceedings of ICML.
Sutton, R.S., Precup D. and Singh, S (1999). Between MDPs and semi-MDPs: A
framework for temporal abstraction in reinforcement learning. Artificial
Intelligence, vol . 112, pp. 181–211.
Sutton,, R.S. and Tanner, B. (2005). Temporal-difference networks. In Proceedings
of NIPS-17.
Sutton R.S., Rafols E. and Koop, A. (2006). Temporal abstraction in
temporal-difference networks”. In Proceedings of NIPS-18.
Tadic, V. (2001). On the convergence of temporal-difference learning with linear
function approximation. In Machine learning vol. 42.
Tsitsiklis, J. N., and Van Roy, B. (1997). An analysis of temporal-difference learning
with function approximation. IEEE Transactions on Automatic Control 42.
Acknowledgements
Theorem 4 If the following assumptions hold:
• The function approximator used to represent the model is a
state aggregrator
• The recognizer behaves consistently with the function
approximator, i.e., c(s, a) = c(p, a), !s " p
• The recognition probability for each partition, µ(p) is estimatedusing maximum likelihood:
µ(p) =N(p, c = 1)
N(p)
Then there exists a policy ! such that the off-policy learning
algorithm converges to the same model as the on-policy algorithm
using !.
Proof: In the limit, w.p.1, µ converges toP
s db(s|p)P
a c(p, a)b(s, a) where db(s|p) is the probability ofvisiting state s from partition p under the stationary distribution of b.
Let ! be defined to be the same for all states in a partition p:
!(p, a) = "(p, a)X
s
db(s|p)b(s, a)
! is well-defined, in the sense thatP
a !(s, a) = 1. Using Theorem3, off-policy updates using importance sampling corrections " willhave the same expected value as on-policy updates using !.
The authors gratefully acknowledge the ideas and encouragement
they have received in this work from Eddie Rafols, Mark Ring,
Lihong Li and other members of the rlai.net group. We thank Csaba
Szepesvari and the reviewers of the paper for constructive
comments. This research was supported in part by iCore, NSERC,
Alberta Ingenuity, and CFI.
The target policy ! is induced by a recognizer function
c : [0, 1] !" #+:
!(a) =c(a)b(a)
P
x c(x)b(x)=
c(a)b(a)µ
(1)
(see blue line below). The estimator is:
m! =1n
nX
i=1
zi!(ai)b(ai)
=1n
nX
i=1
zic(ai)b(ai)
µ
1b(ai)
=1n
nX
i=1
zic(ai)
µ
!" !"" #"" $"" %"" &"""
'&
!
!'& ()*+,+-./01.,+.2-345.13,.630780#""04.)*/301.,+.2-349
:+;<7=;0,3-762+>3,
:+;<0,3-762+>3,
?=)@3,07804.)*/30.-;+724
McGill
The importance sampling corrections are:
!(s, a) ="(s, a)b(s, a)
=c(s, a)µ(s)
where µ(s) depends on the behavior policy b. If b is unknown,instead of µ we will use a maximum likelihood estimate
µ : S ! [0, 1], and importance sampling corrections will be definedas:
!(s, a) =c(s, a)
µ(s)
On-policy learning
If ! is used to generate behavior, then the reward model of anoption can be learned using TD-learning.
The n-step truncated return is:
R(n)t = rt+1 + (1 ! "t+1)R
(n!1)t+1 .
The #-return is defined as usual:
R!t = (1 ! #)
"X
n=1
#n!1R(n)t .
The parameters of the function approximator are updated on every
step proportionally to:
!$t =h
R!t ! yt
i
""yt(1 ! "1) · · · (1 ! "t).
• Recognizers reduce variance
• First off-policy learning algorithm for option models
• Off-policy learning without knowledge of the behavior
distribution
• Observations
– Options are a natural way to reduce the variance of
importance sampling algorithms (because of the termination
condition)
– Recognizers are a natural way to define options, especially
for large or continuous action spaces.
Outline
1 Problem formulation
2 LQRMain ideasSome details (and the algorithm)
3 Proof sketch (and the result)
4 Conclusions and Open Problems
5 Bibliography
COLT 2011 (Budapest) Partial monitoring July 10, 2011 2 / 27
Control problems
Agent Environment
xt
ut
xt+1 = f (xt, ut, wt)
∑t c(xt, ut) → min
COLT 2011 (Budapest) Partial monitoring July 10, 2011 3 / 27
Control problems
Agent Environment
xt
ut
xt+1 = f (xt, ut, wt)
∑t c(xt, ut) → min
COLT 2011 (Budapest) Partial monitoring July 10, 2011 3 / 27
Control problems
Agent Environment
xt
ut
xt+1 = f (xt, ut, wt)
∑t c(xt, ut) → min
COLT 2011 (Budapest) Partial monitoring July 10, 2011 3 / 27
Control problems
Agent Environment
xt
ut
xt+1 = f (xt, ut, wt)
∑t c(xt, ut) → min
COLT 2011 (Budapest) Partial monitoring July 10, 2011 3 / 27
Control problems
Agent Environment
xt
ut
xt+1 = f (xt, ut, wt)
∑t c(xt, ut) → min
COLT 2011 (Budapest) Partial monitoring July 10, 2011 3 / 27
Learning to control
Agent Environment
xt
ut
xt+1 = f (xt, ut, wt)
∑t c(xt, ut) → min
f is unknown – yet the goal is to control the environment almost aswell as if it was known
COLT 2011 (Budapest) Partial monitoring July 10, 2011 4 / 27
Measure of performance of the learner
Does the average cost converge to the optimal average cost?
1T
T∑
t=1
c(xt, ut)→ J∗ ?
How fast is the convergence?Compare the total losses⇒ Regret:
RT =
T∑
t=1
c(xt, ut) − TJ∗ .
Hannan consistency:RT
T→ 0 as T →∞
Typical result: For some γ ∈ (0, 1),
RT = O(Tγ) .COLT 2011 (Budapest) Partial monitoring July 10, 2011 5 / 27
Measure of performance of the learner
Does the average cost converge to the optimal average cost?
1T
T∑
t=1
c(xt, ut)→ J∗ ?
How fast is the convergence?Compare the total losses⇒ Regret:
RT =
T∑
t=1
c(xt, ut) − TJ∗ .
Hannan consistency:RT
T→ 0 as T →∞
Typical result: For some γ ∈ (0, 1),
RT = O(Tγ) .COLT 2011 (Budapest) Partial monitoring July 10, 2011 5 / 27
Measure of performance of the learner
Does the average cost converge to the optimal average cost?
1T
T∑
t=1
c(xt, ut)→ J∗ ?
How fast is the convergence?Compare the total losses⇒ Regret:
RT =
T∑
t=1
c(xt, ut) − TJ∗ .
Hannan consistency:RT
T→ 0 as T →∞
Typical result: For some γ ∈ (0, 1),
RT = O(Tγ) .COLT 2011 (Budapest) Partial monitoring July 10, 2011 5 / 27
Measure of performance of the learner
Does the average cost converge to the optimal average cost?
1T
T∑
t=1
c(xt, ut)→ J∗ ?
How fast is the convergence?Compare the total losses⇒ Regret:
RT =
T∑
t=1
c(xt, ut) − TJ∗ .
Hannan consistency:RT
T→ 0 as T →∞
Typical result: For some γ ∈ (0, 1),
RT = O(Tγ) .COLT 2011 (Budapest) Partial monitoring July 10, 2011 5 / 27
Measure of performance of the learner
Does the average cost converge to the optimal average cost?
1T
T∑
t=1
c(xt, ut)→ J∗ ?
How fast is the convergence?Compare the total losses⇒ Regret:
RT =
T∑
t=1
c(xt, ut) − TJ∗ .
Hannan consistency:RT
T→ 0 as T →∞
Typical result: For some γ ∈ (0, 1),
RT = O(Tγ) .COLT 2011 (Budapest) Partial monitoring July 10, 2011 5 / 27
This talk: Linear Quadratic Regulation
Linear dynamics: xt ∈ Rn, ut ∈ Rd.
f (xt, ut,wt+1) = A∗xt + B∗ut .
Quadratic cost: Q,R � 0
c(xt, ut) = x>t Qxt + u>t Rut .
Noise (wt)t: Subgaussian martingale noise, E[wt+1w>t+1 | Ft
]= In.
LQR problem: given A∗,B∗,Q,R, find an optimal controllerLQR learning problem: given Q,R. not knowing A∗,B∗, learn tocontrol the system
COLT 2011 (Budapest) Partial monitoring July 10, 2011 6 / 27
This talk: Linear Quadratic Regulation
Linear dynamics: xt ∈ Rn, ut ∈ Rd.
f (xt, ut,wt+1) = A∗xt + B∗ut .
Quadratic cost: Q,R � 0
c(xt, ut) = x>t Qxt + u>t Rut .
Noise (wt)t: Subgaussian martingale noise, E[wt+1w>t+1 | Ft
]= In.
LQR problem: given A∗,B∗,Q,R, find an optimal controllerLQR learning problem: given Q,R. not knowing A∗,B∗, learn tocontrol the system
COLT 2011 (Budapest) Partial monitoring July 10, 2011 6 / 27
This talk: Linear Quadratic Regulation
Linear dynamics: xt ∈ Rn, ut ∈ Rd.
f (xt, ut,wt+1) = A∗xt + B∗ut .
Quadratic cost: Q,R � 0
c(xt, ut) = x>t Qxt + u>t Rut .
Noise (wt)t: Subgaussian martingale noise, E[wt+1w>t+1 | Ft
]= In.
LQR problem: given A∗,B∗,Q,R, find an optimal controllerLQR learning problem: given Q,R. not knowing A∗,B∗, learn tocontrol the system
COLT 2011 (Budapest) Partial monitoring July 10, 2011 6 / 27
This talk: Linear Quadratic Regulation
Linear dynamics: xt ∈ Rn, ut ∈ Rd.
f (xt, ut,wt+1) = A∗xt + B∗ut .
Quadratic cost: Q,R � 0
c(xt, ut) = x>t Qxt + u>t Rut .
Noise (wt)t: Subgaussian martingale noise, E[wt+1w>t+1 | Ft
]= In.
LQR problem: given A∗,B∗,Q,R, find an optimal controllerLQR learning problem: given Q,R. not knowing A∗,B∗, learn tocontrol the system
COLT 2011 (Budapest) Partial monitoring July 10, 2011 6 / 27
This talk: Linear Quadratic Regulation
Linear dynamics: xt ∈ Rn, ut ∈ Rd.
f (xt, ut,wt+1) = A∗xt + B∗ut .
Quadratic cost: Q,R � 0
c(xt, ut) = x>t Qxt + u>t Rut .
Noise (wt)t: Subgaussian martingale noise, E[wt+1w>t+1 | Ft
]= In.
LQR problem: given A∗,B∗,Q,R, find an optimal controllerLQR learning problem: given Q,R. not knowing A∗,B∗, learn tocontrol the system
COLT 2011 (Budapest) Partial monitoring July 10, 2011 6 / 27
The goal and why should we care?
Goal: Design a controller which achieves low regret for areasonably large class of LQR problems.
Simple ≡ beautiful, nice structures!Continuous states and controls!LQR control is actually useful! (even when no learning is involved)Unsolved problem!?
COLT 2011 (Budapest) Partial monitoring July 10, 2011 7 / 27
The goal and why should we care?
Goal: Design a controller which achieves low regret for areasonably large class of LQR problems.
Simple ≡ beautiful, nice structures!Continuous states and controls!LQR control is actually useful! (even when no learning is involved)Unsolved problem!?
COLT 2011 (Budapest) Partial monitoring July 10, 2011 7 / 27
The goal and why should we care?
Goal: Design a controller which achieves low regret for areasonably large class of LQR problems.
Simple ≡ beautiful, nice structures!Continuous states and controls!LQR control is actually useful! (even when no learning is involved)Unsolved problem!?
COLT 2011 (Budapest) Partial monitoring July 10, 2011 7 / 27
The goal and why should we care?
Goal: Design a controller which achieves low regret for areasonably large class of LQR problems.
Simple ≡ beautiful, nice structures!Continuous states and controls!LQR control is actually useful! (even when no learning is involved)Unsolved problem!?
COLT 2011 (Budapest) Partial monitoring July 10, 2011 7 / 27
The goal and why should we care?
Goal: Design a controller which achieves low regret for areasonably large class of LQR problems.
Simple ≡ beautiful, nice structures!Continuous states and controls!LQR control is actually useful! (even when no learning is involved)Unsolved problem!?
COLT 2011 (Budapest) Partial monitoring July 10, 2011 7 / 27
Previous works
Bartlett and Tewari (2009); Auer et al. (2010) – regret analysis offinite MDPsFiechter (1997) – discounted LQ, “PAC-learning”Control people!
I Lai and Wei (1982b, 1987); Chen and Guo (1987); Chen andZhang (1990); Lai and Ying (2006) – consistency, forcedexploration (like ε-greedy)
I Campi and Kumar (1998); Bittanti and Campi (2006) – consistency,basis of the present work
Lai and Robbins (1985) – principle in the face of uncertainty forbanditsLai and Wei (1982a); Dani et al. (2008); Rusmevichientong andTsitsiklis (2010) – linear estimation, tail inequalities
COLT 2011 (Budapest) Partial monitoring July 10, 2011 8 / 27
Previous works
Bartlett and Tewari (2009); Auer et al. (2010) – regret analysis offinite MDPsFiechter (1997) – discounted LQ, “PAC-learning”Control people!
I Lai and Wei (1982b, 1987); Chen and Guo (1987); Chen andZhang (1990); Lai and Ying (2006) – consistency, forcedexploration (like ε-greedy)
I Campi and Kumar (1998); Bittanti and Campi (2006) – consistency,basis of the present work
Lai and Robbins (1985) – principle in the face of uncertainty forbanditsLai and Wei (1982a); Dani et al. (2008); Rusmevichientong andTsitsiklis (2010) – linear estimation, tail inequalities
COLT 2011 (Budapest) Partial monitoring July 10, 2011 8 / 27
Previous works
Bartlett and Tewari (2009); Auer et al. (2010) – regret analysis offinite MDPsFiechter (1997) – discounted LQ, “PAC-learning”Control people!
I Lai and Wei (1982b, 1987); Chen and Guo (1987); Chen andZhang (1990); Lai and Ying (2006) – consistency, forcedexploration (like ε-greedy)
I Campi and Kumar (1998); Bittanti and Campi (2006) – consistency,basis of the present work
Lai and Robbins (1985) – principle in the face of uncertainty forbanditsLai and Wei (1982a); Dani et al. (2008); Rusmevichientong andTsitsiklis (2010) – linear estimation, tail inequalities
COLT 2011 (Budapest) Partial monitoring July 10, 2011 8 / 27
Previous works
Bartlett and Tewari (2009); Auer et al. (2010) – regret analysis offinite MDPsFiechter (1997) – discounted LQ, “PAC-learning”Control people!
I Lai and Wei (1982b, 1987); Chen and Guo (1987); Chen andZhang (1990); Lai and Ying (2006) – consistency, forcedexploration (like ε-greedy)
I Campi and Kumar (1998); Bittanti and Campi (2006) – consistency,basis of the present work
Lai and Robbins (1985) – principle in the face of uncertainty forbanditsLai and Wei (1982a); Dani et al. (2008); Rusmevichientong andTsitsiklis (2010) – linear estimation, tail inequalities
COLT 2011 (Budapest) Partial monitoring July 10, 2011 8 / 27
Previous works
Bartlett and Tewari (2009); Auer et al. (2010) – regret analysis offinite MDPsFiechter (1997) – discounted LQ, “PAC-learning”Control people!
I Lai and Wei (1982b, 1987); Chen and Guo (1987); Chen andZhang (1990); Lai and Ying (2006) – consistency, forcedexploration (like ε-greedy)
I Campi and Kumar (1998); Bittanti and Campi (2006) – consistency,basis of the present work
Lai and Robbins (1985) – principle in the face of uncertainty forbanditsLai and Wei (1982a); Dani et al. (2008); Rusmevichientong andTsitsiklis (2010) – linear estimation, tail inequalities
COLT 2011 (Budapest) Partial monitoring July 10, 2011 8 / 27
Previous works
Bartlett and Tewari (2009); Auer et al. (2010) – regret analysis offinite MDPsFiechter (1997) – discounted LQ, “PAC-learning”Control people!
I Lai and Wei (1982b, 1987); Chen and Guo (1987); Chen andZhang (1990); Lai and Ying (2006) – consistency, forcedexploration (like ε-greedy)
I Campi and Kumar (1998); Bittanti and Campi (2006) – consistency,basis of the present work
Lai and Robbins (1985) – principle in the face of uncertainty forbanditsLai and Wei (1982a); Dani et al. (2008); Rusmevichientong andTsitsiklis (2010) – linear estimation, tail inequalities
COLT 2011 (Budapest) Partial monitoring July 10, 2011 8 / 27
Previous works
Bartlett and Tewari (2009); Auer et al. (2010) – regret analysis offinite MDPsFiechter (1997) – discounted LQ, “PAC-learning”Control people!
I Lai and Wei (1982b, 1987); Chen and Guo (1987); Chen andZhang (1990); Lai and Ying (2006) – consistency, forcedexploration (like ε-greedy)
I Campi and Kumar (1998); Bittanti and Campi (2006) – consistency,basis of the present work
Lai and Robbins (1985) – principle in the face of uncertainty forbanditsLai and Wei (1982a); Dani et al. (2008); Rusmevichientong andTsitsiklis (2010) – linear estimation, tail inequalities
COLT 2011 (Budapest) Partial monitoring July 10, 2011 8 / 27
Previous works
Bartlett and Tewari (2009); Auer et al. (2010) – regret analysis offinite MDPsFiechter (1997) – discounted LQ, “PAC-learning”Control people!
I Lai and Wei (1982b, 1987); Chen and Guo (1987); Chen andZhang (1990); Lai and Ying (2006) – consistency, forcedexploration (like ε-greedy)
I Campi and Kumar (1998); Bittanti and Campi (2006) – consistency,basis of the present work
Lai and Robbins (1985) – principle in the face of uncertainty forbanditsLai and Wei (1982a); Dani et al. (2008); Rusmevichientong andTsitsiklis (2010) – linear estimation, tail inequalities
COLT 2011 (Budapest) Partial monitoring July 10, 2011 8 / 27
Outline
1 Problem formulation
2 LQRMain ideasSome details (and the algorithm)
3 Proof sketch (and the result)
4 Conclusions and Open Problems
5 Bibliography
COLT 2011 (Budapest) Partial monitoring July 10, 2011 9 / 27
The main ideas of the algorithm
Estimate the system dynamicsBe optimistic in selecting the controlsAvoid frequent changes to the policy
COLT 2011 (Budapest) Partial monitoring July 10, 2011 10 / 27
The main ideas of the algorithm
Estimate the system dynamicsBe optimistic in selecting the controlsAvoid frequent changes to the policy
COLT 2011 (Budapest) Partial monitoring July 10, 2011 10 / 27
The main ideas of the algorithm
Estimate the system dynamicsBe optimistic in selecting the controlsAvoid frequent changes to the policy
COLT 2011 (Budapest) Partial monitoring July 10, 2011 10 / 27
Estimation
xt+1 = A∗xt + B∗ut + wt+1
= Θ∗
(xt
ut
)+ wt+1
= Θ∗zt + wt+1
Data: (z0, x1), (z2, x2), . . . , (zt−1, xt)
xi+1 = Θ∗zi + wi+1
Linear regression with correlated covariates, martingale noise⇒ Use ridge-regression (least-squares, with `2-penalties)
COLT 2011 (Budapest) Partial monitoring July 10, 2011 11 / 27
Estimation
xt+1 = A∗xt + B∗ut + wt+1
= Θ∗
(xt
ut
)+ wt+1
= Θ∗zt + wt+1
Data: (z0, x1), (z2, x2), . . . , (zt−1, xt)
xi+1 = Θ∗zi + wi+1
Linear regression with correlated covariates, martingale noise⇒ Use ridge-regression (least-squares, with `2-penalties)
COLT 2011 (Budapest) Partial monitoring July 10, 2011 11 / 27
Estimation
xt+1 = A∗xt + B∗ut + wt+1
= Θ∗
(xt
ut
)+ wt+1
= Θ∗zt + wt+1
Data: (z0, x1), (z2, x2), . . . , (zt−1, xt)
xi+1 = Θ∗zi + wi+1
Linear regression with correlated covariates, martingale noise⇒ Use ridge-regression (least-squares, with `2-penalties)
COLT 2011 (Budapest) Partial monitoring July 10, 2011 11 / 27
Estimation
xt+1 = A∗xt + B∗ut + wt+1
= Θ∗
(xt
ut
)+ wt+1
= Θ∗zt + wt+1
Data: (z0, x1), (z2, x2), . . . , (zt−1, xt)
xi+1 = Θ∗zi + wi+1
Linear regression with correlated covariates, martingale noise⇒ Use ridge-regression (least-squares, with `2-penalties)
COLT 2011 (Budapest) Partial monitoring July 10, 2011 11 / 27
Estimation
xt+1 = A∗xt + B∗ut + wt+1
= Θ∗
(xt
ut
)+ wt+1
= Θ∗zt + wt+1
Data: (z0, x1), (z2, x2), . . . , (zt−1, xt)
xi+1 = Θ∗zi + wi+1
Linear regression with correlated covariates, martingale noise⇒ Use ridge-regression (least-squares, with `2-penalties)
COLT 2011 (Budapest) Partial monitoring July 10, 2011 11 / 27
Estimation
xt+1 = A∗xt + B∗ut + wt+1
= Θ∗
(xt
ut
)+ wt+1
= Θ∗zt + wt+1
Data: (z0, x1), (z2, x2), . . . , (zt−1, xt)
xi+1 = Θ∗zi + wi+1
Linear regression with correlated covariates, martingale noise⇒ Use ridge-regression (least-squares, with `2-penalties)
COLT 2011 (Budapest) Partial monitoring July 10, 2011 11 / 27
Optimism principle
Optimism PrincipleLet Ct(δ) be a confidence set for the unknown parameters. Choose thecontrol which gives rise to the best performance.
For given Θ, for the linear system with parameter Θ, let J(Θ) be theoptimal average cost and πΘ be the corresponding optimal policy.Choose
Θt = arg minθ∈Ct(δ)
J(θ) and ut = πΘt(xt) .
CaveatsJ(Θ), πΘ can be ill-definedNeed restriction on allowed set of parametersFinding Θt is a potentially difficult optimization problem
COLT 2011 (Budapest) Partial monitoring July 10, 2011 12 / 27
Optimism principle
Optimism PrincipleLet Ct(δ) be a confidence set for the unknown parameters. Choose thecontrol which gives rise to the best performance.
For given Θ, for the linear system with parameter Θ, let J(Θ) be theoptimal average cost and πΘ be the corresponding optimal policy.Choose
Θt = arg minθ∈Ct(δ)
J(θ) and ut = πΘt(xt) .
CaveatsJ(Θ), πΘ can be ill-definedNeed restriction on allowed set of parametersFinding Θt is a potentially difficult optimization problem
COLT 2011 (Budapest) Partial monitoring July 10, 2011 12 / 27
Optimism principle
Optimism PrincipleLet Ct(δ) be a confidence set for the unknown parameters. Choose thecontrol which gives rise to the best performance.
For given Θ, for the linear system with parameter Θ, let J(Θ) be theoptimal average cost and πΘ be the corresponding optimal policy.Choose
Θt = arg minθ∈Ct(δ)
J(θ) and ut = πΘt(xt) .
CaveatsJ(Θ), πΘ can be ill-definedNeed restriction on allowed set of parametersFinding Θt is a potentially difficult optimization problem
COLT 2011 (Budapest) Partial monitoring July 10, 2011 12 / 27
Optimism principle
Optimism PrincipleLet Ct(δ) be a confidence set for the unknown parameters. Choose thecontrol which gives rise to the best performance.
For given Θ, for the linear system with parameter Θ, let J(Θ) be theoptimal average cost and πΘ be the corresponding optimal policy.Choose
Θt = arg minθ∈Ct(δ)
J(θ) and ut = πΘt(xt) .
CaveatsJ(Θ), πΘ can be ill-definedNeed restriction on allowed set of parametersFinding Θt is a potentially difficult optimization problem
COLT 2011 (Budapest) Partial monitoring July 10, 2011 12 / 27
Optimism principle
Optimism PrincipleLet Ct(δ) be a confidence set for the unknown parameters. Choose thecontrol which gives rise to the best performance.
For given Θ, for the linear system with parameter Θ, let J(Θ) be theoptimal average cost and πΘ be the corresponding optimal policy.Choose
Θt = arg minθ∈Ct(δ)
J(θ) and ut = πΘt(xt) .
CaveatsJ(Θ), πΘ can be ill-definedNeed restriction on allowed set of parametersFinding Θt is a potentially difficult optimization problem
COLT 2011 (Budapest) Partial monitoring July 10, 2011 12 / 27
Optimism principle
Optimism PrincipleLet Ct(δ) be a confidence set for the unknown parameters. Choose thecontrol which gives rise to the best performance.
For given Θ, for the linear system with parameter Θ, let J(Θ) be theoptimal average cost and πΘ be the corresponding optimal policy.Choose
Θt = arg minθ∈Ct(δ)
J(θ) and ut = πΘt(xt) .
CaveatsJ(Θ), πΘ can be ill-definedNeed restriction on allowed set of parametersFinding Θt is a potentially difficult optimization problem
COLT 2011 (Budapest) Partial monitoring July 10, 2011 12 / 27
Optimism principle
Optimism PrincipleLet Ct(δ) be a confidence set for the unknown parameters. Choose thecontrol which gives rise to the best performance.
For given Θ, for the linear system with parameter Θ, let J(Θ) be theoptimal average cost and πΘ be the corresponding optimal policy.Choose
Θt = arg minθ∈Ct(δ)
J(θ) and ut = πΘt(xt) .
CaveatsJ(Θ), πΘ can be ill-definedNeed restriction on allowed set of parametersFinding Θt is a potentially difficult optimization problem
COLT 2011 (Budapest) Partial monitoring July 10, 2011 12 / 27
Avoiding frequent changes
Frequent changes are unnecessarySaving computation⇒ going green!?Frequent changes might be a problem (avoiding frequent changeshelps with the proof)
COLT 2011 (Budapest) Partial monitoring July 10, 2011 13 / 27
Avoiding frequent changes
Frequent changes are unnecessarySaving computation⇒ going green!?Frequent changes might be a problem (avoiding frequent changeshelps with the proof)
COLT 2011 (Budapest) Partial monitoring July 10, 2011 13 / 27
Avoiding frequent changes
Frequent changes are unnecessarySaving computation⇒ going green!?Frequent changes might be a problem (avoiding frequent changeshelps with the proof)
COLT 2011 (Budapest) Partial monitoring July 10, 2011 13 / 27
Outline
1 Problem formulation
2 LQRMain ideasSome details (and the algorithm)
3 Proof sketch (and the result)
4 Conclusions and Open Problems
5 Bibliography
COLT 2011 (Budapest) Partial monitoring July 10, 2011 14 / 27
How to choose the confidence set?
Adventures in self-normalized processes, “method of mixtures”⇒ dela Pena et al. (2009)
Theorem
Let z>t = (x>t , u>t ) ∈ Rn+d. Let Θt be the ridge-regression parameter
estimate with regularization coefficient λ > 0. Let Vt = λI +∑t−1
i=0 ziz>ibe the covariance matrix. Then, for any 0 < δ < 1, with probability atleast 1− δ,
trace((Θt −Θ∗)>Vt(Θt −Θ∗))
≤
d
√2 log
(det(Vt)
1/2 det(λI)−1/2
δ
)+ λ1/2S
2
.
COLT 2011 (Budapest) Partial monitoring July 10, 2011 15 / 27
Construction of confidence sets
An ellipsoid centred at Θt:
trace{
(Θ− Θt)>Vt(Θ− Θt)
}≤ βt.
COLT 2011 (Budapest) Partial monitoring July 10, 2011 16 / 27
The algorithm
Inputs: T, S > 0, δ > 0,Q,L.Set V0 = I and Θ0 = 0, (A0, B0) = Θ0 = argminΘ∈C0(δ) J∗(Θ).for t := 0, 1, 2, . . . do
Calculate Θt.Θt = argminΘ∈Ct(δ) J∗(Θ).Calculate ut based on the current parameters, ut = K(Θt)xt.Execute control, observe new state xt+1.Vt+1 := Vt + ztz>t , where z>t = (x>t , u
>t ).
end for
COLT 2011 (Budapest) Partial monitoring July 10, 2011 17 / 27
Proof sketch
Fix T > 0.With high probability, the state stays O(log T). ⇐ most of the workis here..Decompose the regretAnalyze each term
COLT 2011 (Budapest) Partial monitoring July 10, 2011 18 / 27
Proof sketch
Fix T > 0.With high probability, the state stays O(log T). ⇐ most of the workis here..Decompose the regretAnalyze each term
COLT 2011 (Budapest) Partial monitoring July 10, 2011 18 / 27
Proof sketch
Fix T > 0.With high probability, the state stays O(log T). ⇐ most of the workis here..Decompose the regretAnalyze each term
COLT 2011 (Budapest) Partial monitoring July 10, 2011 18 / 27
Proof sketch
Fix T > 0.With high probability, the state stays O(log T). ⇐ most of the workis here..Decompose the regretAnalyze each term
COLT 2011 (Budapest) Partial monitoring July 10, 2011 18 / 27
Proof sketch
Fix T > 0.With high probability, the state stays O(log T). ⇐ most of the workis here..Decompose the regretAnalyze each term
COLT 2011 (Budapest) Partial monitoring July 10, 2011 18 / 27
Regret decomposition
Dynamic programming equations,E [wt+1|Ft] = 0, Algebra ..
R1 =
T∑
t=0
{x>t P(Θt)xt − E
[x>t+1P(Θt+1)xt+1
∣∣∣Ft
]}
R2 =
T∑
t=0
E[x>t+1
{P(Θt+1)− P(Θt)
}xt+1
∣∣∣Ft
]
R3 =
T∑
t=0
z>t(
Θ>∗ P(Θt)Θ∗ − Θ>t P(Θt)Θt
)zt.
T∑
t=0
(x>t Qxt + u>t Rut) =
T∑
t=0
J(Θt) + R1 + R2 + R3
≤ T J(Θ∗) + R1 + R2 + R3 .
COLT 2011 (Budapest) Partial monitoring July 10, 2011 19 / 27
Regret decomposition
Dynamic programming equations,E [wt+1|Ft] = 0, Algebra ..
R1 =
T∑
t=0
{x>t P(Θt)xt − E
[x>t+1P(Θt+1)xt+1
∣∣∣Ft
]}
R2 =
T∑
t=0
E[x>t+1
{P(Θt+1)− P(Θt)
}xt+1
∣∣∣Ft
]
R3 =
T∑
t=0
z>t(
Θ>∗ P(Θt)Θ∗ − Θ>t P(Θt)Θt
)zt.
T∑
t=0
(x>t Qxt + u>t Rut) =
T∑
t=0
J(Θt) + R1 + R2 + R3
≤ T J(Θ∗) + R1 + R2 + R3 .
COLT 2011 (Budapest) Partial monitoring July 10, 2011 19 / 27
Regret decomposition
Dynamic programming equations,E [wt+1|Ft] = 0, Algebra ..
R1 =
T∑
t=0
{x>t P(Θt)xt − E
[x>t+1P(Θt+1)xt+1
∣∣∣Ft
]}
R2 =
T∑
t=0
E[x>t+1
{P(Θt+1)− P(Θt)
}xt+1
∣∣∣Ft
]
R3 =
T∑
t=0
z>t(
Θ>∗ P(Θt)Θ∗ − Θ>t P(Θt)Θt
)zt.
T∑
t=0
(x>t Qxt + u>t Rut) =
T∑
t=0
J(Θt) + R1 + R2 + R3
≤ T J(Θ∗) + R1 + R2 + R3 .
COLT 2011 (Budapest) Partial monitoring July 10, 2011 19 / 27
Regret decomposition
Dynamic programming equations,E [wt+1|Ft] = 0, Algebra ..
R1 =
T∑
t=0
{x>t P(Θt)xt − E
[x>t+1P(Θt+1)xt+1
∣∣∣Ft
]}
R2 =
T∑
t=0
E[x>t+1
{P(Θt+1)− P(Θt)
}xt+1
∣∣∣Ft
]
R3 =
T∑
t=0
z>t(
Θ>∗ P(Θt)Θ∗ − Θ>t P(Θt)Θt
)zt.
T∑
t=0
(x>t Qxt + u>t Rut) =
T∑
t=0
J(Θt) + R1 + R2 + R3
≤ T J(Θ∗) + R1 + R2 + R3 .
COLT 2011 (Budapest) Partial monitoring July 10, 2011 19 / 27
Regret decomposition
Dynamic programming equations,E [wt+1|Ft] = 0, Algebra ..
R1 =
T∑
t=0
{x>t P(Θt)xt − E
[x>t+1P(Θt+1)xt+1
∣∣∣Ft
]}
R2 =
T∑
t=0
E[x>t+1
{P(Θt+1)− P(Θt)
}xt+1
∣∣∣Ft
]
R3 =
T∑
t=0
z>t(
Θ>∗ P(Θt)Θ∗ − Θ>t P(Θt)Θt
)zt.
T∑
t=0
(x>t Qxt + u>t Rut) =
T∑
t=0
J(Θt) + R1 + R2 + R3
≤ T J(Θ∗) + R1 + R2 + R3 .
COLT 2011 (Budapest) Partial monitoring July 10, 2011 19 / 27
Term R1
R1 =
T∑
t=0
{x>t P(Θt)xt − E
[x>t+1P(Θt+1)xt+1
∣∣∣Ft
]}
RegroupingMartingale difference sequenceState does not explode
COLT 2011 (Budapest) Partial monitoring July 10, 2011 20 / 27
Term R1
R1 =
T∑
t=0
{x>t P(Θt)xt − E
[x>t+1P(Θt+1)xt+1
∣∣∣Ft
]}
RegroupingMartingale difference sequenceState does not explode
COLT 2011 (Budapest) Partial monitoring July 10, 2011 20 / 27
Term R1
R1 =
T∑
t=0
{x>t P(Θt)xt − E
[x>t+1P(Θt+1)xt+1
∣∣∣Ft
]}
RegroupingMartingale difference sequenceState does not explode
COLT 2011 (Budapest) Partial monitoring July 10, 2011 20 / 27
Term R3
R3 =
T∑
t=0
z>t(
Θ>∗ P(Θt)Θ∗ − Θ>t P(Θt)Θt
)zt.
Algebra.. Reduce to
O(√
T) +
(∑
t
‖P(Θt)(Θt −Θ∗)>zt‖2
)1/2
More algebra..Choice of confidence setState does not explode
COLT 2011 (Budapest) Partial monitoring July 10, 2011 21 / 27
Term R3
R3 =
T∑
t=0
z>t(
Θ>∗ P(Θt)Θ∗ − Θ>t P(Θt)Θt
)zt.
Algebra.. Reduce to
O(√
T) +
(∑
t
‖P(Θt)(Θt −Θ∗)>zt‖2
)1/2
More algebra..Choice of confidence setState does not explode
COLT 2011 (Budapest) Partial monitoring July 10, 2011 21 / 27
Term R3
R3 =
T∑
t=0
z>t(
Θ>∗ P(Θt)Θ∗ − Θ>t P(Θt)Θt
)zt.
Algebra.. Reduce to
O(√
T) +
(∑
t
‖P(Θt)(Θt −Θ∗)>zt‖2
)1/2
More algebra..Choice of confidence setState does not explode
COLT 2011 (Budapest) Partial monitoring July 10, 2011 21 / 27
Term R3
R3 =
T∑
t=0
z>t(
Θ>∗ P(Θt)Θ∗ − Θ>t P(Θt)Θt
)zt.
Algebra.. Reduce to
O(√
T) +
(∑
t
‖P(Θt)(Θt −Θ∗)>zt‖2
)1/2
More algebra..Choice of confidence setState does not explode
COLT 2011 (Budapest) Partial monitoring July 10, 2011 21 / 27
Term R3
R3 =
T∑
t=0
z>t(
Θ>∗ P(Θt)Θ∗ − Θ>t P(Θt)Θt
)zt.
Algebra.. Reduce to
O(√
T) +
(∑
t
‖P(Θt)(Θt −Θ∗)>zt‖2
)1/2
More algebra..Choice of confidence setState does not explode
COLT 2011 (Budapest) Partial monitoring July 10, 2011 21 / 27
Term R2
R2 =
T∑
t=0
E[x>t+1
{P(Θt+1)− P(Θt)
}xt+1
∣∣∣Ft
]
Cannot analyze this algorithm!What if we change the policies rarely?
COLT 2011 (Budapest) Partial monitoring July 10, 2011 22 / 27
Term R2
R2 =
T∑
t=0
E[x>t+1
{P(Θt+1)− P(Θt)
}xt+1
∣∣∣Ft
]
Cannot analyze this algorithm!What if we change the policies rarely?
COLT 2011 (Budapest) Partial monitoring July 10, 2011 22 / 27
Change the policy only when the determinant of confidence ellipsoiddoubles.
τs: time of sth policy change.
O(log T) policy changes up to time T.∑Tt=0 E
[x>t+1(P(Θt+1)− P(Θt))xt+1|Ft
]≤ O(log T).
COLT 2011 (Budapest) Partial monitoring July 10, 2011 23 / 27
The algorithm
Inputs: T, S > 0, δ > 0,Q,L.Set V0 = I and Θ0 = 0, (A0, B0) = Θ0 = argminΘ∈C0(δ) J∗(Θ).for t := 0, 1, 2, . . . do
if det(Vt) > 2 det(V0) thenCalculate Θt.Θt = argminΘ∈Ct(δ) J∗(Θ).Let V0 = Vt.
elseΘt = Θt−1.
end ifCalculate ut based on the current parameters, ut = K(Θt)xt.Execute control, observe new state xt+1.Vt+1 := Vt + ztz>t , where z>t = (x>t , u
>t ).
end for
COLT 2011 (Budapest) Partial monitoring July 10, 2011 24 / 27
TheoremWith probability at least 1− δ, the regret of the algorithm is bounded asfollows:
R(T) = O(√
T log(1/δ)).
COLT 2011 (Budapest) Partial monitoring July 10, 2011 25 / 27
Conclusions
First regret result for the problem of linear optimal controlAlgorithm is too expensive!Does there exist a cheaper alternative with similar guarantees?Relaxing the martingale noise assumption? (kth order Markovnoise? ARMA..)Extension to linearly parameterized systems?xt+1 = θ>ϕ(xt, ut) + wt+1
Planning? Learning?Unrealizable case?
COLT 2011 (Budapest) Partial monitoring July 10, 2011 26 / 27
Conclusions
First regret result for the problem of linear optimal controlAlgorithm is too expensive!Does there exist a cheaper alternative with similar guarantees?Relaxing the martingale noise assumption? (kth order Markovnoise? ARMA..)Extension to linearly parameterized systems?xt+1 = θ>ϕ(xt, ut) + wt+1
Planning? Learning?Unrealizable case?
COLT 2011 (Budapest) Partial monitoring July 10, 2011 26 / 27
Conclusions
First regret result for the problem of linear optimal controlAlgorithm is too expensive!Does there exist a cheaper alternative with similar guarantees?Relaxing the martingale noise assumption? (kth order Markovnoise? ARMA..)Extension to linearly parameterized systems?xt+1 = θ>ϕ(xt, ut) + wt+1
Planning? Learning?Unrealizable case?
COLT 2011 (Budapest) Partial monitoring July 10, 2011 26 / 27
Conclusions
First regret result for the problem of linear optimal controlAlgorithm is too expensive!Does there exist a cheaper alternative with similar guarantees?Relaxing the martingale noise assumption? (kth order Markovnoise? ARMA..)Extension to linearly parameterized systems?xt+1 = θ>ϕ(xt, ut) + wt+1
Planning? Learning?Unrealizable case?
COLT 2011 (Budapest) Partial monitoring July 10, 2011 26 / 27
Conclusions
First regret result for the problem of linear optimal controlAlgorithm is too expensive!Does there exist a cheaper alternative with similar guarantees?Relaxing the martingale noise assumption? (kth order Markovnoise? ARMA..)Extension to linearly parameterized systems?xt+1 = θ>ϕ(xt, ut) + wt+1
Planning? Learning?Unrealizable case?
COLT 2011 (Budapest) Partial monitoring July 10, 2011 26 / 27
Conclusions
First regret result for the problem of linear optimal controlAlgorithm is too expensive!Does there exist a cheaper alternative with similar guarantees?Relaxing the martingale noise assumption? (kth order Markovnoise? ARMA..)Extension to linearly parameterized systems?xt+1 = θ>ϕ(xt, ut) + wt+1
Planning? Learning?Unrealizable case?
COLT 2011 (Budapest) Partial monitoring July 10, 2011 26 / 27
Conclusions
First regret result for the problem of linear optimal controlAlgorithm is too expensive!Does there exist a cheaper alternative with similar guarantees?Relaxing the martingale noise assumption? (kth order Markovnoise? ARMA..)Extension to linearly parameterized systems?xt+1 = θ>ϕ(xt, ut) + wt+1
Planning? Learning?Unrealizable case?
COLT 2011 (Budapest) Partial monitoring July 10, 2011 26 / 27
Conclusions
First regret result for the problem of linear optimal controlAlgorithm is too expensive!Does there exist a cheaper alternative with similar guarantees?Relaxing the martingale noise assumption? (kth order Markovnoise? ARMA..)Extension to linearly parameterized systems?xt+1 = θ>ϕ(xt, ut) + wt+1
Planning? Learning?Unrealizable case?
COLT 2011 (Budapest) Partial monitoring July 10, 2011 26 / 27
References
Auer, P., Jaksch, T., and Ortner, R. (2010). Near-optimal regretbounds for reinforcement learning. Journal of MachineLearning Research, 11:1563—1600.
Bartlett, P. L. and Tewari, A. (2009). REGAL: A regularizationbased algorithm for reinforcement learning in weaklycommunicating MDPs. In UAI 2009.
Bittanti, S. and Campi, M. C. (2006). Adaptive control of lineartime invariant systems: the “bet on the best” principle.Communications in Information and Systems,6(4):299–320.
Campi, M. and Kumar, P. (1998). Adaptive linear quadraticgaussian control: the cost-biased approach revisited.SIAM J. Control and Optim., 36(6):1890–1907.
Chen, H.-F. and Guo, L. (1987). Optimal adaptive control andconsistent parameter estimates for armax model withquadratic cost. SIAM Journal on Control and Optimization,25(4):845–867.
Chen, H.-F. and Zhang, J.-F. (1990). Identification andadaptive control for systems with unknown orders, delay,and coefficients. Automatic Control, IEEE Transactions on,35(8):866 –877.
Dani, V., Hayes, T., and Kakade, S. (2008). Stochastic linearoptimization under bandit feedback. COLT-2008, pages355–366.
de la Pena, V., Lai, T., and Shao, Q.-M. (2009).Self-normalized processes: Limit theory and StatisticalApplications. Springer.
Fiechter, C.-N. (1997). Pac adaptive control of linear systems.In in Proceedings of the 10th Annual Conference onComputational Learning Theory, ACM, pages 72–80.Press.
Lai, T. and Wei, C. (1982a). Least squares estimates instochastic regression models with applications toidentification and control of dynamic systems. The Annalsof Statistics, 10(1):154–166.
Lai, T. L. and Robbins, H. (1985). Asymptotically efficientadaptive allocation rules. Advances in AppliedMathematics, 6:4–22.
Lai, T. L. and Wei, C. Z. (1982b). Least squares estimates instochastic regression models with applications toidentification and control of dynamic systems. The Annalsof Statistics, 10(1):pp. 154–166.
Lai, T. L. and Wei, C. Z. (1987). Asymptotically efficientself-tuning regulators. SIAM J. Control Optim.,25:466–481.
Lai, T. L. and Ying, Z. (2006). Efficient recursive estimationand adaptive control in stochastic regression and armaxmodels. Statistica Sinica, 16:741–772.
Rusmevichientong, P. and Tsitsiklis, J. (2010). Linearlyparameterized bandits. Mathematics of OperationsResearch, 35(2):395–411.
COLT 2011 (Budapest) Partial monitoring July 10, 2011 27 / 27