Sampling-based Approximate Optimal Control Under ...jfu2/files/papers/Fu_Papusha...Approximate...

Sampling-based Approximate Optimal Control UnderTemporal Logic Constraints

Jie Fu

Department of Electrical and

Computer Engineering

Robotics Engineering Program

Worcester Polytechnic Institute

Worcester, MA, 01604

[email protected]

Ivan Papusha

Institute for Computational

Engineering and Sciences

University of Texas at Austin

Austin, TX, USA

[email protected]

Ufuk Topcu

Aerospace Engineering and

Engineering Mechanics

University of Texas at Austin

Austin, TX, USA

[email protected]

ABSTRACTWe investigate a sampling-based method for optimal con-trol of continuous-time and continuous-state (possibly non-linear) systems under co-safe linear temporal logic specifica-tions. We express the temporal logic specification as a de-terministic, finite automaton (the specification automaton),and link the automaton’s discrete transitions to the con-tinuous system state as it passes through specified regions.The optimal hybrid controller is characterized by a set ofcoupled partial di↵erential equations. Because these equa-tions are di�cult to solve exactly in practice in all cases, wepropose instead a sampling based technique to solve for anapproximate controller through approximate value iteration.We adopt model reference adaptive search—an importancesampling optimization algorithm—to determine the mixingweights of the approximate value function expressed in a fi-nite basis. Under mild technical assumptions, the algorithmconverges, with probability one, to an optimal weight thatensures the satisfaction of temporal logic constraints, whileminimizing an upper bound for the optimal cost. We demon-strate the correctness and e�ciency of the method throughnumerical experiments, including temporal logic planningfor a linear system, and a nonlinear mobile robot.

KeywordsApproximate optimal control, formal methods, importancesampling, hybrid systems.

1. INTRODUCTIONIn this work, we propose a novel sampling-based optimal

control method for continuous-time and continuous-state non-linear systems subject to a subclass of temporal logic con-straints, i.e., co-safe Linear Temporal Logic (LTL) [15]. Co-safe LTL formulas are LTL formulas with satisfying tracesthat can be recognized by a deterministic finite automaton.Co-safe LTL is an expressive formal language that allows one

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected].

HSCC’17, April 18-20, 2017, Pittsburgh, PA, USAc� 2017 ACM. ISBN 978-1-4503-4590-3/17/04. . . $15.00

DOI: http://dx.doi.org/10.1145/3049797.3049820

to specify a variety of finite-time behaviors, including tradi-tional reaching-a-goal, stability, obstacle avoidance, sequen-tially visiting regions of interest, and conditional reactivebehaviors [21].

Typically, trajectory generation and control of continu-ous systems with formal specifications is performed usingabstraction-based synthesis, for example, by computing adiscrete transition system that abstracts or simulates thesystem dynamics, and then planning in the discrete statespace. Abstraction-based synthesis methods have been stud-ied extensively for continuous linear and nonlinear systems[12, 3, 5, 18, 2, 24, 27].

However, abstraction-based synthesis has several impor-tant limitations. Among these, we highlight three: first,because they rely on discretization, abstraction-based meth-ods scale poorly with the dimension of the continuous statespace; second, optimality is no longer guaranteed when acontroller is synthesized with the discrete abstracted system;and third, depending on the specific abstraction method, itis possible that a control policy exists for the underlyingcontinuous control system even when one does not appearto exist in the abstraction.

To address these concerns, the work [20] exploited theidea that continuous-time and continuous-state systems con-strained by co-safe LTL specifications can be viewed as hy-brid dynamical systems. The continuous state space is aug-mented with the discrete states of the specification automa-ton, derived from the co-safe LTL formula. Then, a hybridfeedback controller is obtained by solving or approximatelysolving the corresponding Hamilton–Jacobi–Bellman (HJB)equations for the hybrid value function. For linear sys-tems, the approximate hybrid value function can be obtainedby semidefinite programming. But for general nonlinearsystems, the resulting optimization problem is semi-infinite[22], and di�cult to solve.

We tackle the computational di�culty of this semi-infiniteprogram in the optimal control synthesis for this class ofhybrid systems by developing an importance sampling algo-rithm. The key idea is to regard the control problem as aproblem of inference, from sampled trajectories, of a weightvector that defines an approximate value function in a givenbasis. For sample-e�cient search through the weight param-eter space, we use model reference adaptive search (MRAS)[9]. In our MRAS-inspired algorithm, specific policies arecomputed from the value function approximations. Theseweights are then ranked by the performance of their corre-

227

sponding controllers, and the best performing weights (elitesamples) are used to improve the weight sampling distribu-tion. The aim is to find a value function approximation thatminimizes an upper bound on the total cost. Similar cross-entropy (CE) methods have been used for trajectory plan-ning in the past [13, 19], with the goal of finding a sequenceof motion primitives or a sequence of states for interpolation-based planning. In stochastic policy optimization [14], CEis also useful for computing linear feedback policies.

Under certain regularity assumptions, we show that oursampling method is probabilistically complete, i.e., it con-verges to the optimal value function approximation in agiven basis as the sample size goes to infinity. To improvethe sampling e�ciency, we employ rank functions in thespecification to guide the iterative update of the samplingdistribution. Based on experimental studies, we show thatour method can generate an initial feasible hybrid controllerthat satisfies the co-safe LTL specification, which is thencontinually improved as more computation time is permit-ted. However, when a limited number of samples is used foreach iteration, the method may converge to local optima.In the last section, we discuss future extensions to addressthese problems and conclude the work.

2. PROBLEM DESCRIPTIONWe consider a continuous-time and continuous-state dy-

namical system on Rn. This system is given by

x = f(x, u), x(0) = x0

, (1)

where x(t) 2 X ✓ Rn and u(t) 2 U ✓ Rm are the stateand control signals at time t. For simplicity, we restrict f tobe a Lipschitz continuous function of (x, u), and the controlinput u to be a piecewise right-continuous function of time,with finitely many discontinuities on any finite time inter-val. These conditions ensure the existence and uniquenessof solutions, and are meant to prevent Zeno behavior.

2.1 Co-safe Linear Temporal LogicThe system (1) is constrained to satisfy a specification on

the discrete behavior obtained from its continuous trajec-tory. First, let AP be a finite set of atomic propositions,which are logical predicates that hold true when x(t) is ina particular region of the state space X . Then, define a la-beling function L : X ! ⌃, which maps a continuous statex 2 X to a finite set ⌃ = 2AP of atomic propositions thatevaluate to true at x. This function partitions the continu-ous space X into regions that share the same truth valuesin AP. The labeling function also links the continuous sys-tem with its discrete behavior. In the following definition,�(x

0

, [0, T ], u) refers to the trajectory of the continuous sys-tem with initial condition x

0

under the control input u(t)over the time interval [0, T ].

Definition 1. Let t0

, t1

, . . . , tN

be times, such that

• 0 = t0

< t1

< · · · < tN

= T ,

• L(x(t)) = L(x(tk

)), tk

t < tk+1

, k = 0, . . . , N ,

• L(x(t�k

)) 6= L(x(t+k

)), k = 0, . . . , N .

The discrete behavior, denoted B(�(x0

, [0, T ], u)), is the dis-crete word �

0

�1

. . .�N�1

2 ⌃⇤, where �k

= L(x(tk

)).

In the scope of this paper, we consider a subclass of lin-ear temporal logic (LTL) specifications called co-safe LTL.A co-safe LTL formula is an LTL formula such that everysatisfying word has a finite good prefix1 [15]. This sub-class of LTL formulas can be used to express tasks thatcan be completed in a finite time horizon. Given a co-safe LTL specification ' over the set of atomic propositionsAP, there exists a corresponding deterministic finite-stateautomaton (DFA) A

'

= hQ,⌃, � , q0

, F i, where Q is a fi-nite set of states (modes), ⌃ = 2AP is a finite alphabet,� : Q ⇥ ⌃ ! Q is a deterministic transition function suchthat when the symbol � 2 ⌃ is read at state q, the automa-ton makes a deterministic transition to state �(q,� ) = q0,q0

2 Q is the initial state, and F ✓ Q is a set of final,or accepting states. The transition function is extended toa sequence of symbols, or a word w = �

0

�1

. . . 2 ⌃⇤, inthe usual way: �(q,�

0

v) = �(�(q,�0

), v) for �0

2 ⌃ andv 2 ⌃⇤. We say that the finite word w satisfies ' if and onlyif �(q

0

, w) 2 F . The set of words satisfying ' is the languageof the automaton A

'

, denoted L(A'

).The discrete behavior encodes the sequence of labels vis-

ited by the state as it moves along its continuous trajec-tory. Specifically, the atomic propositions are evaluated onlyat the times when the evaluation of an atomic propositionchanges value, indicated by the sequence of discrete time-stamps t

0

, t1

, . . . , tN

. Thus a trajectory �(x0

, [0, T ], u) satis-fies an LTL specification ' if and only if its discrete behavioris in the language L(A

'

). The optimal control problem isformulated as follows.

Problem 1. Consider the system (1), a co-safe LTL speci-fication ', and a final state x

f

2 X . Design a control law uthat minimizes the cost function

J(x0

, u) =

ZT

0

`(x(⌧), u(⌧)) d⌧

+NX

k=0

s(x(tk

), q(t�k

), q(t+k

))

(2)

subject to the constraints that B(�(x0

, [0, T ], u)) 2 L(A'

)and x(T ) = x

f

.

Here, ` : X ⇥ U ! R is a continuous loss function, ands : X ⇥ Q ⇥ Q ! R is the cost to transition between twostates of the automaton whenever such a transition is al-lowed. The final state x(T ) = x

f

is also specified. Simi-lar problems have been studied in prior work [7, 8, 28, 5,27, 26, 11]. Recently, the work [20] showed that determin-ing an optimal controller for continuous-time systems underco-safe LTL behavior constraints can be translated into anoptimal hybrid control problem. For linear and polynomialsystems, an approximate optimal controller can be obtainedby a special formulation of semi-infinite programming prob-lems. The novelty in this paper is the development of asampling-based algorithm that handles both nonlinear dy-namical constraints, and co-safe LTL specifications. The al-gorithm is based on MRAS, which is briefly described next.

2.2 Model Reference Adaptive Search (MRAS)1For two given words v, w 2 ⌃⇤, the word v is a prefix of wif and only if w = vu for some u 2 ⌃⇤. The word u is calledthe su�x of w.

228

MRAS [9] is a general sampling-based method for globaloptimization that aims to solve the following problem:

z? = argmaxz2Z

H(z),

where Z ✓ Rn is the solution space and H : Rn ! R is areal-valued function that is bounded from below. It assumesthat the optimization problem has a unique maximum, i.e.,z? 2 Z, and H(z) < H(z?) for all z 6= z?.

The key idea of MRAS is similar to cross-entropy (CE) op-timization methods. First, we sample the solution space Zwith a parameterized distribution. Then, we select sampleswith the highest objective values, termed “elite samples,”and update the parameters of the sampling distribution toassign a higher probability mass to the elite samples. As-suming the neighborhood of the optimal solution z? has apositive probability of being sampled, and H satisfies certainregularity conditions [9, §2], the parametrized distributionconverges to a distribution concentrated around z?. TheMRAS algorithm consists of the following key steps:

• Define a sequence of reference distributions {gk

(·)}that converges to a distribution centered on the op-timal solution z?, for example,

gk

(z) =H(z)g

k�1

(z)RZ

H(z0)gk�1

(z0)⌫(dz0), k = 1, 2, . . . ,

where ⌫(·) is the Lebesgue measure defined over Z.

• Define a parameterized family of distributions {p(·, ✓) |✓ 2 ⇥} over Z, onto which the exact reference distri-butions {g

k

(·)} will be projected.

• Generate a sequence of parameters {✓k

} by minimizing(over ✓ 2 ⇥) the Kullback–Leibler (KL) divergencebetween g

k

(·) and p(·, ✓),

DKL

(gk

, p(·, ✓)) :=Z

Z

lngk

(z)p(z,✓ )

gk

(z)⌫(dz),

at each step k = 1, 2, . . .

The sample distributions {p(·, ✓k

)} are meant to be com-pact approximations of the reference distributions {g

k

(·)}that converge to the same optimal solution. Note that thereference distributions {g

k

(·)} are unknown beforehand, asthe optimal solution is unknown. In order to represent thereference distributions {g

k

(·)}, MRAS uses estimation of dis-tributions [6] from elite samples (similar to CE). Then, theMRAS algorithm computes the parameter that minimizesthe KL divergence between the sampling distribution andthe reference distribution. As shown in [9], MRAS has bet-ter convergence rates and stronger guarantees than CE overseveral benchmark examples.

3. HYBRID SYSTEM FORMULATION OFCONTROL SYSTEMS UNDER TEMPO-RAL LOGIC CONSTRAINTS

To apply MRAS in solving Problem 1, we follow the set-ting and notation of [20] to define a hybrid system from thecontrol system dynamics and the temporal logic constraints.The state space of the hybrid system is the product of thecontinuous state space X and the discrete state space Qof the specification automaton A

'

, which is obtained from

the co-safe LTL specification ' with existing tools [4, 16].This produces a hybrid system where each mode is governedby the same continuous vector field (1). Switching betweendi↵erent discrete modes occurs when the continuous statecrosses a boundary between two labeled regions. Specifi-cally, we consider the following product hybrid system:

Definition 2. The product system H = hQ,X ,⌃, E, f, R,Giis an internally forced hybrid system, where

• Q is the set of discrete states (modes) of A'

,

• X ✓ Rn is the set of continuous states,

• ⌃ = 2AP is the power set of atomic propositions,

• E ✓ Q ⇥ ⌃ ⇥Q is a set of discrete transitions, wheree = (q,�, q0) 2 E if and only if �(q,� ) = q0,

• f : X ⇥ U ! Rn is the continuous vector field givenby (1),

• R = {Rq

| q 2 Q} is a collection of regions, where

Rq,�

= {x 2 X | (q,�, q) 2 E,

and L(x) = �}, q 2 Q,� 2 ⌃,

Rq

=[

�2⌃

Rq,�

, q 2 Q,

• G = {Ge

| e 2 E} is a collection of guards, where

Ge

= {x 2 @Rq,�

| �(q, L(x)) = q0},for each e = (q,�, q0) 2 E.

Each invariant region Rq

refers to the continuous statesx 2 X that are reachable while the automaton is in mode q.For each discrete mode q, the continuous state evolves insideR

q

until it enters a guard region G(q,�,q

0)

and a discretetransition to mode q0 is made.

We can solve the optimal control problem with dynamicprogramming by ensuring that the optimal value function iszero at every accepting state of the automaton. Let V ? :X ⇥Q ! R be the optimal cost-to-go in (2), with V ?(x

0

, q0

)denoting the optimal objective value when starting at initialcondition (x

0

, q0

), subject to the discrete behavior specifica-tion and final condition x(T ) = x

f

for a free terminal timeT . The existance of the controller is assumed.

In this setting, the cost-to-go satisfies a collection of mixedcontinuous-discrete Hamilton–Jacobi–Bellman (HJB) equa-tions,

0 = minu2U

⇢@V ?(x, q)

@x· f(x, u) + `(x, u)

�,

8x 2 Rq

, 8q 2 Q,

(3)

V ?(x, q) = minq

0

�V ?(x, q0) + s(x, q, q0)

,

8x 2 Ge

, 8e = (q,�, q0) 2 E,(4)

0 = V ?(xf

, qf

), 8qf

2 F. (5)

Equation (3) says that V ?(x, q) is an optimal cost-to-go in-side the regions where the label remains constant. The nextequation (4) is a shortest-path equality that must hold atevery continuous state x where discrete state transition toa di↵erent label can happen. Finally, the boundary equa-tion (5) fixes the value function.

229

4. APPROXIMATE OPTIMAL CONTROL UN-DER TEMPORAL LOGIC CONSTRAINTS

In this section, we show that by approximating the op-timal value function V ?(x, q) at each state q 2 Q with alinear combination V (x, q) of pre-defined basis functions, wecan formulate a semi-infinite programming problem over theweight parameters. This value function candidate minimizesan upper bound of the optimal total cost.

Assumption 1. For each mode q 2 Q, the optimal valuefunction V ?(·, q) is C1-di↵erentiable. Moreover, the optimalvalue function V ?(·, q) is positive semidefinite. This is guar-anteed by the condition `(x, u) > 0, for all x 2 X \{x

f

} andu 2 U , and s(x, q, q0) > 0, for all x 2 X , q, q0 2 Q.

Recall from the Weierstrass higher-order approximationtheorem [25] that there exists a complete independent setof bases {�

i

(x, q) | i = 1, . . . , Nq

}, such that the functionV ?(x, q) is uniformly approximated as

V ?(x, q) =

NqX

i=1

wi,q

�i,q

(x) +1X

i=Nq+1

wi,q

�i,q

(x), and

@V ?(x, q)@x

=

NqX

i=1

wi,q

@�i,q

(x)@x

+1X

i=Nq+1

wi,q

@�i,q

(x)@x

,

for some weights {wi,q

}, where in both expressions, the lastterm converges uniformly to zero as N

q

! 1.Thus, it is justified to assume that for any state q, the

optimal value function approximation V (x, q) in a given set{�

i,q

(x)} of bases is

V (x, q) =

NqX

i=1

wi,q

�i,q

(x).

We write V (x, q) compactly as ~wT

q

~�q

(x) where

~wq

= [w1,q

, . . . , wNq ,q]

T , ~�q

(x) = [�1,q

(x), . . . ,�Nq ,q(x)]

T .

With a slight abuse of notation, the overall piecewise contin-uous value function approximation is denoted V = hW,�iwith W = [~w

q

]q2Q

and � = [~�q

]q2Q

and V (x, q) = ~wT

q

~�q

(x),for all x 2 R

q

.Given an approximate value function V , we define a hy-

brid feedback control law u : X ⇥Q ! U as

u(x, q) = argmina2U

(@V (x, q)

@x· f(x, a) + `(x, a)

). (6)

For simulation-based optimization, we minimize an upperbound for the optimal total cost, and define the optimalvalue function approximation as follows.

Definition 3. The projected optimal value function ap-proximation is given by an optimal W ? 2 W that solves

minimizeW2W

J(x0

, u) + J ⇥ (1� 1F

(qf

)) (7a)

subject to V = hW,�i (7b)

u(x, q) = argmina2U

⇢@V (x, q)

@x· f(x, a) + `(x, a)

�,

8x 2 Rq

, 8q 2 Q, (7c)

V (x, q) = minq

0

�V (x, q0) + s(x, q, q0)

,

8x 2 Ge

, 8e = (q,�, q0) 2 E, (7d)

V (xf

, qf

) = 0, 8qf

2 F (7e)

x = f(x, u), x(0) = x0

,

q(t+k

) = �(q(t�k

), L(x(t+k

))),

8tk

such that L(x(t�k

)) 6= L(x(t+k

)), (7f)

where J 2 R+

is a pre-defined large penalty for violating thespecification.

The choice of J will be discussed in more detail in Sec-tion 4.2. Note that the decision variable W is implicit in theobjective function and explicit in the controller u, on whichthe cost function is dependent.

We only consider value function approximations of theform V = hW,�i for a given set of bases �. For eachvalue function approximation, the corresponding controlleris fixed under constraint (7c). The optimal value functionapproximation is the one that minimizes the the actual costJ(x

0

, u) under that fixed controller within the constrainedpolicy space. Slightly abusing notation, let us denote byJ(x

0

;W ) the cost generated by applying the controller ucomputed from the value function approximation hW,�i,and by V (x

0

;W ) the value function approximation hW,�ievaluated at the initial state x

0

.

4.1 Sampling-based control designPreviously, variants of gradient descent [17, 1] have been

used to implement approximate value iteration. Local op-tima can be problematic when there is discontinuity in thevalue function for hybrid systems. Besides, finding a reason-able starting point within the weight space is also criticalfor the successful application of the gradient descent. Weinstead employ MRAS to solve (7). The idea is to samplethe weight space W according to a parameterized distribu-tion p(·, ✓) for parameter ✓ 2 ⇥. For each sampled weight,we simulate the run of the corresponding optimal controllerusing a model of the system for a finite time T . The finitetime T is an estimated upper bound on the time it takes forthe system to stabilize to x

f

with a final bounded error, forany controller with which the closed-loop system satisfies thespecification. By evaluating the sampled trajectory, we up-date ✓ to bias the sampling distribution towards promisingweight parameters. With probability one, the distributionwill converge to a parameter ✓?, and the distribution p(·, ✓?)concentrates on the solution W ? to the optimization prob-lem (7).

First, we select a multivariate Gaussian distribution asthe sample distribution. Recall that the probability density

230

of a multivariate Gaussian distribution is given by

p(W ; ✓) =1p

(2⇡)N |⌃| exp(�12(W � µ)T⌃�1(W � µ)),

✓ = (µ,⌃), 8W 2 W,

where µ is the mean vector, ⌃ is the covariance matrix, Nis the dimension of weight vector W 2 W, and |⌃| is thedeterminant of ⌃.

Next, we iteratively update the mean and covariance ofthe sampling distribution until p(·, ✓

k

) converges to a de-generate distribution with a vanishing covariance.

1) Initialization: Select an initial distribution over weightvectors W, denoted p(·, ✓

0

), for some ✓0

2 ⇥. We spec-ify a parameter ⇢ 2 (0, 1], a small real " 2 R

+

calledan improvement parameter, an initial sample size N

0

, asmoothing coe�cient ↵ 2 (0, 1], a strictly decreasing andpositive function S : R ! R

+

. Possible choices can beS(x) = e�x or S(x) = 1

x

for x strictly positive (which isthe case here because the total cost is always positive).Set k := 1 and go to step 2).

2) Sampling: At each iteration k, given the current distri-bution p(·, ✓

k

), generate a set SW of Nk

samples. Foreach W 2 SW in the sample set, evaluate J(x

0

;W ) bysimulating the system model (1) with the approximatecontroller (6), where V = hW,�i.

3) Reject unsatisfiable policies: Reject all weights forwhich the generated controllers do not satisfy the LTLformula. The remaining set of weights is denoted SW sat.

4) Select elite samples and threshold: Order the set{J(x

0

;W ) | W 2 SW sat} from largest (worst) to smallest(best) among the given samples,

Jk,(0)

� . . . � Jk,(Nk)

.

Let be the estimated (1� ⇢)-quantile of costs J(·;W ),i.e., = J

k,d(1�⇢)Nke.

• If k = 0, we introduce a threshold � := .

• If k 6= 0, the following cases are further distin-guished:

– If ��", i.e., the estimated (1�⇢)-quantile ofcost has been reduced by the amount " from thelast iteration, then update � := , N

k+1

:= Nk

,and go to step 5).

– Otherwise, if > � � ", find the largest ⇢0, if itexists, such that the estimated (1� ⇢0)-quantileof cost 0 = J

k,d(1�⇢

0)Nke satisfies 0 � � ".

Then, update � := 0, and also the quantile⇢ := ⇢0. Set N

k+1

:= N0

and go to step 5). Ifno such ⇢0 exists, increase the sample size bya factor of (1 + ↵), N

k+1

= d(1 + ↵)Nk

e. Set✓k+1

:= ✓k

, k := k + 1, and go to step 2).

5) Parameter update: Update parameters ✓k+1

for iter-ation k + 1 as follows. First, define a set EW = {W |J(x

0

;W ) �,W 2 SW sat} of elite samples. The samplesin EW are accepted and samples not in EW are rejectedduring the parameter update defined next. Select the

next parameter ✓k+1

to maximize the weighted sum ofprobabilities of elite samples according to a weighting

✓?k+1

= argmax✓2⇥

X

W2EW

S(J(x0

;W ))k

p(W,✓k

)p(W,✓ ) (8)

that puts a higher weight on those weight vectors withthe lowest costs. Since the optimal parameter ✓?

k+1

can-not be determined analytically, we use maximum like-lihood estimate of the elite sample distribution ✓?

k+1

⇡(µ

k+1

,⌃k+1

), where

µk+1

=E

✓k

�S(J(x0,W ))

k

p(W,✓k)

�IEW(W ) ·W

E✓k

�S(J(x0,W ))

k

p(W,✓k)

�IEW(W )

⇡P

W2SW�S(J(x0,W ))

k

p(W,✓k)

�IEW(W ) ·W

PW2SW

�S(J(x0,W ))

k

p(W,✓k)

�IEW(W )

, (9)

and

⌃k+1

=E

✓k

�S(J(x0,W ))

k

p(W,✓k)

�IEW(W ) · (W � µ)(W � µ)T

E✓k

�S(J(x0,W ))

k

p(W,✓k)

�IEW(W )

⇡P

W2SW�S(J(x0,W ))

k

p(W,✓k)

�IEW(W ) · (W � µ)(W � µ)T

PW2SW

S(J(x0;W ))

k

p(W,✓k)IEW(W )

.

(10)

Note that we approximate E✓k h(r) with its sample esti-

mate, 1

Nk

PW2SW h(W ) for r ⇠ p(·, ✓

k

), and IEW : W !{0, 1} is the indicator function that equals 1 if W 2 EWand 0 otherwise.

6) Smoothing update: The actual parameter is smoothedwith parameter � 2 (0, 1) as

✓k+1

:= �✓k

+ (1� �)✓?k+1

. (11)

7) Stopping criterion: Stop the iteration if the covariancematrix⌃

k

becomes near singular, that is, the determi-nant of⌃

k

approaches 0. The reason for this choice ofstopping criterion is given next.

In the algorithm, elite samples can be reused by includ-ing in the current sample SW at iteration k, denoted SW

k

,the set of elite samples from the previous iteration EW, de-noted EW

k�1

for k > 1. The current set of elite samples isobtained from the ⇢-quantile of SW

k

[ EWk�1

.We show that the global convergence of the algorithm is

ensured by the properties of MRAS under the following addi-tional assumptions. Let W ? be the optimal weight parame-terizing the projected optimal value function approximation.

Assumption 2. For any given constant ⇠ > J(x0

;W ?), theset {W | J(x

0

;W ) ⇠}\W has a strictly positive Lebesguemeasure.

This condition ensures that any neighborhood of the opti-mal solution W ? will be sampled with a positive probability.

Assumption 3. For any � > 0, we have infW2W� J(x0

;W ) >J(x

0

;W ?), where W�

:= {W | kW �W ?k � �} \W.

Theorem 1 (Adapted from Theorem 1 [9]). Under assump-tions 1–3, the algorithm converges (w.p.1) to

limk!1

µk

= W ? and limk!1

⌃k

= 0n⇥n

,

231

provided that (8) is solved exactly.

Since our algorithm directly uses the multivariate Gaus-sian distribution, it would converge to the global optimalsolution after finitely many iterations if it could be providedwith an infinite number of samples. However, in practice,only finitely many samples can be used. In addition, theparameter update (8) is not solved exactly based on the en-tire set of elite samples. Instead, we provide a maximumlikelihood estimation of ✓ for the next iteration using a fi-nite number of elite samples. Similar to the CE method, theresulting algorithm may converge to local optimal solutionsif the sample size is not su�cient. In general, the numberof samples for each iteration is polynomial in the numberof the decision variables [23], which, in this context, is thenumber of unknown weights.

It should be noted that the value function approximationcomputed with this sampling-based algorithm is not neces-sarily a control Lyapunov function. In fact, there is no guar-antee ahead of time that the space of value function approx-imations in a given fixed bases contains a control Lyapunovfunction. In this case, this method performs approximateoptimal feedback planning in the hybrid system instead ofcontrol design that stabilizes the system.

4.2 Rank-guided sample weightingA direct implementation of the algorithm should enable

the convergence to a global optimal weight given a chosenbases. However, for a high-dimensional weight vector space,it is not likely that many samples will satisfy the specifica-tions in the first few iterations. As a consequence, we areleft with limited information to update the sampling distri-bution and the resulting control policy.

To address this issue, we propose a rank-guided adaptivepolicy search, which incorporates an additional terminal costassociated with the rank of the specification state. Thisallows us to remove the rejection step, so that all sampledweight vectors will be associated with costs.

Definition 4. The rank function rank : Q ! Z+

mapseach specification state q 2 Q to a nonnegative integer andis defined recursively as

rank(q) =

⇢0, if q 2 F,min

�22

AP {1 + rank(�(q,� ))}, otherwise.

It can be shown that given a set Qk

of states with rank k,the set of states Q

k+1

with rank k + 1 can be computed

Qk+1

= {q 2 Q | 80 i k such that q /2 Qi

and 9� 2 2AP such that �(q,� ) 2 Qk

}.In other words, Q

k+1

contains a state which is not includedin Q

i

, for i k, and can take a labeled transition to visit astate q0 in Q

k

.Next, we redefine the terminal cost function related to the

specification state h : Q ! R defined by

h(q) =

⇢0 if q 2 F,rank(q) · c otherwise.

where c is a constant. We must select a constant c at leastas large as the the cost incurred when the system trajectorytriggers a transition in the specification automaton. In prac-tice, this cost is chosen based on an estimated upper boundand does not need to be precise.

In summary, we developed a sampling-based method tosearch for an optimal weight defining a value function ap-proximation. The sample distribution is iteratively updatedbased on simulated runs that assign a higher probabilityto those sampled weights that produce controllers with thebest performance. The method is probabilistically complete.That is, with infinite number of samples, the algorithm isensured to converge to the optimal solution of (7). Further-more, to address the sample scarcity caused by rejectingunsatisfying weights, we introduced a meta-heuristic usingthe rank function of the specification automaton. Instead ofrejecting weights, the penalty associated with nonzero rankallows us to assign lower probabilities to those weights thatdo not generate correct controllers.

5. EXAMPLESIn this section, we illustrate the correctness and e�ciency

of the proposed method with two case studies. The firstis an example linear system, and the second is a nonlinearsystem—a Dubins car. The experiments are carried out inMatlab on an Intel(R) Core(TM) i7 CPU with 16 GB RAM.

5.1 Linear system with halfspace labelsWe consider the linear quadratic system on X = R2 with

the specific parameters

A =

2 �21 0

�, B =

11

�, Q = I, R = 1, ⇠ = 1,

x0

= (0.5, 0.5), xf

= (0, 0).

Let AP = {a, b, c} consist of atomic propositions that aretrue whenever the continuous state enters a specific region,

a : (x1

�1), b : (�1 < x1

1), c : (x1

> 1).

Note that each atomic proposition corresponds to a halfspace. Using AP, we partition the state space into threeregions R

A

= {x 2 R2 | x1

�1}, RB

= {x 2 R2 | �1 <x1

1}, and RC

= {x 2 R2 | x1

> 1}, with the followingLTL specification

'1

= (A ! 3B) ^ (C ! 3B) ^ (3A _3C).

This specification ensures that either RA

or RC

must bereached, after which the system must eventually visit R

B

.The automaton for this specification is shown in Fig. 1.

q0start

q1

q2

q3

q4

A

B

C

B

¬B

B

A

C

B

¬BFigure 1: Automaton A

'1 for '1

= (A ! 3B) ^ (C !3B) ^ (3A _3C).

We consider value a function approximation with poly-nomial bases. The set of bases are {x2

1

, x2

2

, x1

x2

, x1

, x2

, 1}for any specification state q 2 Q. Thus, the total numberof unknown weights is 30. Next, we employ the proposed

232

algorithm to search for the approximate optimal weight vec-tor W ?. The following parameters are used: Initial samplesize N

0

= 100, improvement parameter ✏ = 0.1, smoothingparameter � = 0.2, sample increment percentage ↵ = 0.1,and quantile parameter ⇢ = 0.1. The algorithm took 11 it-erations to converge to the approximate optimal value func-tion with the stopping criterion k⌃k 0.001. The cost ofthe corresponding controller is 9.8384 with a terminal cost100⇥ kxk2 and a finite time horizon T = 10.We test the controller for di↵erent initial states. As shown

in Fig. 2b, a trajectory starting at (0.5, 0.5) (near regionR

C

) visits RC

, then RB

and stabilizing to the origin. Atrajectory starting at (�0.5,�0.5) (near region R

A

) visitsR

A

, then RB

in this order.

Remark. When no terminal cost is considered, the approxi-mate optimal controller can be obtained by convex optimiza-tion [20]. Using the proposed sampling-based method withsample size 500 for each iteration, the cost of the computedoptimal controller is 3.723. The optimality can be improvedby increasing sample size, which requires more computation.Overall, for linear quadratic systems with co-safe LTL spec-ifications, the direct solution [20] is much more e�cient.However, this solution does not apply to nonlinear systems,prompting the development of the sampling-based method.

5.2 Dubins car system with obstacleGiven Dubins car dynamics x = u cos(✓), y = u sin(✓),

and ✓ = v, where x = (x, y,✓ ) 2 SE(2) is the state, and uand v are the inputs to the system (linear and angular ve-locities), we consider an LTL formula '

2

= 3 (3 (A^3B)_3 (B ^ 3A) ^ 3C) ^ ⇤¬obs. The corresponding specifica-tion automaton is shown in Fig. 3. In this case, the systemneeds to traverse regions A and B in any order, and eventu-ally reach C while avoiding collisions with a static obstacle.The running cost ` is the same as in the previous example,with terminal cost 100 ⇥ k(x, y)� (x

f

, yf

)k. We define anadditional cost h(q) = 2⇥ 103 rank(q), where rank(q) is theminimum number of transitions required to reach an accept-ing state from q(T ), where the terminal time T is 50.In Dubins car case, we select radial basis function (RBF)

bases. An RBF is defined by

�(x) = exp(�kx� xc

k2/2�2),

where xc

is the center of the basis element, and � is a freeparameter. We define �

rbf

= [�1

, . . . ,�N

]T where the �i

sare RBFs centered on a uniform grid in x-y coordinates withstep sizes �x = 5 and �y = 5. The workspace is bounded0 x 30 and 0 y 30. The free parameter � of eachRBF is 5. We also define the bases �

✓

= [sin(✓), cos(✓)]T

such that for each state q 2 Q, the vector of basic bases is�basic

= [�T

rbf

,�T

✓

]T .Additional RBF bases are determined by picking their

centers as the centers of disk regions A, B, C, and obs.For example, in the state q

2

, since the region B has beenvisited, the center of region A is taken into the considera-tion, so the basis function with the respect to the state q

2

is: �2

= [�T

basic

,�T

A

,�T

obs

]T where �A

(resp. �obs

) is an RBFwith its center on the center of disk A (resp. obs) and anRBF parameter � = 5. Finally, the total number of ba-sis functions (weight parameters) is 530. For the sampling-based algorithm, the same set of parameters is used, exceptthat the initial sample size is chosen to be 500. A smallersample size leads to faster updates, but longer iterations.

Time

0 1 2 3 4 5 6 7

Sta

te

-1.5

-1

-0.5

0

0.5

1

1.5x(1)

x(2)

(a)

x(1)

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x(2)

-1.5

-1

-0.5

0

0.5

1

1.5

x0 = [-0.5; -0.5]

x0 = [0.5; 0.5]

(b)

Sample size0 500 1000 1500 2000 2500 3000 3500 4000 4500

Cos

t eva

luat

ed a

t the

mea

n of

the

Gau

ssia

n

100

101

102

103

104

(c)

Sample size0 500 1000 1500 2000 2500 3000 3500 4000 4500

Mea

n of

the

mul

tivar

iant

Gau

ssia

n

-5

-4

-3

-2

-1

0

1

µ1 µ2 µ3

(d)

Figure 2: (a) The state trajectory for the linear sys-

tem with x0

= (0.5, 0.5) calculated by the computed

controller after the algorithm converges. (b) Ap-

proximate minimal cost trajectories satisfying the

specification. The trajectories start with di↵erent

initial states x0

= (�0.5,�0.5) and x0

= (0.5, 0.5) while

the same controller is applied. (c) The cost over it-

erations. (d) The mean of the Gaussian distribution

over iterations for the first three components in the

mean vector.

233

q0start

q1

q2

q3

q4

AC

B

B

A

C

A,C

B,C

A,B true

Figure 3: Automaton A'2 for '

2

= 3 (3 (A ^ 3B) _3 (B ^3A)^3C)^⇤¬obs. The self-loops with labels

other than A, B, C, and obs are omitted.

Fig. 4a shows the system trajectories computed using thevalue function approximation parameterized by µ over 179iterations. Each iteration takes 20 to 30 seconds and thealgorithm converges to an optimal controller after 73 it-erations, with the same stopping criterion as in the linearexample. Interestingly, the algorithm quickly finds a con-trol policy that satisfies the specification after 15 iterations.Figs. 4b and 4c show the x and y state trajectories with andwithout initial error. The solid lines are the state trajecto-ries with small initial errors. Even with small initial errors,the controller ensures satisfaction of the temporal logic con-straints. Finally, Fig. 4d shows the convergence of the totalcost at each iteration. The x-axis is the iteration number k,while the y-axis is the total cost value of the mean µ

k

. Notethat the mean of the distribution upon convergence may notbe optimal due to the finite sample size. As a consequence,when the algorithm terminates under the terminal condi-tion k⌃

k

k 0.005, the distribution may converge to thesub-optimal solution.

6. CONCLUSIONIn this work, we presented a sampling-based method for

optimal control with co-safe LTL constraints. Through ahybrid system formulation, the objective is to solve a se-quence of value functions over a hybrid state space, wherethe continuous component comes from the continuous-timeand continuous-state dynamics of the system, and the dis-crete component comes from the specification automaton.By approximating the value functions with weighted com-binations of pre-defined bases, we employ model referenceadaptive search (MRAS)—a general sampling-based opti-mization method—to directly search over weight parameterspace. The method often works in a near-anytime fashion: itquickly finds a hybrid controller that satisfies the temporallogic constraint, and improves the value function approxi-mation by minimizing an upper bound on the actual valuefunction as more computation time is permitted. The algo-rithm converges, with probability one, to an optimal valuefunction approximation. This procedure does not rely ondiscretizing the time/state space.

Building on this result, we consider several further de-velopments. At this stage, this approach is limited to asubset of LTL specifications that admit deterministic andfinite (rather than Buchi) automaton representations. Ex-tensions to the general class of LTL specifications that ad-mit deterministic Buchi automaton are subjects of currentwork. We will also consider decomposition-based distributedsampling so that the algorithm scales to problems in a high-dimensional weight parameter space. We will further extend

0 2 4 6 8 10 12 14 16 18 20

x

-2

0

2

4

6

8

10

12

14

16

18

y OBSTACLE

B

A

C

(a)

0 5 10 15 20 25 30 35 40 45 50

time

0

2

4

6

8

10

12

14

16

18

x

(b)

0 5 10 15 20 25 30 35 40 45 50

time

-2

0

2

4

6

8

10

12

14

y

(c)

0 20 40 60 80 100 120 140 160 180

Number of iterations

0

200

400

600

800

1000

1200

Co

st

(d)Figure 4: (a) trajectories for the Dubins car with

the controller u(x, q) computed from V (x;µ) where

µ is the mean of the Gaussian distribution over it-

erations. Dashed lines represent trajectories that

do not satisfy the temporal logic constraints. (b–c)

state trajectories of the Dubins car under the con-

verged controller and errors in the initial states. (d)

cost over iterations.

234

the proposed method as a building block in anytime optimaland provably correct decision making systems for nonlinearrobotic systems. Finally, the sampling algorithm is highlyparallelizable, suggesting the use of GPU-accelerated com-puting to speed up computations.

AcknowledgmentsThis work was supported in part by grants from AFRL#FA8650-15-C-2546, ONR #N00014-15-IP-00052, and NSF#1550212.

7. REFERENCES[1] M. Abu-Khalaf and F. L. Lewis. Nearly optimal

control laws for nonlinear systems with saturatingactuators using a neural network HJB approach.Automatica, 41(5):779–791, 2005.

[2] R. Alur, T. A. Henzinger, G. La↵erriere, and G. J.Pappas. Discrete abstractions of hybrid systems.Proceedings of the IEEE, 88(7):971–984, July 2000.

[3] G. E. Fainekos, A. Girard, H. Kress-Gazit, and G. J.Pappas. Temporal logic motion planning for dynamicrobots. Automatica, 45(2):343–352, 2009.

[4] P. Gastin and D. Oddoux. Fast LTL to Buchiautomata translation. In G. Berry, H. Comon, andA. Finkel, editors, International Conference onComputer Aided Verification (CAV’01), volume 2102of Lecture Notes in Computer Science, pages 53–65,Paris, France, July 2001. Springer.

[5] L. C. G. J. M. Habets and C. Belta. Temporal logiccontrol for piecewise-a�ne hybrid systems onpolytopes. In Proceedings of the 19th InternationalSymposium on Mathematical Theory of Networks andSystems (MTNS), pages 195–202, July 2010.

[6] M. Hauschild and M. Pelikan. An introduction andsurvey of estimation of distribution algorithms. Swarmand Evolutionary Computation, 1(3):111–128, 2011.

[7] S. Hedlund and A. Rantzer. Optimal control of hybridsystems. In IEEE Conference on Decision and Control(CDC), volume 4, pages 3972–3977, 1999.

[8] S. Hedlund and A. Rantzer. Convex dynamicprogramming for hybrid systems. IEEE Transactionson Automatic Control, 47(9):1536–1540, 2002.

[9] J. Hu, M. C. Fu, and S. I. Marcus. A model referenceadaptive search method for global optimization.Operations Research, 55(3):549–568, 2007.

[10] M. Johansson and A. Rantzer. Computation ofpiecewise quadratic Lyapunov functions for hybridsystems. IEEE Transactions on Automatic Control,43(4):555–559, Apr. 1998.

[11] N. Kariotoglou, S. Summers, T. Summers,M. Kamgarpour, and J. Lygeros. Approximatedynamic programming for stochastic reachability. InEuropean Control Conference (ECC), pages 584–589,July 2013.

[12] M. Kloetzer and C. Belta. A fully automatedframework for control of linear systems from temporallogic specifications. IEEE Transactions on AutomaticControl, 53(1):287–297, 2008.

[13] M. Kobilarov. Cross-entropy motion planning. TheInternational Journal of Robotics Research,31(7):855–871, 2012.

[14] M. Kobilarov. Sample complexity bounds for iterativestochastic policy optimization. In Advances in NeuralInformation Processing Systems, pages 3114–3122,2015.

[15] O. Kupferman and M. Y. Vardi. Model checking ofsafety properties. Formal Methods in System Design,19(3):291–314, Nov. 2001.

[16] T. Latvala. E�cient model checking of safetyproperties. In T. Ball and S. K. Rajamani, editors,International SPIN Workshop on Model Checking ofSoftware, volume 2648 of Lecture Notes in ComputerScience, pages 74–88. Springer, 2003.

[17] F. Lewis, S. Jagannathan, and A. Yesildirak. Neuralnetwork control of robot manipulators and non-linearsystems. CRC Press, 1998.

[18] J. Liu and N. Ozay. Abstraction, discretization, androbustness in temporal logic control of dynamicalsystems. In International Conference on HybridSystems: Computation and Control (HSCC), pages293–302. ACM, 2014.

[19] S. C. Livingston, E. M. Wol↵, and R. M. Murray.Cross-entropy temporal logic motion planning. InInternational Conference on Hybrid Systems:Computation and Control, pages 269–278. ACM, 2015.

[20] I. Papusha, J. Fu, U. Topcu, and R. M. Murray.Automata theory meets approximate dynamicprogramming: Optimal control with temporal logicconstraints. In IEEE Conference on Decision andControl (CDC), Dec. 2016.

[21] N. Piterman, A. Pnueli, and Y. Sa’ar. Synthesis ofreactive (1) designs. In International Workshop onVerification, Model Checking, and AbstractInterpretation, pages 364–380. Springer, 2006.

[22] R. Reemtsen and J.-J. Ruckmann. Semi-infiniteprogramming, volume 25. Springer Science & BusinessMedia, 1998.

[23] R. Y. Rubinstein and D. P. Kroese. The cross-entropymethod: a unified approach to combinatorialoptimization, Monte-Carlo simulation and machinelearning. Springer Science & Business Media, 2013.

[24] S. L. Smith, J. Tumova, C. Belta, and D. Rus.Optimal path planning for surveilance withtemporal-logic constraints. International Journal ofRobotics Research, 30(14):1695–1708, Dec. 2011.

[25] K. Weierstrass. Uber die analytische Darstellbarkheitsogenannter willhulicher Functionen einer reellenVeranderlichen. Berliner Berichte, 1885.

[26] E. M. Wol↵, U. Topcu, and R. M. Murray.Automaton-guided controller synthesis for nonlinearsystems with temporal logic. In IEEE/RSJInternational Conference on Intelligent Robots andSystems (IROS), pages 4332–4339, Nov. 2013.

[27] T. Wongpiromsarn, U. Topcu, and R. M. Murray.Receding horizon temporal logic planning. IEEETransactions on Automatic Control, 57(11):2817–2830,Nov. 2012.

[28] X. Xu and P. J. Antsaklis. Optimal control ofswitched systems based on parameterization of theswitching instants. IEEE Transactions on Automatic

Control, 49(1):2–16, 2004.

235

Sampling-based Approximate Optimal Control Under ...jfu2/files/papers/Fu_Papusha...Approximate...

Documents

Transcript of Sampling-based Approximate Optimal Control Under ...jfu2/files/papers/Fu_Papusha...Approximate...