Sampling-Based Robust Control of Autonomous Systems with ...

16
Sampling-Based Robust Control of Autonomous Systems with Non-Gaussian Noise * Thom S. Badings 1 , Alessandro Abate 2 , Nils Jansen 1 David Parker 3 , Hasan A. Poonawala 4 , Marielle Stoelinga 1 1 Radboud University, Nijmegen, the Netherlands 2 University of Oxford, Oxford, United Kingdom 3 University of Birmingham, Birmingham, United Kingdom 4 University of Kentucky, Kentucky, USA Abstract Controllers for autonomous systems that operate in safety- critical settings must account for stochastic disturbances. Such disturbances are often modeled as process noise, and common assumptions are that the underlying distributions are known and/or Gaussian. In practice, however, these assumptions may be unrealistic and can lead to poor approximations of the true noise distribution. We present a novel planning method that does not rely on any explicit representation of the noise distri- butions. In particular, we address the problem of computing a controller that provides probabilistic guarantees on safely reaching a target. First, we abstract the continuous system into a discrete-state model that captures noise by probabilistic transitions between states. As a key contribution, we adapt tools from the scenario approach to compute probably approx- imately correct (PAC) bounds on these transition probabilities, based on a finite number of samples of the noise. We capture these bounds in the transition probability intervals of a so- called interval Markov decision process (iMDP). This iMDP is robust against uncertainty in the transition probabilities, and the tightness of the probability intervals can be controlled through the number of samples. We use state-of-the-art ver- ification techniques to provide guarantees on the iMDP, and compute a controller for which these guarantees carry over to the autonomous system. Realistic benchmarks show the practical applicability of our method, even when the iMDP has millions of states or transitions. 1 Introduction Consider a so-called reach-avoid problem for an unmanned aerial vehicle (UAV), where the goal is to reach a desirable region within a given time horizon, while avoiding certain un- safe regions (Baier and Katoen 2008; Clarke, Emerson, and Sistla 1986). A natural formal model for such an autonomous system is a dynamical system. The state of the system re- flects the position and velocity of the UAV, and the control inputs reflect choices that may change the state over time (Ku- lakowski, Gardner, and Shearer 2007). The dynamical system is linear if the state transition is linear in the current state and control input. Our problem is to compute a controller, such * This work was funded by NWO grant NWA.1160.18.238 (Pri- maVera), and ERC Advanced Grant 834115 (FUN2MODEL). Copyright © 2022, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. that the state of the UAV progresses safely, without entering unsafe regions, to its goal ( ˚ Astr¨ om and Murray 2010). However, factors like turbulence and wind gusts cause uncertainty in the outcome of control inputs (Blackmore et al. 2010). We model such uncertainty as process noise, which is an additive random variable (with possibly infinitely many outcomes) in the dynamical system that affects the transition of the state. Controllers for autonomous systems that operate in safety-critical settings must account for such uncertainty. A common assumption to achieve computational tractabil- ity of the problem is that the process noise follows a Gaussian distribution (Park, Serpedin, and Qaraqe 2013), for example in linear-quadratic-Gaussian control (Anderson and Moore 2007). However, in realistic problems, such as the UAV oper- ating under turbulence, this assumption yields a poor approx- imation of the uncertainty (Blackmore et al. 2010). Distribu- tions may even be unknown, meaning that one cannot derive a set-bounded or stochastic representation of the noise. In this case, it is generally hard or even impossible to derive hard guarantees on the probability that a given controller ensures a safe progression of the system’s state to the objective. In this work, we do not require that the process noise is known. Specifically, we provide probably approximately correct (PAC) guarantees on the performance of a controller for the reach-avoid problem, where the distribution of the noise is unknown. As such, we solve the following problem: Given a linear dynamical system perturbed by additive noise of unknown distribution, compute a controller un- der which, with high confidence, the probability to satisfy a reach-avoid problem is above a given threshold value. Finite-state abstraction. The fundamental concept of our approach is to compute a finite-state abstraction of the dy- namical system. We obtain such an abstract model from a partition of the continuous state space into a set of disjoint convex regions. Actions in this abstraction correspond to con- trol inputs that induce transitions between these regions. Due to the process noise, the outcome of an action is stochastic, and every transition has a certain probability. Probability intervals. Since the distribution of the noise is unknown, it is not possible to compute the transition prob- abilities exactly. Instead, we estimate the probabilities based arXiv:2110.12662v3 [eess.SY] 13 Dec 2021

Transcript of Sampling-Based Robust Control of Autonomous Systems with ...

Page 1: Sampling-Based Robust Control of Autonomous Systems with ...

Sampling-Based Robust Control of Autonomous Systemswith Non-Gaussian Noise *

Thom S. Badings1, Alessandro Abate2, Nils Jansen1

David Parker3, Hasan A. Poonawala4, Marielle Stoelinga1

1 Radboud University, Nijmegen, the Netherlands2 University of Oxford, Oxford, United Kingdom

3 University of Birmingham, Birmingham, United Kingdom4 University of Kentucky, Kentucky, USA

Abstract

Controllers for autonomous systems that operate in safety-critical settings must account for stochastic disturbances. Suchdisturbances are often modeled as process noise, and commonassumptions are that the underlying distributions are knownand/or Gaussian. In practice, however, these assumptions maybe unrealistic and can lead to poor approximations of the truenoise distribution. We present a novel planning method thatdoes not rely on any explicit representation of the noise distri-butions. In particular, we address the problem of computinga controller that provides probabilistic guarantees on safelyreaching a target. First, we abstract the continuous systeminto a discrete-state model that captures noise by probabilistictransitions between states. As a key contribution, we adapttools from the scenario approach to compute probably approx-imately correct (PAC) bounds on these transition probabilities,based on a finite number of samples of the noise. We capturethese bounds in the transition probability intervals of a so-called interval Markov decision process (iMDP). This iMDPis robust against uncertainty in the transition probabilities, andthe tightness of the probability intervals can be controlledthrough the number of samples. We use state-of-the-art ver-ification techniques to provide guarantees on the iMDP, andcompute a controller for which these guarantees carry overto the autonomous system. Realistic benchmarks show thepractical applicability of our method, even when the iMDPhas millions of states or transitions.

1 IntroductionConsider a so-called reach-avoid problem for an unmannedaerial vehicle (UAV), where the goal is to reach a desirableregion within a given time horizon, while avoiding certain un-safe regions (Baier and Katoen 2008; Clarke, Emerson, andSistla 1986). A natural formal model for such an autonomoussystem is a dynamical system. The state of the system re-flects the position and velocity of the UAV, and the controlinputs reflect choices that may change the state over time (Ku-lakowski, Gardner, and Shearer 2007). The dynamical systemis linear if the state transition is linear in the current state andcontrol input. Our problem is to compute a controller, such

*This work was funded by NWO grant NWA.1160.18.238 (Pri-maVera), and ERC Advanced Grant 834115 (FUN2MODEL).Copyright © 2022, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

that the state of the UAV progresses safely, without enteringunsafe regions, to its goal (Astrom and Murray 2010).

However, factors like turbulence and wind gusts causeuncertainty in the outcome of control inputs (Blackmore et al.2010). We model such uncertainty as process noise, which isan additive random variable (with possibly infinitely manyoutcomes) in the dynamical system that affects the transitionof the state. Controllers for autonomous systems that operatein safety-critical settings must account for such uncertainty.

A common assumption to achieve computational tractabil-ity of the problem is that the process noise follows a Gaussiandistribution (Park, Serpedin, and Qaraqe 2013), for examplein linear-quadratic-Gaussian control (Anderson and Moore2007). However, in realistic problems, such as the UAV oper-ating under turbulence, this assumption yields a poor approx-imation of the uncertainty (Blackmore et al. 2010). Distribu-tions may even be unknown, meaning that one cannot derivea set-bounded or stochastic representation of the noise. In thiscase, it is generally hard or even impossible to derive hardguarantees on the probability that a given controller ensuresa safe progression of the system’s state to the objective.

In this work, we do not require that the process noiseis known. Specifically, we provide probably approximatelycorrect (PAC) guarantees on the performance of a controllerfor the reach-avoid problem, where the distribution of thenoise is unknown. As such, we solve the following problem:

Given a linear dynamical system perturbed by additivenoise of unknown distribution, compute a controller un-der which, with high confidence, the probability to satisfya reach-avoid problem is above a given threshold value.

Finite-state abstraction. The fundamental concept of ourapproach is to compute a finite-state abstraction of the dy-namical system. We obtain such an abstract model from apartition of the continuous state space into a set of disjointconvex regions. Actions in this abstraction correspond to con-trol inputs that induce transitions between these regions. Dueto the process noise, the outcome of an action is stochastic,and every transition has a certain probability.

Probability intervals. Since the distribution of the noiseis unknown, it is not possible to compute the transition prob-abilities exactly. Instead, we estimate the probabilities based

arX

iv:2

110.

1266

2v3

[ee

ss.S

Y]

13

Dec

202

1

Page 2: Sampling-Based Robust Control of Autonomous Systems with ...

on a finite number of samples (also called scenarios) of thenoise, which may be obtained from a high fidelity (blackbox)simulator or from experiments. To be robust against estima-tion errors in these probabilities, we adapt tools from thescenario approach (also called scenario optimization), whichis a methodology to deal with stochastic convex optimizationin a data-driven fashion (Campi and Garatti 2008; Garatti andCampi 2019). We compute upper and lower bounds on thetransition probabilities with a desired confidence level, whichwe choose up front. These bounds are PAC, as they containthe true probabilities with at least this confidence level.

Interval MDPs. We formalize our abstractions with thePAC probability bounds using so-called interval Markov de-cision processes (iMDPs). While regular MDPs require pre-cise transition probabilities, iMDPs exhibit probability in-tervals (Givan, Leach, and Dean 2000). Policies for iMDPshave to robustly account for all possible probabilities withinthe intervals, and one usually provides upper and lowerbounds on maximal or minimal reachability probabilitiesor expected rewards (Hahn et al. 2017; Puggelli et al. 2013;Wolff, Topcu, and Murray 2012). For MDPs with preciseprobabilities, mature tool support exists, for instance, viaPRISM (Kwiatkowska, Norman, and Parker 2011). In thiswork, we extend the support of PRISM to iMDPs.

Iterative abstraction scheme. The tightness of the prob-ability intervals depends on the number of noise samples.Hence, we propose an iterative abstraction scheme, to iter-atively improve these intervals by using increasing samplesizes. For the resulting iMDP, we compute a robust policythat maximizes the probability to safely reach the goal states.Based on a pre-defined threshold, we decide whether thisprobability is unsatisfactory or satisfactory. In the formercase, we collect additional samples to reduce the uncertaintyin the probability intervals. If the probability is satisfactory,we use the policy to compute a controller for the dynamicalsystem. The specified confidence level reflects the likelihoodthat the optimal reachability probability on the iMDP is alower bound for the probability that the dynamical system sat-isfies the reach-avoid problem under this derived controller.

Contributions. Our contributions are threefold: (1) Wepropose a novel method to compute safe controllers for dy-namical systems with unknown noise distributions. Specif-ically, the probability of safely reaching a target is guaran-teed, with high confidence, to exceed a pre-defined threshold.(2) We propose a scalable refinement scheme that incremen-tally improves the iMDP abstraction by iteratively increasingthe number of samples. (3) We apply our method to multiplerealistic control problems, and benchmark against two othertools: StocHy and SReachTools. We demonstrate that theguarantees obtained for the iMDP abstraction carry over tothe dynamical system of interest. Moreover, we show thatusing probability intervals instead of point estimates of prob-abilities yields significantly more robust results.

Related workReachability analysis. Verification and controller synthe-sis for reachability in stochastic systems is an active field of

research in safety-critical engineering (Abate et al. 2008;Lavaei et al. 2021). Most approaches are based on for-mal abstractions (Alur et al. 2000; Lahijanian, Anders-son, and Belta 2015; Soudjani and Abate 2013) or workin the continuous domain directly, e.g., using Hamilton-Jacobi reachability analysis (Bansal et al. 2017; Herbertet al. 2017) or optimization (Rosolia, Singletary, and Ames2020). Several tools exist, such as StocHy (Cauchi and Abate2019), ProbReach (Shmarov and Zuliani 2015) and SReach-Tools (Vinod, Gleason, and Oishi 2019). However, the major-ity of these methods require full knowledge of the models.

We break away from this literature in putting forward ab-stractions that do not require any knowledge of the noisedistribution, via the scenario approach. It has been used forthe verification of Markov decision processes (MDPs) withuncertain parameters (Cubuktepe et al. 2020), albeit only forfinite-state systems. SReachTools also exhibits a sampling-based method, but relies on Hoeffding’s inequality to obtainconfidence guarantees (Sartipizadeh et al. 2019), so the noiseis still assumed to be sub-Gaussian (Boucheron, Lugosi, andMassart 2013). By contrast, the scenario approach is com-pletely distribution-free (Campi and Garatti 2018). Moreover,SReachTools is limited to problems with convex safe sets (arestrictive assumption in many problems) and its sampling-based methods can only synthesize open-loop controllers.Further related are sampling-based feedback motion plan-ning algorithms, such as LQR-Trees. However, sampling inLQR-Trees relates to random exploration of the state space,and not to stochastic noise affecting the dynamics as in oursetting (Reist, Preiswerk, and Tedrake 2016; Tedrake 2009).

Alternatives to the scenario approach. Monte Carlomethods (e.g. particle methods) can also solve stochasticreach-avoid problems (Blackmore et al. 2010; Lesser, Oishi,and Erwin 2013). These methods simulate the system viamany samples of the uncertain variable (Smith 2013). MonteCarlo methods approximate stochastic problems, while ourapproach provides bounds with a desired confidence level.

In distributionally robust optimization (DRO), decisionsare robust with respect to ambiguity sets of distributions (Esfa-hani and Kuhn 2018; Goh and Sim 2010; Wiesemann, Kuhn,and Sim 2014). While the scenario approach uses samples ofthe uncertain variable, DRO works on the domain of uncer-tainty directly, thus involving potentially complex ambiguitysets (Garatti and Campi 2019). Designing robust policies foriMDPs with known uncertainty sets was studied by Puggelliet al. (2013), and Wolff, Topcu, and Murray (2012). Hybridmethods between the scenario approach and robust optimiza-tion also exist (Margellos, Goulart, and Lygeros 2014).

PAC literature. The term PAC refers to obtaining, withhigh probability, a hypothesis that is a good approximation ofsome unknown phenomenon (Haussler 1990). PAC learningmethods for discrete-state MDPs are developed in Brafmanand Tennenholtz (2002), Fu and Topcu (2014), and Kearnsand Singh (2002), and PAC statistical model checking forMDPs in Ashok, Kretınsky, and Weininger (2019).

Safe learning methods. We only briefly discuss the emerg-ing field of safe learning (Brunke et al. 2021; Garcıa and

Page 3: Sampling-Based Robust Control of Autonomous Systems with ...

Fernandez 2015). Recent works use Gaussian processes forlearning-based model predictive control (Hewing, Kabzan,and Zeilinger 2020; Koller et al. 2018) or reinforcementlearning with safe exploration (Berkenkamp et al. 2017), andcontrol barrier functions to reduce model uncertainty (Tayloret al. 2020). Safe learning control concerns learning unknown,deterministic system dynamics, while imposing strong as-sumptions on stochasticity (Fisac et al. 2019). By contrast,our problem setting is fundamentally different: we reasonabout stochastic noise of a completely unknown distribution.

2 Foundations and OutlineA discrete probability distribution over a finite set X is afunction prob : X → [0, 1] with

∑x∈X prob(x) = 1. The

set of all distributions over X is Dist(X ), and the cardinal-ity of a set X is |X|. A probability density function over arandom variable x conditioned on y is written as p(x|y). Allvectors x ∈ Rn, n ∈ N, are column vectors and denoted bybold letters. We use the term controller when referring todynamical systems, while we use policy for (i)MDPs.

2.1 Linear dynamical systemsWe consider discrete-time, continuous-state systems, wherethe progression of the n-dimensional state x ∈ Rn dependslinearly on the current state, a control input, and a processnoise term. Given a state xk at discrete time step k ∈ N, thesuccessor state at time k + 1 is computed as

xk+1 = Axk +Buk + qk +wk, (1)

where uk ∈ U ⊂ Rp is the control input at time k, A ∈Rn×n and B ∈ Rn×p are appropriate matrices, qk ∈ Rnis a deterministic disturbance, and wk ∈ Rn is an arbitraryadditive process noise term. We consider piecewise-linearfeedback controllers of the form φ : Rn × N → U , whichmap a state xk ∈ Rn and a time step k ∈ N to a control inputuk ∈ U . The controller may be time-dependent, because weconsider control objectives with a finite time horizon.

The random variable wk ∈ ∆ is defined on a probabilityspace (∆,D,P), with σ-algebra D and probability measureP defined over D. We do not require the sample space ∆and probability measure P to be known explicitly. Instead,we employ a sampling-based approach, for which it sufficesto have a finite number of N independent and identicallydistributed (i.i.d.) samples of the random variable, and toassume that its distribution is independent of time. Due tothe process noise wk, the successor state xk+1 is a randomvariable at time k. We denote the probability density functionover successor states as pwk

(xk+1 | xk+1), where xk+1 =Axk +Buk + qk is its noiseless value.

Remark 1 (Restriction to linear systems). Our methods aretheoretically amenable to nonlinear systems, albeit requiringmore advanced 1-step reachability computations unrelatedto our main contributions. Hence, we restrict ourselves to thelinear system in Eq. (1). We discuss extensions to nonlinearsystems in Sect. 6.

Problem statement. We consider control objectives thatare expressed as (step-bounded) reach-avoid properties. A

reach-avoid property ϕKx0is satisfied if, starting from state

x0 at time k = 0, the system reaches a desired goal regionXG ⊂ Rn within a finite time horizon of K ∈ N steps, whileavoiding a critical region XC ⊂ Rn. We write the probabilityof satisfying a reach-avoid property ϕKx0

under a controller φas Prφ(ϕKx0

). We state the formal problem as follows.

Compute a controller φ for the system in Eq. (1) that,with high confidence, guarantees that Prφ(ϕKx0

) ≥ η,where η ∈ [0, 1] is a pre-defined probability threshold.

2.2 Markov decision processesA Markov decision process (MDP) is a tuple M =(S,Act, sI , P ) where S is a finite set of states, Act is a fi-nite set of actions, sI is the initial state, and P : S ×Act ⇀Dist(S ) is the (partial) probabilistic transition function. Wecall (s, a, s′) with probability P (s, a)(s′) > 0 a transition.A deterministic (or pure) policy (Baier and Katoen 2008)for an MDPM is a function π : S∗ → Act, where S∗ is asequence of states. The set of all possible policies forM isdenoted by ΠM. Note that we leave out rewards for brevity,but our approach is directly amenable to expected rewardproperties (Baier and Katoen 2008).

A probabilistic reach-avoid property Prπ(ϕKsI ) for anMDP describes the probability of reaching a set of goal statesSG ⊂ S within K ∈ N steps under policy π ∈ Π, whileavoiding a set of critical states SC ⊂ S, where SG∩SC = ∅.An optimal policy π∗ ∈ ΠM for MDP M maximizes thereachability probability:

π∗ = arg maxπ∈ΠM

Prπ(ϕKsI ). (2)

We now relax the assumption that probabilities are preciselygiven. An interval Markov decision process (iMDP) is atuple MI = (S,Act, sI ,P) where the uncertain (partial)probabilistic transition function P : S × Act × S ⇀ I isdefined over intervals I = {[a, b] | a, b ∈ (0, 1] and a ≤ b}.iMDPs define sets of MDPs that vary only in their transi-tion function. In particular, for an MDP transition functionP , we write P ∈ P if for all s, s′ ∈ S and a ∈ Act wehave P (s, a)(s′) ∈ P(s, a)(s′) and P (s, a) ∈ Dist(S ). ForiMDPs, a policy needs to be robust against all P ∈ P . Weemploy value iteration to compute a policy π∗ ∈ ΠMI foriMDPMI that maximizes the lower bound on the reachabil-ity probability Prπ(ϕKsI ) within horizon K:

π∗ = arg maxπ∈ΠMI

Prπ(ϕKsI ) = arg maxπ∈ΠMI

minP∈P

Prπ(ϕKsI ). (3)

Note that deterministic policies suffice to obtain optimalvalues for (i)MDPs (Puterman 1994; Puggelli et al. 2013).

2.3 Our iterative abstraction schemeOur proposed approach is shown in Fig. 1. We choose a fixedstate-space partition and confidence level up front, and selectan initial number of samples N (note that extensions to vari-able confidence levels or partitions are straightforward). Asexplained in Sect. 3 and 4, we then abstract the dynamicalsystem as an iMDP using N samples of the noise. For the

Page 4: Sampling-Based Robust Control of Autonomous Systems with ...

Linear dynamical systemxk+1 = Axk+Buk+qk+wk

Abstract iMDPMI = (S,Act, sI ,P)

Guarantees on iMDPPolicy π∗ and Prπ

∗(ϕKsI )

Continuous controllerφ : X × N → U

Partition R Confidence level1− β

PropertyϕKsI

Sample andabstract

Compute robustoptimal policy

Prπ∗(ϕKsI ) < η m

Increase sample size: N ← γNPrπ

∗(ϕKsI ) ≥ η ¢

Extract π∗

Apply controller(and terminate)

Figure 1: Our iterative approach between abstraction andverification, where N is the number of samples used for theabstraction, and η is the threshold reachability probability.

iMDP, we compute an optimal policy π∗ that maximizes theprobability of satisfying the given property, as per Eq. (3).If the maximum reachability probability under this policy isabove the required threshold η, we compute the correspond-ing controller φ for the dynamical system, and terminate thescheme. If the maximum probability is unsatisfactory (i.e.below η), we obtain additional samples by increasing N by afixed factor γ > 1. The updated iMDP has tighter probabilityintervals, but may also have more transitions. Since the statesand actions of the iMDP are independent of N , they are onlycomputed once, in the first iteration. In general we cannotguarantee a priori that the property is satisfiable up to thegiven value of η, so we also terminate the scheme after a fixednumber of iterations, in which case no output is returned.

3 Finite-State MDP AbstractionFirst, we describe how we partition the state space into a setof discrete convex regions. We then use this partition to builda finite-state abstraction of the dynamical system in Eq. (1).

3.1 State space discretizationWe choose a partition R of the continuous state space Rninto a set of disjoint regions that represent a bounded portionX ⊂ Rn. In addition, we define a single absorbing region ra,representing Rn\X . We number the regions inR from 1 to|R|, and define a function T : Rn → {1, 2, . . . , |R|, |R|+1}that maps a continuous state x ∈ Rn to one of the regions inpartitionR through the index of that region, or to |R|+ 1 ifx ∈ ra = Rn\X . Thus, the absorbing region ra captures theevent that the continuous state leaves the bounded portion ofthe state space over which we plan. For convenience, we alsodefine the inverse mapping as Ri = T−1(i).

We consider the regions inR to be n-dimensional bounded,convex polytopes. In particular, convex polytope Ri is thesolution set of m linear inequalities parameterized by Mi ∈Rm×n and bi ∈ Rm, yielding Ri =

{x ∈ Rn |Mix ≤ bi

}.

In addition, the following assumption allows us to translateproperties for the dynamical system to properties on theiMDP abstraction:Assumption 1. The continuous goal region XG and criticalregion XC are aligned with the union of a subset of regionsin R, i.e. XG = ∪i∈IRi and XC = ∪j∈JRj for index setsI, J ⊂ {1, 2, . . . , |R|}.

3.2 MDP abstractionWe formalize the dynamical system discretized under par-tition R as an MDP M = (S,Act, sI ,P), by defining itsstates, actions, and transition probabilities (cf. ensuing para-graphs). We assume the initial state sI ∈ S is known, and wecapture time constraints by the bounded reach-avoid property.

States. The set of states is S = {si | i = 1, . . . , |R|} ∪{sa}, where discrete state si represents all continuous statesxk for which T (xk) = i. Then, the MDP consists of |S| =|R|+ 1 states: one for every region in partitionR, plus onestate sa corresponding to the absorbing region ra. State sa isa deadlock, meaning the only transition leads back to sa.

Actions. Discrete actions correspond to the execution ofa control input uk ∈ U in the dynamical system in Eq. (1).We define q ∈ N MDP actions, so Act = {a1, . . . , aq}.Recall that the noiseless successor state of xk is xk+1 =Axk +Buk + qk. Every action aj is associated with a fixedcontinuous target point dj ∈ Rn, and is defined such that itsnoiseless successor state xk+1 = dj . While not a restrictionof our approach, we define one action for every MDP state,and choose the target point to be the center of its region.

The MDP must form a correct abstraction of the dynam-ical system. Thus, action aj only exists in an MDP state siif, for every continuous state xk ∈ Ri, there exists a controluk ∈ U , such that xk+1 = dj . To impose this constraint, wedefine the one-step backward reachable set G(dj):

G(dj) = {x ∈ Rn | dj = Ax+Buk+qk, uk ∈ U}. (4)

Then, action aj exists in state si if and only if Ri ⊆ G(dj).Note that the existence of an action in an MDP state merelyimplies that for every continuous state in the associated re-gion, there exists a feasible control input that induces thistransition. The following assumption asserts that the regionsinR can indeed be contained in the backward reachable set.

Assumption 2. The backward reachable set G(dj) has anon-empty interior, which implies that matrix B is full rowrank, i.e., rank(B) = n, where n = dim(x) in Eq. (1).

For many systems, we may group together multiple dis-crete time steps in Eq. (1), such that Assumption 2 holds(see Badings et al. (2021, Sect. 6) for more details). To com-pute the actual control uk in state xk at time k, we replacexk+1 by dj in Eq. (1) and solve for uk, yielding:

uk = B+(dj − qk −Axk), (5)

with B+ the pseudoinverse of B. It is easily verified that forevery state where action aj is enabled, there exists a uk suchthat Eq. (5) holds (depending on B, it may not be unique).

Transition probability intervals. We want to determinethe probability P (si, al)(sj) to transition from state si tostate sj upon choosing action al. In the abstraction, thisis equivalent to computing the cumulative density functionof the distribution over the successor state xk+1 under thepolytope Rj associated with state sj . The probability densityfunction pwk

(xk+1 | xk+1 = dj) captures the distributionover successor states xk+1, which depends on the process

Page 5: Sampling-Based Robust Control of Autonomous Systems with ...

noise wk. By denoting Pwk(xk+1 ∈ Rj) as the probability

that xk+1 takes a value in discrete region Rj , we write:

P (si, al)(sj) = Pwk(xk+1 ∈ Rj)

=

∫Rj

pwk(xk+1 | xk+1 = dj)dxk+1.

(6)

Recall that the probability density function pwk(·) is un-

known, making a direct evaluation of Eq. (6) impossible.Instead, we use a sampling-based approach to compute prob-ability intervals as explained in Sect. 4.

4 Sampling-Based Probability IntervalsWe introduce a sampling-based method to estimate the tran-sition probabilities in Eq. (6), based on a finite set of Nobservations w(i)

k ∈ ∆, i = 1, . . . , N of the process noise.Each sample has a unique index i = 1, . . . , N and is associ-ated with a possible successor state xk+1 = xk+1+w

(i)k . We

assume that these samples are available from experimentaldata or simulations, and are thus obtained at a low cost.

Assumption 3. The noise samples w(i)k ∈ ∆, i = 1, . . . , N

are i.i.d. elements from (∆,P), and are independent of time.

Due to the samples being i.i.d., the set w(1)k , . . . ,w

(N)k of

N samples is a random element from the probability space∆N equipped with the product probability PN .

As an example, we want to evaluate the probabilityP (si, al)(sj) that state-action pair (si, al) induces a transi-tion to state sj . A naive frequentist approach to approximatethe probability would be to determine the fraction of the sam-ples leading to this transition, using the following definition.Definition 1. The cardinality N in

j ∈ {0, . . . , N} of the indexset of the samples leading to xk+1 ∈ Rj is defined as

N inj =

∣∣∣{i ∈ {1, . . . , N} | (xk+1 +w(i)k ) ∈ Rj}

∣∣∣. (7)

Similarly, we define N outj = N − N in

j as the number of

samples for which xk+1 +w(i)k is not contained in Rj .

Note that N inj and N out

j depend on both the sample setand the action. The frequentist approach is simple, but maylead to estimates that deviate critically from their true valuesif the number of samples is limited (we illustrate this issuein practice in Sect. 5.1). In what follows, we introduce ourmethod to be robust against such estimation errors.

4.1 Bounds for the transition probabilitiesWe adapt methods from the scenario approach (Campi andGaratti 2018) to compute intervals of probabilities insteadof precise estimates. For every probability P (si, al)(sj), wecompute an upper and lower bound (i.e. an interval) that con-tains the true probability in Eq. (6) with a high confidence.We formalize the resulting abstraction as an iMDP, wherethese probability intervals enter the uncertain transition func-tion P : S × Act × S ⇀ I. As the intervals are PAC, thisiMDP is a robust abstraction of the dynamical system.

First, we introduce the concept of risk (or violation proba-bility), which is a measure of the probability that a successorstate is not in a given region (Campi and Garatti 2008).

Definition 2. The risk Pwk(xk+1 /∈ Rj) that a successor

state xk+1 is not in region Rj is

Pwk(xk+1 /∈ Rj) = P{wk ∈ ∆ : xk +wk /∈ Rj}

= 1− Pwk(xk+1 ∈ Rj).

(8)

Crucially for our approach, note that P (si, al)(sj) =Pwk

(xk+1 ∈ Rj). The scenario approach enables us tobound the risk that the optimal point of a so-called scenariooptimization problem does not belong to a feasible set R de-fined by a set of constraints when we are only able to samplea subset of those constraints. By formulating this optimiza-tion problem such that R is closely related to a region Rj ,we obtain upper and lower bounds on the risk over Rj , andthus also on the corresponding transition probability (see Ap-pendix A for details). Importantly, this result means that wecan adapt the theory from the scenario approach to computetransition probability intervals for our abstractions.

Based on this intuition, we state the main contribution ofthis section, as a non-trivial variant of Romao, Margellos,and Papachristodoulou (2020, Theorem 5), adapted for ourcontext. Specifically, for a given transition (si, al, sj) and thecorresponding number of samples N out

j outside of region Rj(as per Def. 1), Theorem 1 returns an interval that containsP (si, al)(sj) with at least a pre-defined confidence level.Theorem 1 (PAC probability intervals). For N ∈ N samplesof the noise, fix a confidence parameter β ∈ (0, 1). GivenN outj , the transition probability P (si, al)(sj) is bounded by

PN{

¯p ≤ P (si, al)(sj) ≤ p

}≥ 1− β, (9)

where¯p = 0 if N out

j = N , and otherwise¯p is the solution of

β

2N=

N outj∑

i=0

(N

i

)(1−

¯p)i

¯pN−i, (10)

and p = 1 if N outj = 0, and otherwise p is the solution of

β

2N= 1−

N outj −1∑i=0

(N

i

)(1− p)ipN−i. (11)

For brevity, we postpone the proof and technical details toAppendix A. Theorem 1 states that with a probability of atleast 1− β, the probability P (si, al)(sj) is bounded by theobtained interval. Importantly, this claim holds for any ∆ andP, meaning that we can bound the probability in Eq. (6), evenwhen the probability distribution of the noise is unknown.

4.2 Practical use of Theorem 1We describe how we apply Theorem 1 to compute probabilityintervals of the iMDPs. For every state-action pair (si, al),we obtain N samples of the noise, and we determine N out

j forevery j ∈ {1, . . . , |R|, |R|+ 1}. Then, we invoke Theorem 1for every possible successor state sj ∈ S, to compute thebounds on P (si, al)(sj). Fig. 2 shows this process, whereevery tick is a successor state xk+1 under a sample of thenoise. This figure also shows point estimates of the probabil-ities, derived using the frequentist approach. If no samplesare observed in a region, we assume that P (si, al)(sj) = 0.

Page 6: Sampling-Based Robust Control of Autonomous Systems with ...

0 2 4 6

pwk(xk+1 | xk+1 = d)

sxk+1 = d

Pwk(xk+1 ∈ R1) Pwk

(xk+1 ∈ R2) Pwk(xk+1 ∈ R3)

N inj : 34 18 42

N outj : 66 82 58

[¯p, p]: [0.174, 0.538] [0.063, 0.363] [0.239, 0.617]

N inj /N : 0.34 0.18 0.42

Figure 2: Bounds [¯p, p] on the probabilities P (s, a)(sj) for

3 regions j ∈ {1, 2, 3}, using N = 100 samples (blackticks) and β = 0.01. The distribution over successor states ispwk

(·). Point estimate probabilities are computed as N inj /N .

Algorithm 1: Sampling-based iMDP abstraction.Input: Linear dynamical system; property ϕKsI (threshold η)Params: PartitionR; confidence lvl. β; increment factor γOutput: Controller φ

1: Define iMDP states S and set of enabled actions Act2: Let initial number of samples N = N0,3: Let iteration limit zmax and z = 04: while Prπ

∗(ϕKsI ) < η and z < zmax do

5: for all actions a in Act do6: for all successor states s in S do7: Compute PAC interval on probability P (·, a)(s)8: end for9: end for

10: Generate iMDPMI = (S,Act, sI ,P) for N samples11: Compute π∗ and Prπ

∗(ϕKsI ) onMI using PRISM

12: Let N = γN13: end while14: return piece-wise linear controller φ based on π∗

Interestingly, our method is in practice almost as simpleas the frequentist approach, but has the notable advantagethat we obtain robust intervals of probabilities. Note thatEq. (10) and (11) are cumulative distribution functions ofa beta distribution with parameters N out

j + 1 (or N outj ) and

N−N outj (orN−N out

j −1), respectively (Campi and Garatti2018), which can directly be solved numerically for

¯p (or p).

To speed up the computations at run-time, we apply a tabularapproach to compute the intervals for all relevant values ofN , β, and k up front. In Appendix A.2, we exemplify howthe number of samples controls the tightness of the intervals.

5 Numerical ExamplesWe implement our iterative abstraction method in Python,and tailor the model checker PRISM (Kwiatkowska, Norman,and Parker 2011) for iMDPs to compute robust optimal poli-cies. We present a pseudocode of our method in Algorithm 1.

Figure 3: UAV problem (goal in green; obstacles in red), plustrajectories under the optimal iMDP-based controller fromx0 = [−14, 0, 6, 0,−6, 0]>, under high and low turbulence.

At every iteration, the obtained iMDP is fed to PRISM, whichcomputes the optimal policy associated with the maximumreachability probability, as per Eq. (3). Our codes are avail-able via https://gitlab.science.ru.nl/tbadings/sample-abstract,and all experiments are run on a computer with 32 3.7GHzcores and 64 GB of RAM. We report the performance of ourmethod on: (1) a UAV motion control, (2) a building temper-ature regulation, and (3) a spacecraft rendezvous problem. Inall benchmarks, we use Theorem 1 with β = 0.01, and applythe iterative scheme with γ = 2, starting at N = 25, with anupper bound of 12, 800 samples.

5.1 UAV motion planningWe consider the reach-avoid problem for a UAV operatingunder turbulence, which was introduced in Sect. 1. Our goal isto compute a controller that guarantees (with high confidence)that the probability to reach a goal area, while also avoidingunsafe regions, is above a performance threshold of η = 0.75.We consider a horizon of 64 time steps, and the problemlayout is displayed in Fig. 3, with goal and critical regionsshown in green and red, respectively. We model the UAVas a system of 3 double integrators (see Appendix B fordetails). The state xk ∈ R6 encodes the position and velocitycomponents, and control inputs uk ∈ R3 model actuatorsthat change the velocity. The effect of turbulence on the statecauses (non-Gaussian) process noise, which we model usinga Dryden gust model (Bøhn et al. 2019; Dryden 1943). Wecompare two cases: a) a low turbulence case, and 2) a highturbulence case. We partition the state space into 25, 515regions.

Scalability. We report the model sizes and run times inAppendix B.2. The number of iMDP states equals the sizeof the partition. Depending on the number of samples N ,the iMDP has 9− 24 million transitions. The mean time tocompute the set of iMDP actions (which is only done in the

Page 7: Sampling-Based Robust Control of Autonomous Systems with ...

25

250

2,500

0

0.25

0.5

0.75

1

Number of samples (N )

Rea

chab

ility

prob

abili

ty

Guarantees iMDPEmpirical iMDPGuarantees MDPEmpirical MDP

Figure 4: Reachability guarantees on the iMDPs (blue) andMDPs (orange) for their respective policies, versus the result-ing empirical (simulated) performance (dashed lines) on thedynamical system. Shaded areas show the standard deviationacross 10 iterations. The empirical performance of the MDPsviolates the guarantees; that of the iMDPs does not.

first iteration) is 15 min. Computing the probabilities plusthe verification in PRISM takes 1− 8 min, depending on thenumber of samples N .

Accounting for noise matters. In Fig. 3, we show statetrajectories under the optimal iMDP-based controller, underhigh and low turbulence (noise). Under low noise, the con-troller prefers the short but narrow path; under high noise,the longer but safer path is preferred. Thus, accounting forprocess noise is important to obtain controllers that are safe.

iMDPs yield safer guarantees than MDPs. To show theimportance of using robust abstractions, we compare, underhigh turbulence, our robust iMDP approach against a naiveMDP abstraction. This MDP has the same states and actionsas the iMDP, but uses precise (frequentist) probabilities. Themaximum reachability probabilities (guarantees) for bothmethods are shown in Fig. 4. For every value of N , we applythe resulting controllers to the dynamical system in MonteCarlo simulations with 10, 000 iterations, to determine theempirical reachability probability. Fig. 4 shows that the non-robust MDPs yield poor and unsafe performance guarantees:the actual reachability of the controller is much lower thanthe reachability guarantees obtained from PRISM. By con-trast, our robust iMDP-based approach consistently yieldssafe lower bound guarantees on the actual performance ofcontrollers. The performance threshold of Prπ

∗(ϕKsI ) ≥ 0.75

is guaranteed for N = 3, 200 samples and higher.

5.2 Building temperature regulationInspired by Cauchi and Abate (2018), we consider a temper-ature control problem for a building with two rooms, bothhaving their own radiator and air supply. Our goal is to maxi-mize the probability to reach a temperature of 20 °C in bothzones within 32 steps of 15 minutes. The state xk ∈ R4 ofthe system (see Appendix C for details) reflects the tempera-tures of both zones and radiators, and control inputs uk ∈ R4

18 19 20 21 22

18

19

20

21

22

18 19 20 21 22

Tem

p. z

one

2 (°

C)

Temp. zone 1 (°C) Temp. zone 1 (°C)

1.0

0.8

0.6

0.4

0.2

0.0

N = 800N = 50

Figure 5: Cross section (for radiator temp. of 38 °C) of themaximum lower bound probabilities to reach the goal of20 °C from any initial state, for either 50 or 800 samples.

19.2

19.8

20.4

21.0

21.6

22.2

22.8

Zone temp (°C)

36.5

37.5

38.5

39.5

Rad

iato

r te

mp

(°C

)

0.0

0.2

0.4

0.6

0.8

1.0

19.2

19.8

20.4

21.0

21.6

22.2

22.8

Zone temp (°C)

Our approach (N=12,800) StocHy

Figure 6: Maximum lower bound probabilities to reach thegoal zone temperature of 21 °C from any initial state within64 steps, for our approach (N = 12, 800) and StocHy.

change the air supply and boiler temperatures in both zones.The deterministic heat gain through zone walls is modeled bythe disturbance qk ∈ R4. The noisewk ∈ R4 has a Gaussiandistribution (but this assumption is not required for our ap-proach). We partition the state space into 35, 721 regions: 21values for zone temperatures and 9 for radiator temperatures.

More samples means less uncertainty. In Fig. 5, we show(for fixed radiator temperatures) the maximum lower boundprobabilities obtained from PRISM, to reach the goal fromany initial state. The results clearly show that better reacha-bility guarantees are obtained when more samples are used tocompute the iMDP probability intervals. The higher the valueof N , the lower the uncertainty in the intervals, leading tobetter reachability guarantees. Notably, the largest iMDP hasaround 200 million transitions, as reported in Appendix C.2.

5.3 Benchmarks to other control synthesis toolsStocHy. We benchmark our method on a building tempera-ture problem against StocHy (Cauchi and Abate 2019), a ver-ification and synthesis tool based on formal abstractions (seeAppendix D.1 for details on the setup and results). Similarto our approach, StocHy also derives robust iMDP abstrac-tions. However, StocHy requires precise knowledge of thenoise distribution, and it discretizes the control input spaceof the dynamical system, to obtain a finite action space. Themaximum probabilities to reach the goal zone temperaturefrom any initial state obtained for both methods are presented

Page 8: Sampling-Based Robust Control of Autonomous Systems with ...

-0.8 -0.6 -0.4 -0.2 0

-0.8

-0.6

-0.4

-0.2

0

-0.8 -0.6 -0.4 -0.2 0 0.2

Low noise covariance High noise covariance

(N=1600)

Figure 7: Simulated state trajectories for the spacecraft ren-dezvous problem, under low and high noise covariance. Ourfeedback controllers are more robust, as shown by the smallererror in the state trajectories over time (the Voronoi methodunder high covariance failed to generate a solution).

in Fig. 6. The obtained results are qualitatively similar, andclose to the goal temperature, our lower bound reachabil-ity guarantees are slightly higher than those obtained fromStocHy. However, when starting at temperatures close to theboundary (e.g. at both low radiator and zone temperature), theguarantees obtained from our approach are slightly more con-servative. This is due to the fact that our approach relies onPAC guarantees on the transition probabilities, while StocHygives straight probabilistic outcomes. While both methodsyield results that are qualitatively similar, our approach is anorder of magnitude faster (45 min for StocHy, vs. 3− 9 s forour approach; see Appendix D.1 and Table 1 for details).

SReachTools. We apply our method to the spacecraft ren-dezvous benchmark (see Fig. 7) of SReachTools (Vinod,Gleason, and Oishi 2019), an optimization-based toolboxfor probabilistic reachability problems (see Appendix D.2for details). While we use samples to generate a model ab-straction, SReachTools employs sample-based methods overthe properties directly. Distinctively, SReachTools does notcreate abstractions (as in our case) and is thus generallyfaster than our method. However, its complexity is exponen-tial in the number of samples (versus linear complexity forour method). Importantly, we derive feedback controllers,while the sampling-based methods of SReachTools computeopen-loop controllers. Feedback controllers respond to stateobservations over time and are, therefore, more robust againststrong disturbances from noise, as also shown in Fig. 7.

6 Concluding Remarks and Future Work

We have presented a novel sampling-based method for ro-bust control of autonomous systems with process noise ofunknown distribution. Based on a finite-state abstraction, wehave shown how to compute controllers with PAC guaranteeson the performance on the continuous system. Our experi-ments have shown that our method effectively solves realisticproblems and provides safe lower bound guarantees on theperformance of controllers.

Nonlinear systems. While we have focused on linear dy-namical systems, as discussed in Remark 1, we wish to de-velop extensions to nonlinear systems. Such extensions arenon-trivial and may require more involved reachability com-putations (Bansal et al. 2017; Chen, Abraham, and Sankara-narayanan 2013). Specifically, the main challenge is to com-pute the enabled iMDP actions via the backward reachableset defined in Eq. (4), which may become non-convex undernonlinear dynamics. Note that computing the PAC probabilityintervals remains unchanged, as the scenario approach relieson the convexity of the target set only, and not on that ofthe backward reachable set. Alternatively, we may apply ourmethod on a linearized version of the nonlinear system. How-ever, in order to preserve guarantees, one must then accountfor any linearization error.

State space discretization. The discretization of the statespace influences the quality of the reachability guarantees: amore fine-grained partition yields an abstraction that is a moreaccurate representation of the dynamical system, but alsoincreases the computational complexity. In the future, we planto employ adaptive discretization schemes to automaticallybalance this trade-off, such as in Soudjani and Abate (2013).

Safe exploration. Finally, we wish to incorporate other un-certainties in Eq. (1), such as state/control-dependent processnoise, or measurement noise. Moreover, we may drop theassumption that the system matrices are precisely known,such that we must simultaneously learn about the unknowndeterministic dynamics and the stochastic noise. Learningdeterministic dynamics is common in safe learning con-trol (Brunke et al. 2021), but enabling safe exploration re-quires strong assumptions on stochastic uncertainty. This is achallenging goal, as it conflicts with our assumption that thedistribution of the process noise is completely unknown.

AcknowledgmentsWe would like to thank Licio Romao for his helpful discus-sions related to the scenario approach and our main theorem.

ReferencesAbate, A.; Prandini, M.; Lygeros, J.; and Sastry, S. 2008.Probabilistic reachability and safety for controlled discretetime stochastic hybrid systems. Automatica, 44(11): 2724 –2734.Alur, R.; Henzinger, T. A.; Lafferriere, G.; and Pappas, G. J.2000. Discrete abstractions of hybrid systems. Proceedingsof the IEEE, 88(7): 971–984.Anderson, B. D.; and Moore, J. B. 2007. Optimal control:linear quadratic methods. Courier Corporation.Ashok, P.; Kretınsky, J.; and Weininger, M. 2019. PACStatistical Model Checking for Markov Decision Processesand Stochastic Games. In CAV (1), volume 11561 of LectureNotes in Computer Science, 497–519. Springer.Astrom, K. J.; and Murray, R. M. 2010. Feedback systems: anintroduction for scientists and engineers. Princeton universitypress.

Page 9: Sampling-Based Robust Control of Autonomous Systems with ...

Badings, T. S.; Jansen, N.; Poonawala, H. A.; and Stoelinga,M. 2021. Filter-Based Abstractions with Correctness Guaran-tees for Planning under Uncertainty. CoRR, abs/2103.02398.Baier, C.; and Katoen, J. 2008. Principles of model checking.MIT Press.Bansal, S.; Chen, M.; Herbert, S. L.; and Tomlin, C. J. 2017.Hamilton-Jacobi reachability: A brief overview and recentadvances. In CDC, 2242–2253. IEEE.Berkenkamp, F.; Turchetta, M.; Schoellig, A. P.; and Krause,A. 2017. Safe Model-based Reinforcement Learning withStability Guarantees. In NIPS, 908–918.Blackmore, L.; Ono, M.; Bektassov, A.; and Williams, B. C.2010. A Probabilistic Particle-Control Approximation ofChance-Constrained Stochastic Predictive Control. IEEETrans. Robotics, 26(3): 502–517.Bøhn, E.; Coates, E. M.; Moe, S.; and Johansen, T. A. 2019.Deep Reinforcement Learning Attitude Control of Fixed-Wing UAVs Using Proximal Policy optimization. In ICUAS,523–533. IEEE.Boucheron, S.; Lugosi, G.; and Massart, P. 2013. Concentra-tion Inequalities - A Nonasymptotic Theory of Independence.Oxford University Press.Boyd, S. P.; and Vandenberghe, L. 2014. Convex Optimiza-tion. Cambridge University Press.Brafman, R. I.; and Tennenholtz, M. 2002. R-MAX - AGeneral Polynomial Time Algorithm for Near-Optimal Rein-forcement Learning. J. Mach. Learn. Res., 3: 213–231.Brunke, L.; Greeff, M.; Hall, A. W.; Yuan, Z.; Zhou, S.;Panerati, J.; and Schoellig, A. P. 2021. Safe Learning inRobotics: From Learning-Based Control to Safe Reinforce-ment Learning. CoRR, abs/2108.06266.Campi, M. C.; and Garatti, S. 2008. The Exact Feasibility ofRandomized Solutions of Uncertain Convex Programs. SIAMJ. Optim., 19(3): 1211–1230.Campi, M. C.; and Garatti, S. 2018. Introduction to thescenario approach. SIAM.Cauchi, N.; and Abate, A. 2018. Benchmarks for cyber-physical systems: A modular model library for build-ing automation systems (Extended version). CoRR,abs/1803.06315.Cauchi, N.; and Abate, A. 2019. StocHy: Automated Ver-ification and Synthesis of Stochastic Processes. In TACAS(2), volume 11428 of Lecture Notes in Computer Science,247–264. Springer.Chen, X.; Abraham, E.; and Sankaranarayanan, S. 2013.Flow*: An Analyzer for Non-linear Hybrid Systems. InCAV, volume 8044 of Lecture Notes in Computer Science,258–263. Springer.Clarke, E. M.; Emerson, E. A.; and Sistla, A. P. 1986. Auto-matic Verification of Finite-State Concurrent Systems UsingTemporal Logic Specifications. ACM Trans. Program. Lang.Syst., 8(2): 244–263.Cubuktepe, M.; Jansen, N.; Junges, S.; Katoen, J.; and Topcu,U. 2020. Scenario-Based Verification of Uncertain MDPs.In TACAS (1), volume 12078 of Lecture Notes in ComputerScience, 287–305. Springer.

Dryden, H. L. 1943. A review of the statistical theory ofturbulence. Quarterly of Applied Mathematics, 1(1): 7–42.Esfahani, P. M.; and Kuhn, D. 2018. Data-driven distribu-tionally robust optimization using the Wasserstein metric:performance guarantees and tractable reformulations. Math.Program., 171(1-2): 115–166.Fisac, J. F.; Akametalu, A. K.; Zeilinger, M. N.; Kaynama,S.; Gillula, J. H.; and Tomlin, C. J. 2019. A General SafetyFramework for Learning-Based Control in Uncertain RoboticSystems. IEEE Trans. Autom. Control., 64(7): 2737–2752.Fu, J.; and Topcu, U. 2014. Probably Approximately CorrectMDP Learning and Control With Temporal Logic Constraints.In Robotics: Science and Systems.Garatti, S.; and Campi, M. 2019. Risk and complexity inscenario optimization. Mathematical Programming, 1–37.Garcıa, J.; and Fernandez, F. 2015. A comprehensive surveyon safe reinforcement learning. J. Mach. Learn. Res., 16:1437–1480.Givan, R.; Leach, S. M.; and Dean, T. L. 2000. Bounded-parameter Markov decision processes. Artif. Intell., 122(1-2):71–109.Goh, J.; and Sim, M. 2010. Distributionally robust optimiza-tion and its tractable approximations. Operations research,58(4-part-1): 902–917.Hahn, E. M.; Hashemi, V.; Hermanns, H.; Lahijanian, M.;and Turrini, A. 2017. Multi-objective Robust Strategy Syn-thesis for Interval Markov Decision Processes. In QEST,volume 10503 of Lecture Notes in Computer Science, 207–223. Springer.Haussler, D. 1990. Probably Approximately Correct Learn-ing. In AAAI, 1101–1108. AAAI Press / The MIT Press.Herbert, S. L.; Chen, M.; Han, S.; Bansal, S.; Fisac, J. F.; andTomlin, C. J. 2017. FaSTrack: A modular framework for fastand guaranteed safe motion planning. In CDC, 1517–1522.IEEE.Hewing, L.; Kabzan, J.; and Zeilinger, M. N. 2020. CautiousModel Predictive Control Using Gaussian Process Regres-sion. IEEE Trans. Control. Syst. Technol., 28(6): 2736–2743.Kearns, M. J.; and Singh, S. P. 2002. Near-Optimal Reinforce-ment Learning in Polynomial Time. Mach. Learn., 49(2-3):209–232.Koller, T.; Berkenkamp, F.; Turchetta, M.; and Krause, A.2018. Learning-Based Model Predictive Control for SafeExploration. In CDC, 6059–6066. IEEE.Kulakowski, B. T.; Gardner, J. F.; and Shearer, J. L. 2007.Dynamic modeling and control of engineering systems. Cam-bridge University Press.Kwiatkowska, M. Z.; Norman, G.; and Parker, D. 2011.PRISM 4.0: Verification of Probabilistic Real-Time Systems.In CAV, volume 6806 of Lecture Notes in Computer Science,585–591. Springer.Lahijanian, M.; Andersson, S. B.; and Belta, C. 2015. For-mal Verification and Synthesis for Discrete-Time StochasticSystems. IEEE Trans. Autom. Control., 60(8): 2031–2045.

Page 10: Sampling-Based Robust Control of Autonomous Systems with ...

Lavaei, A.; Soudjani, S.; Abate, A.; and Zamani, M. 2021.Automated verification and synthesis of stochastic hybridsystems: A survey. arXiv preprint arXiv:2101.07491.Lesser, K.; Oishi, M. M. K.; and Erwin, R. S. 2013. Stochasticreachability for control of spacecraft relative motion. In CDC,4705–4712. IEEE.Margellos, K.; Goulart, P.; and Lygeros, J. 2014. On the roadbetween robust optimization and the scenario approach forchance constrained optimization problems. IEEE Transac-tions on Automatic Control, 59(8): 2258–2263.Park, S.; Serpedin, E.; and Qaraqe, K. A. 2013. Gaussian As-sumption: The Least Favorable but the Most Useful [LectureNotes]. IEEE Signal Process. Mag., 30(3): 183–186.Puggelli, A.; Li, W.; Sangiovanni-Vincentelli, A. L.; andSeshia, S. A. 2013. Polynomial-Time Verification of PCTLProperties of MDPs with Convex Uncertainties. In CAV,volume 8044 of Lecture Notes in Computer Science, 527–542. Springer.Puterman, M. L. 1994. Markov Decision Processes: Dis-crete Stochastic Dynamic Programming. Wiley Series inProbability and Statistics. Wiley.Reist, P.; Preiswerk, P.; and Tedrake, R. 2016. Feedback-motion-planning with simulation-based LQR-trees. Int. J.Robotics Res., 35(11): 1393–1416.Romao, L.; Margellos, K.; and Papachristodoulou, A. 2020.Tight generalization guarantees for the sampling and discard-ing approach to scenario optimization. In CDC, 2228–2233.IEEE.Rosolia, U.; Singletary, A.; and Ames, A. D. 2020. UnifiedMulti-Rate Control: from Low Level Actuation to High LevelPlanning. CoRR, abs/2012.06558.Sartipizadeh, H.; Vinod, A. P.; Acikmese, B.; and Oishi, M.2019. Voronoi Partition-based Scenario Reduction for FastSampling-based Stochastic Reachability Computation of Lin-ear Systems. In ACC, 37–44. IEEE.Shmarov, F.; and Zuliani, P. 2015. ProbReach: verified prob-abilistic delta-reachability for stochastic hybrid systems. InHSCC, 134–139. ACM.Smith, A. 2013. Sequential Monte Carlo methods in practice.Springer Science & Business Media.Soudjani, S. E. Z.; and Abate, A. 2013. Adaptive and Sequen-tial Gridding Procedures for the Abstraction and Verificationof Stochastic Processes. SIAM J. Appl. Dyn. Syst., 12(2):921–956.Taylor, A. J.; Singletary, A.; Yue, Y.; and Ames, A. D. 2020.Learning for Safety-Critical Control with Control BarrierFunctions. In L4DC, volume 120 of Proceedings of MachineLearning Research, 708–717. PMLR.Tedrake, R. 2009. LQR-trees: Feedback motion planning onsparse randomized trees. In Robotics: Science and Systems.The MIT Press.Vinod, A. P.; Gleason, J. D.; and Oishi, M. M. K. 2019.SReachTools: a MATLAB stochastic reachability toolbox. InHSCC, 33–38. ACM.

Wiesemann, W.; Kuhn, D.; and Sim, M. 2014. Distribu-tionally Robust Convex Optimization. Oper. Res., 62(6):1358–1376.Wolff, E. M.; Topcu, U.; and Murray, R. M. 2012. Robust con-trol of uncertain Markov Decision Processes with temporallogic specifications. In CDC, 3372–3379. IEEE.

Page 11: Sampling-Based Robust Control of Autonomous Systems with ...

Rj

Rj(1.2)

hj

x(1)

x(2)

Figure 8: PolytopeRj has a Chebyshev center hj (note that itis not unique, since the circle can be shifted while remainingwithin the polytope). Polytope Rj(1.2) is scaled by a factorλ = 1.2 and is computed using Eq. (12).

A Proofs and ExamplesIn this appendix, we provide the proof of Theorem 1, andwe exemplify how the number of noise samples affects thetightness of the obtained intervals. Before stating the actualproof, we introduce a number of technical details.

Scaled polytopes. First, let us define Rj(λ) as a version ofpolytopeRj which is scaled by a factor λ ≥ 0 relative to a so-called Chebyshev center hj ∈ Rn (Boyd and Vandenberghe2014). The resulting scaled polytope is

Rj(λ) = {x ∈ Rn |Mjx ≤ λ(bj + hj)− hj}. (12)

Note that Rj(1) = Rj , and that shifting by hj ensures thatRj(λ1) ⊂ Rj(λ2) for every 0 ≤ λ1 < λ2. A visualizationof the scaling for an arbitrary polytope Rj is shown in Fig. 8(in this example, the Chebyshev center is not unique, sincethe circle can be shifted while remaining within Rj).

Scenario optimization problem. We use the scenario ap-proach to estimate the risk associated with the solution ofthe following convex scenario optimization problem withdiscarded samples, which has a scalar decision variable λ:

minimizeλ∈R+

λ (13)

subject to xk+1 +w(i)k ∈ Rj(λ) ∀i ∈ {1, . . . , N}\Q,

whose optimal solution λ∗|Q| parameterizes polytopeRj(λ

∗|Q|) through Eq. (12). We explicitly write the depen-

dency of the solution on the cardinality |Q| of the index setQ, which accounts for a subset of samples whose constraintshave been discarded based on the following rule:Lemma 1. The sample removal set Q ⊆ {1, . . . , N} is ob-tained by iteratively removing the active constraints fromEq. (13). Thus, given N samples and any two removal setswith cardinalities |Q1| < |Q2|, it holds that Q1 ⊂ Q2. More-over, any discarded sample i ∈ Q violates the solution λ∗Q to

Eq. (13), i.e. xk+1 +w(i)k /∈ Rj(λ∗|Q|), with probability one.

−2 −1 0 1 2

Rj(λ∗N out

j −1)

Rj(λ∗N out

j)

Rj

Figure 9: Bounding region Rj = [−1, 1] using N = 10 suc-cessor state samples (N out

j = 5). Discarding N outj = 5 sam-

ples defines the red region Rj(λ∗N outj

) ⊆ Rj , while discarding

one less sample defines the blue region Rj(λ∗N outj −1) ⊃ Rj .

Intuitively, the successor state of a sample associated withan active constraint is on the boundary of the optimal solutionRj(λ

∗|Q|) of Eq. (13). Under the following non-accumulation

assumption, an active constraint exists and is unique, as longas |Q| < N (i.e., not yet all samples have been discarded).Assumption 4. Given a noise samplewk ∈ ∆, the probabil-ity that the successor state xk+1 is on the boundary of anypolytope Rj(λ) for any λ ∈ R+ is zero.

Assumption 4 implies that the solution to Eq. (13) isunique, and that the number of active constraints is equalto one (given N > 0 and |Q| < N ), as samples accumulateon the boundary of the polytope with probability zero.

Under/over-approximating regions. Recall from Def. 1that N out

j is the number of samples leading to a successorstate outside of region Rj . The following lemma uses N out

jto define an under- or over-approximation of region Rj .Lemma 2. When discarding |Q| = N out

j samples as perLemma 1, it holds that Rj(λ∗N out

j) ⊆ Rj . When discarding

|Q| = N outj − 1 samples, it holds that Rj(λ∗N out

j −1) ⊃ Rj

Proof. Lemma 1 states that at every step, we may only dis-card the sample of the active constraint. By construction, afterdiscarding |Q| = N out

j samples, all remaining successor statesamples lie within Rj , so λ∗N out

j≤ 1, so Rj(λ∗N out

j) ⊆ Rj .

When we discard one less sample, i.e. |Q| = N outj − 1, we

must have one sample outside of Rj , so λ∗N outj −1 > 1, mean-

ing that Rj(λ∗N outj −1) ⊃ Rj . This concludes the proof.

Intuitively,Rj(λ∗N outj

) contains exactly those samples inRj ,

while Rj(λ∗N outj −1) additionally contains the sample closest

outside of Rj , as visualized in Fig. 9 for a 1D example.

A.1 Proof of Theorem 1.The proof of Theorem 1 is adapted from Romao, Margellos,and Papachristodoulou (2020, Theorem 5), which is based onthree key assumptions: (1) the considered scenario problembelongs to the class of so-called fully-supported problems(see Campi and Garatti (2008) for a definition), (2) its solu-tion is unique, and (3) discarded samples violate the optimal

Page 12: Sampling-Based Robust Control of Autonomous Systems with ...

solution with probability one. In our case, requirement (2) isimplied by Assumption 4, (3) is satisfied by Lemma 1, and theproblem is fully-supported because the number of decisionvariables is one. Under these requirements, Romao, Margel-los, and Papachristodoulou (2020, Theorem 5) states that therisk associated with an optimal solution λ∗|Q| to Eq. (13) for|Q| discarded samples satisfies the following expression:

PN{P(x /∈ Rj(λ∗|Q|)

)≤ ε}

= 1−|Q|∑i=0

(N

i

)εi(1− ε)N−i,

(14)

where we omit the subscripts in Pwk(·) and xk+1 for brevity.

Eq. (14) is the cumulative distribution function (CDF) of abeta distribution with parameters |Q|+ 1 and N − |Q|. Wedenote this CDF by F|Q|(ε), where we explicitly write thedependency on |Q|. Hence, we obtain

F|Q|(ε) = PN{P(x /∈ Rj(λ∗|Q|)

)≤ ε}

= β. (15)

Thus, if we discard |Q| samples, Eq. (15) returns the con-fidence probability β ∈ (0, 1) by which the probability ofxk+1 /∈ Rj(λ∗|Q|) is upper bounded by ε. Conversely, we cancompute the value of ε needed to obtain a confidence proba-bility of β, using the percent point function (PPF) G|Q|(β):

PN{P(x /∈ Rj(λ∗|Q|)

)≤ G|Q|(β)

}= β. (16)

The PPF is the inverse of the CDF, so by definition, we have

ε = G|Q|(β) = G|Q|(F|Q|(ε)

). (17)

Note that P (x ∈ R) + P (x /∈ R) = 1, so Eq. (15) equals

F|Q|(ε) = PN{P(x ∈ Rj(λ∗|Q|)

)≥ 1− ε

}= β. (18)

By defining p = 1− ε, Eq. (17) and (18) are combined as

PN{

1−G|Q|(β) ≤ P(x ∈ Rj(λ∗|Q|)

)}= 1−

|Q|∑i=0

(N

i

)(1− p)ipN−i = β.

(19)

In what follows, we separately use Eq. (18) to prove the lowerand upper bound of the probability interval in Theorem 1.

Lower bound. There are N possible values for |Q|, rang-ing from 0 to N − 1. The case |Q| = N (i.e. all samples arediscarded) is treated as a special case in Theorem 1. We fixβ = 1− β

2N in Eq. (19), yielding the series of equations

PN{1−G0

(1− β

2N

)≤ P

(x ∈ Rj(λ∗

0))}

=1− β

2N...

... (20)

PN{1−GN−1

(1− β

2N

)≤ P

(x ∈ Rj(λ∗

N−1))}

=1− β

2N.

Denote the event that 1−Gn(1− β

2N

)≤ P

(x ∈ Rj(λ∗n)

)for n = 0, . . . , N − 1 by An. Regardless of n, this event has

a probability of PN{An} = 1− β2N , and its complementA′n

of PN{A′n} = β2N . Via Boole’s inequality, we know that

PN{N−1⋃i=0

A′n}≤N−1∑i=0

PN{A′n}

2NN =

β

2. (21)

Thus, for the intersection of all events in Eq. (20) we have

PN{N−1⋂i=0

An}

= 1− PN{N−1⋃i=0

A′n}≥ 1− β

2. (22)

After observing the samples at hand, we replace |Q| by theactual value of N out

j (as per Def. 1), giving one of the expres-sion in Eq. (20). The probability that this expression holdscannot be smaller than that of the intersection of all events inEq. (22). Thus, we obtain

PN{

¯p ≤ P

(x ∈ Rj(λ∗N out

j))}≥ 1− β

2, (23)

where¯p = 0 if N out

j = N (which trivially holds with prob-ability one), and otherwise

¯p = 1 − GN out

j(1 − β

2N ) is the

solution for p to Eq. (19), with |Q| = N outj and β = 1− β

2N :

1− β

2N= 1−

N outj∑

i=0

(N

i

)(1− p)ipN−i, (24)

which is equivalent to Eq. (10).

Upper bound. Eq. (19) is rewritten as an upper bound as

PN{P(x ∈ Rj(λ∗|Q|)

)< 1−G|Q|(β)

}= 1− β, (25)

where again, |Q| can range from 0 to N − 1. However, toobtain high-confidence guarantees on the upper bound, wenow fix β = β

2N , which yields the series of equations

PN{P(x ∈ Rj(λ∗

0))< 1−G0

( β2N

)}=1− β

2N...

... (26)

PN{P(x ∈ Rj(λ∗

N−1))< 1−GN−1

( β2N

)}=1− β

2N.

Analogous to the lower bound case, Boole’s inequality im-plies that the intersection of all expressions in Eq. (26) hasa probability of at least 1− β

2 . After observing the samplesat hand, we replace |Q| by N out

j − 1, yielding one of theexpressions in Eq. (26). For this expression, it holds that

PN{P(x ∈ Rj(λ∗N out

j −1))≤ p}≥ 1− β

2, (27)

where p = 1 if N outj = 0 (which trivially holds with probabil-

ity one), and otherwise p = 1−GN outj −1( β

2N ) is the solution

for p to Eq. (19), with |Q| = N outj − 1 and β = β

2N :

β

2N= 1−

N outj −1∑i=0

(N

i

)(1− p)ipN−i, (28)

which is equivalent to Eq. (11).

Page 13: Sampling-Based Robust Control of Autonomous Systems with ...

25

250

2,500

0

0.2

0.4

0.6

Number of samples (N )

Prob

abili

tyTrue probabilityβ = 0.01

β = 0.1

Figure 10: Probability intervals (with standard deviations),obtained from Theorem 1 with β = 0.01 or β = 0.1 on atransition with a true probability of 0.25 (note the log scale).

Probability interval. We invoke Lemma 2, which statesthat Rj(λ∗N out

j) ⊆ Rj ⊂ Rj(λ∗N out

j −1), so we have

P(x ∈ Rj(λ∗N out

j))≤ P (x ∈ Rj) = P (si, al)(sj)

< P(x ∈ Rj(λ∗N out

j −1)).

(29)

We use Eq. (29) to write Eq. (23) and (27) in terms of the tran-sition probability P (si, al)(sj). Finally, by applying Boole’sinequality, we combine Eq. (23) and (27) as follows:

PN{

¯p ≤ P (si, al)(sj)

⋂P (si, al)(sj) ≤ p

}= PN

{¯p ≤ P (si, al)(sj) ≤ p

}≥ 1− β,

(30)

which is equal to Eq. (9), so we conclude the proof.

A.2 Example use of Theorem 1We provide an example to illustrate how the number ofsamples N affects the tightness of the probability inter-vals returned by Theorem 1. Consider a system with a 1-dimensional state xk ∈ R, where the probability densityfunction pwk

(xk+1 | xk+1 = dl) for a specific action alwith target point dl is given by a uniform distribution overthe domain [−4, 4]. For a given region Rj = [−1, 1] (whichis also shown in Fig. 9), we want to evaluate the probabilitythat xk+1 ∈ Rj , which is clearly 0.25. To this end, we applyTheorem 1 for different numbers of samplesN ∈ [25, 12800],and a confidence level of β = 0.1 or β = 0.01. Since theobtained probability bounds are random variables throughtheir dependence on the set of N samples of the noise, we re-peat the experiment 100, 000 times for every value of N . Theresulting bounds on the transition probability are shown inFig. 10. The plot shows that the uncertainty in the transitionprobability is reduced by increasing the number of samples.

B Details on UAV BenchmarkB.1 Explicit model formulationThe 6-dimensional state vector of the UAV model is x =[px, vx, py, vy, pz, vz]

> ∈ R6, where pi and vi are the po-

sition and velocity in the direction i. Every element of thevector of control inputs u = [fx, fy, fz]

> ∈ R3 is con-strained to the interval [−4, 4]. The discrete-time dynamicsare an extension of the lower-dimensional case in Badingset al. (2021), and are written in the form of Eq. (1) as follows:

xk+1 =

1 1 0 0 0 00 1 0 0 0 00 0 1 1 0 00 0 0 1 0 00 0 0 0 1 10 0 0 0 0 1

xk +

0.5 0 01 0 00 0.5 00 1 00 0 0.50 0 1

uk +wk, (31)

where wk is the effect of turbulence on the continuous state,which we model using a Dryden gust model; we use a Pythonimplementation by Bøhn et al. (2019). Note that the driftterm qk = 0, and is thus omitted from Eq. (31). Since themodel in Eq. (31) is not fully actuated (it has only 3 controls,versus a state of dimension 6), we group every two discretetime steps together and rewrite the model as follows:

xk+2 = Axk + Uuk,k+1 +wk,k+1, (32)

where A ∈ R6×6 and U ∈ R6×6 are properly redefinedmatrices, and uk,k+1 ∈ R6 and wk,k+1 ∈ R6 reflect thecontrol inputs and process noise at both time steps k andk + 1, combined in one vector.

The objective of this problem is to reach the goal region,which is also depicted in Fig. 3, within 64 steps (i.e. 32 stepswith the model in Eq. (32)). This goal region is written as thefollowing set of continuous states (for brevity, we omit anexplicit definition of the critical regions):

XG = {x ∈ R6 | 11 ≤ px ≤ 15,

1 ≤ py ≤ 5, −7 ≤ pz ≤ −3}.(33)

B.2 Run times and size of model abstractionThe average run times and iMDP model sizes (over 10 runs)for all iterations of the UAV benchmark are presented inTable 1. Computing the states and actions took 15 min, butsince this step is only performed in the first iteration, we omitthe value from the table. The number of states (which equalsthe number of regions in the partition) and choices (the totalnumber of state-action pairs) of the iMDP are independentof the value of N . By contrast, the number of transitionsincreases with N , because the additional noise samples mayreveal more possible outcomes of a state-action pair.

C Details on Building TemperatureRegulation Benchmark

C.1 Explicit model formulationThis model is a variant of the two-zone building automa-tion system benchmark with stochastic dynamics in Cauchiand Abate (2018). The state vector of the model is x =[T z1 , T

z2 , T

r1 , T

r2 ]> ∈ R4, reflecting both room (zone) temper-

atures (T z) and both radiator temperatures (T r). The controlinputs u = [T ac

1 , Tac2 , T

boil1 , T boil

2 ]> ∈ R4 are the air condi-tioning (ac) and boiler temperatures, respectively, which areconstrained within 14 ≤ T ac ≤ 26 and 65 ≤ T boil ≤ 85. Thechanges in the temperature of both zones are governed by the

Page 14: Sampling-Based Robust Control of Autonomous Systems with ...

Table 1: Run times per iteration (which excludes the computation of states and actions) and model sizes for the UAV (under highturbulence), and 2-zone and 1-zone building temperature regulation (BAS) benchmarks.

Instance Run times (per iteration, in seconds) iMDP model sizeBenchmark Samples N Update intervals Write iMDP Model checking States Choices Transitions

UAV 25 11.4 29.0 39.2 25,516 1,228,749 9,477,795UAV 50 15.4 34.4 46.0 25,516 1,228,749 11,350,692UAV 100 18.6 39.3 51.8 25,516 1,228,749 13,108,282UAV 200 24.3 44.4 59.1 25,516 1,228,749 14,921,925UAV 400 35.4 50.6 68.5 25,516 1,228,749 17,203,829UAV 800 55.4 58.3 80.2 25,516 1,228,749 19,968,663UAV 1600 92.2 64.9 90.6 25,516 1,228,749 22,419,161UAV 3200 164.1 68.6 94.4 25,516 1,228,749 23,691,107UAV 6400 312.1 69.8 95.5 25,516 1,228,749 23,958,183UAV 12800 611.2 69.8 95.1 25,516 1,228,749 23,969,028

BAS (2-zone) 25 13.9 73.6 105.9 35,722 1,611,162 24,929,617BAS (2-zone) 50 43.2 107.7 160.5 35,722 1,611,162 37,692,421BAS (2-zone) 100 50.8 153.2 230.2 35,722 1,611,162 53,815,543BAS (2-zone) 200 59.1 205.8 315.2 35,722 1,611,162 72,588,298BAS (2-zone) 400 72.3 260.6 423.2 35,722 1,611,162 91,960,970BAS (2-zone) 800 96.2 316.8 504.3 35,722 1,611,162 112,028,449BAS (2-zone) 1600 132.6 370.0 614.8 35,722 1,611,162 131,844,146BAS (2-zone) 3200 200.1 432.9 789.4 35,722 1,611,162 154,155,445BAS (2-zone) 6400 331.8 513.2 982.2 35,722 1,611,162 182,175,229BAS (2-zone) 12800 545.5 601.8 1177.8 35,722 1,611,162 218,031,384

BAS (1-zone) 25 1.62 0.06 1.39 381 1,511 20,494BAS (1-zone) 50 1.81 0.08 1.39 381 1,511 27,327BAS (1-zone) 100 1.87 0.09 1.40 381 1,511 33,871BAS (1-zone) 200 1.96 0.11 1.40 381 1,511 40,368BAS (1-zone) 400 2.06 0.13 1.42 381 1,511 47,333BAS (1-zone) 800 2.26 0.14 1.40 381 1,511 54,155BAS (1-zone) 1600 2.59 0.16 1.39 381 1,511 59,686BAS (1-zone) 3200 3.24 0.17 1.40 381 1,511 65,122BAS (1-zone) 6400 4.50 0.18 1.40 381 1,511 70,245BAS (1-zone) 12800 7.01 0.20 1.38 381 1,511 76,076

following thermodynamics:

T z1 =1

C1

[T z2 − T z1R1,2

+Twall − T z1R1,wall

(34a)

+mCpa(T ac1 − T z1 ) + Pout(T

r1 − T z1 )

]T z2 =

1

C2

[T z1 − T z2R1,2

+Twall − T z2R2,wall

(34b)

+mCpa(T ac2 − T z2 ) + Pout(T

r2 − T z2 )

],

where Ci is the thermal capacitance of zone i, and Ri,j isthe resistance between zones i and j, m is the air mass flow,Cpa is the specific heat capacity of air, and Pout is the ratedoutput of the radiator. Similarly, the dynamics of the radiatorin room i are governed by the following equation:

T ri = k1(T zi − T ri ) + k0w(T boili − T ri ), (35)

where k0 and k1 are constant parameters, and w is the watermass flow from the boiler. For the precise parameter valuesused, we refer to our codes, which are provided in the sup-plementary material. By discretizing Eq. (34) and (35) bya forward Euler method at a time resolution of 15 min, we

obtain the following model in explicit form:

xk+1 =

0.8425 0.0537 −0.0084 0.00000.0515 0.8435 0.0000 −0.00640.0668 0.0000 0.8971 0.00000.0000 0.0668 0.0000 0.8971

xk

+

0.0584 0.0000 0.0000 0.00000.0000 0.0599 0.0000 0.00000.0000 0.0000 0.0362 0.00000.0000 0.0000 0.0000 0.0362

uk (36)

+

1.22911.07490.00000.0000

+N (

0000

,0.01 0 0 0

0 0.01 0 00 0 0.01 00 0 0 0.01

),

where N (µ,Σ) ∈ Rn is a multivariate Gaussian randomvariable with mean µ ∈ Rn and covariance matrix Σ.

The goal in this problem is to reach a temperature of 19.9−20.1 °C in both zones within 32 discrete time steps of 15 min.As such, the goal region XG and critical region XC are:

XG = {x ∈ R2 | 19.9 ≤ T z1 ≤ 20.1,

19.9 ≤ T z2 ≤ 20.1} (37)XC = ∅.

Page 15: Sampling-Based Robust Control of Autonomous Systems with ...

C.2 Run times and size of model abstractionThe run times and iMDP model sizes for all iterations ofthe 2-zone building temperature regulation benchmark arepresented in Table 1. Computing the states and actions took5:25 min, but we omit this value from the table as this step isonly performed in the first iteration. The discussion of modelsizes is analogous to Appendix B.2 on the UAV benchmark.

D Benchmarks against other toolsD.1 Benchmark against StocHyWe provide a comparison between our approach andStocHy (Cauchi and Abate 2019) on a reduced version of thetemperature regulation problem in Appendix C.1 with onlyone zone. Thus, the state vector becomes x = [T z, T r]>,and the control inputs are u = [T ac, T boil]>, which are con-strained to 14 ≤ T ac ≤ 28 and −10 ≤ T boil ≤ 10 (notethat T boil is now relative to the nominal boiler temperature of75 °C). Using the same dynamics as in Eq. (34) and (35), weobtain the following model in explicit form:

xk+1 =

[0.8820 0.00580.0134 0.9625

]xk +

[0.0584 0.00000.0000 0.0241

]uk

+

[0.96041.3269

]+N (

[00

],

[0.02 0

0 0.1

]). (38)

The objective in this problem is to reach a zone temperatureof 20.9−21.1 °C within 64 discrete time steps of 15 min. Assuch, the goal region XG and critical region XC are:

XG = {x ∈ R2 | 20.9 ≤ T z ≤ 21.1}, XC = ∅. (39)

We partition the continuous state space in a rectangular gridof 19 × 20 regions of width 0.2 °C, which is centered atT z = 21 °C and T r = 38 °C. As such, the partition cov-ers zone temperatures between 19.1− 22.9 °C, and radiatortemperatures between 36− 40 °C.

Abstraction method. Similar to our approach, StocHy gen-erates robust iMDP abstractions of the model above. How-ever, while our approach also works when the distribution ofthe noise is unknown, StocHy requires precise knowledge ofthe noise distribution. Another difference is that StocHy dis-cretizes the control inputs of the dynamical system, to obtaina finite action space. An iMDP action is then associated withthe execution of a specific value of the control input uk. Inour approach, an iMDP action instead reflects an attemptedtransition to a given target point, and the actual control inputuk (calculated using the control law Eq. (5)) depends on theprecise continuous state xk.

Our approach is faster. The run times and model sizes ofour approach are reported in Table 1 under BAS (1-zone).For our approach, the time to compute the states and actions(needed only in the first iteration) is 0.1 s, and run timesper iteration (excluding verification times) vary from 1.7−7.2 s, depending on the value of N . Notably, the StocHyrun times are orders of magnitude higher: the abstractionprocedure of StocHy (excluding verification time) took 45:28min, compared to 7.2 s for our approach with the maximumof N = 12800 samples (as reported in Table 1). We omit a

comparison of the verification times, since StocHy does notleverage PRISM like our method does.

These results provide a clear message: our approach ismore general and significantly faster, and (as shown earlierin Fig. 6) generates results that are qualitatively similar tothose obtained from StocHy.

D.2 Benchmark against SReachToolsWe benchmark our approach against the spacecraft ren-dezvous problem supplied with the MATLAB toolboxSReachTools (Vinod, Gleason, and Oishi 2019). In this prob-lem, one spacecraft must dock with another within 8 timesteps, while remaining in a target tube. The goal is to computea controller that maximizes the probability to achieve this ob-jective. The 4-dimensional state x = [px, px, vx, vy]> ∈ R4

describes the position and velocity in both directions. Weadopt the same discrete-time dynamics as used in Vinod,Gleason, and Oishi (2019) and assume the same Gaussiannoise (see the reference for details).

This problem is a finite-horizon reach-avoid problem witha rectangular goal region (target set), while the safe set isa convex tube, as also shown in Fig. 7. For our approach,we partition the state space in 3, 200 rectangular regions.It is fair to note that the safe set in SReachTools is smooth,while ours is not (due to our partition-based approach). Whileour current implementation is limited to regions as hyper-rectangles and parallelepipeds, our method can in theory beused with all convex polytopic regions.

Reachability guarantees. We compare our method (withN = 25, . . . , 1600 samples of the noise) to the results ob-tained via the different methods in SReachTools. We onlyconsider the particle and Voronoi methods of SReachTools,because only these methods are sampling-based like our ap-proach (although only the Voronoi method can give confi-dence guarantees on the results). The reachability guaranteesand run times are presented in Table 2, and simulated trajec-tories in Fig. 7. As expected, our reachability guarantees, aswell as for the Voronoi method, are a lower bound on theaverage simulated performance (and are thus safe), while thisis not the case for the particle method of SReachTools.

Complexity and run time. As shown in Table 2, thegrid-free methods of SReachTools are generally faster thanour abstraction-based approach. However, we note that ourmethod was designed for non-convex problems, which can-not be solved by SReachTools. Interestingly, the complexityof the particle (Lesser, Oishi, and Erwin 2013) and Voronoipartition method (Sartipizadeh et al. 2019) increases exponen-tially with the number of samples, because their optimizationproblems have a binary variable for every particle. By con-trast, the complexity of our method increases only linearlywith the number of samples, because it suffices to count thenumber of samples in every region, as discussed in Sect. 4.2.

Controller type. The particle and Voronoi methods ofSReachTools synthesize open-loop controllers, which meansthat they cannot act in response to observed disturbances ofthe state. Open-loop controllers do not consider any feedback,making such controllers unsafe in settings with significant

Page 16: Sampling-Based Robust Control of Autonomous Systems with ...

Table 2: Reachability probability, the average simulated reachability probability, and run times for our method and SReachTools,under the default (x1) and increased (x10) noise covariance.

Our method SReachToolsInitial abstraction N = 25 N = 100 N = 400 N = 1600 Particle Voronoi

Noise covariance x1Reachability probability — 0.44 0.81 0.95 0.98 0.88 0.85Simulated probability — 1.00 1.00 1.00 1.00 0.83 0.86Run time (s) 3.420 6.966 9.214 11.016 17.294 0.703 3.462

Noise covariance x10Reachability probability — 0.19 0.36 0.52 0.62 0.40 NaNSimulated probability — 0.59 0.63 0.59 0.65 0.19 NaNRun time (s) 3.372 11.433 15.909 18.671 25.452 2.495 NaN

noise levels (Astrom and Murray 2010). By contrast, we de-rive piecewise linear feedback controllers. This structure isobtained, because our controllers are based on a state-basedcontrol policy that maps every continuous state within thesame region to the same abstract action.

Robustness against disturbances. To demonstrate the im-portance of feedback control, we test both methods understronger disturbances, by increasing the covariance of thenoise by a factor 10. As shown in Table 2, the particle methodyields a low reachability probability (with the simulated per-formance being even lower), and the Voronoi method wasnot able to generate a solution at all. By contrast, our methodstill provides safe lower bound guarantees, which becometighter for increasing sample sizes. Our state-based controlleris robust against the stronger noise, because it is able to actin response to observed disturbances, unlike the open-loopmethods of SReachTools that results in a larger error in thesimulated state trajectories, as also shown in Fig. 7.