Aryan Mokhtari Suvrit Sra Ali Jadbabaie [email protected] … · 2018-11-29 · Direct Runge-Kutta...

Direct Runge-Kutta Discretization Achieves Acceleration

Jingzhao Zhang [email protected]

Aryan Mokhtari [email protected]

Suvrit Sra [email protected]

Ali Jadbabaie [email protected]

Laboratory for Information and Decision SystemsInstitute for Data, Systems, and SocietyMassachusetts Institute of Technology

Abstract

We study gradient-based optimization methods obtained by directly discretizinga second-order ordinary differential equation (ODE) related to the continuouslimit of Nesterov’s accelerated gradient method. When the function is smoothenough, we show that acceleration can be achieved by a stable discretization ofthis ODE using standard Runge-Kutta integrators. Specifically, we prove that un-der Lipschitz-gradient, convexity and order-(s+ 2) differentiability assumptions,the sequence of iterates generated by discretizing the proposed second-order ODEconverges to the optimal solution at a rate of O(N−2

ss+1 ), where s is the order

of the Runge-Kutta numerical integrator. Furthermore, we introduce a new localflatness condition on the objective, under which rates even faster than O(N−2)can be achieved with low-order integrators and only gradient information. No-tably, this flatness condition is satisfied by several standard loss functions used inmachine learning. We provide numerical experiments that verify the theoreticalrates predicted by our results.

1 Introduction

In this paper, we study accelerated first-order optimization algorithms for the problem

minx∈Rd

f(x), (1)

where f is convex and sufficiently smooth. A classical method for solving (1) is gradient descent(GD), which displays a sub-optimal convergence rate of O(N−1)—i.e., the gap f(xN )− f(x∗) be-tween GD and the optimal value f(x∗) decreases to zero at the rate ofO(N−1). Nesterov’s seminalaccelerated gradient method [Nesterov, 1983] matches the oracle lower bound of O(N−2) [Ne-mirovskii et al., 1983], and is thus a central result in the theory of convex optimization.

However, ever since its introduction, acceleration has remained somewhat mysterious, especiallybecause Nesterov’s original derivation relies on elegant but unintuitive algebraic arguments. Thislack of understanding has spurred a variety of recent attempts to uncover the rationale behind thephenomenon of acceleration [Allen-Zhu and Orecchia, 2014, Bubeck et al., 2015, Lessard et al.,2016, Hu and Lessard, 2017, Scieur et al., 2016, Fazlyab et al., 2017].

We pursue instead an approach to NAG (and accelerated methods in general) via a continuous-timeperspective. This view was recently studied by Su et al. [2014], who showed that the continuous limitof NAG is a second order ODE describing a physical system with vanishing friction; Wibisono et al.[2016] generalized this idea and proposed a class of ODEs by minimizing Bregman Lagrangians.

Although these works succeed in providing a richer understanding of Nesterov’s scheme via itscontinuous time ODE, they fail to provide a general discretization procedure that generates provablyconvergent accelerated methods. In contrast, we introduce a second-order ODE that generates anaccelerated first-order method for smooth functions if we simply discretize it using any Runge-Kuttanumerical integrator and choose a suitable step size.

1.1 Summary of results

Assuming that the objective function is convex and sufficiently smooth, we establish the following:

1

arX

iv:1

805.

0052

1v5

[m

ath.

OC

] 2

8 N

ov 2

018

� We propose a second-order ODE, and show that the sequence of iterates generated by discretizingusing a Runge-Kutta integrator converges to the optimal solution at the rate O(N

−2ss+1 ), where s

is the order of the integrator. By using a more precise numerical integrator, (i.e., a larger s), thisrate approaches the optimal rate O(N−2).

� We introduce a new local flatness condition for the objective function (Assumption 1), underwhich Runge-Kutta discretization obtains convergence rates even faster than O(N−2), withoutrequiring high-order integrators. In particular, we show that if the objective is locally flat arounda minimum, by using only gradient information we can obtain a convergence rate of O(N−p),where p quantifies the degree of local flatness. Acceleration due to local flatness may seemcounterintuitive at first, but our analysis reveals why it helps.

To the best of our knowledge, this work presents the first direct1 discretization of an ODE that yieldsaccelerated gradient methods. Unlike Betancourt et al. [2018] who study symplecticity and considervariational integrators, and Scieur et al. [2017] who study consistency of integrators, we focus on theorder of integrators (see §2.1). We argue that the stability inherent to the ODE and order conditionson the integrators suffice to achieve acceleration.

1.2 Additional related workSeveral works [Alvarez, 2000, Attouch et al., 2000, Bruck Jr, 1975, Attouch and Cominetti, 1996]have studied the asymptotic behavior of solutions to dissipative dynamical systems. However, theseworks retain a theoretical focus as they remain in the continuous time domain and do not discussthe key issue, namely, stability of discretization. Other works such as [Krichene et al., 2015], studythe counterpart of Su et al. [2014]’s work for mirror descent algorithms and achieve accelerationvia Nesterov’s technique. Diakonikolas and Orecchia [2017] proposes a framework to analyze thefirst order mirror descent algorithms by studying ODEs derived from duality gaps. Also, Raginskyand Bouvrie [2012] obtain nonasymptotic rates for continuous time mirror descent in a stochasticsetting.

A textbook treatment of numerical integration is given in [Hairer et al., 2006]; some of our proofsbuild on material from Chapters 3 and 9. [Isaacson and Keller, 1994] and [West, 2004] also providenice introductions to numerical analysis.

2 Problem setup and background

Throughout the paper we assume that the objective f is convex and sufficiently smooth. Our keyresult rests on two key assumptions introduced below. The first assumption is a local flatness con-dition on f around a minimum; our second assumption requires f to have bounded higher orderderivatives. These assumptions are sufficient to achieve acceleration simply by discretizing suitableODEs without either resorting to reverse engineering to obtain discretizations or resorting to othermore involved integration mechanisms.

We will require our assumptions to hold on a suitable subset of Rd. Let x0 be the initial point to ourproposed iterative algorithm. First consider the sublevel set

S := {x ∈ Rd | f(x) ≤ exp(1)((f(x0)− f(x∗) + ‖x0 − x∗‖2) + 1}, (2)where x∗ is a minimum of (1). Later we will show that the sequence of iterates obtained fromdiscretizing a suitable ODE never escapes this sublevel set. Thus, the assumptions that we introduceneed to hold only within a subset of Rd. Let this subset be defined as

A := {x ∈ Rd | ∃x′ ∈ S, ‖x− x′‖ ≤ 1}, (3)that is, the set of points at unit distance to the initial sublevel set (2). The choice of unit distance isarbitrary, and one can scale that to any desired constant.Assumption 1. There exists an integer p ≥ 2 and a positive constant L such that for any pointx ∈ A, and for all indices i ∈ {1, ..., p− 1}, we have the lower-bound

f(x)− f(x∗) ≥ 1L‖∇

(i)f(x)‖p

p−i , (4)

where x∗ minimizes f and ‖∇(i)f(x)‖ denotes the operator norm of the tensor∇(i)f(x).1That is, discretize the ODE with known numerical integration schemes without resorting to reverse engi-

neering NAG’s updates.

2

Assumption 1 bounds high order derivatives by function suboptimality, so that these derivativesvanish as the suboptimality converges to 0. Thus, it quantifies the flatness of the objective arounda minimum.2 When p = 2, Assumption 1 is slightly weaker than the usual Lipschitz-continuity ofgradients (see Example 1) typically assumed in the analysis of first-order methods, including NAG.If we further know that the objectives Taylor expansion around an optimum does not have low orderterms, p would be the degree of the first nonzero term.

Example 1. Let f be convex with L2 -Lipschitz continuous gradients, i.e., ‖∇f(x) − ∇f(y)‖ ≤

L2 ‖x− y‖. Then, for any x, y ∈ Rd we have

f(x) ≥ f(y) + 〈∇f(y), x− y〉+ 1L‖∇f(x)−∇f(y)‖2.

In particular, for y = x∗, an optimum point, we have∇f(y) = 0, and thus we have f(x)−f(x∗) ≥1L‖∇f(x)‖2, which is nothing but inequality (4) for p = 2 and i = 1.

Example 2. Consider the `p-norm regression problem: minx f(x) = ‖Ax − b‖pp, for even integerp ≥ 2. If ∃x∗, Ax∗ = b, then f satisfies inequality (4) for p, and L depends on p and the operatornorm of A.

Logistic loss satisfies a slightly different version of Assumption 1 because its minimum can be atinfinity. We will explain this point in more detail in Section 3.1.

Next, we introduce our second assumption that adds additional restrictions on differentiability andbounds the growth of derivatives.

Assumption 2. There exists an integer s ≥ p and a constantM ≥ 0, such that f(x) is order (s+2)differentiable. Furthermore, for any x ∈ A, the following operator norm bounds hold:

‖∇(i)f(x)‖ ≤M, for i = p, p+ 1, . . . , s, s+ 1, s+ 2. (5)

When the sublevel sets of f are compact and hence the set A is also compact; as a result, thebound (5) on high order derivatives is implied by continuity. In addition, an Lp loss of the form‖Ax− b‖pp also satisfy (5) with M = p!‖A‖p2.

2.1 Runge-Kutta integrators

Before moving onto our new results (§3) let us briefly recall explicit Runge-Kutta (RK) integratorsused in our work. For a more in depth discussion please see the textbook [Hairer et al., 2006].

Definition 1. Given a dynamical system y = F (y), let the current point be y0 and the step size beh. An explicit S stage Runge-Kutta method generates the next step via the following update:

gi = y0 + h

i−1∑j=1

aijF (gj), Φh(y0) = y0 + h

S∑i=1

biF (gi), (6)

where aij and bi are suitable coefficients defined by the integrator; Φh(y0) is the estimation of thestate after time step h, while gi (for i = 1, . . . , S) are a few neighboring points where the gradientinformation F (gi) is evaluated.

By combining the gradients at several evaluation points, the integrator can achieve higher precisionby matching up Taylor expansion coefficients. Let ϕh(y0) be the true solution to the ODE withinitial condition y0; we say that an integrator Φh(y0) has order s if its discretization error shrinks as

‖Φh(y0)− ϕh(y0)‖ = O(hs+1), as h→ 0. (7)

In general, RK methods offer a powerful class of numerical integrators, encompassing several basicschemes. The explicit Euler’s method defined by Φh(y0) = y0 + hF (y0) is an explicit RK methodof order 1, while the midpoint method Φh(y0) = y0 + hF (y0 + h

2F (y0)) is of order 2. Some high-order RK methods are summarized in [Verner, 1996]. An order 4 RK method requires 4 stages, i.e.,4 gradient evaluations, while an order 9 method requires 16 stages.

2One could view this as an error bound condition that reverses the gradient-based upper bounds on subop-timality stipulated by the Polyak-Łojasiewicz condition [Lojasiewicz, 1965, Attouch et al., 2010].

3

Algorithm 1: Input(f, x0, p, L,M, s,N ) . Constants p, L,M are the same as in Assumptions1: Set the initial state y0 = [~0;x0; 1] ∈ R2d+1

2: Set step size h = C/N1

s+1 . C is determined by p, L,M, s, x03: xN ← Order-s-Runge-Kutta-Integrator(F, y0, N, h) . F is defined in equation 124: return xN

3 Main results

In this section, we introduce a second-order ODE and use explicit RK integrators to generate iteratesthat converge to the optimal solution at a rate faster than O(1/t) (where t denotes the time variablein the ODE). A central outcome of our result is that, at least for objective functions that are smoothenough, it is not the integrator type that is the key ingredient of acceleration, but a careful analysisof the dynamics with a more powerful Lyapunov function that achieves the desired result. Morespecifically, we will show that by carefully exploiting boundedness of higher order derivatives, wecan achieve both stability and acceleration at the same time.

We start with Nesterov’s accelerated gradient (NAG) method that is defined according to the updates

xk = yk−1 − h∇f(yk−1), yk = xk + k−1k+2 (xk − xk−1). (8)

Su et al. [2014] showed that the iteration (8) in the limit is equivalent to the following ODE

x(t) + 3t x(t) +∇f(x(t)) = 0, where x = dx

dt (9)

when one drives the step size h to zero. It can be further shown that in the continuous domainthe function value f(x(t)) decreases at the rate of O(1/t2) along the trajectories of the ODE. Thisconvergence rate can be accelerated to an arbitrary rate in continuous time via time dilation as in[Wibisono et al., 2016]. In particular, the solution to

x(t) + p+1t x(t) + p2tp−2∇f(x(t)) = 0, (10)

has a convergence rate O(1/tp). When p > 2, Wibisono et al. [2016] proposed rate matchingalgorithms via utilizing higher order derivatives (e.g., Hessians). In this work, we focus purely onfirst-order methods and study the stability of discretizing the ODE directly when p ≥ 2.

Though deriving the ODE from the algorithm is solved, deriving the update of NAG or any otheraccelerated method by directly discretizing an ODE is not. As stated in [Wibisono et al., 2016], ex-plicit Euler discretization of the ODE in (9) may not lead to a stable algorithm. Recently, Betancourtet al. [2018] observed empirically that Verlet integration is stable and suggested that the stability re-lates to the symplectic property of the Verlet integration. However, in our proof, we found that theorder condition of Verlet integration would suffice to achieve acceleration. Though symplecticintegrators are known to preserve modified Hamiltonians for dynamical systems, we weren’t able toleverage this property for dissipative systems such as (11).

This principal point of departure from previous works underlies Algorithm 1, which solves (1) bydiscretizing the following ODE with an order-s integrator:

x(t) +2p+ 1

tx(t) + p2tp−2∇f(x(t)) = 0. (11)

where we have augmented the state with time, to turn the non-autonomous dynamical system intoan autonomous one. The solution to (11) exists and is unique when t > 0. This claim follows bylocal Lipschitzness of f and is discussed in more details in Appendix A.2 of Wibisono et al. [2016].

We further highlight that the ODE in (11) can also be written as the dynamical system

y = F (y) =

− 2p+1t v − p2tp−2∇f(x)

v1

, where y = [v;x; t]. (12)

We have augmented the state with time to obtain an autonomous system, which can be readilysolved numerically with a Runge-Kutta integrator as in Algorithm 1. To avoid singularity at t = 0,

4

Algorithm 1 discretizes the ODE starting from t = 1 with initial condition y(1) = y0 = [0;x0; 1].The choice of 1 can be replaced by any arbitrary positive constant.

Notice that the ODE in (11) is slightly different from the one in (10); it has a coefficient 2p+1t for

x(t) instead of p+1t . This modification is crucial for our analysis via Lyapunov functions (more

details in Section 4 and Appendix A).

The parameter p in the ODE (11) is set to be the same as the constant in Assumption 1 to achievethe best theoretical upper bound by balancing stability and acceleration. Particularly, the largerp is, the faster the system evolves. Hence, the numerical integrator requires smaller step sizes tostabilize the process, but a smaller step size increases the number of iterations to achieve a targetaccuracy. This tension is alleviated by Assumption 1. The larger p is, the flatter the function f isaround its stationary points. In other words, Assumption 1 implies that as the iterates approach aminimum, the high order derivatives of the function f , in addition to the gradient, also converge tozero. Consequently, the trajectory slows down around the optimum and we can stably discretize theprocess with a large enough step size. This intuition ultimately translates into our main result.Theorem 1. (Main Result) Consider the second-order ODE in (11). Suppose that the function fis convex and Assumptions 1 and 2 are satisfied. Further, let s be the order of the Runge-Kuttaintegrator used in Algorithm 1, N be the total number of iterations, and x0 be the initial point.Also, let E0 := f(x0) − f(x∗) + ‖x0 − x∗‖2 + 1. Then, there exists a constant C1 such that if weset the step size as h = C1N

−1/(s+1)(L + M + 1)−1E−10 , the iterate xN generated after runningAlgorithm 1 for N iterations satisfies the inequality

f(xN )− f(x∗) ≤ C2E0[(L+M+1)E0

Ns

s+1

]p= O

(N−p

ss+1), (13)

where the constants C1 and C2 only depend on s, p, and the Runge-Kutta integrator. Since eachiteration consumes S gradient, f(xN )−f(x∗) will converge asO(S

pss+1N−

pss+1 ) with respect to the

number of gradient evaluations. Note that for commonly used Runge-Kutta integrators, S ≤ 8.

The proof of this theorem is quite involved; we provide a sketch in Section 4, deferring the detailedtechnical steps to the appendix. We do not need to know the constant C1 exactly in order to set thestep size h. Replacing C1 by any smaller positive constant leads to the same polynomial rate.

Theorem 1 indicates that if the objective has bounded high order derivatives and satisfies the flatnesscondition in Assumption 1 with p > 0, then discretizing the ODE in (11) with a high order integratorresults in an algorithm that converges to the optimal solution at a rate that is close to O(N−p). Inthe following corollaries, we highlight two special instances of Theorem 1.Corollary 2. If the function f is convex with L-Lipschitz gradients and is 4th order differentiable,then simulating the ODE (11) for p = 2 with a numerical integrator of order s = 2 for N iterationsresults in the suboptimality bound

f(xN )− f(x∗) ≤ C2(f(x0)− f(x∗) + ‖x0 − x∗‖2 + 1)3(L+M + 1)2

N4/3.

Note that higher order differentiability allows one to use a higher order integrator, which leads to theoptimal O(N−2) rate in the limit. The next example is based on high order polynomial or `p norm.

Corollary 3. Consider the objective function f(x) = ‖Ax + b‖44. Simulating the ODE (11) forp = 4 with a numerical integrator of order s = 4 for N iterations results in the suboptimality bound

f(xN )− f(x∗) ≤ C2(f(x0)− f(x∗) + ‖x0 − x∗‖2 + 1)5(L+M + 1)4

N16/5.

3.1 Logistic loss

Discretizing logistic loss f(x) = log(1 + e−wT x) does not fit exactly into the setting of Theorem

1 due to nonexistence of x∗. This potentially causes two problems. First, Assumption 1 is not welldefined. Second, the constant E0 in Theorem 1 is not well defined. We explain in this section howwe can modify our analysis to admit logistic loss by utilizing its structure of high order derivatives.

The first problem can be resolved by replacing f(x∗) by infx∈Rdf(x) in Assumption 1; then, thelogistic loss satisfies Assumption 1 with arbitrary integer p > 0. To approach the second problem,

5

we replace x∗ by x that satisfies the following relaxed inequalities. For some ε1, ε2, ε3 < 1 we have

〈x− x,∇f(x)〉 ≥ f(x)− f(x)− ε1, (14)

f(x)− f(x) ≥ 1L‖∇

(i)f(x)‖p

p−i − ε2, f(x)− infx∈Rd

f(x) ≤ ε3. (15)

As the inequalities are relaxed, there exists a vector x ∈ Rd that satisfies the above conditions. If wefollow the original proof and balance the additional error terms by picking x carefully, we obtain

Corollary 4. (Informal) If the objective is f(x) = log(1 + e−wT x), then discretizing the ODE (11)

with an order s numerical integrator for N iterations with step size h = O(N−1/(s+1)) results in aconvergence rate of O(Sp

ss+1N−p

ss+1 ).

4 Proof of Theorem 1

We prove Theorem 1 as follows. First(Proposition 5), we show that the suboptimality f(x(t)) −f(x∗) along the continuous trajectory of the ODE (11) converges to zero sufficiently fast. Sec-ond(Proposition 6), we bound the discretization error ‖Φh(yk) − ϕh(yk)‖, which measures thedistance between the point generated by discretizing the ODE and the true continuous solution.Finally(Proposition 7), a bound on this error along with continuity of the Lyapunov function (16)implies that the suboptimality of the discretized sequence of points also converges to zero quickly.

Central to our proof is the choice of a Lyapunov function used to quantify progress. We propose inparticular the Lyapunov function E : R2d+1 → R+ defined as

E([v;x; t]) :=t2

4p2‖v‖2 +

∥∥∥x+t

2pv − x∗

∥∥∥2 + tp(f(x)− f(x∗)). (16)

The Lyapunov function (16) is similar to the ones used by Wibisono et al. [2016], Su et al. [2014],except for the extra term t2

4p2 ‖v‖2. This term allows us to bound ‖v‖ by O(Et ). This dependency is

crucial for us to achieve the O(N−2) bound(see Lemma 11 for more details).

We begin our analysis with Proposition 5, which shows that the function E is non-increasing withtime, i.e., E(y) ≤ 0. This monotonicity then implies that both tp(f(x) − f(x∗)) and t2

4p2 ‖v‖2 are

bounded above by some constants. The bound on tp(f(x)− f(x∗)) provides a convergence rate ofO(1/tp) on the sub-optimality f(x(t))−f(x∗). It further leads to an upper-bound on the derivativesof the function f(x) in conjunction with Assumption 1.Proposition 5 (Monotonicity of E). Consider the vector y = [v;x; t] ∈ R2d+1 as a trajectory ofthe dynamical system (12). Let the Lyapunov function E be defined by (16). Then, for any trajectoryy = [v;x; t], the time derivative E(y) is non-positive and bounded above; more precisely,

E(y) ≤ − tp‖v‖2. (17)

The proof of this proposition follows from convexity and (11); we defer the details to Appendix A.

Next, to bound the Lyapunov function for numerical solutions, we need to bound the distance be-tween points in the discretized and continuous trajectories. As in Section 2.1, for the dynamicalsystem y = F (y), let Φh(y0) denote the solution generated by a numerical integrator starting atpoint y0 with step size h. Similarly, let ϕh(y0) be the corresponding true solution to the ODE.An ideal numerical integrator would satisfy Φh(y0) = ϕh(y0); however, due to discretization errorthere is always a difference between Φh(y0) and ϕh(y0) determined by the order of the integra-tor as in (7). Let {yk}Ni=0 be the sequence of points generated by the numerical integrator, that is,yk+1 = Φh(yk). In the following proposition, we derive an upper bound on the resulting discretiza-tion error ‖Φh(yk)− ϕh(yk)‖.Proposition 6 (Discretization error). Let yk = [vk;xk; tk] be the current state of the dynamicalsystem y = F (y) defined in (12). Suppose xk ∈ S defined in (2). If we use a Runge-Kuttaintegrator of order s to discretize the ODE for a single step with a step size h such that h ≤min{0.2, 1

(1+κ)C(1+E(yk))(M+L+1)}, then

‖Φh(yk)− ϕh(yk)‖ ≤ C ′hs+1(M+L+1)

[[(1 + E(yk))]s+1

tk+ h

[(1 + E(yk))]s+2

tk

], (18)

6

where the constants C, κ, and C ′ only depend on p, s, and the integrator.

The proof of Proposition 6 is the most challenging part in proving Theorem 1. Details may be foundin Appendix B. The key step is to bound ‖ ∂

s+1

∂hs+1 [Φh(yk) − ϕh(yk)]‖. To do so, we first boundthe high order derivative tensor ‖∇(i)f‖ using Assumption 1 and Proposition 5 within a region ofradius R. By carefully selecting R, we can show that for a reasonably small h, Φh(yk) and ϕh(yk)is constrained in the region. Second, we need to compute the high order derivatives of y = F (y)as a function of ∇(i)f which is bounded in the region of radius R. As shown in Appendix E, theexpressions for higher derivatives become quite complicated as the order increases. We approach thiscomplexity by using the notation for elementary differentials (see Appendix E) adopted from [Haireret al., 2006]; we then induct on the order of the derivatives to bound the higher order derivatives. Theflatness assumption (Assumption 1) provides bounds on the operator norm of high order derivativesrelative to the objective function suboptimality, and hence proves crucial in completing the inductivestep.

By the conclusion in Proposition 6 and continuity of the Lyapunov function E , we conclude that thevalue of E at a discretized point is close to its continuous counterpart. Using this observation, weexpect that the Lyapunov function values for the points generated by the discretized ODE do notincrease significantly. We formally prove this key claim in the following proposition.Proposition 7. Consider the dynamical system y = F (y) defined in (12) and the Lyapunov func-tion E defined in (16). Let y0 be the initial state of the dynamical system and yN be the final pointgenerated by a Runge-Kutta integrator of order s after N iterations. Further, suppose that Assump-tions 1 and 2 are satisfied. Then, there exists a constant C determined by p, s and the numericalintegrator, such that if the step size h satsfies h = C N−1/(s+1)

(L+M+1)(eE(y0)+1) , then we have

E(yN ) ≤ exp(1) E(y0) + 1. (19)

Please see Appendix C for a proof of this claim.

Proposition 7 shows that the value of the Lyapunov function E at the point yN is bounded aboveby a constant that depends on the initial value E(y0). Hence, if the step size h satisfies the requiredcondition in Proposition 7, we can see that

f(xN )− f(x∗) ≤ E(yN )tpN≤ eE(y0)+1

(1+Nh)p . (20)

The first inequality in (20) follows from the definition of the E (16). Replacing the step size h in(20) by the choice used in Proposition 7 yields

f(xN )− f(x∗) ≤ (L+M + 1)p(eE(y0) + 1)p+1

CNp ss+1

, (21)

and the claim of Theorem 1 follows.

Note: The dependency of the step size h on the degree of the integrator s suggests that an integratorof higher order allows for larger step size and therefore faster convergence rate.

5 Numerical experiments

In this section, we perform a series of numerical experiments to study the performance of the pro-posed scheme for minimizing convex functions through the direct discretization (DD) of the ODE in(11) and compare it with gradient descent (GD) as well as Nesterov’s accelerated gradient (NAG).All figures in this section are on log-log scale. For each method tested, we empirically choose thelargest step size among {10−k|k ∈ Z} subject to the algorithm remaining stable in the first 1000iterations.

5.1 Quadratic functions

We now verify our theoretical results by minimizing a quadratic convex function of the form f(x) =‖Ax− b‖2 by simulating the ODE in (11) for the case that p = 2, i.e.,

x(t) +5

tx(t) + 4∇f(x(t)) = 0,

7

(a) Quadratic objective (b) Objective as in (22)

Figure 1: Convergence paths of GD, NAG, and the proposed simulated dynamical system withintegrators of degree s = 1, s = 2, and s = 4. The objectives satisfy Assumption 1 with p=2.

where A ∈ R10×10 and b ∈ R10. The first five entries of b = [b1; . . . ; b10] are valued 0 and therest are 1. Rows Ai in A are generated by an i.i.d multivariate Gaussian distribution conditioned onbi. The data is linearly separable. Note that the quadratic objective f(x) = ‖Ax− b‖2 satisfies thecondition in Assumption 1 with p = 2. It is also clear that it satisfies the condition in Assumption 2regarding the bounds on higher order derivatives.

Convergence paths of GD, NAG, and the proposed DD procedure for minimizing the quadraticfunction f(x) = ‖Ax− b‖2 are demonstrated in Figure 1(a). For the proposed method we considerintegrators with different degrees, i.e., s ∈ {1, 2, 4}. Observe that GD eventually attains linear ratesince the function is strongly convex around the optimal solution. NAG displays local accelerationclose to the optimal point as mentioned in [Su et al., 2014, Attouch and Peypouquet, 2016]. ForDD, if we simulate the ODE with an integrator of order s = 1, the algorithm is eventually unstable.This result is consistent with the claim in [Wibisono et al., 2016] and our theorem that requiresthe step size to scale with O(N−0.5). Notice that using a higher order integrator leads to a stablealgorithm. Our theoretical results suggest that the convergence rate for s ∈ {1, 2, 4} should be worsethan O(N−2) and one can approach such rate by making s sufficiently large. However, as shown inFigure 1(a), in practice with an integrator of order s = 4, the DD algorithm achieves a convergencerate of O(N−2). Hence, our theoretical convergence rate in Theorem 1 might be conservative.

We also compare the performances of these algorithms when they are used to minimize

f([x1, x2]) = ‖Ax1 − b‖2 + ‖Cx2 − d‖44. (22)

Matrix C and vector d are generated similarly as A and b. The result is shown in Figure 1(b). Asexpected, we note that GD no longer converges linearly, but the other methods converge at the samerate as in Figure 1(a).

5.2 Decoupling ODE coefficients with the objective

Throughout this paper, we assumed that the constant p in (11) is the same as the one in Assumption 1to attain the best theoretical upper bounds. In this experiment, however, we empirically explore theconvergence rate of discretizing the ODE

x(t) +2q + 1

tx(t) + q2tq−2∇f(x(t)) = 0,

when q 6= p. In particular, we use the same quadratic objective f(x) = ‖Ax− b‖2 as in the previoussection. This objective satisfies Assumption 1 with p = 2. We simulate the ODE with differentvalues of q using the same numerical integrator with the same step size. Figure 2 summarizesthe experimental results. We observe that when q > 2, the algorithm diverges. Even though thesuboptimality along the continuous trajectory will converge at a rate of O(t−p) = O(t−2), thediscretized sequence cannot achieve the lower bound which is of O(N−2).

8

Figure 2: Minimizing quadratic objective by simulating different ODEs with the RK44 integrator(4th order). In the case when p = 2, the optimal choice for q is 2.

(a) The objective is an `4 norm. (b) The objective is a logistic loss.

Figure 3: Experiment results for the cases that Assumption 1 holds for p > 2.

5.3 Beyond Nesterov’s acceleration

In this section, we discretize ODEs with objective functions that satisfy Assumption 1 with p > 2.For all ODE discretization algorithms, we use an order-2 RK integrator that calls the gradient oracletwice per iteration. Then we run all algorithms for 106 iterations and show the results in Figure 3.As shown in Example 2, the objective

f(x) = ‖Ax− b‖44 (23)

satisfies Assumption 1 for p = 4. By Theorem 1 if we set q = 4, we can achieve a convergencerate close to the rate O(N−4). We run the experiments with different values of q and summarizethe results in Figure 3(a). Note that when q > 2, the convergence of direct discretization methodsis faster than NAG. Interestingly, when q = 6 > p = 4, the discretization is still stable withconvergence rate roughly O(N−5). This suggests that our theorem may be conservative.

We then simulate the ODE for the objective function

f(x) =

10∑i=1

log(1 + e−wTi x),

for a dataset of linearly separable points. The data points are generated in the same way as inSection 5.1. As shown in Section 3.1, it satisfies Assumption 1 for arbitrary p > 0. As shown inFigure 3(b), the objective decreases faster for larger q; this verifies Corollary 4.

9

6 Discussion

This paper specifies sufficient conditions for stably discretizing an ODE to obtain accelerated first-order (i.e., purely gradient based) methods. Our analysis allows for the design of optimizationmethods via direct discretization using Runge-Kutta integrators based on the flatness of objectivefunctions. complementing the existing studies that derive ODEs from optimization methods, weshow that one can prove convergence rates of a optimization algorithms by leveraging properties ofits ODE representation. We hope that this perspective will lead to more general results.

In addition, we identified a new condition in Assumption 1 that quantifies the local flatness of con-vex functions. At first, this condition may appear counterintuitive, because gradient descent actuallyconverges fast when the objective is not flat and the progress slows down if the gradient vanishesclose to the minimum. However, when we discretize the ODE, the trajectories with vanishing gra-dients oscillate slowly, and hence allow stable discretization with large step sizes, which ultimatelyallows us to achieve acceleration. We think this high-level idea, possibly as embodied by Assump-tion 1 could be more broadly used in analyzing and designing other optimization methods.

Based on the above two points, this paper contains both positive and negative message for the recenttrend in ODE interpretation of optimization methods. On one hand, it shows that with carefulanalysis, discretizing ODE can preserve some of its trajectories properties. On the other hand, ourproof suggests that nontrivial additional conditions might be required to ensure stable discretization.Hence, designing an ODE with nice properties in the continuous domain doesn’t guarantee theexistence of a practical optimization algorithm.

Although our paper answers a fundamental question regarding the possibility of obtaining acceler-ated gradient methods by directly discretizing second order ODEs (instead of reverse engineeringNesterov-like constructions), it does not fully explain acceleration. First, unlike Nesterov’s acceler-ated gradient method that only requires first order differentiability, our results require the objectivefunction to be (s + 2)-times differentiable (where s is the order of the integrator). Indeed, theprecision of numerical integrators only increases with their order when the function is sufficientlydifferentiable. This property inherently limits our analysis. Second, while we achieve the O(N−2)convergence rate, some of the constants in our bound are loose (e.g., for squared loss and logisticregression they are quadratic in L versus linear in L for NAG). Achieving the optimal dependenceon initial errors f(x0) − f(x∗), the diameter ‖x0 − x∗‖, as well as constants L and M requiresfurther investigation.

Acknowledgement

AJ and SS acknowledge support in part from DARPA FunLoL, DARPA Lagrange; AJ also acknowl-edges support from an ONR Basic Research Challenge Program, and SS acknowledges support fromNSF-IIS-1409802.

References

Z. Allen-Zhu and L. Orecchia. Linear coupling: An ultimate unification of gradient and mirrordescent. arXiv preprint arXiv:1407.1537, 2014.

F. Alvarez. On the minimizing property of a second order dissipative system in hilbert spaces. SIAMJournal on Control and Optimization, 38(4):1102–1119, 2000.

H. Attouch and R. Cominetti. A dynamical approach to convex minimization coupling approxima-tion with the steepest descent method. Journal of Differential Equations, 128(2):519–540, 1996.

H. Attouch and J. Peypouquet. The rate of convergence of nesterov’s accelerated forward-backwardmethod is actually faster than 1/kˆ2. SIAM Journal on Optimization, 26(3):1824–1834, 2016.

H. Attouch, X. Goudou, and P. Redont. The heavy ball with friction method, i. the continuousdynamical system: global exploration of the local minima of a real-valued function by asymptoticanalysis of a dissipative dynamical system. Communications in Contemporary Mathematics, 2(01):1–34, 2000.

10

H. Attouch, J. Bolte, P. Redont, and A. Soubeyran. Proximal alternating minimization and projec-tion methods for nonconvex problems: An approach based on the kurdyka-łojasiewicz inequality.Mathematics of Operations Research, 35(2):438–457, 2010.

M. Betancourt, M. I. Jordan, and A. C. Wilson. On symplectic optimization. arXiv preprintarXiv:1802.03653, 2018.

R. E. Bruck Jr. Asymptotic convergence of nonlinear contraction semigroups in hilbert space. Jour-nal of Functional Analysis, 18(1):15–26, 1975.

S. Bubeck, Y. T. Lee, and M. Singh. A geometric alternative to nesterov’s accelerated gradientdescent. arXiv preprint arXiv:1506.08187, 2015.

J. Diakonikolas and L. Orecchia. The approximate duality gap technique: A unified theory of first-order methods. arXiv preprint arXiv:1712.02485, 2017.

M. Fazlyab, A. Ribeiro, M. Morari, and V. M. Preciado. Analysis of optimization algorithms viaintegral quadratic constraints: Nonstrongly convex problems. arXiv preprint arXiv:1705.03615,2017.

E. Hairer, C. Lubich, and G. Wanner. Geometric numerical integration: structure-preserving al-gorithms for ordinary differential equations, volume 31. Springer Science & Business Media,2006.

B. Hu and L. Lessard. Dissipativity theory for nesterov’s accelerated method. arXiv preprintarXiv:1706.04381, 2017.

E. Isaacson and H. B. Keller. Analysis of numerical methods. Courier Corporation, 1994.W. Krichene, A. Bayen, and P. L. Bartlett. Accelerated mirror descent in continuous and discrete

time. In Advances in neural information processing systems, pages 2845–2853, 2015.L. Lessard, B. Recht, and A. Packard. Analysis and design of optimization algorithms via integral

quadratic constraints. SIAM Journal on Optimization, 26(1):57–95, 2016.S. Lojasiewicz. Ensembles semi-analytiques. Lectures Notes IHES (Bures-sur-Yvette), 1965.A. Nemirovskii, D. B. Yudin, and E. R. Dawson. Problem complexity and method efficiency in

optimization. 1983.Y. Nesterov. A method of solving a convex programming problem with convergence rate o (1/k2).

In Soviet Mathematics Doklady, volume 27, pages 372–376, 1983.M. Raginsky and J. Bouvrie. Continuous-time stochastic mirror descent on a network: Variance

reduction, consensus, convergence. In Decision and Control (CDC), 2012 IEEE 51st AnnualConference on, pages 6793–6800. IEEE, 2012.

D. Scieur, A. d’Aspremont, and F. Bach. Regularized nonlinear acceleration. In Advances In NeuralInformation Processing Systems, pages 712–720, 2016.

D. Scieur, V. Roulet, F. Bach, and A. d’Aspremont. Integration methods and accelerated optimiza-tion algorithms. arXiv preprint arXiv:1702.06751, 2017.

W. Su, S. Boyd, and E. Candes. A differential equation for modeling nesterovs accelerated gradientmethod: Theory and insights. In Advances in Neural Information Processing Systems, pages2510–2518, 2014.

J. Verner. High-order explicit runge-kutta pairs with low stage order. Applied numerical mathemat-ics, 22(1-3):345–357, 1996.

M. West. Variational integrators. PhD thesis, California Institute of Technology, 2004.A. Wibisono, A. C. Wilson, and M. I. Jordan. A variational perspective on accelerated methods in

optimization. Proceedings of the National Academy of Sciences, 113(47):E7351–E7358, 2016.

11

A Proof of Proposition 5

According to the dynamical system in (12) we can write

x = v, x = v = −2p+ 1

tv − p2tp−2∇f(x). (24)

Using these definitions we can show that

E =t2

4p2〈2v, v〉+

2t

4p2〈v, v〉+ 2〈x+

t

2pv − x∗, x+

x

2p+

t

2px〉+ tp〈∇f(x), x〉

+ ptp−1(f(x)− f(x∗))

=2t2

4p2〈x, x+

2p+ 1

tx〉 − 2t2

4p2〈x, 2p

tx〉+ 2

t

2p〈x+

t

2px− x∗, x+

2p+ 1

tx〉

+ tp〈∇f(x), x〉+ ptp−1(f(x)− f(x∗))

=t2

2p2〈x,−p2tp−2∇f〉 − t

p‖x‖2 +

t

p〈x+

t

2px− x∗,−p2tp−2∇f〉

+ tp〈∇f(x), x〉+ ptp−1(f(x)− f(x∗))

=− t

p‖x‖2 + ptp−1(f(x)− f(x∗))− ptp−1〈x− x∗,∇f〉

≤ − t

p‖x‖2. (25)

The equalities follows from rearrangement and (11). The last inequality holds due to convexity.

B Proof of Proposition 6 (Discretization Error)

In this section, we aim to bound the difference between the true solution defined by the ODE andthe point generated by the integrator, i.e., ‖Φh(yc) − ϕh(yc)‖. Since the integrator has order s, thedifference ∆(h) := ‖Φh(yc)−ϕh(yc)‖ should be proportional to hs+1. Here, we intend to formallyderive an upper bound of O(hs+1) on ∆(h).

We start by introducing some notations. Given a vector y = [v;x; t] ∈ R2d+1, we define thefollowing projection operators

πx(y) = x ∈ Rd, πv(y) = v ∈ Rd, πt(y) = t ∈ R, πv,x(y) =

[vx

]∈ R2d. (26)

We also define the set B(xc, R) which is a ball with center xc and radius R as

B(xc, R) = {x ∈ Rd|‖x− xc‖ ≤ R}, (27)

and define the set UR,0.2(yc) as

UR,0.2(yc) = {y = [v;x; t]|‖v − vc‖ ≤ R, ‖x− xc‖ ≤ R, |t− tc| ≤ 0.2}. (28)

In the following Lemma, we show that if we start from the point yc and choose a sufficiently smallstepsize, the true solution defined by the ODE ϕh(y0) and the point generated by the integratorΦh(yc) remain in the set UR,0.2(yc).

Lemma 8. Let y ∈ UR,0.2(yc) where yc = [vc;xc; tc], tc ≥ 1, and R = 1tc

. Supposethat B(xc, R) ⊆ A (defined in (3)) and hence Assumptions 1 and 2 are satisfied. If h ≤min{0.2, 1

(1+κ)C(E(yc)+1)(L+M+1)}, the true solution defined by the ODE ϕh(y0) and the pointgenerated by the integrator Φh(yc) remain in the set UR,0.2(yc), i.e.,

ϕh(yc) ∈ UR,0.2(yc), Φh(yc) ∈ UR,0.2(yc), (29)

where κ is a constant determined by the Runge-Kutta integrator. In addition, the intermediate pointsgi defined in Definition 1 also belong to the set UR,0.2(yc).

12

Proof. Note that ∀y ∈ R2d+1, ‖πtF (y)‖ = 1. Clearly when h ≤ 0.2,

πtϕh(yc)− yc = h ≤ 0.2. (30)

Similarly, for any integrator that is at least order 1,

πtΦh(yc)− yc = h ≤ 0.2. (31)

Therefore, we only need to focus on bounding the remaining coordinates.

By Lemma 10, we have that when y ∈ UR,0.2(yc),

‖πv,xF (y)‖ ≤ C(E(yc) + 1)(L+M + 1)

tc. (32)

By definition 1,

gi = yk + h

i−1∑j=1

aijF (gj) Φh(yk) = yk + h

s−1∑i=0

biF (gi).

Let κ = max{∑j |aij |,

∑|bi|}, we have that when h ≤ min{0.2, R/[κC(E(yc)+1)(L+M)

tc]},

gi ∈ UR,0.2(yc) Φh(yc) ∈ UR,0.2(yc). (33)

By fundamental theorem of calculus, we have that

ϕh(yc) = yc +

∫ h

0

F (ϕt(yc))dt ∈ UR,0.2(yc). (34)

Rearrange and apply Cauchy-Schwarz, we get

‖πv,x[ϕh(yc)− yc]‖ ≤∫ h

0

‖πv,xF (ϕt(yc))‖dt ∈ UR,0.2(yc). (35)

By mean value theorem and proof of contradiction, we can show that when h ≤min{0.2, R/C(E(yc)+1)(L+M)

tc}, ∫ h

0

‖πv,xF (ϕt(yc))‖dt ≤ R. (36)

In particular, if∫ h0‖πv,xF (ϕt(yc))‖dt ≥ R, then exists y1 and h0 < h such that ‖y1 − yc‖ = R

and y1 = yc +∫ h0

0F (ϕt(yc))dt. By mean value theorem, this implies that exist y ∈ UR,0.2(yc)

such that ‖πv,xF (y)‖ > C(E(yc)+1)(L+M+1)tc

, which contradicts Lemma 10.

Therefore we proved thatϕh(yc) ∈ UR,0.2(yc). (37)

The result in Lemma 8 shows that ϕh(yc) and Φh(yc) remain in the set UR,0.2(yc). In addition, wecan bound the operator norm of ∇(i)f in B(xc, R) by Lemma 9. Since ∂qϕh(yc)

∂hq is a function of∇(i)f , we can show in Lemma 11 that the (s + 1)th derivative of ϕh(yc) and Φh(yc) are boundedabove by ∥∥∥∥∂qϕh(yc)

∂hq

∥∥∥∥ ≤ C0[E(yc) + 1]q(L+M + 1)q

tc, (38)

and∥∥∥∥∂qΦh(yc)

∂hq

∥∥∥∥ ≤ C1[1 + E(yc)]q(L+M + 1)q + C2h[1 + E(yc)]

q+1(L+M + 1)p+1

tc. (39)

Since the integrator has order s, we can write

∂i

∂hi[Φh(yk)− ϕh(yk)] = 0 for i = 1, ..., s. (40)

13

Therefore, the difference between the true solution ϕh(yc) defined by the ODE and the point Φh(yc)generated by the integrator can be upper bounded by

‖Φh(yc)− ϕh(yc)‖ ≤(∥∥∥∥∂s+1ϕh(yk)

∂hs+1

∥∥∥∥+

∥∥∥∥∂s+1Φh(yk)

∂hs+1

∥∥∥∥)hs+1 (41)

Replacing the norms on the right hand side of (41) by their upper bounds in (38) and (39) impliesthat

‖Φh(yc)− ϕh(yc)‖ ≤ hs+1

[(C0 + C1)[E(yc) + 1]s+1(M + L+ 1)s+1

tc

]+ hs+2

[C2[1 + E(yc)]

s+2(M + L+ 1)s+2

tc

]. (42)

By replacing yc = [vc;xc; tc] in (42) by yk = [vk;xk; tk] the claim in (18) follows.

C Proof of Proposition 7 (Analysis of discrete Lyapunov functions)

As defined earlier in Section 4, Φh(yk) is the solution generated by the numerical integrator, andϕh(yk) is a point on the trajectory of the ODE. yc = [~0;xc; 1] is the initial point of the ODE. Recallthat {yk}Ni=0 is the sequence of points produced by the numerical integrator, i.e., yk+1 = Φh(yk).

To simplify the notation, we let Ek = E(yk), Ek+1 = E(Φh(yk)), y = ϕh(yk) = [v; x; t + h],y = Φh(yk) = [v; x; t+ h].

We want to prove by induction on k = 0, 1, ..., N that

Ek ≤ (1 +1

N)kE0 +

k

N. (43)

The base case E0 ≤ E0 is trivial. Now let’s assume by induction that the inequality in (43) holdsfor k = j, i.e.,

Ej ≤ (1 +1

N)jE0 +

j

N. (44)

By this assumption, we know that f(xk) ≤ eE0+1tpk

≤ eE0 + 1 and hence xk ∈ S defined in (2).

Note that R = 1tk≤ 1. We then have

B(xk, R) ⊆ B(xk, 1) ∈ A (45)

for A defined in (3). By assumption in Proposition 5,

h ≤ 0.2, h ≤ 1

(1 + κ)C(eE0 + 2)(L+M + 1). (46)

By utilizing the bound on ‖Φh(yk) − ϕh(yk)‖ and the continuity of E(y), we show in Lemma 13that the discretization error of ‖E(y)− E(y)‖ is upper bounded by

‖E(Φh(yk))− E(ϕh(yk))‖ (47)

≤ C ′hs+1[(1 + Ek)s+1(L+M + 1)s+1+ h(1 + Ek)s+2(L+M + 1)s+2](Ek + Ek+1 + 1),

under conditions in (45) and (46). C ′ only depends on p, s and the numerical integrator.

We proceed to prove the inductive step. Start by writing Ek+1 = E(Φh(yk)) as

E(Φh(yk)) = E(yk) + E(ϕh(yk))− E(yk) + E(Φh(yk))− E(ϕh(yk)). (48)

According to Proposition 5, E(ϕh(yk))− E(yk) ≤ 0. Therefore,

Ek+1 ≤ Ek + ‖E(y)− E(y)‖. (49)

Replace the norm ‖E(y)− E(y)‖ = ‖E(Φh(yk))− E(ϕh(yk))‖ by its upper bound (47) to obtain

Ej+1 ≤ Ej+Chs+1[(1+Ej)s+1(L+M+1)s+1+h(1+Ej)

s+2(L+M+1)s+2](Ej+Ej+1+1).(50)

14

Before proving the inductive step, we need to ensure that the step size h is sufficiently small. Here,we further add two more j-independent conditions on the choice of step size h. In particular, weassume that

h ≤ 1

eE0 + 2, hs+1 ≤ 1

3(1 + C−1)C ′N(eE0 + 2)s+1(L+M + 1)s+1. (51)

Note that since we want show the claim in (43) for k = 1, . . . , N , in inductive assumptions we havethat j ≤ N − 1. Now we proceed to show that if the inequality in (43) holds for k = j it also holdsfor k = j + 1. By setting k = j in (50) we obtain that

Ej+1 ≤ Ej+C ′hs+1(1+Ej)s+1(L+M+1)s+1[1+h(1+Ej)(L+M+1)](Ej+Ej+1+1). (52)

Using the assumption of induction in (44) we can obtain that Ej ≤ eE0 + 1 by setting j = n in theright hand side. Using this inequality and the second condition in (46), we can write

h ≤ 1

C(eE0 + 2)(L+M + 1)≤ 1

C(Ej + 1)(L+M + 1)(53)

Using this expression we can simplify (52) toEj+1 ≤ Ej + (1 + C−1)C ′hs+1(1 + Ej)

s+1(L+M + 1)s+1(Ej + Ej+1 + 1). (54)We can further show that

(1 + C−1)C ′hs+1(1 + Ej)s+1(L+M + 1)s+1

≤ (1 + C−1)C ′hs+1(2 + eE0)s+1(L+M + 1)s+1 ≤ 1

3N, (55)

where the first inequality holds since Ej ≤ eE0 + 1 and the second inequality holds due to thesecond condition in (51). Simplifying the right hand side of (54) using the upper bound (55) leadsto

Ej+1 ≤ Ej +1

3N(Ej + Ej+1 + 1). (56)

Regroup the terms in (56) to obtain that Ej+1 is upper bounded by

Ej+1 ≤(

1 + 13N

1− 13N

)Ej +

1

3N − 1(57)

Now replace Ej by its upper bound in (44) to obtain

Ej+1 ≤(

1 + 13N

1− 13N

)((1 +

1

N

)jE0 +

j

N

)+

1

3N − 1

=

(1 + 1

3N

1− 13N

)(1 +

1

N

)jE0 +

(1 + 1

3N

1− 13N

)j

N+

1

3N − 1

=

(3N + 1

3N − 1

)(1 +

1

N

)jE0 +

(3N + 1

3N − 1

)j

N+

1

3N − 1

≤(

1 +1

N

)j+1

E0 +

(3N + 1

3N − 1

)j

N+

1

3N − 1, (58)

where the first inequality holds since 3N+13N−1 ≤

N+1N and the last inequality follows from 1+ 2

3N−1 ≥1 + 1

N . Further, we can show that(3N + 1

3N − 1

)j

N+

1

3N − 1=

(1 +

2

3N − 1

)j

N+

1

3N − 1

=j

N+

(2

3N − 1

)j

N+

1

3N − 1

≤ j

N+

(2

3N − 1

)N − 1

N+

1

3N − 1

=j

N+

1

N

(3N − 2

3N − 1

)≤ j + 1

N, (59)

15

where in the first inequality we use the fact that j ≤ N − 1. Using the inequalities in (58) and (59)we can conclude that

Ej+1 ≤(

1 +1

N

)j+1

E0 +j + 1

N, (60)

Therefore, the inequality in (43) is also true for k = j + 1. The proof is complete by induction andwe can write

EN ≤ eE0 + 1. (61)Now if we reconsider the conditions on h in (46) and (51), we can conclude that there exists aconstant C that is determined by p, s and the numerical integrator, such that

h ≤ C N−1/(s+1)

(L+M + 1)(eE0 + 1), (62)

satisfies all the inequalities in (46) and (51).

D Bounding operator norms of derivatives and discretization errors ofLyapunov functions

Lemma 9. Given state yc = [vc;xc; tc] with tc ≥ 1 and the radius R = 1tc

, if B(xc, R) ⊆ A(defined in (3)) and hence Assumptions 1,2 hold, then for all y ∈ UR,0.2(yc) we can write

‖∇(i)f(x)‖ ≤ p(M + L+ 1)E(yc) + 1

tp−ic

. (63)

Proof. Based on Assumption 2, we know that

‖∇(p)f(x)‖ ≤M. (64)

We further can show that the norm ‖∇(p−1)f(x)‖ is upper hounded by

‖∇(p−1)f(x)‖ = ‖∇(p−1)f(xc) +∇(p−1)f(x)−∇(p−1)f(xc)‖≤ ‖∇(p−1)f(xc)‖+ ‖∇(p−1)f(x)−∇(p−1)f(xc)‖ (65)

Using the bound in (64) and the mean value theorem we can show that ‖∇(p−1)f(x) −∇(p−1)f(xc)‖ ≤ M‖x − xc‖ ≤ MR, where the last inequality follows from y ∈ UR,0.2(yc).Applying this substitution into (65) implies that

‖∇(p−1)f(x)‖ ≤ ‖∇(p−1)f(xc)‖+MR

≤ [L(f(xc)− f(x∗))]1p +MR, (66)

where the first inequality holds due to definition of operator norms and the last inequality holds dueto the condition in Assumption 1. By following the same steps one can show that

‖∇(p−2)f(x)‖ ≤ [L(f(xc)− f(x∗))]2p +R[[L(f(xc)− f(x∗))]

1p +MR] (67)

By iteratively applying this procedure we obtain that if y = [x; v; t] ∈ R2d+1 belongs to the setUR,0.2(yc), then we have

‖∇(i)f(x)‖ ≤MRp−i +

p−1∑j=i

[L(f(xc)− f(x∗))]p−jp Rj−i. (68)

Notice that since p−jp ≤ 1 for j = 1, . . . , p − 1, it follows that we can write L

p−jp ≤ 1 + L.

Moreover, the definition of the Lyapunov function E in (16) implies that

[f(xc)− f(x∗)]p−jp ≤ E(yc)

p−jp

tp−jc

≤ 1 + E(yc)

tp−jc

(69)

16

where the last inequality follows from the fact that E(yc)p−jp ≤ 1 + E(yc) for j = 1, . . . , p − 1.

Therefore, we can simplify the upper bound in (68) by

‖∇(i)f(x)‖ ≤MRp−i +

p∑j=i

(1 + L)(1 + E(yc))

tp−jc

Rj−i. (70)

By replacing the radius R with 1/tc we obtain that

‖∇(i)f(x)‖ ≤ M

tp−ic

+

p∑j=i

(1 + L)(1 + E(yc))

tp−ic

=M + p(1 + L)(1 + E(yc))

tp−ic

(71)

As the Lyapunov function E(yc) is always non-negative, we can write M ≤ Mp(1 + E(yc)). Ap-plying this substitution into (71) yields

‖∇(i)f(x)‖ ≤ p(L+M + 1)(1 + E(yc))

tp−ic

, (72)

and the claim in (63) follows.

Lemma 10. If B(xc, R) ⊆ A (defined in (3)) and hence Assumptions 1 and 2 hold, there exists aconstant C determined by p such that, ∀y ∈ UR,0.2(yc) where yc = [vc;xc; tc], tc ≥ 1 and R = 1

tc,

we have

‖πx,vF (y)‖ =≤ C(E(yc) + 1)(L+M + 1)

tc. (73)

Proof. According to Lemma 9, we can write that

‖∇f(x)‖ ≤ p(M + L+ 1)E(yc) + 1

tp−1c

. (74)

Further, the definition of the Lyapunov function in (16) implies that

‖vc‖ ≤2pE(yc)

0.5

tc. (75)

Since y ∈ UR,0.2(yc), we have that|t− tc| ≤ 0.2, ‖v − vc‖ ≤ R, ‖x− xc‖ ≤ R. (76)

Further, based on the dynamical system in (12), we can write

‖πx,vF (y)‖ =

∥∥∥∥[− 2p+1t v − p2tp−2∇f(x)

v

]∥∥∥∥≤ 2p+ 1

t‖v‖+ ‖p2tp−2∇f(x)‖+ ‖v‖

≤(

2p+ 1

t+ 1

)(‖vc‖+ ‖vc − v‖) + p2tp−2‖∇f(x)‖, (77)

where the first inequality is obtained by using the property of norm, and in the last one we use thetriangle inequality. Note that according to (76) we have t ≥ tc − 0.2. Since tc ≥ 1 it implies thatt ≥ 0.8tc. In addition we can also show that t ≤ tc + 0.2 ≤ 1.2tc. Applying these bounds into (77)yields

‖πx,vF (y)‖ ≤(p+ 1

0.8tc+ 1

)(‖vc‖+ ‖vc − v‖) + (1.2)p−2p2tp−2c ‖∇f(x)‖ (78)

Replace ‖∇f(x)‖, ‖vc‖, and ‖vc − v‖ in (78) by their upper bounds in (74), (75), and (76), respec-tively, to obtain

‖πx,vF (y)‖ ≤(p+ 1

0.8tc+ 1

)(2pE(yc)

0.5

tc+R

)+ (1.2)p−2p3(M + L+ 1)

E(yc) + 1

tc

≤(p+ 1

0.8tc+ 1

)(2p(E(yc) + 1) + 1

tc

)+ (1.2)p−2p3(M + L+ 1)

E(yc) + 1

tc, (79)

17

where in the second inequality we replace R by 1/tc and E(yc)0.5 by its upper bound E(yc) + 1.

Considering that tc ≥ 1 and the result in (79) w obtain that there exists a constant C such that

‖πx,vF (y)‖ ≤ C(E(yc) + 1)(L+M + 1)

tc, (80)

where C only depends on p.

Lemma 11. Given state yc = [vc, xc, tc] with tc ≥ 1, let R = 1tc

. If B(xc, R) ⊆ A (defined in (3))and hence Assumptions 1,2 hold, then when h ≤ min{0.2, 1

(1+κ)C(E(yc)+1)(L+M+1)}, we have

∥∥∥∥∂qϕh(yc)

∂hq

∥∥∥∥ ≤ C0[E(yc) + 1]q(L+M + 1)q

tc, (81)

and∥∥∥∥∂qΦh(yc)

∂hq

∥∥∥∥ ≤ C1[1 + E(yc)]q(L+M + 1)q + C2h[1 + E(yc)]

q+1(L+M + 1)p+1

tc, (82)

where C and κ are the same constants as in Lemma 10. Further, the constants C1, C2, C3 aredetermined by p, q, and the integrator.

Remark 12. In the proof below, we reuse variants of symbol C(e.g.C1, C2, C) to hide constants de-termined by p, q and the integrator. We recommend readers to focus on the degree of the polynomialsin (L+M + 1), E(yc), h, tc, and check that the rest can be upper-bounded by variants of symbol C.We frequently use two tricks in this section. First, for a ∈ (0, 1), we can bound

ca ≤ c+ 1 (83)

Second, note that given tc ≥ 1,for any n > 0, there exist constants C1, C2, C3 determined by n suchthat for all t subject to |t− tc| ≤ 0.2,

1

tn≤ C1

tnc≤ C2t

n ≤ C3tnc (84)

Proof. Notice that the system dynamic function F : R2d+1 → R2d+1 in Equation (12) is a vec-tor valued multivariate function. We denote its ith order derivatives by ∇(i)F (y), which is a(2d+ 1)× . . .×(2d+ 1)︸︷︷︸

i+1 times

tensor. The tensor is symmetric by continuity and Schwartz theorem. As

a shorthand, we use ∇(i)F to denote ∇(i)F (y). We know that y(i) = F (i−1)(y) = ∂iy∂ti . Notice that

F (i−1)(y) is a vector. As an example, we can write

y(1) =F

y(2) =F (1) = ∇F (F )

y(3) =F (2) = ∇(2)F (F, F ) +∇F (∇F (F )). (85)

The derivative∇(i)F (y) can be interpreted as a linear map: ∇(i)F : R2d+1× . . .×R2d+1︸︷︷︸i times

→ R2d+1.

∇(2)F (F1, F2) maps F1, F2 to some element in R2d+1. Enumerating the expressions will soon getvery complicated. However, we can express them compactly with elementary differentials summa-rized in Appendix E (see Chapter 3.1 in Hairer et al. [2006] for details).

18

First we bound ∇(i)F by explicitly computing its entries. Let a(t) = p2tp−2 and b(t) = 2p+1t .

Based on the definition in (12), we obtain that

∂k+1F

∂v∂tk=

−b(k)(t)II(k)

0

, ∂kF

∂tk=

−b(k)(t)v − a(k)(t)∇f(x)00

,∂i+kF

∂xi∂tk=

−a(k)(t)∇i+1f(x)00

, ∂iF

∂xi=

−a(t)∇i+1f(x)00

,∂i+jF

∂vj∂xi=0,

∂F

∂v=

2p+1t II0

, ∂jF

∂vj= 0, j ≥ 2. (86)

(87)

For any vector y = [v;x; t] ∈ UR,0.2(yc), we can show that the norm of∇(n)F is upper bounded by

‖∇(n)F (F1, F2, ..., Fn)‖ ≤ ‖a(t)∇(n+1)f(x)‖∏i∈[n]

‖πxFi‖

+ ‖b(n)(t)v + a(n)(t)∇f(x)‖∏i∈[n]

‖πtFi‖

+

n−1∑k≥1

∑S⊂[n]|S|=k

‖a(k)(t)∇(n−k+1)f(x)‖

[∏s∈S‖πtFs‖

] ∏s′∈[n]/S

‖πxFs′‖

+∑i∈[n]

‖b(n−1)(t) + 1‖‖πvFi‖∏j 6=i

‖πtFj‖. (88)

Using the definition of the Lyapunov function E and the definition of the set UR,0.2(yc) it can beshown that

‖vc‖ ≤E(yc)

0.5

tc≤ E(yc) + 1

tc, tc ≥ 1, |t− tc| ≤ 0.2, ‖v − vc‖ ≤ R. (89)

Further, the result in Lemma 9 implies that

‖∇(i)f(x)‖ ≤ p(M + L+ 1)E(yc) + 1

tp−ic

. (90)

Substituting the upper bounds in (89) and (90) into (88) implies that for n = 1, . . . , p we can write

‖∇(n)F (F1, F2, ..., Fn)‖

≤ C1(M + L+ 1)[E(yc) + 1]tn−1c

∏i∈[n]

‖πxFi‖

+ C2(M + L+ 1) [E(yc) + 1] t−n−1c

∏i∈[n]

‖πtFi‖

+ C3(M + L+ 1)

p−1∑k≥1

[E(Fc) + 1] tn−2k−1c

∑S⊂[n]|S|=k


] ∏s′∈[n]/S

‖πxFs′‖

+ C4

∑i∈[n]

[1 +

1

tnc

]‖πvFi‖

∏j 6=i

‖πtFj‖, (91)

where C1, C2, C3, and C4 only depend on n and p.

19

For n = p, p + 1, ..., s, we can get similar bounds. To do so, not only we use the result in (90), butalso we use the bounds guaranteed by Assumption 2. Hence, for n = p, p+ 1, ..., s it holds

‖∇(n)F (F1, F2, ..., Fn)‖

≤ C1Mtp−2c

∏i∈[n]

‖πxFi‖

+ C2(M + L+ 1)[E(yc) + 1]t−n−1c

∏i∈[n]

‖πtFi‖

+ C3

p−1∑k≥1

(M + L+ 1)[E(yc) + 1]tp−k−2c

∑S⊂[n]|S|=k


] ∏s′∈[n]/S

‖πxFs′‖

+ C4

∑i∈[n]

[1 +

1

tnc

]‖πvFi‖

∏j 6=i

‖πtFj‖. (92)

Finally we are ready to bound the time derivatives. We first bound the elementary differentials F (τ)defined in Section E Definition 2. Let F (τ) = F (τ)(y) for convenience. We claim that when|τ | ≤ q, then ∀y ∈ UR,0.2(yc)

‖πtF (τ)‖ ≤ 1, ‖πv,xF (τ)‖ ≤ C|τ |(L+M + 1)|τ |[E(yc) + 1]|τ |

tc, (93)

where the constant Cq only depends on p and q. We use induction to prove the claims in (93).The base case is trivial as we have shown in Lemma 10 that ‖πx,vF (•)(y)‖ = ‖πx,vF (y)‖ ≤C(E(yc)+1)(L+M)

tc, and ‖πtF (•)(y)‖ = ‖πtF (y)‖ = 1. Since the last coordinate grows linearly

with rate 1 no matter what x, v are, it can be shown that

πtF (τ)(y) = 0,∀|τ | ≥ 2. (94)

We hence focus on proving the upper bound for the norm ‖πx,vF (τ)(y)‖ in (93).

Now assume |τ | = q and it has m subtrees attached to the root, τ = [τ1, ..., τm] with∑mi=1 |τi| =

q − 1. When m ≤ p− 1, by (91) we obtain

‖∇(m)F (F (τ1), ..., F (τm))‖

≤ C1[(M + L+ 1)(E(yc) + 1)]tm−1c

∏i∈[m]

‖πxF (τi)‖

+ C2(M + L+ 1)[E(yc) + 1]t−m−1c

∏i∈[m]

‖πtF (τi)‖

+ C3

m−1∑k≥1

[(M + L+ 1)(E(yc) + 1)1]tm−2k−1c

∑S⊂[m]|S|=k

[∏s∈S‖πtF (τs)‖

] ∏s′∈[m]/S

‖πxF (τs′)‖

+ C4

∑i∈[m]

[1 +

1

tnc

]‖πvF (τi)‖

∏j 6=i

‖πtF (τj)‖. (95)

Notice that |τi| ≤ q − 1. By inductive assumption in (93) we can write

‖πtF (τi)‖ ≤ 1 for all i = 1 . . . ,m (96)∏i∈S‖πv,xF (τi)‖ ≤ Cn(L+M + 1)n

[E(yc) + 1]n

t|S|c

, where n =∑i

|τi|. (97)

Apply these substitutions into (95) to and use the inequality∑i |τi| ≤ q − 1 to obtain that

‖∇(m)F (F (τ1), ..., F (τm))‖ ≤ Cq[E(yc) + 1]q(M + L+ 1)q

tc. (98)

20

Hence, since ‖πx,vF (τ)‖ ≤ ‖∇(m)F (F (τ1), ..., F (τm))‖ we obtain that

‖πx,vF (τ)‖ ≤ Cq[E(yc) + 1]q(M + L+ 1)q

tc. (99)

Similarly, for m ≥ p, by (92) we can write

‖∇(m)F (F (τ1), ..., F (τm))‖

≤ C1Mtp−2c

∏i∈[m]

‖πxF (τi)‖

+ C2(M + L+ 1)[E(yc) + 1]t−n−1∏i∈[n]

‖πtF (τi)‖

+ C3

m−1∑k≥1

[(M + L+ 1)(E(yc) + 1)1]tp−k−2c

∑S⊂[m]|S|=k

[∏s∈S‖πtF (τs)‖

] ∏s′∈[m]/S

‖πxF (τs′)‖

+ C4

∑i∈[m]

[1 +

1

tnc

]‖πvF (τi)‖

∏j 6=i

‖πtF (τj)‖. (100)

Plug in the induction assumption in (93) into (100) to obtain

‖πx,vF (τ)‖ ≤ ‖∇(m)F (F (τ1), ..., F (τm))‖ ≤ Cq[E(yc) + 1]q(M + L+ 1)q

tc. (101)

Hence, the proof is complete by induction.

Now we proceed to derive an upper bound for higher order time derivatives. By Lemma 14 we canwrite

‖∂qϕh(yc)

∂hq‖ = ‖F (q−1)(ϕh(yc))‖ = ‖

∑|τ |=q

α(τ)F (τ)(ϕh(yc))‖.

By Lemma 10, we know that when h ≤ min{0.2, 1(1+κ)C(E(yc)+1)(M+L)}, y ∈ UR,0.2(yc). There-

fore, (101) holds. Hence, there exists a constant C determined by p, q such that

‖∂qϕh(yc)

∂hq‖ ≤ C[E(yc) + 1]q(M + L+ 1)q

tc.

Similarly by Lemma 15, we have the following equation

∂qΦh(yc)

∂hq=∑i≤S

bi[h∂qF (gi)

∂hq+ q

∂q−1F (gi)

∂hq]

Here, ∂qF (gi)∂hq has the same recursive tree structure as F (q)(y), except that we need to replace all F

in the expression by ∂gi∂h and all∇(n)F (y) by∇(n)F (gi). By Definition 1 and Lemma 10, we know

that

‖πx,v∂gi∂h

‖ ≤∑j≤i−1

|aij |C(E(yc) + 1)(M + L+ 1)

tc, ‖πt∂gi

∂h‖ = |

∑j≤i−1

aij |.

We also know by lemma 10 that ∀i, gi ∈ UR,0.2(yc). Hence the bounds for ‖∇(n)F (y)‖ alsoholds for ∇(n)F (gi). Therefore, by the same argument as for bounding ‖∂

qϕh(yc)∂hq ‖, we will get

same bounds for ‖∂qF (gi)∂hq ‖ up to a constant factor determined by the integrator. Based on this, we

conclude that

‖∂qΦh(yc)

∂hq‖ ≤ C[(L+M + 1)(1 + E(yc))]

q + C ′h[(L+M + 1)(1 + E(yc))](q+1)

tc,

where the constants are determined by p, q and the integrator.

21

Lemma 13. Suppose the conditions in Proposition 6 hold. Then, we have that‖E(Φh(yk))− E(ϕh(yk))‖≤ Chs+1[(1 + Ek)s+1(L+M + 1)s+1 + h(1 + Ek)s+2(L+M + 1)s+2](Ek + Ek+1 + 1),

(102)where C only depends on p, s and the numerical integrator.

Proof. Denote y = Φh(yk), y = ϕh(yk). Notice that t = t = tk + h. In fact, because we start thesimulation at tc = 1 and we require that h ≤ 0.2, we have

tk

t=

tktk + h

∈[

5

6, 1

]. (103)

Now using the definition of the Lyapunov function E we can show that

‖E(y)− E(y)‖ ≤ t2

4p2∣∣‖v‖2 − ‖v‖2∣∣+

∣∣∣∣‖x+t

2pv − x∗‖2 − ‖x+

t

2pv − x∗‖2

∣∣∣∣+ tp(|f(x)− f(x)|)

≤ 2t2

4p2(‖v − v‖‖v + v‖) + tp(‖x− x‖)(‖∇f(x)‖+ ‖∇f(x)‖)

+ 2

∥∥∥∥x− x+t

2p(v − v)

∥∥∥∥∥∥∥∥x+t

2pv − x∗ + x+

t

2pv − x∗

∥∥∥∥ , (104)

where to derive the second inequality we used the convexity of the function f which implies

〈y − x,∇f(y)〉 ≤ f(x)− f(y) ≤ 〈x− y,∇f(x)〉. (105)

Recall that Ek = E(yk), Ek+1 = E(y) = E(Φh(yk)), Ek+1 = E(y) = E(ϕh(yk)). According toProposition 5 we know that Ek+1 ≤ Ek, and therefore Ek+1 is upper bounded by Ek. Therefore,we can write

‖v‖ ≤

√Ek+1

t≤√Ek

t≤ Ek + 1

t, ‖v‖ ≤ Ek+1 + 1

t,∥∥∥∥x+

t

2pv − x∗

∥∥∥∥ ≤√Ek ≤ Ek + 1,

∥∥∥∥x+t

2pv − x∗

∥∥∥∥ ≤ Ek+1 + 1. (106)

Further, by Assumption 1, we have that

‖∇f(x)‖ ≤ L(Ek + 1)

tp−1, ‖∇f(x)‖ ≤ L(f(x)− f(x∗))

p−1p ≤ L(

Ek+1

tp)

p−1p ≤ L(Ek+1 + 1)

tp−1.

(107)In addition, by Proposition 6, we know that for some constant C determined by p, s, L,M and theintegrator, it holds

max{‖v − v‖, ‖x− x‖}

≤ Chs+1

[[1 + E(yk)]s+1(L+M + 1)s+1

tk+ h

[1 + E(yk)]s+2(L+M + 1)s+2

tk

]. (108)

Define M := [ [1+E(yk)]s+1(L+M+1)s+1

tk+ h [1+E(yk)]s+2(L+M+1)s+2

tk]. Use the upper bounds in

(106)-(108) and the definition ofM to simplify the right hand side of (104) to

‖E(y)− E(y)‖ ≤ 2t2

4p2Chs+1MEk + Ek+1 + 2

t+ tpChs+1ML(Ek+1 + Ek + 2)

tp−1

+ 2

(1 +

tk2p

)Chs+1M(Ek + Ek+1 + 2). (109)

Now use the fact that tkt

is bounded by a constant as shown (103). Further, upper bound all theconstants determined by s, p and the numerical integrator, we obtain that‖E(y)− E(y)‖≤ C ′hs+1[(1 + Ek)s+1(L+M + 1)s+1 + h(1 + Ek)s+2(L+M + 1)s+2](Ek + Ek+1 + 1),

(110)and the claim in (102) follows.

22

Figure 4: A figure adapted from Hairer et al. [2006]. Example tree structures and correspondingfunction derivatives.

E Elementary differentials

We briefly summarize some key results on elementary differentials from Hairer et al. [2006]. Formore details, please refer to chapter 3 of the book. Given a dynamical system

y = F (y)

we want to find a convenient way to express and compute its higher order derivatives. To do this,let τ denote a tree structure as illustrated in Figure 4. |τ | is the number of nodes in τ . Then we canadopt the following notations as in Hairer et al. [2006]Definition 2. For a tree τ , the elementary differential is a mapping F (τ) : Rd → Rd, definedrecursively by F (•)(y) = F (y) and

F (τ) = ∇(m)F (y)(F (τ1)(y), ..., F (τm)(y))

for τ = [τ1, ..., τm]. Notice that∑mi=1 |τi| = |τ | − 1.

Some examples are shown in Figure 4. With this notation, the following results from Hairer et al.[2006] Chapter 3.1 hold. The proof follows by recursively applying the product rule.Lemma 14. The qth order derivative of the exact solution to y = F (y) is given by

y(q)(tc) = F (q−1)(yc) =∑|τ |=q

α(τ)F (τ)(yc)

for y(tc) = yc. α(τ) is a positive integer determined by τ and counts the number of occurrences ofthe tree pattern τ .

The next result is obtained by Leibniz rule. The expression for ∂qF (gi)∂hq can be calculated the same

way as in Lemma 14.Lemma 15. For a Runge-Kutta method defined in definition 1, if F is qth differentiable, then

∂qΦh(yc)

∂hq=∑i≤S

bi[h∂qF (gi)

∂hq+ q

∂q−1F (gi)

∂hq] (111)

where ∂qF (gi)∂hq has the same structure as F (q)(y) in lemma 14, except that we need to replace all F

in the expression by ∂gi∂h and all ∇(n)F (y) by∇(n)F (gi).

23

Aryan Mokhtari Suvrit Sra Ali Jadbabaie [email protected] … · 2018-11-29 · Direct Runge-Kutta...

Documents

Transcript of Aryan Mokhtari Suvrit Sra Ali Jadbabaie [email protected] … · 2018-11-29 · Direct Runge-Kutta...