# direct_rungekutta_discretization_achieves_acceleration__74bf0d98.pdf

Direct Runge-Kutta Discretization Achieves Acceleration

Jingzhao Zhang LIDS Massachusetts Institute of Technology Cambridge, MA, 02139 jzhzhang@mit.edu

Aryan Mokhtari LIDS Massachusetts Institute of Technology Cambridge, MA, 02139 aryanm@mit.edu

Suvrit Sra LIDS, IDSS Massachusetts Institute of Technology Cambridge, MA, 02139 suvrit@mit.edu

Ali Jadbabaie LIDS, IDSS Massachusetts Institute of Technology Cambridge, MA, 02139 jadbabai@mit.edu

We study gradient-based optimization methods obtained by directly discretizing a second-order ordinary differential equation (ODE) related to the continuous limit of Nesterov s accelerated gradient method. When the function is smooth enough, we show that acceleration can be achieved by a stable discretization of this ODE using standard Runge-Kutta integrators. Speciﬁcally, we prove that under Lipschitz-gradient, convexity and order-(s + 2) differentiability assumptions, the sequence of iterates generated by discretizing the proposed second-order ODE converges to the optimal solution at a rate of O(N 2 s s+1 ), where s is the order of the Runge-Kutta numerical integrator. Furthermore, we introduce a new local ﬂatness condition on the objective, under which rates even faster than O(N 2) can be achieved with low-order integrators and only gradient information. Notably, this ﬂatness condition is satisﬁed by several standard loss functions used in machine learning. We provide numerical experiments that verify the theoretical rates predicted by our results.

1 Introduction

We study accelerated ﬁrst-order optimization algorithms for the problem

min x Rd f(x), (1)

where f is convex and sufﬁciently smooth. A classical method for solving (1) is gradient descent (GD), which displays a sub-optimal convergence rate of O(N 1) i.e., the gap f(x N) f(x ) between GD and the optimal value f(x ) decreases to zero at the rate of O(N 1). Nesterov s seminal accelerated gradient method [19] matches the oracle lower bound of O(N 2) [18], and is thus a central result in the theory of convex optimization.

However, ever since its introduction, acceleration has remained somewhat mysterious, especially because Nesterov s original derivation relies on elegant but unintuitive algebraic arguments. This lack of understanding has spurred a variety of recent attempts to uncover the rationale behind the phenomenon of acceleration [1, 9, 11, 13, 16, 21].

We pursue instead an approach to NAG (and accelerated methods in general) via a continuous-time perspective. This view was recently studied by Su et al. [23], who showed that the continuous limit

32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montr eal, Canada.

of NAG is a second order ODE describing a physical system with vanishing friction; Wibisono et al. [26] generalized this idea and proposed a class of ODEs by minimizing Bregman Lagrangians.

Although these works succeed in providing a richer understanding of Nesterov s scheme via its continuous time ODE, they fail to provide a general discretization procedure that generates provably convergent accelerated methods. In contrast, we introduce a second-order ODE that generates an accelerated ﬁrst-order method for smooth functions if we simply discretize it using any Runge-Kutta numerical integrator and choose a suitable step size.

1.1 Summary of results

Assuming that the objective function is convex and sufﬁciently smooth, we establish the following:

We propose a second-order ODE, and show that the sequence of iterates generated by discretizing using a Runge-Kutta integrator converges to the optimal solution at the rate O(N 2s s+1 ), where s is the order of the integrator. By using a more precise numerical integrator, (i.e., a larger s), this rate approaches the optimal rate O(N 2).

We introduce a new local ﬂatness condition for the objective function (Assumption 1), under which Runge-Kutta discretization obtains convergence rates even faster than O(N 2), without requiring high-order integrators. In particular, we show that if the objective is locally ﬂat around a minimum, by using only gradient information we can obtain a convergence rate of O(N p), where p quantiﬁes the degree of local ﬂatness. Acceleration due to local ﬂatness may seem counterintuitive at ﬁrst, but our analysis reveals why it helps.

To the best of our knowledge, this work presents the ﬁrst direct1 discretization of an ODE that yields accelerated gradient methods. Unlike Betancourt et al. [7] who study symplecticity and consider variational integrators, and Scieur et al. [22] who study consistency of integrators, we focus on the order of integrators (see 2.1). We argue that the stability inherent to the ODE and order conditions on the integrators sufﬁce to achieve acceleration.

1.2 Additional related work

Several works [2, 3, 5, 8] have studied the asymptotic behavior of solutions to dissipative dynamical systems. However, these works retain a theoretical focus as they remain in the continuous time domain and do not discuss the key issue, namely, stability of discretization. Other works such as [15], study the counterpart of Su et al. [23] s work for mirror descent algorithms and achieve acceleration via Nesterov s technique. Diakonikolas and Orecchia [10] proposes a framework to analyze the ﬁrst order mirror descent algorithms by studying ODEs derived from duality gaps. Also, Raginsky and Bouvrie [20] obtain nonasymptotic rates for continuous time mirror descent in a stochastic setting.

A textbook treatment of numerical integration is given in [12]; some of our proofs build on material from Chapters 3 and 9. [14] and [25] also provide nice introductions to numerical analysis.

2 Problem setup and background

Throughout the paper we assume that the objective f is convex and sufﬁciently smooth. Our key result rests on two key assumptions introduced below. The ﬁrst assumption is a local ﬂatness condition on f around a minimum; our second assumption requires f to have bounded higher order derivatives. These assumptions are sufﬁcient to achieve acceleration simply by discretizing suitable ODEs without either resorting to reverse engineering to obtain discretizations or resorting to other more involved integration mechanisms.

We will require our assumptions to hold on a suitable subset of Rd. Let x0 be the initial point to our proposed iterative algorithm. First consider the sublevel set

S := {x Rd | f(x) exp(1)((f(x0) f(x ) + x0 x 2) + 1}, (2)

where x is a minimum of (1). Later we will show that the sequence of iterates obtained from discretizing a suitable ODE never escapes this sublevel set. Thus, the assumptions that we introduce

1That is, discretize the ODE with known numerical integration schemes without resorting to reverse engineering NAG s updates.

need to hold only within a subset of Rd. Let this subset be deﬁned as

A := {x Rd | x S, x x 1}, (3)

that is, the set of points at unit distance to the initial sublevel set (2). The choice of unit distance is arbitrary, and one can scale that to any desired constant. Assumption 1. There exists an integer p 2 and a positive constant L such that for any point x A, and for all indices i {1, ..., p 1}, we have the lower-bound

f(x) f(x ) 1

p p i , (4)

where x minimizes f and (i)f(x) denotes the operator norm of the tensor (i)f(x).

Assumption 1 bounds high order derivatives by function suboptimality, so that these derivatives vanish as the suboptimality converges to 0. Thus, it quantiﬁes the ﬂatness of the objective around a minimum.2 When p = 2, Assumption 1 is slightly weaker than the usual Lipschitz-continuity of gradients (see Example 1) typically assumed in the analysis of ﬁrst-order methods, including NAG. If we further know that the objectives Taylor expansion around an optimum does not have low order terms, p would be the degree of the ﬁrst nonzero term. Example 1. Let f be convex with L

2 -Lipschitz continuous gradients, i.e., f(x) f(y) L

2 x y . Then, for any x, y Rd we have

f(x) f(y) + f(y), x y + 1

L f(x) f(y) 2.

In particular, for y = x , an optimum point, we have f(y) = 0, and thus we have f(x) f(x ) 1 L f(x) 2, which is nothing but inequality (4) for p = 2 and i = 1. Example 2. Consider the ℓp-norm regression problem: minx f(x) = Ax b p p, for even integer p 2. If x , Ax = b, then f satisﬁes inequality (4) for p, and L depends on p and the operator norm of A.

Logistic loss satisﬁes a slightly different version of Assumption 1 because its minimum can be at inﬁnity. We will explain this point in more detail in Section 3.1.

Next, we introduce our second assumption that adds additional restrictions on differentiability and bounds the growth of derivatives. Assumption 2. There exists an integer s p and a constant M 0, such that f(x) is order (s+2) differentiable. Furthermore, for any x A, the following operator norm bounds hold:

(i)f(x) M, for i = p, p + 1, . . . , s, s + 1, s + 2. (5)

When the sublevel sets of f are compact and hence the set A is also compact; as a result, the bound (5) on high order derivatives is implied by continuity. In addition, an Lp loss of the form Ax b p p also satisfy (5) with M = p! A p 2.

2.1 Runge-Kutta integrators

Before moving onto our new results ( 3) let us brieﬂy recall explicit Runge-Kutta (RK) integrators used in our work. For a more in depth discussion please see the textbook [12]. Deﬁnition 1. Given a dynamical system y = F(y), let the current point be y0 and the step size by h. An explicit S stage Runge-Kutta method generates the next step via the following update:

gi = y0 + h

j=1 aij F(gj), Φh(y0) = y0 + h

i=1 bi F(gi), (6)

where aij and bi are suitable coefﬁcients deﬁned by the integrator; Φh(y0) is the estimation of the state after time step h, while gi (for i = 1, . . . , S) are a few neighboring points where the gradient information F(gi) is evaluated.

2One could view this as an error bound condition that reverses the gradient-based upper bounds on suboptimality stipulated by the Polyak-Łojasiewicz condition [6, 17].

Algorithm 1: Input(f, x0, p, L, M, s, N) Constants p, L, M are the same as in Assumptions

1: Set the initial state y0 = [ 0; x0; 1] R2d+1

2: Set step size h = C/N 1 s+1 C is determined by p, L, M, s, x0 3: x N Order-s-Runge-Kutta-Integrator(F, y0, N, h) F is deﬁned in equation 12 4: return x N

By combining the gradients at several evaluation points, the integrator can achieve higher precision by matching up Taylor expansion coefﬁcients. Let ϕh(y0) be the true solution to the ODE with initial condition y0; we say that an integrator Φh(y0) has order s if its discretization error shrinks as

Φh(y0) ϕh(y0) = O(hs+1), as h 0. (7)

In general, RK methods offer a powerful class of numerical integrators, encompassing several basic schemes. The explicit Euler s method deﬁned by Φh(y0) = y0 + h F(y0) is an explicit RK method of order 1, while the midpoint method Φh(y0) = y0 + h F(y0 + h

2 F(y0)) is of order 2. Some high-order RK methods are summarized in [24].

3 Main results

In this section, we introduce a second-order ODE and use explicit RK integrators to generate iterates that converge to the optimal solution at a rate faster than O(1/t) (where t denotes the time variable in the ODE). A central outcome of our result is that, at least for objective functions that are smooth enough, it is not the integrator type that is the key ingredient of acceleration, but a careful analysis of the dynamics with a more powerful Lyapunov function that achieves the desired result. More speciﬁcally, we will show that by carefully exploiting boundedness of higher order derivatives, we can achieve both stability and acceleration at the same time.

We start with Nesterov s accelerated gradient (NAG) method that is deﬁned according to the updates

xk = yk 1 h f(yk 1), yk = xk + k 1

k+2(xk xk 1). (8)

Su et al. [23] showed that the iteration (8) in the limit is equivalent to the following ODE

t x(t) + f(x(t)) = 0, where x = dx

when one drives the step size h to zero. It can be further shown that in the continuous domain the function value f(x(t)) decreases at the rate of O(1/t2) along the trajectories of the ODE. This convergence rate can be accelerated to an arbitrary rate in continuous time via time dilation as in [26]. In particular, the solution to

t x(t) + p2tp 2 f(x(t)) = 0, (10)

has a convergence rate O(1/tp). When p > 2, Wibisono et al. [26] proposed rate matching algorithms via utilizing higher order derivatives (e.g., Hessians). In this work, we focus purely on ﬁrst-order methods and study the stability of discretizing the ODE directly when p 2.

Though deriving the ODE from the algorithm is solved, deriving the update of NAG or any other accelerated method by directly discretizing an ODE is not. As stated in [26], explicit Euler discretization of the ODE in (9) may not lead to a stable algorithm. Recently, Betancourt et al. [7] observed empirically that Verlet integration is stable and suggested that the stability relates to the symplectic property of the Verlet integration. However, in our proof, we found that the order condition of Verlet integration would sufﬁce to achieve acceleration. Though symplectic integrators are known to be stable, we weren t able to leverage the symplecticity for the dissipative system (11).

This principal point of departure from previous works underlies Algorithm 1, which solves (1) by discretizing the following ODE with an order-s integrator:

x(t) + 2p + 1

t x(t) + p2tp 2 f(x(t)) = 0. (11)

The solution to (11) exists and is unique when t > 0. This claim follows by local Lipschitzness of f and is discussed in more details in Appendix A.2 of [26].

We further highlight that the ODE in (11) can also be written as the dynamical system

t v p2tp 2 f(x) v 1

, where y = [v; x; t]. (12)

We have augmented the state with time to obtain an autonomous system, which can be readily solved numerically with a Runge-Kutta integrator as in Algorithm 1. To avoid singularity at t = 0, Algorithm 1 discretizes the ODE starting from t = 1 with initial condition y(1) = y0 = [0; x0; 1]. The choice of 1 can be replaced by any arbitrary positive constant.

Notice that the ODE in (11) is slightly different from the one in (10); it has a coefﬁcient 2p+1

t for x(t) instead of p+1

t . This modiﬁcation is crucial for our analysis via Lyapunov functions (more details in Section 4 and Appendix A).

The parameter p in the ODE (11) is set to be the same as the constant in Assumption 1 to achieve the best theoretical upper bound by balancing stability and acceleration. Particularly, the larger p is, the faster the system evolves. Hence, the numerical integrator requires smaller step sizes to stabilize the process, but a smaller step size increases the number of iterations to achieve a target accuracy. This tension is alleviated by Assumption 1. The larger p is, the ﬂatter the function f is around its stationary points. In other words, Assumption 1 implies that as the iterates approach a minimum, the high order derivatives of the function f, in addition to the gradient, also converge to zero. Consequently, the trajectory slows down around the optimum and we can stably discretize the process with a large enough step size. This intuition ultimately translates into our main result. Theorem 1. (Main Result) Consider the second-order ODE in (11). Suppose that the function f is convex and Assumptions 1 and 2 are satisﬁed. Further, let s be the order of the Runge-Kutta integrator used in Algorithm 1, N be the total number of iterations, and x0 be the initial point. Also, let E0 := f(x0) f(x ) + x0 x 2 + 1. Then, there exists a constant C1 such that if we set the step size as h = C1N 1/(s+1)(L + M + 1) 1E 1 0 , the iterate x N generated after running Algorithm 1 for N iterations satisﬁes the inequality

f(x N) f(x ) C2E0 h (L+M+1)E0

where the constants C1 and C2 only depend on s, p, and the Runge-Kutta integrator. S is the number of stage as deﬁned in 1. Since each iteration consumes S gradient, f(x N) f(x ) will converge as O(S

ps s+1 N ps

s+1 ) with respect to the number of gradient evaluations. Note that for commonly used Runge-Kutta integrators, S 8.

The proof of this theorem is quite involved; we provide a sketch in Section 4, deferring the detailed technical steps to the appendix. We do not need to know the constant C1 exactly in order to set the step size h. Replacing C1 by any smaller positive constant leads to the same polynomial rate.

Theorem 1 indicates that if the objective has bounded high order derivatives and satisﬁes the ﬂatness condition in Assumption 1 with p > 0, then discretizing the ODE in (11) with a high order integrator results in an algorithm that converges to the optimal solution at a rate that is close to O(N p). In the following corollaries, we highlight two special instances of Theorem 1. Corollary 2. If the function f is convex with L-Lipschitz gradients and is 4th order differentiable, then simulating the ODE (11) for p = 2 with a numerical integrator of order s = 2 for N iterations results in the suboptimality bound

f(x N) f(x ) C2(f(x0) f(x ) + x0 x 2 + 1)3(L + M + 1)2

Note that higher order differentiability allows one to use a higher order integrator, which leads to the optimal O(N 2) rate in the limit. The next example is based on high order polynomial or ℓp norm.

Corollary 3. Consider the objective function f(x) = Ax + b 4 4. Assume that x, s.t.Ax = b. Simulating the ODE (11) for p = 4 with a numerical integrator of order s = 4 for N iterations results in the suboptimality bound

f(x N) f(x ) C2(f(x0) f(x ) + x0 x 2 + 1)5(L + M + 1)4

3.1 Logistic loss

Discretizing logistic loss f(x) = log(1 + e w T x) does not ﬁt exactly into the setting of Theorem 1 due to nonexistence of x . This potentially causes two problems. First, Assumption 1 is not well deﬁned. Second, the constant E0 in Theorem 1 is not well deﬁned. We explain in this section how we can modify our analysis to admit logistic loss by utilizing its structure of high order derivatives.

The ﬁrst problem can be resolved by replacing f(x ) by infx Rdf(x) in Assumption 1; then, the logistic loss satisﬁes Assumption 1 with arbitrary integer p > 0. To approach the second problem, we replace x by x that satisﬁes the following relaxed inequalities. For some ϵ1, ϵ2, ϵ3 < 1 we have

x x, f(x) f(x) f( x) ϵ1, (14)

f(x) f( x) 1

p p i ϵ2, f( x) inf x Rdf(x) ϵ3. (15)

As the inequalities are relaxed, there exists a vector x Rd that satisﬁes the above conditions. If we follow the original proof and balance the additional error terms by picking x carefully, we obtain

Corollary 4. (Informal) If the objective is f(x) = log(1 + e w T x), then discretizing the ODE (11) with an order s numerical integrator for N iterations with step size h = O(N 1/(s+1)) results in a convergence rate of O(Sp s s+1 N p s s+1 ).

4 Proof of Theorem 1

We prove Theorem 1 as follows. First(Proposition 5), we show that the suboptimality f(x(t)) f(x ) along the continuous trajectory of the ODE (11) converges to zero sufﬁciently fast. Second(Proposition 6), we bound the discretization error Φh(yk) ϕh(yk) , which measures the distance between the point generated by discretizing the ODE and the true continuous solution. Finally(Proposition 7), a bound on this error along with continuity of the Lyapunov function (16) implies that the suboptimality of the discretized sequence of points also converges to zero quickly.

Central to our proof is the choice of a Lyapunov function used to quantify progress. We propose in particular the Lyapunov function E : R2d+1 R+ deﬁned as

E([v; x; t]) := t2

4p2 v 2 + x + t

2pv x 2 + tp(f(x) f(x )). (16)

The Lyapunov function (16) is similar to the ones used by Su et al. [23], Wibisono et al. [26], except for the extra term t2 4p2 v 2. This term allows us to bound v by O( E

t ). This dependency is crucial for us to achieve the O(N 2) bound(see Lemma 11 for more details).

We begin our analysis with Proposition 5, which shows that the function E is non-increasing with time, i.e., E(y) 0. This monotonicity then implies that both tp(f(x) f(x )) and t2 4p2 v 2 are bounded above by some constants. The bound on tp(f(x) f(x )) provides a convergence rate of O(1/tp) on the sub-optimality f(x(t)) f(x ). It further leads to an upper-bound on the derivatives of the function f(x) in conjunction with Assumption 1.

Proposition 5 (Monotonicity of E). Consider the vector y = [v; x; t] R2d+1 as a trajectory of the dynamical system (12). Let the Lyapunov function E be deﬁned by (16). Then, for any trajectory y = [v; x; t], the time derivative E(y) is non-positive and bounded above; more precisely,

p v 2. (17)

The proof of this proposition follows from convexity and (11); we defer the details to Appendix A.

Next, to bound the Lyapunov function for numerical solutions, we need to bound the distance between points in the discretized and continuous trajectories. As in Section 2.1, for the dynamical system y = F(y), let Φh(y0) denote the solution generated by a numerical integrator starting at point y0 with step size h. Similarly, let ϕh(y0) be the corresponding true solution to the ODE. An ideal numerical integrator would satisfy Φh(y0) = ϕh(y0); however, due to discretization error

there is always a difference between Φh(y0) and ϕh(y0) determined by the order of the integrator as in (7). Let {yk}N i=0 be the sequence of points generated by the numerical integrator, that is, yk+1 = Φh(yk). In the following proposition, we derive an upper bound on the resulting discretization error Φh(yk) ϕh(yk) . Proposition 6 (Discretization error). Let yk = [vk; xk; tk] be the current state of the dynamical system y = F(y) deﬁned in (12). Suppose xk S deﬁned in (2). If we use a Runge-Kutta integrator of order s to discretize the ODE for a single step with a step size h such that h min{0.2, 1 (1+κ)C(1+E(yk))(M+L+1)}, then

Φh(yk) ϕh(yk) C hs+1(M +L+1) [(1 + E(yk))]s+1

tk + h[(1 + E(yk))]s+2

where the constants C, κ, and C only depend on p, s, and the integrator.

The proof of Proposition 6 is the most challenging part in proving Theorem 1. Details may be found in Appendix B. The key step is to bound s+1

hs+1 [Φh(yk) ϕh(yk)] . To do so, we ﬁrst bound the high order derivative tensor (i)f using Assumption 1 and Proposition 5 within a region of radius R. By carefully selecting R, we can show that for a reasonably small h, Φh(yk) and ϕh(yk) is constrained in the region. Second, we need to compute the high order derivatives of y = F(y) as a function of (i)f which is bounded in the region of radius R. As shown in Appendix E, the expressions for higher derivatives become quite complicated as the order increases. We approach this complexity by using the notation for elementary differentials (see Appendix E) adopted from [12]; we then induct on the order of the derivatives to bound the higher order derivatives. The ﬂatness assumption (Assumption 1) provides bounds on the operator norm of high order derivatives relative to the objective function suboptimality, and hence proves crucial in completing the inductive step.

By the conclusion in Proposition 6 and continuity of the Lyapunov function E, we conclude that the value of E at a discretized point is close to its continuous counterpart. Using this observation, we expect that the Lyapunov function values for the points generated by the discretized ODE do not increase signiﬁcantly. We formally prove this key claim in the following proposition. Proposition 7. Consider the dynamical system y = F(y) deﬁned in (12) and the Lyapunov function E deﬁned in (16). Let y0 be the initial state of the dynamical system and y N be the ﬁnal point generated by a Runge-Kutta integrator of order s after N iterations. Further, suppose that Assumptions 1 and 2 are satisﬁed. Then, there exists a constant C determined by p, s and the numerical integrator, such that if the step size h satsﬁes h = C N 1/(s+1) (L+M+1)(e E(y0)+1), then we have

E(y N) exp(1) E(y0) + 1. (19)

Please see Appendix C for a proof of this claim.

Proposition 7 shows that the value of the Lyapunov function E at the point y N is bounded above by a constant that depends on the initial value E(y0). Hence, if the step size h satisﬁes the required condition in Proposition 7, we can see that

f(x N) f(x ) E(y N)

tp N e E(y0)+1

(1+Nh)p . (20)

The ﬁrst inequality in (20) follows from the deﬁnition of the E (16). Replacing the step size h in (20) by the choice used in Proposition 7 yields

f(x N) f(x ) (L + M + 1)p(e E(y0) + 1)p+1

CN p s s+1 , (21)

and the claim of Theorem 1 follows.

Note: The dependency of the step size h on the degree of the integrator s suggests that an integrator of higher order allows for larger step size and therefore faster convergence rate.

5 Numerical experiments

We perform numerical experiments to verify Theorem 1 and compare ODE direct discretizating (DD) methods described in Algorithm 1 against gradient descent (GD) and Nesterov s accelerated

Figure 1: (a) Convergence paths of GD, NAG, and the proposed algorithm with integrators of degree s = 1, s = 2, and s = 4. The objectives is quadratic. (b) Minimizing quadratic objective by discretizing different ODEs (different choices of q in (22)) with the RK44 integrator (4th order). (c/d) Minimizing L4/logistic loss by discretizing different ODEs with a second order integrator.

gradient(NAG) method. All ﬁgures in this section are on log-log scale. For each optimization method, we empirically choose the largest step size among {10 k|k Z} subject to that the algorithm remains stable in the ﬁrst 1000 iterations.

In Figure 1a, we generate synthetic linearly separable dataset and ﬁt linear model Ax = b. A is entry-wise Gaussian and the feasibility is achieved via increasing data dimension. We then minimize L2 loss f(x) = Ax b 2 2. In particular, we discretize the ODE (11) for p = 2 with integrators of different orders, i.e., s {1, 2, 4} and compare them against GD and NAD. Observe that GD eventually attains linear rate and NAG achieves local acceleration close to the optimal point as mentioned in [4]. For DD, if we simulate the ODE with an integrator of order s = 1, the algorithm is eventually unstable. Using a higher order integrator leads to stable accelerated algorithms.

Throughout this paper, we have assumed that the constant p in (11) is the same as the one in Assumption 1 to attain the best theoretical upper bounds. In Figure 1b, we empirically explore the convergence rate of discretizing the ODE

x(t) + 2q+1

t x(t) + q2tq 2 f(x(t)) = 0, (22) when q = p. We minimize the same L2 loss with different values of q using a fourth order integrator with the same step size. We observe that when q > 2, the algorithm diverges eventually. We then discretize ODEs with different parameter q for L4 loss and logistic loss on the same set of data points using a second order RK integrator. As shown in Figure 1c, the objective decreases faster for larger q up to q = 6 and diverges when q = 8. Given that L4 loss has p = 4, this result suggests that our analysis might be conservative. Finally, ﬁgure 1d summarizes the experiment result for minimizing logistic loss. We notice that the algorithm is stable even when q = 8. This result veriﬁes Corollary 4.

6 Discussion

Our paper obtains accelerated gradient methods by directly discretizing second order ODEs (instead of reverse engineering Nesterov-like constructions), yet it does not fully explain acceleration. First,

unlike Nesterov s accelerated gradient method that only requires ﬁrst order differentiability, our results require the objective function to be (s + 2)-times differentiable (where s is the order of the integrator). The precision of numerical integrators only increases with their order when the function is sufﬁciently differentiable. This property inherently limits our analysis. Second, while we achieve the O(N 2) convergence rate, some of the constants in our bound are loose (e.g., for squared loss and logistic regression they are quadratic in L versus linear in L for NAG). Achieving the optimal dependence on initial errors f(x0) f(x ), the diameter x0 x , as well as constants L and M requires further investigation.

In addition, we identiﬁed a new condition in Assumption 1 that quantiﬁes the local ﬂatness of convex functions. At ﬁrst, this condition may appear counterintuitive, because gradient descent actually converges fast when the objective is not ﬂat and the progress slows down if the gradient vanishes close to the minimum. However, when we discretize the ODE, the trajectories with vanishing gradients oscillate slowly, and hence allow stable discretization with large step sizes, which ultimately allows us to achieve acceleration. We think this high-level idea, possibly as embodied by Assumption 1 could be more broadly used in analyzing and designing other optimization methods.

Based on the above two points, this paper contains both positive and negative message for the recent trend in ODE interpretation of optimization methods. On one hand, it shows that with careful analysis, discretizing ODE can preserve some of its trajectories properties. On the other hand, our proof suggests that nontrivial additional conditions might be required to ensure stable discretization. Hence, designing an ODE with nice properties in the continuous domain doesn t guarantee the existence of a practical optimization algorithm.

Acknowledgement

AJ and SS acknowledge support in part from DARPA Fun Lo L, DARPA Lagrange; AJ also acknowledges support from an ONR Basic Research Challenge Program, and SS acknowledges support from NSF-IIS-1409802.

[1] Z. Allen-Zhu and L. Orecchia. Linear coupling: An ultimate uniﬁcation of gradient and mirror descent. ar Xiv preprint ar Xiv:1407.1537, 2014. [2] F. Alvarez. On the minimizing property of a second order dissipative system in hilbert spaces. SIAM Journal on Control and Optimization, 38(4):1102 1119, 2000. [3] H. Attouch and R. Cominetti. A dynamical approach to convex minimization coupling approximation with the steepest descent method. Journal of Differential Equations, 128(2):519 540, 1996. [4] H. Attouch and J. Peypouquet. The rate of convergence of nesterov s accelerated forwardbackward method is actually faster than 1/kˆ2. SIAM Journal on Optimization, 26(3):1824 1834, 2016. [5] H. Attouch, X. Goudou, and P. Redont. The heavy ball with friction method, i. the continuous dynamical system: global exploration of the local minima of a real-valued function by asymptotic analysis of a dissipative dynamical system. Communications in Contemporary Mathematics, 2(01):1 34, 2000. [6] H. Attouch, J. Bolte, P. Redont, and A. Soubeyran. Proximal alternating minimization and projection methods for nonconvex problems: An approach based on the kurdyka-łojasiewicz inequality. Mathematics of Operations Research, 35(2):438 457, 2010. [7] M. Betancourt, M. I. Jordan, and A. C. Wilson. On symplectic optimization. ar Xiv preprint ar Xiv:1802.03653, 2018. [8] R. E. Bruck Jr. Asymptotic convergence of nonlinear contraction semigroups in hilbert space. Journal of Functional Analysis, 18(1):15 26, 1975. [9] S. Bubeck, Y. T. Lee, and M. Singh. A geometric alternative to nesterov s accelerated gradient descent. ar Xiv preprint ar Xiv:1506.08187, 2015. [10] J. Diakonikolas and L. Orecchia. The approximate duality gap technique: A uniﬁed theory of ﬁrst-order methods. ar Xiv preprint ar Xiv:1712.02485, 2017.

[11] M. Fazlyab, A. Ribeiro, M. Morari, and V. M. Preciado. Analysis of optimization algorithms via integral quadratic constraints: Nonstrongly convex problems. ar Xiv preprint ar Xiv:1705.03615, 2017. [12] E. Hairer, C. Lubich, and G. Wanner. Geometric numerical integration: structure-preserving algorithms for ordinary differential equations, volume 31. Springer Science & Business Media, 2006. [13] B. Hu and L. Lessard. Dissipativity theory for nesterov s accelerated method. ar Xiv preprint ar Xiv:1706.04381, 2017. [14] E. Isaacson and H. B. Keller. Analysis of numerical methods. Courier Corporation, 1994. [15] W. Krichene, A. Bayen, and P. L. Bartlett. Accelerated mirror descent in continuous and discrete time. In Advances in neural information processing systems, pages 2845 2853, 2015. [16] L. Lessard, B. Recht, and A. Packard. Analysis and design of optimization algorithms via integral quadratic constraints. SIAM Journal on Optimization, 26(1):57 95, 2016. [17] S. Lojasiewicz. Ensembles semi-analytiques. Lectures Notes IHES (Bures-sur-Yvette), 1965. [18] A. Nemirovskii, D. B. Yudin, and E. R. Dawson. Problem complexity and method efﬁciency in optimization. 1983. [19] Y. Nesterov. A method of solving a convex programming problem with convergence rate o (1/k2). In Soviet Mathematics Doklady, volume 27, pages 372 376, 1983. [20] M. Raginsky and J. Bouvrie. Continuous-time stochastic mirror descent on a network: Variance reduction, consensus, convergence. In Decision and Control (CDC), 2012 IEEE 51st Annual Conference on, pages 6793 6800. IEEE, 2012. [21] D. Scieur, A. d Aspremont, and F. Bach. Regularized nonlinear acceleration. In Advances In Neural Information Processing Systems, pages 712 720, 2016. [22] D. Scieur, V. Roulet, F. Bach, and A. d Aspremont. Integration methods and accelerated optimization algorithms. ar Xiv preprint ar Xiv:1702.06751, 2017. [23] W. Su, S. Boyd, and E. Candes. A differential equation for modeling nesterovs accelerated gradient method: Theory and insights. In Advances in Neural Information Processing Systems, pages 2510 2518, 2014. [24] J. Verner. High-order explicit runge-kutta pairs with low stage order. Applied numerical mathematics, 22(1-3):345 357, 1996. [25] M. West. Variational integrators. Ph D thesis, California Institute of Technology, 2004. [26] A. Wibisono, A. C. Wilson, and M. I. Jordan. A variational perspective on accelerated methods in optimization. Proceedings of the National Academy of Sciences, 113(47):E7351 E7358, 2016.