# adaptive_averaging_in_accelerated_descent_dynamics__e622102d.pdf

Adaptive Averaging in Accelerated Descent Dynamics

Walid Krichene UC Berkeley walid@eecs.berkeley.edu

Alexandre M. Bayen UC Berkeley bayen@berkeley.edu

Peter L. Bartlett UC Berkeley and QUT bartlett@cs.berkeley.edu

We study accelerated descent dynamics for constrained convex optimization. This dynamics can be described naturally as a coupling of a dual variable accumulating gradients at a given rate η(t), and a primal variable obtained as the weighted average of the mirrored dual trajectory, with weights w(t). Using a Lyapunov argument, we give sufﬁcient conditions on η and w to achieve a desired convergence rate. As an example, we show that the replicator dynamics (an example of mirror descent on the simplex) can be accelerated using a simple averaging scheme. We then propose an adaptive averaging heuristic which adaptively computes the weights to speed up the decrease of the Lyapunov function. We provide guarantees on adaptive averaging in continuous-time, prove that it preserves the quadratic convergence rate of accelerated ﬁrst-order methods in discrete-time, and give numerical experiments to compare it with existing heuristics, such as adaptive restarting. The experiments indicate that adaptive averaging performs at least as well as adaptive restarting, with signiﬁcant improvements in some cases.

1 Introduction

We study the problem of minimizing a convex function f over a feasible set X, a closed convex subset of E = Rn. We will assume that f is differentiable, that its gradient f is a Lipschitz function with Lipschitz constant L, and that the set of minimizers S = arg minx X f(x) is non-empty. We will focus on the study of continuous-time, ﬁrst-order dynamics for optimization. First-order methods have seen a resurgence of interest due to the signiﬁcant increase in both size and dimensionality of the data sets typically encountered in machine learning and other applications, which makes higher-order methods computationally intractable in most cases. Continuous-time dynamics for optimization have been studied for a long time, e.g. [6, 9, 5], and more recently [20, 2, 1, 3, 11, 23], in which a connection is made between Nesterov s accelerated methods [14, 15] and a family of continuous-time ODEs. Many optimization algorithms can be interpreted as a discretization of a continuous-time process, and studying the continuous-time dynamics is useful for many reasons: The analysis is often simpler in continuous-time, it can help guide the design and analysis of new algorithms, and it provides intuition and insight into the discrete process. For example, Su et al. show in [20] that Nesterov s original method [14] is a discretization of a second-order ODE, and use this interpretation to propose a restarting heuristic which empirically speeds up the convergence. In [11], we generalize this approach to the proximal version of Nesterov s method [15] which applies to constrained convex problems, and show that the continuous-time ODE can be interpreted as coupled dynamics of a dual variable Z(t) which evolves in the dual space E , and a primal variable X(t) which is obtained as the weighted average of a non-linear transformation of the dual trajectory. More precisely,

R t 0 τr 1 ψ (Z(τ))dτ R t 0 τr 1dτ

X(0) = ψ (Z(0)) = x0,

Walid Krichene is currently afﬁliated with Google. walidk@google.com

30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

where r 2 is a ﬁxed parameter, the initial condition x0 is a point in the feasible set X, and ψ is a Lipschitz function that maps from the dual space E to the feasible set X, which we refer to as the mirror map (such a function can be constructed using standard results from convex analysis, by taking the convex conjugate of a strongly convex function ψ with domain X; see the supplementary material for a brief review of the deﬁnition and basic properties of mirror maps). Using a Lyapunov argument, we show that the solution trajectories of this ODE exhibit a quadratic convergence rate, i.e. if f is the minimum of f over the feasible set, then f(X(t)) f C/t2 for a constant C which depends on the initial conditions. This formalized an interesting connection between acceleration and averaging, which had been observed in [8] in the special case of unconstrained quadratic minimization.

A natural question that arises is whether different averaging schemes can be used to achieve the same rate, or perhaps faster rates. In this article, we provide a positive answer. We study a broad family of Accelerated Mirror Descent (AMD) dynamics, given by

Z(t) = η(t) f(X(t))

X(t) = X(t0)W (t0)+ R t t0 w(τ) ψ (Z(τ))dτ

W (t) , with W(t) = R t 0 w(τ)dτ

X(t0) = ψ (Z(t0)) = x0,

parameterized by two positive, continuous weight functions w and η, where w is used in the averaging and η determines the rate at which Z accumulates gradients. This is illustrated in Figure 1. In our formulation we choose to initialize the ODE at t0 > 0 instead of 0 (to guarantee existence and uniqueness of a solution, as discussed in Section 2). We give a uniﬁed study of this ODE using an appropriate Lyapunov function, given by

Lr(X, Z, t) = r(t)(f(X) f ) + Dψ (Z, z ), (2)

where Dψ is the Bregman divergence associated with ψ (a non-negative function deﬁned on E E ), and r(t) is a desired convergence rate (a non-negative function deﬁned on R+). By construction, Lr is a non-negative function on X E R+. If t 7 Lr(X(t), Z(t), t) is a non-increasing function for all solution trajectories (X(t), Z(t)), then Lr is said to be a Lyapunov function for the ODE, in reference to Aleksandr Mikhailovich Lyapunov [12]. We give in Theorem 2 a sufﬁcient condition on η, w and r for Lr to be a Lyapunov function for AMDw,η, and show that under these conditions, f(X(t)) converges to f at the rate 1/r(t).

η(t) f(X(t)) X(t)

Figure 1: Illustration of AMDw,η. The dual variable Z evolves in the dual space E , and accumulates negative gradients at a rate η(t), and the primal variable X(t) (green solid line) is obtained by averaging the mirrored trajectory { ψ (Z(τ)), τ [t0, t]} (green dashed line), with weights w(τ).

In Section 3, we give an equivalent formulation of AMDw,η written purely in the primal space. We give several examples of these dynamics for simple constraint sets. In particular, when the feasible set is the probability simplex, we derive an accelerated version of the replicator dynamics, an ODE that plays an important role in evolutionary game theory [22] and viability theory [4].

Many heuristics have been developed to empirically speed up the convergence of accelerated methods. Most of these heuristics consist in restarting the ODE (or the algorithm in discrete time) whenever a simple condition is met. For example, a gradient restart heuristic is proposed in [17], in which the algorithm is restarted whenever the trajectory forms an acute angle with the gradient (which intuitively indicates that the trajectory is not making progress), and a speed restarting heuristic is proposed in [20], in which the ODE is restarted whenever the speed X(t) decreases (which intuitively indicates that progress is slowing). These heuristics are known to empirically improve

the speed of convergence, but provide few guarantees. For example, the gradient restart in [17] is only studied for unconstrained quadratic problems, and the speed restart in [20] is only studied for unconstrained strongly convex problems. In particular, it is not guaranteed (to our knowledge) that these heuristics preserve the original convergence rate of the non-restarted method, when the objective function is not strongly convex. In Section 4, we propose a new heuristic that provides such guarantees, and that is based on a simple idea for adaptively computing the weights w(t) along the solution trajectories. The heuristic simply decreases the time derivative of the Lyapunov function Lr(X(t), Z(t), t) whenever possible. Thus it preserves the 1/r(t) convergence rate. Other adaptive methods have been applied to convex optimization, such as Adagrad [7] and Adam [10], which adapt the learning rate in ﬁrst-order methods, by maintaining moment estimates of the observed gradients. They are particularly well suited to problems with sparse gradients. While these methods are similar in spirit to adaptive averaging, they are not designed for accelerated methods. In Section 5, we give numerical experiments in which we compare the performance of adaptive averaging and restarting. The experiments indicate that adaptive averaging compares favorably in all of the examples, and gives a signiﬁcant improvement in some cases. We conclude with a brief discussion in Section 6.

2 Accelerated mirror descent with generalized averaging

We start by giving an equivalent form of AMDw,η, which we use to brieﬂy discuss existence and uniqueness of a solution. Writing the second equation as X(t)W(t) X(t0)W(t0) = R t t0 w(τ) ψ (Z(τ))dτ, then taking the time-derivative, we have

X(t)W(t) + X(t)w(t) = w(t) ψ (Z(t)).

Thus the ODE is equivalent to

Z(t) = η(t) f(X(t))

X(t) = w(t) W (t)( ψ (Z(t)) X(t))

X(t0) = ψ (Z(t0)) = x0.

The following theorem guarantees existence and uniqueness of the solution. Theorem 1. Suppose that W(t0) > 0. Then AMDw,η has a unique maximal (i.e. deﬁned on a maximal interval) solution (X(t), Z(t)) that is C1([t0, + )). Furthermore, for all t t0, X(t) belongs to the feasible set X.

Proof. Recall that, by assumption, f and ψ are both Lipschitz, and w, η are continuous. Furthermore, W(t) is non-decreasing and continuous, as the integral of a non-negative function, thus w(t)/W(t) w(t)/W(t0). This guarantees that on any ﬁnite interval [t0, T), the functions η(t) and w(t)/W(t) are bounded. Therefore, η(t) f(X) and w(t)

W (t)( ψ (Z) X) are Lipschitz functions of (X, Z), uniformly in t [t0, T). By the Cauchy-Lipschitz theorem (e.g. Theorem 2.5 in [21]), there exists a unique C1 solution deﬁned on [t0, T). Since T is arbitrary, this deﬁnes a unique solution on all of [t0, + ). Indeed, any two solutions deﬁned on [t0, T1) and [t0, T2) with T2 > T1 coincide on [t0, T1). Finally, feasibility of the solution follows from the fact that X is convex and X(t) is the weighted average of points in X, speciﬁcally, x0 and the set { ψ (Z(τ)), τ [t0, t]}.

Note that in general, it is important to initialize the ODE at t0 and not 0, since W(0) = 0 and w(t)/W(t) can diverge at 0, in which case one cannot apply the Cauchy-Lipschitz theorem. It is possible however to prove existence and uniqueness with t0 = 0 for some choices of w, by taking a sequence of Lipschitz ODEs that approximate the original one, as is done in [20], but this is a technicality and does not matter for practical purposes.

We now move to our main result for this section. Suppose that r is an increasing, positive differentiable function on [t0, + ), and consider the candidate Lyapunov function Lr deﬁned in (2), where the Bregman divergence term is given by

Dψ (z, y) := ψ (z) ψ (y) ψ (y), z y ,

and z is a point in the dual space such that ψ (z ) = x belongs to the set of minimizers S. Let (X(t), Z(t)) be the unique maximal solution trajectory of AMDw,η.

Taking the derivative of t 7 Lr(X(t), Z(t), t) = r(t)(f(X(t)) f ) + Dψ (Z(t), z ), we have d dt Lr(X(t), Z(t), t) = r (t)(f(X(t)) f ) + r(t) D f(X(t)), X(t) E + D Z(t), ψ (Z(t)) ψ (z ) E

= r (t)(f(X(t)) f ) + r(t) D f(X(t)), X(t) E + η(t) f(X(t)), X(t) + W(t)

w(t) X(t) x

(f(X(t)) f )(r (t) η(t)) + D f(X(t)), X(t) E r(t) η(t)W(t)

where we used the expressions for Z and ψ (Z) from AMD w,η in the second equality, and convexity of f in the last inequality. Equipped with this bound, it becomes straightforward to give sufﬁcient conditions for Lr to be a Lyapunov function. Theorem 2. Suppose that for all t [t0, + ),

1. η(t) r (t) and

2. D f(X(t)), X(t) E r(t) η(t)W (t)

Then Lr is a Lyapunov function for AMDw,η, and for all t t0, f(X(t)) f Lr(X(t0),Z(t0),t0)

Proof. The two conditions, combined with inequality (3), imply that d

dt Lr(X(t), Z(t), t) 0, thus Lr is a Lyapunov function. Finally, since Dψ is non-negative, and Lr is decreasing, we have

f(X(t)) f Lr(X(t), Z(t), t)

r(t) Lr(X(t0), Z(t0), t0)

which proves the claim.

Note that the second condition depends on the solution trajectory X(t), and may be hard to check a priori. However, we give one special case in which the condition trivially holds.

Corollary 1. Suppose that for all t [t0, + ), η(t) = w(t)r(t)

W (t) , and w(t)

W (t) r (t)

r(t) . Then Lr is a

Lyapunov function for AMDw,η, and for all t t0, f(X(t)) f Lr(X(t0),Z(t0),t0)

Next, we describe a method to construct weight functions w, η that satisfy the conditions of Corollary 1, given a desired rate r. Of course, it sufﬁces to construct w that satisﬁes w(t)

W (t) r (t)

r(t) , then

to set η(t) = w(t)r(t)

W (t) . We can reparameterize the weight function by writing w(t)

W (t) = a(t). Then

integrating from t0 to t, we have W (t) W (t0) = e

R t t0 a(τ)dτ, and

w(t) = w(t0) a(t)

R t t0 a(τ)dτ. (4)

Therefore the conditions of the corollary are satisﬁed whenever w(t) is of the form (4) and a : R+ R+ is a continuous, positive function with a(t) r (t)

r(t) . Note that the expression of w is deﬁned up to the constant w(t0), which reﬂects the fact that the condition of the corollary is scale-invariant (if the condition holds for a function w, then it holds for αw for all α > 0). Example 1. Let r(t) = t2. Then r (t)/r(t) = 2/t, and we can take a(t) = β

t with β 2. Then

w(t) = a(t) a(t0)e

R t t0 a(τ)dτ = β/t β/t0 eβ ln(t/t0) = (t/t0)β 1 and η(t) = w(t)r(t)

W (t) = βt, and we recover the weighting scheme used in [11]. Example 2. More generally, if r(t) = tp, p 1, then r (t)/r(t) = p/t, and we can take a(t) = β

t with β p. Then w(t) = (t/t0)β 1, and η(t) = w(t)r(t)

W (t) = βtp 1.

We also exhibit in the following a second energy function that is guaranteed to decrease under the same conditions. This energy function, unlike the Lyapunov function Lr, does not guarantee a speciﬁc convergence rate. However, it captures a natural measure of energy in the system. To deﬁne this energy function, we will use the following characterization of the inverse mirror map: By duality of the subdifferentials (e.g. Theorem 23.5 in [18]), we have for a pair of convex conjugate functions ψ and ψ that x ψ (x ) if and only if x ψ(x). To simplify the discussion, we will assume that ψ is also differentiable, so that ( ψ ) 1 = ψ (this assumption can be relaxed). In what follows, we will denote by ˇX = ψ(X) and ˇZ = ψ (Z).

Theorem 3. Let (X(t), Z(t)) be the unique maximal solution of AMDw,η, and let ˇX = ψ(X). Consider the energy function

Er(t) = f(X(t)) + 1 r(t)Dψ (Z(t), ˇ X(t)). (5)

Then if w, η satisfy condition (2) of Theorem 2, Er is a decreasing function of time.

Proof. To make the notation more concise, we omit the explicit dependence on time in this proof. We have Dψ (Z, ˇX) = ψ (Z) ψ ( ˇX) X, Z ˇX . Taking the time-derivative , we have

d dt Dψ (Z, ˇ X) = D ψ (Z), Z E D ψ ( ˇ X), ˇ X E D X, Z ˇ X E D X, Z ˇ X E

= D ψ (Z) X, Z E D X, Z ˇ X E .

Using the second equation in AMD w,η, we have ψ (Z) X = 1 a X, and D X, Z ˇX E =

a ψ (Z) ψ ( ˇX), Z ˇX 0 by monotonicity of ψ . Combining, we have

d dt Dψ (Z, ˇX) η

a D X, f(X) E , and we can ﬁnally bound the derivative of Er:

d dt Er(t) = D f(X), X E + 1

r d dt Dψ (Z, ˇ X) r

r2 Dψ (Z, ˇ X)

D f(X), X E 1 η

Therefore condition (2) of Theorem 2 implies that d

dt Er(t) 0.

This energy function can be interpreted, loosely speaking, as the sum of a potential energy given by f(X), and a kinetic energy given by 1 r(t)Dψ (Z, ˇX): Indeed, when the problem is unconstrained, then one can take ψ (z) = 1

2 z 2, in which case ψ = ψ = I, the identity, and Dψ (Z, ˇX) =

1 2 ˇZ X 2 = 1

2 X a 2, a quantity proportional to the kinetic energy.

3 Primal Representation and Example Dynamics

An equivalent primal representation can be obtained by rewriting the equations in terms of ˇZ = ψ (Z) and its derivatives ( ˇZ is a primal variable that remains in X, since ψ maps into X). In this section, we assume that ψ is twice differentiable on E . Taking the time derivative of ˇZ(t) = ψ (Z(t)), we have

ˇZ(t) = 2ψ (Z(t)) Z(t) = η(t) 2ψ ψ( ˇZ(t)) f(X(t)),

where 2ψ (z) is the Hessian of ψ at z, deﬁned as 2ψ (z)ij = 2ψ (z)

zj zi . Then using the averaging expression for X, we can write AMDw,η in the following primal form

ˇZ(t) = η(t) 2ψ ψ( ˇZ(t)) f x0W (t0)+ R t t0 w(τ) ˇ Z(τ)dτ

ˇZ(t0) = x0. (6)

A similar derivation can be made for the mirror descent ODE without acceleration, which can be written as follows [11] (see also the original derivation of Nemirovski and Yudin in Chapter 3 in [13])

Z(t) = f(X(t)) X(t) = ψ (Z(t)) X(t0) = x0.

Note that this can be interpreted as a limit case of AMDη,w with η(t) 1 and w(t) a Dirac function at t. Taking the time derivative of X(t) = ψ (Z(t)), we have X(t) = 2ψ (Z(t)) Z(t), which leads to the primal form of the mirror descent ODE

MDp ( X(t) = 2ψ ψ(X(t)) f(X(t)) X(t0) = x0. (7)

The operator 2ψ ψ appears in both primal representations (6) and (7), and multiplies the gradient of f. It can be thought of as a transformation of the gradient which ensures that the primal trajectory remains in the feasible set, this is illustrated in the supplementary material. For some choices of ψ, 2ψ ψ has a simple expression. We give two examples below.

We also observe that in its primal form, AMDp w,η is a generalization of the ODE family studied in [23], which can be written as d dt ψ(X(t) + e α(t) X(t)) = eα(t)+β(t) f(X(t)), for which they prove the convergence rate O(e β(t)). This corresponds to setting, in our notation, a(t) = eα(t), r(t) = eβ(t) and taking η(t) = a(t)r(t) (which corresponds to the condition of Corollary 1).

Positive-orthant-constrained dynamics Suppose that X is the positive orthant Rn +, and consider the negative entropy function ψ(x) = P

i xi ln xi. Then its dual is ψ (z) = P

i ezi 1, and we have ψ(x)i = 1 + ln xi and 2ψ (z)i,j = δj i ezi 1, where δj i is 1 if i = j and 0 otherwise. Thus for all x Rn +, 2ψ ψ(x) = diag(x). Therefore, the primal forms (7) and (6), reduce to, respectively, i, Xi = Xi f(X)i X(0) = x0

( i, ˇZi = η(t) ˇZi f(X)i ˇZ(t0) = x0

where for the second ODE we write X compactly to denote the weighted average given by the second equation of AMDw,η. When f is afﬁne, the mirror descent ODE lead to Lotka-Volterra equation which has applications in economics and ecology. For the mirror descent ODE, one can verify that the solution remains in the positive orthant since X tends to 0 as Xi approaches the boundary of the feasible set. Similarly for the accelerated version, ˇZ tends to 0 as ˇZ approaches the boundary, thus ˇZ remains feasible, and so does X by convexity.

Simplex-constrained dynamics: the replicator equation. Now suppose that X is the n-simplex, X = = {x Rn + : Pn i=1 xi = 1}. Consider the distance-generating function ψ(x) = Pn i=1 xi ln xi + δX (x), where δX ( ) is the convex indicator function of the feasible set. Then its conjugate is ψ (z) = ln (Pn i=1 ezi), deﬁned on E , and we have ψ(x)i = 1 + ln xi, ψ (z)i =

k ezk , and 2ψ (z)ij = δj i ezi P

k ezk eziezj ( P

k ezk) 2 . Then it is simple to calculate 2ψ ψ(x)ij =

k xk xixj ( P

k xk) 2 = δj i xi xixj. Therefore, the primal forms (7) and (6) reduce to, respectively,

( i, Xi + Xi ( f(X)i X, f(X) ) = 0 X(0) = x0

( i, ˇZi + η(t) ˇZi f(X)i ˇZ, f(X) = 0 ˇZ(0) = x0.

The ﬁrst ODE is known as the replicator dynamics [19], and has many applications in evolutionary game theory [22] and viability theory [4], among others. See the supplementary material for additional discussion on the interpretation and applications of the replicator dynamics. This example shows that the replicator dynamics can be accelerated simply by performing the original replicator update on the variable ˇZ, in which (i) the gradient of the objective function is scaled by η(t) at time t, and (ii) the gradient is evaluated at X(t), the weighted average of the ˇZ trajectory.

4 Adaptive Averaging Heuristic

In this section, we propose an adaptive averaging heuristic for adaptively computing the weights w. Note that in Corollary 1, we simply set a(t) = η(t)

r(t) so that D f(X(t)), X(t) E r(t) η(t)

a(t) is identically zero (thus trivially satisfying condition (2) of Theorem 2). However, from the bound (3), if this term is negative, then this helps further decrease the Lyapunov function Lr (as well as the energy function Er). A simple strategy is then to adaptively choose a(t) as follows ( a(t) = η(t)

r(t) if D f(X(t)), X(t) E > 0,

r(t) otherwise. (8)

If we further have η(t) r (t), then the conditions of Theorem 2 and Theorem 3 are satisﬁed, which guarantee that Lr is a Lyapunov function and that the energy Er decreases. In particular, such a heuristic would preserve the convergence rate r(t) by Theorem 2.

We now propose a discrete version of the heuristic when r(t) = t2. We consider the quadratic rate in particular since in this case the discretization proposed by [11] preserves the quadratic rate, and corresponds to a ﬁrst-order accelerated method2 for which many heuristics have been developed, such as the restarting heuristics [17, 20] discussed in the introduction. To satisfy condition (1) of Theorem 2, we choose η(t) = βt with β 2. Note that in this case, η(t)

t . In the supplementary material, we propose a discretization of the heuristic (8), using the correspondance t = k s, for a step size s. The resulting algorithm is summarized in Algorithm 1, where ψ is a smooth distance generating function, and R is a regularizer assumed to be strongly convex and smooth. We give a bound on the convergence rate of Algorithm 1 in the supplementary material. The proof relies on a discrete counterpart of the Lyapunov function Lr.

The algorithm keeps ak = ak 1 whenever f( x(k+1)) f( x(k)), and sets ak to β k s otherwise. This results in a non-increasing sequence ak. It is worth observing that in continuous time, from the expression (4), a constant a(t) over an interval [t1, t2] corresponds to an exponential increase in the weight w(t) over that interval, while a(t) = β

t corresponds to a polynomial increase w(t) = (t/t0)β 1. Intuitively, adaptive averaging increases the weights w(t) on portions of the trajectory which make progress.

Algorithm 1 Accelerated mirror descent with adaptive averaging

1: Initialize x(0) = x0, ˇz(0) = x0, a1 = β s 2: for k N do 3: ˇz(k+1) = arg minˇz X βks D f(x(k)), ˇz E + Dψ(ˇz, ˇz(k)).

4: x(k+1) = arg min x X γs D f(x(k)), x E + R( x, x(k))

5: x(k+1) = λk+1ˇz(k+1) + (1 λk+1) x(k+1), with λk = sak 1+ sak .

6: ak = min ak 1, βmax

7: if f( x(k+1)) f( x(k)) > 0 then 8: ak = β k s

5 Numerical Experiments

In this section, we compare our adaptive averaging heuristic (in its discrete version given in Algorithm 1) to existing restarting heuristics. We consider simplex-constrained problems and take the distance generating function ψ to be the entropy function, so that the resulting algorithm is a discretization of the accelerated replicator ODE studied in Section 3. We perform the experiments in R3 so that we can visualize the solution trajectories (the supplementary material contains additional experiments in higher dimension). We consider different objective functions: A strongly convex quadratic given by f(x) = (x s)T A(x s) for a positive deﬁnite matrix A, a weakly convex quadratic, a linear function f(x) = c T x, and the Kullback-Leibler divergence, f(x) = DKL(x , x). We compare the following methods:

1. The original accelerated mirror descent method (in which the weights follow a predetermined schedule given by ak = β k s), 2. Our adaptive averaging, in which ak is computed adaptively following Algorithm 1, 3. The gradient restarting heuristic in [17], in which the algorithm is restarted from the current point whenever f(x(k)), x(k+1) x(k) > 0, 4. The speed restarting heuristic in [20], in which the algorithm is restarted from the current point whenever x(k+1) x(k) x(k) x(k 1) .

The results are shown in Figure 2. Each subﬁgure is divided into four plots: Clockwise from the top left, we show the value of the objective function, the trajectory on the simplex, the value of the energy function Er and the value of the Lyapunov function Lr.

2For faster rates r(t) = tp, p > 2, it is possible to discretize the ODE and preserve the convergence rate, as proposed by Wibisono et al. [23], however this discretization results in a higher-order method such as Nesterov s cubic accelerated Newton method [16].

The experiments show that adaptive averaging compares favorably to the restarting heuristics on all these examples, with a signiﬁcant improvement in the strongly convex case. Additionally, the experiments conﬁrm that under the adaptive averaging heuristic, the Lyapunov function is decreasing. This is not the case for the restarting heuristics as can be seen on the weakly convex example. It is interesting to observe, however, that the energy function Er is non-increasing for all the methods in our experiments. If we interpret the energy as the sum of a potential and a kinetic term, then this could be explained intuitively by the fact that restarting keeps the potential energy constant, and decreases the kinetic energy (since the velocity is reset to zero). It is also worth observing that even though the Lyapunov function Lr is non-decreasing, it will not necessarily converge to 0 when there is more than one minimizer (its limit will depend on the choice of z in the deﬁnition of Lr).

Finally, we observe that the methods have a different qualitative behavior: The original accelerated method typically exhibits oscillations around the set of minimizers. The heuristics alleviate these oscillations in different ways: Intuitively, adaptive averaging acts by increasing the weights on portions of the trajectory which make the most progress, while the restarting heuristics reset the velocity to zero whenever the algorithm detects that the trajectory is moving in a bad direction. The speed restarting heuristic seems to be more conservative in that it restarts more frequently.

(a) Strongly convex quadratic.

(b) Weakly convex function.

(c) Linear function.

(d) KL divergence.

Figure 2: Examples of accelerated descent with adaptive averaging and restarting.

6 Conclusion

Motivated by the averaging formulation of accelerated mirror descent, we studied a family of ODEs with a generalized averaging scheme, and gave simple sufﬁcient conditions on the weight functions to guarantee a given convergence rate in continuous time. We showed as an example how the replicator ODE can be accelerated by averaging. Our adaptive averaging heuristic preserves the convergence rate (since it preserves the Lyapunov function), and it seems to perform at least as well as other heuristics for ﬁrst-order accelerated methods, and in some cases considerably better. This encourages further investigation into the performance of this adaptive averaging, both theoretically (by attempting to prove faster rates, e.g. for strongly convex functions), and numerically, by testing it on other methods, such as the higher-order accelerated methods proposed in [23].

[1] H. Attouch and J. Peypouquet. The rate of convergence of nesterov s accelerated forwardbackward method is actually faster than 1/k2. SIAM Journal on Optimization, 26(3):1824 1834, 2016.

[2] H. Attouch, J. Peypouquet, and P. Redont. Fast convergence of an inertial gradient-like system with vanishing viscosity. Co RR, abs/1507.04782, 2015.

[3] H. Attouch, J. Peypouquet, and P. Redont. Fast convex optimization via inertial dynamics with hessian driven damping. Co RR, abs/1601.07113, 2016.

[4] J.-P. Aubin. Viability Theory. Birkhauser Boston Inc., Cambridge, MA, USA, 1991.

[5] A. Bloch, editor. Hamiltonian and gradient ﬂows, algorithms, and control. American Mathematical Society, 1994.

[6] A. A. Brown and M. C. Bartholomew-Biggs. Some effective methods for unconstrained optimization based on the solution of systems of ordinary differential equations. Journal of Optimization Theory and Applications, 62(2):211 224, 1989.

[7] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res., 12:2121 2159, July 2011.

[8] N. Flammarion and F. R. Bach. From averaging to acceleration, there is only a step-size. In 28th Conference on Learning Theory, COLT, pages 658 695, 2015.

[9] U. Helmke and J. Moore. Optimization and dynamical systems. Communications and control engineering series. Springer-Verlag, 1994.

[10] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR), 2014.

[11] W. Krichene, A. Bayen, and P. Bartlett. Accelerated mirror descent in continuous and discrete time. In NIPS, 2015.

[12] A. Lyapunov. General Problem of the Stability Of Motion. Control Theory and Applications Series. Taylor & Francis, 1992.

[13] A. S. Nemirovsky and D. B. Yudin. Problem Complexity and Method Efﬁciency in Optimization. Wiley-Interscience series in discrete mathematics. Wiley, 1983.

[14] Y. Nesterov. A method of solving a convex programming problem with convergence rate o(1/k2). Soviet Mathematics Doklady, 27(2):372 376, 1983.

[15] Y. Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming, 103 (1):127 152, 2005.

[16] Y. Nesterov. Accelerating the cubic regularization of newton s method on convex problems. Mathematical Programming, 112(1):159 181, 2008.

[17] B. O Donoghue and E. Candès. Adaptive restart for accelerated gradient schemes. Foundations of Computational Mathematics, 15(3):715 732, 2015. ISSN 1615-3375.

[18] R. Rockafellar. Convex Analysis. Princeton University Press, 1970.

[19] K. Sigmund. Complexity, Language, and Life: Mathematical Approaches, chapter A Survey of Replicator Equations, pages 88 104. Springer Berlin Heidelberg, Berlin, Heidelberg, 1986.

[20] W. Su, S. Boyd, and E. Candès. A differential equation for modeling Nesterov s accelerated gradient method: Theory and insights. In NIPS, 2014.

[21] G. Teschl. Ordinary differential equations and dynamical systems, volume 140. American Mathematical Soc., 2012.

[22] J. W. Weibull. Evolutionary game theory. MIT press, 1997.

[23] A. Wibisono, A. C. Wilson, and M. I. Jordan. A variational perspective on accelerated methods in optimization. Co RR, abs/1603.04245, 2016.