# acceleration_and_averaging_in_stochastic_descent_dynamics__a2388c85.pdf

Acceleration and Averaging In Stochastic Descent Dynamics

Walid Krichene Google, Inc. walidk@google.com

Peter Bartlett U.C. Berkeley bartlett@cs.berkeley.edu

We formulate and study a general family of (continuous-time) stochastic dynamics for accelerated ﬁrst-order minimization of smooth convex functions. Building on an averaging formulation of accelerated mirror descent, we propose a stochastic variant in which the gradient is contaminated by noise, and study the resulting stochastic differential equation. We prove a bound on the rate of change of an energy function associated with the problem, then use it to derive estimates of convergence rates of the function values (almost surely and in expectation), both for persistent and asymptotically vanishing noise. We discuss the interaction between the parameters of the dynamics (learning rate and averaging rates) and the covariation of the noise process. In particular, we show how the asymptotic rate of covariation affects the choice of parameters and, ultimately, the convergence rate.

1 Introduction

We consider the constrained convex minimization problem

min x X f(x),

where X is a closed, convex, compact subset of Rn, and f is a proper closed convex function, assumed to be differentiable with Lipschitz gradient, and we denote X X the set of its minimizers. First-order methods play an important role in minimizing such functions, in particular in large-scale machine learning applications, in which the dimensionality (number of features) and size (number of samples) in typical datasets makes higher-order methods intractable. Many such algorithms can be viewed as a discretization of continuous-time dynamics. The simplest example is gradient descent, which can be viewed as the discretization of the gradient ﬂow dynamics x(t) = f(x(t)), where x(t) denotes the time derivative of a C1 trajectory x(t). An important generalization of gradient descent was developed by Nemirovsky and Yudin [1983], and termed mirror descent: it couples a dual variable z(t) and its mirror primal variable x(t). More speciﬁcally, the dynamics are given by

MD z(t) = f(x(t)) x(t) = ψ (z(t)), (1)

where ψ : Rn X is a Lipschitz function deﬁned on the entire dual space Rn, with values in the feasible set X; it is often referred to as a mirror map, and we will recall its deﬁnition and properties in Section 2. Mirror descent can be viewed as a generalization of projected gradient descent, where the Euclidean projection is replaced by the mirror map ψ [Beck and Teboulle, 2003]. This makes it possible to adapt the choice of the mirror map to the geometry of the problem, leading to better dependence on the dimension n, see [Ben-Tal and Nemirovski, 2001], [Ben-Tal et al., 2001].

Continuous-time dynamics Although optimization methods are inherently discrete, the continuous-time point of view can help in their design and analysis, since it can leverage the

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

rich literature on dynamical systems, control theory, and mechanics, see [Helmke and Moore, 1994], [Bloch, 1994], and the references therein. Continuous-time models are also commonly used in ﬁnancial applications, such as option pricing [Black and Scholes, 1973], even though the actions are taken in discrete time. In convex optimization, beyond simplifying the analysis, continuous-time models have also motivated new algorithms: mirror descent is one such example, since it was originally motivated in continuous time (Chapter 3 in [Nemirovsky and Yudin, 1983]). In a more recent line of work ([Su et al., 2014], [Krichene et al., 2015], [Wibisono et al., 2016]), Nesterov s accelerated method [Nesterov, 1983] was shown to be the discretization of a second-order ordinary differential equation (ODE), which, in the unconstrained case, can be interpreted as a damped non-linear oscillator [Cabot et al., 2009, Attouch et al., 2015]. This motivated a restarting heuristic [O Donoghue and Candès, 2015], which aims at further dissipating the energy. Krichene et al. [2015] generalized this ODE to mirror descent, and gave an averaging interpretation of accelerated dynamics by writing it as two coupled ﬁrst-order ODEs. This is the starting point of this paper, in which we introduce and study a stochastic variant of accelerated mirror descent.

Stochastic dynamics and related work The dynamics that we have discussed so far are deterministic ﬁrst-order dynamics, since they use the exact gradient f. However, in many machine learning applications, evaluating the exact gradient f can be prohibitively expensive, e.g. in regularized empirical risk minimization problems, where the objective function f involves the sum of loss functions over a training set, of the form f(x) = 1 |I| P

i I fi(x) + g(x), where I indexes the training samples, and g is a regularization function1. Instead of computing the exact gradient f(x) = 1 |I| P

i I fi(x) + g(x), a common approach is to compute an unbiased, stochastic

estimate of the gradient, given by 1 | I| P

i I fi(x) + g(x), where I is a uniformly random subset of I, indexing a random batch of samples from the training set. This approach motivates the study of stochastic dynamics for convex optimization. But despite an extensive literature on stochastic gradient and mirror descent in discrete time, e.g. [Nemirovski et al., 2009], [Duchi et al., 2010], [Lan, 2012], [Johnson and Zhang, 2013], [Xiao and Zhang, 2014], and many others, few results are known for stochastic mirror descent in continuous-time. To the best of our knowledge, the only published results are by Raginsky and Bouvrie [2012] and Mertikopoulos and Staudigl [2016]. In its simplest form, the stochastic gradient dynamics can be described by the (underdamped) Langevin equation

d X(t) = f(X(t)) + σd B(t),

where B(t) denotes a standard Wiener process (Brownian motion). It has a long history in optimization [Chiang et al., 1987], dating back to simulated annealing, and it is known to have a unique invariant measure with density proportional to the Gibbs distribution e 2f(x)

σ (see, e.g., [Pavliotis, 2014]). Langevin dynamics have recently played an important role in the analysis of sampling methods [Dalalyan, 2017, Bubeck et al., 2015, Durmus and Moulines, 2016, Cheng and Bartlett, 2017, Eberle et al., 2017, Cheng et al., 2017], where f is taken to be proportional to the logarithm of a target density. It has also been used to derive convergence rates for smooth, non-convex optimization where the objective is dissipative [Raginsky et al., 2017].

For mirror descent dynamics, Raginsky and Bouvrie [2012] were the ﬁrst to propose a stochastic variant of the mirror descent ODE (1), given by the SDE:

SMD d Z(t) = f(X(t)) + σd B(t) X(t) = ψ (Z(t)), (2)

where σ is a constant volatility. They argued that the function values f(X(t)) along sample trajectories do not converge to the minimum value of f due to the persistent noise, but the optimality gap is bounded by a quantity proportional to σ2. They also proposed a method to reduce the variance by simultaneously sampling multiple trajectories and linearly coupling them. Mertikopoulos and Staudigl [2016] extended the analysis in some important directions: they replaced the constant σ with a general volatility matrix σ(x, t) which can be space and time dependent, and studied two regimes: the small noise limit (σ(x, t) vanishes at a O(1/ log t) rate), in which case they prove almost sure convergence; and the persistent noise regime (σ(x, t) is uniformly bounded), in which case they deﬁne

1In statistical learning, one seeks to minimize the expected risk (with respect to the true, unknown data distribution). A common approach is to minimize the empirical risk (observed on a given training set) then bound the distance between empirical and expected risk. Here we only focus on the optimization part.

a rectiﬁed variant of SMD, obtained by replacing the second equation by X(t) = ψ (Z(t)/s(t)), where 1/s(t) is a sensitivity parameter (intuitively, decreasing the sensitivity reduces the impact of accumulated noise). In particular, they prove that with s(t) =

t, the expected function values converge at a O(1/

t) rate. While these recent results paint a broad picture of mirror descent dynamics, they leave many questions open: in particular, they do not provide estimates for convergence rates in the vanishing noise limit, which is an important regime in machine learning applications, since one can often control the variance of the gradient estimate, for example by gradually increasing the batch size, as done by Xiao and Zhang [2014]. Besides, they do not study accelerated dynamics, and the interaction between acceleration and noise remains unexplored in continuous time.

Our contributions In this paper, we answer many of the questions left open in previous works. We formulate and study a family of stochastic accelerated mirror descent dynamics, and we characterize the interaction between its different parameters: the volatility of the noise, the (primal and dual) learning rates, and the sensitivity of the mirror map. More speciﬁcally:

In Theorem 1, we give sufﬁcient conditions for almost sure convergence of solution trajectories to the set of minimizers X . In particular, we show that it is possible to guarantee almost sure convergence even when the volatility grows unbounded asymptotically. In Theorem 2, we derive a bound on the expected function values. In particular, we can prove that in the vanishing noise regime, acceleration (with appropriate averaging) achieves a faster rate, see Corollary 2 and the discussion in Remark 3. In Theorem 3, we provide estimates of sample trajectory convergence rates.

The rest of the paper is organized as follows: We review the building blocks of our construction in Section 2, then formulate the stochastic dynamics in Section 3, and prove two instrumental lemmas. Section 4 is dedicated to the convergence results. We conclude with a brief discussion in Section 5.

2 Accelerated Mirror Descent Dynamics

2.1 Smooth mirror map

We start by reviewing some deﬁnitions and preliminaries. Let (E, ), be a normed vector space, and (E , ) be its dual space equipped with the dual norm, and denote by x, z the pairing between x E, z E . To simplify, both E and E can be identiﬁed with Rn, but we make the distinction for clarity. We say that a map F : E E is Lipschitz continuous on X E with constant L if for all x, x X, F(x) F(x ) L x x . Let ψ : E R {+ } be a convex function with effective domain X (i.e. X = {x E : ψ(x) < }). Its convex conjugate ψ is deﬁned on E by ψ (z) = supx X z, x ψ(x). One can show that if ψ is strongly convex, then ψ is differentiable on all of E , and its gradient ψ is a Lipschitz function that maps E to X (see the supplementary material). This map is often called a mirror map [Nemirovsky and Yudin, 1983]. To give a concrete example, take ψ to be the squared Euclidean norm, ψ(x) = 1

2 x 2 2. Then one can show ψ (z) = arg minx X z x 2 2, and the mirror map reduces to the Euclidean projection on X. For additional examples, see e.g. Banerjee et al. [2005]. We make the following assumptions throughout the paper:

Assumption 1. X is closed, convex and compact, the set of minimizers X is contained in the relative interior of X, ψ is non-negative (without loss of generality), ψ is twice differentiable with a Lipschitz gradient, and f is differentiable with a Lipschitz gradient. We denote by Lψ the Lipschitz constant of ψ , and by Lf the Lipschitz constant of f.

2.2 Averaging formulation of accelerated mirror descent

We start from the averaging formulation of Krichene et al. [2015], and include a sensitivity parameter similar to Mertikopoulos and Staudigl [2016]. This results in the following ODE:

z(t) = η(t) f(x(t)) x(t) = a(t)( ψ (z(t)/s(t)) x(t)), (3)

with initial conditions2 (x(t0), z(t0)) = (x0, z0). The ODE system is parameterized by the following functions, all assumed to be positive and continuous on [t0, ) (see Figure 1 for an illustration):

s(t) is a non-decreasing, inverse sensitivity parameter. As we will see, s(t) will be helpful in the stochastic case in scaling the noise term, in order to reduce its impact. η(t) is a learning rate in the dual space. a(t) is an averaging rate in the primal space. Indeed, the second ODE in (3) can be written in integral form as a weighted average of the mirror trajectory as follows: let w(t) = e

R t t0 a(τ)dτ (equivalently, a(t) = w(t) w(t)), then the ODE is equivalent to w(t) x(t) + w(t)x(t) = w(t) ψ (z(t)/s(t)), and integrating and rearranging,

x(t) = x(t0)w(t0) + R t t0 w(τ) ψ (Z(τ)/s(τ))dτ

There are other, different ways of formulating the accelerated dynamics: instead of two ﬁrst-order ODEs, one can write one second-order ODE (such as in Su et al. [2014], Wibisono et al. [2016]), which has interesting interpretations related to Lagrangian dynamics. The averaging formulation given in Equation (3) is better suited to our analysis.

2.3 Energy function

The analysis of continuous-time dynamics often relies on a Lyapunov argument (in reference to Lyapunov [1892]): one starts by deﬁning a non-negative energy function, then bounding its rate of change along solution trajectories. This bound can then be used to prove convergence to the set of minimizers X . We will consider a modiﬁed version of the energy function used in Krichene et al. [2016]: given a positive, C1 function r(t), and a pair of optimal primal-dual points (x , z ) such that3 x X and ψ (z ) = x , let

L(x, z, t) = r(t)(f(x) f(x )) + s(t)Dψ (z(t)/s(t), z ). (4)

Here, Dψ is the Bregman divergence associated with ψ , deﬁned by

Dψ (z , z) = ψ (z ) ψ (z) ψ (z), z z , for all z, z E .

Then we can prove a bound on the time derivative of L along solution trajectories of AMDη,a,s, given in the following proposition. To keep the equations compact, we will occasionally omit explicit dependence on time, and write, e.g. η/r instead of η(t)/r(t). Lemma 1. Suppose that a = η/r. Then under AMDη,η/r,s, for all t t0,

d dt L(x(t), z(t), t) (f(x(t)) f(x ))( r(t) η(t)) + ψ(x ) s(t). (5)

Proof. We start by bounding the rate of change of the Bregman divergence term:

d dts(t)Dψ (z(t)/s(t), z ) = s Dψ (z/s, z ) + s ψ (z/s) ψ (z ), z/s sz/s2

= ψ (z/s) x , z + s(Dψ (z/s, z ) ψ (z/s) ψ (z ), z/s )

= ψ (z/s) x , z + s(ψ(x ) ψ( ψ (z/s)))

ψ (z/s) x , z + sψ(x ),

where the third equality can be proved using the fact that ψ(x) + ψ (z) = x, z x ψ (z) z ψ(x) (Theorem 23.5 in Rockafellar [1970]), and the last inequality follows from the assumption that s is non-decreasing, and that ψ is non-negative. Using this expression, we can then compute

d dt L(x(t), z(t), t) r(f(x) f(x )) + r f(x), x + ψ (z/s) x , z + ψ(x ) s

= r(f(x) f(x )) + r f(x), x + x/a + x x , η f(x) + ψ(x ) s

(f(x) f(x ))( r η) + f(x), x (r η/a) + ψ(x ) s,

2The initial conditions typically satisfy ψ (z0) = x0 which ensures that the trajectory starts with zero velocity, but this is not necessary in general. 3Such a z exists whenever x is in the relative interior of X (hence the condition X relint X in Assumption 1). The analysis can be extended to minimizers that are on the relative boundary by replacing the Bregman divergence term in L by the Fenchel coupling deﬁned by Mertikopoulos and Staudigl [2016].

where we plugged in the expression of z and ψ (z/s) from AMDη,a,s in the second equality, and used convexity of f in the last inequality. The assumption a = η/r ensures that the middle term vanishes, which concludes the proof.

As a consequence of the previous proposition, we can prove the following convergence rate: Corollary 1. Suppose that a = η/r and that η r. Then under AMDη,η/r,s, for all t t0

f(x(t)) f(x ) ψ(x )(s(t) s(t0)) + L(x0, z0, t0)

Proof. Starting from the bound (5), the ﬁrst term is non-positive by the assumption that η r. Integrating, we have L(x(t), z(t), t) L(x0, z0, t0) ψ(x )(s(t) s(t0)), thus,

f(x(t) f(x )) L(x(t), z(t), t)

r(t) ψ(x )(s(t) s(t0)) + L(x0, z0, t0)

Remark 1. Corollary 1 can be interpreted as follows: given a desired convergence rate r(t), one can choose parameters a, η, s that satisfy the conditions of the corollary (e.g. by ﬁrst setting η = r, then choosing a = η/r). This deﬁnes an ODE, the solutions of which are guaranteed to converge at the rate r(t). While the convergence rate can seemingly be arbitrary for continuous time dynamics, discretizing the ODE does not always preserve the convergence rate. Wibisono et al. [2016], Wilson et al. [2016] give sufﬁcient conditions on the discretization scheme to preserve polynomial rates, for example, a ﬁrst-order discretization can preserve quadratic rates, and a higher-order discretization (using cubic-regularized Newton updates) can preserve cubic rates. Remark 2. As a special case, one can recover Nesterov s ODE by taking r(t) = t2, η(t) = βt, a(t) = β/t (i.e. w(t) = w(t0)(t/t0)β), and s(t) = 1 (see the supplement for additional details). It is worth observing that in this case, both the primal and dual rates η(t) and w(t) are increasing. A different choice of parameters leads to dynamics similar to Nesterov s but with different weights.

3 Stochastic dynamics

We now formulate the stochastic variant of accelerated mirror descent dynamics (SAMD). Intuitively, we would like to replace the gradient term f(x) in AMDη,a,s by a noisy gradient. Writing the noisy dynamics as an Itô SDE [Øksendal, 2003], we consider the system

d Z(t) = η(t)[ f(X(t))dt + σ(X(t), t)d B(t)] d X(t) = a(t)[ ψ (Z(t)/s(t)) X(t)]dt, (6)

with initial condition (X(t0), Z(t0)) = (x0, z0) (we assume deterministic initial conditions for simplicity). Here, B(t) Rn is a standard Wiener process with respect to a given ﬁltered probability space (Ω, F, {Ft}t t0, P), and σ : (x, t) 7 σ(x, t) Rn n is a volatility matrix assumed measurable and Lipschitz in x (uniformly in t), and continuous in t for all x. The drift term in SAMDη,a,s is identical to the deterministic case, and the volatility term η(t)σ(X(t), t)d B(t) represents the noise in the gradient. In particular, we note that the learning rate η(t) multiplies σ(X(t), t)d B(t), to capture the fact that the gradient noise is scaled by the learning rate η. This formulation is fairly general, and does not assume, in particular, that the different components of the noise are independent, as we can see in the quadratic covariation of the dual process Z(t):

d[Zi(t), Zj(t)] = η(t)2(σ(X(t), t)σ(X(t), t)T )i,jdt = η(t)2Σij(X(t), t)dt, (7) where we deﬁned the inﬁnitesimal covariance matrix Σ(x, t) = σ(x, t)σ(x, t)T Rn n. In our analysis, we will focus on different noise regimes, which can be characterized using4

σ2 (t) = sup x X Σ(x, t) i, (8)

where Σ i = sup z 1 Σz is the induced matrix norm. Since Σ(x, t) is Lipschitz in x and continuous in t, and X is compact, σ (t) is ﬁnite for all t, and continuous. Contrary to [Raginsky and Bouvrie, 2012, Mertikopoulos and Staudigl, 2016], we do not assume that σ (t) is uniformly bounded in t. We give an illustration of the stochastic dynamics in Figure 1 (see the supplement for details).

4In our model, we focus on the time dependence of the volatility. Note that in some settings, the variance of the gradient estimates scales with the squared norm of the gradient, see [Bottou et al., 2016] in the discrete case. Thus one can consider a model where σ(x, t) scales with f(x) , which may lead to different rates.

Figure 1: Illustration of the SAMD dynamics. The dual variable Z(t) cumulates gradients. It is scaled by the sensitivity 1/s(t) then mapped to the primal space via the mirror map, resulting in ψ (Z/s) (dotted line). The primal variable is then a weighted average of the mirror trajectory.

Existence and uniqueness First, we give the following existence and uniqueness result: Proposition 1. For all T > t0, SAMDη,a,s has a unique (up to redeﬁnition on a P-null set) solution (X(t), Z(t)) continuous on [0, T], with the property that (X(t), Z(t)) is adapted to the ﬁltration {Ft}, and R T t0 X(t) 2dt, R T t0 Z(t) 2 dt have ﬁnite expectations.

Proof. By assumption, ψ and f are Lipschitz continuous, thus the function (x, z) 7 ( η(t) f(x), a(t)[ ψ (z/s(t)) x]) is Lipschitz on [t0, T] (since a, η, s are positive continuous). Additionally, the function x 7 σ(x, t) is also Lipschitz. Therefore, we can invoke the existence and uniqueness theroem for stochastic differential equations [Øksendal, 2003, Theorem 5.2.1].

Since T is arbitrary, we can conclude that there exists a unique continuous solution on [t0, ).

Energy decay Next, in order to analyze the convergence properties of the solution trajectories (X(t), Z(t)), we will need to bound the time-derivative of the energy function L. Lemma 2. Suppose that the primal rate a = η/r, and let (X(t), Z(t)) be the unique solution to SAMDη,η/r,s. Then for all t t0,

d L(X(t), Z(t), t) (f(X(t)) f(x ))( r(t) η(t)) + ψ(x ) s(t) + n Lψ

2 η2(t)σ2 (t) s(t)

dt+ V (t), d B(t) ,

where V (t) is the continuous process given by

V (t) = η(t)σ(X(t), t)T ( ψ (Z(t)/s(t)) ψ (z )). (9)

Proof. By deﬁnition of the energy function L, x L(x, z, t) = r(t) f(x) and z L(x, z, t) = ψ (z/s(t)) ψ (z ), which are Lipschitz continuous in (x, z) (uniformly in t on any bounded interval, since s, r are continuous positive functions of t). Thus by the Itô formula for functions with Lipschitz continuous gradients [Errami et al., 2002], we have

d L = t Ldt + x L, d X + z L, d Z + 1

2 tr ησT 2 zz Lση dt

= t Ldt + x L, d X + z L, η f(X) dt + z L, ησd B + η2

2 tr Σ 2 zz L dt.

The ﬁrst three terms correspond exactly to the deterministic case, and we can bound them by (5) from Lemma 1. The last two terms are due to the stochastic noise, and consist of a volatility term

η z L(X, Z, t), σd B = η ψ (Z/s) ψ (z ), σd B = V, d B ,

and the Itô correction term

2 tr Σ(X, t) 2 zz L(X, Z, t) dt = η2

2s tr Σ(X, t) 2ψ (Z/s) dt.

We can bound the last term using the fact that ψ is, by assumption, Lψ -Lipschitz, and the deﬁnition (8) of σ : for all x E, z E , and t t0, tr(Σ(x, t) 2ψ (z)) n Lψ σ2 (t). Combining the previous inequalities, we obtain the desired bound.

Integrating the bound of Lemma 2 will allow us to bound changes in energy. This bound will involve the Itô martingale term R t t0 V (τ), d B(τ) , and in order to control this term, we give, in the following lemma, an asymptotic envelope (a consequence of the law of the iterated logarithm).

Lemma 3. Let b(t) = R t t0 η2(τ)σ2 (τ)dτ. Then Z t

t0 V (τ), d B(τ) = O( p

b(t) log log b(t)) a.s. as t . (10)

Proof. Let us denote the Itô martingale by V(t) = R t t0 V (τ), d B(τ) = Pn i=1 R t t0 Vi(τ)d Bi(τ), and its quadratic variation by β(t) = [V(t), V(t)]. By deﬁnition of V, we have

j=1 Vi Vjd[Bi, Bj] =

i=1 V 2 i dt = V, V dt.

By the Dambis-Dubins-Schwartz time change theorem (e.g. Corollary 8.5.4 in [Øksendal, 2003]), there exists a Wiener process ˆB such that

V(t) = ˆB(β(t)). (11)

We now proceed to bound β(t). Using the expression (9) of V , we have V, V = η2(t) T (t)Σ(X, t) (t), where (t) = ψ (Z(t)/s(t)) ψ (z ). Since the mirror map has values in X and X is assumed compact, the diameter D = supx,x X x x is ﬁnite, and (t) D for all t. Thus, dβ(t) D2η(t)2σ2 (t)dt, and integrating,

β(t) D2b(t) a.s. (12)

Since β(t) is a non-decreasing process, two cases are possible: if limt β(t) is ﬁnite, then lim supt |V(t)| is a.s. ﬁnite and the result follows immediately. If limt β(t) = , then

lim sup t V(t) p

b(t) log log b(t) lim sup t

D2 log log β(t)

where the inequality combines (11) and (12), and the equality is by the law of the iterated logarithm.

4 Convergence results

Equipped with Lemma 2 and Lemma 3, which bound, respectively, the rate of change of the energy and the asymptotic growth of the martingale term, we are now ready to prove our convergence results.

Theorem 1. Suppose that η(t)σ (t) = o(1/ log t), and that R t t0 η(τ)dτ dominates b(t) and p

b(t) log log b(t) (where b(t) = R t t0 η2(τ)σ2 (τ)dτ as deﬁned in Lemma 3). Consider SAMD dynamics with r = s = 1. Let (X(t), Z(t)) be the unique continuous solution of SAMDη,η,1. Then

lim t f(X(t)) f(x ) = 0 a.s.

Proof sketch. We give a sketch of the proof here (the full argument is deferred to the supplement).

i) The ﬁrst step is to prove that under the conditions of the theorem, the continuous solution of SAMDη,η,1, (X(t), Z(t)), is an asymptotic pseudo trajectory (a notion deﬁned and studied by Benaïm and Hirsch [1996] and Benaïm [1999]) of the deterministic ﬂow AMDη,η,1. The rigorous deﬁnition is given in the supplementary material, but intuitively, this means that for large enough times, the sample paths of the process (X(t), Z(t)) get arbitrarily close to (x(t), z(t)), the solution trajectories of the deterministic dynamics.

ii) The second step is to show that under the deterministic ﬂow, the energy L decreases enough for large enough times. iii) The third step is to prove that under the stochastic process, f(X(t)) cannot stay bounded away from f(x ) for all t. Note that under the conditions of the theorem, integrating the bound of Lemma 2, and using the asymptotic envelope of Lemma 3, gives

L(X(t), Z(t), t) L(x0, z0, t0) Z t

t0 (f(X(τ)) f(x ))η(τ)dτ+O(b(t))+O( p

b(t) log log b(t)),

and if say f(X(t)) f(x ) c > 0 for all t, then the ﬁrst term dominates the bound, and the energy would decrease to , a contradiction.

Combining these steps, we argue that f(X(t)) eventually becomes close to f(x ) by (iii), then stays close by virtue of (i) and (ii).

The result of Theorem 1 makes it possible to guarantee almost sure convergence (albeit without guaranteeing a convergence rate) when the noise is persistent (σ (t) is constant, or even increasing). To give a concrete example, suppose σ (t) = O(tα) (with α < 1

2 but can be positive), and let η(t) = t α 1

2 . Then η(t)σ (t) = O(t 1

2 ), R t t0 η(τ)dτ = Ω(t α+ 1

2 ), b(t) = O(log t), and p

b(t) log log b(t) = O( log t log log log t), and the conditions of the theorem are satisﬁed. Therefore, with the appropriate choice of learning rate η(t) (and the corresponding averaging in the primal space given by a(t) = η(t)), one can guarantee almost sure convergence.

Next, we derive explicit bounds on convergence rates. We start by bounding expected function values. Theorem 2. Suppose that a = η/r and η r. Let (X(t), Z(t)) be the unique continuous solution to SAMDη,η/r,s. Then for all t t0,

E[f(X(t))] f(x ) L(x0, z0, t0) + ψ(x )(s(t) s(t0)) + n Lψ

2 R t t0 η2(τ)σ2 (τ) s(τ) dτ

Proof. Integrating the bound of Lemma 2, and using the fact that (f(X(t)) f(x ))( r η) 0 by assumption on η, we have

L(X(t), Z(t), t) L(x0, z0, t0) ψ(x )(s(t) s(t0)) + n Lψ

η2(τ)σ2 (τ) s(τ) dτ + Z t

t0 V (τ), d B(τ) ,

(13) Taking expectations, the last term vanishes since it is an Itô martingale, and we conclude by observing that E[f(X(t))] f(x ) E[L(X(t), Z(t), t)]/r(t). To give a concrete example, suppose that σ (t) = O(tα) is given, and let r(t) = tβ and s(t) = tγ, β, γ > 0. To simplify, we will take η(t) = r(t) = βtβ 1. Then the bound of Theorem 2 shows that E[f(X(t))] f(x ) = O(tγ β + tβ+2α γ 1). To minimize the asymptotic rate, we can choose γ β = β + 2α γ 1, i.e. β + α γ 1

2 = 0, and the resulting rate is O(tα 1

2 ). In particular, we have: Corollary 2. Suppose that σ (t) = O(tα), α < 1

2. Then with η(t) = (1 α)t α, a(t) = 1 α

t , and s(t) = t 1 2 , we have E[f(X(t))] f(x ) = O(tα 1

2 ). Remark 3. Corollary 2 can be interpreted as follows: Given a polynomial bound σ (t) = O(tα) on the volatility of the noise process, one can adapt the choice of primal and dual averaging rates (a(t) and η(t)), which leads to an O(tα 1

2 ) convergence rate.

- In the persistent noise regime (α = 0), the dynamics use a constant η, and result in a O(1/

t) rate. This rate is similar to the rectiﬁed dynamics proposed by Mertikopoulos and Staudigl [2016], but while they show convergence of the ergodic average X(t) = 1

t R t 0 X(τ)dτ, we can show convergence of the original process X(t) under acceleration. - In the vanishing noise regime (α < 0), we can take advantage of the decreasing volatility by making η(t) increasing. With the appropriate averaging rate a(t), this leads to the improved rate O(tα 1

2 ). It is worth observing here that when α 1

2, the same rate can be obtained for the ergodic average, without acceleration: one can show that the rectiﬁed SMD with s(t) = tmax(0,α+ 1

2 ) achieves a O(tmax(α 1

2 , 1)). However for α < 1

2, acceleration improves the rate from O(t 1) to O(tα 1

- In the increasing noise regime (α > 0), as long as the volatility does not increase too fast (α < 1

2), one can still guarantee convergence by decreasing η(t) with the appropriate rate.

Finally, we give an estimate of the asymptotic convergence rate along solution trajectories. Theorem 3. Suppose that a = η/r and η r. Let (X(t), Z(t)) be the unique continuous solution to SAMDη,η/r,s. Then

f(X(t)) f(x ) = O

s(t) + n R t t0 η2(τ)σ2 (τ) s(τ) + p

b(t) log log b(t)

a.s. as t ,

where b(t) = R t t0 η2(τ)σ2 (τ)dτ.

Proof. Integrating the bound of Lemma 2 once again, we get inequality (13), where we can bound the Itô martingale term R t t0 V (τ), d B(τ) using Lemma 3. This concludes the proof.

Comparing the last bound to that of Theorem 2, we have the additional p

b(t) log log b(t)/r(t) term due to the envelope of the martingale term. This results in a slower a.s. convergence rate.

Suppose again that σ (t) = O(tα), and that r(t) = tβ and η(t) = r(t) = βtβ 1 to simplify. Then b(t) = R t t0 η2(τ)σ2 (τ)dτ = O(t2β+2α 1), and the martingale term becomes

b(t) log log b(t)/r(t)) = O(tα 1

2 log log t). Remarkably, the asymptotic rate of sample trajectories is, up to a log log t factor, the same as the asymptotic rate in expectation; one should observe, however, that the constant in the O notation is trajectory-dependent. Corollary 3. Suppose that σ (t) = O(tα), α < 1

2. Then with η(t) = (1 α)t α, a(t) = 1 α

t , and s(t) = t 1 2 , we have f(X(t)) f(x ) = O(tα 1

2 log log t) a.s.

5 Conclusion

Starting from the averaging formulation of accelerated mirror descent in continuous-time, and motivated by stochastic optimization, we formulated a stochastic variant and studied the resulting SDE. We discussed the role played by each parameter: the dual learning rate η(t), the inverse sensitivity parameter s(t), and the noise covariation bound σ (t). Our results show that in the persistent noise regime, thanks to averaging, it is possible to guarantee a.s. convergence, remarkably even when σ (t) is increasing (as long as σ (t) = o(

t)). In the vanishing noise regime, adapting the choice of η(t) to the rate of σ (t) (with the appropriate averaging) leads to improved convergence rates, e.g. to O(tα 1

2 ) in expectation and O(tα 1

2 log log t) almost surely, when σ (t) = O(tα). These asymptotic bounds in continuous-time can provide guidelines in setting the different parameters of accelerated stochastic mirror descent.

It is also worth observing that in the deterministic case, one can theoretically obtain arbitrarily fast convergence, through a time change as observed by Wibisono et al. [2016] a time-change would simply result in using different weights η(t) and a(t); this can also be seen in Corollary 1, where the rate r(t) can be arbitrarily fast. In the stochastic dynamics, such a time-change would also lead to re-scaling the noise covariation, and does not lead to a faster rate. To some extent, adding the noise prevents us from artiﬁcially accelerating convergence using a simple time-change.

Finally, we believe this continuous-time analysis can be extended in several directions. For instance, it will be interesting to carry out a similar analysis for strongly convex functions, for which we expect faster convergence rates.

Acknowledgments

We gratefully acknowledge the support of the NSF through grant IIS-1619362 and of the Australian Research Council through an Australian Laureate Fellowship (FL110100281) and through the Australian Research Council Centre of Excellence for Mathematical and Statistical Frontiers (ACEMS). We thank the anonymous reviewers for their insightful comments and suggestions.

H. Attouch, J. Peypouquet, and P. Redont. Fast convergence of an inertial gradient-like system with vanishing viscosity. Co RR, abs/1507.04782, 2015.

A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh. Clustering with Bregman divergences. J. Mach. Learn. Res., 6:1705 1749, Dec. 2005.

A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Oper. Res. Lett., 31(3):167 175, May 2003.

A. Ben-Tal and A. Nemirovski. Lectures on Modern Convex Optimization. SIAM, 2001.

A. Ben-Tal, T. Margalit, and A. Nemirovski. The ordered subsets mirror descent optimization method with applications to tomography. SIAM J. on Optimization, 12(1):79 108, Jan. 2001.

M. Benaïm. Dynamics of stochastic approximation algorithms. In Séminaire de probabilités XXXIII, pages 1 68. Springer, 1999.

M. Benaïm and M. W. Hirsch. Asymptotic pseudotrajectories and chain recurrent ﬂows, with applications. Journal of Dynamics and Differential Equations, 8(1):141 176, 1996.

F. Black and M. Scholes. The pricing of options and corporate liabilities. Journal of Political Economy, 81(3):637 654, 1973.

A. Bloch, editor. Hamiltonian and gradient ﬂows, algorithms, and control. American Mathematical Society, 1994.

L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning. Co RR, abs/1606.04838, 2016.

S. Bubeck, R. Eldan, and J. Lehec. Finite-time analysis of projected Langevin Monte Carlo. In Advances in Neural Information Processing Systems (NIPS) 28, pages 1243 1251, 2015.

A. Cabot, H. Engler, and S. Gadat. On the long time behavior of second order differential equations with asymptotically small dissipation. Transactions of the American Mathematical Society, 361: 5983 6017, 2009.

X. Cheng and P. Bartlett. Convergence of Langevin MCMC in KL-divergence. Co RR, abs/1705.09048, 2017.

X. Cheng, N. S. Chatterji, P. L. Bartlett, and M. I. Jordan. Underdamped Langevin MCMC: A non-asymptotic analysis. Co RR, abs/1707.03663, 2017.

T.-S. Chiang, C.-R. Hwang, and S. J. Sheu. Diffusion for global optimization in Rn. SIAM Journal on Control and Optimization, 25(3):737 753, 1987.

A. S. Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3): 651 676, 2017.

J. C. Duchi, A. Agarwal, M. Johansson, and M. Jordan. Ergodic mirror descent. SIAM Journal on Optimization (SIOPT), 22(4):1549 1578, 2010.

A. Durmus and E. Moulines. Sampling from strongly log-concave distributions with the unadjusted Langevin algorithm. Co RR, 2016.

A. Eberle, A. Guillin, and R. Zimmer. Quantitative contraction rates for Langevin dynamics. Co RR, 2017.

M. Errami, F. Russo, and P. Vallois. Itô s formula for C1,λ-functions of a càdlàg process and related calculus. Probability Theory and Related Fields, 122(2):191 221, 2002.

U. Helmke and J. Moore. Optimization and dynamical systems. Communications and control engineering series. Springer-Verlag, 1994.

R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Proceedings of the 26th International Conference on Neural Information Processing Systems (NIPS), pages 315 323, 2013.

W. Krichene, A. Bayen, and P. Bartlett. Accelerated mirror descent in continuous and discrete time. In NIPS, 2015.

W. Krichene, A. Bayen, and P. Bartlett. Adaptive averaging in accelerated descent dynamics. In NIPS, 2016.

G. Lan. An optimal method for stochastic composite optimization. Mathematical Programming, 133 (1):365 397, 2012.

A. Lyapunov. General Problem of the Stability Of Motion. Doctoral thesis, 1892.

P. Mertikopoulos and M. Staudigl. On the convergence of gradient-like ﬂows with noisy gradient input. Co RR, abs/1611.06730, 2016.

A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574 1609, 2009.

A. S. Nemirovsky and D. B. Yudin. Problem Complexity and Method Efﬁciency in Optimization. Wiley-Interscience series in discrete mathematics. Wiley, 1983.

Y. Nesterov. A method of solving a convex programming problem with convergence rate o(1/k2). Soviet Mathematics Doklady, 27(2):372 376, 1983.

B. O Donoghue and E. Candès. Adaptive restart for accelerated gradient schemes. Foundations of Computational Mathematics, 15(3):715 732, 2015. ISSN 1615-3375.

B. Øksendal. Stochastic Differential Equations: An Introduction with Applications. Hochschultext / Universitext. Springer, 2003.

G. Pavliotis. Stochastic Processes and Applications: Diffusion Processes, the Fokker-Planck and Langevin Equations. Texts in Applied Mathematics. Springer New York, 2014.

M. Raginsky and J. Bouvrie. Continuous-time stochastic mirror descent on a network: Variance reduction, consensus, convergence. In CDC 2012, pages 6793 6800, 2012.

M. Raginsky, A. Rakhlin, and M. Telgarsky. Non-convex learning via stochastic gradient Langevin dynamics: a nonasymptotic analysis. Co RR, abs/1702.03849, 2017.

R. Rockafellar. Convex Analysis. Princeton University Press, 1970.

W. Su, S. Boyd, and E. Candès. A differential equation for modeling Nesterov s accelerated gradient method: Theory and insights. In NIPS, 2014.

A. Wibisono, A. C. Wilson, and M. I. Jordan. A variational perspective on accelerated methods in optimization. Co RR, abs/1603.04245, 2016.

A. C. Wilson, B. Recht, and M. I. Jordan. A lyapunov analysis of momentum methods in optimization. Co RR, abs/1611.02635, 2016.

L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057 2075, 2014.