# algorithmic_instabilities_of_accelerated_gradient_descent__ea87078e.pdf

Algorithmic Instabilities of Accelerated Gradient Descent

Amit Attia Blavatnik School of Computer Science Tel Aviv University amitattia@mail.tau.ac.il

Tomer Koren Blavatnik School of Computer Science Tel Aviv University, and Google Research tkoren@tauex.tau.ac.il

We study the algorithmic stability of Nesterov s accelerated gradient method. For convex quadratic objectives, Chen et al. [10] proved that the uniform stability of the method grows quadratically with the number of optimization steps, and conjectured that the same is true for the general convex and smooth case. We disprove this conjecture and show, for two notions of algorithmic stability (including uniform stability), that the stability of Nesterov s accelerated method in fact deteriorates exponentially fast with the number of gradient steps. This stands in sharp contrast to the bounds in the quadratic case, but also to known results for non-accelerated gradient methods where stability typically grows linearly with the number of steps.

1 Introduction

Algorithmic stability has emerged over the last two decades as a central tool for generalization analysis of learning algorithms. While the classical approach in generalization theory originating in the PAC learning framework appeal to uniform convergence arguments, more recent progress on stochastic convex optimization models, starting with the pioneering work of Bousquet and Elisseeﬀ[6] and Shalev-Shwartz et al. [24], has relied on stability analysis for deriving tight generalization results for convex risk minimizing algorithms.

Perhaps the most common form of algorithmic stability is the so called uniform stability [6]. Roughly, the uniform stability of a learning algorithm is the worst-case change in its output model, in terms of its loss on an arbitrary example, when replacing a single sample in the data set used for training. Bousquet and Elisseeﬀ[6] initially used uniform stability to argue about the generalization of empirical risk minimization with strongly convex losses. Shalev-Shwartz et al. [24] revisited this concept and studied the stability eﬀect of regularization on the generalization of convex models. Their bounds were recently improved in a variety of ways [13, 14, 7] and their approach has been inﬂuential in a variety of settings (e.g., 18, 16, 9). In fact, to this day, algorithmic stability is essentially the only general approach for obtaining tight (dimension free) generalization bounds for convex optimization algorithms applied to the empirical risk (see 24, 12).

Signiﬁcant focus has been put recently on studying the stability properties of iterative optimization algorithms. Hardt et al. [17] considered stochastic gradient descent (SGD) and gave the ﬁrst bounds on its uniform stability for a convex and smooth loss function, that grow linearly with the number of optimization steps. As observed by Feldman and Vondrak [13] and Chen et al. [10], their arguments also apply with minor modiﬁcations to full-batch gradient descent (GD). Bassily et al. [5] exhibited a signiﬁcant gap in stability between the smooth and non-smooth cases, showing that non-smooth GD and SGD are inherently less stable than their smooth counterparts. Even further, algorithmic stability has also been used as an analysis technique in stochastic mini-batched iterative optimization (e.g., 25, 1), and has been proved crucial to the design and analysis of diﬀerentially private optimization algorithms [26, 4, 15], both of which focusing primarily on smooth optimization.

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

Having identiﬁed smoothness as key to algorithmic stability of iterative optimization methods, the following fundamental question emerges: how stable are optimal methods for smooth convex optimization? In particular, what is the algorithmic stability of the celebrated Nesterov accelerated gradient (NAG) method [22] a cornerstone of optimal methods in convex optimization? Besides being a basic and natural question in its own right, its resolution could have important implications to the design and analysis of optimization algorithms, as well as serve to deepen our understanding of the generalization properties of iterative gradient methods. Chen et al. [10] addressed this question in the case of convex quadratic objectives and derived bounds on the uniform stability of NAG that grow quadratically with the number of gradient steps (as opposed to the linear growth known for GD). They conjectured that similar bounds hold true more broadly, but fell short of proving this for general convex and smooth objectives. Our work is aimed at ﬁlling this gap.

1.1 Our Results

We establish tight algorithmic stability bounds for the Nesterov accelerated gradient method (NAG). We show that, somewhat surprisingly, the uniform stability of NAG grows exponentially fast with the number of steps in the general convex and smooth setting. Namely, the uniform stability of 푇-steps NAG with respect to a dataset of 푛examples is in general exp(Ω(푇))/푛, and in particular, after merely 푇= 푂(log 푛) steps the stability becomes the trivial Ω(1). This result demonstrates a sharp contrast between the stability of NAG in the quadratic case and in the general convex, and disproves the conjecture of Chen et al. [10] that the uniform stability of NAG in the general convex setting is 푂(푇2/푛), as in the case of a quadratic objective.

Our results in fact apply to a simpler notion of stability one that is arguably more fundamental in the context of iterative optimization methods which we term initialization stability. The initialization stability of an algorithm 퐴(formally deﬁned in Section 2 below) measures the sensitivity of 퐴 s output to an ϵ-perturbation in its initialization point. For this notion, we demonstrate a construction of a smooth and convex objective function such that, for suﬃciently small ϵ, the stability of 푇-steps NAG is lower bounded by exp(Ω(푇))ϵ. Here again, we exhibit a dramatic gap between the quadratic and general convex cases: for quadratic objectives, we show that the initialization stability of NAG is upper bounded by 푂(푇ϵ).

For completeness, we also prove initialization stability upper bounds in a few relevant convex optimization settings: for GD, we analyze both the smooth and non-smooth cases; for NAG, we give bounds for quadratic objectives as well as for general smooth ones. Table 1 summarizes the stability bounds we establish compared to existing bounds in the literature. Note in particular the remarkable exponential gap between the stability bounds for GD and NAG in the general smooth case, with respect to both stability deﬁnitions. Stability lower bounds for NAG are discussed in Sections 3 and 4; initialization stability upper bounds for the various settings and additional uniform stability bounds are detailed in the full version of the paper [3] .

Method Setting Init. Stability Unif. Stability Reference GD convex, smooth Θ(ϵ) Θ(푇/푛) Hardt et al. [17] GD convex, non-smooth Θ(ϵ + η

푇+ η푇/푛) Bassily et al. [5] NAG convex, quadratic 푶(푻ϵ) Θ(푇2/푛) Chen et al. [10] NAG convex, smooth exp(Θ(푻))ϵ exp(Θ(푻))/풏 (this paper)

Table 1: Stability bounds introduced in this work (in bold) compared to existing bounds. For simplicity, all bounds in the smooth case are for η = Θ(1/β). The lower bounds for NAG are presented here in a simpliﬁed form and the actual bounds exhibit a ﬂuctuation in the increase of stability; see also Fig. 2 and the precise results in Sections 3 and 4.

Finally, we remark that our focus here is on the general convex (and smooth) case, and we do not provide formal results for the strongly convex case. However, we argue that stability analysis in the latter case is not as compelling as in the general case. Indeed, a strongly convex objective admits a unique minimum, and so NAG will converge to an ϵ-neighborhood of this minimum in 푂(log(1/ϵ)) steps from any initialization, at which point its stability becomes 푂(ϵ); thus, with strong convexity perturbations in initialization get quickly washed away as the algorithm rapidly converges to the unique optimum. (A similar reasoning also applies to uniform stability with strongly convex losses.)

1.2 Overview of Main Ideas and Techniques

We now provide some intuition to our constructions and highlight some of the key ideas leading to our results. We start by revisiting the analysis of the quadratic case which is simpler and better understood.

Why NAG is stable for quadratics: Consider a quadratic function 푓with Hessian matrix 퐻 0. For analyzing the initialization stability of NAG, let us consider two runs of the method initialized at 푥0, 푥0 respectively, and let (푥푡, 푦푡), ( 푥푡, 푦푡) denote the corresponding NAG iterates at step 푡. Further, let us denote by 푥 푡 푥푡 푥푡the diﬀerence between the two sequences of iterates. Using the update rule of NAG (see Eqs. (1) and (2) below) and the fact that for a quadratic 푓, diﬀerences between gradients can be expressed as 푓(푥) 푓(푥 ) = 퐻(푥 푥 ) for any 푥, 푥 ℝ푑, it is straightforward to show that the distance 푥 푡evolves according to

푥 푡+1 = (퐼 η퐻) (1 + γ푡) 푥 푡 γ푡 푥 푡 1 .

This recursion can be naturally put in matrix form, leading to: 푥 푡+1 푥 푡

(1 + γ푘)퐴 γ푘퐴 퐼 0

where here 퐴= 퐼 η퐻. Thus, for a quadratic 푓, bounding the divergence 푥 푡 between the two NAG sequences reduces to controlling the operator norm of the matrix product above, namely

(1 + γ푘)퐴 γ푘퐴 퐼 0

Remarkably, it can be shown that this norm is 푂(푡) for any 0 퐴 퐼and any choice of 1 γ1, . . . , γ푡 1. (This can be seen by writing the Schur decomposition of the involved matrices, as we show in the full version of the paper [3].1) As a consequence, the initialization stability of NAG for a quadratic objective 푓is shown to grow only linearly with the number of steps 푡.

What breaks down in the general convex case: For a general convex (twice-diﬀerentiable and smooth) 푓, the Hessian matrix is of course no longer ﬁxed across the execution. Assuming for simplicity the one-dimensional case, similar arguments show that the relevant operator norm is of the form

(1 + γ푘)퐴푘 γ푘퐴푘 퐼 0

where 0 퐴1, . . . , 퐴푡 1 are related to Hessians of 푓taken at suitable points along the optimization trajectory. However, if 퐴푘are allowed to vary arbitrarily between steps, the matrix product above might explode exponentially fast, even in the one-dimensional case! Indeed, ﬁx γ푘= 0.9 for all 푘, and set 퐴푘= 0 whenever 푘mod 3 = 0 and 퐴푘= 1 otherwise; then using simple linear algebra the operator norm of interest can be shown to satisfy

1.9 0.9 1 0

1.9 0.9 1 0

0 0 2.71 1.71

How a hard function should look like: The exponential blowup we exhibited above hinged on a worst-case sequence 퐴1, . . . , 퐴푡that varies signiﬁcantly between consecutive steps. It remains unclear, however, what does this imply for the actual optimization setup we care about, and whether such a sequence can be realized by Hessians of a convex and smooth function 푓. Our main results essentially answer the latter question on the aﬃrmative and build upon a construction of such a function 푓that directly imitates such a bad sequence.

Concretely, we generate a hard function inductively based on a running execution of NAG, where in each step we amend the construction with a gadget function having a piecewise-constant Hessian (that equals either 0 or the maximal β); see Fig. 1 for an illustration of this construct. The interval pieces are carefully chosen based on the NAG iterates computed so far in a way that a slightly perturbed

1Chen et al. [10] give an alternative argument based on Chebyshev polynomials.

Figure 1: A function (left) constructed of four instantiations of our gadget (right) at increasing sizes. During an interval with zero Hessian, the trajectory with the larger momentum gains distance. When reaching an interval with maximal Hessian (depicted here as iteration 푡), the slow trajectory experiences a larger gradient which gives it larger momentum and makes it become the faster one.

execution would traverse through intervals with an appropriate pattern of Hessians that induces a behaviour similar to the one exhibited by the matrix products above, leading to an exponential blowup in the stability terms. Fig. 2 shows a simulation of the divergence between the two trajectories of NAG on the objective function we construct, illustrating how the divergence ﬂuctuates between positive and negative values, with its absolute value growing exponentially with time. More technical details on this construction can be found in Section 3.

0 50 100 150 200 250 300 iteration

(a) construction for η = 1/(2β);

0 250 500 750 1000 1250 1500 1750 iteration

(b) construction for η = 1/(10β).

Figure 2: Divergence between trajectories (in log-scale) along the optimization process for diﬀerent values of η. At steps 푡= Θ(푖/ηβ) (푖= 1, 2, . . .) NAG experiences an exponential growth in the divergence, as it reaches an interval with maximal Hessian. The dashed lines depict our theoretical exponential lower bound.

From initialization stability to uniform stability: Finally, we employ a simple reduction to translate our results regarding initialization stability to relate to uniform stability in the context of empirical risk optimization, where one uses (full-batch) NAG to minimize the (convex, smooth) empirical risk induced by a sample 푆of 푛examples. Concretely, we show that by replacing a single example in 푆, we can arrive at a scenario where after one step of NAG on the original and modiﬁed samples the respective iterates are ϵ = Θ(1/푛) away from each other, whereas in the remaining steps both empirical risks simulate our bad function from before. Thus, we again end up with an exponential increase in the divergence between the two executions, that leads to a similar increase in the algorithmic (uniform) stability: the latter becomes as large as Ω(1) after merely 푇= 푂(log 푛) steps of full-batch NAG. The formal details of this reduction can be found in Section 4.

1.3 Discussion and Additional Related Work

It is interesting to contrast our results with what is known for the closely related heavy ball method [23]. Classic results show that while for convex quadratic objectives the (properly tuned) heavy ball method attains the accelerated 푂(1/푇2) convergence rate, for general convex and smooth functions it might even fail to converge at all (see 20). More speciﬁcally, it is known that there exists objectives for which heavy ball assumes a cyclic trajectory that never gets close to the optimum; it is then not hard

to turn such a construction to an instability result for heavy ball, as a slight perturbation in the cyclic pattern can be shown to make the method converge to optimum.

Also related to our work is Devolder et al. [11], that analyzed GD and NAG with inexact ﬁrst-order information, namely, in a setting where each gradient update is contaminated with a bounded yet arbitrary perturbation. Interestingly, they showed that in contrast to GD, NAG suﬀers from an accumulation of errors which appears analogous to the linear increase in initialization stability the latter experiences in the quadratic case. At the same time, in the general convex case their results might seem to be at odds with ours as we show that even a single perturbation at initialization suﬃces for extreme instabilities. However, note that they analyze the impact of perturbations on the convergence rate of NAG (in terms of objective value), whereas algorithmic stability is concerned with their eﬀect on the actual iterates: speciﬁcally, initialization stability captures to what extent the iterates of the algorithm might stray away from their original positions as a result of a small perturbation in the initialization point.

Various forms of stability are well-studied in the literature of dynamical systems, with the notion of Lyapunov stability being perhaps the most notable one. While related to our study, Lyapunov stability is concerned with stability of a solution due to a perturbation near a point of equilibrium of the system, and the solution is then said to be stable if the system remains in equilibrium despite this perturbation. In contrast, we are concerned with an arbitrary perturbation to the system (or algorithm) initialization point, which is the case of interest for initialization stability.

Our work leaves a few intriguing open problems for future investigation. Most importantly, it remains unclear whether there exists a diﬀerent accelerated method (one with the optimal 푂(1/푇2) rate for smooth and convex objectives) that is also poly(푇)-stable. Bubeck et al. [8] suggested a geometric alternative to NAG that comes to mind, and it could be interesting to check whether this method or a variant thereof is more stable than NAG. Another open question is to resolve the gap between our stability lower and upper bounds for NAG in the regime η 1/β: while our lower bounds have an exponential dependence on η, the upper bounds do not. Finally, it could be interesting to determine whether the 푂(푇ϵ) initialization stability bound we have for NAG in the quadratic case is tight (the corresponding uniform stability result is actually tight even for linear losses, but this may not be the case for initialization stability).

2 Preliminaries

In this work we are interested in optimization of convex and smooth functions over the 푑-dimensional Euclidean space ℝ푑. A function 푓is said to be β-smooth (for β > 0) if its gradient is β-Lipschitz, namely, if for all 푢, 푣 ℝ푑it holds that 푓(푢) 푓(푣) β 푢 푣 .

2.1 Nesterov Accelerated Gradient Method

The Nesterov Accelerated Gradient (NAG) method [22] we consider in this paper takes the following form. Starting with 푥0 and 푦0 = 푥0, it iterates for 푡= 1, 2, . . .:

푥푡= 푦푡 1 η 푓(푦푡 1); (1) 푦푡= 푥푡+ γ푡(푥푡 푥푡 1), (2)

where γ푡= 푡 1

푡+2 and η > 0 is a step-size parameter.2 For a β-smooth convex objective 푓and 0 < η 1/β, this method exhibits the convergence rate 푂(1/η푇2); for η = 1/β, this gives the optimal convergence rate for the class of β-smooth convex functions (see [21]). We remark that while NAG appears in several other forms in the literature, many of these are in fact equivalent to the one given in Eqs. (1) and (2). For more details, see the full version of the paper [3].

Throughout, we use the notation NAG( 푓, 푥0, 푡, η) to refer to the iterates (푥푡, 푦푡) at step 푡of NAG on 푓initialized at 푥0 with step size η. We sometimes drop the step size argument and use the shorter notation NAG( 푓, 푥0, 푡) when η is clear from the context.

We will use the following deﬁnitions and relations throughout. We introduce the following notation for the momentum term of NAG, for all 푡> 0:

푚푡 γ푡(푥푡 푥푡 1). (3)

2For concreteness, we focus here on the most common choice of momentum parameters γ푡, but our results can be adapted to similar settings.

Using this notation, we have that

푥푡= 푦푡 1 η 푓(푦푡 1), (4) 푦푡= 푥푡+ 푚푡= 푦푡 1 η 푓(푦푡 1) + 푚푡, (5) 푚푡= γ푡(푚푡 1 η 푓(푦푡 1)). (6)

Here, Eq. (6) follows from Eqs. (3) to (5) via

푚푡= γ푡(푥푡 푥푡 1) (Eq. (3)) = γ푡(푦푡 1 η 푓(푦푡 1) 푥푡 1) (Eq. (4)) = γ푡(푥푡 1 + 푚푡 1 η 푓(푦푡 1) 푥푡 1) (Eq. (5)) = γ푡(푚푡 1 η 푓(푦푡 1)).

2.2 Algorithmic Stability

We consider two forms of algorithmic stability. The ﬁrst is the well-known uniform stability [6], while the second is initialization stability which we deﬁne here.

Uniform stability. Consider the following general setting of supervised learning. There is a sample space Z of examples and an unknown distribution D over Z. We receive a training set 푆= (푧1, . . . , 푧푛) of 푛samples drawn i.i.d. from D. The goal is ﬁnding a model 푤with a small population risk:

푅(푤) 피푧 D[ℓ(푤; 푧)],

where ℓ(푤; 푧) is the loss of the model described by 푤on an example 푧. However, as we cannot evaluate the population risk directly, learning algorithms will be applied on the empirical risk with respect to the sample 푆, given by

푖=1 ℓ(푤; 푧푖).

In this paper, our algorithm of interest in this context is full-batch NAG, namely, NAG applied to the empirical risk 푅푆. We use the following notion of uniform stability.3 Deﬁnition 1 (uniform stability). Algorithm 퐴is ϵ-uniformly stable if for all 푆, 푆 Z푛such that 푆, 푆 diﬀer in at most one example, the corresponding outputs 퐴(푆) and 퐴(푆 ) satisfy

sup 푧 Z |ℓ(퐴(푆); 푧) ℓ(퐴(푆 ); 푧)| ϵ.

We use δunif 퐴,ℓ(푛) to denote the inﬁmum over all ϵ > 0 for which this inequality holds.

Initialization stability. A second notion of algorithmic stability that we deﬁne and discuss in this paper, natural in the context of iterative optimization methods, pertains to the stability of the optimization algorithm with respect to its initialization point. Initialization stability measures the sensitivity of the algorithm s output to a small perturbation in its initial point; formally, Deﬁnition 2 (initialization stability). Let 퐴be an algorithm that when initialized at a point 푥 ℝ푑, produces 퐴(푥) ℝ푑as output. Then for ϵ > 0, the initialization stability of 퐴at 푥0 ℝ푑is given as

δinit A (푥0, ϵ) = sup{ 퐴( 푥0) 퐴(푥0) : 푥0 ℝ푑, 푥0 푥0 ϵ}.

3 Initialization Stability of NAG

In this section we prove our ﬁrst main result, regarding the initialization stability of NAG: Theorem 3. Let ϵ, 퐺, β > 0 and 0 < η 1/β. Consider two initialization points 푥0 = 0, 푥0 = ϵ. Then, there exists a convex, β-smooth, 퐺-Lipschitz function 푓that attains a minimum over ℝ, and universal constants 푐1, 푐2 > 0, such that the sequences (푥푡, 푦푡) = NAG( 푓, 푥0, 푡, η) and ( 푥푡, 푦푡) = NAG( 푓, 푥0, 푡, η) satisfy

δinit NAG푡(푥0, ϵ) |푥푡 푥푡| min 퐺

3β, 푐2푒푐1ηβ푡ϵ , 푡 10

ηβ (푖+ 2) : 푖= 1, 2, . . . .

Furthermore, for all 푡> 10

2βϵ + 3 it holds that δinit NAG푡(푥0, ϵ) 퐺

3We give here a deﬁnition suitable for deterministic algorithms, which suﬃces for the context of this paper. Similar deﬁnitions exist for randomized algorithms; see for example [17, 13].

In words, the theorem establishes an exponential blowup in the distance between the two trajectories 푥푡and 푥푡during the initial 푂(1/ϵ) steps, after which the (lower bound on the) distance reaches a constant and stops increasing. Notice that in the blowup phase, an increase in distance happens roughly every ηβ steps; indeed, the actual behaviour of NAG on the function we construct exhibit ﬂuctuations in the diﬀerence 푥푡 푥푡, as illustrated in Fig. 2. We remark that a similar bound holds also for the 푦푡sequence produced by NAG.

Construction. Throughout this section, we will assume without loss of generality that 0 < ϵ < 퐺/2β. (When ϵ 퐺/2β our result holds simply for a constant function.) To lower bound the initialization stability and prove the theorem, we will rely on the following construction of functions 푓0, 푓1, . . . : ℝ ℝ. Let the parameters 퐺, β, η, ϵ > 0 be given, and for all 푖 0 deﬁne 푛푖 10/ηβ (푖+ 2). The construction proceeds as follows:

(i) Let 푓0(푥) 퐺푥; (ii) For 푖 1:

Let (푥푛푖, 푦푛푖) = NAG( 푓푖 1, 0, 푛푖, η) and ( 푥푛푖, 푦푛푖) = NAG( 푓푖 1, ϵ, 푛푖, η); Deﬁne 푓푖: ℝ ℝas follows:

푓푖(푥) 퐺푥+ β 푥

1 h 푗 푖s.t. 푧 [푦min 푛푗, 푦max 푛푗] i 푑푧푑푦,

where 푦min 푛푖 = min{푦푛푖, 푦푛푖}, 푦max 푛푖 = max{푦푛푖, 푦푛푖}; (iii) Let 푀= sup 푖 0 : max푥 푓푗(푥) < 1

2퐺, 0 푗 푖 .

Note that the above recursion deﬁnes an inﬁnite sequence of functions 푓0, 푓1, 푓2, . . . : ℝ ℝ. Ultimately, we will be interested in the functions { 푓푖}푖 푀which we will analyze in order to prove the instability result. Further, note that max푥 푓0(푥) < 1

2퐺, thus 푀itself is well-deﬁned, possibly . The functions constructed above are not lower-bounded (and thus do not admit a minimum). Below we deﬁne a lower-bounded adaptation of 푓푀(assuming 1 푀< , which is proved in Lemma 7 later on). Our modiﬁed version of 푓푀, termed 푓is deﬁned by a quadratic continuation of 푓푀right of 푝 푦max 푛푀, up to a plateau. This construction is deﬁned formally as:

푓푀(푥) 푥 푝; 푓푀(푝) + 푓푀(푝)(푥 푝) + β

2 (푥 푝)2 푝< 푥 푝 1

β 푓푀(푝); 푓푀(푝) 1

2β 푓푀(푝)2 otherwise.

Analysis. We start by stating a few lemmas we will use in the proof of our main theorem. Our focus is on the functions 푓푖for 0 푖 푀, deferring the analysis of 푓to after we establish that 푀is ﬁnite. First, we show that the functions we constructed are indeed convex, smooth and Lipschitz. Lemma 4. For all 0 푖 푀, the function 푓푖is convex, β-smooth and 퐺-Lipschitz. Proof. The second derivative of 푓푖is

2 푓푖(푥) = β 1 h 푗 푖s.t. 푧 [푦min 푛푗, 푦max 푛푗] i {0, β}.

Thus, 푓푖is convex and β-smooth. We lower bound the ﬁrst derivative by

푓푖(푥) = 퐺+ β 푥

1 h 푗 푖s.t. 푧 [푦min 푛푗, 푦max 푛푗] i 푑푧 퐺,

and by the deﬁnition of 푀, 푓푖(푥) < 퐺/2. Hence for all 푥 ℝwe have 푓푖(푥) [ 퐺, 퐺/2), so 푓푖is 퐺-Lipschitz over ℝ.

Next, we analyze the iterations of NAG on 푓푖for any given 0 푖 푀. Fix such index 푖and consider (푥푡, 푦푡) = NAG( 푓푖, 0, 푡) and ( 푥푡, 푦푡) = NAG( 푓푖, ϵ, 푡) for all 푡 푇for some 푇 푛푖+1. We introduce the following compact notation for diﬀerences between the NAG terms related to the two sequences:

푥 푡 푥푡 푥푡, 푦 푡 푦푡 푦푡, 푓 푡 푓푖(푥푡) 푓푖( 푥푡), 푚 푡 푚푡 푚푡. From the update rules of NAG (Eqs. (4) to (6)), we have that

푥 푡= 푦 푡 1 η 푓 푡 1, (7)

푦 푡= 푥 푡+ 푚 푡= 푦 푡 1 η 푓 푡 1 + 푚 푡, (8)

푚 푡= γ푡( 푚 푡 1 η 푓 푡 1). (9)

Our next lemma below describes the evolution of the diﬀerences 푓 푡and 푚 푡in terms of 푦 푡.

Lemma 5. For all 푡 푇,

푓 푡= β 푦 푡 if 푡 {푛푗}푖 푗=1; 0 otherwise, and 푚 푡= γ푡( 푚 푡 1 ηβ 푦 푡 1) if 푡 {푛푗+ 1}푖 푗=1; γ푡 푚 푡 1 otherwise.

The following lemma summarise the evolution of the distance between the sequences 푦푡and 푦푡at steps 푡 {푛푗}1 푗 푖+1 and for 푡> 푛푖+1. The exponential growth is achieved by a balance between the diﬀerence in momentum terms and the diﬀerence between the sequences.

Lemma 6. For the diﬀerence terms 푦 푡, we have the following:

(i) For all 1 푗 푖, it holds that 2

3ηβ| 푦 푛푗| | 푚 푛푗+1| 1

5ηβ| 푦 푛푗+1|.

(ii) For all 0 푗 푖, it holds that | 푦 푛푗+1| = 푦max 푛푗+1 푦min 푛푗+1 3푗ϵ. (iii) For all 푡> 푛푖+1, it holds that | 푦 푡| | 푦 푛푖+1|.

Finally, we can show that 푓is well-deﬁned by proving that 푀is ﬁnite in the following lemma. The bound of 푀also indicate that after 푂(log 1

ϵ) steps the two trajectories 푦푡, 푦푡reach a constant distance.

Lemma 7. It holds that 1 푀 ln 3퐺

2βϵ (in particular, 푀is ﬁnite), and 푦max 푛푀+1 푦min 푛푀+1 퐺

Now we can return to our 푓. First, we show it indeed posses the basic properties for Theorem 3.

Lemma 8. The function 푓is convex, β-smooth, 퐺-Lipschitz and attains a minimum 푥 arg min푥푓(푥) s.t. |푥0 푥 | = 푂 (퐺/ηβ2) log(퐺/βϵ)2 for 푥0 {0, ϵ}.

The ﬁnal lemma we require shows that the distance between the two trajectories is the same for 푓 and 푓푀. This holds true since the two functions coincide for 푥 푝, after which the iterates reach a plateau which induces similar stability dynamics as the linear part of 푓푀at 푥> 푝.

Lemma 9. Let (푥푡, 푦푡) = NAG( 푓, 0, 푡, η) and ( 푥푡, 푦푡) = NAG( 푓, ϵ, 푡, η) be the iterations of NAG on 푓from our initialization points. Similarly, for 푓푀, let ( ˆ푥푡, ˆ푦푡) = NAG( 푓푀, 0, 푡, η) and ( 푥푡, 푦푡) = NAG( 푓푀, ϵ, 푡, η). Then for all 푡, we have that 푥푡 푥푡= ˆ푥푡 푥푡and 푦푡 푦푡= ˆ푦푡 푦푡.

We defer the proofs of Lemmas 5 to 9 to the full version of the paper [3], and proceed to prove our main result.

Proof of Theorem 3. Based on Lemma 9, it suﬃces to show that the lower bound holds for the function 푓푀. Let 푐1 = 1 11 ln(3), 푐2 = 4

53 3. Let 푡= 푛푖for some 푖 1. The ﬁrst case we will deal with is when 푖 푀+ 1. We already established with Lemma 6 that | 푦 푛푖| 3푖 1ϵ. Since 푡= 푛푖= (푖+ 2) 10/ηβ , 푖 ηβ푡/11 2. Hence,

| 푦 푛푖| 3ηβ푡/11 3ϵ = | 푦 푛푖| 5

4푐2푒푐1ηβ푇ϵ.

To relate to |푥푡 푥푡|,

|푥푡 푥푡| = | 푥 푛푖|

| 푦 푛푖| | 푚 푛푖| (Eq. (8))

| 푦 푛푖| | 푚 푛푖 1+1|

푡=푛푖 1+2 γ푡 (Lemma 5 for 푡= 푛푖, . . . , 푛푖 1 + 2)

| 푦 푛푖| | 푚 푛푖 1+1| (Since 푡 1 0 γ푡 1)

( ) | 푦 푛푖| 1 ηβ

5 | 푦 푛푖| 푐2푒푐1ηβ푇ϵ. (η 1

Here, ( ) follows from Lemma 6 if 푖> 1 and the case of 푖= 1 follows by combining Lemma 5 and 푚 1 = 0 which implies that 푚 푛0+1 = 0. If 푡> 푛푀+1 (includes the case of 푖> 푀+ 1 and

푡> 10/ηβ ln 3퐺

2βϵ + 3 from Lemma 7), since 푡 1 푛푀+1,

|푥푡 푥푡| = | 푦 푡 1 η 푓 푡 1| (Eq. (7))

= | 푦 푡 1| (Lemma 5)

| 푦 푛푀+1|. (Lemma 6)

And using Lemma 7 we conclude that |푥푡 푥푡| 퐺

3β. Hence, with Lemma 4, 푓푀holds all properties of Theorem 3 beside attaining a minimum, and using Lemmas 8 and 9, 푓posses all the properties needed for Theorem 3.

4 Uniform Stability of NAG

In this section we present our second main result, regarding the uniform stability of (full-batch) NAG. This is given formally in the following theorem. Theorem 10. For any 퐺, β, η and 푛 4 such that 0 < η 1/β, there exists a loss function ℓ(푤; 푧) that is convex, β-smooth and 퐺-Lipschitz in 푤(for every 푧 Z) and universal constants 푐3, 푐4 > 0, such that the uniform stability of 푇-steps full-batch NAG with step size η is

δunif NAG푇,ℓ(푛) min 퐺2

3β , 푐4푒푐3ηβ푇βη2퐺2

ηβ 푛 푛 3 (푖+ 2) : 푖= 1, 2, . . . ,

Furthermore, for all 푇 40

βϵ + 3 it holds that δunif NAG푇,ℓ(푛) 퐺2

The comments following Theorem 3 regarding the exponential blowup and the ﬂuctuating behaviour also apply here. Note also the perhaps surprising inverse dependence on β (βη2 is also 푂(1/β)). The dependence can be explained by the fact that smooth optimization over a highly non-smooth yet still 퐺-Lipschitz function must have a small step size (with η 1/β) which improves stability.

Construction. We denote the given parameters for the theorem with ˆ퐺, ˆβ, ˆη, 푛. We will use the construction from Section 3 with the properties of Theorem 3 in order to create a loss function and samples which will have the same optimization for 푡 1. For the construction we deﬁne the following setting of 퐺, β, η, ϵ:

퐺= ˆ퐺, β = ˆβ, η = 푛 3

푛 ˆη, ϵ = βη2퐺

Using these parameters, we obtain 푓from the construction of Section 3. As we proved in the previous section, this is the function for which Theorem 3 holds. Note that for 푇< 푛1, the lower bounds already holds even for quadratics, as we show in the full version of the paper [3]. We also deﬁne the following functions,

ℓ(푤; 1) 0, (10)

ℓ(푤; 2) βη퐺푤+ β 푤

1[푧 [0, η퐺]]푑푧푑푦, (11)

and further let ℓ(푤; 3) = 퐺푤, ℓ(푤; 4) = 퐺푤, and ℓ(푤; 5) = 푓(푤). Note that all are convex, β-smooth and 퐺-Lipschitz. Let Z = {1, 2, 3, 4, 5} be our sample space, we consider the loss function ℓ(푤; 푧) over ℝ Z. Our samples of interest are 푆= (1, 3, 4, 5, . . . , 5) Z푛and 푆 = (2, 3, 4, 5, . . . , 5) Z푛. Thus, the empirical risks 푅푆, 푅푆 corresponding to 푆, 푆 are given by

푛푔1(푤) + 푛 3

푛 푓(푤) = 푛 3

푛 푓(푤), (12)

푛푔2(푤) + 푛 3

푛 푓(푤). (13)

Analysis. The key lemma below shows that we constructed a scenario that reduces the problem of analyzing the uniform stability of full-batch NAG to analyzing its initialization stability on the function 푓we constructed in Section 3. Lemma 11. For all 푡= 1, 2, . . ., we have that

NAG( 푓, 0, 푡, η) = NAG(푅푆, 0, 푡, ˆη), NAG( 푓, ϵ, 푡, η) = NAG(푅푆 , 0, 푡, ˆη).

Using this key lemma, Theorem 10 follows immediately by setting 푐3 = 푐1/4 and 푐4 = 푐2/4, for the samples 푆, 푆 we deﬁned and 푧= 3. As ℓ(푤; 3) = 퐺푤,

|ℓ(푥푇; 3) ℓ( 푥푇; 3)| = 퐺|푥푇 푥푇|,

and by using the deﬁnitions of 퐺, β, η, ϵ we get the two cases of Theorem 3 which lower bound |푥푇 푥푇| as needed. The full proof of Lemma 11 and Theorem 10 can be found in the full version of the paper [3].

Acknowledgments and Disclosure of Funding

We thank Naman Agarwal, Yair Carmon and Roi Livni for valuable discussions. This work was partially supported by the Israeli Science Foundation (ISF) grant no. 2549/19, by the Len Blavatnik and the Blavatnik Family foundation, and by the Yandex Initiative in Machine Learning.

[1] N. Agarwal, R. Anil, T. Koren, K. Talwar, and C. Zhang. Stochastic optimization with laggard data pipelines. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, 2020.

[2] Z. Allen-Zhu and L. Orecchia. Linear coupling: An ultimate uniﬁcation of gradient and mirror descent. In 8th Innovations in Theoretical Computer Science Conference (ITCS 2017). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2017.

[3] A. Attia and T. Koren. Algorithmic instabilities of accelerated gradient descent. ar Xiv preprint ar Xiv:2102.02167, 2021. Available at https://arxiv.org/abs/2102.02167.

[4] R. Bassily, V. Feldman, K. Talwar, and A. Guha Thakurta. Private stochastic convex optimization with optimal rates. Advances in neural information processing systems, 2019.

[5] R. Bassily, V. Feldman, C. Guzmán, and K. Talwar. Stability of stochastic gradient descent on nonsmooth convex losses. Advances in Neural Information Processing Systems, 33, 2020.

[6] O. Bousquet and A. Elisseeﬀ. Stability and generalization. Journal of machine learning research, 2(Mar):499 526, 2002.

[7] O. Bousquet, Y. Klochkov, and N. Zhivotovskiy. Sharper bounds for uniformly stable algorithms. In Conference on Learning Theory, pages 610 626. PMLR, 2020.

[8] S. Bubeck, Y. T. Lee, and M. Singh. A geometric alternative to nesterov s accelerated gradient descent. ar Xiv preprint ar Xiv:1506.08187, 2015.

[9] Z. Charles and D. Papailiopoulos. Stability and generalization of learning algorithms that converge to global optima. In International Conference on Machine Learning, pages 745 754. PMLR, 2018.

[10] Y. Chen, C. Jin, and B. Yu. Stability and convergence trade-oﬀof iterative optimization algorithms. ar Xiv preprint ar Xiv:1804.01619, 2018.

[11] O. Devolder, F. Glineur, and Y. Nesterov. First-order methods of smooth convex optimization with inexact oracle. Mathematical Programming, 146(1-2):37 75, 2014.

[12] V. Feldman. Generalization of erm in stochastic convex optimization: The dimension strikes back. Advances in Neural Information Processing Systems, 29:3576 3584, 2016.

[13] V. Feldman and J. Vondrak. Generalization bounds for uniformly stable algorithms. In Advances in Neural Information Processing Systems, pages 9747 9757, 2018.

[14] V. Feldman and J. Vondrak. High probability generalization bounds for uniformly stable algorithms with nearly optimal rate. In Conference on Learning Theory, pages 1270 1279, 2019.

[15] V. Feldman, T. Koren, and K. Talwar. Private stochastic convex optimization: optimal rates in linear time. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 439 449, 2020.

[16] A. Gonen and S. Shalev-Shwartz. Fast rates for empirical risk minimization of strict saddle problems. In Conference on Learning Theory, pages 1043 1063. PMLR, 2017.

[17] M. Hardt, B. Recht, and Y. Singer. Train faster, generalize better: Stability of stochastic gradient descent. In International Conference on Machine Learning, pages 1225 1234. PMLR, 2016.

[18] T. Koren and K. Y. Levy. Fast rates for exp-concave empirical risk minimization. In Proceedings of the 28th International Conference on Neural Information Processing Systems-Volume 1, pages 1477 1485, 2015.

[19] G. Lan. An optimal method for stochastic composite optimization. Mathematical Programming, 133(1-2):365 397, 2012.

[20] L. Lessard, B. Recht, and A. Packard. Analysis and design of optimization algorithms via integral quadratic constraints. SIAM Journal on Optimization, 26(1):57 95, 2016.

[21] Y. Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2003.

[22] Y. E. Nesterov. A method for solving the convex programming problem with convergence rate o (1/kˆ 2). In Dokl. akad. nauk Sssr, volume 269, pages 543 547, 1983.

[23] B. T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1 17, 1964.

[24] S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan. Stochastic convex optimization. In COLT, 2009.

[25] J. Wang, W. Wang, and N. Srebro. Memory and communication eﬃcient distributed stochastic optimization with minibatch prox. In Conference on Learning Theory, pages 1882 1919. PMLR, 2017.

[26] X. Wu, F. Li, A. Kumar, K. Chaudhuri, S. Jha, and J. Naughton. Bolt-on diﬀerential privacy for scalable stochastic gradient descent-based analytics. In Proceedings of the 2017 ACM International Conference on Management of Data, pages 1307 1322, 2017.