# benign_underfitting_of_stochastic_gradient_descent__5ed07c4d.pdf

Benign Underfitting of Stochastic Gradient Descent

Tomer Koren Blavatnik School of Computer Science Tel Aviv University and Google Research tkoren@tauex.tau.ac.il

Roi Livni Department of Electrical Engineering Tel Aviv University rlivni@tauex.tau.ac.il

Yishay Mansour Blavatnik School of Computer Science Tel Aviv University and Google Research mansour.yishay@gmail.com

Uri Sherman Blavatnik School of Computer Science Tel Aviv University urisherman@mail.tau.ac.il

We study to what extent may stochastic gradient descent (SGD) be understood as a conventional learning rule that achieves generalization performance by obtaining a good fit to training data. We consider the fundamental stochastic convex optimization framework, where (one pass, without-replacement) SGD is classically known to minimize the population risk at rate 𝑂(1/ 𝑛), and prove that, surprisingly, there exist problem instances where the SGD solution exhibits both empirical risk and generalization gap of Ω(1). Consequently, it turns out that SGD is not algorithmically stable in any sense, and its generalization ability cannot be explained by uniform convergence or any other currently known generalization bound technique for that matter (other than that of its classical analysis). We then continue to analyze the closely related with-replacement SGD, for which we show that an analogous phenomenon does not occur and prove that its population risk does in fact converge at the optimal rate. Finally, we interpret our main results in the context of without-replacement SGD for finite-sum convex optimization problems, and derive upper and lower bounds for the multi-epoch regime that significantly improve upon previously known results.

1 Introduction

Conventional wisdom in statistical learning revolves around what is traditionally known as the bias-variance dilemma; the classical theory stipulates the quality of fit to the training data be in a trade-off with model complexity, aiming for a sweet spot where training error is small but yet representative of performance on independent test data.

This perspective is reflected in the vast majority of generalization bound techniques offered by contemporary learning theory. Uniform convergence approaches [36, 4] seek capacity control over the model function class, and employ uniform laws of large numbers to argue convergence of sample averages to their respective expectations. Algorithmic stability [9, 32] on the other hand, builds on controlling sensitivity of the learning algorithm to small changes in its input, and provides algorithm dependent bounds. Nevertheless, despite the conceptual and technical differences between these two methods, both ultimately produce risk bounds by controlling the training error, and the generalization gap. The same is true for many other techniques, including sample compression [17, 2], PAC-Bayes [18, 12], and information theoretic generalization bounds [29, 37, 24], to name a few.

In recent years it has become clear there are other, substantially different, ways to manage the fit vs. complexity trade-off, that are in a sense incompatible with traditional generalization bound techniques. Evidently, heavily over-parameterized deep neural networks may be trained to perfectly

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

fit training data and generalize well nonetheless [38, 25, 26], thus seemingly disobeying conventional statistical wisdom. This phenomenon has garnered significant attention, with a flurry of research works dedicated to developing new techniques that would be able to explain strong generalization performance of algorithms in this so called interpolation regime (see 6, 8 and references therein). Notably, while these algorithms do not strike a balance between model complexity and fit to the data in the traditional sense, fundamentally, they still minimize the empirical risk as a proxy to test performance.

To summarize, in the classical and modern regimes alike, learning methods are thought of as minimizing some combination of the training error and generalization gap, with reasoning that relies in one way or another on the following trivial, yet arguably most profound, bound:

test-error train-error + |generalization gap| . (1)

In this work, we focus on stochastic gradient descent (SGD) the canonical algorithm for training machine learning models nowadays and ask whether its generalization performance can be understood through a similar lens. We consider the fundamental stochastic convex optimization (SCO) framework, in which it is well known that SGD minimizes the population risk at a rate of 𝑂(1/ 𝑛) [23]. Remarkably, the classical analysis targets the population risk directly, and in contrast with other generalization arguments, at least seemingly does not rely on the above bound. This highlights an intriguing question: Are these quantities, so fundamental to learning theory, relevant to the way that SGD works ? Put differently, is it possible to provide a more conventional" analysis of SGD that conforms with (1)?

Our main result shows that, perhaps surprisingly, there exist convex learning problems where the above bound becomes vacuous for SGD: namely, SGD minimizes the population risk, but at the same time, it does not minimize the empirical risk and thus exhibits constant generalization gap. This accords neither with the traditional viewpoint nor with that of interpolation, as both recognize the empirical risk as the principal minimization objective. We refer to this phenomenon as benign underfitting: evidently, SGD underfits the training data, but its classical analysis affirms this underfitting to be benign, in the sense that test performance is never compromised as a result. Our construction presents a learning problem where the output of SGD with step size η over 𝑛i.i.d. training examples is Ω(η 𝑛) sub-optimal w.r.t. the best fit possible, and consequently has a generalization gap of the same order. Notably, with the standard step size choice of 1/ 𝑛necessary to ensure the population risk converges at the optimal rate this lower bound amounts to a constant.

Many previously plausible explanations for generalization properties of this algorithm are thereby rendered inadequate, at least in the elementary convex setup we consider here. First, it is clear that SGD cannot be framed as any reasonable regularized empirical risk minimization procedure for the simple reason that it does not minimize the empirical risk, which challenges the implicit regularization viewpoint to the generalization of SGD. Second, any attempt to explain generalization of SGD by uniform convergence over any (possibly data-dependent) hypotheses set cannot hold, simply because the sample average associated with the very same training set SGD was trained on is not necessarily close to its respective expectation. Finally, as it turns out, SGD provides for a strikingly natural example of an algorithm that generalizes well but is not stable in any sense, as the most general notion of algorithmic stability is entirely equivalent to the generalization gap [32].

We then move on to study the generalization gap and empirical risk guarantees of SGD in a broader context. We study the case of non-convex and strongly convex component functions, and present natural extensions of our basic result. In addition, we analyse the variant of SGD where datapoints are sampled with-replacement from the training set, in which case the train error is of course low but perhaps surprisingly the population risk is well behaved. Finally, we make the natural connection to the study of without-replacement SGD for empirical risk minimization, and derive upper and lower bounds for the multi-epoch regime. These last two points are discussed in further detail in the following.

With vs without-replacement SGD. We may view one-pass SGD as processing the data via without-replacement sampling from the training set, as randomly reshuffling the examples does not change their unconditional distribution. Thus, it is interesting to consider the generalization gap of the closely related algorithm given by running SGD over examples sampled with-replacement from the training set. Considering instability (see the supplementary for a detailed discussion) of SGD for non-smooth losses and the fact that this variant targets the empirical objective, a priori it would

seem this algorithm would overfit the training set and not provide strong population risk guarantees. Surprisingly, our analysis presented in Section 4 reveals this is not the case, and that with a certain iterate averaging scheme the population risk converges at the optimal rate. Consequently, it turns out the generalization gap is well bounded, and therefore that this variant constitutes a natural learning rule that is not stable in any sense but the most general one.

Without-replacement SGD for empirical risk minimization. The example featured in our main construction implies a lower bound of Ω(𝑛 1/4) on the convergence rate of a single epoch of withoutreplacement SGD for finite sum optimization problems. In this setting, we have a set of 𝑛convex losses and we wish to minimize their sum by running SGD over random shufflings of the losses. While the smooth case has been studied extensively (e.g., [28, 27, 20, 31]), the non-smooth case has hardly received much attention. In Section 5 we extend our basic construction to a lower bound for the multi-epoch regime, and complement it with nearly matching upper bounds.

Our techniques. Fundamentally, we exploit the fact that dimension independent uniform convergence does not hold in SCO [32]. This is a prerequisite to any attempt at separating train and test losses of any hypothesis vector, let alone that produced by SGD. Another essential condition is the instability of SGD for non-smooth losses, as any form of stability would immediately imply a generalization gap upper bound regardless of uniform convergence. Our main lower bound draws inspiration from constructions presented in the works of [7] and [1], both of which rely on instability, the latter also exploiting failure of uniform convergence. However, neither of these contains the main ideas necessary to provoke the optimization dynamics required in our example. A crucial ingredient in our construction consists of encoding into the SGD iterate information about previous training examples. This, combined with careful design of the loss function, gradient oracle and population distribution, allows correlating sub-gradients of independent training examples, and in turn guiding the SGD iterates to ascend the empirical risk.

1.1 Summary of main contributions

To summarize, the main contributions of the paper are as follows:

One-pass SGD in SCO. In Section 3, we study the basic SCO setup where the component losses are assumed to be individually convex, and present a construction where the expected empirical risk and therefore the generalization gap are both Ω(η 𝑛). We also provide extensions of our main construction demonstrating;

SCO with non-convex component functions may exhibit cases of benign overfitting, where 𝔼 𝐹(b𝑤) b𝐹(b𝑤) = Ω(η2𝑛). In SCO with λ-strongly convex losses the worst case generalization gap is Ω(1/λ 𝑛) for the standard step size choice. With vs without replacement SGD in SCO. In Section 4, we prove the variant of SGD where the training examples are processed via sampling with-replacement from the training set minimizes the population risk at the optimal rate, and thus enjoys a generalization gap upper bound bound of 𝑂(1/ 𝑛). Multi-epoch without-replacement SGD. In Section 5, we study convergence rates of withoutreplacement SGD for finite sum convex optimization problems. We prove a lower bound of Ω(𝑛 1/4𝐾 3/4) on the optimization error after 𝐾epochs over 𝑛convex losses, and complement with upper bounds of 𝑂(𝑛 1/4𝐾 1/2) and 𝑂(𝑛 1/4𝐾 1/4) for respectively the multi-shuffle and single-shuffle SGD variants.

1.2 Additional related work

Gradient descent, algorithmic stability and generalization. Closely related to our work is the study of stability properties of SGD. For smooth losses, [14] provide upper bounds on the generalization gap by appealing to uniform stability, yielding an 𝑂(1/ 𝑛) rate for a single epoch of 𝑛convex losses and the standard step size choice. In a later work, [7] prove tight rates for uniform stability of SGD in the setting of non-smooth losses, establishing these scale substantially worse; Θ(η 𝑛) for step size η and 𝑛training examples. Our work shows that in fact the worst case rate of the generalization gap completely coincides with the uniform stability rate of SGD.

A number of works prior to ours studied the extent to which SGD can be explained by implicit regularization in SCO. [16] study the setup where losses are smooth but only required to be convex in expectation, and show SGD may successfully learn when regularized ERM does not. Prior to

their work, [11] also rule out a wide range of implicit regularization based explanations of SGD in the basic SCO setup with convex losses. On a more general level, our work is related to the study of stability and generalization in modern learning theory, pioneered by [9, 32]. In particular, the failure of (dimension independent) uniform convergence in SCO was established in [32]. The work of [13] improves the dimension dependence in the construction of [32] from exponential to linear in the number of training examples. Notably, the construction featured in our main result requires the dimension to be exponential in the sample size, however the techniques of [13] do not readily extend to our setting. Thus, the optimal dimension dependence for a generalization gap lower bound is left for future work.

Without-replacement SGD for empirical risk minimization. A relatively long line of work studies convergence properties of without-replacement SGD from a pure optimization perspective (e.g., [28, 20, 30, 27, 19, 31]). Nearly all the papers in this line of work adopt the smoothness assumption, with near optimal bounds established by [20]. An exception is the paper of [33] where an 𝑂(1/

𝑛𝐾) upper bound is obtained for 𝑛datapoints and 𝐾epochs, albeit only for generalized linear models over a bounded domain notably, a setting where uniform convergence holds. Prior to this thread of research, [22] prove a convergence rate of 𝑂(𝑛/

𝐾) for non-smooth loss functions that applies for any ordering of the losses. To the best of our knowledge, this is also the state-of-the-art result for without-replacement SGD in the non-smooth setting without further assumptions on the loss functions.

Benign overfitting vs. benign underfitting. While both benign underfitting and benign overfitting challenge traditional generalization techniques, that postulate the training error to represent the test error, as we discuss above these two phenomena point to very different regimes of learning. In particular, [34] shows that benign overfitting requires distributional assumptions for the interpolating algorithm to succeed. In contrast, we show that benign underfitting happens for SGD in a setting where it provably learns (namely, SCO), without any distributional assumptions. We also point out that Corollary 1 shows benign overfitting cannot happen in the setup we consider, hence the two phenomena seem to rise in different setups.

Explaining generalization of interpolators. As already discussed, there is a large recent body of work dedicated to understanding why over-parameterized models trained by SGD to zero training error generalize well [6, 8, and references therein]. In particular, the work of [5] aims at explaining the phenomenon for high dimensional linear models. Some recent papers investigate limitations of certain techniques in explaining generalization of interpolating algorithms: [21] show uniform convergence fails to explain generalization of SGD in a setup where the generalization gap is in fact well bounded, thus in sharp contrast to our work; [3] rule out the possibility of a large class of excess risk bounds to explain generalization of minimum norm interpolants. Unlike our work, they study properties of possible risk bounds when benign overfitting occurs, and thus do not pertain to SGD that never benignly overfits in SCO.

2 Preliminaries

We consider stochastic convex optimization (SCO) specified by a population distribution Z over a datapoint set 𝑍, and loss function 𝑓: 𝑊 𝑍 ℝwhere 𝑊 ℝ𝑑is convex and compact. We denote

𝐹(𝑤) B 𝔼𝑧 Z 𝑓(𝑤; 𝑧), (population loss)

𝑖=1 𝑓(𝑤; 𝑧𝑖), (empirical loss)

where {𝑧1, . . . , 𝑧𝑛} 𝑍stands for the training set, which we regularly denote by 𝑆. We let 𝑤 B min𝑤 𝑊𝐹(𝑤) denote the population minimizer, and 𝑤 𝑆B min𝑤 𝑊b𝐹(𝑤) denote the empirical risk minimizer (ERM). The diameter of 𝑊is defined by max𝑥,𝑦 𝑊{ 𝑥 𝑦 } where denotes the euclidean norm, and B𝑑 0 (1) B 𝑥 ℝ𝑑| 𝑥 1 denotes the 𝐿2 unit ball in ℝ𝑑. Given a training set 𝑆= {𝑧1, . . . , 𝑧𝑛} Z𝑛and a learning algorithm that outputs a hypothesis b𝑤𝑆, we define the generalization gap to be the absolute value of the expected difference between test and train losses;

𝔼𝑆 Z𝑛 𝐹(b𝑤𝑆) b𝐹(b𝑤𝑆) . (generalization gap)

Throughout most of the paper, we consider one-pass projected SGD over 𝑆;

initialize at 𝑤1 𝑊; for 𝑡= 2, . . . , 𝑛: 𝑤𝑡+1 Π𝑊(𝑤𝑡 η𝑔𝑡) , with 𝑔𝑡 𝑓(𝑤𝑡; 𝑧𝑡),

where 𝑓(𝑤; 𝑧) denotes the set of sub-gradients of 𝑓( ; 𝑧) ℝat the point 𝑤 𝑊, and Π𝑊: ℝ𝑑 𝑊 the projection operation onto 𝑊.

3 A generalization gap lower bound for SGD

In this section, we establish our main result; that there exist convex learning problems where SGD incurs a large optimization error and therefore also a large generalization gap. When losses are convex these two quantities are closely related since in expectation, the empirical risk minimizer cannot significantly outperform the population minimizer (a claim that will be made rigorous shortly after our main theorem). Our construction builds on losses that are highly non-smooth, leading to SGD taking gradient steps that actually ascend the empirical objective.

Theorem 1. Let 𝑛 ℕ, 𝑛 4, 𝑑 24𝑛log 𝑛, and 𝑊= B2𝑑 0 (1). Then there exists a distribution over instance set 𝑍and a 4-Lipschitz convex loss function 𝑓: 𝑊 𝑍 ℝsuch that running SGD initialized at 𝑤1 = 0, with step size η > 0 over 𝑆 Z𝑛yields;

(i) a large optimization error; 𝔼 h b𝐹(b𝑤𝑆) b𝐹(𝑤 𝑆) i = Ω min n η 𝑛, 1 η 𝑛

(ii) a large generalization gap; 𝔼 h b𝐹(b𝑤𝑆) 𝐹(b𝑤𝑆) i = Ω min n η 𝑛, 1 η 𝑛

where b𝑤𝑆is any suffix average of the iterates. In particular, for η = Θ(1/ 𝑛), the population risk is 𝔼[𝐹(b𝑤𝑆) 𝐹(𝑤 )] = 𝑂(1/ 𝑛), while the generalization gap and training error are both Ω(1) .

A detailed proof of Theorem 1 is deferred to the supplementary; in the following we provide an informal overview containing its principal ingredients.

Proof sketch. Let 𝑍B {0, 1}𝑑, and consider a population distribution Z such that 𝑧(𝑖) = 1 with probability δ. We will use a loss function of the form

𝑓(𝑤; 𝑧) B 𝑧 𝑤 + φ(𝑤; 𝑧),

where denotes element-wise product. The high level idea is that the norm component penalizes 𝑤 s that correlate with the given sample point 𝑧, and the φ function (the details of which are left for the supplementary) is tailored so that it drives the SGD iterates precisely to those areas in the 𝐿2 ball where it correlates with the training set {𝑧1, . . . , 𝑧𝑛}. In addition, the choice of parameters is such that the population loss is approximately zero over the entire domain.

Taking 𝑑sufficiently large compared to δ 1, we ensure that w.h.p., for every round 𝑡 [𝑛] there exist many coordinates 𝑖 [𝑑] with a prefix of ones; 𝑧1(𝑖) = = 𝑧𝑡 1(𝑖) = 1 . With δ chosen sufficiently small compared to 𝑛, we ensure that as long as 𝑖 [𝑑] is any coordinate chosen independently of {𝑧𝑡+1, . . . , 𝑧𝑛}, w.h.p. this coordinate will have a suffix of zeros; 𝑧𝑡+1(𝑖) = = 𝑧𝑛(𝑖) = 0.

Our goal is to make SGD take steps 𝑤𝑡+1 𝑤𝑡 η𝑒𝑖𝑡(where 𝑒𝑖denotes the 𝑖 th standard basis vector) where 𝑖𝑡 [𝑑] is a coordinate with the aforementioned property of having a prefix of ones followed by a suffix of zeros. Note that since these steps are taken after the prefix of ones has ended, they will inflict large empirical loss from the norm component, but will not be corrected by future steps owed to the suffix of zeros. To achieve this, we design φ so that it encodes the relevant information into the SGD iterates. Specifically, φ flags (using some extra dimensions) all coordinates 𝑖 [𝑑] where a prefix of ones has been encountered. In addition, using another max component in φ we have that for all such coordinates 𝑖, 𝑒𝑖 𝑓(𝑤𝑡; 𝑧) for any example 𝑧(as this component in the loss depends only on the iterate 𝑤𝑡). In particular, we get that 𝑒𝑖 𝑓(𝑤𝑡; 𝑧𝑡). Then, our gradient oracle just returns a subgradient pointing towards one of these coordinates (for convenience, we use the minimal one) which we denote by 𝑖𝑡, and SGD makes the desired step.

Notably, the coordinate 𝑖𝑡chosen by the subgradient oracle is independent of future examples, and therefore will have a suffix of zeros w.h.p. Hence, as mentioned, this ensures no gradient signal after round 𝑡will be able to correct the empirical risk ascent on 𝑖𝑡. Concluding, we have that for the final

iterate b𝑤B 𝑤𝑛+1, we get b𝑤(𝑖𝑡) = η for all 𝑡 [𝑛], therefore

𝑖=1 𝑓(b𝑤; 𝑧𝑖) 1

𝑖=1 𝑧𝑖 b𝑤 b𝑤

A similar argument requiring a few more technical steps shows the same is true for any suffix average b𝑤. Noting that b𝐹(0) = 0, we get that the optimization error is Ω(η 𝑛). The implication for the generalization gap follows immediately with the standard step size choice of η = 1/ 𝑛, owed to SGD s population risk convergence guarantee. For an arbitrary step size, the result follows from a simple computation, and the proof is concluded.

The magnitude of the generalization gap featured in Theorem 1 stems from the large optimization error, which results in the empirical risk over-estimating the population risk by a large margin. Evidently, for convex losses the converse is always false; the empirical risk will never significantly under-estimate the population risk (a fact that will turn out false when losses are only required to be convex in expectation see Section 3.1). Indeed, stability of the regularized ERM solution implies the ERM does not perform significantly better on the training set compared to the population minimizer 𝑤 . Lemma 1. Let 𝑊 ℝ𝑑with diameter 𝐷, Z any distribution over 𝑍, and 𝑓: 𝑊 𝑍 ℝconvex and 𝐺-Lipschitz in the first argument. Then 𝔼 h b𝐹(𝑤 ) b𝐹(𝑤 𝑆) i 4𝐺𝐷 𝑛.

Proof. Denote the regularized ERM by b𝑤λ 𝑆B arg min𝑤 𝑊 1

𝑛 P𝑛 𝑖=1 𝑓𝑖(𝑤; 𝑧𝑖) + λ

2 𝑤 2 . Observe,

𝐹(𝑤 ) 𝔼𝐹(b𝑤λ 𝑆) 𝔼b𝐹(b𝑤λ 𝑆) + 4𝐺2

λ𝑛 𝔼b𝐹(𝑤 𝑆) + λ

where the second inequality follows from stability of the regularized ERM (see Lemma 13). Choosing λ B 2𝐺𝐷/ 𝑛, we get that

𝔼 h b𝐹(𝑤 ) b𝐹(𝑤 𝑆) i = 𝐹(𝑤 ) 𝔼b𝐹(𝑤 𝑆) 4𝐺𝐷 𝑛,

as claimed.

Since the optimization error is always positive, we see that the upper bound given by Lemma 1 implies an upper bound on the difference between the population and empirical risks. Corollary 1. For any distribution Z over 𝑍and Lipschitz loss function 𝑓: 𝑊 𝑍 ℝconvex in the first argument, running SGD with step size η B 1/ 𝑛guarantees 𝔼 h 𝐹(b𝑤𝑆) b𝐹(b𝑤𝑆) i 𝑂(1/ 𝑛).

Proof. We have,

𝔼 𝐹(b𝑤𝑆) b𝐹(b𝑤𝑆) = 𝔼 𝐹(b𝑤𝑆) 𝐹(𝑤 ) + 𝔼 b𝐹(𝑤 ) b𝐹(b𝑤𝑆)

The population error term on the RHS is 𝑂(1/ 𝑛) by the classical analysis of SGD. The second term is bounded by Lemma 1;

𝔼 b𝐹(𝑤 ) b𝐹(b𝑤𝑆) 𝔼 b𝐹(𝑤 ) b𝐹(𝑤 𝑆) 4𝐺𝐷/ 𝑛,

and the result follows.

In the subsections that follow we continue to study the generalization gap in the context of common variants to the basic SCO setup.

3.1 SCO with non-convex components

When we relax the convexity assumption and only require the losses to be convex in expectation, we can construct a learning problem where SGD exhibits a case of benign overfitting. In contrast to Theorem 1, here we actually drive the SGD iterates towards an ERM solution, thus achieving a low optimization error and an empirical risk that under-estimates the population risk. Theorem 2. Let 𝑛 ℕ, 𝑛 4, 𝑑 24𝑛log 𝑛, 𝑊= B2𝑑 0 (1), and η 1/ 𝑛. Then there exists a distribution Z over 𝑍and a 4-Lipschitz loss 𝑓: 𝑊 𝑍 ℝwhere 𝔼𝑧 Z 𝑓(𝑤; 𝑧) is convex in 𝑤, such that for any suffix average b𝑤of SGD initialized at 𝑤1 = 0, with step size η;

𝔼 h 𝐹(b𝑤𝑆) b𝐹(b𝑤𝑆) i = Ω(η2𝑛).

The construction and proof of Theorem 2 given in the supplementary follow a methodology similar to that of Theorem 1. Here however, we exploit non convex losses to form an empirical loss landscape where the ERM solution significantly outperforms the population minimizer 𝑤 (notably, a feat not possible when losses are individually convex, by Corollary 1). Our loss function is defined by 𝑓(𝑤; 𝑧) B P𝑑 𝑖=1 𝑧(𝑖)𝑤(𝑖)2 + φ(𝑤; 𝑧), with each component playing a similar role as before. We work with the distribution 𝑧 {0, 1}𝑑where 𝑧(𝑖) = 1 w.p. δ, 𝑧(𝑖) = 1 w.p. δ, and 𝑧(𝑖) = 0 w.p. 1 2δ. The intuition is that coordinates accumulating many 1 s offer regions in the 𝐿2 ball where the empirical risk is too good compared to the population risk. We tailor the extra dimensions and φ in coordination with the 1 values so that the sub-gradients guide the SGD iterates towards these regions, in exactly the same manner the construction of Theorem 1 drives the iterates to high loss regions. We note that while the statement of Theorem 2 is specialized to step size smaller than 1/ 𝑛, it may be extended to any step size using arguments similar to those given in the proof of Theorem 1.

3.2 SCO with strongly convex components

Our basic construction extends to the strongly convex case by making only technical modification to Theorem 1. The theorem below concerns the standard step size choice for strongly convex objectives. We provide its proof in the supplementary. Theorem 3. Let 𝑛 ℕ, 𝑛 10, 𝑑 24𝑛log 𝑛, 𝑊= B2𝑑 0 (1), and λ 1/ 𝑛. Then there exists a distribution over instance set 𝑍and a 4-Lipschitz, λ-strongly convex loss function 𝑓: 𝑊 𝑍 ℝ

(i) the optimization error is large; 𝔼𝑆 Z𝑛 b𝐹(b𝑤𝑆) b𝐹(𝑤 𝑆) = Ω 1 λ 𝑛

(ii) the generalization gap is large; 𝔼𝑆 Z𝑛 b𝐹(b𝑤𝑆) 𝐹(b𝑤𝑆) = Ω 1 λ 𝑛

where b𝑤𝑆is any suffix average of SGD initialized at 𝑤1 = 0, with step size schedule η𝑡= 1/λ𝑡. Furthermore, the problem instance where this occurs is precisely the λ regularized version of the example featured in Theorem 1.

We note that an immediate implication of the above theorem is that if we seek a generalization gap upper bound for a weakly convex problem by means of regularization (meaning, by running SGD on a regularized problem), we would have to take λ 1 to guarantee a gap of 𝑂(1/ 𝑛). To see this, note that the generalization gap (of any hypothesis) of the regularized problem is the same as that of the original. On the other hand, taking λ 1 will of course be detrimental to the population error guarantee. Hence, one cannot circumvent the generalization gap lower bound by regularization without compromising the population error.

We conclude this section with a note regarding stability rates of SGD in non-smooth SCO. Implicit in Theorem 1, is that average stability of SGD coincides with the tight uniform stability rate of Θ(η 𝑛) established by [7]. This is because Theorem 1 provides the Ω(η 𝑛) lower bound on the most general stability notion, which is precisely the generalization gap [32]. We refer the reader to the supplementary for a more elaborate discussion.

4 SGD with vs without replacement

In this section, we consider a different algorithm in the context of the basic SCO setup; SGD over examples drawn with-replacement from the training set. This is not to be confused with one-pass SGD discussed in Section 3, which corresponds to without-replacement SGD on the training set, or alternatively with-replacement SGD over the population distribution. Given a training set 𝑆= {𝑧1, . . . , 𝑧𝑛} Z𝑛, we define with-replacement projected SGD initialized at 𝑤1 𝑊by

𝑤𝑡+1 Π𝑊(𝑤𝑡 ηb𝑔𝑡) , where b𝑔𝑡 𝑓(𝑤𝑡;b𝑧𝑡) and b𝑧𝑡 Unif(𝑆). Perhaps surprisingly, this version of SGD does not overfit the training data; our theorem below establishes that with proper iterate averaging, the population risk converges at the optimal rate. Theorem 4. Let 𝑊 ℝ𝑑with diameter 𝐷, Z be any distribution over 𝑍, and 𝑓: 𝑊 𝑍 ℝbe convex and 𝐺-Lipschitz in the first argument. Let 𝑆 Z𝑛be a training set of 𝑛 ℕdatapoints drawn i.i.d. from Z, and consider running SGD over training examples sampled with-replacement, uniformly and independently from 𝑆. Then, for step size η = 𝐷 𝐺 𝑛and 𝑤B 2 𝑛+1 P𝑛 𝑡=1 𝑛 𝑡+1

𝑛 𝑤𝑡, the following upper bound holds;

𝔼 𝐹(𝑤) 𝐹(𝑤 ) 10𝐺𝐷 𝑛 .

Proof. Fix a time-step 𝑡 [𝑛], and observe that if we don t condition on 𝑆, we may view the random datapoint b𝑧𝑡as a mixture between a fresh i.i.d. sample from the population and a uniformly distributed sample from the previously processed datapoints b𝑆𝑡 1 B {b𝑧1, . . . ,b𝑧𝑡 1};

b𝑧𝑡| b𝑆𝑡 1 = 𝑧 Z w.p. 1 𝑡 1

𝑛, 𝑧 Unif(b𝑆𝑡 1) w.p. 𝑡 1

With this in mind, denote b𝑓𝑡(𝑤) B 𝑓(𝑤;b𝑧𝑡), fix b𝑆𝑡 1 and observe:

𝔼b𝑧𝑡 h b𝑓𝑡(𝑤𝑡) b𝑓𝑡(𝑤 ) | b𝑆𝑡 1 i = 1 𝑡 1

𝔼𝑧 Z 𝑓(𝑤𝑡; 𝑧) 𝑓(𝑤 ; 𝑧)

𝑖=1 b𝑓𝑖(𝑤𝑡) b𝑓𝑖(𝑤 ).

Rearranging and taking expectation with respect to b𝑆𝑡 1 we obtain

𝔼 𝑓(𝑤𝑡; 𝑧) 𝑓(𝑤 ; 𝑧) = 𝔼 h b𝑓𝑡(𝑤𝑡) b𝑓𝑡(𝑤 ) i + 𝔼

𝑖=1 b𝑓𝑖(𝑤 ) b𝑓𝑖(𝑤𝑡)

𝔼 h b𝑓𝑡(𝑤𝑡) b𝑓𝑡(𝑤 ) i + 4𝐺𝐷 𝑡

where the inequality follows from Lemma 1. Now, by a direct computation we have P𝑛 𝑡=1 1 𝑡 1

2 , which motivates setting 𝑤B 2 𝑛+1 P𝑛 𝑡=1 𝑛 𝑡+1

𝑛 𝑤𝑡. By convexity of 𝐹, Eq. (2), and the standard regret analysis of gradient descent [e.g., 15] we now have

𝔼 𝐹(𝑤) 𝐹(𝑤 ) 2 𝑛+ 1

𝔼 𝐹(𝑤𝑡) 𝐹(𝑤 )

𝑡=1 𝔼 h b𝑓𝑡(𝑤𝑡) b𝑓𝑡(𝑤 ) i + 2 𝑛+ 1

𝑡=1 b𝑓𝑡(𝑤𝑡) b𝑓𝑡(𝑤 )

where the last inequality follows by our choice of η = 𝐷 𝐺 𝑛.

Evidently, the averaging scheme dictated by Theorem 4 does little to hurt the empirical risk convergence guarantee, which follows from the standard analysis with little modifications (for completeness we provide a formal statement and proof in the supplementary). Combined with Lemma 1, this immediately implies a generalization gap upper bound for with-replacement SGD. Notably, this shows with-replacement SGD provides for an example of a (natural) algorithm in the SCO learning setup that is not even stable on-average, but nonetheless has a well bounded generalization gap. We refer the reader to the discussion in the supplementary for more details. Corollary 2. For any distribution Z and loss function 𝑓: 𝑊 𝑍 ℝconvex and Lipschitz in the first argument, running SGD with step size and averaging as specified in Theorem 4 ensures 𝔼 𝐹(𝑤) b𝐹(𝑤) 𝑂(1/ 𝑛).

Proof. We have; 𝔼 𝐹(𝑤) b𝐹(𝑤) 𝔼 𝐹(𝑤) 𝐹(𝑤 ) + 𝔼 b𝐹(𝑤 ) b𝐹(𝑤 𝑆) + 𝔼 b𝐹(𝑤 𝑆) b𝐹(𝑤) .

The first term is upper bounded by convergence of the population risk provided by Theorem 4, the second by Lemma 1, and the third by the standard analysis of SGD (see the supplementary).

5 Multi-epoch SGD for empirical risk minimization

In this section, we forgo the existence of a population distribution and discuss convergence properties of without-replacement SGD (wor-SGD) for finite sum optimization problems. A relatively long line of work discussed in the introduction studies this problem in the smooth case. The work of [20] noted smoothness is a necessary assumption to obtain rates that are strictly better than the 𝑂(1/

𝑛𝐾) guaranteed by with-replacement SGD for 𝑛losses and 𝐾epochs, due to a lower bound that follows from the deterministic case (e.g., [10]). Here we establish that smoothness is in fact necessary to obtain rates that are not strictly worse than with-replacement SGD. We consider running multiple passes of wor-SGD to solve the finite sum optimization problem given by the objective

𝑡=1 𝑓(𝑤; 𝑡) (3)

where { 𝑓(𝑤; 𝑡)}𝑛 𝑡=1 is a set of 𝑛convex, 𝐺-Lipschitz losses defined over a convex and compact domain 𝑊 ℝ𝑑. Throughout this section we let 𝑤 B min𝑤 𝑊𝐹(𝑤) denote the minimizer of the objective Eq. (3). In every epoch 𝑘 [𝐾] we process the losses in the order specified by a permutation π𝑘: [𝑛] [𝑛] sampled uniformly at random, either once in the beginning of the algorithm (single-shuffle), or at the onset of every epoch (multi-shuffle). Multi-epoch wor-SGD initialized at 𝑤1 1 𝑊is specified by the following equations;

𝑤𝑘 𝑡+1 Π𝑊(𝑤𝑘 𝑡 η𝑔𝑘 𝑡), where 𝑔𝑘 𝑡 𝑓𝑘 𝑡(𝑤𝑘 𝑡)

𝑤𝑘+1 1 B 𝑤𝑘 𝑛+1,

where we denote 𝑓𝑘 𝑡(𝑤) B 𝑓(𝑤; π𝑘(𝑡)). A near-immediate implication of Theorem 1 is that there exists a set of convex losses on which a single epoch of wor-SGD cannot converge at a rate faster than 1/𝑛1/4. Theorem 5 presented below extends our basic construction from Theorem 1 to accommodate multiple epochs. The main challenge here is in devising a mechanism that will allow fresh bad gradient steps to take place on every new epoch. Theorem 5. Let 𝑛, 𝐾 ℕ, 𝐾 4, 𝑛 4, 𝑐B 4/(21/𝐾 1), 𝑑 26𝑛log(𝑐𝑛𝐾), and 𝑊= B𝑑 0 (1) where 𝑑 = (𝑛𝐾+ 1)𝑑. Then there exists a set of 𝑛convex, 4-Lipschitz losses such that after 𝐾epochs of either multi-shuffle or single-shuffle SGD initialized at 𝑤1 1 = 0 with step size η 1/

2𝑛𝐾, it holds that

𝔼[𝐹(b𝑤) 𝐹(𝑤 )] = Ω min 1, η 𝑛

𝐽+ 1 η𝑛𝐾+ η ,

where b𝑤is any suffix average of the last 𝐽epochs. In particular, we obtain a bound of Ω 𝑛 1/4𝐾 3/4

for any suffix average and any choice of η.

The proof of Theorem 5 is provided in the supplementary. The construction in the proof takes the idea that the training set can be encoded in the SGD iterate to the extreme. The loss function and gradient oracle are designed in such a way so as to record the training examples in their full form and order into the iterate. We then exploit this encoded information with an adversarial gradient oracle that returns the bad sub-gradients on each gradient step in every new epoch.

Next, we complement Theorem 5 with an upper bound that builds on stability arguments similar to those of the smooth case [20]. Importantly though, lack of smoothness means worse stability rates and necessitates extra care in the technical arguments. Below, we prove the multi-shuffle case, and defer the full details for the single-shuffle case to the supplementary. Theorem 6. Let 𝑆= { 𝑓(𝑤; 𝑡)}𝑛 𝑡=1 be a set of 𝑛convex, 𝐺-Lipschitz losses over a convex and compact domain 𝑊 ℝ𝑑of diameter 𝐷, and consider running 𝐾 1 epochs of wor-SGD over 𝑆. Then, we have the following guarantees:

(i) For multi-shuffle, with step-size η = 𝐷/(𝐺𝑛3/4𝐾1/2), we have

𝔼 𝐹(b𝑤) 𝐹(𝑤 ) 3𝐺𝐷 𝑛1/4𝐾1/2 .

(ii) For single-shuffle, with step-size η = 𝐷/(2𝐺𝑛3/4𝐾3/4) and assuming 𝐾 𝑛, we have

𝔼 𝐹(b𝑤) 𝐹(𝑤 ) 10𝐺𝐷 𝑛1/4𝐾1/4 .

In both of the above bounds, b𝑤= 1 𝑛𝐾 P 𝑘 [𝐾],𝑡 [𝑛] 𝑤𝑘 𝑡, and the expectation is over the random permutations of losses.

Proof (multi-shuffle case). Observe;

𝐹(b𝑤) 𝐹(𝑤 ) 1 𝑛𝐾

𝑡=1 𝐹(𝑤𝑘 𝑡) 𝐹(𝑤 )

𝑡=1 𝐹(𝑤𝑘 𝑡) 𝑓𝑘 𝑡(𝑤 )

𝑡=1 𝐹(𝑤𝑘 𝑡) 𝑓𝑘 𝑡(𝑤𝑘 𝑡) + 1

𝑡=1 𝑓𝑘 𝑡(𝑤𝑘 𝑡) 𝑓𝑘 𝑡(𝑤 )

𝑡=1 𝐹(𝑤𝑘 𝑡) 𝑓𝑘 𝑡(𝑤𝑘 𝑡) + 𝐷2

with the last inequality following from the standard 𝑛𝐾round regret bound for gradient descent [see e.g., 15]. To bound the other term, using Lemma 10, we relate the difference between the without-replacement loss distribution and the full batch objective to the uniform stability rate of SGD, which may then be bounded by applying Lemma 11:

𝔼 𝐹(𝑤𝑘 𝑡) 𝑓𝑘 𝑡(𝑤𝑘 𝑡)) = 𝔼π1,...,π𝑘 1𝔼π𝑘 𝐹(𝑤𝑘 𝑡) 𝑓𝑘 𝑡(𝑤𝑘 𝑡) | 𝑤𝑘 1

𝔼π1,...,π𝑘 1 𝐺ϵSGD stab (𝑡 1)

= 𝐺ϵSGD stab (𝑡 1)

Concluding, we have that

𝔼 𝐹(b𝑤) 𝐹(𝑤 ) 1 𝑛𝐾

𝑡=1 𝔼 𝐹(𝑤𝑘 𝑡) 𝑓𝑘 𝑡(𝑤𝑘 𝑡) + 𝐷2

3𝐺𝐷 𝑛1/4𝐾1/2 ,

where the last inequality follows from our choice of η = 𝐷/(𝐺𝑛3/4𝐾1/2).

Acknowledgements and funding disclosure

This work was supported by the European Research Council (ERC) under the European Union s Horizon 2020 research and innovation program (grant agreement No. 882396), by the Israel Science Foundation (grants number 993/17, 2549/19, 2188/20), by the Len Blavatnik and the Blavatnik Family foundation, by the Yandex Initiative in Machine Learning at Tel Aviv University, by a grant from the Tel Aviv University Center for AI and Data Science (TAD), and by an unrestricted gift from Google. Any opinions, findings, and conclusions or recommendations expressed in this work are those of the author(s) and do not necessarily reflect the views of Google.

[1] I. Amir, T. Koren, and R. Livni. SGD generalizes better than GD (and regularization doesn t help). In Conference on Learning Theory, COLT 2021, volume 134 of Proceedings of Machine Learning Research, pages 63 92. PMLR, 2021.

[2] S. Arora, R. Ge, B. Neyshabur, and Y. Zhang. Stronger generalization bounds for deep nets via a compression approach. In International Conference on Machine Learning, pages 254 263. PMLR, 2018.

[3] P. L. Bartlett and P. M. Long. Failures of model-dependent generalization bounds for least-norm interpolation. Journal of Machine Learning Research, 22(204):1 15, 2021.

[4] P. L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463 482, 2002.

[5] P. L. Bartlett, P. M. Long, G. Lugosi, and A. Tsigler. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063 30070, 2020.

[6] P. L. Bartlett, A. Montanari, and A. Rakhlin. Deep learning: a statistical viewpoint. ar Xiv preprint ar Xiv:2103.09177, 2021.

[7] R. Bassily, V. Feldman, C. Guzmán, and K. Talwar. Stability of stochastic gradient descent on nonsmooth convex losses. Advances in Neural Information Processing Systems, 33, 2020.

[8] M. Belkin. Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation. ar Xiv preprint ar Xiv:2105.14368, 2021.

[9] O. Bousquet and A. Elisseeff. Stability and generalization. The Journal of Machine Learning Research, 2:499 526, 2002.

[10] S. Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8(3-4):231 357, 2015.

[11] A. Dauber, M. Feder, T. Koren, and R. Livni. Can implicit bias explain generalization? stochastic convex optimization as a case study. Advances in Neural Information Processing Systems, 33, 2020.

[12] G. K. Dziugaite and D. M. Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. In Thirty-Third Conference on Uncertainty in Artificial Intelligence, UAI 2017, 2018.

[13] V. Feldman. Generalization of erm in stochastic convex optimization: The dimension strikes back. In Advances in Neural Information Processing Systems, volume 29, 2016.

[14] M. Hardt, B. Recht, and Y. Singer. Train faster, generalize better: Stability of stochastic gradient descent. In International Conference on Machine Learning, pages 1225 1234. PMLR, 2016.

[15] E. Hazan. Introduction to online convex optimization. ar Xiv preprint ar Xiv:1909.05207, 2019.

[16] S. Kale, A. Sekhari, and K. Sridharan. Sgd: The role of implicit regularization, batch-size and multiple-epochs. ar Xiv preprint ar Xiv:2107.05074, 2021.

[17] N. Littlestone and M. Warmuth. Relating data compression and learnability, 1986.

[18] D. A. Mc Allester. Pac-bayesian model averaging. In Proceedings of the twelfth annual conference on Computational learning theory, pages 164 170, 1999.

[19] K. Mishchenko, A. Khaled Ragab Bayoumi, and P. Richtárik. Random reshuffling: Simple analysis with vast improvements. Advances in Neural Information Processing Systems, 33, 2020.

[20] D. Nagaraj, P. Jain, and P. Netrapalli. Sgd without replacement: Sharper rates for general smooth convex functions. In International Conference on Machine Learning, pages 4703 4711. PMLR, 2019.

[21] V. Nagarajan and J. Z. Kolter. Uniform convergence may be unable to explain generalization in deep learning. Advances in Neural Information Processing Systems, 32, 2019.

[22] A. Nedić and D. Bertsekas. Convergence rate of incremental subgradient algorithms. In Stochastic optimization: algorithms and applications, pages 223 264. Springer, 2001.

[23] A. S. Nemirovskij and D. B. Yudin. Problem complexity and method efficiency in optimization, 1983.

[24] G. Neu. Information-theoretic generalization bounds for stochastic gradient descent. ar Xiv preprint ar Xiv:2102.00931, 2021.

[25] B. Neyshabur, R. Tomioka, and N. Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning. ar Xiv preprint ar Xiv:1412.6614, 2014.

[26] B. Neyshabur, Z. Li, S. Bhojanapalli, Y. Le Cun, and N. Srebro. The role of over-parametrization in generalization of neural networks. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Open Review.net, 2019.

[27] S. Rajput, A. Gupta, and D. Papailiopoulos. Closing the convergence gap of SGD without replacement. In International Conference on Machine Learning, pages 7964 7973. PMLR, 2020.

[28] B. Recht and C. Ré. Toward a noncommutative arithmetic-geometric mean inequality: conjectures, case-studies, and consequences. In Conference on Learning Theory, pages 11 1. JMLR Workshop and Conference Proceedings, 2012.

[29] D. Russo and J. Zou. Controlling bias in adaptive data analysis using information theory. In Artificial Intelligence and Statistics, pages 1232 1240. PMLR, 2016.

[30] I. Safran and O. Shamir. How good is SGD with random shuffling? In Conference on Learning Theory, pages 3250 3284. PMLR, 2020.

[31] I. Safran and O. Shamir. Random shuffling beats sgd only after many epochs on ill-conditioned problems. ar Xiv preprint ar Xiv:2106.06880, 2021.

[32] S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan. Learnability, stability and uniform convergence. The Journal of Machine Learning Research, 11:2635 2670, 2010.

[33] O. Shamir. Without-replacement sampling for stochastic gradient methods. Advances in neural information processing systems, 29:46 54, 2016.

[34] O. Shamir. The implicit bias of benign overfitting. ar Xiv preprint ar Xiv:2201.11489, 2022.

[35] U. Sherman, T. Koren, and Y. Mansour. Optimal rates for random order online optimization. Advances in Neural Information Processing Systems, 34, 2021.

[36] V. Vapnik. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16(2):264 281, 1971.

[37] A. Xu and M. Raginsky. Information-theoretic analysis of generalization capability of learning algorithms. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 2521 2530, 2017.

[38] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires rethinking generalization. In 5th International Conference on Learning Representations, ICLR 2017, 2017.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes]

(c) Did you discuss any potential negative societal impacts of your work? [Yes] (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes]

2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [N/A] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [N/A] (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [N/A] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [N/A] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [N/A] (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]