# gradient_methods_never_overfit_on_separable_data__96b0651b.pdf

Journal of Machine Learning Research 22 (2021) 1-20 Submitted 9/20; Revised 2/21; Published 2/21

Gradient Methods Never Overﬁt On Separable Data

Ohad Shamir ohad.shamir@weizmann.ac.il Department of Computer Science and Applied Mathematics Weizmann Institute of Science Rehovot, Israel

Editor: Lorenzo Rosasco

A line of recent works established that when training linear predictors over separable data, using gradient methods and exponentially-tailed losses, the predictors asymptotically converge in direction to the max-margin predictor. As a consequence, the predictors asymptotically do not overﬁt. However, this does not address the question of whether overﬁtting might occur non-asymptotically, after some bounded number of iterations. In this paper, we formally show that standard gradient methods (in particular, gradient ﬂow, gradient descent and stochastic gradient descent) never overﬁt on separable data: If we run these methods for T iterations on a dataset of size m, both the empirical risk and the generalization error decrease at an essentially optimal rate of O(1/γ2T) up till T m, at which point the generalization error remains ﬁxed at an essentially optimal level of O(1/γ2m) regardless of how large T is. Along the way, we present non-asymptotic bounds on the number of margin violations over the dataset, and prove their tightness. Keywords: Gradient Methods, Margin-based Generalization Bounds, Implicit Bias, Linear Predictors, Linearly Separable Data

1. Introduction

Motivated by empirical observations in the context of neural networks, there is considerable interest nowadays in studying the implicit bias of learning algorithms. This refers to the fact that even without any explicit regularization or other techniques to avoid overﬁtting, the dynamics of the learning algorithm itself biases its output towards simple predictors that generalize well. In this paper, we consider the implicit bias in a well-known and simple setting, namely learning linear predictors (x 7 x w) for binary classiﬁcation with respect to linearlyseparable data. In a recent line of works (Soudry et al., 2018; Ji and Telgarsky, 2018b; Nacson et al., 2019a; Ji and Telgarsky, 2019b; Dudik et al., 2020), it was shown that if we attempt to do this by minimizing the empirical risk (average loss) over a dataset, using gradient descent and any exponentially-tailed loss (such as the logistic loss), then the predictor asymptotically converges in direction to the max-margin predictor with respect to the Euclidean norm1. Since there are standard generalization bounds for predictors which achieve a large margin over the dataset, we get that asymptotically, gradient descent does not overﬁt, even if we just run it on the empirical risk function without any explicit regularization, and even if the number of iterations T diverges to inﬁnity. In follow-up works,

1. Namely, arg maxw: w 2=1 mini x i w for a given dataset x1, x2, . . . , xm.

c 2021 Ohad Shamir.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v22/20-997.html.

similar results were also obtained for other gradient methods such as stochastic gradient descent and mirror descent (Nacson et al., 2019b; Gunasekar, 2018), and for more complicated predictors such as linear networks, shallow Re LU networks, and linear convolutional networks (Ji and Telgarsky, 2018a, 2019a; Gunasekar et al., 2018). However, in practice the number of iterations T is some bounded ﬁnite number. Thus, the asymptotic results above leave open the possibility that for a wide range of values of T, gradient methods do not achieve a good margin, and possibly overﬁt. Admittedly, many of these papers do provide ﬁnite-time guarantees for linear predictors, which all tend to have the following form: After T iterations, the output of gradient descent w T (normalized to have unit norm, and with bounded step sizes) generally satisﬁes2

w T w = O 1 log(T)

where w is the max-margin predictor, and the O( ) notation hides dependencies in the dataset size and the margin attained by w . However, such bounds do not satisfactorily address the problem above, since they decay extremely slowly with T. For example, suppose we ask how many iterations T are needed, till we get a predictor which achieves some positive margin on all the data points, assuming there exists a unit-norm predictor w achieving a margin of γ (namely mini x i w γ). If all we know is the bound in Eq. (1), we must require w T w γ, which holds if T > exp(Ω(1/γ) (and in fact, the actual required bound is much larger due to the hidden dependencies in the O( ) notation). For realistically small values of γ, this bound on T is unacceptably large. Could it be that gradient methods do not overﬁt only after so many iterations? We note that in Ji and Telgarsky (2019b), it is shown that Eq. (1) is essentially tight, but this does not preclude the possibility that w T does not overﬁt even before getting very close to w . In this paper, we show that this is indeed the case, and in fact, for any number of iterations T, gradient methods do not overﬁt and essentially behave in the best manner we can hope for. Speciﬁcally, if the underlying data distribution is separable with margin γ, and we attempt to minimize the average of an exponential or logistic loss over a training set of size m, using standard gradient methods (gradient ﬂow, gradient descent, or stochastic gradient descent), then the generalization error of the resulting predictor (with respect to the 0 1 loss) is at most O(1/γ2T + 1/γ2m), up to constants and logarithmic factors. For T m, this bound is O(1/γ2T), which is essentially the same as the optimal upper bound on the empirical risk of the algorithm s output after T iterations. In other words, both the generalization error and the empirical risk provably go down at the same (essentially optimal) rate. Once T m, the empirical risk may further decrease to 0, but the generalization error remains at O(1/γ2m), which is well-known to be essentially optimal for any learning algorithm in this setting (see for example Hanneke and Kontorovich (2019)). To prove these results, we also establish more reﬁned, nonasymptotic bounds on the margins attained on the dataset, which are also applicable to other losses. In general, these bounds imply that for any α [0, 1], after T iterations, the resulting predictor achieves a

2. A bit more precisely, these results imply that for all datasets linearly-separable with margin γ, and for exponentially-tailed losses such as the logistic loss, w T = w log(T) + ρ(T), where ρ(T) is O(1) (as a function of T) for datasets in general position, and O(log log(T)) always. See for example Gunasekar et al. (2018); Ji and Telgarsky (2019b) for more details.

Gradient Methods Never Overfit On Separable Data

margin of Ω((1 α)γ) on all but O 1 (γ2T)α of the data points. These bounds guarantee that as T increases, for larger and larger portions of the dataset, the predictors achieve a margin of Ω(γ), implying good generalization properties. Furthermore, we prove a lower bound showing that such guarantees are essentially optimal. Finally, although we focus mostly on the exponential and logistic loss, we also discuss the applicability of our results to polynomially-tailed losses in the appendix. Before continuing, we emphasize that the techniques we use in our upper bounds are not fundamentally new, and similar ideas were employed in previous analyses on the convergence to the max-margin predictor, such as in Ji and Telgarsky (2018b, 2019a) (in fact, some of our results build on these analyses). However, we apply these techniques to a conceptually diﬀerent question, about the non-asymptotic ability of gradient methods to attain some signiﬁcant margin. In addition, since we only care about convergence to some large-margin predictor (as opposed to the max-margin predictor), our analysis can be shorter and simpler. Finally, we note that polynomial-time, non-asymptotic guarantees on the generalization error of unregularized gradient methods were also obtained in Ji and Telgarsky (2019a) and version 2 of Ji and Telgarsky (2018b). However, the former is for nonlinear predictors, and the latter is for one-pass stochastic gradient descent, which is diﬀerent than the algorithms considered here and where necessarily T = m. Moreover, the bounds in both papers have a worse polynomial dependence on the margin γ, compared to our results. The paper is structured as follows. In the next section, we deﬁne some useful notation, and formally describe our setting. In Sec. 3, we provide our positive results about the margin behavior and the generalization error, focusing on gradient ﬂow (for which our analysis is the simplest and completely self-contained). In Sec. 4, we show how similar results can be obtained for gradient descent and stochastic gradient descent. In Sec. 5, focusing for concreteness on gradient descent, we show that our positive result on the margin behavior is essentially tight. In Sec. 6, we discuss how our results can be applied to polynomially-tailed losses, and their implications. A numerical illustration of our results is provided in Sec. 7 Some technical proofs are provided in Appendix A.

2. Preliminaries

We generally let boldfaced letters denote vectors. Given a positive integer n, we let [n] be a shorthand for {1, . . . , n}. Given a nonzero vector w, we let

denote its normalization to unit norm. We use the standard O( ) and Ω( ) notation to hide constants, and O( ), Ω( ) to hide constants and factors polylogarithmic in the problem parameters. log( ) refers to the natural logarithm, and refers to the Euclidean norm. We consider datasets deﬁned by a set of vectors x1, . . . , xm Rd, and algorithms which attempt to minimize the empirical risk function, namely

i=1 ℓ(x i w)

where ℓ: R 7 R is some loss function3. We will utilize the following two assumptions about the dataset and the loss:

Assumption 1 maxi xi 1, and the dataset is separable with margin γ (0, 1]: Namely, there exists a unit vector w s.t. mini x i w γ.

Assumption 2 ℓis convex, monotonically decreasing, and has an inverse function4 ℓ 1

on the interval (0, ℓ(0)].

Remark 1 Given a datset, the results in our paper hold for any margin γ which satisfy Assumption 1, and in particular the maximum margin (i.e. the largest γ for which Assumption 1 hold).

We note that assumption 1 is without much loss of generality (it simply sets the scaling of the problem). Assumption 2 implies that ˆL is convex, and is satisﬁed for most classiﬁcation losses. When instantiating our general results, we will focus for concreteness on the logistic loss ℓ(z) = log(1 + exp( z)) and the exponential loss ℓ(z) = exp( z). However, our results can be applied to other losses as well. In our proofs, we will make use of the following well-known facts (see for example Nesterov (2018)): For a convex function f on Rd, it holds for any vectors u, v that f(v) f(u) f(v)(v u). Also, if f is a function with µ-Lipschitz gradients, then for any u, v, f(v) f(u) + f(u)(v u) + µ

3. Gradient Flow

In this section, we present positive results on the margin behavior and generalization error of gradient ﬂow. Gradient ﬂow is the standard continuous-time analogue of gradient descent. Although it cannot be implemented precisely in practice, it is a useful idealization of gradient descent, and the method in which it is easiest to present our analysis (the analysis for gradient descent in the next section is just a slight variation). Gradient ﬂow produces a continuous trajectory of vectors w(t), indexed by a time t 0. We deﬁne it starting from 0, in which case it is speciﬁed by the following diﬀerential equation:

w(0) = 0 , tw(t) = ˆL(w(t)) . (2)

We recall that for linearly separable data, and classiﬁcation losses which asymptote to zero (such as the exponential and logistic losses), the gradient ﬂow w(t) generally diverges. However, for classiﬁcation we are only interested in the direction of the vector (or equivalently, its normalization to a unit vector), and the margin this normalized vector attains on the data. We now present a general result about the margin behavior of gradient ﬂow, which is applicable to general losses and any vector reached by gradient ﬂow at any time point:

3. In the context of binary classiﬁcation, it is customary to consider labeled data points (x1, y1), . . . , (xm, ym) and losses of the form w 7 ℓ(yix i w). However, for our purposes we can fold the binary label yi inside xi, and treat this as a single vector. 4. Namely, for any z (0, ), there is a unique p = ℓ 1(z) such that ℓ(p) = z.

Gradient Methods Never Overfit On Separable Data

Theorem 2 Under assumptions 1 and 2, let w = 0 be some point reached by gradient ﬂow (as speciﬁed in Eq. (2)), such that ˆL(w) = ϵ for some ϵ (0, ℓ(0)]. Then for any p h ϵ ℓ(0), 1 i , for at least (1 p)m of the indices i [m],

Intuitively, we expect the empirical risk ˆL(w(t)) to decay with t (as we instantiate for speciﬁc losses later on). Computing the corresponding bounds on ϵ and plugging into the above, we can get guarantees on how many points in our dataset achieve a certain margin. Note that since ℓ 1 is monotonically decreasing, the margin lower bound in the theorem is always at most γ/2. Since under assumption 1 the maximal margin is at least γ, this result cannot be used to recover the asymptotic convergence to the max-margin predictor, shown by previous results. However, as discussed in the introduction, this is not our focus here: We ask about the time to converge to some large-margin predictor which generalizes well, not necessarily the max-margin predictor. For that purpose, as we will see later, a margin lower bound of Ω(γ) is perfectly adequate. To prove Thm. 2, we will need the following key lemma, which bounds the norm of the points along the trajectory of gradient ﬂow, in terms of the value of ˆL. The proof is short and relies only on the convexity of ˆL:

Lemma 3 Fix some T 0, and let w be any vector such that ˆL(w ) ˆL(w(T)). Then supt [0,T] w(t) 2 w .

Proof By deﬁnition of gradient ﬂow and the chain rule, we have

t ˆL(w(t)) = ˆL(w(t)) 2

0, so ˆL(w(t)) is monotonically decreasing in t. As a result, for any t, ˆL(w(t)) ˆL(w(T)) ˆL(w ). By convexity of ˆL, is follows that

ˆL(w(t)) (w(t) w ) ˆL(w(t)) ˆL(w ) 0 .

Using this inequality, the deﬁnition of gradient ﬂow and the chain rule, it follows that

t w(t) w 2 = 2 ˆL(w(t)) (w(t) w ) 0 .

Therefore, w(t) w is monotonically decreasing in t [0, T], so it is at most w(0) w = w . Thus, by the triangle inequality, w(t) w(t) w + w 2 w .

Proof [Proof of Thm. 2] Let w0, w0 = 1 be a max-margin separator, so that mini x i w0 γ. Deﬁne w := ℓ 1(ϵ)

γ w0, which has norm ℓ 1(ϵ)/γ, and note that since ℓ 1(ϵ) is nonnegative,

max i ℓ(x i w ) = ℓ min i ℓ 1(ϵ)

ℓ(ℓ 1(ϵ) 1) = ϵ .

This implies ˆL(w ) = 1

m Pm i=1 ℓ(x i w ) ϵ. Combined with Lemma 3, we get that

w 2 w = 2ℓ 1(ϵ)

Since ˆL(w) = ϵ, we get that Ei[ℓ(x i w)] ϵ, where i is uniformly distributed on [m]. By Markov s inequality and Eq. (3), it follows that

x i w ℓ 1(ϵ/p) = Pr i

x i w ℓ 1(ϵ/p)

x i w γℓ 1(ϵ/p)

For concreteness, let us now apply Thm. 2 to the case of the exponential loss, ℓ(z) = exp( z). In order to get an interesting guarantee, we will utilize the following simple non-asymptotic guarantee on the decay of ˆL(w(T)) as a function of T:

Lemma 4 Under assumption 1, if ℓis the exponential loss, then ˆL(w(T)) 1 γ2T for any T > 0.

Proof Let w0 be a max-margin unit vector, so that mini x i w0 γ. By deﬁnition of gradient ﬂow, the chain rule and Cauchy-Schwartz,

t ˆL(w(t)) = ˆL(w(t)) 2 ˆL(w(t)) w0 2 =

i=1 ℓ (x i w(t))x i w0

i=1 ℓ (x i w(t))γ

i=1 ℓ(x i w(t))γ

!2 = γ2 ˆL(w(t))2 ,

where in ( ) we used the fact that for the exponential loss, ℓ(z) = ℓ (z) for all z. Consider now the function f(t) := γ2t ˆL(w(t)). We clearly have f(0) = 0, and by diﬀerentiation and the inequality above, it is easily veriﬁed that f (t) 0 whenever f(t) 1. Since f(t) is continuous, we must have f(t) 1 for all t, hence ˆL(w(t)) 1 γ2t.

Using this lemma, we get the following corollary of Thm. 2 for the exponential loss:

Corollary 5 Under assumption 1, if ℓis the exponential loss, then for any T > 1/γ2 and any α [0, 1], it holds that x i w(T) > 1 α

2 γ for at least (1 (γ2T) α)m of the indices i [m].

Proof By Lemma 4, we know that at time T, ˆL(w(T)) = ϵ for some ϵ 1/γ2T < 1 = ℓ(0) (which implies w(T) = 0). Applying Thm. 2, and noting that ℓ 1(z) = log(1/z) and ℓ(0) = 1, we get that for any p [1/γ2T, 1], for at least (1 p)m of the indices i,

x i w T > γ

log(1/ϵ) = γ

2 1 log(1/p)

2 1 log(1/p)

In particular, picking p = (γ2T) α for some α [0, 1], the result follows.

Gradient Methods Never Overfit On Separable Data

Note that if T > m1/α/γ2, then (γ2T) α < 1 m, which implies that for all i [m], x i w(T) is at least 1 α

2 γ. However, even for smaller values of T, the theorem provides guarantees on the margin attained on most points in the dataset. Moreover, using standard margin-based generalization bounds for binary classiﬁcation, this theorem implies that the predictors returned by gradient ﬂow achieve low generalization error, uniformly over all suﬃciently large T:

Theorem 6 Let D be some distribution over {x : x 1} { 1, +1}, such that there exists a unit vector w0 and γ > 0 satisfying Pr(x,y) D(yx w0 γ) = 1. If we sample m points (x1, y1), . . . , (xm, ym) i.i.d. from D, and run gradient ﬂow on w 7 1

m Pm i=1 ℓ(yix i w), where ℓis the exponential loss, then for any δ (0, 1], with probability at least 1 δ over the sample, it holds for any time T that

Pr (x,y) D(sign(x w(T)) = y) O 1

γ2T + 1 γ2m

where the O notation hides universal constants and factors polylogarithmic in γ2m and 1/δ.

As discussed in the introduction, this is essentially the best behavior we can hope for (up to log factors): For T m, the generalization error decays as O(1/γ2T), which is also the bound on the empirical risk (see Lemma 4). Once T m, the generalization error becomes O(1/γ2m), and stays there regardless of how large T is. Proof [Proof of Thm. 6] The bound in the theorem is vacuous when γ2m 1 or γ2T 1, so we will assume without loss of generality that both quantities are larger than 1. Standard margin-based generalization bounds (e.g. Mc Allester (2003)) imply that in our setting, if we pick m points i.i.d., then with probability at least 1 δ, any vector w for which maxi yix i w ˆγ for all but pm of the points satisﬁes

Pr (x,y) D(sign(w x) = y) p + O r

p 1 ˆγ2m + 1 ˆγ2m

O p + 1 ˆγ2m

where the O hides universal constants and factors polylogarithmic in 1/δ and γ2m. In particular, this can be applied uniformly for w(T) for any T. By Corollary 5, we can substitute p = (γ2T) α and ˆγ = (1 α)γ/2, to get that with probability at least 1 δ,

Pr (x,y) D(sign(x w(T)) = y) O 1 (γ2T)α + 1 (1 α)2γ2m

This holds for any α [0, 1]. In particular, if γ2T (γ2m)2, pick α = 1/2, in which case Eq. (4) is at most

O 1 (γ2T)1/2 + 1 γ2m

and if γ2T < (γ2m)2, pick α = 1 1 2 log(γ2m), in which case Eq. (4) is at most

(γ2T)1/2 log(γ2m)

γ2T + 1 γ2m

(γ2m)1/ log(γ2m)

γ2T + 1 γ2m

γ2T + 1 γ2m

where we used the fact that z1/ log(z) = exp(log(z)/ log(z)) = exp(1). Combining the two cases, the result follows.

4. Gradient Descent and Stochastic Gradient Descent

Having discussed gradient ﬂow, we show in this section how essentially identical results can be obtained for gradient descent and stochastic gradient descent.

4.1 Gradient Descent

Gradient descent, which is probably the simplest and most well-known gradient method, optimizes ˆL by initializing at some point w1 (which we will take to be 0), and performing iterations of the form wt+1 = wt ηt ˆL(wt), where η1, η2, . . . are step size parameters. We will utilize the following standard assumption:

Assumption 3 The derivative of ℓis µ-Lipschitz, and 0 < ηt 1/µ for all t.

Inspecting the analysis for gradient ﬂow from the previous section, we note that we relied on the algorithm s structure only at two points: In Lemma 3, to bound the norm of the points along the trajectory, and in Lemma 4, to upper bound the values of ˆL. Fortunately, we can provide analogues of these two lemmas for gradient descent:

Lemma 7 Under assumptions 1,2 and 3, ﬁx some index T 1, and let w be any vector such that ˆL(w ) ˆL(w T ). Then maxt [T] wt 2 w .

Lemma 8 (Ji and Telgarsky (2018b, Theorem 3.1)) If ℓis the logistic loss, then under assumption 3, gradient descent with step size η = 1 satisﬁes ˆL(w T ) 1 T + log2(T)

2γ2T for any T.

The proof of Lemma 7 (which is a slight variation on the proof of Lemma 3) appears in Appendix A. With these lemmas, we can prove analogues of the theorems from the previous section, this time for gradient descent and for the logistic loss. The resulting bounds are identical up to constants and logarithmic factors:

Theorem 9 Under assumptions 1, 2 and 3, let w = 0 be some point reached by gradient descent, such that ˆL(w) = ϵ for some ϵ (0, ℓ(0)]. Then for any p h ϵ ℓ(0), 1 i , for at least (1 p)m of the indices i [m],

Corollary 10 Under assumption 1, if ℓis the logistic loss, and we use a ﬁxed step size of ηt = 1 for all t, then for any T > 4 such that log2(T)

γ2T < ℓ(0), and any α [0, 1], the

gradient descent iterates satisfy x i w T > 1 α

2 γ for at least 1 2 log2(T)

γ2T α m of the indices i [m].

Gradient Methods Never Overfit On Separable Data

Theorem 11 Let D be some distribution over {x : x 1} { 1, +1}, such that there exists a unit vector w0 and γ > 0 satisfying Pr(x,y) D(yx w0 γ) = 1. If we sample m points (x1, y1), . . . , (xm, ym) i.i.d. from D, and run gradient descent with ﬁxed step sizes ηt = 1 on w 7 1

m Pm i=1 ℓ(yix i w), where ℓis the logistic loss, then for any δ (0, 1], with probability at least 1 δ over the sample, it holds for any iteration T that

Pr (x,y) D(sign(x w T ) = y) O 1

γ2T + 1 γ2m

where the O notation hides universal constants and factors polylogarithmic in γ2m, T and 1/δ.

The results can be easily generalized to other step size strategies. The proofs are essentially identical to the proofs from the previous section, except that we use Lemma 7 and 8 instead of Lemmas 3 and 4. In particular, the proof of Thm. 9 is identical to the proof of Thm. 2; the proof of Corollary 10 is nearly identical to the proof of Corollary 5 (and is provided in Appendix A for completeness); and the proof of Thm. 11 is identical to the proof of Thm. 6, except that we plug in Corollary 10 instead of Corollary 5, and thus have some additional log(T) factors which get absorbed into the O() notation.

4.2 Stochastic Gradient Descent

We now turn to discuss the stochastic gradient descent (SGD) algorithm, perhaps the main workhorse of modern machine learning methods. We consider the simplest version of SGD for minimizing ˆL(w) = 1

m Pm i=1 ℓ(x i w): We initialize w1 at the origin 0, and for any t 1, deﬁne wt+1 := wt ηℓ (x itwt)xit, where it [m] is chosen independently and uniformly at random (so that in expectation, Eit[wt+1] = wt η ˆL(wt), similar to the gradient descent update). We assume that at the end of T iterations, the algorithm returns the average of the iterates obtained so far, v T := 1

T PT t=1 wt. To avoid yet another (and more complicated) repetition of the analysis from the previous section, we will take a somewhat diﬀerent route, focusing on the logistic loss and ﬁxed step sizes ηt = 1, which allows us to directly utilize some existing results in the literature to get a bound on the margin behavior of SGD (analogous to Corollary 5 for gradient ﬂow and Corollary 10 for gradient descent). We note that the analysis can be easily generalized to other constant step sizes.

Theorem 12 Under assumption 1, if ℓis the logistic loss, and we use step sizes ηt = 1, then for any T 2/γ2 and any α [0, 1], the SGD iterates satisfy x i v T > 1 α

5 γ for at least (1 δ) m of the indices i [m], where δ is a nonnegative random variable (dependent on the randomness of SGD), whose expectation is at most 8+4 log2(γ2T)

Proof We will utilize the following easily-veriﬁed facts about the logistic loss: It is nonnegative, its gradient is 1

4-Lipschitz, and its inverse is ℓ 1(z) := log(1/(exp(z) 1)), which is between log(1/z) and log(1/2z) for all z [0, 1] . Let w0, w0 = 1 be a max-margin separator, so that mini x i w0 γ, and deﬁne (similarly to the proof of Thm. 6) w := ℓ 1(ϵ)

γ w0, where ϵ (0, 1

2] will be chosen later

(note that ℓ 1(ϵ) > 0 in this regime). It is easily veriﬁed that maxi ℓ(x i w ) ϵ, and therefore ˆL(w ) ϵ. Since ℓis non-negative and with a 1

4-Lipschitz derivative, we can use Theorem 14.13 from Shalev-Shwartz and Ben-David (2014) on the convergence of SGD for such losses to get

E[ˆL(v T )] 1 1 1

ˆL(w ) + w 2

ϵ + (ℓ 1(ϵ))2

In particular, picking ϵ = 1 γ2T (which is in (0, 1

2] by our assumption T 2/γ2), we get

E[ˆL(v T )] 4 3γ2T

1 + (ℓ 1(1/γ2T))2

2 log2(γ2T) .

To simplify notation, deﬁne (T, γ) to be the expression in the right-hand side above. We also note that the left-hand side equals Ev T ,i[ℓ(x i v T )], where i is uniformly distributed in [m]. Thus, by Markov s inequality, for any p > 0, we have

ℓ(x i v T ) (T, γ)

x i v T ℓ 1 (T, γ)

According to Theorem 2.1 in version 2 of Ji and Telgarsky (2018b), the iterates of SGD on the logistic loss satisfy deterministically maxt [T] wt 2 log(T)

γ + 2, so by Jensen s

inequality, v T 1

T PT t=1 wt 2 log(T)

γ + 2. Combining this with the displayed equation above, we get that

x i v T 1 2 + 2 log(T)/γ ℓ 1 (T, γ)

Choosing p = 2T 1 α (T, γ) = 8+4 log2(γ2T)

3γ2T α for some α [0, 1], and substituting into the above, we get that

8 + 4 log2(γ2T)

3γ2T α Pr i,v T

x i v T 1 2 + 2 log(T)/γ ℓ 1 T α 1

x i v T 1 2 + 2 log(T)/γ log T 1 α

x i v T (1 α)γ

2 log(T) γ + log(T)

Since T > 1 γ2 1, we have log(T) γ+log(T) log(T) 1+log(T) log(2) 1+log(2) > 2

5. Plugging into the above, we get 8 + 4 log2(γ2T)

3γ2T α Pr i,v T

x i v T (1 α)γ

To complete the proof, we note that if we let Ai denote the event that x i v T (1 α)γ

5 , and 1Ai the indicator function of the event Ai, then the above implies

8 + 4 log2(γ2T)

3γ2T α Ei,v T [1Ai] = Ev T Ei[1Ai] = Ev T

Gradient Methods Never Overfit On Separable Data

Thus, letting δ = 1

m Pm i=1 1Ai, the theorem follows.

Using this theorem, we get the following generalization error bound for SGD, which is a direct analogue of the error bounds we obtained for gradient ﬂow and gradient descent:

Theorem 13 Let D be some distribution over {x : x 1} { 1, +1}, such that there exists a unit vector w0 and γ > 0 satisfying Pr(x,y) D(yx w0 γ) = 1. Suppose we sample m points (x1, y1), . . . , (xm, ym) i.i.d. from D, and run SGD (with step sizes ηt = 1) on w 7 1

m Pm i=1 ℓ(yix i w), where ℓis the logistic loss. Then for any T,

E Pr (x,y) D(sign(x v T ) = y) O 1

γ2T + 1 γ2m

where the O notation hides universal constants and factors polylogarithmic in T, m, 1/γ, and the expectation is over the randomness of the SGD algorithm.

Proof Using identical arguments as in the proof of Thm. 6 (except using Thm. 12 instead of Corollary 5, and the fact that high-probability bounds imply a bound on the expectation), we get that

E Pr (x,y) D(sign(x w(T)) = y) O 1 γ2T α + 1 (1 α)2γ2m

This holds for any α [0, 1]. In particular, if T m2, pick α = 1/2, in which case Eq. (5) is at most O 1 γ2T 1/2 + 1 γ2m

and if T < m2, pick α = 1 1 log(m2), in which case Eq. (5) is at most

T 1/ log(m2)

γ2T + 1 γ2m

(m2)1/ log(m2)

γ2T + 1 γ2m

γ2T + 1 γ2m

Combining the two cases, the result follows.

Finally, we remark that Thm. 13 only bounds the expectation of the error probability of v T . It is likely that using more sophisticated concentration tools, one can obtain a highprobability bound. However, this would require a more involved analysis, and is left for future work.

5. Tightness

In the previous sections, we provided bounds on the margin behavior of gradient ﬂow, gradient descent and SGD, which all have the following form (ignoring constants and log factors): After T iterations, we get a predictor which achieves Ω(γ) margin on at least (1 (γ2T) α)m of the m data points, where α (0, 1) is a constant arbitrarily close to one. It is natural to ask whether this result can be improved.

In this section, we show that this result is essentially tight: After T iterations, it is impossible to guarantee any positive margin on more than (1 p)m of the points, when p is much less than (γ2T) 1. For concreteness, we prove this for gradient descent and the logistic loss, although the analysis can be extended to other gradient methods and losses:

Theorem 14 For any positive integers m, T and any γ 0, 1

8 i , there exists a dataset of

m points in R2 satisfying assumption 1, such that gradient descent using the logistic loss and any step size η 1 must satisfy x i w T 0 for at least m 26 max{1,γ2T} of the data points.

Since x i w T 0 translates to ℓ(x i w T ) Ω(1), the theorem also implies that the O(1/γ2T) bounds on the empirical risk shown earlier are tight up to logarithmic factors. It also implies that the percentage of misclassiﬁed points (x i w T 0) cannot decrease at a better rate. In addition, the lower bound implies that if we want to get any positive margin on all m data points, the number of iterations T must be at least Ω(m/γ2) in the worst case. A lower bound related to ours appears in Ji and Telgarsky (2019b), where the authors show that if w is the max-margin unit-norm predictor, then w T w Ω(log(m)/ log(T)). This translates to a T Ω(m) requirement to get a non-trivial guarantee on the direction of w T . However, this is a somewhat diﬀerent objective than ours, and moreover, their lower bound does not specify the dependence on the margin parameter γ. Proof [Proof of Thm. 14] We will assume without loss of generality that T 1 γ2 (or equivalently, γ2T 1), in which case the lower bound we need to prove is m 26γ2T . Otherwise, if T < 1 γ2 , we can simply apply the construction below with the larger margin parameter ˆγ := 1

T , which by

deﬁnition satisﬁes T 1 ˆγ2 , and uses a dataset separable with margin ˆγ γ, so assumption 1 still holds with margin parameter γ. We will also assume without loss of generality that m 26γ2T > 0 (otherwise the theorem

trivially holds), and ﬁx ϵ := 1

m j m 26γ2T k (which is positive and less than 1 26γ2T 1

Consider a dataset consisting of a majority group of (1 ϵ)m points of the form (1, 0), and a minority group of ϵm points of the form 1

2, 3γ (see Figure 1 for a sketch of the construction and proof idea). It is easily veriﬁed that all these points are contained in the unit ball, and that the vector w = γ, 1

2 satisﬁes x w x w γ for any x in the dataset. Thus, assumption 1 holds. Moreover, the empirical risk function equals

ˆL(w) = ˆL(w(1), w(2)) = (1 ϵ) ℓ(w(1)) + ϵ ℓ 1

2 w(1) + 3γ w(2) ,

where ℓ(z) := log(1 + exp( z)) is the logistic loss. Suppose by contradiction that x w T 0 for less than ϵm of the points x in the dataset. Since there are ϵm points of the form x = 1

2, 3γ , this implies that all these points are positively correlated with w T , that is x w T = 1

2 w T (1)+3γ w T (2) > 0. The lemma below implies that this cannot happen unless T 19 480γ2ϵ. But since 19 480γ2ϵ 19 480γ2(1/26γ2T) > T, we get T > T, a contradiction.

Gradient Methods Never Overfit On Separable Data

-1 -0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 -1

Figure 1: Illustration of the proof of Thm. 14, for γ = 1

5. The dataset consists of 90% points at (1, 0) (thick black arrow) and 10% points at ( 1/2, 3γ) (thin black arrow). The gradient descent trajectory (with η = 1, starting from the origin) is the dotted red line, and the shaded blue region are the vectors which achieves a positive margin on more than 90% of the data points (or equivalently for this construction, get a positive margin on all data points). Initially, the gradient descent trajectory is mostly inﬂuenced by the points at (1, 0), and only once a suﬃciently large margin is achieved on them, the inﬂuence of the few points at ( 1/2, 3γ) begins to manifest, and the trajectory curves towards the blue region. Best viewed in color.

Lemma 15 Fix some ϵ and γ in 0, 1

8 i . Consider gradient descent on the bivariate function

ˆL(r, s) := (1 ϵ) ℓ(r) + ϵ ℓ 1

where ℓis the logistic loss, using any step size η 1, starting from r1 = s1 = 0 and producing a sequence of iterates (r2, s2), (r3, s3), . . .. Then if 1

2rt +3γst > 0, we must have t 19 480γ2ϵ.

The proof of the lemma is based on a technical calculation, and appears in Appendix A.

6. Polynomially-Tailed Losses

In our paper, we focused mostly on the exponential and logistic losses, which both have exponentially decaying tails. Nacson et al. (2019a) showed that such a tail is important to get asymptotic convergence to a max-margin solution: If ℓ(z) decays polynomially with z, then the convergence may not be to a max-margin solution. Despite this, in the spirit of our previous results, one may still conjecture that we converge to some large-margin solution (though not a max-margin one). Recently, Dudik et al. (2020, Remark 8) showed this holds, albeit in a weak sense: If the loss ℓ(z) decays as z b for some b > 0, then the asymptotic margin over the dataset is Ω(m 1/(b+1)), and this is tight in general. Using our techniques, we can provide the following more reﬁned margin bound:

Theorem 16 Under assumptions 1, 2, 3, let ℓbe a loss satisfying ℓ(z) = z b for all z 1. Let w be some point reached by gradient descent, such that ˆL(w) = ϵ for some ϵ (0, ℓ(0)). Then for any p [ϵ, 1], it holds that for at least (1 p)m of the indices i [m] that

Proof By deﬁnition, ℓ 1(z) = z 1/b for any z (0, 1]. Plugging this into Thm. 9, and noting that ℓ(0) ℓ(1) = 1, it follows that for any p [ϵ, 1], for at least (1 p)m of the indices i [m],

2 (ϵ/p) 1/b

Assuming gradient descent converges to a globally minimal value, we have ϵ 0. As a result, after suﬃciently many iterations, we can pick p slightly below 1/m (say 1/2m), and get a margin bound of the form Ω(γ m 1/k) on all of the training examples (since getting a margin violation on less than 1/m of m examples is equivalent to no margin violations). This essentially recovers the result of Dudik et al. (2020) (up to the small diﬀerence of having b in the exponent instead of b + 1). However, our theorem covers a wider regime, since we can pick other values of p. For example, by picking p to be any constant, the theorem implies that we attain Ω(γ) margin on a constant portion of the training data (which can be arbitrarily close to 1). Finally, we note that by upper bounding ϵ as a function of the number of iterations, the theorem above can be used to quantify how many iterations are required to achieve a certain margin on a certain percentage of the data points, as well as derive margin-based generalization bounds, in a manner completely analogous to the results in Sec. 4. Since polynomially-tailed losses are not commonly used in practice, we do not further pursue this here.

7. Numerical Experiment

We conclude our paper by providing a numerical experiment supporting our theoretical results on a real-world dataset.

Gradient Methods Never Overfit On Separable Data

1 2 3 4 5 6 7 8 9 10 # of iterations t 104

Train 0-1 error Train ( /2)-margin error Test 0-1 error

0 2 4 6 8 10 # of iterations t 104

1 Distance from max-margin predictor

Figure 2: Running gradient descent on a linearly-separable subset of MNIST. Left ﬁgure: Train and test error, as a function of the number of iterations t. Right ﬁgure: Distance from the max-margin predictor w , as a function of t. Best viewed in color.

Speciﬁcally, we used a linearly-separable subset of the well-known MNIST digit recognition dataset, focusing on the binary classiﬁcation task of distinguishing between the letters 4 and 7 using a linear predictor. This dataset was constructed as follows: We took all 14117 examples of {4, 7} digits in the MNIST dataset, centered and rescaled to have norm at most 1. Then, we trained an unbiased linear classiﬁer using gradient descent (with step size 1 and using the logistic loss) until all but 212 examples were misclassiﬁed. After removing these examples, we were left with a dataset of 13905 examples, which by deﬁnition is linearly separable. Using a quadratic programming solver, we computed the exact max-margin predictor w , with a margin of γ := 0.0187... Given this linearly-separable dataset, we randomly split it into a training set of size 104, and a test set. We again ran gradient descent (with the logistic loss and step size of 1) on the training set. For each iterate, we computed the average 0 1 loss and the average γ/2-margin error over the training set (that is, the percentage of training examples (x, y) where yx wt 0 and yx wt γ/2 respectively). We also computed the average 0 1 loss over the test set, as well as the distance to the max-margin predictor wt w . The results are presented in Fig. 7, from which several conclusions may be drawn:

First, the train error (both 0 1 and margin variants) go to zero with the number of iterations t, as should be expected (cf. Corollary 10).

Second, we see that the test 0 1 error decays with the training error, and actually remains quite close to it, appearing to converge to a small positive value. This is in complete agreement with the qualitative behavior predicted by our Thm. 11, where the test error remains comparable to the test error up to an unavoidable positive statistical term. In other words, at no point does gradient descent signiﬁcantly overﬁt, regardless of the number of iterations.

Third, the diﬀerence between the 0 1 training error and the γ/2-margin error is negligible. This provides an indication that once an example is classiﬁed correctly, achieving a margin of Ω(γ) on it does not require signiﬁcantly more iterations, on average (at least for this dataset whose margin is rather small, as is typical of realworld datasets). This lends support to our argument in Sec. 5 that the α parameter in Corollary 10 can be taken to be close to 1.

Fourth, our results substantiate the claim in the introduction, that the overﬁtting behavior of gradient descent is dramatically better than what can be predicted solely based on convergence to the max-margin predictor w : Indeed, the normalized iterates wt converge to w very slowly, and even after 105 iterations on this simple dataset, we have wt w 0.56. In contrast, to guarantee small training error based on distance from w alone, we must have wt w < γ = 0.0187...

Finally, we note that since our theoretical bounds are worst-case over all datasets, they do not necessarily match the error curves quantitatively on this dataset. However, there exist datasets where they are quite tight up to constants, as evidenced by the constructions formally analyzed in Thm. 14 (see also Figure 1 for a graphical illustration).

Acknowledgments

This research is supported in part by European Research Council (ERC) grant 754705. We thank Matus Telgarsky and Ronen Eldan for very helpful discussions, and the anonymous reviewers for their insightful comments.

Gradient Methods Never Overfit On Separable Data

Appendix A. Additional Proofs

A.1 Proof of Lemma 7

We ﬁrst argue that ˆL(wt) is monotonically decreasing in t. To see this, note that by the assumption that the derivative of ℓ(and hence the gradient of ˆL) is µ-Lipschitz, we have

ˆL(wt+1) = ˆL(wt ηt ˆL(wt)) ˆL(wt) ηt ˆL(wt) 2 + µ

2 η2 t ˆL(wt) 2

= ˆL(wt) ηt ˆL(wt) 2 1 µ

which implies that ˆL(wt+1) ˆL(wt) whenever ηt 2/µ, which is indeed assumed. This also implies that ˆL(w ) ˆL(w T ) ˆL(wt) for all t. Next, we argue that wt w is monotonically decreasing in t. To see this, note that by smoothness of ˆL and the monotonicity property above, for any t < T,

ˆL(w ) ˆL(wt) ˆL(w T ) ˆL(wt) ˆL(wt+1) ˆL(wt) = ˆL(wt ηt ˆL(wt)) ˆL(wt)

ηt ˆL(wt) 2 + µ

2 η2 t ˆL(wt) = ηt 1 ηtµ

2 ˆL(wt) 2 ,

where in the last step we used the fact that ηt 1/µ, hence 1 ηtµ

2. Overall, we get that

ˆL(wt) 2 2 ηt (ˆL(wt) ˆL(w )) .

Employing this inequality, together with ˆL(wt) ˆL(w ) ˆL(wt) (wt w ) (which follows from convexity of ˆL), we have the following:

wt+1 w 2 wt w = wt ηt ˆL(wt) w 2 wt w

= 2ηt ˆL(wt) (wt w ) + η2 t ˆL(wt) 2

2ηt(ˆL(wt) ˆL(w )) + η2 t 2

ηt (ˆL(wt) ˆL(w )) = 0 .

To complete the proof, note that since wt w is monotonically decreasing, then for any t, it is at most w1 w = w . Therefore, wt wt w + w 2 w .

A.2 Proof of Corollary 10

By Lemma 8, we know that at iteration T, ˆL(w(T)) = ϵ for some

ϵ 1 T + log2(T)

2 log2(T) log2(T)

where we used the facts that γ must be at most 1 and 1 1

2 log2(T) (since T > 4). By the theorem assumptions, it follows that ϵ < ℓ(0), so we must have w(T) = 0. Applying Thm. 9 (noting that ℓhas 1

4-Lipschitz gradients and that ℓ 1(z) := log(1/(exp(z) 1)),

which is between log(1/z) and log(1/2z) for all z [0, 1]), we get that for any p [ ϵ ℓ(0), 1], for at least (1 p)m of the indices i,

x i w T > γ

2 log(p/2ϵ)

log(1/ϵ) = γ

2 1 log(2/p)

In particular, picking p = 2ϵα for some α [0, 1] (which satisﬁes p ϵ ℓ(0) = ϵ log(2)), and

noting that p 2 log2(T)

γ2T α , the result follows.

A.3 Proof of Lemma 15

We will utilize the following easily-veriﬁed facts about the logistic loss ℓ(z) = log(1 + exp( z)):

z R , 1 ℓ (z) < 0

z 1 , ℓ (z) < 1

Also, by deﬁnition of gradient descent, we have the following:

r1 = s1 = 0

rt+1 = rt (1 ϵ)η ℓ (rt) + ϵη

st+1 = st 3γϵη ℓ 1

The proof ofrelies on the following key lemma:

Lemma 17 There exists an iteration index T 32

5η + 1 for which

16 for all t T

2rt + 3γst 0 for all t T.

Proof We start with the ﬁrst item. If rt 1, then by Eq. (6), Eq. (7) and the assumption on ϵ,

rt+1 rt + (1 ϵ)η 1

4 (1 3ϵ) rt + 5η

Since we start at r1 = 0, we get that as long as rt 1, rt increases in increments of at least 5η/32. Thus, there is some index T 32

5η + 1 at which r T 1 for the ﬁrst time, and rt is monotonically increasing for all t T. Once rt 1, it cannot decrease to 15

16 or below in the following iteration, since by the update equation,

rt+1 rt + 0 ϵη

and if rt+1 1, it must monotonically increase again as shown earlier. This establishes the ﬁrst item in the lemma.

Gradient Methods Never Overfit On Separable Data

Turning to the second item, we note that s1 = 0 and for any t, st+1 st + 3γϵη st + 3 1

8 η = st + 3η

64. Since T 32

5η + 1, it follows that

5η = 3 10 .

Turning to the third item in the lemma, deﬁne for simplicity ut := 1

2rt + 3γst. Thus, we need to show ut 0 for all t T. This trivially holds for t = 1. For any t < T, by the update equations for rt, st,

ut+1 = ut + (1 ϵ)η

4 + 9γ2 ϵη ℓ (ut)

( ) ut (1 ϵ)η

8 ϵ ( ) < ut .

where in ( ) we used the assumption ϵ 1

8, and in ( ) we used the facts that γ 1

8, that 1 ℓ (z) 0 for any z, and that since t < T, rt < 1 (see the deﬁnition of T above), hence ℓ (rt) < 1

4. Overall, we get that ut is monotonically decreasing for all t T. But we have u1 = 1

2r1 + 3γs1 = 0, hence ut must be non-positive for all t T as required.

With this lemma at hand, we turn to prove Lemma 15. According to the lemma, 1

2rt + 3γst > 0 can only occur for t > T (in which case, rt 15

16). This requires that

st > 1 6γ rt 15 96γ .

However, by the lemma, we have s T 3

10, and by the update rule for st, st+1 st +3γϵη st + 3γϵ. Thus, the number of additional iterations (after iteration T) required to make st > 15 96γ is at least 15/96γ 3/10

3γϵ = 5 96γ2ϵ 1 96

50γ 5 96γ2ϵ 1 96 50 8 = 19 480γ2ϵ as required.

Miroslav Dudik, Ziwei Ji, Robert Schapire, and Matus Telgarsky. Gradient descent follows the regularization path for general losses. In Conference on Learning Theory, 2020.

Suriya Gunasekar. Characterizing implicit bias in terms of optimization geometry. In In International Conference on Machine Learning, 2018.

Suriya Gunasekar, Jason D Lee, Daniel Soudry, and Nati Srebro. Implicit bias of gradient descent on linear convolutional networks. In Advances in Neural Information Processing Systems, pages 9461 9471, 2018.

Steve Hanneke and Aryeh Kontorovich. Optimality of svm: Novel proofs and tighter bounds. Theoretical Computer Science, 796:99 113, 2019.

Ziwei Ji and Matus Telgarsky. Gradient descent aligns the layers of deep linear networks. In International Conference on Learning Representations, 2018a.

Ziwei Ji and Matus Telgarsky. Risk and parameter convergence of logistic regression. ar Xiv preprint ar Xiv:1803.07300, 2018b.

Ziwei Ji and Matus Telgarsky. Polylogarithmic width suﬃces for gradient descent to achieve arbitrarily small test error with shallow relu networks. In International Conference on Learning Representations, 2019a.

Ziwei Ji and Matus Telgarsky. A reﬁned primal-dual analysis of the implicit bias. ar Xiv preprint ar Xiv:1906.04540, 2019b.

David Mc Allester. Simpliﬁed pac-bayesian margin bounds. In Learning theory and Kernel machines, pages 203 215. Springer, 2003.

Mor Shpigel Nacson, Jason Lee, Suriya Gunasekar, Pedro Henrique Pamplona Savarese, Nathan Srebro, and Daniel Soudry. Convergence of gradient descent on separable data. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics, pages 3420 3428, 2019a.

Mor Shpigel Nacson, Nathan Srebro, and Daniel Soudry. Stochastic gradient descent on separable data: Exact convergence with a ﬁxed learning rate. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics, pages 3051 3059, 2019b.

Yurii Nesterov. Lectures on convex optimization, volume 137. Springer, 2018.

Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.

Daniel Soudry, Elad Hoﬀer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19(1):2822 2878, 2018.