# datadependent_stability_of_stochastic_gradient_descent__5c38e50c.pdf

Data-Dependent Stability of Stochastic Gradient Descent

Ilja Kuzborskij 1 Christoph H. Lampert 2

We establish a data-dependent notion of algorithmic stability for Stochastic Gradient Descent (SGD), and employ it to develop novel generalization bounds. This is in contrast to previous distribution-free algorithmic stability results for SGD which depend on the worst-case constants. By virtue of the data-dependent argument, our bounds provide new insights into learning with SGD on convex and non-convex problems. In the convex case, we show that the bound on the generalization error depends on the risk at the initialization point. In the non-convex case, we prove that the expected curvature of the objective function around the initialization point has crucial inﬂuence on the generalization error. In both cases, our results suggest a simple data-driven strategy to stabilize SGD by pre-screening its initialization. As a corollary, our results allow us to show optimistic generalization bounds that exhibit fast convergence rates for SGD subject to a vanishing empirical risk and low noise of stochastic gradient.

1. Introduction

Stochastic gradient descent (SGD) has become one of the workhorses of modern machine learning. In particular, it is the optimization method of choice for training highly complex and non-convex models, such as neural networks. When it was observed that these models generalize better (suffer less from overﬁtting) than classical machine learning theory suggests, a large theoretical interest emerged to explain this phenomenon. Given that SGD at best ﬁnds a local minimum of the non-convex objective function, it has been argued that all such minima might be equally good. However, at the same time, a large body of empirical work and tricks of trade, such as early stopping, suggests that in prac-

1University of Milan, Italy 2IST Austria. Correspondence to: Ilja Kuzborskij <ilja.kuzborskij@gmail.com>.

Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s).

tice one might not even reach a minimum, yet nevertheless observes excellent performance.

In this work we follow an alternative route that aims to directly analyze the generalization ability of SGD by studying how sensitive it is to small perturbations in the training set. This is known as algorithmic stability approach (Bousquet & Elisseeff, 2002) and was used recently (Hardt et al., 2016) to establish generalization bounds for both convex and non-convex learning settings. To do so they employed a rather restrictive notion of stability that does not depend on the data, but captures only intrinsic characteristics of the learning algorithm and global properties of the objective function. Consequently, their analysis results in worst-case guarantees that in some cases tend to be too pessimistic. As recently pointed out in (Zhang et al., 2017), deep learning might indeed be such a case, as this notion of stability is insufﬁcient to give deeper theoretical insights, and a less restrictive one is desirable.

As our main contribution in this work we establish that a data-dependent notion of algorithmic stability, very similar to the On-Average Stability (Shalev-Shwartz et al., 2010), holds for SGD when applied to convex as well as nonconvex learning problems. As a consequence we obtain new generalization bounds that depend on the data-generating distribution and the initialization point of an algorithm. For convex loss functions, the bound on the generalization error is essentially multiplicative in the risk at the initialization point when noise of stochastic gradient is not too high. For the non-convex loss functions, besides the risk, it is also critically controlled by the expected second-order information about the objective function at the initialization point. We further corroborate our ﬁndings empirically and show that, indeed, the data-dependent generalization bound is tighter than the worst-case counterpart on non-convex objective functions. Finally, the nature of the data-dependent bounds allows us to state optimistic bounds that switch to the faster rate of convergence subject to the vanishing empirical risk.

In particular, our ﬁndings justify the intuition that SGD is more stable in less curved areas of the objective function and link it to the generalization ability. This also backs up numerous empirical ﬁndings in the deep learning literature that solutions with low generalization error occur in less curved regions. At the same time, in pessimistic scenarios,

Data-Dependent Stability of Stochastic Gradient Descent

our bounds are no worse than those of (Hardt et al., 2016).

Finally, we exemplify an application of our bounds, and propose a simple yet principled transfer learning scheme for the convex and non-convex case, which is guaranteed to transfer from the best source of information. In addition, this approach can also be used to select a good initialization given a number of random starting positions. This is a theoretically sound alternative to the purely random commonly used in non-convex learning.

The rest of the paper is organized as follows. We revisit the connection between stability and generalization of SGD in Section 3 and introduce a data-dependent notion of stability in Section 4. We state the main results in Section 5, in particular, Theorem 3 for the convex case, and Theorem 4 for the non-convex one. Next we demonstrate empirically that the bound shown in Theorem 4 is tighter than the worstcase one in Section 5.2.1. Finally, we suggest application of these bounds by showcasing principled transfer learning approaches in Section 5.3, and we conclude in Section 7.

2. Related Work

Algorithmic stability has been a topic of interest in learning theory for a long time, however, the modern approach on the relationship between stability and generalization goes back to the milestone work of (Bousquet & Elisseeff, 2002). They analyzed several notions of stability, which fall into two categories: distribution-free and distribution-dependent ones. The ﬁrst category is usually called uniform stability and focuses on the intrinsic stability properties of an algorithm without regard to the data-generating distribution. Uniform stability was used to analyze many algorithms, including regularized Empirical Risk Minimization (ERM) (Bousquet & Elisseeff, 2002), randomized aggregation schemes (Elisseeff et al., 2005), and recently SGD by (Hardt et al., 2016; London, 2016), and (Poggio et al., 2011). Despite the fact that uniform stability has been shown to be sufﬁcient to guarantee learnability, it can be too pessimistic, resulting in worst-case rates.

In this work we are interested in the data-dependent behavior of SGD, thus the emphasis will fall on the distributiondependent notion of stability, known as on-average stability, explored throughly in (Shalev-Shwartz et al., 2010). The attractive quality of this less restrictive stability type is that the resulting bounds are controlled by how stable the algorithm is under the data-generating distribution. For instance, in (Bousquet & Elisseeff, 2002) and (Devroye & Wagner, 1979), the on-average stability is related to the variance of an estimator. In (Shalev-Shwartz & Ben-David, 2014, Sec. 13), the authors show risk bounds that depend on the expected empirical risk of a solution to the regularized ERM. In turn, one can exploit this fact to state improved optimistic risk

bounds, for instance, ones that exhibit fast-rate regimes (Koren & Levy, 2015; Gonen & Shalev-Shwartz, 2017), or even to design enhanced algorithms that minimize these bounds in a data-driven way, e.g. by exploiting side information as in transfer (Kuzborskij & Orabona, 2013; Ben-David & Urner, 2013) and metric learning (Perrot & Habrard, 2015). Here, we mainly focus on the later direction in the context of SGD: how stable is SGD under the data-generating distribution given an initialization point? We also touch the former direction by taking advantage of our data-driven analysis and show optimistic bounds as a corollary.

We will study the on-average stability of SGD for both convex and non-convex loss functions. In the convex setting, we will relate stability to the risk at the initialization point, while previous data-driven stability arguments usually consider minimizers of convex ERM rather than a stochastic approximation (Shalev-Shwartz & Ben-David, 2014; Koren & Levy, 2015). Beside convex problems, our work also covers the generalization ability of SGD on non-convex problems. Here, we borrow techniques of (Hardt et al., 2016) and extend them to the distribution-dependent setting. That said, while bounds of (Hardt et al., 2016) are stated in terms of worst-case quantities, ours reveal new connections to the data-dependent second-order information. These new insights also partially justify empirical observations in deep learning about the link between the curvature and the generalization error (Hochreiter & Schmidhuber, 1997; Keskar et al., 2017; Chaudhari et al., 2017). At the same time, our work is an alternative to the theoretical studies of neural network objective functions (Choromanska et al., 2015; Kawaguchi, 2016), as we focus on the direct connection between the generalization and the curvature.

In this light, our work is also related to non-convex optimization by SGD. Literature on this subject typically studies rates of convergence to the stationary points (Ghadimi & Lan, 2013; Allen-Zhu & Hazan, 2016; Reddi et al., 2016), and ways to avoid saddles (Ge et al., 2015; Lee et al., 2016). However, unlike these works, and similarly to (Hardt et al., 2016), we are interested in the generalization ability of SGD, and thanks to the stability approach, involvement of stationary points in our analysis is not necessary.

Finally, we propose an example application of our ﬁndings in Transfer Learning (TL). For instance, by controlling the stability bound in a data-driven way, one can choose an initialization that leads to improved generalization. This is related to TL where one transfers from pre-trained models (Kuzborskij & Orabona, 2016; Tommasi et al., 2014; Pentina & Lampert, 2014; Ben-David & Urner, 2013), especially popular in deep learning due to its data-demanding nature (Galanti et al., 2016). Literature on this topic is mostly focused on the ERM setting and PAC-bounds, while our analysis of SGD yields such guarantees as a corollary.

Data-Dependent Stability of Stochastic Gradient Descent

3. Stability of Stochastic Gradient Descent

First, we introduce deﬁnitions used in the rest of the paper.

3.1. Deﬁnitions

We will denote with small and capital bold letters respectively column vectors and matrices, e.g., a ra1, a2, . . . , ads T P Rd and A P Rd1ˆd2 , }a} is understood as a Euclidean norm and }A}2 as the spectral norm. We denote enumeration by rns t1, . . . , nu for n P N.

We indicate an example space by Z and its member by z P Z. For instance, in a supervised setting Z X ˆ Y, such that X is the input and Y is the output space of a learning problem. We assume that training and testing examples are drawn iid from a probability distribution D over Z. In particular, we will denote the training set as S tzium i 1 Dm.

For a parameter space H, we deﬁne a learning algorithm as a map A : Zm ÞÑ H and for brevity we will use the notation AS Ap Sq. In the following we assume that H Ď Rd. To measure the accuracy of a learning algorithm A, we have a loss function fpw, zq, which measures the cost incurred by predicting with parameters w P H on an example z. The risk of w, with respect to the distribution D, and the empirical risk given a training set S are deﬁned as

Rpwq : E z Drfpw, zqs, and p RSpwq : 1

i 1 fpw, ziq .

Finally, deﬁne R : infw PH Rpwq.

3.2. Uniform Stability and Generalization

On an intuitive level, a learning algorithm is said to be stable whenever a small perturbation in the training set does not affect its outcome too much. Of course, there is a number of ways to formalize the perturbation and the extent of the change in the outcome, and we will discuss some of them below. The most important consequence of a stable algorithm is that it generalizes from the training set to the unseen data sampled from the same distribution. In other words, the difference between the risk Rp ASq and the empirical risk p RSp ASq of the algorithm s output is controlled by the quantity that captures how stable the algorithm is. So, to observe good performance, or a decreasing true risk, we must have a stable algorithm and decreasing empirical risk (training error), which usually comes by design of the algorithm. In this work we focus on the stability of the Stochastic Gradient Descent (SGD) algorithm, and thus, as a consequence, we study its generalization ability.

Recently, (Hardt et al., 2016) used a stability argument to prove generalization bounds for learning with SGD. Speciﬁcally, the authors extended the notion of the uniform stability

originally proposed by (Bousquet & Elisseeff, 2002), to accommodate randomized algorithms.

Deﬁnition 1 (Uniform stability). A randomized algorithm A is ϵ-uniformly stable if for all datasets S, Spiq P Zm such that S and Spiq differ in the i-th example, we have

sup z PZ,i Prms

! E A rfp AS, zq fp ASpiq, zqs ) ď ϵ .

Since SGD is a randomized algorithm, we have to cope with two sources of randomness: the data-generating process and the randomization of the algorithm A itself, hence we have statements in expectation. The following theorem of (Hardt et al., 2016) shows that the uniform stability implies generalization in expectation.

Theorem 1. Let A be ϵ-uniformly stable. Then, ˇˇˇˇ E S,A

p RSp ASq Rp ASq ıˇˇˇˇ ď ϵ .

Thus it sufﬁces to characterize the uniform stability of an algorithm to state a generalization bound. In particular, (Hardt et al., 2016) showed generalization bounds for SGD under different assumptions on the loss function f. Despite that these results hold in expectation, other forms of generalization bounds, such as high-probability ones, can be derived from the above (Shalev-Shwartz et al., 2010).

Apart from SGD, uniform stability has been used before to prove generalization bounds for many learning algorithms (Bousquet & Elisseeff, 2002). However, these bounds typically suggest worst-case generalization rates, and rather reﬂect intrinsic stability properties of an algorithm. In other words, uniform stability is oblivious to the data-generating process and any other side information, which might reveal scenarios where generalization occurs at a faster rate. In turn, these insights could motivate the design of improved learning algorithms. In the following we address some limitations of analysis through uniform stability by using a less restrictive notion of stability. We extend the setting of (Hardt et al., 2016) by proving data-dependent stability bounds for convex and non-convex loss functions. In addition, we also take into account the initialization point of an algorithm as a form of supplementary information, and we dedicate special attention to its interplay with the data-generating distribution. Finally, we discuss situations where one can explicitly control the stability of SGD in a data-dependent way.

4. Data-dependent Stability Bounds for SGD

In this section we describe a notion of data-dependent algorithmic stability, that allows us to state generalization bounds which depend not only on the properties of the learning algorithm, but also on the additional parameters of the

Data-Dependent Stability of Stochastic Gradient Descent

algorithm. We indicate such additional parameters by θ, and therefore we denote stability as a function ϵpθq. In particular, in the following we will be interested in scenarios where θ describes the data-generating distribution and the initialization point of SGD.

Deﬁnition 2 (On-Average stability). A randomized algorithm A is ϵpθq-on-average stable if it is true that

" E A E S,z rfp AS, zq fp ASpiq, zqs * ď ϵpθq ,

where S iid Dm and Spiq is its copy with i-th example replaced by z iid D.

Our deﬁnition of on-average stability resembles the notion introduced by (Shalev-Shwartz et al., 2010). The difference lies in the fact that we take supremum over index of replaced example. A similar notion was also used by (Bousquet & Elisseeff, 2002) and later by (Elisseeff et al., 2005) for analysis of a randomized aggregation schemes, however their deﬁnition involves absolute difference of losses. The dependence on θ also bears similarity to recent work of (London, 2016), however, there, it is used in the context of uniform stability. The following theorem shows that on-average -stable random algorithm is guaranteed to generalize in expectation.

Theorem 2. Let an algorithm A be ϵpθq-on-average stable. Then, E S E A

Rp ASq p RSp ASq ı ď ϵpθq .

5. Main Results

Before presenting our main results in this section, we discuss algorithmic details and assumptions. We will study the following variant of SGD: given a training set S tzium i 1 iid Dm, step sizes tαtu T t 1, random indices I tjtu T t 1, and an initialization point w1, perform updates

wt 1 wt αt fpwt, zjtq

for T ď m steps. Moreover we will use the notation w S,t to indicate the output of SGD ran on a training set S, at step t. We assume that the indices in I are sampled from the uniform distribution over rms without replacement, and that this is the only source of randomness for SGD. In practice this corresponds to permuting the training set before making a pass through it, as it is commonly done in practical applications. We also assume that the variance of stochastic gradients obeys

} fpw S,t, zq Rpw S,tq}2ı ď σ2 @t P r Ts .

Next, we introduce statements about the loss functions f used in the following.

Deﬁnition 3 (Lipschitz f). A loss function f is L-Lipschitz if } fpw, zq} ď L, @w P H and @z P Z. Note that this also implies that |fpw, zq fpv, zq| ď L}w v} .

Deﬁnition 4 (Smooth f). A loss function is β-smooth if @w, v P H and @z P Z, } fpw, zq fpv, zq} ď β}w v} , which also implies fpw, zq fpv, zq ď fpv, zq Jpw vq β

Deﬁnition 5 (Lipschitz Hessians). A loss function f has a ρLipschitz Hessian if @w, v P H and @z P Z, } 2fpw, zq 2fpv, zq}2 ď ρ}w v} .

The last condition is occasionally used in analysis of SGD (Ge et al., 2015) and holds whenever f has a bounded third derivative. All presented theorems assume that the loss function is non-negative, Lipschitz, and β-smooth. Examples of such commonly used loss functions are the logistic/softmax losses and neural networks with sigmoid activations. Convexity of loss functions or Lipschitzness of Hessians will only be required for some results, and we will denote it when necessary. Proofs for all the statements in this section are given in the ar Xiv version of the paper1.

5.1. Convex Losses

First, we present a new and data-dependent stability result for convex losses.

Theorem 3. Assume that f is convex, and that SGD s step sizes satisfy αt c ?

t ď 1 β , @t P r Ts. Then SGD is ϵp D, w1q-on-average stable with

ϵp D, w1q O

c p Rpw1q R q

Under the same assumptions, taking step size of order Op1{ ?

tq, Theorem 3.7 of (Hardt et al., 2016) implies a uniform stability bound ϵ Op ?

T{mq. Our bound differs since it involves a multiplicative risk at the initialization point. Thus, our bound corroborates the intuition that whenever we start at a good location of the objective function, the algorithm is more stable and thus generalizes better. However, this is only the case, whenever the variance of stochastic gradient σ2 is not too large. In the deterministic case, and of Rpw1q R , the theorem conﬁrms that SGD, in expectation, does not need to make any updates and is therefore perfectly stable. On the other hand, when the variance σ2 is large enough to make the second summand in Theorem 3 dominant, the bound does not offer improvement compared to (Hardt et al., 2016). Note, that a result of this type cannot be obtained through the more restrictive uniform stability, precisely because such bounds on the stability must hold even for a worst-case choice of data distribution and initialization. In contrast, the notion of stability we

1https://arxiv.org/abs/1703.01678

Data-Dependent Stability of Stochastic Gradient Descent

employ depends on the data-generating distribution, which allowed us to introduce dependency on the risk.

Furthermore, consider that we start at arbitrary location w1: assuming that the loss function is bounded for a concrete H and Z, the rate of our bound up to a constant is no worse than that of (Hardt et al., 2016). Finally, one can always tighten this result by taking the minimum of two bounds.

5.2. Non-convex Losses

Now we state a new stability result for non-convex losses.

Theorem 4. Assume that fp , zq P r0, 1s and has a ρLipschitz Hessian, and that step sizes of a form αt c t satisfy c ď min ! 1 β , 1 4p2β lnp T qq2 ) . Then SGD is ϵp D, w1qon-average stable with

ϵp D, w1q ď 1 1

2c L2 1 1 cγ ˆ E S,A r Rp ASqs T cγ 1 cγ ,

γ : O min ! β, E z 2fpw1, zq 2 1,σ2 ) , (2)

1,σ2 : ρ cσ a

c p Rpw1q R q .

In particular, γ characterizes how the curvature at the initialization point affects stability, and hence the generalization error of SGD. Since γ heavily affects the rate of convergence in (1), and in most situations smaller γ yields higher stability, we now look at a few cases of its behavior. Consider a regime such that γ is of the order Θ Er} 2fpw1, zq}2s a

Rpw1q σ , or in other words, that stability is controlled by the curvature, the risk of the initialization point w1, and the variance of the stochastic gradient σ2. This suggests that starting from a point in a less curved region with low risk should yield higher stability, and therefore as predicted by our theory, allow for faster generalization. In addition, we observe that the considered stability regime offers a principled way to pre-screen a good initialization point in practice, by choosing the one that minimizes spectral norm of the Hessian and the risk.

Next, we focus on a more speciﬁc case. Suppose that we choose a step size αt c t such that γ Θ Er} 2fpw1, zq}2s , yet not too small, so that the empirical risk can still be decreased. Then, stability is dominated by the curvature around w1. Indeed, lower generalization errors on non-convex problems, such as training deep neural networks, have been observed empirically when SGD is actively guided (Hochreiter & Schmidhuber, 1997; Goodfellow et al., 2016; Chaudhari et al., 2017) or converges to solutions with low curvature (Keskar et al., 2017). However, to the best of our knowledge, Theorem 4 is the ﬁrst

to establish a theoretical link between the curvature of the loss function and the generalization ability of SGD in a data-dependent sense.

Theorem 4 immediately implies the following statement that further reinforces the effect of the initialization point on the generalization error, assuming that ESr Rp ASqs ď Rpw1q.

Corollary 1. Under conditions of Theorem 4 we have that SGD is ϵp D, w1q-on-average stable with

ϵp D, w1q O

cγ m p Rpw1q Tq

We take a moment to discuss the role of the risk term in p Rpw1q Tq

cγ 1 cγ . Observe that ϵp D, w1q Ñ 0 as Rpw1q Ñ 0, in other words, the generalization error approaches zero as the risk of the initialization point vanishes. This is an intuitive behavior, however, uniform stability does not capture this due to its distribution-free nature. Finally, we note that (Hardt et al., 2016, Theorem 3.8) showed a bound similar to (1), however, in place of γ their bound has a Lipschitz constant of the gradient. The crucial difference lies in term γ which is now not merely a Lipschitz constant, but rather depends on the data-generating distribution and initialization point of SGD. We compare to their bound by considering the worst case scenario, namely, that SGD is initialized in a point with high curvature, or altogether, that the objective function is highly curved everywhere. Then, at least our bound is no worse than the one of (Hardt et al., 2016), since γ ď β. Finally, it should be noted that our bound can be compared to the one of (Hardt et al., 2016) only in a setting of a single pass. In a multiple-pass case, data-dependent analysis in a current form would not hold, since the output of SGD would not be independent from a newly observed example after the ﬁrst pass. On the other hand, our results focus on the gains due to data-dependent initialization.

Theorem 4 also allows us to prove an optimistic generalization bound for learning with SGD on non-convex objectives. Corollary 2. Under conditions of Theorem 4 we have that the output of SGD obeys

Rp ASq p RSp ASq ı

1 1 cγ m max

p RSp ASq ı T cγ 1 cγ , ˆ T

An important consequence of Corollary 2, is that for a vanishing expected empirical risk, in particular for ES,Ar p RSp ASqs O T cγ m1 cγ , the generalization error behaves as O T cγ m1 cγ . Considering the full pass, that is m Op Tq, we have an optimistic generalization error of order O p1{mq instead of Opm 1 1 cγ q. We note that

Data-Dependent Stability of Stochastic Gradient Descent

Figure 1. Comparison of data-dependent and uniform generalization bounds evaluated by training a neural network.

PAC bounds with similar optimistic message (although not directly comparable), but without curvature information can also be obtained through empirical Bernstein bounds as in (Maurer & Pontil, 2009). However, a PAC bound does not suggest a way to minimize non-convex empirical risk in general, where SGD is known to work reasonably well.

5.2.1. TIGHTNESS OF NON-CONVEX BOUNDS

Next we empirically assess the tightness of our non-convex generalization bounds on real data. In the following experiment we train a neural network with three convolutional layers interlaced with max-pooling, followed by the fully connected layer with 16 units, on the MNIST dataset. This totals in a model with 18K parameters. Figure 1 compares our data-dependent bound (1) to the distribution-free one of (Hardt et al., 2016, Theorem 3.8). As as a reference we also include an empirical estimate of the generalization error taken as an absolute difference of the validation and training average losses. Since our bound also depends on the initialization point, we plot (1) for multiple warm-starts , ie.with SGD initialized from a pre-trained position. We consider 7 such warm-starts at every 200 steps, and report data-dependent quantities used to compute (1) just beneath the graph. Our ﬁrst observation is that, clearly, the datadependent bound gives tighter estimate, by roughly one order of magnitude. Second, simulating start from a pretrained position suggests even tighter estimates: we suspect that this is due to decreasing validation error which is used as an empirical estimate for Rpw1q which affects bound (1).

We compute an empirical estimate of the expected Hessian spectral norm by the power iteration method using an efﬁcient Hessian-vector multiplication method (Pearlmutter, 1994). Since bounds depend on constants L, β, and ρ, we estimate them by tracking maximal values of the

gradient and Hessian norms throughout optimization. We compute bounds with estimates p L 78.72, pβ 1692.28, pρ 3823.73, and c 10 3.

5.3. Application to Transfer Learning

One example application of data-dependent bounds presented before lies in Transfer Learning (TL), where we are interested in achieving faster generalization on a target task by exploiting side information that originates from different but related source tasks. The literature on TL explored many ways to do so, and here we will focus on the one that is most compatible with our bounds. More formally, suppose that the target task at hand is characterized by a joint probability distribution D, and as before we have a training set S iid Dm. Some TL approaches also assume access to the data sampled from the distributions associated with the source tasks. Here we follow a conservative approach instead of the source data, we receive a set of source hypotheses twsrc k u K k 1 Ă H, trained on the source tasks. The goal of a learner is to come up with a target hypothesis, which in the optimistic scenario generalizes better by relying on source hypotheses. In the TL literature this is known as Hypothesis Transfer Learning (HTL) (Kuzborskij & Orabona, 2016), that is, we transfer from the source hypotheses which act as a proxy to the source tasks and the risk Rpwsrc k q quantiﬁes how much source and target tasks are related. In the following we will consider SGD for HTL, where the source hypotheses act as initialization points. First, consider learning with convex losses: Theorem 3 depends on Rpw1q, thus it immediately quantiﬁes the relatedness of source and target tasks. So it is enough to pick the point that minimizes the stability bound to transfer from the most related source. Then, bounding Rpwsrc k q by p RSpwsrc k q through Hoeffding bound along with union bound gives with high probability that

min k Pr Ks ϵp D, wsrc k q ď min k Pr Ks O

p RSpwsrc k q

Hence, the most related source is the one that simply minimizes empirical risk. Similar conclusions where drawn in HTL literature, albeit in the context of ERM. Matters are slightly more complicated in the non-convex case. We take a similar approach, however, now we minimize stability bound (3), and for the sake of simplicity assume that we make a full pass over the data, so T m. Minimizing the following empirical upper bound select the best source.

Proposition 1. Let pγ k Θ 1 m řm i 1 } 2fpwsrc k , ziq}2 b

p RSpwsrc k q 4a

logp Kq{m . Then with high probability the generalization error of wsrc k is bounded by

min k Pr Ks O

ˆ 1 1 cpγ k

p RSpwsrc k q

cpγ k 1 cpγ k

Data-Dependent Stability of Stochastic Gradient Descent

Note that pγ k involves estimation of the spectral norm of the Hessian, which is computationally cheaper to evaluate compared to the complete Hessian matrix (Pearlmutter, 1994). This is particularly relevant for deep learning, where computation of the Hessian matrix can be prohibitively expensive.

6. Proof Outline of Theorem 4

In this section we discuss the proof of Theorem 4, which bounds data-dependent stability for non-convex losses.

We say that the SGD gradient update rule is an operator Gt : H ÞÑ H, such that

Gtpwq : w αt fpw, zitq ,

and it is also a function of the training set S and a random index set I. Then, wt 1 Gtpwtq, throughout t 1, . . . , T. Recall the use of notation w S,t to indicate the output of SGD ran on a training set S, at step t, and deﬁne

δtp S, zq : }w S,t w Spiq,t} .

The following propoperty of Gt will be central to our proof. Deﬁnition 6 (Expansiveness). A gradient update rule is η-expansive if for all w, v, }Gtpwq Gtpvq} ď η}w v} .

The following lemma characterizes expansiveness for the gradient update rule under different assumptions on f. Lemma 1 ((Hardt et al., 2016)). Assume that f is β-smooth. Then, we have that Gt is p1 αtβq-expansive.

The following lemma is similar to Lemma 3.11 of (Hardt et al., 2016), and is instrumental in bounding the stability of SGD. However, we make an adjustment and state it in expectation over the data. Note that it does not require convexity of the loss function. Lemma 2. Assume that the loss function fp , zq P r0, 1s is L-Lipschitz for all z. Then, for every t0 P t0, 1, 2, . . . mu we have that,

E S,z E A fpw S,T , zq fpw Spiq,T , zq (4)

E A rδT p S, zq | δt0p S, zq 0s ı E S,A r Rp ASqs t0

We spend a moment to highlight the role of conditional expectation. Observe that we could naively bound (4) by the Lipschitzness of f, but Lemma 2 follows a more careful argument. First note that t0 is a free parameter. The expected distance in (5) between SGD outputs w S,t and w Spiq,t is conditioned on the fact that at step t0 outputs of SGD are still the same. This means that the perturbed point is encountered after t0. Then, the conditional expectation should be a decreasing function of t0: the later the perturbation occurs, the smaller deviation between w S,t and w Spiq,t we should expect. Eventually we minimize (5) over t0.

6.1. Non-convex Losses

Our proof of a stability bound for non-convex loss functions, Theorem 4, follows a general outline of (Hardt et al., 2016, Theorem 3.8). Namely, the outputs of SGD run on a training set S and its perturbed version Spiq will not differ too much, because by the time a perturbation is encountered, the step size has already decayed enough. So, on the one hand, stabilization is enforced by the diminishing the step size, and on the other hand, by how much updates expand the distance between the gradients after the perturbation. Since (Hardt et al., 2016) work with uniform stability, they capture the expansiveness of post-perturbation update by the Lipschitzness of the gradient. In combination with a recursive argument, their bound has exponential dependency on the Lipschitz constant of the gradient. We argue that the Lipschitz continuity of the gradient can be too pessimistic in general. Instead, we rely on a local data-driven argument: considering that we initialize SGD at point w1, how much do updates expand the gradient under the distribution of interest? The following crucial lemma characterizes such behavior in terms of the curvature at w1.

Lemma 3. Assume that the loss function fp , zq is β-smooth and that its Hessian is ρ-Lipschitz. Then, Gtpw S,tq Gtpw Spiq,tq ď p1 αtξtp S, zqq δtp S, zq

Furthermore, for any t P r Ts,

E S,z rξtp S, zqs O ˆ E S,z 2fpw1, ztq 2 1,σ2 ,

1,σ2 : ρ cσ a

c p Rpw1q R q .

Next, we need the following statement.

Proposition 2 (Bernstein-type inequality). Let Z be a zeromean real-valued r.v., such that |Z| ď b and Er Z2s ď σ2. Then for all |c| ď 1 2b, we have that E ec Z ď ec2σ2 .

Now we are ready to prove Theorem 4.

Sketch proof of Theorem 4. We start from Lemma 2. Most of the proof is dedicated to bounding the ﬁrst term in (5). We deal with this similarly as in (Hardt et al., 2016). Specifically, we state the bound on EA rδT p S, zq|δt0p S, zq 0s by using a recursion. In our case, however, we also have an expectation w.r.t. the data, and to avoid complications with dependencies, we ﬁrst unroll the recursion for the random quantities, and only then take the expectation. At this point the proof crucially relies on the product of exponentials arising from the recursion, and all relevant random quantities end up inside of them. We alleviate this by Proposition 2. Finally, we conclude by minimizing (5) w.r.t. t0. Thus we have three steps: 1) recursion, 2) bounding Erexpp qs, and 3) tuning of t0.

Data-Dependent Stability of Stochastic Gradient Descent

1) Recursion. We begin by stating the bound on EA rδT p S, zq|δt0p S, zq 0s by recursion. Thus we will ﬁrst state the bound on EA rδt 1p S, zq|δt0p S, zq 0s in terms of EA rδtp S, zq|δt0p S, zq 0s, and other relevant quantities and then unravel the recursion. We distinguish two cases: 1) SGD encounters the perturbed point at step t, that is t i, and 2) the current point is the same in S and Spiq, so t i. For the ﬁrst case, we will use worst-case boundedness of Gt, that is, we observe that }Gtpw S,tq Gtpw Spiq,tq} ď δtp S, zq 2αt L. To handle the second case we will use Lemma 3, namely, Gtpw S,tq Gtpw Spiq,tq ď p1 αtξtp S, zqq δtp S, zq .

In addition, as a safety measure we will also take into account that the gradient update rule is at most p1 αtβqexpansive by Lemma 1. So we will work with the function ψtp S, zq : min tξtp S, zq, βu instead of ξtp S, zq. Now, introduce tp S, zq : EArδtp S, zq | δt0p S, zq 0s, and decompose the expectation w.r.t. A for a step t. Noting that SGD encounters the perturbed example with probability 1

t 1p S, zq ď ˆ 1 1

p1 αtψtp S, zqq tp S, zq

m p2αt L tp S, zqq

αtψtp S, zq tp S, zq 2αt L

ď exp pαtψtp S, zqq tp S, zq 2αt L

where the last inequality follows from 1 x ď exppxq. This inequality is not overly loose for x P r0, 1s, and, in our case it becomes instrumental in handling the recursion.

Now, observe that relation xt 1 ď atxt bt with xt0 0 unwinds from T to t0 as x T ď řT t t0 1 bt śT k t 1 ak. Consequently, having t0p S, zq 0, we unwind (6) to get

T p S, zq ď

k t 1 exp ˆcψkp S, zq

2) Bounding Erexpp qs. We take expectation w.r.t. S and z on both sides and focus on the expectation of the exponential in (7). First, introduce µk : ES,zrψkp S, zqs, and observe that the zero-mean version of ψkp S, zq is bounded as řT k t 1 1 k|ψkp S, zq µk| ď 2β lnp Tq. Assuming the setting of c ď 1 2p2β lnp T qq2 , we apply Proposition 2 and get

where we bounded variance by µk thanks to the setting of c. Next, we give an upper-bound on µk, that is µk ď min tβ, ES,zrξkp S, zqsu. Finally, we bound ES,zrξkp S, zqs using the second result of Lemma 3, which holds for any k P r Ts, to get that µk ď γ, with γ deﬁned in (2).

3) Tuning of t0. Now we turn our attention back to (7). Considering that we took an expectation w.r.t. the data, we use (8) and the fact that µk ď γ to get that

E S,zr T p S, zqs ď

t t0 1 exp ˆ 2cγ ln ˆT

Plug the above into (5) to get

E S,z E A fpw S,T , zq fpw Spiq,T , zq

Let q 2cγ. Then, setting t0 2c L2 1 1 q T

q 1 q minimizes (9). Plugging t0 back we get that (9) equals to 1 1

q m 2c L2 1 1 q T

q 1 q . This completes the proof.

7. Conclusions and Future Work

In this work we proved data-dependent stability bounds for SGD and revisited its generalization ability. We presented novel bounds for convex and non-convex smooth loss functions, partially controlled by data-dependent quantities, while previous stability bounds for SGD were derived through the worst-case analysis. In particular, for non-convex learning, we demonstrated theoretically that generalization of SGD is heavily affected by the expected curvature around the initialization point. We demonstrated empirically that our bound is indeed tighter compared to the uniform one. In addition, our data-dependent analysis also allowed us to show optimistic bounds on the generalization error of SGD, which exhibit fast rates subject to the vanishing empirical risk of the algorithm s output.

In future work we further intend to explore our theoretical ﬁndings experimentally and evaluate the feasibility of the transfer learning based on the second-order information. Another direction lies in making our bounds adaptive. So far we have presented bounds that have data-dependent components, however the step size cannot be adjusted depending on the data, e.g. as in (Zhao & Zhang, 2015). This was partially addressed by (London, 2016), albeit in the context of uniform stability, and we plan to extend this idea to the context of data-dependent stability.

Data-Dependent Stability of Stochastic Gradient Descent

Acknowledgments

This work was in parts funded by the European Research Council (ERC) under the European Union s Horizon 2020 research and innovation programme (grant agreement no 637076). This work was in parts funded by the European Research Council under the European Union s Seventh Framework Programme (FP7/2007-2013)/ERC grant agreement no 308036.

Allen-Zhu, Z. and Hazan, E. Variance reduction for faster non-convex optimization. In International Conference on Machine Learing (ICML), pp. 699 707, 2016.

Ben-David, S. and Urner, R. Domain adaptation as learning with auxiliary information. NIPS workshop on New Directions in Transfer and Multi-Task, 2013.

Bousquet, O. and Elisseeff, A. Stability and Generalization. Journal of Machine Learning Research, 2:499 526, 2002.

Chaudhari, P., Choromanska, A., Soatto, S., Le Cun, Y., Baldassi, C., Borgs, C., Chayes, J., Sagun, L., and Zecchina, R. Entropy-sgd: Biasing gradient descent into wide valleys. In International Conference on Learning Representations (ICLR), 2017.

Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B., and Le Cun, Y. The loss surfaces of multilayer networks. In International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), 2015.

Devroye, L. and Wagner, T. Distribution-free performance bounds for potential function rules. IEEE Transactions on Information Theory, 25(5):601 604, 1979.

Elisseeff, A., Evgeniou, T., and Pontil, M. Stability of randomized learning algorithms. Journal of Machine Learning Research, 6(Jan):55 79, 2005.

Galanti, T., Wolf, L., and Hazan, T. A theoretical framework for deep transfer learning. Information and Inference, pp. iaw008, 2016.

Ge, R., Huang, F., Jin, C., and Yuan, Y. Escaping from saddle points-online stochastic gradient for tensor decomposition. In Conference on Learning Theory (COLT), pp. 797 842, 2015.

Ghadimi, S. and Lan, G. Stochastic ﬁrst-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341 2368, 2013.

Gonen, A. and Shalev-Shwartz, S. Fast rates for empirical risk minimization of strict saddle problems. ar Xiv preprint ar Xiv:1701.04271, 2017.

Goodfellow, I., Bengio, Y., and Courville, A. Deep learning. The MIT Press, 2016.

Hardt, M., Recht, B., and Singer, Y. Train faster, generalize better: Stability of stochastic gradient descent. In International Conference on Machine Learing (ICML), 2016.

Hochreiter, S. and Schmidhuber, J. Flat minima. Neural Computation, 9(1):1 42, 1997.

Kawaguchi, K. Deep learning without poor local minima. In Conference on Neural Information Processing Systems (NIPS), pp. 586 594, 2016.

Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P. T. P. On large-batch training for deep learning: Generalization gap and sharp minima. In International Conference on Learning Representations (ICLR), 2017.

Koren, T. and Levy, K. Fast rates for exp-concave empirical risk minimization. In Conference on Neural Information Processing Systems (NIPS), pp. 1477 1485, 2015.

Kuzborskij, I. and Orabona, F. Stability and hypothesis transfer learning. In International Conference on Machine Learing (ICML), pp. 942 950, 2013.

Kuzborskij, I. and Orabona, F. Fast Rates by Transferring from Auxiliary Hypotheses. Machine Learning, pp. 1 25, 2016.

Lee, J. D., Simchowitz, M., Jordan, M. I., and Recht, B. Gradient descent only converges to minimizers. In Conference on Learning Theory (COLT), pp. 1246 1257, 2016.

London, B. Generalization bounds for randomized learning with application to stochastic gradient descent. In NIPS Workshop on Optimizing the Optimizers, 2016.

Maurer, A. and Pontil, M. Empirical bernstein bounds and sample variance penalization. ar Xiv preprint ar Xiv:0907.3740, 2009.

Pearlmutter, B. A. Fast exact multiplication by the hessian. Neural Computation, 6(1):147 160, 1994.

Pentina, A. and Lampert, C. H. A pac-bayesian bound for lifelong learning. In International Conference on Machine Learing (ICML), 2014.

Perrot, M. and Habrard, A. A theoretical analysis of metric hypothesis transfer learning. In International Conference on Machine Learing (ICML), pp. 1708 1717, 2015.

Poggio, T., Voinea, S., and Rosasco, L. Online learning, stability, and stochastic gradient descent. ar Xiv preprint ar Xiv:1105.4701, 2011.

Data-Dependent Stability of Stochastic Gradient Descent

Reddi, S. J., Hefny, A., Sra, S., Poczos, B., and Smola, A. Stochastic variance reduction for nonconvex optimization. In International Conference on Machine Learing (ICML), pp. 314 323, 2016.

Shalev-Shwartz, S. and Ben-David, S. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014.

Shalev-Shwartz, S., Shamir, O., Srebro, N., and Sridharan, K. Learnability, stability and uniform convergence. Journal of Machine Learning Research, 11(Oct):2635 2670, 2010.

Tommasi, T., Orabona, F., and Caputo, B. Learning categories from few examples with multi model knowledge transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(5):928 941, 2014.

Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations (ICLR), 2017.

Zhao, P. and Zhang, T. Stochastic optimization with importance sampling for regularized loss minimization. In International Conference on Machine Learing (ICML), pp. 1 9, 2015.