# algorithmic_stability_and_hypothesis_complexity__97aedd25.pdf

Algorithmic Stability and Hypothesis Complexity

Tongliang Liu 1 G abor Lugosi 2 3 4 Gergely Neu 5 Dacheng Tao 1

Abstract We introduce a notion of algorithmic stability of learning algorithms that we term argument stability that captures stability of the hypothesis output by the learning algorithm in the normed space of functions from which hypotheses are selected. The main result of the paper bounds the generalization error of any learning algorithm in terms of its argument stability. The bounds are based on martingale inequalities in the Banach space to which the hypotheses belong. We apply the general bounds to bound the performance of some learning algorithms based on empirical risk minimization and stochastic gradient descent.

1. Introduction

Many efforts have been made to analyze various notions of algorithmic stability and prove that a broad spectrum of learning algorithms are stable in some sense. Intuitively, a learning algorithm is said to be stable if slight perturbations in the training data result in small changes in the output of the algorithm, and these changes vanish as the data set grows bigger and bigger (Bonnans & Shapiro, 2013). For example, Devroye & Wagner (1979), Lugosi & Pawlak (1994), and Zhang (2003) showed that several non-parametric learning algorithms are stable; Bousquet & Elisseeff (2002) proved that ℓ2 regularized learning algorithms are uniformly stable; Wibisono et al. (2009) generalized Bousquet and Elisseeff s results and proved that regularized learning algorithms with strongly convex penalty functions on bounded domains, e.g., ℓp regularized learning algorithms for 1 < p 2, are also uniformly stable;

1UBTech Sydney AI Institute, School of IT, FEIT, The University of Sydney, Australia 2Department of Economics and Business, Pompeu Fabra University, Barcelona, Spain 3ICREA, Pg. Llus Companys 23, 08010 Barcelona, Spain 4Barcelona Graduate School of Economics 5AI group, DTIC, Universitat Pompeu Fabra, Barcelona, Spain. Correspondence to: Tongliang Liu <tliang.liu@gmail.com>, G abor Lugosi <gabor.lugosi@upf.edu>, Gergely Neu <gergely.neu@gmail.com>, Dacheng Tao <dacheng.tao@sydney.edu.au>.

Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, 2017. JMLR: W&CP. Copyright 2017 by the author(s).

Hardt et al. (2015) showed that parametric models trained by stochastic gradient descent algorithms are uniformly stable; and Liu et al. (2017) proved that tasks in multi-task learning can act as regularizers and that multi-task learning in a very general setting will therefore be uniformly stable under mild assumptions.

The notion of algorithmic stability has been an important tool in deriving theoretical guarantees of the generalization abilities of learning algorithms. Various notions of stability have been introduced and have been exploited to derive generalization bounds. For some examples, Mukherjee et al. (2006) proved that a statistical form of leave-one-out stability is a sufﬁcient and necessary condition for the generalization and learnability of empirical risk minimization learning algorithms; Shalev-Shwartz et al. (2010) deﬁned a weaker notion, the so-called on-average-replace-oneexample stability , and showed that this condition is both sufﬁcient and necessary for the generalization and learnability of a general learning setting.

In this paper we study learning algorithms that select a hypothesis (i.e., a function used for prediction) from a certain ﬁxed class of functions belonging to a separable Banach space. We introduce a notion of argument stability which measures the impact of changing a single training example on the hypothesis selected by the learning algorithm. This notion of stability is stronger than uniform algorithmic stability of Bousquet & Elisseeff (2002) that is only concerned about the change in the loss but not the hypothesis itself. However, as we will show, the new notion is still quite natural and holds for a variety of learning algorithms. On the other hand, it allows one to exploit martingale inequalities (Boucheron et al., 2013) in the Banach space of the hypotheses. Indeed, the performance bounds we derive for stable algorithms depend on characteristics related to the martingale type of the Banach space.

Generalization bounds typically depend on the complexity of a class of hypotheses that can be chosen by the learning algorithm. Exploiting the local estimates of the complexity of the predeﬁned hypothesis class is a promising way to obtain sharp bounds. Building on martingale inequalities in the Banach space of the hypotheses, we deﬁne a subset of the predeﬁned hypothesis class, whose elements will (or will have a high probability to) be output by a

Algorithmic Stability and Hypothesis Complexity

learning algorithm, as the algorithmic hypothesis class, and study the complexity of the algorithmic hypothesis class of argument-stable learning algorithms. We show that, if the hypotheses belong to a Hilbert space, the upper bound of the Rademacher complexity of the algorithmic hypothesis class will converge at a fast rate of order O(1/n), where n is the sample size.

The rest of the paper is organized as follows. Section 2 introduces the mathematical framework and the proposed notion of algorithmic stability. Section 3 presents the main results of this study, namely the generalization bounds in terms of argument stability. Section 4 specializes the results to some learning algorithms, including empirical risk minimization and stochastic gradient descent. Section 5 concludes the paper.

2. Algorithmic Stability and Hypothesis Class

We consider the classical statistical learning problem, where the value of a real random variable Y is to be predicted based on the observation of an another random variable X. Let S be a training sample of n i.i.d. pairs of random variables Z1 = (X1, Y1), . . . , Zn = (Xn, Yn) drawn from a ﬁxed distribution P on a set Z = X Y, where X is the so-called feature space. A learning algorithm A : S Zn 7 h S H is a mapping from Zn to a hypothesis class H that we assume to be a subset of a separable Banach space (B, ). We focus on linear prediction problems, that is, when h(x) is a linear functional of x. We write h(x) = h, x . In other words, we assume that the feature space X is the algebraic dual of the Banach space B. We denote the norm in X by . The output h S of the learning algorithm is a hypothesis used for predicting the value for Y .

An important special case is when B is a Hilbert space. In that case we may assume that X = B and that h, x is the inner product in B.

The quality of the predictions made by any hypothesis will be measured by a loss function ℓ: B Z R+ (where R+ denotes the set of positive reals). Speciﬁcally, ℓ(h, Z) measures the loss of predicting an example Z using a hypothesis h.

The risk of h H is deﬁned by

R(h) = Eℓ(h, Z) ;

while the empirical risk is

i=1 ℓ(h, Zi) .

For the output h S of a learning algorithm A, the general-

ization error is deﬁned as

R(h S) RS(h S) . (1)

The notion of algorithmic stability was proposed to measure the changes of outputs of a learning algorithm when the input is changed. Various ways have been introduced to measure algorithmic stability. Here we recall the notion of uniform stability deﬁned by Bousquet & Elisseeff (2002) for comparison purposes. This notion of stability relies on the altered sample Si = {Z1, . . . , Zi 1, Z i, Zi+1, . . . , Zn}, the sample S with the i-th example being replaced by an independent copy of Zi.

Deﬁnition 1 (Uniform Stability). A learning algorithm A is β(n)-uniformly stable with respect to the loss function ℓ if for all i {1, . . . , n},

|ℓ(h S, Z) ℓ(h Si, Z)| β(n) ,

with probability one, where β(n) R+ .

We propose the following, similar, notion that acts on the hypotheses directly, as opposed to the losses.

Deﬁnition 2 (Uniform Argument Stability). A learning algorithm A is α(n)-uniformly argument stable if for all i {1, . . . , n},

h S h Si α(n) .

with probability one, where α(n) R+ .

The two notions of stability are closely related: Intuitively, if the loss ℓ(h, z) is a sufﬁciently smooth function of h, then uniform argument stability should imply uniform stability. To make this intuition precise, we deﬁne the notion of Lipschitz-continuous loss functions below.

Deﬁnition 3 (L-Lipschitz Loss Function). The loss function ℓ: B Z R+ is L-Lipschitz for an L > 0 if

|ℓ(h, z) ℓ(h , z)| L | h, x h , x |

holds for all z Z and h, h H.

Additionally assuming that X is bounded by some B > 0 with probability one, it is easy to see that an α(n)- uniformly argument stable learning algorithm is uniformly stable with β(n) = LBα(n), since

h S h Si = sup x X: x 1 ( h S, x h Si, x ) .

However, the reverse implication need not necessarily hold and hence uniform argument stability is a stronger notion.

In the rest of the paper, we will focus on L-Lipschitz loss functions and assume that X B holds almost surely.

Algorithmic Stability and Hypothesis Complexity

These assumptions are arguably stronger than those made by Bousquet & Elisseeff (2002) who only require that the loss function be bounded. In contrast, our results will require that the loss ℓ(h, z) be Lipschitz in the linear form h, x , which is only slightly more general than assuming generalized linear loss functions. Nevertheless, these stronger assumptions will enable us to prove stronger generalization bounds.

The relationship between argument stability and generalization performance hinges on a property of the Banach space B that is closely related to the martingale type of the space see Pisier (2011) for a comprehensive account. For concreteness we assume that the Banach space B is (2, D)-smooth (or of martingale type 2) for some D > 0. This means that for all h, h B,

h + h 2 + h h 2 2 h 2 + 2D2 h 2 .

Note that Hilbert spaces are (2, 1)-smooth. The property we need is described in the following result of (Pinelis, 1994): Proposition 1. Let D1, . . . , Dn be a martingale difference sequence taking values in a separable (2, D)-smooth Banach space B. Then for any ϵ > 0,

where c is a constant satisfying that P t=1 Dt 2 c2

(and Dt is the essential supremum of the random variable Dt ).

Our arguments extend, in a straightforward manner, to more general Banach spaces whenever exponential tail inequalities for bounded martingale sequences similar to Proposition 1 are available. We stay with the assumption of (2, D)-smoothness for convenience and because it applies to the perhaps most important special case when B is a Hilbert space. We refer to Rakhlin & Sridharan (2015) for more information of martingale inequalities of this kind.

A key property of stable algorithms, implied by the martingale inequality, is that the hypothesis h S output by the algorithm is concentrated in the Banach space B around its expectation Eh S. This is established in the next simple lemma. Lemma 1. Let the Banach space B be (2, D)-smooth. If a learning algorithm A is α(n)-uniformly argument stable, then, for any δ > 0,

P h S Eh S Dα(n) p

2n log(2/δ) 1 δ .

Proof. Introduce the martingale differences

Dt = E(h S|Z1, . . . , Zt) E(h S|Z1, . . . , Zt 1)

t=1 E(h S|Z1, . . . , Zt) E(h S|Z1, . . . , Zt 1) 2

t=1 E(h S h St|Z1, . . . , Zt) 2

t=1 (E( (h S h St |Z1, . . . , Zt))2

Thus, by Proposition 1, we have

P h S ESh S α(n)D p

2n log(2/δ) δ

for δ = 2 exp ϵ2

3. Algorithmic Rademacher Complexity and Generalization Bound

The concentration result of Lemma 1 justiﬁes the following deﬁnition of the algorithmic hypothesis class : since with high probability h S concentrates around its expectation Eh S, what matters in the generalization performance of the algorithm is the complexity of the ball centered at Eh S and not that of the entire hypothesis class H. This observation may lead to signiﬁcantly improved performance guarantees.

Deﬁnition 4 (Algorithmic Hypothesis Class). For a sample size n and conﬁdence parameter δ > 0, let r = r(n, δ) = Dα(n) p

2n log(2/δ) and deﬁne the algorithmic hypothesis class of a stable learning algorithm by

Br = {h H| h Eh S r(n, δ)} .

Note that, by Lemma 1, h S Br with probability at least 1 δ.

We bound the generalization error (1) in terms of the Rademacher complexity (Bartlett & Mendelson, 2003) of the algorithmic hypothesis class. The Rademacher complexity of a hypothesis class H on the feature space X is deﬁned as

R(H) = E sup h H

i=1 σi h, Xi ,

Algorithmic Stability and Hypothesis Complexity

where σ1, . . . , σn are i.i.d. Rademacher variables that are uniformly distributed in { 1, +1}.

The next theorem shows how the Rademacher complexity of the algorithmic hypothesis class can be bounded. The bound depends on the type of the feature space X. Recall that the Banach space (X, ) is of type p 1 if there exists a constant Cp such that for all x1, . . . , xn X,

In the important special case when X is a Hilbert space, the space is of type 2 with constant C2 = 1.

Theorem 1. Assume that B is a (2, D)-smooth Banach space and that its dual X is of type p. Suppose that the marginal distribution of the Xi is such that Xi B with probability one, for some B > 0. If a learning algorithm is α(n)-uniformly argument stable, then the Rademacher complexity of the algorithmic hypothesis class Br on the feature space satisﬁes

R(Br) DCp B p

2 log(2/δ)α(n)n 1/2+1/p .

In particular, when B is a Hilbert space, the bound simpliﬁes to R(Br) B p

2 log(2/δ)α(n) .

Proof. We have

= E sup h Br

i=1 σi h, Xi

= E sup h Br

i=1 (σi h, Xi

σi E h S, Xi + σi E h S, Xi )

= E sup h Br

i=1 σi( h, Xi E h S, Xi )

= E sup h Br

i=1 σi h Eh S, Xi

2n log(2/δ)Cp

2 log(2/δ)α(n)n 1/2+1/p ,

concluding the proof.

The theorem above may be easily used to bound the performance of an α(n)-uniformly argument stable learning algorithm. For simplicity, we state the result for Hilbert spaces only. The extension to (2, D)-smooth Banach spaces with a type-p dual is straightforward.

Corollary 1. Assume that B is a separable Hilbert space. Suppose that the marginal distribution of the Xi is such that Xi B with probability one, for some B > 0 and that the loss function is bounded and Lipschitz, that is, ℓ(h, Z) M with probability one for some M > 0 and |ℓ(h, z) ℓ(h , z)| L | h, x h , x | for all z Z and h, h H. If a learning algorithm is α(n)-uniformly argument stable, then its generalization error is bounded as follows. With probability at least 1 2δ,

R(h S) RS(h S)

2 log(2/δ)α(n) + M

Proof. Note ﬁrst that, by Lemma 1, with probability at least 1 δ,

R(h S) RS(h S) sup h Br (R(h) RS(h)) .

On the other hand, by the boundedness of the loss function, and the bounded differences inequality, with probability at least 1 δ,

sup h Br (R(h) RS(h))

E sup h Br (R(h) RS(h)) + M

2R(ℓ Br) + M

where ℓ H denotes the set of compositions of functions ℓ and h H. By the Lipschitz property of the loss function and a standard contraction argument, i.e., Talagrand Contraction Lemma (Ledoux & Talagrand, 2013), we have,

R(ℓ Br) L R(Br)

2 log(2/δ)α(n) .

Note that the order of magnitude of α(n) of many stable algorithms is of order O(1/n). For the notion of uniform stability, such bounds appear in Lugosi & Pawlak (1994); Bousquet & Elisseeff (2002); Wibisono et al. (2009); Hardt et al. (2015); Liu et al. (2017). As we will show in the examples below, many of these learning algorithms even have uniform argument stability of order O(1/n). In such cases the bound of Corollary 1 is essentially equivalent of

Algorithmic Stability and Hypothesis Complexity

the earlier results cited above. The bound is dominated by

the term M q

2n present by using the bounded dif-

ferences inequality. Fluctuations of the order of O(n 1/2) are often inevitable, especially when R(h S) is not typically small. When small risk is reasonable to expect, one may use more advanced concentration inequalities with secondmoment information, at the price of replacing the generalization error by the so-called deformed generalization error R(h S) a a 1RS(h S) where a > 1. The next theorem derives such a bound, relying on techniques developed by Bartlett et al. (2005). This result improves essentially on earlier stability-based bounds. Theorem 2. Assume that B is a separable Hilbert space. Suppose that the marginal distribution of the Xi is such that Xi B with probability one, for some B > 0 and that the loss function is bounded and Lipschitz, that is, ℓ(h, Z) M with probability one for some M > 0 and |ℓ(h, z) ℓ(h , z)| L | h, x h , x | for all z Z and h, h H. Let a > 1. If a learning algorithm is α(n)- uniformly argument stable, then, with probability at least 1 2δ,

R(h S) a a 1RS(h S)

2 log(2/δ)α(n) + (6a + 8)M log(1/δ)

The proof of Theorem 2 relies on techniques developed by Bartlett et al. (2005). In particular, we make use of the following result. Proposition 2. (Bartlett et al., 2005, Theorem 2.1). Let F be a class of functions that map X into [0, M]. Assume that there is some ρ > 0 such that for every f F, var(f(X)) ρ. Then, with probability at least 1 δ, we have

2ρ log(1/δ)

To prove the theorem, we also need to introduce the following auxiliary lemma.

Gr(Z) = r max{r, Eℓ(h, Z)}ℓ(h, Z)|h Br

It is evident that Gr {αℓ h|h Br, α [0, 1]}. The following lemma is proven in (Bartlett et al., 2005). Lemma 2. Deﬁne

Vr = sup g Gr

For any r > 0 and a > 1, if Vr r/a then every h Br satisﬁes

Eℓ(h, Z) a a 1 1 n

i=1 ℓ(h, Zi) + Vr.

Now, we are ready to prove Theorem 2.

Proof of Theorem 2. First, we introduce an inequality to build the connection between algorithmic stability and hypothesis complexity. According to Lemma 1, for any a > 1 and δ > 0, with probability at least 1 δ, we have

R(h S) a a 1RS(h S) sup h Br (R(h) a a 1RS(h)) .

Second, we are going to upper bound the term suph Br(R(h) a a 1RS(h)) with high probability. It is easy to check that for any g Gr, Eg(Z) r and g(Z) [0, M]. Then

var(g(Z)) E(g(Z))2 MEg(Z) Mr.

Applying Proposition 2,

Vr 4R(Gr) +

2Mr log(1/δ)

2Mr log(1/δ)

r 2Ma2 log(1/δ)

n + 8a R(Gr) + 4

3 2a M log(1/δ)

which means that there exists an r 2Ma2 log(1/δ)

n + 8a R(Gr)+ 4

3 2a M log(1/δ)

n such that Vr r /a holds. According to Lemma 2, for any h Br, with probability at least 1 δ, we have

Eℓ(h, Z) a a 1 1 n

i=1 ℓ(h, Zi) + Vr

i=1 ℓ(h, Zi) + r

i=1 ℓ(h, Zi) + 2Ma log(1/δ)

+ 8R(Gr) + 4

3 2M log(1/δ)

Algorithmic Stability and Hypothesis Complexity

It is easy to verify that Gr {αℓ h|h Br, α [0, 1]} conv Br.

By elementary properties of the Rademacher complexity (see, e.g., Bartlett & Mendelson (2003)), H H implies R(H ) R(H). Then, with probability at least 1 δ, we have

Eℓ(h, X) a a 1 1 n

i=1 ℓ(h, Xi)

2Ma log(1/δ)

n + 8R(ℓ Br) + 4

3 2M log(1/δ)

The proof of Theorem 2 is complete by combining the above inequality with inequality (2), the Talagrand Contraction Lemma, and Theorem 1.

In the next section, we specialize the above results to some learning algorithms by proving their uniform argument stability.

4. Applications

Various learning algorithms have been proved to possess some kind of stability. We refer the reader to (Devroye & Wagner, 1979; Lugosi & Pawlak, 1994; Bousquet & Elisseeff, 2002; Zhang, 2003; Wibisono et al., 2009; Hardt et al., 2015; Liu et al., 2017) for such examples, including stochastic gradient descent methods, empirical risk minimization, and non-parametric learning algorithms such as k-nearest neighbor rules and kernel regression.

4.1. Empirical Risk Minimization

Regularized empirical risk minimization has been known to be uniformly stable (Bousquet & Elisseeff, 2002). Here we consider regularized empirical risk minimization (RERM) algorithms of the following form. The empirical risk (or the objective function) of RERM is formulated as

RS,λ(h) = 1

i=1 ℓ(h, Xi) + λN(h),

where N : h H 7 N(h) R+ is a convex function. Its corresponding expected counterpart is deﬁned as

Rλ(h) = Eℓ(h, X) + λN(h).

Bousquet & Elisseeff (2002) proved that ℓ2-regularized learning algorithms are β(n)-uniformly stable. Wibisono et al. (2009) extended the result and studied a sufﬁcient condition of the penalty term N(h) to ensure uniform β(n)-stability. As we now show, both of their proof methods are applicable to the analysis of uniform argument stability.

By exploiting their results, we show that stable RERM algorithms have strong generalization properties.

Theorem 3. Assume that B is a separable Hilbert space. Suppose that the marginal distribution of the Xi is such that Xi B with probability one, for some B > 0 and that the loss function is convex in h, bounded by M and L-Lipschitz. Suppose that for some constants C and ξ > 1, the penalty function N(h) satisﬁes

N(h S) + N(h Si) 2N h S + h Si

C h S h Si ξ. (3)

Then, for any δ > 0, and a > 1, if h S is the output of RERM, with probability at least 1 2δ, we have

R(h S) a a 1RS(h S)

+ (6a + 8)M log(1/δ)

Speciﬁcally, when N(h) = h 2, (3) holds with ξ = 2 and

Proof. The proof of Theorem 3 relies on the following result implied by Wibisono et al. (2009).

Proposition 3. Assume the conditions of Theorem 3. Then the RERM learning algorithm is β(n)-uniformly stable with

β(n) = kξLξ

and is α(n)-uniformly argument stable with

Speciﬁcally, when N(h) = h p p and 1 < p 2, the condition 3 on the penalty function holds with ξ = 2 and

p , where h p p = P

r |hr|p and r is the index for the dimensionality.

Theorem 3 follows by combining Theorem 2 and Proposition 3.

4.2. Stochastic Gradient Descent

Stochastic gradient descent (SGD) is one of the most widely used optimization methods in machine learning. Hardt et al. (2015) showed that parametric models trained by SGD methods are uniformly stable. Their results apply to both convex and non-convex learning problems and

Algorithmic Stability and Hypothesis Complexity

provide insights for why SGD performs well in practice, in particular, for deep learning algorithms.

Their results are based on the assumptions that the loss function employed is both Lipschitz and smooth. In order to avoid technicalities of deﬁning derivatives in general Hilbert spaces, in this section we assume that B = X = Rd, the d-dimensional Euclidean space. Deﬁnition 5 (Smooth). A differentiable loss function ℓ(h, ) is s-smooth if for all h, h H, we have

hℓ(h, ) h ℓ(h , ) s h h ,

where xf(x) denotes the derivative of f(x) with respect to x and s > 0. Deﬁnition 6 (Strongly Convex). A differentiable loss function ℓ(h, ) is γ-strongly convex with respect to if for all h, h H, we have

( hℓ(h, ) h ℓ(h , ))T (h h ) γ h h 2,

where γ > 0.

Theorem 2 is applicable to the results of SGD when the general loss function ℓ(h, x) is L-Lipschitz, s-smooth, and h is linear with respect to x. Note that our deﬁnition of LLipschitzness requires the loss function to be Lipschitz in the linear form h, x . Theorem 4. Let the stochastic gradient update rule be given by ht+1 = ht αt hℓ(ht, Xit), where αt > 0 is the learning rate and it is the index for choosing one example for the t-th update. Let h T and hi T denote the outputs of SGD run on sample S and Si, respectively. Assume that X B with probability one. Suppose that the loss function is L-Lipschitz, s-smooth, and upper bounded by M. Let SGD is run with a monotonically non-increasing step size αt c/t, where c is a universal constant, for T steps. Then, for any δ > 0 and a > 1, with probability at least 1 2δ, we have

R(h T ) a a 1RS(h T )

8BL1 + 1/sc

n 1 (2c BL) 1 sc+1 T sc sc+1 p

+(6a + 8)M log(1/δ)

When the loss function ℓis convex, L-admissible, s-smooth, and upper bounded by M, suppose that SGD is run with step sizes αt 2/s for T steps. Then, for any δ > 0 and a > 1, with probability at least 1 2δ,

R(h T ) a a 1RS(h T )

+(6a + 8)M log(1/δ)

Moreover, when the loss function ℓis γ-strongly convex, ssmooth, and upper bounded by M, let the stochastic gradient update be given by ht+1 = ΠΩ(ht αt hℓ(ht, Xit)), where Ωis a compact, convex set over which we wish to optimize and ΠΩ( ) is a projection such that ΠΩ(f) = arg minh H h f . If the loss function is further LLipschitz over the set Ωand the projected SGD is run with a constant step size α 1/s for T steps. Then, for any δ > 0 and a > 1, with probability at least 1 2δ, the projected SGD satisﬁes that

R(h T ) a a 1RS(h T )

2 log(2/δ) + (6a + 8)M log(1/δ)

Note that any ℓ2 regularized convex loss function is strongly convex. Bousquet & Elisseeff (2002) studied the stability of batch methods. When the loss function is strongly convex, the stability of SGD is consistent with the result in (Bousquet & Elisseeff, 2002).

While the above result only applies to L-Lipschitz loss functions as deﬁned in Deﬁnition 3, it does explain some generalization properties of layer-wise training of neural networks by stochastic gradient descent. In this oncecommon training scheme (see, e.g., Bengio et al., 2007), one freezes the parameters of the network before/after a certain layer and performs SGD for this single layer. It is easy to see that, as long as the activation function and the loss function (connected with the network) are Lipschitzcontinuous in their inputs, the overall loss can easily satisfy the continuous conditions of Theorem 4. This implies that the parameters in each layer may generalize well in a certain sense if SGD is employed with an early stop.

The proof of Theorem 4 follows immediately from Theorem 2, combined with the following result implied by Hardt et al. (2015) (which is a collection of the results of Theorems 3.8, 3.9, and 3.12 therein).

Proposition 4. Let the stochastic gradient update be given by ht+1 = ht αt hℓ(ht, Zit), where αt > 0 is the learning rate and it is the index for choosing one example for the t-th update. Let h T and hi T denote the outputs of SGD running on sample S and Si respectively. When the loss function is L-Lipschitz and s-smooth, suppose that SGD is run with monotonically non-increasing step size αt c/t, where c is a universal constant, for T steps. Then,

h T hi T 1 + 1/sc

n 1 (2c BL) 1 sc+1 T sc sc+1 .

When the loss function ℓis convex, L-Lipschitz, and ssmooth, suppose that SGD is run with step sizes αt 2/s

Algorithmic Stability and Hypothesis Complexity

for T steps. Then,

h T hi T 2BL

Moreover, when the loss function ℓis γ-strongly convex and s-smooth, let the stochastic gradient update be given by ht+1 = ΠΩ(ht αt hℓ(ht, Zit)), where Ωis a compact, convex set over which we wish to optimize and ΠΩ( ) is a projection such that ΠΩ(f) = arg minh H h f . If the loss function is L-Lipschitz over the set Ωand the projected SGD is run with constant step size α 1/s for T steps. Then, the projected SGD satisﬁes algorithmic argument stability with

h T hi T 2BL

5. Conclusion

We introduced the concepts of uniform argument stability and algorithmic hypothesis class, deﬁned as the class of hypotheses that are likely to be output by the learning algorithm. We proposed a general probabilistic framework to exploit local estimates for the complexity of hypothesis class to obtain fast convergence rates for stable learning algorithms. Speciﬁcally, we deﬁned the algorithmic hypothesis class by observing that the output of stable learning algorithms concentrates around Eh S. The Rademacher complexity deﬁned on the algorithmic hypothesis class then converges at the same rate as that of the uniform argument stability in Hilbert space, which are of order O(1/n) for various learning algorithms, such as empirical risk minimization and stochastic gradient descent. We derived fast convergence rates of order O(1/n) for their deformed generalization errors. Unlike previously published guarantees of similar ﬂavor, our bounds hold with high probability, rather than only in expectation.

Our study leaves some open problems and allows several possible extensions. First, the algorithmic hypothesis class deﬁned in this study depends mainly on the property of learning algorithms but little on the data distribution. It would be interesting to investigate a way to deﬁne an algorithmic hypothesis class by considering both the algorithmic property and the data distribution. Second, it would be interesting to explore if there are some algorithmic properties other than stability that could result in a small algorithmic hypothesis class.

Acknowledgments

Liu and Tao were partially supported by Australian Research Council Projects FT-130101457, DP-140102164, LP-150100671. Lugosi was partially supported by the

Spanish Ministry of Economy and Competitiveness, Grant MTM2015-67304-P, and FEDER, EU. Neu was partially supported by the UPFellows Fellowship (Marie Curie COFUND program 600387).

Bartlett, Peter L and Mendelson, Shahar. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463 482, 2003.

Bartlett, Peter L, Bousquet, Olivier, and Mendelson, Shahar. Local Rademacher complexities. Annals of Statistics, pp. 1497 1537, 2005.

Bengio, Yoshua, Lamblin, Pascal, Popovici, Dan, and Larochelle, Hugo. Greedy layer-wise training of deep networks. In NIPS, 2007.

Bonnans, J Fr ed eric and Shapiro, Alexander. Perturbation analysis of optimization problems. Springer Science & Business Media, 2013.

Boucheron, St ephane, Lugosi, G abor, and Massart, Pascal. Concentration inequalities: A nonasymptotic theory of independence. OUP Oxford, 2013.

Bousquet, Olivier and Elisseeff, Andr e. Stability and generalization. Journal of Machine Learning Research, 2: 499 526, 2002.

Devroye, Luc and Wagner, Terry J. Distribution-free inequalities for the deleted and holdout error estimates. IEEE Transactions on Information Theory, 25(2):202 207, 1979.

Hardt, Moritz, Recht, Benjamin, and Singer, Yoram. Train faster, generalize better: Stability of stochastic gradient descent. ar Xiv preprint ar Xiv:1509.01240, 2015.

Ledoux, Michel and Talagrand, Michel. Probability in Banach spaces: Isoperimetry and processes. Springer Science & Business Media, 2013.

Liu, Tongliang, Tao, Dacheng, Song, Mingli, and Maybank, Stephen J. Algorithm-dependent generalization bounds for multi-task learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(2):227 241, February 2017.

Lugosi, G abor and Pawlak, Miroslaw. On the posteriorprobability estimate of the error rate of nonparametric classiﬁcation rules. IEEE Transactions on Information Theory, 40(2):475 481, 1994.

Mukherjee, Sayan, Niyogi, Partha, Poggio, Tomaso, and Rifkin, Ryan. Learning theory: stability is sufﬁcient for

Algorithmic Stability and Hypothesis Complexity

generalization and necessary and sufﬁcient for consistency of empirical risk minimization. Advances in Computational Mathematics, 25:161 193, 2006.

Pinelis, Iosif. Optimum bounds for the distributions of martingales in Banach spaces. The Annals of Probability, pp. 1679 1706, 1994.

Pisier, Gilles. Martingales in Banach spaces (in connection with type and cotype). IHP course notes, 2011.

Rakhlin, Alexander and Sridharan, Karthik. On equivalence of martingale tail bounds and deterministic regret inequalities. ar Xiv preprint ar Xiv:1510.03925, 2015.

Shalev-Shwartz, Shai, Shamir, Ohad, Srebro, Nathan, and Sridharan, Karthik. Learnability, stability and uniform convergence. Journal of Machine Learning Research, 11:2635 2670, 2010.

Wibisono, Andre, Rosasco, Lorenzo, and Poggio, Tomaso. Sufﬁcient conditions for uniform stability of regularization algorithms. Techincal Report MIT-CSAIL-TR-2009060, 2009.

Zhang, Tong. Leave-one-out bounds for kernel methods. Neural Computation, 15(6):1397 1437, 2003.