# algorithmic_stability_and_hypothesis_complexity__97aedd25.pdf Algorithmic Stability and Hypothesis Complexity Tongliang Liu 1 G abor Lugosi 2 3 4 Gergely Neu 5 Dacheng Tao 1 Abstract We introduce a notion of algorithmic stability of learning algorithms that we term argument stability that captures stability of the hypothesis output by the learning algorithm in the normed space of functions from which hypotheses are selected. The main result of the paper bounds the generalization error of any learning algorithm in terms of its argument stability. The bounds are based on martingale inequalities in the Banach space to which the hypotheses belong. We apply the general bounds to bound the performance of some learning algorithms based on empirical risk minimization and stochastic gradient descent. 1. Introduction Many efforts have been made to analyze various notions of algorithmic stability and prove that a broad spectrum of learning algorithms are stable in some sense. Intuitively, a learning algorithm is said to be stable if slight perturbations in the training data result in small changes in the output of the algorithm, and these changes vanish as the data set grows bigger and bigger (Bonnans & Shapiro, 2013). For example, Devroye & Wagner (1979), Lugosi & Pawlak (1994), and Zhang (2003) showed that several non-parametric learning algorithms are stable; Bousquet & Elisseeff (2002) proved that ℓ2 regularized learning algorithms are uniformly stable; Wibisono et al. (2009) generalized Bousquet and Elisseeff s results and proved that regularized learning algorithms with strongly convex penalty functions on bounded domains, e.g., ℓp regularized learning algorithms for 1 < p 2, are also uniformly stable; 1UBTech Sydney AI Institute, School of IT, FEIT, The University of Sydney, Australia 2Department of Economics and Business, Pompeu Fabra University, Barcelona, Spain 3ICREA, Pg. Llus Companys 23, 08010 Barcelona, Spain 4Barcelona Graduate School of Economics 5AI group, DTIC, Universitat Pompeu Fabra, Barcelona, Spain. Correspondence to: Tongliang Liu , G abor Lugosi , Gergely Neu , Dacheng Tao . Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, 2017. JMLR: W&CP. Copyright 2017 by the author(s). Hardt et al. (2015) showed that parametric models trained by stochastic gradient descent algorithms are uniformly stable; and Liu et al. (2017) proved that tasks in multi-task learning can act as regularizers and that multi-task learning in a very general setting will therefore be uniformly stable under mild assumptions. The notion of algorithmic stability has been an important tool in deriving theoretical guarantees of the generalization abilities of learning algorithms. Various notions of stability have been introduced and have been exploited to derive generalization bounds. For some examples, Mukherjee et al. (2006) proved that a statistical form of leave-one-out stability is a sufficient and necessary condition for the generalization and learnability of empirical risk minimization learning algorithms; Shalev-Shwartz et al. (2010) defined a weaker notion, the so-called on-average-replace-oneexample stability , and showed that this condition is both sufficient and necessary for the generalization and learnability of a general learning setting. In this paper we study learning algorithms that select a hypothesis (i.e., a function used for prediction) from a certain fixed class of functions belonging to a separable Banach space. We introduce a notion of argument stability which measures the impact of changing a single training example on the hypothesis selected by the learning algorithm. This notion of stability is stronger than uniform algorithmic stability of Bousquet & Elisseeff (2002) that is only concerned about the change in the loss but not the hypothesis itself. However, as we will show, the new notion is still quite natural and holds for a variety of learning algorithms. On the other hand, it allows one to exploit martingale inequalities (Boucheron et al., 2013) in the Banach space of the hypotheses. Indeed, the performance bounds we derive for stable algorithms depend on characteristics related to the martingale type of the Banach space. Generalization bounds typically depend on the complexity of a class of hypotheses that can be chosen by the learning algorithm. Exploiting the local estimates of the complexity of the predefined hypothesis class is a promising way to obtain sharp bounds. Building on martingale inequalities in the Banach space of the hypotheses, we define a subset of the predefined hypothesis class, whose elements will (or will have a high probability to) be output by a Algorithmic Stability and Hypothesis Complexity learning algorithm, as the algorithmic hypothesis class, and study the complexity of the algorithmic hypothesis class of argument-stable learning algorithms. We show that, if the hypotheses belong to a Hilbert space, the upper bound of the Rademacher complexity of the algorithmic hypothesis class will converge at a fast rate of order O(1/n), where n is the sample size. The rest of the paper is organized as follows. Section 2 introduces the mathematical framework and the proposed notion of algorithmic stability. Section 3 presents the main results of this study, namely the generalization bounds in terms of argument stability. Section 4 specializes the results to some learning algorithms, including empirical risk minimization and stochastic gradient descent. Section 5 concludes the paper. 2. Algorithmic Stability and Hypothesis Class We consider the classical statistical learning problem, where the value of a real random variable Y is to be predicted based on the observation of an another random variable X. Let S be a training sample of n i.i.d. pairs of random variables Z1 = (X1, Y1), . . . , Zn = (Xn, Yn) drawn from a fixed distribution P on a set Z = X Y, where X is the so-called feature space. A learning algorithm A : S Zn 7 h S H is a mapping from Zn to a hypothesis class H that we assume to be a subset of a separable Banach space (B, ). We focus on linear prediction problems, that is, when h(x) is a linear functional of x. We write h(x) = h, x . In other words, we assume that the feature space X is the algebraic dual of the Banach space B. We denote the norm in X by . The output h S of the learning algorithm is a hypothesis used for predicting the value for Y . An important special case is when B is a Hilbert space. In that case we may assume that X = B and that h, x is the inner product in B. The quality of the predictions made by any hypothesis will be measured by a loss function ℓ: B Z R+ (where R+ denotes the set of positive reals). Specifically, ℓ(h, Z) measures the loss of predicting an example Z using a hypothesis h. The risk of h H is defined by R(h) = Eℓ(h, Z) ; while the empirical risk is i=1 ℓ(h, Zi) . For the output h S of a learning algorithm A, the general- ization error is defined as R(h S) RS(h S) . (1) The notion of algorithmic stability was proposed to measure the changes of outputs of a learning algorithm when the input is changed. Various ways have been introduced to measure algorithmic stability. Here we recall the notion of uniform stability defined by Bousquet & Elisseeff (2002) for comparison purposes. This notion of stability relies on the altered sample Si = {Z1, . . . , Zi 1, Z i, Zi+1, . . . , Zn}, the sample S with the i-th example being replaced by an independent copy of Zi. Definition 1 (Uniform Stability). A learning algorithm A is β(n)-uniformly stable with respect to the loss function ℓ if for all i {1, . . . , n}, |ℓ(h S, Z) ℓ(h Si, Z)| β(n) , with probability one, where β(n) R+ . We propose the following, similar, notion that acts on the hypotheses directly, as opposed to the losses. Definition 2 (Uniform Argument Stability). A learning algorithm A is α(n)-uniformly argument stable if for all i {1, . . . , n}, h S h Si α(n) . with probability one, where α(n) R+ . The two notions of stability are closely related: Intuitively, if the loss ℓ(h, z) is a sufficiently smooth function of h, then uniform argument stability should imply uniform stability. To make this intuition precise, we define the notion of Lipschitz-continuous loss functions below. Definition 3 (L-Lipschitz Loss Function). The loss function ℓ: B Z R+ is L-Lipschitz for an L > 0 if |ℓ(h, z) ℓ(h , z)| L | h, x h , x | holds for all z Z and h, h H. Additionally assuming that X is bounded by some B > 0 with probability one, it is easy to see that an α(n)- uniformly argument stable learning algorithm is uniformly stable with β(n) = LBα(n), since h S h Si = sup x X: x 1 ( h S, x h Si, x ) . However, the reverse implication need not necessarily hold and hence uniform argument stability is a stronger notion. In the rest of the paper, we will focus on L-Lipschitz loss functions and assume that X B holds almost surely. Algorithmic Stability and Hypothesis Complexity These assumptions are arguably stronger than those made by Bousquet & Elisseeff (2002) who only require that the loss function be bounded. In contrast, our results will require that the loss ℓ(h, z) be Lipschitz in the linear form h, x , which is only slightly more general than assuming generalized linear loss functions. Nevertheless, these stronger assumptions will enable us to prove stronger generalization bounds. The relationship between argument stability and generalization performance hinges on a property of the Banach space B that is closely related to the martingale type of the space see Pisier (2011) for a comprehensive account. For concreteness we assume that the Banach space B is (2, D)-smooth (or of martingale type 2) for some D > 0. This means that for all h, h B, h + h 2 + h h 2 2 h 2 + 2D2 h 2 . Note that Hilbert spaces are (2, 1)-smooth. The property we need is described in the following result of (Pinelis, 1994): Proposition 1. Let D1, . . . , Dn be a martingale difference sequence taking values in a separable (2, D)-smooth Banach space B. Then for any ϵ > 0, where c is a constant satisfying that P t=1 Dt 2 c2 (and Dt is the essential supremum of the random variable Dt ). Our arguments extend, in a straightforward manner, to more general Banach spaces whenever exponential tail inequalities for bounded martingale sequences similar to Proposition 1 are available. We stay with the assumption of (2, D)-smoothness for convenience and because it applies to the perhaps most important special case when B is a Hilbert space. We refer to Rakhlin & Sridharan (2015) for more information of martingale inequalities of this kind. A key property of stable algorithms, implied by the martingale inequality, is that the hypothesis h S output by the algorithm is concentrated in the Banach space B around its expectation Eh S. This is established in the next simple lemma. Lemma 1. Let the Banach space B be (2, D)-smooth. If a learning algorithm A is α(n)-uniformly argument stable, then, for any δ > 0, P h S Eh S Dα(n) p 2n log(2/δ) 1 δ . Proof. Introduce the martingale differences Dt = E(h S|Z1, . . . , Zt) E(h S|Z1, . . . , Zt 1) t=1 E(h S|Z1, . . . , Zt) E(h S|Z1, . . . , Zt 1) 2 t=1 E(h S h St|Z1, . . . , Zt) 2 t=1 (E( (h S h St |Z1, . . . , Zt))2 Thus, by Proposition 1, we have P h S ESh S α(n)D p 2n log(2/δ) δ for δ = 2 exp ϵ2 3. Algorithmic Rademacher Complexity and Generalization Bound The concentration result of Lemma 1 justifies the following definition of the algorithmic hypothesis class : since with high probability h S concentrates around its expectation Eh S, what matters in the generalization performance of the algorithm is the complexity of the ball centered at Eh S and not that of the entire hypothesis class H. This observation may lead to significantly improved performance guarantees. Definition 4 (Algorithmic Hypothesis Class). For a sample size n and confidence parameter δ > 0, let r = r(n, δ) = Dα(n) p 2n log(2/δ) and define the algorithmic hypothesis class of a stable learning algorithm by Br = {h H| h Eh S r(n, δ)} . Note that, by Lemma 1, h S Br with probability at least 1 δ. We bound the generalization error (1) in terms of the Rademacher complexity (Bartlett & Mendelson, 2003) of the algorithmic hypothesis class. The Rademacher complexity of a hypothesis class H on the feature space X is defined as R(H) = E sup h H i=1 σi h, Xi , Algorithmic Stability and Hypothesis Complexity where σ1, . . . , σn are i.i.d. Rademacher variables that are uniformly distributed in { 1, +1}. The next theorem shows how the Rademacher complexity of the algorithmic hypothesis class can be bounded. The bound depends on the type of the feature space X. Recall that the Banach space (X, ) is of type p 1 if there exists a constant Cp such that for all x1, . . . , xn X, In the important special case when X is a Hilbert space, the space is of type 2 with constant C2 = 1. Theorem 1. Assume that B is a (2, D)-smooth Banach space and that its dual X is of type p. Suppose that the marginal distribution of the Xi is such that Xi B with probability one, for some B > 0. If a learning algorithm is α(n)-uniformly argument stable, then the Rademacher complexity of the algorithmic hypothesis class Br on the feature space satisfies R(Br) DCp B p 2 log(2/δ)α(n)n 1/2+1/p . In particular, when B is a Hilbert space, the bound simplifies to R(Br) B p 2 log(2/δ)α(n) . Proof. We have = E sup h Br i=1 σi h, Xi = E sup h Br i=1 (σi h, Xi σi E h S, Xi + σi E h S, Xi ) = E sup h Br i=1 σi( h, Xi E h S, Xi ) = E sup h Br i=1 σi h Eh S, Xi 2n log(2/δ)Cp 2 log(2/δ)α(n)n 1/2+1/p , concluding the proof. The theorem above may be easily used to bound the performance of an α(n)-uniformly argument stable learning algorithm. For simplicity, we state the result for Hilbert spaces only. The extension to (2, D)-smooth Banach spaces with a type-p dual is straightforward. Corollary 1. Assume that B is a separable Hilbert space. Suppose that the marginal distribution of the Xi is such that Xi B with probability one, for some B > 0 and that the loss function is bounded and Lipschitz, that is, ℓ(h, Z) M with probability one for some M > 0 and |ℓ(h, z) ℓ(h , z)| L | h, x h , x | for all z Z and h, h H. If a learning algorithm is α(n)-uniformly argument stable, then its generalization error is bounded as follows. With probability at least 1 2δ, R(h S) RS(h S) 2 log(2/δ)α(n) + M Proof. Note first that, by Lemma 1, with probability at least 1 δ, R(h S) RS(h S) sup h Br (R(h) RS(h)) . On the other hand, by the boundedness of the loss function, and the bounded differences inequality, with probability at least 1 δ, sup h Br (R(h) RS(h)) E sup h Br (R(h) RS(h)) + M 2R(ℓ Br) + M where ℓ H denotes the set of compositions of functions ℓ and h H. By the Lipschitz property of the loss function and a standard contraction argument, i.e., Talagrand Contraction Lemma (Ledoux & Talagrand, 2013), we have, R(ℓ Br) L R(Br) 2 log(2/δ)α(n) . Note that the order of magnitude of α(n) of many stable algorithms is of order O(1/n). For the notion of uniform stability, such bounds appear in Lugosi & Pawlak (1994); Bousquet & Elisseeff (2002); Wibisono et al. (2009); Hardt et al. (2015); Liu et al. (2017). As we will show in the examples below, many of these learning algorithms even have uniform argument stability of order O(1/n). In such cases the bound of Corollary 1 is essentially equivalent of Algorithmic Stability and Hypothesis Complexity the earlier results cited above. The bound is dominated by the term M q 2n present by using the bounded dif- ferences inequality. Fluctuations of the order of O(n 1/2) are often inevitable, especially when R(h S) is not typically small. When small risk is reasonable to expect, one may use more advanced concentration inequalities with secondmoment information, at the price of replacing the generalization error by the so-called deformed generalization error R(h S) a a 1RS(h S) where a > 1. The next theorem derives such a bound, relying on techniques developed by Bartlett et al. (2005). This result improves essentially on earlier stability-based bounds. Theorem 2. Assume that B is a separable Hilbert space. Suppose that the marginal distribution of the Xi is such that Xi B with probability one, for some B > 0 and that the loss function is bounded and Lipschitz, that is, ℓ(h, Z) M with probability one for some M > 0 and |ℓ(h, z) ℓ(h , z)| L | h, x h , x | for all z Z and h, h H. Let a > 1. If a learning algorithm is α(n)- uniformly argument stable, then, with probability at least 1 2δ, R(h S) a a 1RS(h S) 2 log(2/δ)α(n) + (6a + 8)M log(1/δ) The proof of Theorem 2 relies on techniques developed by Bartlett et al. (2005). In particular, we make use of the following result. Proposition 2. (Bartlett et al., 2005, Theorem 2.1). Let F be a class of functions that map X into [0, M]. Assume that there is some ρ > 0 such that for every f F, var(f(X)) ρ. Then, with probability at least 1 δ, we have 2ρ log(1/δ) To prove the theorem, we also need to introduce the following auxiliary lemma. Gr(Z) = r max{r, Eℓ(h, Z)}ℓ(h, Z)|h Br It is evident that Gr {αℓ h|h Br, α [0, 1]}. The following lemma is proven in (Bartlett et al., 2005). Lemma 2. Define Vr = sup g Gr For any r > 0 and a > 1, if Vr r/a then every h Br satisfies Eℓ(h, Z) a a 1 1 n i=1 ℓ(h, Zi) + Vr. Now, we are ready to prove Theorem 2. Proof of Theorem 2. First, we introduce an inequality to build the connection between algorithmic stability and hypothesis complexity. According to Lemma 1, for any a > 1 and δ > 0, with probability at least 1 δ, we have R(h S) a a 1RS(h S) sup h Br (R(h) a a 1RS(h)) . Second, we are going to upper bound the term suph Br(R(h) a a 1RS(h)) with high probability. It is easy to check that for any g Gr, Eg(Z) r and g(Z) [0, M]. Then var(g(Z)) E(g(Z))2 MEg(Z) Mr. Applying Proposition 2, Vr 4R(Gr) + 2Mr log(1/δ) 2Mr log(1/δ) r 2Ma2 log(1/δ) n + 8a R(Gr) + 4 3 2a M log(1/δ) which means that there exists an r 2Ma2 log(1/δ) n + 8a R(Gr)+ 4 3 2a M log(1/δ) n such that Vr r /a holds. According to Lemma 2, for any h Br, with probability at least 1 δ, we have Eℓ(h, Z) a a 1 1 n i=1 ℓ(h, Zi) + Vr i=1 ℓ(h, Zi) + r i=1 ℓ(h, Zi) + 2Ma log(1/δ) + 8R(Gr) + 4 3 2M log(1/δ) Algorithmic Stability and Hypothesis Complexity It is easy to verify that Gr {αℓ h|h Br, α [0, 1]} conv Br. By elementary properties of the Rademacher complexity (see, e.g., Bartlett & Mendelson (2003)), H H implies R(H ) R(H). Then, with probability at least 1 δ, we have Eℓ(h, X) a a 1 1 n i=1 ℓ(h, Xi) 2Ma log(1/δ) n + 8R(ℓ Br) + 4 3 2M log(1/δ) The proof of Theorem 2 is complete by combining the above inequality with inequality (2), the Talagrand Contraction Lemma, and Theorem 1. In the next section, we specialize the above results to some learning algorithms by proving their uniform argument stability. 4. Applications Various learning algorithms have been proved to possess some kind of stability. We refer the reader to (Devroye & Wagner, 1979; Lugosi & Pawlak, 1994; Bousquet & Elisseeff, 2002; Zhang, 2003; Wibisono et al., 2009; Hardt et al., 2015; Liu et al., 2017) for such examples, including stochastic gradient descent methods, empirical risk minimization, and non-parametric learning algorithms such as k-nearest neighbor rules and kernel regression. 4.1. Empirical Risk Minimization Regularized empirical risk minimization has been known to be uniformly stable (Bousquet & Elisseeff, 2002). Here we consider regularized empirical risk minimization (RERM) algorithms of the following form. The empirical risk (or the objective function) of RERM is formulated as RS,λ(h) = 1 i=1 ℓ(h, Xi) + λN(h), where N : h H 7 N(h) R+ is a convex function. Its corresponding expected counterpart is defined as Rλ(h) = Eℓ(h, X) + λN(h). Bousquet & Elisseeff (2002) proved that ℓ2-regularized learning algorithms are β(n)-uniformly stable. Wibisono et al. (2009) extended the result and studied a sufficient condition of the penalty term N(h) to ensure uniform β(n)-stability. As we now show, both of their proof methods are applicable to the analysis of uniform argument stability. By exploiting their results, we show that stable RERM algorithms have strong generalization properties. Theorem 3. Assume that B is a separable Hilbert space. Suppose that the marginal distribution of the Xi is such that Xi B with probability one, for some B > 0 and that the loss function is convex in h, bounded by M and L-Lipschitz. Suppose that for some constants C and ξ > 1, the penalty function N(h) satisfies N(h S) + N(h Si) 2N h S + h Si C h S h Si ξ. (3) Then, for any δ > 0, and a > 1, if h S is the output of RERM, with probability at least 1 2δ, we have R(h S) a a 1RS(h S) + (6a + 8)M log(1/δ) Specifically, when N(h) = h 2, (3) holds with ξ = 2 and Proof. The proof of Theorem 3 relies on the following result implied by Wibisono et al. (2009). Proposition 3. Assume the conditions of Theorem 3. Then the RERM learning algorithm is β(n)-uniformly stable with β(n) = kξLξ and is α(n)-uniformly argument stable with Specifically, when N(h) = h p p and 1 < p 2, the condition 3 on the penalty function holds with ξ = 2 and p , where h p p = P r |hr|p and r is the index for the dimensionality. Theorem 3 follows by combining Theorem 2 and Proposition 3. 4.2. Stochastic Gradient Descent Stochastic gradient descent (SGD) is one of the most widely used optimization methods in machine learning. Hardt et al. (2015) showed that parametric models trained by SGD methods are uniformly stable. Their results apply to both convex and non-convex learning problems and Algorithmic Stability and Hypothesis Complexity provide insights for why SGD performs well in practice, in particular, for deep learning algorithms. Their results are based on the assumptions that the loss function employed is both Lipschitz and smooth. In order to avoid technicalities of defining derivatives in general Hilbert spaces, in this section we assume that B = X = Rd, the d-dimensional Euclidean space. Definition 5 (Smooth). A differentiable loss function ℓ(h, ) is s-smooth if for all h, h H, we have hℓ(h, ) h ℓ(h , ) s h h , where xf(x) denotes the derivative of f(x) with respect to x and s > 0. Definition 6 (Strongly Convex). A differentiable loss function ℓ(h, ) is γ-strongly convex with respect to if for all h, h H, we have ( hℓ(h, ) h ℓ(h , ))T (h h ) γ h h 2, where γ > 0. Theorem 2 is applicable to the results of SGD when the general loss function ℓ(h, x) is L-Lipschitz, s-smooth, and h is linear with respect to x. Note that our definition of LLipschitzness requires the loss function to be Lipschitz in the linear form h, x . Theorem 4. Let the stochastic gradient update rule be given by ht+1 = ht αt hℓ(ht, Xit), where αt > 0 is the learning rate and it is the index for choosing one example for the t-th update. Let h T and hi T denote the outputs of SGD run on sample S and Si, respectively. Assume that X B with probability one. Suppose that the loss function is L-Lipschitz, s-smooth, and upper bounded by M. Let SGD is run with a monotonically non-increasing step size αt c/t, where c is a universal constant, for T steps. Then, for any δ > 0 and a > 1, with probability at least 1 2δ, we have R(h T ) a a 1RS(h T ) 8BL1 + 1/sc n 1 (2c BL) 1 sc+1 T sc sc+1 p +(6a + 8)M log(1/δ) When the loss function ℓis convex, L-admissible, s-smooth, and upper bounded by M, suppose that SGD is run with step sizes αt 2/s for T steps. Then, for any δ > 0 and a > 1, with probability at least 1 2δ, R(h T ) a a 1RS(h T ) +(6a + 8)M log(1/δ) Moreover, when the loss function ℓis γ-strongly convex, ssmooth, and upper bounded by M, let the stochastic gradient update be given by ht+1 = ΠΩ(ht αt hℓ(ht, Xit)), where Ωis a compact, convex set over which we wish to optimize and ΠΩ( ) is a projection such that ΠΩ(f) = arg minh H h f . If the loss function is further LLipschitz over the set Ωand the projected SGD is run with a constant step size α 1/s for T steps. Then, for any δ > 0 and a > 1, with probability at least 1 2δ, the projected SGD satisfies that R(h T ) a a 1RS(h T ) 2 log(2/δ) + (6a + 8)M log(1/δ) Note that any ℓ2 regularized convex loss function is strongly convex. Bousquet & Elisseeff (2002) studied the stability of batch methods. When the loss function is strongly convex, the stability of SGD is consistent with the result in (Bousquet & Elisseeff, 2002). While the above result only applies to L-Lipschitz loss functions as defined in Definition 3, it does explain some generalization properties of layer-wise training of neural networks by stochastic gradient descent. In this oncecommon training scheme (see, e.g., Bengio et al., 2007), one freezes the parameters of the network before/after a certain layer and performs SGD for this single layer. It is easy to see that, as long as the activation function and the loss function (connected with the network) are Lipschitzcontinuous in their inputs, the overall loss can easily satisfy the continuous conditions of Theorem 4. This implies that the parameters in each layer may generalize well in a certain sense if SGD is employed with an early stop. The proof of Theorem 4 follows immediately from Theorem 2, combined with the following result implied by Hardt et al. (2015) (which is a collection of the results of Theorems 3.8, 3.9, and 3.12 therein). Proposition 4. Let the stochastic gradient update be given by ht+1 = ht αt hℓ(ht, Zit), where αt > 0 is the learning rate and it is the index for choosing one example for the t-th update. Let h T and hi T denote the outputs of SGD running on sample S and Si respectively. When the loss function is L-Lipschitz and s-smooth, suppose that SGD is run with monotonically non-increasing step size αt c/t, where c is a universal constant, for T steps. Then, h T hi T 1 + 1/sc n 1 (2c BL) 1 sc+1 T sc sc+1 . When the loss function ℓis convex, L-Lipschitz, and ssmooth, suppose that SGD is run with step sizes αt 2/s Algorithmic Stability and Hypothesis Complexity for T steps. Then, h T hi T 2BL Moreover, when the loss function ℓis γ-strongly convex and s-smooth, let the stochastic gradient update be given by ht+1 = ΠΩ(ht αt hℓ(ht, Zit)), where Ωis a compact, convex set over which we wish to optimize and ΠΩ( ) is a projection such that ΠΩ(f) = arg minh H h f . If the loss function is L-Lipschitz over the set Ωand the projected SGD is run with constant step size α 1/s for T steps. Then, the projected SGD satisfies algorithmic argument stability with h T hi T 2BL 5. Conclusion We introduced the concepts of uniform argument stability and algorithmic hypothesis class, defined as the class of hypotheses that are likely to be output by the learning algorithm. We proposed a general probabilistic framework to exploit local estimates for the complexity of hypothesis class to obtain fast convergence rates for stable learning algorithms. Specifically, we defined the algorithmic hypothesis class by observing that the output of stable learning algorithms concentrates around Eh S. The Rademacher complexity defined on the algorithmic hypothesis class then converges at the same rate as that of the uniform argument stability in Hilbert space, which are of order O(1/n) for various learning algorithms, such as empirical risk minimization and stochastic gradient descent. We derived fast convergence rates of order O(1/n) for their deformed generalization errors. Unlike previously published guarantees of similar flavor, our bounds hold with high probability, rather than only in expectation. Our study leaves some open problems and allows several possible extensions. First, the algorithmic hypothesis class defined in this study depends mainly on the property of learning algorithms but little on the data distribution. It would be interesting to investigate a way to define an algorithmic hypothesis class by considering both the algorithmic property and the data distribution. Second, it would be interesting to explore if there are some algorithmic properties other than stability that could result in a small algorithmic hypothesis class. Acknowledgments Liu and Tao were partially supported by Australian Research Council Projects FT-130101457, DP-140102164, LP-150100671. Lugosi was partially supported by the Spanish Ministry of Economy and Competitiveness, Grant MTM2015-67304-P, and FEDER, EU. Neu was partially supported by the UPFellows Fellowship (Marie Curie COFUND program 600387). Bartlett, Peter L and Mendelson, Shahar. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463 482, 2003. Bartlett, Peter L, Bousquet, Olivier, and Mendelson, Shahar. Local Rademacher complexities. Annals of Statistics, pp. 1497 1537, 2005. Bengio, Yoshua, Lamblin, Pascal, Popovici, Dan, and Larochelle, Hugo. Greedy layer-wise training of deep networks. In NIPS, 2007. Bonnans, J Fr ed eric and Shapiro, Alexander. Perturbation analysis of optimization problems. Springer Science & Business Media, 2013. Boucheron, St ephane, Lugosi, G abor, and Massart, Pascal. Concentration inequalities: A nonasymptotic theory of independence. OUP Oxford, 2013. Bousquet, Olivier and Elisseeff, Andr e. Stability and generalization. Journal of Machine Learning Research, 2: 499 526, 2002. Devroye, Luc and Wagner, Terry J. Distribution-free inequalities for the deleted and holdout error estimates. IEEE Transactions on Information Theory, 25(2):202 207, 1979. Hardt, Moritz, Recht, Benjamin, and Singer, Yoram. Train faster, generalize better: Stability of stochastic gradient descent. ar Xiv preprint ar Xiv:1509.01240, 2015. Ledoux, Michel and Talagrand, Michel. Probability in Banach spaces: Isoperimetry and processes. Springer Science & Business Media, 2013. Liu, Tongliang, Tao, Dacheng, Song, Mingli, and Maybank, Stephen J. Algorithm-dependent generalization bounds for multi-task learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(2):227 241, February 2017. Lugosi, G abor and Pawlak, Miroslaw. On the posteriorprobability estimate of the error rate of nonparametric classification rules. IEEE Transactions on Information Theory, 40(2):475 481, 1994. Mukherjee, Sayan, Niyogi, Partha, Poggio, Tomaso, and Rifkin, Ryan. Learning theory: stability is sufficient for Algorithmic Stability and Hypothesis Complexity generalization and necessary and sufficient for consistency of empirical risk minimization. Advances in Computational Mathematics, 25:161 193, 2006. Pinelis, Iosif. Optimum bounds for the distributions of martingales in Banach spaces. The Annals of Probability, pp. 1679 1706, 1994. Pisier, Gilles. Martingales in Banach spaces (in connection with type and cotype). IHP course notes, 2011. Rakhlin, Alexander and Sridharan, Karthik. On equivalence of martingale tail bounds and deterministic regret inequalities. ar Xiv preprint ar Xiv:1510.03925, 2015. Shalev-Shwartz, Shai, Shamir, Ohad, Srebro, Nathan, and Sridharan, Karthik. Learnability, stability and uniform convergence. Journal of Machine Learning Research, 11:2635 2670, 2010. Wibisono, Andre, Rosasco, Lorenzo, and Poggio, Tomaso. Sufficient conditions for uniform stability of regularization algorithms. Techincal Report MIT-CSAIL-TR-2009060, 2009. Zhang, Tong. Leave-one-out bounds for kernel methods. Neural Computation, 15(6):1397 1437, 2003.