# concentration_inequalities_under_subgaussian_and_subexponential_conditions__f5fef0cb.pdf Concentration inequalities under sub-Gaussian and sub-exponential conditions Andreas Maurer Istituto Italiano di Tecnologia am@andreas-maurer.eu Massimiliano Pontil Istituto Italiano di Tecnologia & University College London massimiliano.pontil@iit.it We prove analogues of the popular bounded difference inequality (also called Mc Diarmid s inequality) for functions of independent random variables under sub Gaussian and sub-exponential conditions. Applied to vector-valued concentration and the method of Rademacher complexities these inequalities allow an easy extension of uniform convergence results for PCA and linear regression to the case of potentially unbounded inputand output variables. 1 Introduction The popular bounded difference inequality [11] has become a standard tool in the analysis of algorithms. It bounds the deviation probability of a function of independent random variables from its mean in terms of the sum of conditional ranges, and may not be applied when these ranges are infinite. This hampers the utility of the inequality in certain situations. It may happen that the conditional ranges are infinite, but that the conditional versions, the random variables obtained by fixing all but one of the arguments of the function, have light tails with exponential decay. In this case we might still expect exponential concentration, but the bounded difference inequality is of no help. Vershynin s book [14] gives general Hoeffding and Bernstein-type inequalities for sums of independent sub-Gaussian or sub-exponential random variables. In situations where the bounded difference inequality is used, one would like to have analogous bounds for general functions. In this work we use the entropy method ([8], [2], [3]) to extend these inequalities from sums to general functions of independent variables, for which the centered conditional versions are sub-Gaussian or sub-exponential, respectively. These concentration inequalities, Theorem 3, 4 and 5, are stated in Section 3 below. Theorems 4 and 5, which apply to the heavier tailed sub-exponential distributions, are our principal contributions. Theorem 3 for the sub-Gaussian case has less novelty, but it is included to complete the picture, and because its proof provides a good demonstration of the entropy method. For the purpose of illustration we apply these results to some standard problems in learning theory, vector valued concentration, the generalization of PCA and the method of Rademacher complexities. Over the last twenty years the latter method ([1], [5]) has been successfully used to prove generalization bounds in a variety of situations. The Rademacher complexity itself does not necessitate boundedness, but, when losses and data-distributions are unbounded, the use of the bounded difference inequality can only be circumvented with considerable effort. Using our bounds the extension is immediate. We also show how an inequality of Kontorovich [6], which describes concentration on products of sub-Gaussian metric probability spaces and has applications to algorithmic stability, can be extended to the sub-exponential case. Related work Several works contain results very similar to Theorem 3, which refers to the sub Gaussian case. The closest to it is Theorem 3 in [12], which gives essentially the same learning bounds for sub-Gaussian distributions. Theorem 1 in [6] is also somewhat similar, but specializes to metric probability spaces. Somewhat akin is the work in [7]. 35th Conference on Neural Information Processing Systems (Neur IPS 2021). To address the sub-exponential case, we have not found results comparable to Theorems 4 and 5 in the literature. There has been a lot of work to establish generalization in unbounded situations ([12], [4], [6]), or the astounding results in [13], but we are unaware of an equally simple extension of the method of Rademacher complexities to sub-exponential distributions, as the one given below. 2 Preliminaries 2.1 Notation and conventions We use upper-case letters for random variables and vectors of random variables and lower case letters for scalars and vectors of scalars. In the sequel X = (X1, . . . , Xn) is a vector of independent random variables with values in a space X, the vector X = (X 1, . . . , X n) is iid to X and f is a function f : X n R. We are interested in concentration of the random variable f (X) about its expectation, and require some special notation to describe the fluctuations of f in its k-th variable Xk, when the other variables (xi : i = k) are given. Definition 1 If f : X n R, x = (x1, ..., xn) X n and X = (X1, ..., Xn) is a random vector with independent components in X n, then the k-th centered conditional version of f is the random variable fk (X) (x) = f (x1, . . . , xk 1, Xk, xk+1, . . . , xn) E [f (x1, . . . , xk 1, X k, xk+1, . . . , xn)] . Then fk (X) is a random-variable-valued function fk (X) : x X n 7 fk (X) (x), which does not depend on the k-th coordinate of x. If . a is any given norm on random variables, then fk (X) a (x) := fk (X) (x) a defines a non-negative real-valued function fk (X) a on X n. Thus fk (X) a (X) is also a random variable, of which fk (X) a is the essential supremum. If X is iid to X then fk (X) a is the same function as fk (X ) a and fk (X) a (X ) is iid to fk (X) a (X). Note that fk (X) (X) = f (X) E [f (X) |X1, ..., Xk 1, Xk+1, ...Xn] . Also, if f (x) = Pn i=1 xi, then fk (X) (x) = Xk E [Xk] is independent of x. If H is a Hilbert space, then the Hilbert space of Hilbert-Schmidt operators HS (H) is the set of bounded operators T on H satisfying T HS = q P ij Tei, ej 2 H < with inner product T, S HS = P ij Tei, ej H Sei, ej H, where (ei) is an orthonormal basis. For x H the operator Qx HS (H) is defined by Qxy = y, x x, and one verifies that Qx HS = x 2 H. 2.2 Sub-Gaussian and sub-exponential norms It follows from Propositions 2.7.1 and 2.5.2 in [14] that we can equivalently redefine the usual sub-Gaussian and sub-exponential norms ψ2 and ψ1 for any real random variable Z as Z ψ2 = sup p 1 Z p p and Z ψ1 = sup p 1 where the Lp-norms are defined as Z p = (E [|Z|p])1/p. It also follows from the above mentioned propositions that for every centered sub-Gaussian random variable Z we have, for all β R, E eβZ e4eβ2 Z 2 ψ2. (2) Sub-exponential variables ( Z ψ1 < ) have heavier tails than sub-Gaussian variables and include the exponential, chi-squared and Poisson distributions. Products and squares of sub-Gaussian variables are sub-exponential, in particular Z2 ψ1 = sup p 1 Z2 p p = 2 sup p 1 (we would have Z2 ψ1 = Z 2 ψ2, if the norms were defined as in [14]). All sub-Gaussian and bounded variables are sub-exponential. For bounded variables we have Z ψ1 Z ψ2 Z , but for concentrated variables the sub-Gaussian and sub-exponential norms can be much smaller. The arithmetic mean of N iid bounded variables has uniform norm O (1), sub-Gaussian norm O N 1/2 , and the square of the mean has sub-exponential norm O N 1 (see [14]). The inequalities of the next section can therefore be applied successfully to n such variables, even when the bounded difference inequality gives only trivial results, for example when n < ln (1/δ), where δ is the confidence parameter. For strongly concentrated variables we have the following lemma (with proof in the supplement). Lemma 2 Suppose the random variable X satisfies E [X] = 0, |X| 1 a.s. and Pr {|X| > ϵ} ϵ for some ϵ > 0. Then p 1, X p 2ϵ1/p and X ψ1 2 (e ln (1/ϵ)) 1. In a nearly deterministic situation, with ϵ = e d, we have X ψ1 O (1/d), and a simple union bound of the sub-exponential inequalities allows uniform estimation of ed such variables with sample size O (n) O (d). In several applications we will require a sub-Gaussian or sub-exponential bound on the norm of a random vector. This may seem quite restrictive. If X = P Zi, then in general we can only say X ψ1 P i Zi ψ1, so if X = Rd with basis (ei) then our most general estimate is X ψ1 Pd i=1 ei, X ψ1, which has poor dimension dependence. But in many situations in machine learning one can assume that X is a sum, X = Zsignal + Znoise, where Zsignal is bounded and the perturbing component Znoise is of small sub-exponential norm, albeit potentially unbounded. Our first result assumes sub-Gaussian versions fk (X). It is an unbounded analogue of the popular bounded difference inequality, which is sometimes also called Mc Diarmid s inequality ([3], [11]). Theorem 3 Let f : X n R and X = (X1, . . . , Xn) be a vector of independent random variables with values in a space X. Then for any t > 0 we have Pr {f (X) E [f (X )] > t} exp k fk (X) 2 ψ2 If f is a sum of sub-Gaussian variables this reduces to the general Hoeffding inequality, Theorem 2.6.2 in [14]. On the other hand, if the fk (X) are a.s. bounded, fk (X) (x) rk (x), then also fk (X) ψ2 (x) rk (x) and we recover the bounded difference inequality (Theorem 6.5 in [3]) up to a constant factor. A similar result to Theorem 3 is given with better constants in [6], although in specialized and slightly weaker forms, where the essential supremum is inside the sum in the denominator of the exponent. The next two results are our principal contributions and apply to functions with sub-exponential conditional versions. Theorem 4 With f and X as in Theorem 3 for any t > 0 Pr {f (X) E [f (X )] > t} k fk (X) 2 ψ1 + 2e maxk fk (X) ψ1 The bound exhibits a sub-Gaussian tail governed by the variance-proxy P k fk (X) 2 ψ1 small deviations, and a sub-exponential tail governed by the scale-proxy maxk fk (X) ψ1 for large deviations. If f is a sum we recover the inequality in [14], Theorem 2.8.1. In Theorem 4 both the variance-proxy and the scale proxy depend on the sub-exponential norms ψ1. A well known two-tailed bound for sums of bounded variables, Bernstein s inequality [11], has the variance proxy depending on 2 and the scale-proxy on . When 2 this leads to tighter bounds, whenever the inequality is operating in the sub-Gaussian regime, which often happens for large sample-sizes. The next result allows a similar use, whenever 2p q ψ1 for conjugate exponents p and q. Theorem 5 With f and X as above let p, q (1, ) satisfy p 1 + q 1 = 1. Then for any t > 0 Pr {f (X) E [f (X )] > t} exp k fk (X) 2 2p + 2eq maxk fk (X) ψ1 We cannot let p 1 to recover the behaviour of Bernstein s inequality in the sub-Gaussian regime, because this would drive the scale-proxy to infinity. But already p = q = 2 can give substantial improvements over Theorem 4, if the distributions of the fk (X) are very concentrated. This inequality appears to be new even if applied to sums. A proof is given in the supplement, where we also show, that the q in the scale-proxy can be replaced by q, if the sub-exponential norm is replaced by the sub-Gaussian norm. We conclude this section with a centering lemma, which will be useful in applications. The proof is given in the supplement. Lemma 6 Let X, X be iid with values in X, φ : X X R measurable, α {1, 2}. Then (i) E [φ (X, X ) |X] ψα φ (X, X ) ψα (ii) If X = R then X E [X] ψα 2 X ψα. One consequence of this lemma is, that we could equally well work with uncentered conditional versions, if we adjust the constants by an additional factor of 2. 4 Applications To illustrate the use of these inequalities we give applications to vector valued concentration and different methods to prove generalization bounds. We concentrate mainly on applications of the more novel Theorems 4 and 5. Applications of the the sub-Gaussian inequality can often be substituted by the reader following the same pattern. 4.1 Vector valued concentration We begin with concentration of vectors in a normed space (X, . ). Proposition 7 Suppose the Xi are independent random variables with values in a normed space (X, . ) such that Xi ψ1 and that δ > 0. (i) With probability at least 1 δ k Xk 2 ψ1 ln (1/δ) + 4e max k Xk ψ1 ln (1/δ) . The inequality is two-sided, that is the two terms on the left-hand-side may be interchanged. (ii) If X is a Hilbert space, the Xi are iid, n ln (1/δ) ln 2, then with probability at least 1 δ 1 n i Xi E [X 1] (iii) If X is a Hilbert space, the Xi are iid, 0 < δ 1/2 and p, q (1, ) are conjugate exponents then with probability at least 1 δ 1 n i Xi E [X 1] 2 X1 E [X 1] 2p n + 4eq X1 ψ1 ln (1/δ) The purpose of the simple inequality (3) is to give a compact expression, when it is possible to restrict to the sub-Gaussian regime with the assumption n ln (1/δ). This is often possible in applications. Part (iii) gives better estimation bounds when the lower order moments 2p ar small. Proof. (i) We look at the function f (x) = P i xi . Then |fk (X) (x)| = i =k xi + Xk i =k xi + X k E [ Xk X k |X] . Observe that the bound on fk (X) (x) (not fk (X) (x) itself) is independent of x. Using Lemma 6 we get fk (X) (x) ψ1 Xk X k ψ1 2 Xk ψ1 , and the first conclusion follows from Theorem 4 by equating the probability to δ and solving for t. The proofs of (ii) and (iii) follow a similar pattern and are given in the supplement. 4.2 A uniform bound for PCA With the results of the previous section it is very easy to obtain a uniform bound for principal subspace selection (often called PCA for principal component analysis) with sub-Gaussian data. In PCA we look for a projection onto a d-dimensional subspace which most faithfully represents the data. Let H be a Hilbert-space, Xi iid with values in H and Pd the set of d-dimensional orthogonal projection operators in H. For x H and P Pd the reconstruction error is ℓ(P, x) := Px x 2 H. We give a bound on the estimation difference between the expected and the empirical reconstruction error, uniform for projections in Pd. Theorem 8 With X = (X1, ..., Xn) iid and n ln (1/δ) ln 2 we have with probability at least 1 δ i E [ℓ(P, X1)] ℓ(P, Xi) 16e d + 1 X1 2 ψ2 Proof. It is convenient to work in the space of Hilbert-Schmidt operators HS (H), where we can write ℓ(P, x) = Qx HS P, Qx HS. Then i E [ℓ(P, X1)] ℓ(P, Xi) i (QXi E [QXi]) Since for P Pd we have P HS = d, we can use Cauchy-Schwarz and Proposition 7 (ii) to bound the first term above with probability at least 1 δ as i (QXi E [QX1]) d QX1 HS ψ1 The remaining term is bounded by applying the same result to the random vectors QXi HS in the Hilbert space R (note that this just involves a sum of sub-exponential variables and could already be handled with Theorem 2.8.1 in [14]). The result follows from combining both bounds in a union bound and noting that QX1 HS ψ1 = X1 2 ψ1 2 X1 2 ψ2. 4.3 Generalization with Rademacher complexities Suppose that H is a class of functions h : X R. We seek a high-probability bound on the supremum deviation, which is the random variable f (X) = sup h H i=1 (h (Xi) E [h (X i)]) . The now classical method of Rademacher complexities ([1], [5]) writes f (X) as the sum f (X) = (f (X) E [f (X )]) + E [f (X )] (4) and bounds the two terms separately. The first term is bounded using a concentration inequality, the second term E [f (X)] is bounded by symmetrization. If the ϵi are independent Rademacher variables, uniformly distributed on { 1, 1} , then E [f (X)] E i ϵih (Xi) |X =: E [R (H, X)] . Further bounds on this quantity depend on the class in question, but they do not necessarily require the h (Xi) to be bounded random variables, Lipschitz properties being more relevant. For the first term f (X) E [f (X)], however, the classical approach uses the bounded difference inequality, which requires boundedness. We now show that boundedness can be replaced by sub-exponential distributions for uniformly Lipschitz function classes. Theorem 9 Let X = (X1, ..., Xn) be iid random variables with values in a Banach space (X, ) and let H be a class of functions h : X R such that h (x) h (y) L x y for all h H and x, y X. If n ln (1/δ) then with probability at least 1 δ i h (Xi) E (h (X)) E [R (H, X)] + 16e L X1 ψ1 Proof. The vector space B = g : H R : sup h H |g (h)| < becomes a normed space with norm g B = suph H |g (h)| . For each Xi define ˆXi B by ˆXi (h) = (1/n) (h (Xi) E [h (X i)]). Then the ˆXi are zero mean random variables in B and i ˆXi B. Also with Lemma 6 and the iid-assumption sup h (E [h (Xi) h (X i)] |X) ψα L n E [ Xi X i ] |X ψα 2L and from Proposition 7 (ii) we get with probability at least 1 δ f (X) E [f (X )] 16e L X1 ψ1 The result follows from (4). Remarks. 1. A possible candidate for H would be a ball of radius L in the dual space X , composed with Lipschitz functions, like the hinge-loss. 2. If, instead of using Proposition 7, one directly considers the centered conditional versions of f (X), the constants above can be improved at the expense of a slightly more complicated proof. 3. A corresponding sub-Gaussian result can be supplied along the same lines by using Theorem 3 instead of Theorem 4. Such a result has been given in [12], Theorem 3, using a sub-Gaussian condition which involves the supremum over the function class. The sub-exponential bound above is new as far as we know, and in the relevant regime n > ln (1/δ) it improves over the sub-Gaussian case, since X1 ψ1 X1 ψ2. As a concrete case consider linear regression with potentially unbounded data. Let X = (H, R), where H is a Hilbert-space with inner product ., . and norm . H, and let X1 and Z1 be each sub-exponential random variables in H and R respectively. The pair (X1, Z1) represents the joint occurrence of input-vectors X1 and real outputs Z1. On X we consider the class H of functions H = {(x, z) 7 h (x, z) = ℓ( w, x z) : w H L}, where ℓis a 1-Lipschitz loss function, like the absolute error or the Huber loss. Corollary 10 Let X and H be as above and (X, Z) = ((X1, Z1) , ..., (Xn, Zn)) be an iid sample of random variables in X. Then for δ > 0 and n ln (1/δ) with probability at least 1 δ i h (Xi, Zi) E (h (Xi, Zi)) 8 n L X1 ψ1 + Z1 ψ1 The proof of this corollary is given in the supplement. 4.4 Unbounded metric spaces and algorithmic stability We use Theorem 2 to extend a method of Kontorovich [6] from sub-Gaussian to sub-exponential distributions. If (X, d, µ) is a metric probability space and X, X µ are iid random variables with values in X, Kontorovich defines the sub-Gaussian diameter of (X, d) as the optimal sub-Gaussian parameter of the random variable ϵd (X, X ), where ϵ is a Rademacher variable. The Rademacher variable is needed in [6] to work with centered random variables, which gives better constants. In our case we work with norms and we can more simply define the sub-Gaussian and sub-exponential diameters respectively as α (X, d, µ) = d (X, X ) ψα for α {1, 2} and independent X , X µ. Then Theorem 4 implies the following result, the easy proof of which is given in the supplement. Theorem 11 For 1 i n let Xi be independent random variables distributed as µi in X, X = (X1, ..., Xn), X iid to X, and let f : X n R have Lipschitz constant L with respect to the metric ρ on X n defined by ρ (x, y) = P i d (xi, yi). Then for t > 0 Pr {f (X) E [f (X )] > t} exp i 1 (X, d, µi)2 + 2e maxi 1 (X, d, µi) t This is the sub-exponential counterpart to Theorem 1 of [6], a version of which could have been derived using Theorem 3 in place of 4. Our result can be equally substituted to establish generalization using the notion of total Lipschitz stability, just as in [6]. We also note that Theorem 4 of the latter work also gives bounds for different Orlicz norms . ψp, but it requires p > 1, and the bounds deteriorate as p 1. 5 Proofs of Theorems 3 and 4 We first collect some necessary tools. Central to the entropy method is the entropy S (Y ) of a real valued random variable Y defined as S (Y ) = EY [Y ] ln E e Y , where the tilted expectation EY is defined as EY [Z] = E Ze Y /E e Y . Using Theorem 1 in [10] the logarithm of the moment generating function can be expressed in terms of the entropy as ln E h eβ(Y E[Y ])i = β Z β If f : X n R and X and the fk are as in the introduction then the conditional entropy is the function Sf,k : X n R defined by Sf,k (x) = S (fk (X) (x)) for x X n. At the heart of the method is the sub-additivity of entropy (Theorem 6 in [10] or Theorem 4.22 in [3]) S (f (X)) Ef(X) i=1 Sf,k (X) The following lemma gives a bound on the entropy of a sub-Gaussian random variable. Lemma 12 For any centered random variable Y we have (i) S (Y ) ln E e2Y . (ii) If Y is sub-Gaussian and β is real then S (βY ) 16eβ2 Y 2 φ2 . S (Y ) = EY = ln E e2Y 2 ln E e Y The first inequality follows from Jensen s inequality by concavity of the logarithm, the second by convexity of the exponential function. This gives (i). For (ii) replace Y by βY and use (2) to get S (βY ) ln E e2βY 16eβ2 Y 2 ψ2 . Proof of Theorem 3. For any x X n and γ R part (ii) of the previous lemma gives Sγf,k (x) = S (γfk (X) (x)) 16eγ2 fk (X) (x) 2 ψ2. By subadditivity of entropy (6) this gives S (γf (X)) 16eγ2Eγf(X) k fk (X ) 2 ψ2 (X) k fk (X) 2 ψ2 Using Markov s inequality and (5) this gives Pr {f (X) E [f (X )] > t} exp S (γf (X)) dγ k fk (X) 2 ψ2 Minimization in β concludes the proof. Lemma 12 (i) and the preceeding proof provide a general template to convert many exponential tail-bounds for sums into analogous bounds for general functions. For sums P Xi one typically has a bound on ln E eβXi . Lemma 12 then provides an analogous bound on the entropy of the conditional versions of a general function, and subadditivity of entropy and (5) complete the conversion, albeit with a deterioration of constants. Using part (v) of Proposition 2.7.1 in [14] this method would lead to an easy proof of Theorem 4, in a form exactly like Theorem 2.8.1 [14]. Here we will use a slightly different method which gives better constants and will also provide the proof of Theorem 5. For the proof of Theorem 4 we use the following fluctuation representation of entropy (Theorem 3 in [10]). S (Y ) = Z 1 t Es Y h (Y Es Y [Y ])2i ds dt (7) We use this to bound the entropy of a centered sub-exponential random variable. Lemma 13 If Y ψ1 < 1/e and E [Y ] = 0 then S (Y ) e2 Y 2 ψ1 1 e Y ψ1 Proof. Let s [0, 1] Es Y h (Y Es Y [Y ])2i Es Y Y 2 = E Y 2es Y E [es Y ] E Y 2es Y . The first inequality follows from the variational property of variance, the second from Jensen s inequality since E [Yk] = 0. Expanding the exponential we get E Y 2es Y E m! E Y m+2 . The interchange of expectation and summation will be justified by absolute convergence of the sum as follows. X m! Y m+2 ψ1 (m + 2)m+2 m=0 (m + 2) (m + 1) se Y ψ1 The first inequality follows from definition of Y ψ1, and the second from Stirling s approximation (m + 2)m+2 (m + 2)!em+2. Absolute convergence is insured since se Y ψ1 < 1. Using Z 1 t smds dt = 1 m + 2 the fluctuation representation (7) and the above inequalities give S (Y ) = Z 1 t Es Y h (Y Es Y [Y ])2i ds dt m=0 (m + 1) e Y φ1 m = e2 Y 2 ψ1 1 e Y ψ1 We also need the following lemma (Lemma 12 in [9]). Lemma 14 Let C and b denote two positive real numbers, t > 0. Then inf β [0,1/b) 2 (2C + bt). (8) Proof of Theorem 4. We abbreviate M := maxk fk (X) ψ1 and let 0 < γ β < (e M) 1. Then for any x X n and k {1, ..., n} we have γfk (X) (x) ψ1 < fk (X) (x) ψ1 / (e M) 1/e by the definition of M. We can therefore apply the previous lemma to γfk (X) (x). It gives for almost all x Sγf,k (x) = S (γfk (X) (x)) e2 γfk (X) (x) 2 ψ1 1 e γfk (X) (x) ψ1 2 γ2e2 fk (X) (x) 2 ψ1 (1 γe M)2 Subadditivity of entropy (6) then yields the total entropy bound S (γf (X)) Eγf(X) k Sγf,k (X) γ2e2Eγf(X) h P k fk (X ) 2 ψ1 (X) i (1 γe M)2 (9) k fk (X) 2 ψ1 (1 γe M)2 . Together with (5) this gives ln E h eβ(f Ef)i = β Z β S (γf (X)) dγ k fk (X) 2 ψ1 and the concentration inequality then follows from Markov s inequality and Lemma 14. 6 Conclusion In this paper, we presented an extension of Hoeffdingand Bernstein-type inequalities for sums of sub-Gaussian and sub-exponential independent random variables to general functions, and illustrated these inequalities with applications to statistical learning theory. We hope that future work will reveal other interesting applications of these inequalities. [1] P. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463 482, 2002. [2] S. Boucheron, G. Lugosi, P. Massart, et al. Concentration inequalities using the entropy method. The Annals of Probability, 31(3):1583 1614, 2003. [3] S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities. Oxford University Press, 2013. [4] C. Cortes, S. Greenberg, and M. Mohri. Relative deviation learning bounds and generalization with unbounded loss functions. ar Xiv preprint ar Xiv:1310.5796, 2013. [5] V. Koltchinskii and D. Panchenko. Rademacher processes and bounding the risk of function learning. In J. W. E. Gine, D. Mason, editor, High Dimensional Probability II, pages 443 459. 2000. [6] A. Kontorovich. Concentration in unbounded metric spaces and algorithmic stability. In International Conference on Machine Learning, pages 28 36. PMLR, 2014. [7] S. Kutin. Extensions to mcdiarmid s inequality when differences are bounded with high probability. Dept. Comput. Sci., Univ. Chicago, Chicago, IL, USA, Tech. Rep. TR-2002-04, 2002. [8] M. Ledoux. The concentration of measure phenomenon. Number 89. American Mathematical Soc., 2001. [9] A. Maurer. Concentration inequalities for functions of independent variables. Random Structures & Algorithms, 29(2):121 138, 2006. [10] A. Maurer. Thermodynamics and concentration. Bernoulli, 18(2):434 454, 2012. [11] C. Mc Diarmid. Concentration. In Probabilistic Methods of Algorithmic Discrete Mathematics, pages 195 248, Berlin, 1998. Springer. [12] R. Meir and T. Zhang. Generalization error bounds for bayesian mixture algorithms. Journal of Machine Learning Research, 4:839 860, 2003. [13] S. Mendelson. Learning without concentration. In Conference on Learning Theory, pages 25 39. PMLR, 2014. [14] R. Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.