# entropybased_concentration_inequalities_for_dependent_variables__efed6a80.pdf Entropy-Based Concentration Inequalities for Dependent Variables Liva Ralaivola LIVA.RALAIVOLA@LIF.UNIV-MRS.FR QARMA, LIF, CNRS, Aix-Marseille University, F 13288 Marseille cedex 9, France Massih-Reza Amini MASSIH-REZA.AMINI@IMAG.FR AMA, LIG, CNRS, University Grenoble Alpes, Centre Equation 4, BP 53, F 38041 Grenoble Cedex 9, France We provide new concentration inequalities for functions of dependent variables. The work extends that of Janson (2004), which proposes concentration inequalities using a combination of the Laplace transform and the idea of fractional graph coloring, as well as many works that derive concentration inequalities using the entropy method (see, e.g., (Boucheron et al., 2003)). We give inequalities for fractionally sub-additive and fractionally self-bounding functions. In the way, we prove a new Talagrand concentration inequality for fractionally sub-additive functions of dependent variables. The results allow us to envision the derivation of generalization bounds for various applications where dependent variables naturally appear, such as in bipartite ranking. 1. Introduction We present new concentration inequalities for specific functions of possibly dependent random variables. The approach that we advocate is based on the entropy method and the idea of breaking up the dependencies between random variables thanks to a graph coloring approach. Having these results at hand allows us to envision the study of the generalization properties of predictors trained over interdependent data for which a suitable dependency structure exist. As discussed by Amini & Usunier (2015), this structure could be naturally related to the dependency graph of the data, or it could be obtained a posteriori from a transformation that reduces a general learning problem to a more simple case, e.g. some reductions of multiclass classification problems to binary classification problems. Proceedings of the 32 nd International Conference on Machine Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copyright 2015 by the author(s). Related Works. Learning with interdependent data is a topic that has received quite interest over the past few years; from a theoretical point of view, it ultimately pertains to the availability of concentration inequalities designed to account for the dependencies at hand. Among the prominent works that address this problem are a series of contributions on learning from mixing processes, where the dependencies within a sequence of random variables decreases over time (Yu, 1994; Karandikar & Vidyasagar, 2002; Kontorovich & Ramanan, 2008; Mohri & Rostamizadeh, 2008; 2009; Samson, 2000; Steinwart & Christmann, 2010). Another line of research within this field, is based on the idea of graph coloring, designed to divide a graph into sets of independent sets, and considers subsets of independent random variables deduced from the graph, linking these variables. By mixing the idea of graph coloring with the Laplace transform, Hoeffding-like concentration inequalities for the sum of dependent random variables were proposed by Janson (2004). Usunier et al. (2006) later extended this result to provide a generalization of the bounded differences inequality of Mc Diarmid (1989) to the case of interdependent random variables. This extension then paved the way for the definition of the fractional Rademacher complexity that generalizes the idea of Rademancher complexity and allows one to derive generalization bounds for scenarios where the training data are made of dependent data. The Chromatic PAC-Bayes bound proposed by Ralaivola et al. (2009; 2010) is another instance of a generalization bound that builds upon the coloring principle; London et al. (2014) later provided another PACBayesian result for dependent inputs. However, one important issue that has not been explored in these studies, is the use of second-order (i.e. variance) information: such information is pivotal to get generalization bounds with fast learning rates as outlined for instance in (Boucheron et al., 2005). To this aim, we here consider the entropy method (Boucheron et al., 2003) that is a central technique to obtain concentration inequalities for certain types of functions (namely, sub-additive and self-bounding) and it is at the core of a proof of the well-known Talagrand Entropy-Based Concentration Inequalities for Dependent Variables concentration inequality for empirical processes (Bousquet, 2002; Ledoux, 1996; Massart, 2000). This inequality makes it possible then to derive generalization bounds based on Local Rademacher Complexities (Bartlett et al., 2005; Koltchinskii, 2006) that may induce fast convergence rates. To the best of our knowledge, the question of pairing the entropy method together with the coloring approach has not yet been studied and, we propose to address it in this paper. Contributions. The main theoretical results of the present paper essentially are of three different kinds. First, we show that, according to the idea of fractional coloring, it is possible to extend the applicability of concentration of certain types of sub-abdditive and self-bounding functions, namely fractionally sub-additive and fractionally self-bounding functions to the case of dependent variables; the new Bernstein s type concentration inequalities we propose reduce to the usual concentration inequalities when the random variables at hand are independent. Second, thanks to the derived concentration inequality, we introduce the notion of local fractional Rademacher complexity. Finally, we show how these technical results can be instantiated for the learning scenario of bipartite ranking. Organization of the paper. Section 2 states the general problem we are interested in. In Section 3, we give the formal definition of our framework and explicit the progression of our analysis over the paper. Section 4 presents new entropy-based concentration inequalities which will allow to extend several inequalities proposed for empirical processes to the case of dependent variables, and, in Section 5, we prove a generalization bound for bipartite ranking. 2. Statement of the Problem Many learning problems deal with interdependent training data. The study of the consistency of the ERM principle requires in this case, the availability of concentration inequalities tailored to handle general functions of dependent random variables. A common example is the reduction of learning problems to classification of pairs of examples like in the bipartite ranking or in multiclass classification with the all-pairs approach (Amini & Usunier, 2015). The former problem deals with the search of a scoring function over a class F = {f : X R} of real-valued functions using a training set S = {(Ti, Yi)}n i=1 where the observations (Ti, Yi) are supposed to be identically and independently distributed according to some distribution D, in such a way that P(T,Y ),(T ,Y ) D((Y Y )(f(T) f(T )) 0) is as small as possible. Without loss of generality, we will preferrably term the problem as that of controlling PT + D+,T D (f(T +) < f(T )), (1) where D+ (resp. D ) is the conditional distribution of the positive or relevant (resp. negative or irrelevant) examples. To this end, it is natural to consider some empirical risk ˆRℓ(f, S) on S related to the AUC and defined as ˆRℓ(f, S) .= 1 n+n j:Yj= 1 1If(Ti) 0 such that for all n [N], P(Yn b) = 1 then, for all λ 0 G(λ) ψ( λ)v (11) where v .= (1 + b)EZ + σ2. We recall the following result which provides a bound on the expectation of the Laplace transform of self-bounding functions due to Mc Diarmid & Reed (2006). Note that similar results for weakly self-bounding a slightly weaker notion than self-bounding functions are given by Maurer (2006) and this set of results were refined (with better constants) by Boucheron et al. (2009). We decide to refer to the result of Mc Diarmid & Reed (2006) because it implies upper and lower tail bound in a more compact way but the extensions to the dependent case that we propose also apply to the more recent versions of the results. Theorem 2 ((Mc Diarmid & Reed, 2006)). Let f : X N R be a (a, b)-self-bounding function. Assume X[N] is a sequence of independent random variables. Let µ .= EZ. The following holds. G(λ) (aµ + b)ψ(λ), λ 0 (12) G(λ) aµ + b 2(1 aλ)λ2, 0 λ 1/a. (13) 3.3. Dependency Graph Specific graphs, namely dependency graphs, are also at the core of the present study: a graph G = (V, E) is made of a finite set V of vertices and a set E V V of edges that connect the vertices. We have the following definition of an exact proper fractional cover of a graph that we will make intensive use of afterwards. Definition 3 (Exact proper fractional cover of G). Let G = (V, E) be a graph. C = {(Cj, ωj)}j [J], for some positive integer J, with Cj V and ωj [0, 1] is an exact proper fractional cover of G, if: 1. it is proper: j, Cj is an independent set, i.e., there is no connections between vertices in Cj; 2. it is an exact fractional cover of G: v V, P j:v Cj ωj = 1. The weight W(C) of C is given by: W(C) .= P j [J] ωj and the minimum weight χ (G) = min C K(G) W(C) over the set K(G) of all exact proper fractional covers of G is the fractional chromatic number of G. Note that, as observed by Janson (2004), Lemma 3.2, we may restrict ourselves to working with exact fractional covers, which requires v V, P j:v Cj ωj = 1 instead of the weaker condition v V, P j:v Cj ωj 1 for non-exact fractional covers, without loss of generality, since any fractional covers induces an exact fractional cover. From now on, it must be understood that we refer to exact proper fractional cover when using the simpler term of fractional cover. The reader that is not familiar with this notion of fractional covers may regard them as generalization of graph coloring, where the question is to assign the smallest Entropy-Based Concentration Inequalities for Dependent Variables number of colors to nodes of a graph so that no two connected nodes share the same color. When colored this way, the set of points that have the same color are necessarily independent sets and the coloring might be thought of an exact proper fractional coloring with every Cj corresponding to color j and every ωj being equal to 1. Given some graph G, the smallest number of colors χ(G) is its chromatic number and the following holds: χf(G) χ(G). Definition 4 (Dependency Graph). Let X[N] .= (X1, . . . , XN) be a sequence of random variables. We may associate the dependency graph GX .= (VX, EX) to X[X] so that i) VX = [N] and ii) (i, j) EX if and only if Xi and Xj are dependent random variables. Note that there are other notions of dependency graphs that can be envisioned (see (Janson, 2004)). The present notion of dependency graph will however suffice to our purpose. As we shall see, computing an exact proper fractional cover {(Cj, ωj)}j [J] of GX allows one to decompose X[N] in sets of independent variables XCj. This will make it possible to have the usual concentration inequalities for independent variables to carry over to the dependent case. 4. New Concentration Inequalities As stated above, we build upon the works on the entropy method (see, for a somewhat exhaustive overview the method the work of Boucheron et al. (2003)) and that of Janson (2004) to provide new concentration inequalities for functions of dependent random variables. 4.1. Fractionally Colorable Functions We aim at establishing concentration results for functions that are more complex than sums of dependent random variables. To this end, we introduce the notions of fractionally colorable functions, colorable sub-additive functions and colorable self-bounding functions; they refine the definitions given in the previous section. Definition 5 (Fractionally colorable function). Let G = ([N], E) be a graph. A function f : X N R is fractionally colorable with respect to G if there exists a decomposition DG(f) = {(fj, Cj, ωj)}j [J] of J triplets, such that: 1. C = {(Cj, ωj)}j [J] is an exact proper fractional cover of G; 2. for all j, fj : X |Cj| R is a function of |Cj| variables and f decomposes as x[N] X N, f(x[N]) = X j ωjfj(x Cj) (14) The decomposition DG(f) of f is optimal if the weight of the cover C = {(ωj, Cj)}j [J] is the smallest over all decompositions of f. In that case, the chromatic decom- position number χf of f is the weight of such an optimal decomposition. In the sequel, and without loss of generality, we will always consider optimal decompositions of fractionally colorable functions. Also, we will assume that the graph G at hand is the dependency graph of the sequence X[N] under study. We may now assume that we are working with a fractionally colorable function f and we may recall/introduce notation: as before, Z and Z(n) are defined as Z .= f X[N] , Z(n) .= f X(n) [N] , and, for j [J], n Cj, Zj and Z(n) j are defined as: Zj .= fj XCj , Z(n) j .= fj X(n) Cj where X(n) Cj is defined as in equation (5). Hence, j ωj Zj. (15) Let ΠJ be the family of discrete probability distributions over J-sets: (p1, . . . , p J) : j=1 pj = 1 and pj > 0, j We then have the following central lemma . Lemma 1 (Central Lemma). If f is fractionally colorable then (p1, . . . , p J) ΠJ, λ R, j [J] pj exp Gj Gj(λ) .= log ECj exp λ Zj ECj[Zj] . (18) G(λ) = log E[exp(λ(Z E[Z]))] j [J] ωj Zj ECj[Zj] j [J] pj ωj j [J] pj exp λωj (convexity of x 7 ex and the Jensen inequality) j [J] pj ECj Entropy-Based Concentration Inequalities for Dependent Variables Remark that the functions Gj s are the counterparts of G for the random variables Zj, which are defined with respect to the set Cj of independent variables. 4.2. Concentration of Fractionally Sub-Additive Functions Definition 6 (Fractionally Sub-Additive function). Let G = ([N], E) be some graph. A function f : X N R is fractionally sub-additive if it is fractionally colorable with respect to G with decomposition DG(f) = {(fj, Cj, ωj)}j [J] and each fj is sub-additive. Proposition 1. Suppose the following assumptions are true. f is fractionally sub-additive with decomposition D(f) = {(fj, Cj, ωj)}j [J]. Assume that for all j [J]: (Yj,n)n Cj is a sequence of real-valued σ(XCj)- measurable random variables such that n Cj, P(Yj,n Zj Z(n) j 1) = 1, P(Ej,n[Yj,n] 0) = 1, where Ej,n denotes the expectation with respect to the σ-algebra generated by (XCj\{n}); there exists σ2 j R so that n Cj Ej,n Y 2 j,n there exists a positive bj R such that n Cj, P(Yj,n bj) = 1; vj R denotes the real vj .= (1 + bj)E[Zj] + σ2 j . The following result holds: for all λ 0 and for all (p1, . . . , p J) ΠJ, j pj exp vjψ λωj where ψ is defined as in (9). Proof. Let (p1, . . . , p J) ΠJ and λ > 0 (the proof is trivially true for λ = 0). j [J] pj exp Gj j [J] pj exp vjψ λωj (Theorem 1) The following result extends Bennett s inequality presented by (Bousquet, 2002) to the dependent case. Theorem 3 (Bennett s Inequality for Dependent Variables). Suppose the assumptions of Proposition 1 hold with b1 = . . . = b J .= b and define the constants j [J] ωjσ2 j , v .= (1 + b)E[Z] + σ2, c .= 25χf/16. The following results hold: for all t 0 P(Z E[Z] + t) exp v where ϕ is defined as in (10). for all t 0 Proof. To get the results, we start from Proposition 1 and we follow the steps of Janson (2004) (Theorem 3.4). Set j [J] ωjvj, W .= X j [J] ωj, (22) j [J] ωj max 1, v1/2 j W 1/2v 1/2 , (23) pj .= ωj max 1, v1/2 j W 1/2v 1/2 /U. (24) With these choices, we observe that each summand of the sum in (19) is such that vjψ( λUωj/pj) vψ( λU)/W. Indeed, if v1/2 j W 1/2v 1/2 1, then pj = ωj/U, vj v/W, and = vjψ(λU) v Otherwise (i.e. v1/2 j W 1/2v 1/2 > 1) pj = ωjv1/2 j W 1/2v 1/2/U v1/2 j W 1/2 2 ψ( λU) = vj where the inequality comes from a property of ψ that is recalled in Proposition 5 (in Appendix A). This bounding of vjψ( λUωj/pj), and the fact that x 7 ex is an increasing function give j [J] pj exp vjψ λU ωj Entropy-Based Concentration Inequalities for Dependent Variables Using Markov s inequality (cf. Theorem 6, Appendix A), P(Z E[Z] t) = P (exp (λ(Z E[Z])) exp(λt)) W ψ( λU) λt . The upper bound of this inequality is minimized for λ = ln(1 + t W/v U)/U; plugging in this value yields P(Z E[Z] t) exp v Now using the fact that x R, x 1 + x2/4, we get Since x 7 ϕ(t/x) is decreasing for t > 0, we readily have the following upper bound P(Z E[Z] t) exp v As said before, we consider only optimal decompositions of f and the total weight W may be readily replaced by the chromatic number χf. Finally, we observe that: j ωj((1 + b)E[Zj] + σ2 j ) = (1 + b)E[Z] + σ2. Inequality (21) is deduced from (20) and the fact that x 0, ϕ(x) x2/(2(1 + x/3). This, in turn, gives the following Talagrand s type inequality for empirical processes in the dependent case. Theorem 4. Let F be a set of functions from X to R and assume all functions in F are measurable, squareintegrable and satisfy E[f(Xn)] = 0, n [N] and supf F f 1. Assume that C = {(Cj, ωj)} is a cover of the dependency graph of X[N] and let χf .= P j ωj. Let us define: j [J] ωj sup f F Let σj be so that σ2 j P n Cj supf F E f 2(Xn) . j ωjσ2 j + 2E[Z]. For any t 0, P(Z E[Z] + t) exp v Also, if c .= 25χf/16. Proof. (Sketch.) The proof is similar to the one of Bousquet (2003) (Theorem 7.3) and it hinges on the fact that, by Lemma 2, Appendix A, the Zj s are indeed sub-additive functions and by studying the random variables Yj,n defined for n Cj as: Yj,n .= f n j (Xn), where f j n is such that P k Cj\{n} f n j (Xk) = supf F P k Cj f(Xk) which also yields that b = 1 in the definition of v. We may now introduce the Local Fractional Rademacher Complexity which, combined with the previous inequality, is useful to get generalization bounds (see Section 5). Definition 7. The Local Fractional Rademacher Complexity R(F, r) is defined as R(F, r) .= 2 j [J] ωj EXCj sup f F:Vf r n Cj ξif(Xn) (27) where ξ = (ξ1, . . . , ξN) is a sequence of N independent Rademacher variables: P(ξn = 1) = P(ξn = 1) = 1/2. This is a generalization of the fractional Rademacher complexity of Usunier et al. (2006). The following holds: Proposition 2. For all r > 0, j ωj sup f F:Vf r n Cj [Ef(Xn) f(Xn)] NR(F, r). Proof. A simple symmetrization argument carefully used in combination with the fractional decomposition of f gives the result. 4.3. Concentration of Fractionally Self-Bounding Functions We provide concentration inequalities for a generalization of self-bounding functions, namely, fractionally selfbounding functions. Such results may have some use to problems that naturally make self-bounding functions appear (Boucheron et al., 2009; Mc Diarmid & Reed, 2006). Definition 8 (Fractionally Self-Bounding Function). Let G = ([N], E) be some graph. A function f : X N R with decomposition DG(f) = {(fj, Cj, ωj)}j [J] is ({aj}j, {bj}j)-fractionally self-bounding if each fj is (aj, bj)-self-bounding. Proposition 3. Let GX be the dependency graph associated with X[N]. Let f : X N R be ({aj}j, {bj}j)- fractionally self-bounding. The following holds for all (p1, . . . , p J) ΠJ, for all λ 0 j [J] pj exp (ajµj + bj)ψ λωj Entropy-Based Concentration Inequalities for Dependent Variables for all 0 λ < min(pj/(ajωj)), with µj = ECj[Zj], j [J] pj exp ajµj + bj 2(1 λajωj/pj) Proof. A combination of Lemma 1 and Theorem 2. This proposition entails the following concentration inequalities for the upper and lower tails of Z. Theorem 5. Let f : X N R be ({aj}j, {bj}j)- fractionally self-bounding, with decomposition DGX(f) = {(fj, Cj, ωj)}j [J]. Define γj .= (ajµj + bj), γ .= X j [J] ωjγj, χf .= X The following results hold: for all t > 0, P(Z E[Z] t) exp γ P(Z E[Z] + t) exp γ where ρ .= 1 χf P j [J] ωjaj + 1 4γ P j [J] ωj γj aj . If a .= a1 = . . . = a J, then ρ simplifies to ρ = a + 1 4a and (31) takes the more convenient form P(Z E[Z] + t) exp γ χf ϕ 4at (4a2 + 1)γ which is similar to Equation (30) for a = 1. Proof. The proof of Equation (30) follows exactly the same steps as the proof of Theorem 3 after noticing that the starting point of the former, namely Equation (28), is similar to the Equation (19), the starting point of the latter, with λ in place of λ and γj in place of vj. The proof of Equation (31) hinges on the following choices: j [J] ajωj max 1, a 1 j γ1/2 j χ1/2 f γ 1/2 , pj .= ajωj max 1, a 1 j γ1/2 j χ1/2 f γ 1/2 /U, which allow the argument of the exponential in the righthand side of Equation (29) to be rewritten as, for all j [J] θj(λ) .= γj 2(1 λajωj/pj) = γj/a2 j U 2/ max2 1, a 1 j γ1/2 j χ1/2 f γ 1/2 2 1 λU/ max 1, a 1 j γ1/2 j χ1/2 f γ 1/2 λ2. It is easy to verify that for all λ 0 we have θj(λ) γ/χf 2 (1 λU)(λU)2 Indeed, if a 1 j γ1/2 j χ1/2 f γ 1/2 1 then θj is bounded as θj(λ) = γj/a2 j 2 (1 λU)(λU)2 γ/χf 2 (1 λU)(λU)2. And if, a 1 j γ1/2 j χ1/2 f γ 1/2 > 1, then θj(λ) = γ/χf 2 1 λU ajγ 1/2 j χ 1/2 f γ1/2 (λU)2 2 1 λU χ1/2 f γ 1/2χ 1/2 f γ1/2 (λU)2 = γ/χf 2 (1 λU)(λU)2, where the upper-bounding is possible because λ 0. Therefore, from the bound (29) it comes exp G(λ) = E [exp (λ(Z E[Z]))] exp γ/χf 2 (1 λU)(λU)2 . Using Markov s inequality, and the fact that pj/(ajωj) 1/U for all j [J], we get P(Z E[Z] t) = P(exp (λ(Z E[Z])) exp(λt)) exp inf 0 λ<1/U γ/χf 2 (1 λU)(λU)2 λt where we used Lemma 3 (Appendix A) to get the last line. Using the inequality x R, x 1 + x2/4 once again, we may bound U as j [J] ωjaj + χf j [J] ωj γj and use the fact that x 7 ϕ(t/x) is decreasing for t > 0. 5. Induced Generalization Bounds for Bipartite Ranking In this section, we show how the concentration inequalities we established in the previous section can be of some use to derive generalization bounds for predictors trained on interdependent data. We will more precisely take advantage of the concentration inequality given by Theorem 3 and provide a generalization bound for the problem of bipartite ranking that can be reduced to the classification of pairs of examples (see Section 2). To do so, our proof will Entropy-Based Concentration Inequalities for Dependent Variables rest on the notion of local fractional Rademacher complexity, a generalization of both notions of local Rademacher complexity (Bartlett et al., 2005; Koltchinskii, 2006) and fractional Rademacher complexity (Usunier et al., 2006). If we assume that n+ n , it is easy to see that C .= {(Cj, ωj .= 1)}j [n ] with Cj .= {(i, (j + i 2 mod n ) + 1) : i = 1, . . . , n+} is an exact cover of the dependency graph of (Xij)ij. The chromatic number of this cover is therefore χf = P j ωj = n and ˆRℓ(f, S) decomposes as ˆRℓ(f, S) = 1 (k,l) Cj ℓ(f, Xkl), (33) where, abusing notation, ℓ(f, Xkl) = ℓ(f, T + i , T j ) and N .= n+n . This is a colorable function with respect to the sequence X .= (Xkl)kl and the tools we have developed will help us derive a generalization bound for f. Given a family of functions F, and r > 0, we define the parameterized family Fℓ,r which, for r > 0, is given by Fℓ,r .= {f : f F, VX1,1ℓ(f, X1,1) r}, where V denotes the variance (recall that all the Xkl are identically distributed). Now denote the function Φ as Φ(X, r) .= N sup f Fℓ,r h EX [ ˆRℓ(f, X )] ˆRℓ(f, X) i , where X is a copy X and where we have used the notation EX [ ˆRℓ(f, X )] for ES ˆRℓ(f, S) to make explicit the dependence on the sequence of dependent variables X . It is easy to see that j [n ] sup f Fℓ,r h EX kl[ℓ(f, X kl)] ℓ(f, Xkl) i When ℓtakes values in the interval [0, 1] then Theorem 4 readily applies to upper bound the right hand side of (34). Therefore, for t > 0, the following holds with probability at least 1 e t: Φ(X, r) E[Z] + where c = 25χf/16 = 25n /16 and v Nr + 2E[Z]. Using ab ua + b/u for all u > 0, we further have, for all α > 0 Φ(X, r) (1 + α)E[Z] + and, using Proposition 2, we get, the following proposition. Proposition 4. With probability 1 e t, for all α > 0 Φ(X, r) (1 + α)NR(Fℓ,r, r) + or, using χf = n , N = n+n , with probability at least 1 e t, for all f Fℓ,r ES[ ˆR(f, S)] ˆR(f, S) (1 + α)R(Fℓ,r, r) + 5 2rt n+ + 25 As is common with generalization bounds for bipartite ranking, the convergence rate is governed by the least represented class, i.e. the positive class here. Note this result is only the starting point of a wealth of results that may be obtained using the concentration inequalities studied here. In particular, it might be possible to sutdy how arguments based on star hulls and subroot functions may help us to get fast-rate-like results akin to (Cl emenc on et al., 2008). 6. Conclusion We have proposed new concentration inequalities for functions of dependent variables. From these, we derived a new Talagrand concentration inequality for fractionally sub-additive functions and fractionally self-bounding functions of dependent variables. An instance of a generalization bounds based on Fractional Local Rademacher Complexity for bipartite ranking exemplifies the usefulness of our concentration results. Acknowledgments. This work is partially supported by the French GIP ANR under contract ANR GRETA 12BS02-004-01 Greediness: theory and algorithms, and the Lab Ex PERSYVALLab ANR-11-LABX-0025. A. Technical Results Theorem 6 (Markov Inequality). Let X be a nonnegative random variable. For all a > 0 P(X > a) E[X] a . Proposition 5 (Lemma A.3 of Bousquet (2003)). Let gλ be as gλ .= ψ( λx)/x2. If λ 0 then gλ is non-decreasing on R. If λ 0 then gλ is non-increasing on R. Lemma 2 (Lemma C.1 of Bousquet (2003)). Let F be a set of functions and let Z .= supf F Pn k=1 f(Xk). Then, defining Zk .= supf F P i =k f(Xi), Z is sub-additive. The same is true if Z .= supf F |Pn k=1 f(Xk)| and Zk .= i =k f(Xi) . Lemma 3 (Lemma 11 of Boucheron et al. (2003)). Let C and a denote two positive real numbers. Then supλ [0,1/a) and the supremum is at λ = 1 Entropy-Based Concentration Inequalities for Dependent Variables Agarwal, S. and Niyogi, P. Stability and Generalization of Bipartite Ranking Algorithms. In COLT, pp. 32 47, 2005. Amini, M.-R. and Usunier, N. Learning with Partially Labeled and Interdependent Data. Springer, 2015. Bartlett, P., Bousquet, O., and Mendelson, S. Local Rademacher complexities. Annals of Statistics, 33(4): 1497 1537, 2005. Boucheron, S., Lugosi, G., and Massart, P. Concentration inequalities using the entropy method. The Annals of Probability, 31(3):1583 1614, 2003. Boucheron, S., Bousquet, O., and Lugosi, G. Theory of classification : A survey of some recent advances. ESAIM. P&S, 9:323 375, 2005. Boucheron, S., Lugosi, G., and Massart, P. On concentration of self-bounding functions. Electronic Journal of Probability, 14(64):1884 1899, 2009. Bousquet, O. A Bennett Concentration Inequality and its Application to Suprema of Empirical Processes. CRAS, Serie I, 334:495 500, 2002. Bousquet, O. Concentration Inequalities for Sub-Additive Functions Using the Entropy Method. In Gin e, E., Houdr e, C., and Nualart, D. (eds.), Stochastic Inequalities and Applications, volume 56 of Progress in Probability, pp. 213 247. Birkh auser Basel, 2003. Cl emenc on, S., Lugosi, G., and Vayatis, N. Ranking and empirical minimization of u -statistics. The Annals of Statistics, 36(2):844 874, April 2008. ISSN 0090-5364. Janson, S. Large Deviations for Sums of Partly Dependent Random Variables. Random Structures and Algorithms, 24(3):234 248, 2004. Karandikar, R. L. and Vidyasagar, M. Rates of uniform convergence of empirical means with mixing processes. Statistics and Probability Letters, 58(3):297307, 2002. Koltchinskii, V. Local Rademacher complexities and oracle inequalities in risk minimization. The Annals of Statistics, 34(6):2593 2656, 12 2006. Kontorovich, L. and Ramanan, K. Concentration inequalities for dependent random variables via the martingale method. The Annals of Probability, 36(6):2126 2158, 2008. Ledoux, M. Talagrand deviation inequalities for product measures. ESAIM: Probabilty and Statistics, 1:63 87, 1996. London, B., Huang, B., Taskar, B., and Getoor, L. PACBayesian Collective Stability. In Proc. of the 17th Int. Conf. on Artificial Intelligence and Statistics (AISTATS 14), 2014. Massart, P. About the constants in Talagrand s concentration inequalities for empirical processes. The Annals of Probability, 28(2):863 884, 2000. Maurer, A. Concentration inequalities for functions of independent variables. Random Structures and Algorithms, 29:121 138, 2006. Mc Diarmid, C. On the method of bounded differences. Survey in Combinatorics, pp. 148 188, 1989. Mc Diarmid, C. and Reed, B. Concentration for Selfbounding Functions and an Inequality of Talagrand. Random Structures and Algorithms, 29(4):549 557, 2006. Mohri, M. and Rostamizadeh, A. Stability Bounds for Noni.i.d. Processes. In Adv. in Neural Information Processing Systems 20, pp. 1025 1032, 2008. Mohri, M. and Rostamizadeh, A. Rademacher Complexity Bounds for Non-I.I.D. Processes. In Adv. in Neural Information Processing Systems 21, pp. 1097 1104, 2009. Ralaivola, L., Szafranski, M., and Stempfel, G. Chromatic PAC-Bayes Bounds for non-IID Data. In AISTATS 09: JMLR Workshop and Conference Proceedings, volume 5, pp. 416 423, 2009. Ralaivola, L., Szafranski, M., and Stempfel, G. Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary β-Mixing Processes. JMLR, Journal of Machine Learning Research, pp. 1 30, 2010. Samson, P.-M. Concentration of measure inequalities for Markov chains and Φ-mixing processes. Annals of Probability, 28(1):416 461, 2000. Steinwart, I. and Christmann, A. Fast learning from noni.i.d. observations. In Adv. in Neural Information Processing Systems 22, pp. 1768 1776, 2010. Usunier, N., Amini, M.-R., and Gallinari, P. Generalization Error Bounds for Classifiers Trained with Interdependent Data. In Adv. in Neural Information Processing Systems 18, pp. 1369 1376, 2006. Yu, B. Rates of Convergence for Empirical Processes of Stationary Mixing Sequences. The Annals of Probability, 22(1):94 116, 1994.