# optimistic_bounds_for_multioutput_learning__ea98a208.pdf Optimistic Bounds for Multi-output Prediction Henry W.J. Reeve 1 Ata Kab an 1 We investigate the challenge of multi-output learning, where the goal is to learn a vector-valued function based on a supervised data set. This includes a range of important problems in Machine Learning including multi-target regression, multi-class classification and multi-label classification. We begin our analysis by introducing the self-bounding Lipschitz condition for multioutput loss functions, which interpolates continuously between a classical Lipschitz condition and a multi-dimensional analogue of a smoothness condition. We then show that the self-bounding Lipschitz condition gives rise to optimistic bounds for multi-output learning, which attain the minimax optimal rate up to logarithmic factors. The proof exploits local Rademacher complexity combined with a powerful minoration inequality due to Srebro, Sridharan and Tewari. As an application we derive a state-of-the-art generalisation bound for multi-class gradient boosting. 1. Introduction Multi-output prediction represents an important class of problems that includes multi-class classification (Crammer & Singer, 2001), multi-label learning (Tsoumakas & Katakis, 2007; Zhang & Zhou, 2013), multi-target regression (Borchani et al., 2015), label distribution learning (Geng, 2016), structured regression (Cortes et al., 2016) and others, with a wide range of practical applications (Xu et al., 2019). Our objective is to provide a general framework for establishing guarantees for multi-output prediction problems. A fundamental challenge in the statistical learning theory of multi-output prediction is to obtain bounds that allow for (i) favourable convergence rate with the sample size, and (ii) favourable dependence of the risk on the dimensionality 1School of Computer Science, University of Birmingham. Correspondence to: Henry W.J. Reeve . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). of the output space. Whilst modern applications of multioutput prediction deal with increasingly large data sets, they also include problems where the target dimensionality is increasingly large. For example, the number of categories in multi-label learning is often of the order of tens of thousands, an emergent problem referred to as extreme classification (Agrawal et al., 2013; Babbar & Sch olkopf, 2017; Bhatia et al., 2015; Jain et al., 2019). Formally, the task of multi-output prediction is to learn a vector-valued function from a labelled training set. A common tool in the theoretical analysis of this problem has been a vector-valued extension of Talagrand s contraction inequality for Lipschitz losses (Ledoux & Talagrand, 2013). Both (Maurer, 2016) and (Cortes et al., 2016) established vectorcontraction inequalities for Rademacher complexity that gave rise to learning guarantees for multi-output prediction problems with a linear dependence on the dimensionality of the output space. More recently, (Lei et al., 2019) has provided more refined vector-contraction inequalities for both Gaussian and Rademacher complexity. This approach leads to a highly favourable sub-linear dependence on the output dimensionality, which can even be logarithmic, depending on the degree of regularisation. These structural results lead to a slow convergence rate O(n 1/2). Guermeur (2017) and Musayeva et al. (2019) explore an alternative approach based on covering numbers. (Chzhen et al., 2017) derived a bound for multi-label classification based on Rademacher complexities. Each of these bounds give rise to favourable dependence on the dimensionality of the output space, but with a slow rate of order O(n 1/2). Local Rademacher complexities provide a crucial tool in establishing faster rates of convergence (Bousquet, 2002; Bartlett et al., 2005; Koltchinskii et al., 2006; Lei et al., 2016). By leveraging local Rademacher complexities, Liu et al. (2019) have derived guarantees for multi-class learning with function classes that are linear in an RKHS, building on their previous margin based guarantees (Lei et al., 2015; Li et al., 2019). This gives rise to fast rates under suitable spectral conditions. Fast rates of convergence have also been derived by Xu et al. (2016) for multi-label classification with linear function spaces. On the other hand, Chzhen (2019) have derived fast rates of convergence by exploiting an analogue of the margin assumption. Optimistic bounds for multi-output prediction In this paper we establish generalisation bounds for multioutput prediction, which yields fast rates whenever the empirical error is small. We address this problem by generalising to vector-valued functions a smoothness based approach due to (Srebro et al., 2010). A key advantage of our approach is that it allow us to accommodate a wide variety of multi-output loss functions, and hypothesis classes, making our analytic framework applicable to a variety of learning tasks. Below we summarise the contributions of this paper: We give a contraction inequality for the local Rademacher complexity of vector-valued functions (Proposition 1). The main ingredient is a self-bounding Lipschitz condition for multi-output loss functions that holds for several widely used examples. We leverage our localised contraction inequality to give a general upper bound for multi-output learning (Theorem 1), which exhibits fast rates whenever the empirical error is small. We demonstrate a concrete use our general result, by derive from it a state-of-the-art bound for ensembles of multi-output decision trees (Theorem 7). Furthermore, the obtained rates on multi-output learning are minimax optimal up to logarithmic factors. The corresponding lower bounds can be found in the full version (Reeve & Kab an, 2020). 1.1. Problem Setting We shall consider multi-output prediction problems in supervised learning. Suppose we have a measurable space X, a label space Y and an output space V. We shall assume that there is an unknown probability distribution P over random variables (X, Y ), taking values in X Y. The performance is quantified through a loss function L : V Y R. Let M(X, V) denote the set of measurable functions f : X V. The goal of the learner is to obtain f M(X, V) such that the corresponding risk EL(f, P) := E(X,Y ) P [L(f(X), Y )] is as low as possible. The learner selects f M(X, V) based upon a sample D := {(Xi, Yi)}i [n], where (Xi, Yi) are independent copies of (X, Y ). We let ˆEL(f, D) := n 1 P i [n] L(f(Xi), Yi) denote the empirical risk. When the distribution P and the sample D are clear from context we shall write EL(f) in place of EL(f, P) and ˆEL(f) in place of ˆEL(f, D). We consider multi-output prediction problems in which V Rq. We let denote the max norm on Rq and for a positive integer m N we let [m] := {1, , m}. 2. The Self-bounding Lipschitz Condition We introduce the following self-bounding Lipschitz condition for multi-output loss functions. Definition 1 (Self-bounding Lipschitz condition). A loss function L : V Y R is said to be (λ, θ)-self-bounding Lipschitz for λ, θ 0 if for all y Y and u, v V, |L(u, y) L(v, y)| λ max{L(u, y), L(v, y)}θ u v . This condition interpolates continuously between a classical Lipschitz condition (when θ = 0) and a multi-dimensional analogue of a smoothness condition (when θ = 1/2), and will be the main assumption that we use to obtain our results. Our motivation for introducing Definition 1 is as follows. Firstly, in recent work of (Lei et al., 2019) the classical Lipschitz condition with respect to the ℓ norm has been utilised to derive multi-class bounds with a favourable dependence upon the number of classes q. The role of the ℓ norm is crucial since it prevents the deviations in the loss function from accumulating as the output dimension q grows. Our goal is to give a general framework which simultaneously achieves a favourable dependence upon n. Secondly, Srebro et al. (2010) introduced a second-order smoothness condition on the loss function. This condition corresponds to the special case whereby q = 1 and θ = 1/2. Srebro et al. (2010) showed that this smoothness condition gives rise to an optimistic bound having a fast rate O(n 1) in the realisable case. The self-bounding Lipschitz condition provides a multi-dimensional analogue of this condition when θ = 1/2, intended to yield a favourable dependence on the number of samples n. The results established in Sections 3 and 5 show that this is indeed the case, while we also obtain favourable dependence on the number of classes q. Finally, by considering the range of exponents θ [0, 1/2] we exhibit convergence rates ranging from slow O(n 1/2) to fast O(n 1) in the realisable case. This is reminiscent of the celebrated Tsybakov margin condition (Mammen & Tsybakov, 1999), which interpolates between slow and fast rates in the parametric classification setting. Crucially, however, whilst the Tsybakov margin condition (Mammen & Tsybakov, 1999) is a condition on the underlying distribution which cannot be verified in practice the self-bounding Lipschitz condition is a property of a loss function which may be verified analytically by the learner. 2.1. Verifying the Self-bounding Lipschitz Condition We start by giving a collection of results which can be used to verify that a given loss function satisfies the self-bounding Lipschitz condition. The following lemma is proved in the Supplementary Appendix A. Lemma 1. Take any λ > 0, θ [0, 1/2]. Suppose that L : V Y [0, ) is a loss function such that for any u V, y Y, there exists a non-negative differentiable function ϕu,y : R [0, ) satisfying 1. ϕu,y(0) = L(u, y); Optimistic bounds for multi-output prediction 2. t > 0, supv: u v t{L(v, y)} ϕu,y(t). 3. The derivative ϕ u,y(t) is non-negative on [0, ); 4. t0, t1 R, |ϕ u,y(t1) ϕ u,y(t0)| λ 2 1 1 θ |t1 t0| θ 1 θ ; Then L : V Y [0, ) is (λ, θ)-self-bounding Lipschitz. The following Lemma shows that clipping preserves this condition. Lemma 2. Suppose that L : V Y [0, ) is a (λ, θ)- self-bounding Lipschitz loss function with λ > 0, θ [0, 1]. Then the loss L : V Y [0, b] defined by L(u, y) = min{L(u, y), b} is (λ, θ)-self-bounding Lipschitz. Finally, we note the following monotonicity property, which follows straightforwardly from the definition. Lemma 3. Suppose that L : V Y [0, b] is a bounded (λ, θ)-self-bounding Lipschitz loss function with λ > 0, θ [0, 1]. Then given any θ θ, the loss L is also ( λ, θ)- self-bounding Lipschitz with λ = λ bθ θ. 2.2. Examples We now demonstrate several examples of multi-output loss functions that satisfy our self-bounding Lipschitz condition. In each of the examples below we shall show that the selfbounding Lipschitz condition is satisfied by applying our sufficient condition (Lemma 1). Detailed proofs are given in the Supplementary Appendix A. 2.2.1. MULTI-CLASS LOSSES We begin with the canonical multi-output prediction problem of multi-class classification in which Y = [q] and V = Rq. A popular loss function for the theoretical analysis of multi-class learning is the margin loss (Crammer & Singer, 2001). The smoothed analogue of the margin loss was introduced by Srebro et al. (2010) in the onedimensional setting, and Li et al. (2018) in the multi-class setting. Example 1 (Smooth margin losses). Given Y = [q] we define the margin function m : V Y R by m(u, y) := uy maxj [q]\{y}{uj}. The zero-one loss L0,1 : V Y [0, 1] is defined by L0,1(u, y) = 1{m(u, y) 0}. Whilst natural, the zero-one loss has the drawback of being discontinuous, which presents an obstacle for deriving guarantees. For each ρ > 0, the corresponding margin loss Lρ : V Y [0, 1] is defined by Lρ(u, y) = 1{m(u, y) ρ}. The margin loss Lρ is also discontinuous. However, we may define a smooth margin loss Lρ : V Y [0, 1] by Lρ(u, y) 1 if m(u, y) 0 ρ 3 3 m(u,y) ρ 2 + 1 if m(u, y) [0, ρ] 0 if m(u, y) ρ. By applying Lemma 1 we can show that Lρ is (λ, θ)-selfbounding Lipschitz with λ = 4 6 ρ 1 and θ = 1/2. Moreover, the smooth margin loss satisfies L0,1(u, y) Lρ(u, y) Lρ(u, y) for (u, y) V Y. The margin loss plays a central role in learning theory and continues to receive significant attention in the analysis of multi-class prediction (Guermeur, 2017; Li et al., 2018; Musayeva et al., 2019), so it is fortuitous that our self-bounding Lipschitz condition incorporates the smooth margin loss. More importantly, however, the self-bounding Lipschitz condition applies to a variety of other loss functions which have received less attention in statistical learning theory. One of the most widely used loss functions in practical applications is the multinomial logistic loss, also known as the softmax loss. Example 2 (Multinomial logistic loss). Given Y = [q], the multinomial logistic loss L : V Y [0, ) is defined by L(u, y) = log j [q] exp(uj uy) where u = (uj)j [q] and y [q]. For each (u, y) V [q] let Au,y = P j [q]\{y} exp(uj uy) and define ϕu,y(t) = log (1 + Au,y exp(2t)). By applying Lemma 1 with ϕu,y we can show that the multinomial logistic loss L is (λ, θ)- self-bounding Lipschitz with λ = 2 and θ = 1/2. Recently, Lei et al. (2019) pointed out that the multinomiallogistic loss is 2-Lipschitz with respect to the ℓ -norm (equivalently, (2, 0)-self-bounding Lipschitz). This gives rise to a slow rate of order O(n 1/2). The fact that the multinomial-logistic loss is also (2, 1/2)-self bounding can be used to derive more favourable guarantees, as we shall see in Section 3. 2.2.2. MULTI-LABEL LOSSES In multi-label prediction instances may be simultaneously assigned to several categories. We have Y {0, 1}q, where q is the total number possible classes. Whilst q is often very large, the total number of simultaneous labels is typically much smaller. Hence, we consider the set of k-sparse binary vectors S(k) = {(yj)j [q] {0, 1}q : P j [q] yj k} denote the set of k-sparse vectors, where k [q]. We consider the pick-all-labels loss (Menon et al., 2019; Reddi et al., 2019). Example 3 (Pick-all-labels). Given Y = S(k), the pick-alllabels loss L : V Y [0, ) is defined by L(u, y) = P l [q] yl log P j [q] exp(uj ul) where u = (uj)j [q] V and y = (yj)j [q] Y. For each (u, y) V Y we define ϕu,y : R [0, ) Optimistic bounds for multi-output prediction by Au,y = P j [q]\{l} exp(uj ul) and let ϕu,y(t) := P l [q] yl log (1 + Au,y exp(2t)). By applying Lemma 1 with ϕu,y we can show that L is (λ, θ)-self-bounding Lipschitz with λ = 2 k and θ = 1/2. Crucially, the constant λ for the pick-all-labels family of losses is a function of the sparsity k, rather than the total number of labels. This means that our approach is applicable to multi-label problems with with tens of thousands of labels, as long as the label-vectors are k-sparse. 2.2.3. LOSSES FOR MULTI-TARGET REGRESSION We now return to the problem of multi-target regression in which Y = Rq (Borchani et al., 2015). Example 4 (Sup-norm losses). Given κ, γ [1, 2] we can define a loss-function L : V Y R for multi-target regression by setting L(u, y) = κ u y γ . By applying Lemma 1 with ϕu,y(t) = κ ( u y +t)γ we can see that L is a (λ, θ)-self-bounding Lipschitz with λ = (8κ)1 θ and θ = (γ 1)/γ. This yields examples of (λ, θ)-self-bounding Lipschitz loss functions for all λ > 0 and θ [0, 1/2]. With these examples in mind we are ready to present our results. 3. Main Results In this section we give a general upper bound for multioutput prediction problems under the self-bounding Lipschitz condition. A key tool for proving this result will be a contraction inequality for local Rademacher complexity of vector valued functions given in Section 4.1, and which may also be of independent interest. First, we recall the concept of Rademacher complexity. Definition 2 (Rademacher complexity). Let Z be a measurable space and consider a function class G M(Z, R). Given a sequence z = (zi) Zn we define the empirical Rademacher complexity of G with respect to z by1 ˆRz (G) := sup G G:| G|< Eσ i [n] σi g(zi) where the expectation is taken over sequences of independent Rademacher random variables σ = (σi)i [n] with σi { 1, +1}n. For each n N, the worstcase Rademacher complexity of G is defined by Rn(G) := supz Zn ˆRz(G). The Rademacher complexity is defined in the context of real-valued functions. However, in this work we deal 1Taking the supremum over finite subsets G G is required to ensure that the function within the expectation is measurable (Talagrand, 2014). This technicality can typically be overlooked. with multi-output prediction so we shall focus on function classes F M(X, Rq). In order to utilise the theory of Rademacher complexity in this context we shall transform function classes F M(X, Rq) into the projected function classes Π F M(X [q], R) as follows. Firstly, for each j [q] we define πj : Rq R to be the projection onto the j-th coordinate. We then define, for each f M(X, Rq), the function Π f : X [q] R by (Π f)(x, j) = πj(f(x)). Finally, given F M(X, Rq) we let Π F := {Π f : f F} M(X [q], R). Our central result is the following relative bound. Theorem 1. Suppose we have a class of multi-output functions F M(X, [ β, β]q), and a (λ, θ)-self-bounding Lipschitz loss function L : V Y [0, b] for some β, b 1, λ > 0, θ [0, 1/2]. Take δ (0, 1), n N and let Γλ,θ n,q,δ(F) := λ q log3/2 (eβnq) Rnq(Π F) + 1 n 1 1 θ n (log(1/δ) + log(log n)). There exists numerical constants C0, C1 > 0 such that given an i.i.d. sample D the following holds with probability at least 1 δ for all f F, EL(f) ˆEL(f) + C0 q ˆEL(f) Γλ,θ n,q,δ(F) + Γλ,θ n,q,δ(F) . Moreover, if f argminf F{EL(f)} minimises the risk and ˆf argminf F{ˆEL(f)} minimises the empirical risk, then with probability at least 1 δ, EL( ˆf) EL(f ) + C1 q EL(f ) Γλ,θ n,q,δ(F) + Γλ,θ n,q,δ(F) . The proof of Theorem 1 is given in Section 4.2. It builds on a local contraction inequality result (Proposition 1, Section 4.1), combined with techniques from (Bousquet, 2002). Theorem 1 gives an upper bound for the generalisation gap (EL(f) ˆEL(f)), framed in terms of a complexity term Γλ,θ n,q,δ(F), which depends upon both the Rademacher complexity of the projected function class Rnq(Π F) and the self-bounding Lipschitz parameters λ, θ. When the empirical error is small in relation to the complexity term (ˆEL(f) Γλ,θ n,q,δ(F)), the generalisation gap is of order Γλ,θ n,q,δ(F). In less favourable circumstances we recover a bound of order q Γλ,θ n,q,δ(F). 3.1. Comparison with State of the Art In this section we compare our main result (Theorem 1) with a closely related guarantee due to Lei et al. (2019). Observe that a loss function L is λ-Lipschitz if it is (λ, θ)- self-bounding Lipschitz with θ = 0. Theorem 2. (Lei et al., 2019) Suppose we have a class of multi-output functions F M(X, [ β, β]q), and a λ- Optimistic bounds for multi-output prediction Lipschitz loss function L : V Y [0, b] for some β, b 1 and λ > 0. Take δ (0, 1), n N and let Jλ n,q,δ(F) := λ q log3/2 (eβnq) Rnq(Π F) + 1 n . There exists numerical constants C2, C3 > 0 such that given an i.i.d. sample D the following holds with probability at least 1 δ for all f F, EL(f) ˆEL(f) + C2 Jλ n,q,δ(F) + b Moreover, if f argminf F{EL(f)} minimises the risk and ˆf argminf F{ˆEL(f)} minimises the empirical risk, then with probability at least 1 δ, EL( ˆf) EL(f ) + C3 Jλ n,q,δ(F) + 2b Theorem 2 is a mild generalisation of Theorem 6 from (Lei et al., 2019), originally formulated for multi-class classification and F an RKHS. For completeness we show that Theorem 2 follows from Proposition 1 in the Supplementary Appendix B. . Note that by the monotonicity property (Lemma 3) any loss function L : V Y [0, b] which is (λ, θ)-self-bounding Lipschitz is also λ bθ-Lipschitz, so the additive bound in Theorem 2 also applies. To gain a deeper intuition for the bound in Theorem 1 we compare with the bound in Theorem 2. Let s suppose that Rnq(Π F) = O((nq) 1/2) (for a concrete example where this is the case see Section 5). We then have Γλ,θ n,q,δ(F) = O(n 1 2(1 θ) ). For large values of ˆEL(f) Theorem 1 gives a bound on generalisation gap (EL(f) ˆEL(f)) of order O(n 1 4(1 θ) ), which is slower than the rate achieved by Theorem 2 whenever θ < 1/2. However, when ˆEL(f) is small (ˆEL(f) O(n 1 2(1 θ) )), Theorem 1 gives rise to a bound of order O(n 1 2(1 θ) ), yielding faster rates than can be obtained through the standard Lipschitz condition alone whenever θ > 0. Finally note that if the loss L is (λ, θ)- self-bounding Lipschitz with θ = 1/2 then the rates given by Theorem 1 always either match or outperform the rates given by Theorem 2. Moreover, θ = 1/2 occurs for several practical examples discussed in Section 2.2 including the multinomial-logistic loss. 4. Proofs of Main Results We now turn to stating and proving the key ingredient of our main result, Proposition 1. 4.1. A Contraction Inequality for the Local Rademacher Complexity of Vector-valued Function Classes First we introduce some additional notation. Suppose f M(X, V). Given a loss function L : V Y R we define L f : X Y R by (L f)(x, y) = L(f(x), y). We extend this definition to function classes F M(X, V) by L F = {L f : f F}. Moreover, for each z (X Y)n and r > 0, a subset F|r z := {f F : ˆEL(f, z) r}. Intuitively, the local Rademacher complexity allows us to zoom in upon the neighbourhood of the empirical risk minimiser. This is the subset that matters in practice and is typically much smaller than the full Π F. Proposition 1. Suppose we have a class of multi-output functions F M(X, [ β, β]q), where β 1. Given a (λ, θ)-self-bounding Lipschitz loss function L : V Y [0, R], where λ > 0, θ [0, 1/2] and z (X Y)n, r > 0, we have the following bound, ˆRz (L F|r z) λrθ 29 q log3/2 (eβnq) Rnq(Π F) + n 1/2 . The proof of Proposition 1, given later in this section, relies upon covering numbers. Definition 3 (Covering numbers). Let (M, ρ) be a semimetric space. Given a set A M and an ϵ > 0, a subset A A is said to be a (proper) ϵ-cover of A if, for all a A, there exists some a A with ρ(a, a) ϵ. We let N(ϵ, A, ρ) denote the minimal cardinality of an ϵ-cover for A. We shall consider covering numbers for two classes of datadependent semi-metric spaces. Let Z be a measurable space and take G M(Z, R). For each n N and each sequence z = (zi)i [n] Zn we define a pair of metrics ρz,2 and ρz, by ρz,2(g0, g1) := i [n] (g0(zi) g1(zi))2 ρz, (g0, g1) := max i [n]{|g0(zi) g1(zi)|}, where g0, g1 G. The first stage of the proof of Proposition 1 will be using the following lemma which bounds the covering number of L F|r z in terms of an associated covering number for Π(F). Lemma 4. Suppose that F M(X, Rq) and L is (λ, θ)- self-bounding Lipschitz with θ [0, 1/2]. Take L : V Y [0, b], z = {(xi, yi)}i [n] (X Y)n, r > 0 and define w = {(xi, j)}(i,j) [n] [q] (X [q])nq. Given any f0, f1 F|r z we have ρz,2(L f0, L f1) 2θλrθ ρw, (Π f0, Π f1). Optimistic bounds for multi-output prediction Moreover, for any ϵ > 0, N 21+θλrθ ϵ, L F|r z, ρz,2 N (ϵ, Π F, ρw, ) . Proof of Lemma 4. To prove the first part of the lemma we take f0, f1 F|r z and let ζ = ρw, (Π f0, Π f1). It follows from the construction of w that |πj(f0(xi)) πj(f1(xi))| ζ for each (i, j) [n] [q], so f0(xi) f1(xi) ζ for each i [n]. Furthermore, by the self-bounding Lipschitz condition we deduce that for each i [n], |L(f0(xi), yi) L(f1(xi), yi)| λ max {L(f0(xi), yi), L(f1(xi), yi)}θ f0(xi) f1(xi) λ max {L(f0(xi), yi), L(f1(xi), yi)}θ ζ. Hence, by Jensen s inequality we have ρz,2(L f0, L f1)2 i [n] (L(f0(xi), yi) L(f1(xi), yi))2 i [n] max {L(f0(xi), yi), L(f1(xi), yi)}2θ i [n] max {L(f0(xi), yi), L(f1(xi), yi)} (λζ)2 ˆEL(f0, z) + ˆEL(f1, z) 2θ (λζ)2 (2r)2θ, where we use the fact that θ [0, 1/2] and max{ˆEL(f0, z), ˆEL(f1, z)} r. Thus, ρz,2(L f0, L f1) 2θλrθ ζ = 2θλrθ ρw, (Π f0, Π f1). This completes the proof of the first part of the lemma. To prove the second part of the lemma we note that since Π F|r z Π F we have2 N (2ϵ, Π F|r z, ρw, ) N (ϵ, Π F, ρw, ) , so we may choose f1, , fm F|r z with m N (ϵ, Π F, ρw, ) such that Π f1, , Π fm forms a 2ϵ-cover of Π F|r z with respect to the ρw, metric. To complete the proof it suffices to show that L f1, , L fm is a 21+θλrθ ϵ-cover of L F|r z with respect to the ρz,2 metric. 2The factor of 2 is required as we are using proper covers, which are subsets of the set being covered (see Definition 3). Take any g L F|r z, so g = L f for some f F|r z. Since Π f1, , Π fm forms a 2ϵ-cover of Π F|r z we may choose l [m] so that ρw, (Π fl, Π f) 2ϵ. By the first part of the lemma we deduce that ρz,2(L fl, g) = ρz,2(L fl, L f) 21+θλrθ ϵ Since this holds for all g L F|r z, we see that L f1, , L fm is a 21+θλrθ ϵ-cover of L F|r z, which completes the proof of the lemma. To prove Proposition 1, we shall also utilise two technical results to move from covering numbers to Rademacher complexity and back. First, we shall use the following powerful result from (Srebro et al., 2010) which gives an upper bound for worst-case covering numbers in terms of the worst-case Rademacher complexity. Theorem 3 (Srebro et al. (2010)). Given a measurable space Z and a function class G M(Z, [ β, β]), any ϵ > 2 Rn(G) and any z Zn, log N(ϵ, G, ρz, ) (Rn(G))2 4n ϵ2 log 2eβn We can view this result as an analogue of Sudakov s minoration inequality for ℓ covers, rather than ℓ2 covers. Secondly, we shall use Dudley s inequality (Dudley, 1967) which allows us to bound Rademacher complexities in terms of covering numbers. We shall use the following variant due to (Guermeur, 2017) as it yields more favourable constants. Theorem 4 (Guermeur (2017)). Suppose we have a measurable space Z, a function class G M(Z, R) and a sequence z Zn. For any decreasing sequence (ϵk) k=0 with lim k ϵk = 0 with ϵ0 supg0,g1 G ρz,2(g0, g1), the following inequality holds for all K N, k=1 (ϵk + ϵk 1) log N(ϵk, G, ρz,2) We are now ready to complete the proof of our local Rademacher complexity inequality. Proof of Proposition 1. Take z = {(xi, yi)}i [n] (X Y)n and r > 0 and define w = {(xi, j)}(i,j) [n] [q] (X [q])nq. By Lemma 4 combined with Theorem 3 applied to Π F we see that for each ξ > 2 Rnq(Π F) we have log N 21+θλrθ ξ, L F|r z, ρz,2 log N(ξ, Π F, ρw, ) (Rnq(Π F))2 4nq ξ2 log 2eβnq Optimistic bounds for multi-output prediction Moreover, given any g0 = L f0, g1 = L f1 L F|r z, so ρw, (Π f0, Π f1) 2β, so by the first part of Lemma 4 we have ρz,2(g0, g1) 21+θλrθ β. Now construct (ϵk) k=0 by ϵk = 21+θλrθ β 2 k and choose K = log2 β min{(2 Rnq(Π F)) 1, (8 n)} 1 Hence, supg0,g1 Π F|rz ρz,2(g0, g1) ϵ0 and β 2 K 1 max{2 Rnq(Π F), (8 n) 1} < β 2 K. Furthermore, for k K by letting ξk = β 2 k, we have ϵk = 21+θλrθ ξk and ξk > max{2 Rnq(Π F), (8 n) 1}, so by eq. (1) log N (ϵk, L F|r z, ρz,2) (Rnq(Π F))2 4nq ξ2 k log 2eβnq 21+θλrθ Rnq(Π F) 2 4nq ϵ2 k log eβ(nq)3/2 21+θλrθ Rnq(Π F) 2 6nq ϵ2 k log (eβnq) . Note also that by construction K 4 log(eβnq). By Theorem 4 and ϵk 1 = 2 ϵk we deduce that ˆRz(L F|r z) k=1 (ϵk + ϵk 1) log N(ϵk, L F|rz, ρz,2) log N(ϵk, L F|rz, ρz,2) 6K 21+θλrθ Rnq(Π F) p 6q log (eβnq) + ϵK 28 q λrθ Rnq(Π F) log3/2 (eβnq) + ϵK λrθ 29 q log3/2 (eβnq) Rnq(Π F) + n 1/2 . This completes the proof of the proposition. 4.2. Proof of Theorem 1 To complete the proof of Theorem 1 we combine Proposition 1 with some results due to Bousquet (2002). Theorem 5. Suppose we have a measurable space Z and a function class G M(Z, [0, b]). For each z Zn and g G we let ˆEz(g) = n 1 P i [n] g(zi). Suppose we have a function φn : [0, ) [0, ) which is non-negative, non-decreasing, not identically zero, and φn(r)/ r is nonincreasing. Suppose further that for all z Zn and r > 0, ˆRz({g G : ˆEz(g) r}) φn(r). Let ˆrn be the largest solution of the equation φn(r) = r. Suppose that Z is a random variable with distribution P, where P is a distribution on Z and let D = {Zi}i [n] Zn be an i.i.d. sample, where each Zi P is an independent copy of Z. For any δ (0, 1), the following holds with probability at least 1 δ, for all g G: E(g) ˆED(g) + 90(ˆrn + r0) + 4 q ˆED(g)(ˆrn + r0). where r0 = b (log(1/δ) + 6 log log n)/n. Proof. The following result is given in the penultimate line of the proof of (Theorem 6.1, Bousquet (2002)): E(g) ˆED(g) + 45ˆrn + p 8ˆrn E(g) + p 4r0 E(g) + 20r0, with probability at least 1 δ, for all g G. So, E(g) ˆED(g) + 45ˆrn + 20r0 + 4 p (ˆrn + r0) E(g). We also need the following inequality (Lemma 5.11, Bousquet (2002)): Suppose that t, B, C > 0 satisfy t B t + C. Then t B2 + C + B C. Applying this with B = 4 p (ˆrn + r0) and C = ˆED(g) + 45ˆrn + 20r0 we have E(g) 16(ˆrn + r0) + (ˆED(g) + 45ˆrn + 20r0) (ˆrn + r0)(ˆED(g) + 45ˆrn + 20r0) ˆED(g) + 90(ˆrn + r0) + 4 q ˆED(g)(ˆrn + r0), which completes the proof. Theorem 5 is a uniform upper bound in terms of the empirical risk. We can deduce a performance bound on the empirical risk minimiser by combining with Bernstein s inequality see Theorem 2.10 from (Boucheron et al., 2013) Theorem 6 (Bernstein (1924)). Let Wi, , Wi [0, b] be bounded independent random variables with mean µ = E[Wi]. Then with probability at least 1 δ we have i [n] Wi µ + 2µb log(1/δ) n + b log(1/δ) 2µ + 3b log(1/δ) Corollary 1. Suppose that the assumptions of Theorem 5 hold and choose g argming G{E(g)}. Given z Zn we choose ˆgz argming G{ˆEz(g)}. For any δ (0, 1), the following holds with probability at least 1 2δ E(ˆg D) E(g ) + 9 p E(g ) (ˆrn + r0) + 100 (ˆrn + r0) . We can now complete the proof of Theorem 1. Optimistic bounds for multi-output prediction Proof of Theorem 1. First let G = L F = {(x, y) 7 L(f(x), y) : f F}. Note that for g = L f with f F and Z = (X, Y ) P we have EZ(g) = EL(f, P) and given z (X Y)n we have ˆEz(g) = ˆEL(f). Note also that under this correspondence L F|r z = {g G : ˆEz(g) r}. Now define φn : [0, ) [0, ) by φn(r) = λrθ 29 q log3/2 (eβnq) Rnq(Π F) + n 1/2 . By Proposition 1, for each z (X Y)n, ˆRz {g G : ˆEz(g) r} = ˆRz (L F|r z) φn(r). Observe that φn is non-negative, non-decreasing and φn(r)/ r is non-increasing, since θ [0, 1/2]. So it remains to solve the fixed point equation φn(r) = r and find ˆrn := λ 29 q log3/2 (eβnq) Rnq(Π F) + n 1/2 1 1 θ Hence, the two bounds in Theorem 1 follow from Theorem 5 and Corollary 1, respectively. 5. An application to ensembles In this section we highlight applications of our general multioutput learning framework. Specifically, here we consider ensembles of decision trees (Schapire & Freund, 2013), as they represent an effective and widely used tool in practice (Chen & Guestrin, 2016). Throughout this section we shall assume that X = Rd. We consider sets of decision tree functions H1 p,τ M(X, [ 1, 1]q) constructed as follows. Let Tp,d be the set of decision trees t : Rd [p] with p leaves, where each internal node performs a binary split along a single feature. Let Hp,τ M(X, Rq) be the set of all functions of the form h(x) = (wt(x),j)j [q], where t Tp,d is a decision tree and w = (wl,j)(l,j) [p] [q] Rpq satisfies the ℓ1 constraint wl 1 = P j [q] |wlj| τ. Finally, let H1 p,τ := Hp,τ M(X, [ 1, 1]q). We now give a bound for convex combinations of such decision trees. Theorem 7. Suppose we have β, b 1, λ > 0, θ [0, 1/2] and a (λ, θ)-self-bounding Lipschitz loss function L : V Y [0, b]. Given δ (0, 1), n N we define for each α = (αt)t [T ], τ = (τt)t [T ] (0, )T , Cn,δ(α, τ) := λ n p log2(3npqdβ) P t [T ] αt τt + 1 1 1 θ n (log(1/δ) + log(log n)). There exists a numerical constant C0 such that given an i.i.d. sample D the following holds with probability at least 1 δ, for all ensembles f = P t [T ] αt ht where P t [T ] αt β and ht H1 p,τt, EL(f) ˆEL(f) + C0 q ˆEL(f) Cn,δ(α, τ) + Cn,δ(α, τ) . Before giving the proof, we highlight several interesting features of this result: First and foremost, Theorem 7 gives guarantees for ensembles of decision trees with respect to a wide variety of losses including the multinomial logistic loss for multi-class classification and the one versus all loss for multi-label classification, as well as implying margin based guarantees (see Section 2.2). Theorem 7 has a favourable dependency upon the number of examples whenever ˆEL(f) is sufficiently small, as is often the case for large ensembles of decision trees. For example, if we are using the multinomial logistic loss and ˆEL(f) 0, then Theorem 7 gives rise to a fast rate of O(n 1). Theorem 7 has only logarithmic dependency upon the dimensionality of the output space q. This contrasts starkly with previous guarantees for multi-class learning with ensembles of decision trees (Kuznetsov et al., 2014; 2015) which are linear with respect to the number of classes q. 5.1. Proof of Theorem 7 The proof of Theorem 7 is a consequence of Theorem 1 combined with the following lemma. Lemma 5. For all m N and z (X [q])m we have, ˆRz (Π Hp,τ) 2τ p p log(2 max{p d m, q})/m. We begin by counting the number of possible partitions that can be made by a decision tree in Tp,d on a given sequence of points. Given a sequence x = (xi)i [m] Xm we let Tp,d(x) := (t(xi))i [m] : t Tp,d [p]m. Lemma 6. For all m N and x Xm, we have |Tp,d(x)| (p 1)! (d (m + 1))p 1. Proof. By induction, it suffices to show that |Tp+1,d(x)| |Tp,d(x)| p d (m + 1). Now observe that each element of Tp+1,d(x) may be constructed by taking an element of Tp,d(x) and then making a choice of one of p existing leaf nodes to partition, one of d dimensions to split upon, and one of at most m + 1 possible split points. We complete the proof of Lemma 5 as follows. Proof of Lemma 5. For a given τ > 0 we let Λτ := {(aj)j [q] : P j [q] |aj| τ}. Let {e(j)}j [q] Rq be the canonical orthonormal basis. Let Λex 1 Λ1 denote the subset of extreme points in Λ1, so Λex 1 = {u e(j) : u { 1, +1} and j [q]}. Note that |Λex 1 | = 2q. Fix Optimistic bounds for multi-output prediction z = (zi)i [m] (X [q])m with each zi = (xi, ji) and let x = (xi)i [m] Xm. Then ˆRz (Π Hp,τ) = = Eσ sup h Hp,τ i [m] σi (Π h)(xi, ji) = Eσ sup h Hp,τ i [m] σi πji(h(xi)) = Eσ sup t Tp,d sup w (Λτ )p i [m] σi wt(xi),ji = Eσ sup (li)i [m] Tp,d(x) 1 m sup w (Λτ )p i [m] σi wli,ji Observe that for w (Λτ)p and (li) [p]m, X i [m] σi wli,ji s [q] wr,s X i:li=r & ji=s σi s [q] |wr,s| i:li=r & ji=s σi r [p] max s [q] i:li=r & ji=s σi r [p] sup (ur,s)s [q] Λex 1 i:li=r & ji=s σi = τ sup (ur,s) (Λex 1 )p i:li=r & ji=s σi (ur,s) (Λex 1 ) p i [m] σi uli,ji Plugging this bound into the above yields ˆRz (Π Hp,τ) Eσ sup (li)i [m] Tp,d(x) (ur,s) (Λex 1 ) p i [m] σi uli,ji (li)i [m] Tp,d(x), (ur,s) (Λex 1 ) p i [m] u2 li,ji 2 log (|Tp,d(x)| |Λex 1 |p) m (2) 2 ((p 1) log(p d m) + p log(2q)) p log(2 max{p d m, q}) where (2) follows from Massart s lemma (eg. Theorem 3.3 from (Mohri et al., 2012)) and the penultimate inequality follows from Lemma 6. We can now deduce Theorem 7 from Theorem 1 with the help of a re-weighting argument along with the convexity property of Rademacher complexities. Proof of Theorem 7. Take ζ > 0 and let t [T ] αt ht : ht H1 p,τt, αt 0, t [T ] αt τt ζ and X t [T ] αt β Observe that F conv (Hp,ζ). Indeed, given f = P t [T ] αt ht with ht H1 p,τt and P t [T ] αt τt ζ, we can rewrite (ζ τt 1 ht), with P t [T ](αt τt ζ 1) 1 and for each t [T], we have ζ τt 1 ht Hp,ζ. Thus, Π F Π conv (Hp,ζ) = conv (Π Hp,ζ). Hence, by the convexity of Rademacher complexities (Boucheron et al., 2005, Theorem 3.3, eq. (5)) and z (X [q])nq combined with Lemma 5 we have, ˆRz (Π F) ˆRz (conv (Π Hp,ζ)) ˆRz (Π Hp,ζ) p log(2 max{pd (nq), q}) p log(2pdnq) Taking a supremum over all z (X [q])nq we have Rnq(Π F) 2ζ p (p log(2pdnq))(nq) 1. Note also that F M(X, [ β, β]q), since each f F is of the form f = P t [T ] αt ht with ht H1 p,τt M(X, [ 1, +1]q) and P t [T ] αt β. Thus, plugging the bound on Rnq(Π F) into Theorem 1 yields the bound in Theorem 7. 6. Conclusions We presented a theoretical analysis of multi-output learning, based on a self-bounding Lipschitz condition. Under this condition, we obtained favourable dependence on both the sample size and the output dimension. The main analytic tool is a new contraction inequality for the local Rademacher complexity of vector valued function classes with a self-bounding Lipschitz loss, which may be of independent interest. Theorem 1 can be applied to any multioutput prediction problem where one can obtain an upper bound on the Rademacher complexity Rnq(Π F). We demonstrate this by applying our approach to ensembles of decision trees, yielding state of the art results. Optimistic bounds for multi-output prediction Acknowledgement This work is funded by EPSRC (grant EP/P004245/1) and partially by the Turing Institute (grant EP/N510129/1). Agrawal, R., Gupta, A., Prabhu, Y., and Varma, M. Multilabel learning with millions of labels: Recommending advertiser bid phrases for web pages. In Proceedings of the 22nd international conference on World Wide Web, pp. 13 24, 2013. Babbar, R. and Sch olkopf, B. Dismec: Distributed sparse machines for extreme multi-label classification. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pp. 721 729, 2017. Bartlett, P. L., Bousquet, O., Mendelson, S., et al. Local rademacher complexities. The Annals of Statistics, 33(4): 1497 1537, 2005. Bernstein, S. On a modification of chebyshev s inequality and of the error formula of laplace. Ann. Sci. Inst. Sav. Ukraine, Sect. Math, 1(4):38 49, 1924. Bhatia, K., Jain, H., Kar, P., Varma, M., and Jain, P. Sparse local embeddings for extreme multi-label classification. In Advances in neural information processing systems, pp. 730 738, 2015. Borchani, H., Varando, G., Bielza, C., and Larra naga, P. A survey on multi-output regression. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 5(5): 216 233, 2015. Boucheron, S., Bousquet, O., and Lugosi, G. Theory of classification: A survey of some recent advances, 2005. Boucheron, S., Lugosi, G., and Massart, P. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013. Bousquet, O. Concentration inequalities and empirical processes theory applied to the analysis of learning algorithms. Ph D Thesis, 2002. Chen, T. and Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp. 785 794, 2016. Chzhen, E. Classification of sparse binary vectors. ar Xiv preprint ar Xiv:1903.11867, 2019. Chzhen, E., Denis, C., Hebiri, M., and Salmon, J. On the benefits of output sparsity for multi-label classification. ar Xiv preprint ar Xiv:1703.04697, 2017. Cortes, C., Kuznetsov, V., Mohri, M., and Yang, S. Structured prediction theory based on factor graph complexity. In International Conference on Machine Learning, pp. 2522 2530, 2016. Crammer, K. and Singer, Y. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of machine learning research, 2(Dec):265 292, 2001. Dudley, R. M. The sizes of compact subsets of hilbert space and continuity of gaussian processes. Journal of Functional Analysis, 1(3):290 330, 1967. Geng, X. Label distribution learning. IEEE Transactions on Knowledge and Data Engineering, 28(7):1734 1748, 2016. Guermeur, Y. Lp-norm sauer shelah lemma for margin multi-category classifiers. Journal of Computer and System Sciences, 89:450 473, 2017. Jain, H., Balasubramanian, V., Chunduri, B., and Varma, M. Slice: Scalable linear extreme classifiers trained on 100 million labels for related searches. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pp. 528 536, 2019. Koltchinskii, V. et al. Local rademacher complexities and oracle inequalities in risk minimization. The Annals of Statistics, 34(6):2593 2656, 2006. Kuznetsov, V., Mohri, M., and Syed, U. Multi-class deep boosting. In Advances in Neural Information Processing Systems, pp. 2501 2509, 2014. Kuznetsov, V., Mohri, M., and Syed, U. Rademacher complexity margin bounds for learning with a large number of classes. In ICML Workshop on Extreme Classification: Learning with a Very Large Number of Labels, 2015. Ledoux, M. and Talagrand, M. Probability in Banach Spaces: isoperimetry and processes. Springer Science & Business Media, 2013. Lei, Y., Dogan, U., Binder, A., and Kloft, M. Multi-class svms: From tighter data-dependent generalization bounds to novel algorithms. In Advances in Neural Information Processing Systems, pp. 2035 2043, 2015. Lei, Y., Ding, L., and Bi, Y. Local rademacher complexity bounds based on covering numbers. Neurocomputing, 218:320 330, 2016. Lei, Y., Dogan, U., Zhou, D.-X., and Kloft, M. Datadependent generalization bounds for multi-class classification. IEEE Transactions on Information Theory, 65(5): 2995 3021, 2019. Optimistic bounds for multi-output prediction Li, J., Liu, Y., Yin, R., Zhang, H., Ding, L., and Wang, W. Multi-class learning: From theory to algorithm. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp. 1586 1595. Curran Associates, Inc., 2018. Li, J., Liu, Y., Yin, R., and Wang, W. Multi-class learning using unlabeled samples: theory and algorithm. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp. 2880 2886. AAAI Press, 2019. Liu, Y., Li, J., Ding, L., Liu, X., and Wang, W. Learning vector-valued functions with local rademacher complexity and unlabeled data, 2019. Mammen, E. and Tsybakov, A. B. Smooth discrimination analysis. Ann. Statist., 27(6):1808 1829, 12 1999. doi: 10.1214/aos/1017939240. Maurer, A. A vector-contraction inequality for rademacher complexities. In International Conference on Algorithmic Learning Theory, pp. 3 17. Springer, 2016. Menon, A. K., Rawat, A. S., Reddi, S., and Kumar, S. Multilabel reductions: what is my loss optimising? In Advances in Neural Information Processing Systems, pp. 10599 10610, 2019. Mohri, M., Rostamizadeh, A., and Talwalkar, A. Foundations of machine learning. MIT press, 2012. Musayeva, K., Lauer, F., and Guermeur, Y. Rademacher complexity and generalization performance of multicategory margin classifiers. Neurocomputing, pp. 6 15, 11 2019. Reddi, S. J., Kale, S., Yu, F., Holtmann-Rice, D., Chen, J., and Kumar, S. Stochastic negative mining for learning with large output spaces. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 1940 1949, 2019. Reeve, H. W. and Kab an, A. Optimistic bounds for multioutput prediction, 2020. Schapire, R. E. and Freund, Y. Boosting: Foundations and algorithms. Kybernetes, 2013. Srebro, N., Sridharan, K., and Tewari, A. Smoothness, low noise and fast rates. In Advances in neural information processing systems, pp. 2199 2207, 2010. Talagrand, M. Upper and lower bounds for stochastic processes: modern methods and classical problems, volume 60. Springer Science & Business Media, 2014. Tsoumakas, G. and Katakis, I. Multi-label classification: An overview. International Journal of Data Warehousing and Mining (IJDWM), 3(3):1 13, 2007. Xu, C., Liu, T., Tao, D., and Xu, C. Local rademacher complexity for multi-label learning. IEEE Transactions on Image Processing, 25(3):1495 1507, March 2016. Xu, D., Shi, Y., Tsang, I. W., Ong, Y.-S., Gong, C., and Shen, X. Survey on multi-output learning. IEEE transactions on neural networks and learning systems, 2019. Zhang, M.-L. and Zhou, Z.-H. A review on multi-label learning algorithms. IEEE transactions on knowledge and data engineering, 26(8):1819 1837, 2013.