# optimistic_bounds_for_multioutput_learning__ea98a208.pdf

Optimistic Bounds for Multi-output Prediction

Henry W.J. Reeve 1 Ata Kab an 1

We investigate the challenge of multi-output learning, where the goal is to learn a vector-valued function based on a supervised data set. This includes a range of important problems in Machine Learning including multi-target regression, multi-class classiﬁcation and multi-label classiﬁcation. We begin our analysis by introducing the self-bounding Lipschitz condition for multioutput loss functions, which interpolates continuously between a classical Lipschitz condition and a multi-dimensional analogue of a smoothness condition. We then show that the self-bounding Lipschitz condition gives rise to optimistic bounds for multi-output learning, which attain the minimax optimal rate up to logarithmic factors. The proof exploits local Rademacher complexity combined with a powerful minoration inequality due to Srebro, Sridharan and Tewari. As an application we derive a state-of-the-art generalisation bound for multi-class gradient boosting.

1. Introduction

Multi-output prediction represents an important class of problems that includes multi-class classiﬁcation (Crammer & Singer, 2001), multi-label learning (Tsoumakas & Katakis, 2007; Zhang & Zhou, 2013), multi-target regression (Borchani et al., 2015), label distribution learning (Geng, 2016), structured regression (Cortes et al., 2016) and others, with a wide range of practical applications (Xu et al., 2019).

Our objective is to provide a general framework for establishing guarantees for multi-output prediction problems. A fundamental challenge in the statistical learning theory of multi-output prediction is to obtain bounds that allow for (i) favourable convergence rate with the sample size, and (ii) favourable dependence of the risk on the dimensionality

1School of Computer Science, University of Birmingham. Correspondence to: Henry W.J. Reeve <henrywjreeve@gmail.com>.

Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

of the output space. Whilst modern applications of multioutput prediction deal with increasingly large data sets, they also include problems where the target dimensionality is increasingly large. For example, the number of categories in multi-label learning is often of the order of tens of thousands, an emergent problem referred to as extreme classiﬁcation (Agrawal et al., 2013; Babbar & Sch olkopf, 2017; Bhatia et al., 2015; Jain et al., 2019).

Formally, the task of multi-output prediction is to learn a vector-valued function from a labelled training set. A common tool in the theoretical analysis of this problem has been a vector-valued extension of Talagrand s contraction inequality for Lipschitz losses (Ledoux & Talagrand, 2013). Both (Maurer, 2016) and (Cortes et al., 2016) established vectorcontraction inequalities for Rademacher complexity that gave rise to learning guarantees for multi-output prediction problems with a linear dependence on the dimensionality of the output space. More recently, (Lei et al., 2019) has provided more reﬁned vector-contraction inequalities for both Gaussian and Rademacher complexity. This approach leads to a highly favourable sub-linear dependence on the output dimensionality, which can even be logarithmic, depending on the degree of regularisation. These structural results lead to a slow convergence rate O(n 1/2). Guermeur (2017) and Musayeva et al. (2019) explore an alternative approach based on covering numbers. (Chzhen et al., 2017) derived a bound for multi-label classiﬁcation based on Rademacher complexities. Each of these bounds give rise to favourable dependence on the dimensionality of the output space, but with a slow rate of order O(n 1/2).

Local Rademacher complexities provide a crucial tool in establishing faster rates of convergence (Bousquet, 2002; Bartlett et al., 2005; Koltchinskii et al., 2006; Lei et al., 2016). By leveraging local Rademacher complexities, Liu et al. (2019) have derived guarantees for multi-class learning with function classes that are linear in an RKHS, building on their previous margin based guarantees (Lei et al., 2015; Li et al., 2019). This gives rise to fast rates under suitable spectral conditions. Fast rates of convergence have also been derived by Xu et al. (2016) for multi-label classiﬁcation with linear function spaces. On the other hand, Chzhen (2019) have derived fast rates of convergence by exploiting an analogue of the margin assumption.

Optimistic bounds for multi-output prediction

In this paper we establish generalisation bounds for multioutput prediction, which yields fast rates whenever the empirical error is small. We address this problem by generalising to vector-valued functions a smoothness based approach due to (Srebro et al., 2010). A key advantage of our approach is that it allow us to accommodate a wide variety of multi-output loss functions, and hypothesis classes, making our analytic framework applicable to a variety of learning tasks. Below we summarise the contributions of this paper:

We give a contraction inequality for the local Rademacher complexity of vector-valued functions (Proposition 1). The main ingredient is a self-bounding Lipschitz condition for multi-output loss functions that holds for several widely used examples.

We leverage our localised contraction inequality to give a general upper bound for multi-output learning (Theorem 1), which exhibits fast rates whenever the empirical error is small.

We demonstrate a concrete use our general result, by derive from it a state-of-the-art bound for ensembles of multi-output decision trees (Theorem 7).

Furthermore, the obtained rates on multi-output learning are minimax optimal up to logarithmic factors. The corresponding lower bounds can be found in the full version (Reeve & Kab an, 2020).

1.1. Problem Setting

We shall consider multi-output prediction problems in supervised learning. Suppose we have a measurable space X, a label space Y and an output space V. We shall assume that there is an unknown probability distribution P over random variables (X, Y ), taking values in X Y. The performance is quantiﬁed through a loss function L : V Y R.

Let M(X, V) denote the set of measurable functions f : X V. The goal of the learner is to obtain f M(X, V) such that the corresponding risk EL(f, P) := E(X,Y ) P [L(f(X), Y )] is as low as possible. The learner selects f M(X, V) based upon a sample D := {(Xi, Yi)}i [n], where (Xi, Yi) are independent copies of (X, Y ). We let ˆEL(f, D) := n 1 P i [n] L(f(Xi), Yi) denote the empirical risk. When the distribution P and the sample D are clear from context we shall write EL(f) in place of EL(f, P) and ˆEL(f) in place of ˆEL(f, D). We consider multi-output prediction problems in which V Rq. We let denote the max norm on Rq and for a positive integer m N we let [m] := {1, , m}.

2. The Self-bounding Lipschitz Condition

We introduce the following self-bounding Lipschitz condition for multi-output loss functions.

Deﬁnition 1 (Self-bounding Lipschitz condition). A loss function L : V Y R is said to be (λ, θ)-self-bounding Lipschitz for λ, θ 0 if for all y Y and u, v V,

|L(u, y) L(v, y)| λ max{L(u, y), L(v, y)}θ u v .

This condition interpolates continuously between a classical Lipschitz condition (when θ = 0) and a multi-dimensional analogue of a smoothness condition (when θ = 1/2), and will be the main assumption that we use to obtain our results.

Our motivation for introducing Deﬁnition 1 is as follows. Firstly, in recent work of (Lei et al., 2019) the classical Lipschitz condition with respect to the ℓ norm has been utilised to derive multi-class bounds with a favourable dependence upon the number of classes q. The role of the ℓ norm is crucial since it prevents the deviations in the loss function from accumulating as the output dimension q grows. Our goal is to give a general framework which simultaneously achieves a favourable dependence upon n. Secondly, Srebro et al. (2010) introduced a second-order smoothness condition on the loss function. This condition corresponds to the special case whereby q = 1 and θ = 1/2. Srebro et al. (2010) showed that this smoothness condition gives rise to an optimistic bound having a fast rate O(n 1) in the realisable case. The self-bounding Lipschitz condition provides a multi-dimensional analogue of this condition when θ = 1/2, intended to yield a favourable dependence on the number of samples n. The results established in Sections 3 and 5 show that this is indeed the case, while we also obtain favourable dependence on the number of classes q. Finally, by considering the range of exponents θ [0, 1/2] we exhibit convergence rates ranging from slow O(n 1/2) to fast O(n 1) in the realisable case. This is reminiscent of the celebrated Tsybakov margin condition (Mammen & Tsybakov, 1999), which interpolates between slow and fast rates in the parametric classiﬁcation setting. Crucially, however, whilst the Tsybakov margin condition (Mammen & Tsybakov, 1999) is a condition on the underlying distribution which cannot be veriﬁed in practice the self-bounding Lipschitz condition is a property of a loss function which may be veriﬁed analytically by the learner.

2.1. Verifying the Self-bounding Lipschitz Condition

We start by giving a collection of results which can be used to verify that a given loss function satisﬁes the self-bounding Lipschitz condition. The following lemma is proved in the Supplementary Appendix A.

Lemma 1. Take any λ > 0, θ [0, 1/2]. Suppose that L : V Y [0, ) is a loss function such that for any u V, y Y, there exists a non-negative differentiable function ϕu,y : R [0, ) satisfying

1. ϕu,y(0) = L(u, y);

Optimistic bounds for multi-output prediction

2. t > 0, supv: u v t{L(v, y)} ϕu,y(t).

3. The derivative ϕ u,y(t) is non-negative on [0, );

4. t0, t1 R, |ϕ u,y(t1) ϕ u,y(t0)| λ

2 1 1 θ |t1 t0| θ 1 θ ;

Then L : V Y [0, ) is (λ, θ)-self-bounding Lipschitz.

The following Lemma shows that clipping preserves this condition.

Lemma 2. Suppose that L : V Y [0, ) is a (λ, θ)- self-bounding Lipschitz loss function with λ > 0, θ [0, 1]. Then the loss L : V Y [0, b] deﬁned by L(u, y) = min{L(u, y), b} is (λ, θ)-self-bounding Lipschitz.

Finally, we note the following monotonicity property, which follows straightforwardly from the deﬁnition.

Lemma 3. Suppose that L : V Y [0, b] is a bounded (λ, θ)-self-bounding Lipschitz loss function with λ > 0, θ [0, 1]. Then given any θ θ, the loss L is also ( λ, θ)- self-bounding Lipschitz with λ = λ bθ θ.

2.2. Examples

We now demonstrate several examples of multi-output loss functions that satisfy our self-bounding Lipschitz condition. In each of the examples below we shall show that the selfbounding Lipschitz condition is satisﬁed by applying our sufﬁcient condition (Lemma 1). Detailed proofs are given in the Supplementary Appendix A.

2.2.1. MULTI-CLASS LOSSES

We begin with the canonical multi-output prediction problem of multi-class classiﬁcation in which Y = [q] and V = Rq. A popular loss function for the theoretical analysis of multi-class learning is the margin loss (Crammer & Singer, 2001). The smoothed analogue of the margin loss was introduced by Srebro et al. (2010) in the onedimensional setting, and Li et al. (2018) in the multi-class setting.

Example 1 (Smooth margin losses). Given Y = [q] we deﬁne the margin function m : V Y R by m(u, y) := uy maxj [q]\{y}{uj}. The zero-one loss L0,1 : V Y [0, 1] is deﬁned by L0,1(u, y) = 1{m(u, y) 0}. Whilst natural, the zero-one loss has the drawback of being discontinuous, which presents an obstacle for deriving guarantees. For each ρ > 0, the corresponding margin loss Lρ : V Y [0, 1] is deﬁned by Lρ(u, y) = 1{m(u, y) ρ}. The margin loss Lρ is also discontinuous. However, we may deﬁne a smooth margin loss Lρ : V Y [0, 1] by Lρ(u, y)

1 if m(u, y) 0

ρ 3 3 m(u,y)

ρ 2 + 1 if m(u, y) [0, ρ]

0 if m(u, y) ρ.

By applying Lemma 1 we can show that Lρ is (λ, θ)-selfbounding Lipschitz with λ = 4

6 ρ 1 and θ = 1/2. Moreover, the smooth margin loss satisﬁes L0,1(u, y) Lρ(u, y) Lρ(u, y) for (u, y) V Y.

The margin loss plays a central role in learning theory and continues to receive signiﬁcant attention in the analysis of multi-class prediction (Guermeur, 2017; Li et al., 2018; Musayeva et al., 2019), so it is fortuitous that our self-bounding Lipschitz condition incorporates the smooth margin loss. More importantly, however, the self-bounding Lipschitz condition applies to a variety of other loss functions which have received less attention in statistical learning theory.

One of the most widely used loss functions in practical applications is the multinomial logistic loss, also known as the softmax loss. Example 2 (Multinomial logistic loss). Given Y = [q], the multinomial logistic loss L : V Y [0, ) is deﬁned by

L(u, y) = log

j [q] exp(uj uy)

where u = (uj)j [q] and y [q]. For each (u, y) V [q] let Au,y = P

j [q]\{y} exp(uj uy) and deﬁne ϕu,y(t) = log (1 + Au,y exp(2t)). By applying Lemma 1 with ϕu,y we can show that the multinomial logistic loss L is (λ, θ)- self-bounding Lipschitz with λ = 2 and θ = 1/2.

Recently, Lei et al. (2019) pointed out that the multinomiallogistic loss is 2-Lipschitz with respect to the ℓ -norm (equivalently, (2, 0)-self-bounding Lipschitz). This gives rise to a slow rate of order O(n 1/2). The fact that the multinomial-logistic loss is also (2, 1/2)-self bounding can be used to derive more favourable guarantees, as we shall see in Section 3.

2.2.2. MULTI-LABEL LOSSES

In multi-label prediction instances may be simultaneously assigned to several categories. We have Y {0, 1}q, where q is the total number possible classes. Whilst q is often very large, the total number of simultaneous labels is typically much smaller. Hence, we consider the set of k-sparse binary vectors S(k) = {(yj)j [q] {0, 1}q : P

j [q] yj k} denote the set of k-sparse vectors, where k [q]. We consider the pick-all-labels loss (Menon et al., 2019; Reddi et al., 2019). Example 3 (Pick-all-labels). Given Y = S(k), the pick-alllabels loss L : V Y [0, ) is deﬁned by

L(u, y) = P

l [q] yl log P

j [q] exp(uj ul)

where u = (uj)j [q] V and y = (yj)j [q] Y. For each (u, y) V Y we deﬁne ϕu,y : R [0, )

Optimistic bounds for multi-output prediction

by Au,y = P

j [q]\{l} exp(uj ul) and let ϕu,y(t) := P

l [q] yl log (1 + Au,y exp(2t)). By applying Lemma 1 with ϕu,y we can show that L is (λ, θ)-self-bounding Lipschitz with λ = 2

k and θ = 1/2.

Crucially, the constant λ for the pick-all-labels family of losses is a function of the sparsity k, rather than the total number of labels. This means that our approach is applicable to multi-label problems with with tens of thousands of labels, as long as the label-vectors are k-sparse.

2.2.3. LOSSES FOR MULTI-TARGET REGRESSION

We now return to the problem of multi-target regression in which Y = Rq (Borchani et al., 2015).

Example 4 (Sup-norm losses). Given κ, γ [1, 2] we can deﬁne a loss-function L : V Y R for multi-target regression by setting L(u, y) = κ u y γ . By applying Lemma 1 with ϕu,y(t) = κ ( u y +t)γ we can see that L is a (λ, θ)-self-bounding Lipschitz with λ = (8κ)1 θ and θ = (γ 1)/γ. This yields examples of (λ, θ)-self-bounding Lipschitz loss functions for all λ > 0 and θ [0, 1/2].

With these examples in mind we are ready to present our results.

3. Main Results

In this section we give a general upper bound for multioutput prediction problems under the self-bounding Lipschitz condition. A key tool for proving this result will be a contraction inequality for local Rademacher complexity of vector valued functions given in Section 4.1, and which may also be of independent interest. First, we recall the concept of Rademacher complexity.

Deﬁnition 2 (Rademacher complexity). Let Z be a measurable space and consider a function class G M(Z, R). Given a sequence z = (zi) Zn we deﬁne the empirical Rademacher complexity of G with respect to z by1

ˆRz (G) := sup G G:| G|< Eσ

i [n] σi g(zi)

where the expectation is taken over sequences of independent Rademacher random variables σ = (σi)i [n] with σi { 1, +1}n. For each n N, the worstcase Rademacher complexity of G is deﬁned by Rn(G) := supz Zn ˆRz(G).

The Rademacher complexity is deﬁned in the context of real-valued functions. However, in this work we deal

1Taking the supremum over ﬁnite subsets G G is required to ensure that the function within the expectation is measurable (Talagrand, 2014). This technicality can typically be overlooked.

with multi-output prediction so we shall focus on function classes F M(X, Rq). In order to utilise the theory of Rademacher complexity in this context we shall transform function classes F M(X, Rq) into the projected function classes Π F M(X [q], R) as follows. Firstly, for each j [q] we deﬁne πj : Rq R to be the projection onto the j-th coordinate. We then deﬁne, for each f M(X, Rq), the function Π f : X [q] R by (Π f)(x, j) = πj(f(x)). Finally, given F M(X, Rq) we let Π F := {Π f : f F} M(X [q], R).

Our central result is the following relative bound. Theorem 1. Suppose we have a class of multi-output functions F M(X, [ β, β]q), and a (λ, θ)-self-bounding Lipschitz loss function L : V Y [0, b] for some β, b 1, λ > 0, θ [0, 1/2]. Take δ (0, 1), n N and let

Γλ,θ n,q,δ(F) := λ q log3/2 (eβnq) Rnq(Π F) + 1 n 1 1 θ

n (log(1/δ) + log(log n)).

There exists numerical constants C0, C1 > 0 such that given an i.i.d. sample D the following holds with probability at least 1 δ for all f F,

EL(f) ˆEL(f) + C0 q

ˆEL(f) Γλ,θ n,q,δ(F) + Γλ,θ n,q,δ(F) .

Moreover, if f argminf F{EL(f)} minimises the risk and ˆf argminf F{ˆEL(f)} minimises the empirical risk, then with probability at least 1 δ,

EL( ˆf) EL(f ) + C1 q

EL(f ) Γλ,θ n,q,δ(F) + Γλ,θ n,q,δ(F) .

The proof of Theorem 1 is given in Section 4.2. It builds on a local contraction inequality result (Proposition 1, Section 4.1), combined with techniques from (Bousquet, 2002).

Theorem 1 gives an upper bound for the generalisation gap (EL(f) ˆEL(f)), framed in terms of a complexity term Γλ,θ n,q,δ(F), which depends upon both the Rademacher complexity of the projected function class Rnq(Π F) and the self-bounding Lipschitz parameters λ, θ. When the empirical error is small in relation to the complexity term (ˆEL(f) Γλ,θ n,q,δ(F)), the generalisation gap is of order

Γλ,θ n,q,δ(F). In less favourable circumstances we recover a

bound of order q

Γλ,θ n,q,δ(F).

3.1. Comparison with State of the Art

In this section we compare our main result (Theorem 1) with a closely related guarantee due to Lei et al. (2019). Observe that a loss function L is λ-Lipschitz if it is (λ, θ)- self-bounding Lipschitz with θ = 0. Theorem 2. (Lei et al., 2019) Suppose we have a class of multi-output functions F M(X, [ β, β]q), and a λ-

Optimistic bounds for multi-output prediction

Lipschitz loss function L : V Y [0, b] for some β, b 1 and λ > 0. Take δ (0, 1), n N and let

Jλ n,q,δ(F) := λ q log3/2 (eβnq) Rnq(Π F) + 1 n .

There exists numerical constants C2, C3 > 0 such that given an i.i.d. sample D the following holds with probability at least 1 δ for all f F,

EL(f) ˆEL(f) + C2 Jλ n,q,δ(F) + b

Moreover, if f argminf F{EL(f)} minimises the risk and ˆf argminf F{ˆEL(f)} minimises the empirical risk, then with probability at least 1 δ,

EL( ˆf) EL(f ) + C3 Jλ n,q,δ(F) + 2b

Theorem 2 is a mild generalisation of Theorem 6 from (Lei et al., 2019), originally formulated for multi-class classiﬁcation and F an RKHS. For completeness we show that Theorem 2 follows from Proposition 1 in the Supplementary Appendix B. . Note that by the monotonicity property (Lemma 3) any loss function L : V Y [0, b] which is (λ, θ)-self-bounding Lipschitz is also λ bθ-Lipschitz, so the additive bound in Theorem 2 also applies.

To gain a deeper intuition for the bound in Theorem 1 we compare with the bound in Theorem 2. Let s suppose that Rnq(Π F) = O((nq) 1/2) (for a concrete example where this is the case see Section 5). We then have Γλ,θ n,q,δ(F) = O(n 1 2(1 θ) ). For large values of ˆEL(f) Theorem 1 gives a bound on generalisation gap (EL(f) ˆEL(f)) of order O(n 1 4(1 θ) ), which is slower than the rate achieved by Theorem 2 whenever θ < 1/2. However, when ˆEL(f) is small (ˆEL(f) O(n 1 2(1 θ) )), Theorem 1 gives rise to a bound of order O(n 1 2(1 θ) ), yielding faster rates than can be obtained through the standard Lipschitz condition alone whenever θ > 0. Finally note that if the loss L is (λ, θ)- self-bounding Lipschitz with θ = 1/2 then the rates given by Theorem 1 always either match or outperform the rates given by Theorem 2. Moreover, θ = 1/2 occurs for several practical examples discussed in Section 2.2 including the multinomial-logistic loss.

4. Proofs of Main Results

We now turn to stating and proving the key ingredient of our main result, Proposition 1.

4.1. A Contraction Inequality for the Local Rademacher Complexity of Vector-valued Function Classes

First we introduce some additional notation. Suppose f M(X, V). Given a loss function L : V Y R we deﬁne L f : X Y R by (L f)(x, y) = L(f(x), y). We extend this deﬁnition to function classes F M(X, V) by L F = {L f : f F}. Moreover, for each z (X Y)n and r > 0, a subset F|r z := {f F : ˆEL(f, z) r}. Intuitively, the local Rademacher complexity allows us to zoom in upon the neighbourhood of the empirical risk minimiser. This is the subset that matters in practice and is typically much smaller than the full Π F.

Proposition 1. Suppose we have a class of multi-output functions F M(X, [ β, β]q), where β 1. Given a (λ, θ)-self-bounding Lipschitz loss function L : V Y [0, R], where λ > 0, θ [0, 1/2] and z (X Y)n, r > 0, we have the following bound,

ˆRz (L F|r z)

λrθ 29 q log3/2 (eβnq) Rnq(Π F) + n 1/2 .

The proof of Proposition 1, given later in this section, relies upon covering numbers.

Deﬁnition 3 (Covering numbers). Let (M, ρ) be a semimetric space. Given a set A M and an ϵ > 0, a subset A A is said to be a (proper) ϵ-cover of A if, for all a A, there exists some a A with ρ(a, a) ϵ. We let N(ϵ, A, ρ) denote the minimal cardinality of an ϵ-cover for A.

We shall consider covering numbers for two classes of datadependent semi-metric spaces. Let Z be a measurable space and take G M(Z, R). For each n N and each sequence z = (zi)i [n] Zn we deﬁne a pair of metrics ρz,2 and ρz, by

ρz,2(g0, g1) :=

i [n] (g0(zi) g1(zi))2

ρz, (g0, g1) := max i [n]{|g0(zi) g1(zi)|},

where g0, g1 G. The ﬁrst stage of the proof of Proposition 1 will be using the following lemma which bounds the covering number of L F|r z in terms of an associated covering number for Π(F).

Lemma 4. Suppose that F M(X, Rq) and L is (λ, θ)- self-bounding Lipschitz with θ [0, 1/2]. Take L : V Y [0, b], z = {(xi, yi)}i [n] (X Y)n, r > 0 and deﬁne w = {(xi, j)}(i,j) [n] [q] (X [q])nq. Given any f0, f1 F|r z we have

ρz,2(L f0, L f1) 2θλrθ ρw, (Π f0, Π f1).

Optimistic bounds for multi-output prediction

Moreover, for any ϵ > 0,

N 21+θλrθ ϵ, L F|r z, ρz,2 N (ϵ, Π F, ρw, ) .

Proof of Lemma 4. To prove the ﬁrst part of the lemma we take f0, f1 F|r z and let ζ = ρw, (Π f0, Π f1). It follows from the construction of w that |πj(f0(xi)) πj(f1(xi))| ζ for each (i, j) [n] [q], so f0(xi) f1(xi) ζ for each i [n].

Furthermore, by the self-bounding Lipschitz condition we deduce that for each i [n],

|L(f0(xi), yi) L(f1(xi), yi)|

λ max {L(f0(xi), yi), L(f1(xi), yi)}θ

f0(xi) f1(xi)

λ max {L(f0(xi), yi), L(f1(xi), yi)}θ ζ.

Hence, by Jensen s inequality we have

ρz,2(L f0, L f1)2

i [n] (L(f0(xi), yi) L(f1(xi), yi))2

i [n] max {L(f0(xi), yi), L(f1(xi), yi)}2θ

i [n] max {L(f0(xi), yi), L(f1(xi), yi)}

(λζ)2 ˆEL(f0, z) + ˆEL(f1, z) 2θ (λζ)2 (2r)2θ,

where we use the fact that θ [0, 1/2] and max{ˆEL(f0, z), ˆEL(f1, z)} r.

Thus, ρz,2(L f0, L f1)

2θλrθ ζ = 2θλrθ ρw, (Π f0, Π f1).

This completes the proof of the ﬁrst part of the lemma.

To prove the second part of the lemma we note that since Π F|r z Π F we have2

N (2ϵ, Π F|r z, ρw, ) N (ϵ, Π F, ρw, ) ,

so we may choose f1, , fm F|r z with m N (ϵ, Π F, ρw, ) such that Π f1, , Π fm forms a 2ϵ-cover of Π F|r z with respect to the ρw, metric.

To complete the proof it sufﬁces to show that L f1, , L fm is a 21+θλrθ ϵ-cover of L F|r z with respect to the ρz,2 metric.

2The factor of 2 is required as we are using proper covers, which are subsets of the set being covered (see Deﬁnition 3).

Take any g L F|r z, so g = L f for some f F|r z. Since Π f1, , Π fm forms a 2ϵ-cover of Π F|r z we may choose l [m] so that ρw, (Π fl, Π f) 2ϵ. By the ﬁrst part of the lemma we deduce that

ρz,2(L fl, g) = ρz,2(L fl, L f) 21+θλrθ ϵ

Since this holds for all g L F|r z, we see that L f1, , L fm is a 21+θλrθ ϵ-cover of L F|r z, which completes the proof of the lemma.

To prove Proposition 1, we shall also utilise two technical results to move from covering numbers to Rademacher complexity and back. First, we shall use the following powerful result from (Srebro et al., 2010) which gives an upper bound for worst-case covering numbers in terms of the worst-case Rademacher complexity.

Theorem 3 (Srebro et al. (2010)). Given a measurable space Z and a function class G M(Z, [ β, β]), any ϵ > 2 Rn(G) and any z Zn,

log N(ϵ, G, ρz, ) (Rn(G))2 4n

ϵ2 log 2eβn

We can view this result as an analogue of Sudakov s minoration inequality for ℓ covers, rather than ℓ2 covers.

Secondly, we shall use Dudley s inequality (Dudley, 1967) which allows us to bound Rademacher complexities in terms of covering numbers. We shall use the following variant due to (Guermeur, 2017) as it yields more favourable constants.

Theorem 4 (Guermeur (2017)). Suppose we have a measurable space Z, a function class G M(Z, R) and a sequence z Zn. For any decreasing sequence (ϵk) k=0 with lim k ϵk = 0 with ϵ0 supg0,g1 G ρz,2(g0, g1), the

following inequality holds for all K N,

k=1 (ϵk + ϵk 1)

log N(ϵk, G, ρz,2)

We are now ready to complete the proof of our local Rademacher complexity inequality.

Proof of Proposition 1. Take z = {(xi, yi)}i [n] (X Y)n and r > 0 and deﬁne w = {(xi, j)}(i,j) [n] [q] (X [q])nq. By Lemma 4 combined with Theorem 3 applied to Π F we see that for each ξ > 2 Rnq(Π F) we have

log N 21+θλrθ ξ, L F|r z, ρz,2

log N(ξ, Π F, ρw, )

(Rnq(Π F))2 4nq

ξ2 log 2eβnq

Optimistic bounds for multi-output prediction

Moreover, given any g0 = L f0, g1 = L f1 L F|r z, so ρw, (Π f0, Π f1) 2β, so by the ﬁrst part of Lemma 4 we have ρz,2(g0, g1) 21+θλrθ β.

Now construct (ϵk) k=0 by ϵk = 21+θλrθ β 2 k and choose

K = log2 β min{(2 Rnq(Π F)) 1, (8 n)} 1

Hence, supg0,g1 Π F|rz ρz,2(g0, g1) ϵ0 and β 2 K 1 max{2 Rnq(Π F), (8 n) 1} < β 2 K.

Furthermore, for k K by letting ξk = β 2 k, we have ϵk = 21+θλrθ ξk and ξk > max{2 Rnq(Π F), (8 n) 1}, so by eq. (1)

log N (ϵk, L F|r z, ρz,2)

(Rnq(Π F))2 4nq

ξ2 k log 2eβnq

21+θλrθ Rnq(Π F) 2 4nq

ϵ2 k log eβ(nq)3/2

21+θλrθ Rnq(Π F) 2 6nq

ϵ2 k log (eβnq) .

Note also that by construction K 4 log(eβnq).

By Theorem 4 and ϵk 1 = 2 ϵk we deduce that

ˆRz(L F|r z)

k=1 (ϵk + ϵk 1)

log N(ϵk, L F|rz, ρz,2)

log N(ϵk, L F|rz, ρz,2)

6K 21+θλrθ Rnq(Π F) p

6q log (eβnq) + ϵK

28 q λrθ Rnq(Π F) log3/2 (eβnq) + ϵK

λrθ 29 q log3/2 (eβnq) Rnq(Π F) + n 1/2 .

This completes the proof of the proposition.

4.2. Proof of Theorem 1

To complete the proof of Theorem 1 we combine Proposition 1 with some results due to Bousquet (2002).

Theorem 5. Suppose we have a measurable space Z and a function class G M(Z, [0, b]). For each z Zn and g G we let ˆEz(g) = n 1 P

i [n] g(zi). Suppose we have a function φn : [0, ) [0, ) which is non-negative, non-decreasing, not identically zero, and φn(r)/ r is nonincreasing. Suppose further that for all z Zn and r > 0,

ˆRz({g G : ˆEz(g) r}) φn(r).

Let ˆrn be the largest solution of the equation φn(r) = r. Suppose that Z is a random variable with distribution P,

where P is a distribution on Z and let D = {Zi}i [n] Zn

be an i.i.d. sample, where each Zi P is an independent copy of Z. For any δ (0, 1), the following holds with probability at least 1 δ, for all g G:

E(g) ˆED(g) + 90(ˆrn + r0) + 4 q

ˆED(g)(ˆrn + r0).

where r0 = b (log(1/δ) + 6 log log n)/n.

Proof. The following result is given in the penultimate line of the proof of (Theorem 6.1, Bousquet (2002)):

E(g) ˆED(g) + 45ˆrn + p

8ˆrn E(g) + p

4r0 E(g) + 20r0,

with probability at least 1 δ, for all g G. So,

E(g) ˆED(g) + 45ˆrn + 20r0 + 4 p

(ˆrn + r0) E(g).

We also need the following inequality (Lemma 5.11, Bousquet (2002)): Suppose that t, B, C > 0 satisfy t B

t + C. Then t B2 + C + B

C. Applying this with B = 4 p

(ˆrn + r0) and C = ˆED(g) + 45ˆrn + 20r0 we have

E(g) 16(ˆrn + r0) + (ˆED(g) + 45ˆrn + 20r0)

(ˆrn + r0)(ˆED(g) + 45ˆrn + 20r0)

ˆED(g) + 90(ˆrn + r0) + 4 q

ˆED(g)(ˆrn + r0),

which completes the proof.

Theorem 5 is a uniform upper bound in terms of the empirical risk. We can deduce a performance bound on the empirical risk minimiser by combining with Bernstein s inequality see Theorem 2.10 from (Boucheron et al., 2013)

Theorem 6 (Bernstein (1924)). Let Wi, , Wi [0, b] be bounded independent random variables with mean µ = E[Wi]. Then with probability at least 1 δ we have

i [n] Wi µ +

2µb log(1/δ)

n + b log(1/δ)

2µ + 3b log(1/δ)

Corollary 1. Suppose that the assumptions of Theorem 5 hold and choose g argming G{E(g)}. Given z Zn

we choose ˆgz argming G{ˆEz(g)}. For any δ (0, 1), the following holds with probability at least 1 2δ

E(ˆg D) E(g ) + 9 p

E(g ) (ˆrn + r0) + 100 (ˆrn + r0) .

We can now complete the proof of Theorem 1.

Optimistic bounds for multi-output prediction

Proof of Theorem 1. First let G = L F = {(x, y) 7 L(f(x), y) : f F}. Note that for g = L f with f F and Z = (X, Y ) P we have EZ(g) = EL(f, P) and given z (X Y)n we have ˆEz(g) = ˆEL(f). Note also that under this correspondence L F|r z = {g G : ˆEz(g) r}.

Now deﬁne φn : [0, ) [0, ) by

φn(r) = λrθ 29 q log3/2 (eβnq) Rnq(Π F) + n 1/2 .

By Proposition 1, for each z (X Y)n,

ˆRz {g G : ˆEz(g) r} = ˆRz (L F|r z) φn(r).

Observe that φn is non-negative, non-decreasing and φn(r)/ r is non-increasing, since θ [0, 1/2]. So it remains to solve the ﬁxed point equation φn(r) = r and ﬁnd ˆrn := λ 29 q log3/2 (eβnq) Rnq(Π F) + n 1/2 1 1 θ

Hence, the two bounds in Theorem 1 follow from Theorem 5 and Corollary 1, respectively.

5. An application to ensembles

In this section we highlight applications of our general multioutput learning framework. Speciﬁcally, here we consider ensembles of decision trees (Schapire & Freund, 2013), as they represent an effective and widely used tool in practice (Chen & Guestrin, 2016).

Throughout this section we shall assume that X = Rd. We consider sets of decision tree functions H1 p,τ M(X, [ 1, 1]q) constructed as follows. Let Tp,d be the set of decision trees t : Rd [p] with p leaves, where each internal node performs a binary split along a single feature. Let Hp,τ M(X, Rq) be the set of all functions of the form h(x) = (wt(x),j)j [q], where t Tp,d is a decision tree and w = (wl,j)(l,j) [p] [q] Rpq satisﬁes the ℓ1 constraint wl 1 = P

j [q] |wlj| τ. Finally, let H1 p,τ := Hp,τ M(X, [ 1, 1]q). We now give a bound for convex combinations of such decision trees.

Theorem 7. Suppose we have β, b 1, λ > 0, θ [0, 1/2] and a (λ, θ)-self-bounding Lipschitz loss function L : V Y [0, b]. Given δ (0, 1), n N we deﬁne for each α = (αt)t [T ], τ = (τt)t [T ] (0, )T ,

Cn,δ(α, τ) := λ n p log2(3npqdβ) P

t [T ] αt τt + 1 1 1 θ

n (log(1/δ) + log(log n)).

There exists a numerical constant C0 such that given an i.i.d. sample D the following holds with probability at least 1 δ, for all ensembles f = P

t [T ] αt ht where P

t [T ] αt β and ht H1 p,τt,

EL(f) ˆEL(f) + C0 q

ˆEL(f) Cn,δ(α, τ) + Cn,δ(α, τ) .

Before giving the proof, we highlight several interesting features of this result:

First and foremost, Theorem 7 gives guarantees for ensembles of decision trees with respect to a wide variety of losses including the multinomial logistic loss for multi-class classiﬁcation and the one versus all loss for multi-label classiﬁcation, as well as implying margin based guarantees (see Section 2.2).

Theorem 7 has a favourable dependency upon the number of examples whenever ˆEL(f) is sufﬁciently small, as is often the case for large ensembles of decision trees. For example, if we are using the multinomial logistic loss and ˆEL(f) 0, then Theorem 7 gives rise to a fast rate of O(n 1).

Theorem 7 has only logarithmic dependency upon the dimensionality of the output space q. This contrasts starkly with previous guarantees for multi-class learning with ensembles of decision trees (Kuznetsov et al., 2014; 2015) which are linear with respect to the number of classes q.

5.1. Proof of Theorem 7

The proof of Theorem 7 is a consequence of Theorem 1 combined with the following lemma.

Lemma 5. For all m N and z (X [q])m we have,

ˆRz (Π Hp,τ) 2τ p

p log(2 max{p d m, q})/m.

We begin by counting the number of possible partitions that can be made by a decision tree in Tp,d on a given sequence of points. Given a sequence x = (xi)i [m] Xm we let Tp,d(x) := (t(xi))i [m] : t Tp,d [p]m.

Lemma 6. For all m N and x Xm, we have |Tp,d(x)| (p 1)! (d (m + 1))p 1.

Proof. By induction, it sufﬁces to show that |Tp+1,d(x)| |Tp,d(x)| p d (m + 1). Now observe that each element of Tp+1,d(x) may be constructed by taking an element of Tp,d(x) and then making a choice of one of p existing leaf nodes to partition, one of d dimensions to split upon, and one of at most m + 1 possible split points.

We complete the proof of Lemma 5 as follows.

Proof of Lemma 5. For a given τ > 0 we let Λτ := {(aj)j [q] : P

j [q] |aj| τ}. Let {e(j)}j [q] Rq be the canonical orthonormal basis. Let Λex 1 Λ1 denote the subset of extreme points in Λ1, so Λex 1 = {u e(j) : u { 1, +1} and j [q]}. Note that |Λex 1 | = 2q. Fix

Optimistic bounds for multi-output prediction

z = (zi)i [m] (X [q])m with each zi = (xi, ji) and let x = (xi)i [m] Xm. Then ˆRz (Π Hp,τ) =

= Eσ sup h Hp,τ

i [m] σi (Π h)(xi, ji)

= Eσ sup h Hp,τ

i [m] σi πji(h(xi))

= Eσ sup t Tp,d

sup w (Λτ )p

i [m] σi wt(xi),ji

= Eσ sup (li)i [m] Tp,d(x)

1 m sup w (Λτ )p

i [m] σi wli,ji

Observe that for w (Λτ)p and (li) [p]m, X

i [m] σi wli,ji

s [q] wr,s X

i:li=r & ji=s σi

s [q] |wr,s|

i:li=r & ji=s σi

r [p] max s [q]

i:li=r & ji=s σi

r [p] sup (ur,s)s [q] Λex 1

i:li=r & ji=s σi

= τ sup (ur,s) (Λex 1 )p

i:li=r & ji=s σi

(ur,s) (Λex 1 ) p

i [m] σi uli,ji

Plugging this bound into the above yields ˆRz (Π Hp,τ)

Eσ sup (li)i [m] Tp,d(x)

(ur,s) (Λex 1 ) p

i [m] σi uli,ji

(li)i [m] Tp,d(x), (ur,s) (Λex 1 ) p

i [m] u2 li,ji

2 log (|Tp,d(x)| |Λex 1 |p) m (2)

2 ((p 1) log(p d m) + p log(2q))

p log(2 max{p d m, q})

where (2) follows from Massart s lemma (eg. Theorem 3.3 from (Mohri et al., 2012)) and the penultimate inequality follows from Lemma 6.

We can now deduce Theorem 7 from Theorem 1 with the help of a re-weighting argument along with the convexity property of Rademacher complexities.

Proof of Theorem 7. Take ζ > 0 and let

t [T ] αt ht : ht H1 p,τt, αt 0,

t [T ] αt τt ζ and X

t [T ] αt β

Observe that F conv (Hp,ζ). Indeed, given f = P

t [T ] αt ht with ht H1 p,τt and P

t [T ] αt τt ζ, we can rewrite

(ζ τt 1 ht),

with P t [T ](αt τt ζ 1) 1 and for each t [T], we have ζ τt 1 ht Hp,ζ. Thus, Π F Π conv (Hp,ζ) = conv (Π Hp,ζ). Hence, by the convexity of Rademacher complexities (Boucheron et al., 2005, Theorem 3.3, eq. (5)) and z (X [q])nq combined with Lemma 5 we have,

ˆRz (Π F) ˆRz (conv (Π Hp,ζ)) ˆRz (Π Hp,ζ)

p log(2 max{pd (nq), q})

p log(2pdnq)

Taking a supremum over all z (X [q])nq we have Rnq(Π F) 2ζ p

(p log(2pdnq))(nq) 1. Note also that F M(X, [ β, β]q), since each f F is of the form f = P

t [T ] αt ht with ht H1 p,τt M(X, [ 1, +1]q) and P

t [T ] αt β. Thus, plugging the bound on Rnq(Π F) into Theorem 1 yields the bound in Theorem 7.

6. Conclusions

We presented a theoretical analysis of multi-output learning, based on a self-bounding Lipschitz condition. Under this condition, we obtained favourable dependence on both the sample size and the output dimension. The main analytic tool is a new contraction inequality for the local Rademacher complexity of vector valued function classes with a self-bounding Lipschitz loss, which may be of independent interest. Theorem 1 can be applied to any multioutput prediction problem where one can obtain an upper bound on the Rademacher complexity Rnq(Π F). We demonstrate this by applying our approach to ensembles of decision trees, yielding state of the art results.

Optimistic bounds for multi-output prediction

Acknowledgement

This work is funded by EPSRC (grant EP/P004245/1) and partially by the Turing Institute (grant EP/N510129/1).

Agrawal, R., Gupta, A., Prabhu, Y., and Varma, M. Multilabel learning with millions of labels: Recommending advertiser bid phrases for web pages. In Proceedings of the 22nd international conference on World Wide Web, pp. 13 24, 2013.

Babbar, R. and Sch olkopf, B. Dismec: Distributed sparse machines for extreme multi-label classiﬁcation. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pp. 721 729, 2017.

Bartlett, P. L., Bousquet, O., Mendelson, S., et al. Local rademacher complexities. The Annals of Statistics, 33(4): 1497 1537, 2005.

Bernstein, S. On a modiﬁcation of chebyshev s inequality and of the error formula of laplace. Ann. Sci. Inst. Sav. Ukraine, Sect. Math, 1(4):38 49, 1924.

Bhatia, K., Jain, H., Kar, P., Varma, M., and Jain, P. Sparse local embeddings for extreme multi-label classiﬁcation. In Advances in neural information processing systems, pp. 730 738, 2015.

Borchani, H., Varando, G., Bielza, C., and Larra naga, P. A survey on multi-output regression. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 5(5): 216 233, 2015.

Boucheron, S., Bousquet, O., and Lugosi, G. Theory of classiﬁcation: A survey of some recent advances, 2005.

Boucheron, S., Lugosi, G., and Massart, P. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013.

Bousquet, O. Concentration inequalities and empirical processes theory applied to the analysis of learning algorithms. Ph D Thesis, 2002.

Chen, T. and Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp. 785 794, 2016.

Chzhen, E. Classiﬁcation of sparse binary vectors. ar Xiv preprint ar Xiv:1903.11867, 2019.

Chzhen, E., Denis, C., Hebiri, M., and Salmon, J. On the beneﬁts of output sparsity for multi-label classiﬁcation. ar Xiv preprint ar Xiv:1703.04697, 2017.

Cortes, C., Kuznetsov, V., Mohri, M., and Yang, S. Structured prediction theory based on factor graph complexity. In International Conference on Machine Learning, pp. 2522 2530, 2016.

Crammer, K. and Singer, Y. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of machine learning research, 2(Dec):265 292, 2001.

Dudley, R. M. The sizes of compact subsets of hilbert space and continuity of gaussian processes. Journal of Functional Analysis, 1(3):290 330, 1967.

Geng, X. Label distribution learning. IEEE Transactions on Knowledge and Data Engineering, 28(7):1734 1748, 2016.

Guermeur, Y. Lp-norm sauer shelah lemma for margin multi-category classiﬁers. Journal of Computer and System Sciences, 89:450 473, 2017.

Jain, H., Balasubramanian, V., Chunduri, B., and Varma, M. Slice: Scalable linear extreme classiﬁers trained on 100 million labels for related searches. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pp. 528 536, 2019.

Koltchinskii, V. et al. Local rademacher complexities and oracle inequalities in risk minimization. The Annals of Statistics, 34(6):2593 2656, 2006.

Kuznetsov, V., Mohri, M., and Syed, U. Multi-class deep boosting. In Advances in Neural Information Processing Systems, pp. 2501 2509, 2014.

Kuznetsov, V., Mohri, M., and Syed, U. Rademacher complexity margin bounds for learning with a large number of classes. In ICML Workshop on Extreme Classiﬁcation: Learning with a Very Large Number of Labels, 2015.

Ledoux, M. and Talagrand, M. Probability in Banach Spaces: isoperimetry and processes. Springer Science & Business Media, 2013.

Lei, Y., Dogan, U., Binder, A., and Kloft, M. Multi-class svms: From tighter data-dependent generalization bounds to novel algorithms. In Advances in Neural Information Processing Systems, pp. 2035 2043, 2015.

Lei, Y., Ding, L., and Bi, Y. Local rademacher complexity bounds based on covering numbers. Neurocomputing, 218:320 330, 2016.

Lei, Y., Dogan, U., Zhou, D.-X., and Kloft, M. Datadependent generalization bounds for multi-class classiﬁcation. IEEE Transactions on Information Theory, 65(5): 2995 3021, 2019.

Optimistic bounds for multi-output prediction

Li, J., Liu, Y., Yin, R., Zhang, H., Ding, L., and Wang, W. Multi-class learning: From theory to algorithm. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp. 1586 1595. Curran Associates, Inc., 2018.

Li, J., Liu, Y., Yin, R., and Wang, W. Multi-class learning using unlabeled samples: theory and algorithm. In Proceedings of the 28th International Joint Conference on Artiﬁcial Intelligence, pp. 2880 2886. AAAI Press, 2019.

Liu, Y., Li, J., Ding, L., Liu, X., and Wang, W. Learning vector-valued functions with local rademacher complexity and unlabeled data, 2019.

Mammen, E. and Tsybakov, A. B. Smooth discrimination analysis. Ann. Statist., 27(6):1808 1829, 12 1999. doi: 10.1214/aos/1017939240.

Maurer, A. A vector-contraction inequality for rademacher complexities. In International Conference on Algorithmic Learning Theory, pp. 3 17. Springer, 2016.

Menon, A. K., Rawat, A. S., Reddi, S., and Kumar, S. Multilabel reductions: what is my loss optimising? In Advances in Neural Information Processing Systems, pp. 10599 10610, 2019.

Mohri, M., Rostamizadeh, A., and Talwalkar, A. Foundations of machine learning. MIT press, 2012.

Musayeva, K., Lauer, F., and Guermeur, Y. Rademacher complexity and generalization performance of multicategory margin classiﬁers. Neurocomputing, pp. 6 15, 11 2019.

Reddi, S. J., Kale, S., Yu, F., Holtmann-Rice, D., Chen, J., and Kumar, S. Stochastic negative mining for learning with large output spaces. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics, pp. 1940 1949, 2019.

Reeve, H. W. and Kab an, A. Optimistic bounds for multioutput prediction, 2020.

Schapire, R. E. and Freund, Y. Boosting: Foundations and algorithms. Kybernetes, 2013.

Srebro, N., Sridharan, K., and Tewari, A. Smoothness, low noise and fast rates. In Advances in neural information processing systems, pp. 2199 2207, 2010.

Talagrand, M. Upper and lower bounds for stochastic processes: modern methods and classical problems, volume 60. Springer Science & Business Media, 2014.

Tsoumakas, G. and Katakis, I. Multi-label classiﬁcation: An overview. International Journal of Data Warehousing and Mining (IJDWM), 3(3):1 13, 2007.

Xu, C., Liu, T., Tao, D., and Xu, C. Local rademacher complexity for multi-label learning. IEEE Transactions on Image Processing, 25(3):1495 1507, March 2016.

Xu, D., Shi, Y., Tsang, I. W., Ong, Y.-S., Gong, C., and Shen, X. Survey on multi-output learning. IEEE transactions on neural networks and learning systems, 2019.

Zhang, M.-L. and Zhou, Z.-H. A review on multi-label learning algorithms. IEEE transactions on knowledge and data engineering, 26(8):1819 1837, 2013.