# finegrained_generalization_analysis_of_vectorvalued_learning__a651a490.pdf

Fine-grained Generalization Analysis of Vector-valued Learning

Liang Wu 1, Antoine Ledent 2, Yunwen Lei 3, 2 and Marius Kloft 2

1Center of Statistical Research, School of Statistics, Southwestern University of Finance and Economics, Chengdu 611130, China 2Department of Computer Science, TU Kaiserslautern, 67653 Kaiserslautern, Germany 3School of Computer Science, University of Birmingham, Birmingham B15 2TT, United Kingdom wuliang@swufe.edu.cn, ledent@cs.uni-kl.de, y.lei@bham.ac.uk, kloft@cs.uni-kl.de

Many fundamental machine learning tasks can be formulated as a problem of learning with vector-valued functions, where we learn multiple scalar-valued functions together. Although there is some generalization analysis on different speciﬁc algorithms under the empirical risk minimization principle, a unifying analysis of vector-valued learning under a regularization framework is still lacking. In this paper, we initiate the generalization analysis of regularized vector-valued learning algorithms by presenting bounds with a mild dependency on the output dimension and a fast rate on the sample size. Our discussions relax the existing assumptions on the restrictive constraint of hypothesis spaces, smoothness of loss functions and low-noise condition. To understand the interaction between optimization and learning, we further use our results to derive the ﬁrst generalization bounds for stochastic gradient descent with vector-valued functions. We apply our general results to multi-class classiﬁcation and multi-label classiﬁcation, which yield the ﬁrst bounds with a logarithmic dependency on the output dimension for extreme multi-label classiﬁcation with the Frobenius regularization. As a byproduct, we derive a Rademacher complexity bound for loss function classes deﬁned in terms of a general strongly convex function.

Introduction In machine learning, we often encounter learning tasks involving vector-valued prediction functions ( Alvarez, Rosasco, and Lawrence 2012; Xu et al. 2019). As examples consider the following. In multi-class classiﬁcation (MCC), we aim to assign each instance to a single label class (Mohri, Rostamizadeh, and Talwalkar 2012; Crammer and Singer 2002). In multi-label classiﬁcation (MLC), each instance may be annotated with one or multiple class labels (Zhou et al. 2012; Yu et al. 2014). In both cases, we build one

LW and YL acknowledge support by the National Natural Science Foundation of China (Grant Nos. 61903309, 61806091, 11771012) and the Fundamental Research Funds for the Central Universities (JBK1806002). YL also acknowledges support by the Alexander von Humboldt Foundation. MK acknowledges support by the Carl-Zeiss Foundation, by the German Research Foundation (DFG) award KL 2698/2-1, and by the Federal Ministry of Science and Education (BMBF) awards 01IS18051A and 031B0770E. The corresponding author is Yunwen Lei. Copyright c 2021, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

scalar-valued prediction function per class. Together these functions form a vector-valued predictor with the output dimension being the total number of label classes. These two important learning tasks have found wide applications in real systems, where they are used, for instance, for image and video annotation, face classiﬁcation, and query/keyword suggestion (Yu et al. 2014). Another popular learning setting where we encounter vector-valued prediction functions is multi-task learning (MTL). Here we build, for each task, a distinct predictor. The beneﬁt of learning these predictors together is that of exploiting a shared hidden structure in these tasks (Zhang and Yang 2017; Maurer and Pontil 2016; Youseﬁet al. 2018; Argyriou et al. 2007; Ciliberto et al. 2017). The more these tasks are related, the larger the structure and the beneﬁt of learning them together. We refer to problems of learning a vector-valued prediction function such as the above ones as vector-valued learning problems (Micchelli and Pontil 2005; Liu et al. 2019; Xu et al. 2019; Alvarez, Rosasco, and Lawrence 2012; Lei et al. 2015a). An important measure of the quality of vector-valued learning models is their generalization performance, i.e., their ability to generalize their empirical behavior on training examples to unseen test data. As a central topic in statistical learning theory, generalization analysis has received a lot of attention. Other than providing an intuitive understanding of how different parameters affect the learning performance, generalization analysis is also effective in designing novel learning machines (Cortes, Kloft, and Mohri 2013). Unlike traditional binary classiﬁcation problems, a distinguished property of vector-valued learning is that the output dimension plays an important role in the analysis (Reddi et al. 2019). This is especially the case for problems with a huge output dimension, which are becoming more and more ubiquitous in the big data era. One such example is e Xtreme Classiﬁcation (XC; Bengio et al. 2019). Here we deal with multi-class and multi-label problems involving an extremely large total number of potential class labels (Jain et al. 2019). Generalization bounds with an emphasis on the output dimension were developed for speciﬁc vector-valued learning problems such as MCC (Zhang 2004; Lei et al. 2015b; Guermeur 2017; Li et al. 2018; Musayeva, Lauer, and Guermeur 19) or MLC (Yu et al. 2014; Liu et al. 2018; Shen et al. 2018; Xu et al. 2016; Khandagale, Xiao, and Babbar 2020).

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

Recently, Maurer and Pontil (2016); Li, Liu, and Wang (2019) initiated the study of the general framework of vector-valued learning. These studies exhibit the following limitations. First, they exploit the Lipschitz continuity of loss functions with respect to (w.r.t.) the Euclidean norm, while typical loss functions occurring in vector-valued learning problems are Lipschitz continuous w.r.t. the ℓ -norm (Lei et al. 2019), with a comparable Lipschitz constant as for the Euclidean norm (note ℓ -norm can be signiﬁcantly smaller than the Euclidean norm). This mismatch between generalization analysis and Lipschitz continuity induces, for standard Euclidean regularization, a square-root dependency on the number of components (that is, the number of classes or tasks in MCC/MLC and MTL, respectively) (Maurer and Pontil 2016; Li, Liu, and Wang 2019). Second, there is the following conceptual mismatch between algorithms and theory. While the theory is developed for empirical risk minimization (ERM) in a constrained space (Maurer and Pontil 2016; Li, Liu, and Wang 2019; Reeve and Kaban 2020), in practice, regularization schemes are used, which are oftentimes easier to solve (Yu et al. 2014; Lei et al. 2015b; Li et al. 2018; Li, Liu, and Wang 2019). Third, the existing generalization analysis fails to take into account the computational properties of the algorithm, which is important to understand the interaction between optimization and learning. Lastly, most existing studies are limited to the classic regime of slow rates (Ω(n 1

2 ), where n is the sample size). In order to achieve so-called fast rates (Bartlett, Bousquet, and Mendelson 2005), they require restrictive assumptions, such as the smoothness of loss function (Reeve and Kaban 2020), the capacity assumption on the hypothesis space, or the existence of a model with vanishing errors. In this paper, we address all the above issues. Our contributions are as follows. 1. We show the ﬁrst generalization bounds for general vector-valued learning using a regularization framework, thus removing the gap between the algorithm that is analyzed in theory and the one that is considered in practice. 2. Not only do we analyze a more realistic model, we dramatically improve the best known dependency of bounds for general vector-valued learning (in either framework). For instance for standard Frobenius regularization, we drop the dependency on the number of components in the model from c to log c. 3. In MLC, the components are the classes. Thus our result establishes guarantees that scale logarithmic in the number of classes. This is remarkable because the previously best result for MLC scaled square root in the number of classes. Thus, for the ﬁrst time, we establish non-void theory for extreme multi-label classiﬁcation (yet, as mentioned in Point 1, our analyzed algorithm is more realistic). Note that MLC is the by far most common scenario in XC. 4. Our results apply also to the fast-rate regime, that is, the learning scenario where bounds enjoy a fast decay on n. In this regime, our analysis improves not only the best known rates in c (see Point 2) over previous work, but lifts also assumptions employed therein on the smoothness of the loss function and hypothesis space.

Related Work Here we survey the related work on vector-valued learning. There is a large body of work on the generalization analysis of MCC (Lei et al. 2019; Thrampoulidis, Oymak, and Soltanolkotabi 2020), based on various capacity measures: e.g., covering numbers (Zhang 2004; Lei et al. 2019), the fat-shattering dimension (Guermeur 2017), and (local) Rademacher complexities (Mohri, Rostamizadeh, and Talwalkar 2012; Lei et al. 2015b; Maurer 2016; Cortes et al. 2016; Li et al. 2018; Maximov, Amini, and Harchaoui 2018; Musayeva, Lauer, and Guermeur 19, 2018; Deshmukh et al. 2019). Unlike binary classiﬁcation problems, the number of classes plays here an important role in the generalization performance. Until recently, the coupling among the class components, while exploited by most practical multiclass algorithms (Crammer and Singer 2002), was ignored by generalization bounds. As a result, they exhibited at least a linear dependency on the number of classes (Mohri, Rostamizadeh, and Talwalkar 2012). Subsequently, several works aimed to improve this dependency through structural results on Gaussian and Rademacher complexities that exploit the coupling among classes, ﬁrst achieving a square root (Guermeur 2017; Maurer 2016; Lei et al. 2015b) and later on a logarithmic dependency (Lei et al. 2019). For the low-noise regime, Li et al. (2018) show a fast-rate generalization bound based on local Rademacher complexities. There is far less work on the theoretical analysis of MLC. The consistency of MLC with different loss functions was studied in Gao and Zhou (2011). For decomposable loss functions involving the speciﬁc least-squares loss, generalization bounds based on the Rademacher complexity were derived in Yu et al. (2014). The best known dependency on the output dimension is square root and was shown in Liu et al. (2018); Wu and Zhu (2020). Xu et al. (2016) bounded the local Rademacher complexity of MLC, which motivated the authors to study a novel MLC algorithm based on the tail-sum of singular values of the predictors. Generalization analysis for general vector-valued learning algorithms was initiated by Maurer and Pontil (2016) and put forward by Li, Liu, and Wang (2019). However, their analysis implies generalization bounds with a square-root dependency on the output dimension for reasonable regularizers such as those based on the Frobenius norm. This is because they work with the Lipschitz continuity w.r.t. the Euclidean norm. In the present work, we consider the inﬁnity norm instead. Notice that the existing results for MCC, MLC or general vector-valued learning algorithms are mainly established for ERM in a constrained hypothesis space (Maurer and Pontil 2016; Li, Liu, and Wang 2019; Reeve and Kaban 2020). As a comparison, there is scarce work on vectorvalued learning based on regularization, while in practice this is the mostly used scheme (Crammer and Singer 2002; Lei et al. 2015b; Yu et al. 2014).

Problem Formulation and Results Problem Formulation We describe here the framework of vector-valued learning with regularization. Let ρ be a probability measure deﬁned

over a sample space Z = X Y, where X Rd is an input space and Y is an output space (d is the input dimension). Let S = {z1, . . . , zn} Zn be the training dataset drawn independently from ρ. For vector-valued learning, we aim to build a vector-valued predictor h : X 7 Rc, i.e., h = (h1, . . . , hc) with hj : X 7 R. We denote by c the output dimension. We consider non-parametric learning in a reproducing kernel Hilbert space HK associated with a Mercer kernel K : X X 7 R. Let φ : X 7 HK be the corresponding feature map, i.e., K(x, x ) = φ(x), φ(x ) for all x, x X with , being the inner product. Consider the hypothesis space

W = {w = (w1, . . . , wc) Hc K}.

We consider vector-valued predictors of the form

hw(x) = hw 1 (x), . . . , hw c (x) ,

where w W and hw j (x) = wj, φ(x) . Note that if φ(x) = x, then the model hw can be characterized by a matrix w in Rd c with wj being the j-th column. The performance of hw on a single example z is measured by a loss function ℓ: W Z 7 R+. An effective approach to building a model is to learn with regularization, where we build an objective function FS : W 7 R+ by

i=1 ℓ(w; zi) + r(w). (1)

The ﬁrst term characterizes the empirical behavior of a model w on the training examples S, while the second term r : W 7 R+ is a regularizer. The predictor is then established by minimizing the objective function over the hypothesis space (regularized risk minimization, RRM), i.e., w S = arg minw W FS(w). The generalization behavior of a model on a testing example can be measured by the regularized risk deﬁned by F(w) = EZ ℓ(w; Z) +r(w), where EZ denotes the expectation w.r.t. Z. As we will show in applications, this framework of vector-valued learning covers important learning tasks such as MCC and MLC by considering speciﬁc instantiations of the output space, loss functions and regularizers. For any k N, let [k] = {1, . . . , k}. To study high-probability bounds for excess regularized risks of w S, we introduce some necessary assumptions. The ﬁrst assumption is the Lipschitz continuity of loss functions.

Assumption 1 (Lipschitz continuity). We assume ℓsatisﬁes a Lipschitz continuity w.r.t. the inﬁnity norm as follows ℓ(w; z) ℓ(w ; z) L hw(x) hw (x) , (2)

where L > 0 and t = maxj [c] |tj| for t = (t1, . . . , tc).

A notable property is that we consider the Lipschitz continuity w.r.t. the inﬁnity-norm instead of the Euclidean norm in the literature (Li et al. 2018; Li, Liu, and Wang 2019). Although these norms are equivalent, the involved Lipschitz constant can differ up to a factor of c which plays an important role in the generalization behavior if the output dimension is large. Therefore, the L-Lipschitz condition w.r.t. inﬁnity norm is much stronger than that w.r.t. the Euclidean

norm. Fortunately, popular loss functions in MLC and MCC actually satisfy the more restrictive condition of the Lipschitz continuity w.r.t. inﬁnity-norm, where the Lipschitz constant is independent of c. This is why we can exploit the restrctive assumption on inﬁnity-norm to develop a bound with a better dependency on c, which is an advantage over the analysis w.r.t. Euclidean norm (Li, Liu, and Wang 2019). Our second assumption is the (strong) convexity of loss functions and regularizers (Sridharan, Shalev-Shwartz, and Srebro 2009; Kakade, Shalev-Shwartz, and Tewari 2012). Assumption 2. We assume ℓis convex w.r.t. the ﬁrst argument. We also assume r is σ-strongly convex w.r.t. a norm , i.e., for all w, w W

r(w) r(w ) + w w , r (w ) + σ

where r (w ) denotes a subgradient of r at w . A popular strongly convex regularizer is r(w) =

σ 2 w 2 2,p, where w 2,p = Pc j=1 wj p 2 2

p is the ℓ2,p norm (p 1). Here 2 denotes the norm in HK induced by , . It is known that this regularizer is σ(p 1)- strongly convex w.r.t. 2,p for p (1, 2] (Kakade, Shalev-Shwartz, and Tewari 2012). Another popular regularizer is r(w) = σ 2 w 2 Sp for w Rd c where w Sp is the Shatten-p norm (Kakade, Shalev-Shwartz, and Tewari 2012). This regularizer is (p 1)σ-strongly convex w.r.t. Sp for p (1, 2]. Furthermore, we always assume w w 2, for all w W. This is a very mild assumption and is satisﬁed for all the norm considered in this paper. Finally, we assume supx X φ(x) 2 κ.

Main Results We now present our results. We use Rademacher complexity to measure the capacity of hypothesis spaces. Deﬁnition 1 (Rademacher complexity). Let H be a class of real-valued functions deﬁned over a space Z and S = {zi}n i=1 Zn. The empirical Rademacher complexities of H with respect to S is deﬁned as

RS(H) = Eϵ sup h H

i=1 ϵih(zi) ,

where ϵ1, . . . , ϵn are independent Rademacher variables, i.e., ϵi takes an equal probability of being either 1 or 1. Our ﬁrst result is an upper bound on the Rademacher complexity of loss function classes. This result is general in the sense that we deﬁne hypothesis spaces in terms of a general strongly convex function τ(w) = F(w) F(w ), where w = arg minw W F(w). For any Λ > 0, we denote WΛ = {w W : F(w) F(w ) Λ} (3) and the associated class of loss functions FΛ = w 7 ℓ(w; z) : w WΛ . (4) Theorem 1 (Rademacher complexity bound). Let Assumptions 1 and 2 hold. Then there exists a constant C1 independent of n, c, L and Λ such that

2Λ e B log2(nc) nσ ,

where e B = sup(x,j) φj(x) and is the dual norm of in W. Here for any x X and j [c] we use the notation φj(x) := 0, . . . , 0 | {z } j 1

, φ(x), 0, . . . , 0 | {z } c j

With different loss functions and regularizers, Theorem 1 immediately implies speciﬁc Rademacher complexity bounds for different vector-valued learning problems. It is worth mentioning that the upper bound enjoys a logarithmic dependency on the output dimension (we treat Λ as a constant which is problem-dependent and should be tuned by cross validation in practice). Therefore, this result is particularly useful for large-scale learning problems. This mild dependency is derived by exploiting the Lipschitz continuity of loss functions w.r.t. and a structural result to capture this Lipschitz continuity. The proof of Theorem 1 is given in the ar Xiv version. Remark 1. We now compare this result with the existing Rademacher complexity bounds for loss function classes in vector-valued learning problems. Initially, the Rademacher complexity bounds (Lemma 8.1 in Mohri, Rostamizadeh, and Talwalkar (2012)) failed to capture the coupling of different components reﬂected by the constraint (e.g., w Λ for some ) and therefore presented a crude dependency on the output dimension. These results are improved in Lei et al. (2015b) and Maurer (2016) by exploiting the Lipschitz continuity of loss functions w.r.t. ℓ2 norm. Very recently, the Lipschitz continuity w.r.t. ℓ norm is also considered in the literature (Lei et al. 2019; Foster and Rakhlin 2019). However, the analysis in Foster and Rakhlin (2019) failed to exploit the coupling among components while the analysis in Lei et al. (2019) only exploited this coupling enforced by some speciﬁc norms (case by case investigation is required). For example, they require Khintchine inequalities for vectors and matrices to handle (2, p)-norm and Schatten p-norm, respectively. As a comparison, Theorem 1 provides a more general result where w is constrained via a strongly convex function. A nice property is that Theorem 1 treats (2, p)-norm and Schatten p-norm exactly the same, and does not need case-by-case discussions. Moreover, Theorem 1 also applies to other regularization schemes, e.g., learning with entropic regularizer in a probability simplex. We now present upper bounds on the excess regularized risk for vector-valued learning. We use a variant of the big-O notation e O to hide any logarithmic factor. The proof is given in the ar Xiv version. Theorem 2 (Regularized risk bound for RRM). Let Assumptions 1 and 2 hold. Let δ (0, 1). With probability at least 1 δ Eq. (5) holds uniformly for all w W

F(w) F(w ) = max FS(w) FS(w ), 1/(nσ) +

e O F(w) F(w ) log(1/δ) + e B

In particular, the following inequality holds with probability at least 1 δ

F(w S) F(w ) = e O log(1/δ) + e B

Eq. (5) applies uniformly to all w W while (6) applies to the RRM scheme. Both of these upper bounds admit a very mild dependency on the output dimension. Unlike ERM which generally implies slow rates O(1/ n) (Lei et al. 2019), we establish fast rates of the order e O(1/n) for RRM by leveraging the strong convexity of the objective function. A notable property of this result is its generality. It requires only Lipschitz continuity of loss functions and strong convexity of objective functions to develop meaningful generalization bounds for any vector-valued learning methods. It does not impose a capacity assumption on hypothesis spaces or a low-noise condition for fast generalization bounds as was required in the existing treatments (Li et al. 2018; Li, Liu, and Wang 2019). Remark 2. Generalization error bounds with an implicit dependency on the output dimension were recently derived for ERM with vector-valued functions (Li, Liu, and Wang 2019). The discussions there imposes a constraint in terms of either w 2,1 or w S1 on the hypothesis space. As a comparison, we get a logarithmic dependency by considering a regularizer involving a much milder norm 2,2, (note 2,1 can be as large as c 2,2). Indeed, the analysis in Li, Liu, and Wang (2019) would imply a generalization bound with a square-root dependency on c if the constraint w 2,1 Λ there is replaced by w 2,2 Λ. We achieve this improvement by exploiting the Lipschitz continuity of loss functions w.r.t. the inﬁnity-norm instead of w.r.t. the Euclidean norm. Although the Lipschitz contintuity w.r.t. inﬁnity-norm is a much stronger assumption than that w.r.t. Euclidean norm, it is satisﬁed by popular loss functions with the Lipschitz constant independent of c (cf. Section on Applications). Our analysis is able to fully exploit this stronger assumption and therefore gets a tighter dependency on c. The use of strong convexity allows for a tighter dependency on n. Theorem 2 establishes excess risk bounds for w S under RRM, which is non-constructive in the sense that it does not tell us how the model w S is built from the data. In the following theorem, we will consider a constructive algorithm called SGD, which is a very popular optimization algorithm especially useful for large data analysis due to its simplicity and efﬁciency. To study its excess risk bounds, we need to consider both statistical and computational properties as well as the trade-off realized by early-stopping. We give the proof in the ar Xiv version. Theorem 3 (Regularized risk bound for SGD). Assume ℓ(w; z) takes the form ℓ(w; z) = ψ(hw(x), y) with ψ : Rc Y 7 R being a convex function w.r.t. the ﬁrst argument. Let Assumption 1 hold and r(w) = σ

2 w 2 2,2. Let f(w; z) = ℓ(w; z) + r(w). Let w1 = 0 W and {wt}t N be the sequence produced by SGD, i.e.,

wt+1 = wt ηtf (wt; zit), (7)

where ηt = 1/(tσ) and {it}t is independently drawn from the uniform distribution over [n]. Then for any δ (0, 1) with probability at least 1 δ the following holds:

F(w T ) F(w ) = e O log(1/δ) min{n, T}σ

Bound Lipschitz Continuity Additional Assumption Method Reference c/ n ℓ1-norm No ERM Kuznetsov, Mohri, and Syed (2014) p

c/n ℓ2-norm No ERM Lei et al. (2015b) c log3(n)/n ℓ2-norm smoothness and low noise ERM Li et al. (2018) log2(nc)/ n ℓ -norm No ERM Lei et al. (2019) c/n ℓ2-norm decay of singular values ERM Li, Liu, and Wang (2019) log3(nc)/n ℓ -norm strong convexity RRM/SGD This work

Table 1: Generalization bounds for MCC in a Frobenius learning framework (either constraint W F 1 or regularizer W 2 F )

If T is of the order of n then with probability at least 1 δ,

F(w T ) F(w ) = e O log(1/δ)

Remark 3. According to Theorem 3, there is no need to run more iterations once T is of the order of n since further computation would not further improve the generalization performance. In other words, a computationally efﬁcient method is to early-stop SGD after a constant number of passes over the data. To our best knowledge, this is the ﬁrst result on the generalization analysis of SGD for vectorvalued learning algorithms.

Applications In this section, we apply our general results to speciﬁc vector-valued learning tasks, including MCC and MLC. Let ℓ: R 7 R be convex, decreasing and (L/2)-Lipschitz continuous. Popular choices of ℓincludes the hinge-loss ℓ(t) = max{0, 1 t} and the logistic loss ℓ(t) = log(1+exp( t)). We only consider applications to RRM here. It should be mentioned that our results directly apply to SGD.

Multi-class Classiﬁcation MCC is a classic learning problem which aims to assign a single class label to each training example, i.e., Y = [c]. In this problem setting, the j-th component of a model h : X 7 Rc measures the likelihood of the output label being j. The prediction is made by taking the component with the largest value as the predicted label, i.e., x 7 arg maxj [c] hj(x). We apply our result to three classical MCC models: the multi-class SVM, the multinomial logistic regression and top-k SVM. In Table 1, we compare generalization bounds for MCC in a Frobenius learning framework, i.e., either there is a constraint W F 1 or a regularizer W 2 F .

Example 1 (Multi-class SVM). Let Y = [c]. Consider the objective function FS in (1) with

ℓ(w; z) = max y [c]:y =y ℓ wy wy , φ(x)

and r(w) = σ 2 P j [c] wj 2 2. This is a margin-based loss function and miny [c]:y =y wy wy , x is called the margin of hw at z = (x, y). It is clear that the loss function is designed to favor a model with a large margin. This objective function recovers the multi-class SVM in Crammer and

Singer (2002) by taking ℓas the hinge loss. To apply Theorem 2, it sufﬁces to verify Assumptions 1, 2. It was shown that ℓis L-Lipschitz continuous w.r.t. (Lei et al. 2019). The convexity of ℓfollows from the convexity of ℓand the linearity of hypothesis. The σ-strong convexity of r w.r.t. 2,2 is clear in the literature (Kakade, Shalev-Shwartz, and Tewari 2012). The dual norm is 2,2, and therefore e B κ. Therefore, we can apply Theorem 2 to develop the generalization bound F(w S) F(w ) = e O log(1/δ)

nσ with high probability. Example 2 (Multinomial logistic regression). Let Y = [c]. Consider the loss function

ℓ(w; z) = log X

j [c] exp wj wy, φ(x) .

Consider the objective function FS (1) with the above loss function and the regularizer r(w) = σ

2 P j [c] wj 2 2. This learning scheme recovers the popular multinomial logistic regression. It was shown that the above loss function is 2Lipschitz continuous w.r.t. (Lei et al. 2019). Assumption 2 also holds. Therefore, we can apply Theorem 2 to develop the generalization bound F(w S) F(w ) = e O log(1/δ)

nσ with probability at least 1 δ.

Example 3 (Top-k SVM). Let Y = [c]. Consider the loss

ℓ(w; z) = max n 0, 1

Iy =1 + w1 wy, φ(x) , ,

Iy =c + wc wy, φ(x)

where I is the indicator function and for any t Rc the notation { } denotes a permutation such that {j} is the index of the j-th largest score, i.e., t{1} t{2} t{c}. If we use the above loss function and the regularizer r(w) = σ

2 P j [c] wj 2 2 in (1), we recover the top-k SVM useful to tackle the ambiguity in class labels (Lapin, Hein, and Schiele 2015). It was shown that the above loss function is 2-Lipschitz continuous w.r.t. (Lei et al. 2019). Assumption 2 also holds. Therefore, we can apply Theorem 2 to develop the generalization bound F(w S) F(w ) = e O log(1/δ)

nσ with probability at least 1 δ.

Remark 4. The existing generalization analysis of MCC considers ERM in a constrained hypothesis space (Lei et al.

2019). However, RRM is more popular in MCC. For example, the multi-class SVM, multinomial logistic regression and top-k SVM in the above three examples are proposed in a regularization setting (Lapin, Hein, and Schiele 2015; Crammer and Singer 2002). Furthermore, the existing discussions generally imply a slow generalization bound O(1/ n) (Lei et al. 2015b, 2019) or require restrictive assumptions such as low-noise condition and capacity assumptions to achieve a fast generalization bound O(1/n) (Li, Liu, and Wang 2019). As a comparison, our discussion implies a fast bound O(1/n) without these assumptions.

Multi-label Classiﬁcation

For MLC, each training example can be associated with one or more class labels (Yu et al. 2014; Zhang and Zhou 2013; Xu, Liu, and Geng 2020; Dembczy nski et al. 2012; Wu and Zhu 2020). This can be realized by setting Y = { 1, +1}c, i.e., each y = y(1), . . . , y(c) is a binary vector where y(j) = 1 if the j-th label is relevant and y(j) = 1 if the j-th label is irrelevant. A popular approach to MLC is to learn a vector-valued function h : X 7 Rc and predict the output ˆy for z according to its sign, i.e., ˆy(j) = sgn(hj(x)), where sgn(a) denotes the sign of a R. There are various performance measures to quantify the performance of a multi-label predictor, including the subset accuracy and the ranking loss. In the ar Xiv version, we compare generalization bounds for MLC in a Frobenius learning framework, i.e., either a constraint W F 1 or a regularizer W 2 F .

Example 4 (Learning with subset loss). Let Y = { 1, +1}c. Consider the loss function

ℓ(w; z) = max j [c] ℓ y(j) wj, φ(x) . (10)

This loss function is called the subset loss (Zhang and Zhou 2013). Note that ℓ(y(j) wj, φ(x) ) is a standard loss if we consider the prediction of the j-th label as a standard binary classiﬁcation problem. Then this loss function encourages us to predict all labels correctly. As we will show in Proposition 4, the subset loss is L-Lipschitz continuous w.r.t. . If we consider the objective function (1) with the subset loss and the regularizer r(w) = σ

j [c] wj 2 2, then we can apply Theorem 2 to derive with probability 1 δ that F(w S) F(w ) = e O log(1/δ)

Example 5 (Learning with ranking loss). Let Y = { 1, +1}c. For each y { 1, +1}c we denote y+ = {j [c] : y(j) = +1} and y = {j [c] : y(j) = 1} as the set of relevant and irrelevant labels, respectively. Consider the following ranking loss (Zhang and Zhou 2013)

ℓ(w; z) = 1 |y+||y |

ℓ wj+ wj , φ(x) ,

(11) where |A| denotes the cardinality of a set A. Intuitively, the ranking loss encourages predictors with larger function values for a relevant label than a irrelevant label. As we will show in Proposition 4, the ranking loss is L-Lipschitz

continuous w.r.t. . If we consider the objective function (1) with the ranking loss and the regularizer r(w) = σ 2 P j [c] wj 2 2, then we can apply Theorem 2 to show with probability at least 1 δ that F(w S) F(w ) = e O log(1/δ)

Remark 5. Generalization bounds of the order O(1/ n) were established for MLC with the decomposable loss ℓ(w; z) = 1

c P j [c] y(j) wj, x 2 under ERM in a constrained space {w Rd c : w S1 Λ} (Yu et al. 2014). Although this bound has no dependency on c, it requires a constraint in terms of S1, which can be as large as c 2,2 (i.e., a constraint w S1 Λ corresponds to the constraint 2,2, cΛ). As a comparison, we consider a regularization scheme involving the much milder norm 2,2. Also, there is a gap between the theoretical analysis and the algorithm design: the generalization analysis there is for ERM, while the algorithm is designed based on a regularization scheme (Yu et al. 2014). Furthermore, our analysis for regularization implies a fast rate O(1/n). The proof of Proposition 4 is given in the ar Xiv version.

Proposition 4. 1. If ℓis L-Lipschitz continuous, then the subset loss (10) is L-Lipschitz continuous w.r.t. . 2. If ℓis (L/2)-Lipschitz continuous, then the ranking loss (11) is L-Lipschitz continuous w.r.t. .

Experimental Veriﬁcation In this section, we present experimental results to verify our theoretical analysis. We consider a speciﬁc vectorvalued learning problem called multinomial logistic regression, where the aim is to predict a class label for each training example. We apply SGD (7) to solve the optimization problem with the objective function

j [c] exp wj wyi, xi +λ

(12) where w = (w1, . . . , wc) Rd c and λ = 0.01. We set the initial point w = 0 and the step size ηt = 1/(λt + 1). We consider four real-world datasets available from the LIBSVM homepage (Chang and Lin 2011), whose information is summarized in the ar Xiv version. We repeat experiments 50 times and report the average as well as standard deviation of the experimental results. We call FS and F the training error and testing error, respectively. We consider two experiments. For the ﬁrst experiment, we aim to verify the generalization bound in (8). We randomly use 80% of data for training and reserve the remaining 20% for testing. We plot the testing errors F(wt) of the SGD sequence versus the number of passes, i.e., t/n in Figure 1. It is clear that the testing error decreases initially as we run more SGD iterations. After sufﬁcient number of iterations, the generalization performance of SGD iterates no long improves. This is well consistent to the generalization bound in (8) which shows that the generalization bound would be dominated by the effect due to sample size if T n.

(b) CIFAR10

Figure 1: Testing errors versus the number of passes.

(b) CIFAR10

Figure 2: Training and testing errors versus the training data size.

(b) CIFAR10

Figure 3: F(w T ) FS(w T ) versus the training data size.

For the second experiment, our aim is to show whether the dependency of the generalization bound (9) on the sample size can be really captured in practice. For each dataset, we randomly select different number of examples for training and reserve the remaining for testing. For each ﬁxed number of training examples, we train a model by solving (12) with SGD and stop it after 5 passes over the data. We compute both the training error FS(w T ) and testing error F(w T ) of the output model w T (last iterate). In this way we get a training error and a testing error for each considered sample size. We then plot the relative behavior of these errors versus the number of training examples in Figure 2. According to Figure 2, it is clear that training errors increase as the increase of training examples. This phenomenon is due to the increasing difﬁculty of the optimization problem as the increase of training examples. Note the ﬁt of a model on 10, 000 examples would be signiﬁcantly harder than that on 100 examples. It is also clear that the generalization behavior improves as we have more training examples. This matches well the generalization bound (9). In Figure 3 we further plot the difference between testing error and training

error (F(w T ) FS(w T )) versus the sample size.

We present a unifying generalization analysis for regularized learning with vector-valued functions. Our generalization bounds admit a mild dependency on the output dimension and decay fast w.r.t. the sample size. For instance for MLC, they improve the best known dependency on the number of classes from O( c) to a O(log(c)), making them suitable for extreme classiﬁcation. Furthermore, our analysis relax the existing restrictive assumptions such as a low noise condition or a smoothness requirement on loss functions. We develop the ﬁrst generalization analysis for SGD to learn vector-valued functions, which is important to understand both statistical properties of models and the convergence of the algorithm. We present applications to speciﬁc learning machines and conduct experiments to verify our theory. It is interesting to extend our analysis to distributed learning (Lin and Zhou 2018; Hu, Wu, and Zhou 2020). It is also interesting to study non-strongly convex regularizers such as ℓ1 regularizers to learn sparse models (Guo et al. 2017).

References Alvarez, M. A.; Rosasco, L.; and Lawrence, N. D. 2012. Kernels for Vector-Valued Functions: A Review. Foundations and Trends R in Machine Learning 4(3): 195 266.

Argyriou, A.; Pontil, M.; Ying, Y.; and Micchelli, C. 2007. A spectral regularization framework for multi-task structure learning. Advances in neural information processing systems 20: 25 32.

Bartlett, P.; Bousquet, O.; and Mendelson, S. 2005. Local Rademacher complexities. Annals of Statistics 33(4): 1497 1537.

Bengio, S.; Dembczynski, K.; Joachims, T.; Kloft, M.; and Varma, M. 2019. Extreme classiﬁcation (dagstuhl seminar 18291). In Dagstuhl Reports, volume 8. Schloss Dagstuhl Leibniz-Zentrum fuer Informatik.

Chang, C.-C.; and Lin, C.-J. 2011. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2(3): 27.

Ciliberto, C.; Rudi, A.; Rosasco, L.; and Pontil, M. 2017. Consistent multitask learning with nonlinear output relations. In Advances in Neural Information Processing Systems, 1986 1996.

Cortes, C.; Kloft, M.; and Mohri, M. 2013. Learning kernels using local rademacher complexity. In Advances in Neural Information Processing Systems, 2760 2768.

Cortes, C.; Kuznetsov, V.; Mohri, M.; and Yang, S. 2016. Structured Prediction Theory Based on Factor Graph Complexity. In Advances in Neural Information Processing Systems, 2514 2522.

Crammer, K.; and Singer, Y. 2002. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research 2: 265 292.

Dembczy nski, K.; Waegeman, W.; Cheng, W.; and H ullermeier, E. 2012. On label dependence and loss minimization in multi-label classiﬁcation. Machine Learning 88(1-2): 5 45.

Deshmukh, A. A.; Lei, Y.; Sharma, S.; Dogan, U.; Cutler, J. W.; and Scott, C. 2019. A Generalization Error Bound for Multi-class Domain Generalization. ar Xiv preprint ar Xiv:1905.10392 .

Foster, D. J.; and Rakhlin, A. 2019. ℓ Vector Contraction for Rademacher Complexity. ar Xiv preprint ar Xiv:1911.06468 .

Gao, W.; and Zhou, Z.-H. 2011. On the Consistency of Multi-Label Learning. In Conference on Learning Theory, volume 19, 341 358.

Guermeur, Y. 2017. Lp-norm Sauer Shelah lemma for margin multi-category classiﬁers. Journal of Computer and System Sciences 89: 450 473.

Guo, Z.-C.; Xiang, D.-H.; Guo, X.; and Zhou, D.-X. 2017. Thresholded spectral algorithms for sparse approximations. Analysis and Applications 15(03): 433 455.

Hu, T.; Wu, Q.; and Zhou, D.-X. 2020. Distributed kernel gradient descent algorithm for minimum error entropy principle. Applied and Computational Harmonic Analysis 49(1): 229 256. Jain, H.; Balasubramanian, V.; Chunduri, B.; and Varma, M. 2019. Slice: Scalable linear extreme classiﬁers trained on 100 million labels for related searches. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, 528 536. Kakade, S. M.; Shalev-Shwartz, S.; and Tewari, A. 2012. Regularization techniques for learning with matrices. Journal of Machine Learning Research 13: 1865 1890. Khandagale, S.; Xiao, H.; and Babbar, R. 2020. Bonsai: diverse and shallow trees for extreme multi-label classiﬁcation. Machine Learning 109(11): 2099 2119. Kuznetsov, V.; Mohri, M.; and Syed, U. 2014. Multi-Class Deep Boosting. In Advances in Neural Information Processing Systems, 2501 2509. Lapin, M.; Hein, M.; and Schiele, B. 2015. Top-k Multiclass SVM. In Advances in Neural Information Processing Systems, 325 333. Lei, Y.; Binder, A.; Dogan, U.; and Kloft, M. 2015a. Theory and algorithms for the localized setting of learning kernels. In Feature Extraction: Modern Questions and Challenges, 173 195. PMLR. Lei, Y.; Dogan, U.; Binder, A.; and Kloft, M. 2015b. Multiclass SVMs: From Tighter Data-Dependent Generalization Bounds to Novel Algorithms. In Advances in Neural Information Processing Systems, 2026 2034. Lei, Y.; Dogan, U.; Zhou, D.-X.; and Kloft, M. 2019. Datadependent Generalization Bounds for Multi-class Classiﬁcation. IEEE Transactions on Information Theory 65(5): 2995 3021. Li, J.; Liu, Y.; and Wang, W. 2019. Learning Vectorvalued Functions with Local Rademacher Complexity. ar Xiv preprint ar Xiv:1909.04883 . Li, J.; Liu, Y.; Yin, R.; Zhang, H.; Ding, L.; and Wang, W. 2018. Multi-class learning: from theory to algorithm. In Advances in Neural Information Processing Systems, 1586 1595. Lin, S.-B.; and Zhou, D.-X. 2018. Distributed kernel-based gradient descent algorithms. Constructive Approximation 47(2): 249 276. Liu, C.; Zhao, P.; Huang, S.-J.; Jiang, Y.; and Zhou, Z.-H. 2018. Dual set multi-label learning. In Thirty-Second AAAI Conference on Artiﬁcial Intelligence. Liu, W.; Xu, D.; Tsang, I. W.; and Zhang, W. 2019. Metric Learning for Multi-Output Tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence 41(2): 408 422. Maurer, A. 2016. A vector-contraction inequality for Rademacher complexities. In International Conference on Algorithmic Learning Theory, 3 17. Maurer, A.; and Pontil, M. 2016. Bounds for vector-valued function estimation. ar Xiv preprint ar Xiv:1606.01487 .

Maximov, Y.; Amini, M.-R.; and Harchaoui, Z. 2018. Rademacher complexity bounds for a penalized multi-class semi-supervised algorithm. Journal of Artiﬁcial Intelligence Research 61: 761 786.

Micchelli, C. A.; and Pontil, M. 2005. On learning vectorvalued functions. Neural computation 17(1): 177 204.

Mohri, M.; Rostamizadeh, A.; and Talwalkar, A. 2012. Foundations of Machine Learning. MIT press.

Musayeva, K.; Lauer, F.; and Guermeur, Y. 19. Rademacher complexity and generalization performance of multicategory margin classiﬁers. Neurocomputing 342: 6 15.

Musayeva, K.; Lauer, F.; and Guermeur, Y. 2018. A sharper bound on the Rademacher complexity of margin multicategory classiﬁers. In ESANN.

Reddi, S. J.; Kale, S.; Yu, F.; Holtmann-Rice, D.; Chen, J.; and Kumar, S. 2019. Stochastic negative mining for learning with large output spaces. In International Conference on Artiﬁcial Intelligence and Statistics, 1940 1949.

Reeve, H. W.; and Kaban, A. 2020. Optimistic Bounds for Multi-output Learning. In International Conference on Machine Learning, 8030-8040.

Shen, X.; Liu, W.; Tsang, I. W.; Sun, Q.-S.; and Ong, Y.-S. 2018. Compact multi-label learning. In Thirty-Second AAAI Conference on Artiﬁcial Intelligence.

Sridharan, K.; Shalev-Shwartz, S.; and Srebro, N. 2009. Fast rates for regularized objectives. In Advances in Neural Information Processing Systems, 1545 1552.

Thrampoulidis, C.; Oymak, S.; and Soltanolkotabi, M. 2020. Theoretical Insights Into Multiclass Classiﬁcation: A Highdimensional Asymptotic View. In Advances in Neural Information Processing Systems 33.

Wu, G.; and Zhu, J. 2020. Multi-label classiﬁcation: do Hamming loss and subset accuracy really conﬂict with each other? In Advances in Neural Information Processing Systems 33.

Xu, C.; Liu, T.; Tao, D.; and Xu, C. 2016. Local rademacher complexity for multi-label learning. IEEE Transactions on Image Processing 25(3): 1495 1507.

Xu, D.; Shi, Y.; Tsang, I. W.; Ong, Y.-S.; Gong, C.; and Shen, X. 2019. Survey on multi-output learning. IEEE Transactions on Neural Networks and Learning Systems .

Xu, N.; Liu, Y.-P.; and Geng, X. 2020. Partial Multi-Label Learning with Label Distribution. In AAAI, 6510 6517.

Youseﬁ, N.; Lei, Y.; Kloft, M.; Mollaghasemi, M.; and Anagnostopoulos, G. 2018. Local rademacher complexitybased learning guarantees for multi-task learning. Journal of Machine Learning Research 19: 1385 1431.

Yu, H.-F.; Jain, P.; Kar, P.; and Dhillon, I. 2014. Large-scale multi-label learning with missing labels. In International Conference on Machine Learning, 593 601.

Zhang, M.-L.; and Zhou, Z.-H. 2013. A review on multilabel learning algorithms. IEEE Transactions on Knowledge and Data Engineering 26(8): 1819 1837.

Zhang, T. 2004. Statistical analysis of some multi-category large margin classiﬁcation methods. Journal of Machine Learning Research 5: 1225 1251. Zhang, Y.; and Yang, Q. 2017. A survey on multi-task learning. ar Xiv preprint ar Xiv:1707.08114 . Zhou, Z.-H.; Zhang, M.-L.; Huang, S.-J.; and Li, Y.-F. 2012. Multi-instance multi-label learning. Artiﬁcial Intelligence 176(1): 2291 2320.