# generalization_analysis_for_multilabel_learning__93da46e9.pdf Generalization Analysis for Multi-Label Learning Yi-Fan Zhang 1 2 Min-Ling Zhang 2 3 Despite great advances in algorithms for multilabel learning, research on the theoretical analysis of generalization is still in the early stage. Some recent theoretical results has investigated the generalization performance of multi-label learning under several evaluation metrics, however, how to reduce the dependency on the number of labels, explicitly introduce label correlations, and quantitatively analyze the impact of various inductive biases in the generalization analysis of multi-label learning is still a crucial and open problem. In an attempt to make up for the gap in the generalization theory of multi-label learning, we develop several novel vector-contraction inequalities, which exploit the Lipschitz continuity of loss functions, and derive generalization bounds with a weaker dependency on the number of labels than the state of the art in the case of decoupling the relationship among different components, which serves as theoretical guarantees for the generalization of multi-label learning. In addition, we derive the generalization bound for Macro-Averaged AUC and analyze its relationship with class-imbalance. The mild bounds without strong assumptions explain the good generalization ability of multi-label learning with first-order label correlations and high-order label correlations induced by norm regularizers. 1. Introduction Multi-label learning is one of the most studied and important machine learning paradigms in practice, in which each object is represented by a single instance while being asso- 1School of Cyber Science and Engineering, Southeast University, Nanjing 210096, China 2Key Laboratory of Computer Network and Information Integration (Southeast University), Ministry of Education, China 3School of Computer Science and Engineering, Southeast University, Nanjing 210096, China. Correspondence to: Min-Ling Zhang . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). ciated with a set of labels instead of a single label. The goal of multi-label learning is to learn a hypothesis which can predict the proper sets of labels for unseen instances. It has made important advances in text categorization (Schapire & Singer, 2000; Rubin et al., 2012), multimedia content annotation (Boutell et al., 2004; Cabral et al., 2011), bioinformatics (Barutc uoglu et al., 2006; Cesa-Bianchi et al., 2012) and other fields (Yu et al., 2005). Although multilabel learning has achieved impressive empirical advances across a wide range of tasks (Zhang & Zhou, 2014), the problem of understanding multi-label learning theoretically remains relatively under-explored. As we all know, the generalization ability, i.e., the performance of learning machines trained on certain datasets on unseen data, of learning machines is an important question of theoretical research in machine learning, and it is no exception for multi-label learning. Efforts to explain why multi-label models generalize well is an important open problem in multi-label learning community. Uniform convergence is a powerful tool in learning theory for understanding the generalization ability of learners, and it is also used in the generalization analysis of multi-label learning (Yu et al., 2014; Xu et al., 2016; Wu & Zhu, 2020; Wu et al., 2021b;a). However, the progress on the generalization analysis of multi-label learning appears to be severely scarce. A satisfactory and complete study of the generalization analysis for multi-label learning should include three aspects: 1) the reduction of the dependency on the number of labels of the generalization bounds, 2) the explicit introduction of label correlations in the generalization analysis, and 3) the impact of various inductive biases on the generalization performance. First of all, the generalization analysis of multi-label learning is more difficult than that of traditional supervised learning since their difference in problem settings. In particular, the vector-valued output of multi-label learning makes the typical theoretical results not applicable to multi-label learning (Maurer, 2016; Wu & Zhu, 2020). Hence, how to reduce the dependency on the number of labels of the generalization bounds is a very critical problem. Secondly, the consideration of label correlations can often effectively improve the generalization performance of multi-label learning (Zhang & Zhou, 2014), so it is very necessary to explicitly and formally introduce label correlations in the generalization analysis. Finally, the impressive em- Generalization Analysis for Multi-Label Learning pirical success of multi-label learning algorithms motivates us to further investigate the inductive biases induced by these algorithms, e.g., the inductive bias induced by various norm regularizers (Huang et al., 2016) or label-specific features (Zhang & Wu, 2015; Hang & Zhang, 2022; Jia et al., 2023). Therefore, theoretical research on generalization can promote a better understanding of multi-label learning. In this paper, we derive novel and tighter bounds based on the Rademacher complexity for multi-label learning. Specifically, for ℓ2 Lipschitz loss, we improve the basic linear dependent bound to be independent on the number of labels, which decouples the relationship among different components. For ℓ Lipschitz loss, we improve the square-root dependent bound to be independent on the number of labels, which also decouples the relationship among different components. These bounds are tighter than the state of the art. We also give several tight bounds for the coupling case to study the impact of different types of label correlations on the generalization analysis. Finally, we derive the generalization bound based on the label-based ranking multi-label Rademacher complexity for Macro-Averaged AUC, and analyze the relationship between Macro-Averaged AUC and class-imbalance. Our generalization bounds reduce the dependency on the number of labels and account for different types of label correlations. Major contributions of the paper include: We prove several novel vector-contraction inequalities for the generalization analysis of multi-label learning, which exploits the Lipschitz continuity of the loss function with respect to the ℓ2 and ℓ norm and decouples the relationship among different components. We derive generalization bounds for general function classes with a weaker dependency on the number of labels than the state of the art, which provides general theoretical guarantees for multi-label learning with different types of label correlations. We introduce the label-based ranking multi-label Rademacher complexity and analyze the relationship between Macro-Averaged AUC and class-imbalance according to the generalization bound. We structure our work as follows. We first introduce an overview of the problem setting for multi-label learning, the definitions of the related evaluation metrics and complexity measures in Section 2. We then present our main results in Sections 3, where we develop several novel vectorcontraction inequalities and derive bounds with a weaker dependency on the number of labels than the state of the art for ℓ2 and ℓ Lipschitz loss, and study the impact of different types of label correlations. In Section 4, we derive the bound for Macro-Averaged AUC and analyze its relationship with class-imbalance. In Section 5, we provide a comparison of our theoretical results with the related works. Finally, we give a conclusion of our work in Section 6. 2. Preliminaries In this section, we first present the problem setting for multilabel learning. Secondly, we give the definitions of commonly used surrogate losses. Finally, we introduce the related complexity measures involved in the main results. 2.1. Multi-Label Learning In the context of multi-label learning, given a dataset D = {(x1, Y1) , . . . , (xn, Yn)} with n examples which are identically and independently distributed (i.i.d.) from a probability distribution P on X Y, where X Rd denotes the d-dimensional input space and Y = {y1, . . . , yc} denotes the label space with c class labels, xi X, Yi Y. Let [n] := {1, . . . , n} for any natural number n and i [n]. Let Y = { 1, +1}c, i.e., each y = (y1, . . . , yc) is a binary vector and yj = 1 (yj = 1) denotes that the j-th label is relevant (irrelevant), j [c]. The task of multi-label learning is to learn a multi-label classifier h H : X 7 { 1, +1}c which assigns each instance with a set of relevant labels. A common strategy is to learn a vector-valued function f = (f1, . . . , fc) : X 7 Rc and derive the classifier by a thresholding function which dichotomizes the label space into relevant and irrelevant label sets. We consider the prediction function for each label of the general form fj(x) = wj, φ(x) , where φ represents a nonlinear mapping. We define a vector-valued function class of the multi-label learning as follows: F = {x 7 f(x) :f(x) = (f1(x), . . . , fc(x)), fj(x) = wj, φ(x) , x X, j [c] w = (w1, . . . , wc) Rd c, α(w) Λ, β(φ(x)) A, Λ > 0, A > 0}, (1) where α represents a functional that constrains weights, β represents a functional that constrains nonlinear mappings. For any function f : X 7 Rc, the quality of a prediction on a single example (x, y) is measured by a loss function L : Rc { 1, +1}c 7 R+. The goal is to learn a hypothesis f F with good generalization performance from the dataset D by optimizing the loss L. The generalization performance is measured by the expected risk: R(f) = E(x,y) P [L(f(x), y)]. We denote the empirical risk w.r.t. the training dataset D as b RD(f) = 1 n Pn i=1 L(f(xi), yi). In addition, we denote the optimal risk as R = inff F R(f) and denote the minimizer of the empirical risk as ˆ f = arg minf F b RD(f). Generalization Analysis for Multi-Label Learning The above definitions apply to Hamming loss, Subset loss and Ranking loss. For Macro-Averaged AUC, it involves the pairwise loss, so we additionally define the corresponding risks. Maximizing Macro-Averaged AUC is equivalent to minimizing the following empirical risk w.r.t. Macro-Averaged AUC: b RD(f) (2) 1 X+ j X j X ℓ0/1 (fj(xi) fj(x i)) , where X+ j = {xi | yj = +1, i [n]} (X j = {x i | yj = 1, i [n]}) corresponds to the set of test instances that are relevant (irrelevant) to the j-th label. The expected risk w.r.t. Macro-Averaged AUC is defined as R(f) = ED[ b RD(f)]. However, the above mentioned loss is typically the 0 1 loss, which is hard to handle in practice. Hence, one usually consider its surrogate losses. 2.2. Related Evaluation Metrics A number of evaluation metrics are proposed to measure the generalization performance of different approaches for multi-label learning. Here we focus on commonly used evaluation metrics, i.e., Hamming loss, Subset loss, Ranking loss and Macro-Averaged AUC, and their surrogate losses are defined as follows: Hamming Loss: LH(f(x), y) = 1 c Pc j=1 ℓ(yjfj(x)) , where the base convex surrogate loss ℓcan be various popular forms, such as the hinge, logistic and exponential loss. Subset Loss: LS(f(x), y) = maxj [c] {ℓ(yjfj(x))} . Ranking Loss: LR(f(x), y) = 1 |Y +| |Y | q Y ℓ(fp(x) fq(x)) , where Y + (Y ) denotes the relevant (irrelevant) label index set induced by y, and | | denotes the cardinality of a set. The surrogate loss for Macro-Averaged AUC: LM(f(xi, x i), y) = 1 j=1 ℓ(fj(xi) fj(x i)) , (3) where xi (x i) corresponds to the instances that are relevant (irrelevant) to the j-th label. 2.3. Related Complexity Measures Here we use the Rademacher complexity to perform generalization analysis for multi-label learning. Definition 2.1 (Rademacher complexity). Let G be a class of real-valued functions mapping from X to R. Let D = {x1, . . . , xn} be a set with n i.i.d. samples. The empirical Rademacher complexity over G is defined by ˆℜD(G) = Eϵ i=1 ϵig (xi) where ϵ1, . . . , ϵn are i.i.d. Rademacher random variables. In addition, we define the worst-case Rademacher complexity as ℜn(G) = sup D X n ˆℜD(G). In multi-label learning, F is a class of vector-valued functions, which makes traditional Rademacher complexity analysis methods invalid. A common practice is to use the multilabel Rademacher complexity to bound the Rademacher complexity of a loss function space associated with the vector-valued function class F according to the vectorcontraction inequality in (Maurer, 2016). Definition 2.2 (Multi-label Rademacher complexity). Let F be a class of vector-valued functions mapping from X to Rc. Let D = {x1, . . . , xn} be a set with n i.i.d. samples. The empirical multi-label Rademacher complexity over F is defined by ˆℜD(F) = Eϵ j=1 ϵijfj (xi) where each ϵij is an independent doubly indexed Rademacher random variable, and fj (xi) is the j-th component of f (xi). Here we use the covering number to bound the Rademacher complexity for multi-label learning. The covering number can be bounded by the fat-shattering dimension (Srebro et al., 2010; Lei et al., 2019; Zhang & Zhang, 2023): Definition 2.3 (Covering number). Let F be a class of real-valued functions mapping from X to R. Let D = {x1, . . . , xn} be a set with n i.i.d. samples. For any ϵ > 0, the empirical ℓ2 (or ℓ ) norm covering number N2(ϵ, F, D) (or N (ϵ, F, D)) w.r.t. D is defined as the minimal number m of a collection of vectors v1, . . . , vm Rn such that (vj i is the i-th component of the vector vj) v u u t 1 i=1 (f (xi) vj i)2 ϵ. or max i=1,...,n f (xi) vj i ϵ In this case, we call v1, . . . , vm an (ϵ, ℓ2)-cover (or (ϵ, ℓ )-cover) of F with respect to D. We also denote N2(ϵ, F, n) = sup D N2(ϵ, F, D) (or N (ϵ, F, n) = sup D N (ϵ, F, D)). Definition 2.4 (Fat-shattering dimension). Let F be a class of real-valued functions mapping from X to R. We Generalization Analysis for Multi-Label Learning define the fat-shattering dimension fatϵ(F) at scale ϵ > 0 as the largest p N such that there exist p points x1, . . . , xp X and witnesses s1, . . . , sp R satisfying: for any δ1, . . . , δp { 1, +1} there exists f F with δi (f(xi) si) ϵ 2, i = 1, . . . , p. 3. Generalization Bounds Based on the Rademacher Complexity In this section, we derive several novel bounds under the assumption that the loss is Lipschitz continuous w.r.t. ℓ2 norm or ℓ norm. For ℓ2 Lipschitz loss, we first introduce a basic bound with a linear dependency on c, then we develop a novel vector-contraction inequality and improve the bound with no dependency on c, which is tighter than the state of the art. For ℓ Lipschitz loss, we first introduce a basic bound with a square-root dependency on c, then we develop a novel vector-contraction inequality and improve the bound with no dependency on c, which is tighter than the state of the art in the decoupling case. We also give several tight bounds for the coupling case to study the impact of different types of label correlations on the generalization analysis. The theoretical results in this section hold for surrogate Hamming loss, surrogate Subset loss and surrogate Ranking loss. The detailed proofs of the theoretical results in this paper are provided in the appendix. 3.1. Generalization Bounds for ℓ2 Lipschitz Loss We first introduce the assumptions used, and we show that for general function classes, Lipschitz continuity of the loss function w.r.t. the ℓ2 norm combined with the multi-label Rademacher complexity yields basic generalization bounds with a linear dependency on the number of labels, which exploits the typical vector-contraction inequality (Maurer, 2016). Second, we develop a novel vector-contraction inequality and derive a tighter bound with no dependency on the number of labels for ℓ2 Lipschitz loss. Assumption 3.1. Assume that the loss function and the components of the vector-valued function are bounded: L( , ) M, |fj( )| B for j [c] where M > 0 and B > 0 are constants. Assumption 3.2. Assume that the loss function is µLipschitz continuous w.r.t. the ℓ2 norm. Assumption 3.1 is a relatively mild assumption. In fact, when we consider the function class (1) for multi-label learning, we often use the assumptions wj 2 Λ, φj is ρLipschitz w.r.t. the ℓ2 norm and xi 2 A for any j [c], i [n] to replace the boundedness of the components of the vector-valued function. The following Proposition 3.3 also illustrates that Assumption 3.2 is very mild. Proposition 3.3 (Lemma 1 in (Wu & Zhu, 2020)). Assume that the base loss function ℓdefined in Subsection 2.2 is µ-Lipschitz continuous, then the surrogate Hamming Loss is µ c-Lipschitz w.r.t. the ℓ2 norm, the surrogate Subset Loss is µ-Lipschitz w.r.t. the ℓ2 norm, and the surrogate Ranking Loss is µ-Lipschitz w.r.t. the ℓ2 norm. 3.1.1. A BASIC BOUND FOR ℓ2 LIPSCHITZ LOSS Using the ℓ2 Lipschitz continuity of loss and the multi-label Rademacher complexity, we have the following theorem: Theorem 3.4. Let F be a vector-valued function class of the multi-label learning defined by (1). Let Assumptions 3.1 and 3.2 hold. Given a dataset D of size n. Then, for any 0 < δ < 1, with probability at least 1 δ, the following holds for any f F: R(f) b RD(f) + 2 2µc B n + 3M Proof Sketch. We first use the multi-label Rademacher complexity to bound the Rademacher complexity of the loss function space associated with the vector-valued function class F according to the typical vector-contraction inequality (Maurer, 2016), and then complete the proof with the Mc Diarmid s inequality and the symmetrization technique. Remark 3.5. The above bound with a linear dependency on the number of labels indicates that good generalization performance will be obtained when the number of examples ( n) is larger than the number of labels (c), but in practice, it is often encountered in the case of an extremely large number of labels, that is, extreme multilabel learning (Yu et al., 2014; Prabhu & Varma, 2014; Yen et al., 2016; Liu & Shen, 2019). At this time, the number of labels will probably be more than the number of examples, so the bound in Theorem 3.4 will not be able to provide theoretical guarantees, thus prompting us to develop the bound that is tighter on the number of labels. Our analysis in Theorem 3.4 implies the following inequality: Eϵ h supf F 1 n Pn i=1 Pc j=1 ϵijfj (xi) i c maxj Eϵ h supfj Fj 1 n Pn i=1 ϵijfj (xi) i , which shows that decoupling the relationship among different components (since the maximization over j [c] is outside of the expectation operator) will lead to bounds with a linear dependency on c for general function classes. When considering kernel function classes, the dependency of the bounds on c can be improved to square-root (Maurer, 2016; Wu & Zhu, 2020; Wu et al., 2021a). Such improvements essentially come from preserving the coupling among different components reflected by the constraint, i.e., w Λ. As a comparison, when wj 2 Λ for any j [c], if we consider the group norm 2,2, we have w 2,2 cΛ, Generalization Analysis for Multi-Label Learning which means that these improved bounds still suffer from a linear dependency on the number of labels. Hence, the improvement in the preservation of coupling by a factor of c benefits from replacing Λ with cΛ in the constraint to some extent. This reveals that in order to improve the existing bounds, we need to improve the dependency on the number of labels in the decoupling case. 3.1.2. TIGHTER BOUNDS FOR ℓ2 LIPSCHITZ LOSS We develop a novel vector-contraction inequality for ℓ2 Lipschitz loss, which decouples the relationship among different components and guarantees that the derived generalization bounds are tighter than the state of the art. We show that the Rademacher complexity of the loss function space associated with F can be bounded by the worstcase Rademacher complexity of the projection function class P(F). We first define a function class P consisting of projection operators pj : Rc 7 R for any j [c] which project the c-dimensional vector onto the j-th coordinate. Then, we have P(F) = {(j, x) 7 pj(f(x)) : pj(f(x)) = fj(x), f F, (j, x) [c] X}. With the above definitions, we develop the following vector-contraction inequality: Lemma 3.6. Let F be a vector-valued function class of the multi-label learning defined by (1). Let Assumptions 3.1 and 3.2 hold. Given a dataset D of size n. Then, we have ˆℜD(F) 48µ ceℜnc(P(F)) 1 + log 1 2 (nc) log M n where eℜnc(P(F)) is the worst-case Rademacher complexity of the projection function class. Proof Sketch. First, the Rademacher complexity of the loss function space associated with F can be bounded by the empirical ℓ2 norm covering number with the refined Dudley s entropy integral inequality. Second, according to the Lipschitz continuity w.r.t the ℓ2 norm, the empirical ℓ2 norm covering number of F can be bounded by the empirical ℓ2 norm covering number of P(F). Third, the empirical ℓ2 norm covering number of P(F) can be bounded by using Sudakov s minoration (Wainwright, 2019), which bounds the ℓ2 norm covering number of a function class by the expectation of a Gaussian process indexed by the function class, and the expectation of the Gaussian process can be bounded by the worst-case Rademacher complexity of the projection function class P(F). Hence, the problem is transferred to the estimation of the worst-case Rademacher complexity. Finally, we estimate the lower bound of the worstcase Rademacher complexity of P(F), and then combined with the above steps, the Rademacher complexity of the loss function space associated with F can be bounded. With the vector-contraction inequality above, we can derive the following tight bound for ℓ2 Lipschitz loss: Theorem 3.7. Let F be a vector-valued function class of the multi-label learning defined by (1). Let Assumptions 3.2 and 3.1 hold. Given a dataset D of size n. Then, for any 0 < δ < 1, with probability at least 1 δ, the following holds for any f F: R(f) b RD(f) + 3M 96Bµ 1 + log 1 2 (nc) log M n Proof Sketch. We first upper bound the worst-case Rademacher complexity eℜnc(P(F)), and then combined with Lemma 3.6, the desired bound can be derived. Remark 3.8. Although Lemma 3.6 shows a factor of c, the term eℜnc(P(F)) B nc, which makes the Rademacher complexity of the loss function space associated with F (i.e., ˆℜD(F)) actually independent on c, and results in a tighter bound than the O(c/ n) bound in Theorem 3.4 with a faster convergence rate e O(1/ n). Lemma 3.6 decouples the relationship among different components (since the supremum over j [c] is outside of the expectation operator by the definition of P(F)). Hence, the bound with no dependency on c in Theorem 3.7 is clearly tighter than the state of the art with square-root dependency on c in (Lei et al., 2015; Maurer, 2016; Wu & Zhu, 2020; Wu et al., 2021a) for ℓ2 Lipschitz loss assumption, where the analyzes all preserve the coupling among different components, not to mention, as discussed in Remark 3.5, the constraint on preserving the coupling ( w Λ) directly implies the improvement by a factor of c. In fact, decoupling the relationship or preserving the coupling among different components corresponds to different types of label correlations in multi-label learning (Zhang & Zhou, 2014). The former corresponds to first-order label correlations (which tackle multi-label learning problem by decomposing it into a number of independent binary classification problems, i.e., ignorance of label correlations), and the latter corresponds to highorder label correlations (which tackle multi-label learning problem by exploiting high-order relationships among labels) induced by norm regularizers. The assumption of the coupling case that holds for many learning scenarios (e.g., multi-class learning) does not hold in multi-label learning methods with first-order label correlations. This means that we need to develop new vector-contraction inequalities to handle the assumption of the decoupling case. The projection function class is used to help handle the generalization analysis in the decoupling case. The above bounds provide theoretical guarantees for first-order label correlations methods, e.g., Binary Relevance methods (Boutell et al., 2004; Zhang & Zhou, 2014; Zhang & Wu, 2015; Hang & Zhang, 2022). Furthermore, according to the Proposition 3.3, we Generalization Analysis for Multi-Label Learning can obtain a e O(1/ nc) bound for ℓ2 Lipschitz surrogate Hamming Loss in the decoupling case. 3.2. Generalization Bounds for ℓ Lipschitz Loss We first introduce the assumption of Lipschitz continuity w.r.t. the ℓ norm, and we derive a basic bound with a square-root dependency on c for general function classes by refining Theorem 1 in (Foster & Rakhlin, 2019). Second, we develop a novel vector-contraction inequality and derive a tighter bound with no dependency on c for ℓ Lipschitz loss in the decoupling case, up to logarithmic terms, and we also give several bounds with no dependency on c for ℓ Lipschitz loss in the coupling case. Assumption 3.9. Assume that the loss function L is µLipschitz continuous w.r.t. the ℓ norm, that is: L(f(x), ) L(f (x), ) µ f(x) f (x) , where µ > 0, t = maxj [c] |tj| for t = (t1, . . . , tc). In fact, the commonly used loss functions in multi-label learning actually satisfy the Lipschitz continuity w.r.t. the ℓ norm, and it has been considered in some literature (Lei et al., 2019; Wu et al., 2021b). The following Proposition 3.10 further illustrates that Assumption 3.9 is very mild. Proposition 3.10. Assume that the base loss ℓdefined in Subsection 2.2 is µ-Lipschitz continuous, then the surrogate Hamming Loss is µ-Lipschitz w.r.t. the ℓ norm, the surrogate Subset Loss is µ-Lipschitz w.r.t. the ℓ norm, and the surrogate Ranking Loss is 2µ-Lipschitz w.r.t. the ℓ norm. 3.2.1. A BASIC BOUND FOR ℓ LIPSCHITZ LOSS Using the Lipschitz continuity w.r.t. the ℓ norm, we first show that the Rademacher complexity of the loss function space associated with F can be bounded by the worst-case Rademacher complexity of the restriction of the function class along each coordinate with timing a factor of e O( c). Then, we derive the basic bound with a square-root dependency on the number of labels in the decoupling case. Lemma 3.11. Let F be a vector-valued function class of the multi-label learning defined by (1). Let Assumptions 3.1 and 3.9 hold. Given a dataset D of size n. Then, for any 0 < η < 1, 0 < a < 1, we have 2Eµ a c max j eℜn (Fj) 2n B ) log( M where eℜn (Fj) is the worst-case Rademacher complexity, Fj is the restriction of the function class along the j-th coordinate, fj Fj and E > 0 is an absolute constant. Proof Sketch. We obtain the lower bound of the worst-case Rademacher complexity eℜn (Fj) through the Khintchine Kahane inequality, combined with the refined Theorem 1 in (Foster & Rakhlin, 2019), the Rademacher complexity of the loss function space associated with F can be bounded. With the vector-contraction inequality above, we can derive the following basic bound for ℓ Lipschitz loss: Theorem 3.12. Let F be a vector-valued function class of the multi-label learning defined by (1). Let Assumptions 3.1 and 3.9 hold. Given a dataset D of size n. Then, for any 0 < δ, η, a < 1, there exists an absolute constant E > 0 such that with probability at least 1 δ, the following holds for any f F: R(f) b RD(f) + 3M 2n B ) log( M Proof Sketch. We first upper bound the worst-case Rademacher complexity eℜn (Fj), and then combined with Lemma 3.11, the desired bound can be derived. Remark 3.13. Lemma 3.11 is not a direct application of Theorem 1 in (Foster & Rakhlin, 2019) which absorbs all terms independent of c and n into an unspecified numerical constant and yields a logarithmic term of order O(log 2 ( n)). We refine the proof of Theorem 1 in (Foster & Rakhlin, 2019), bound the relevant constant terms and imporve the order of the logarithmic term to O(log 2 ( n) log(p n c )). The term maxj eℜn (Fj) B n means that Lemma 3.11 decouples the relationship among different components, which results in a tighter bound than the O(c/ n) bound in Theorem 3.4 with a faster convergence rate e O( p 3.2.2. TIGHTER BOUNDS FOR ℓ LIPSCHITZ LOSS We first give a corollary with no dependency on c, which preserves the coupling among different components. Corollary 3.14. Let F be a vector-valued function class of the multi-label learning defined by (1). Let Assumptions 3.1 and 3.9 hold. Given a dataset D of size n. Then, for any 0 < δ < 1, with probability at least 1 δ, the following holds for any f F: 1.) If α(w) := w , β(φ(x)) := supx X,j [c] φj(x) , and is the dual norm of , we have: R(f) b RD(f) + 36µΛA log 2n 3 2 c) n + 3M Generalization Analysis for Multi-Label Learning 2.) If α(w) := w Sp, β(φ(x)) := maxi [n] φ(xi) 2, when 1 p 2, we have: R(f) b RD(f) + 36µΛA log 2n 3 2 c) n + 3M when p > 2, we have: R(f) b RD(f) + 3M 36µΛA min{c, d} 1 2 1 2n 3 2 c) n . The Schatten-p norm is defined as the ℓp norm of the singular value vector of a matrix, i.e., w Sp = σ(w) p, where the singular values are sorted in non-increasing order. For any x X and j [c], the notation φj(x) is defined by φj(x) := (0, . . . , 0 | {z } j 1 , φ(x), 0, . . . , 0 | {z } c j ) Rd c. Proof Sketch. We first show that the Rademacher complexity of the loss function space associated with F can be bounded by the worst-case Rademacher complexity of e F, which refines Theorem 5 in (Lei et al., 2019) (i.e., ˆℜD(F) 16 log 2µ ceℜnc( e F)(1 + log 3 2 2 ΛAn c eℜnc( e F))), where e F := n v 7 w, v : w, v Rd c, α(w) Λ, β(v) A, v e S o and e S = { φj(xi) : j [c], i [n]}. Then, we upper bound eℜnc( e F) for different norm regularizers. Combining these results, the desired bounds can be derived. Remark 3.15. Corollary 3.14 for preserving the coupling case has less novelty, which has the same order e O(1/ n) as the results in (Lei et al., 2019; Wu et al., 2021b), since these results all use the same vector-contraction inequality, i.e., Theorem 5 in (Lei et al., 2019). However, here we are more concerned with investigating the impact of the label correlation induced by the norm regularizer on the generalization analysis. The bounds here involve a constraint on the overall weight w, which consider that the components share some constraint properties with each other, that is, the label correlation induced by the norm regularizer. The above bounds involve label correlations induced by the general norm regularizer (case 1) and the Schatten-p norm regularizer (case 2), respectively. Trace norm regularizer, corresponding to Schatten-p norm regularizer with p = 1, is a common practice to consider the label correlation in multi-label learning, which imposes a low-rank constraint on the spectrum of w. The bounds here also explain the good generalization ability of multi-label learning with the label correlation induced by the trace norm regularizer. Then, we develop a novel vector-contraction inequality, which guarantees that the derived bounds are tighter than the state of the art in the decoupling case. Lemma 3.16. Let F be a vector-valued function class of the multi-label learning defined by (1). Let Assumptions 3.1 and 3.9 hold. Given a dataset D of size n. Then, for any 0 < η < 1, 0 < a < 1, E > 0, we have Eµ a ceℜnc(P(F)) 1 + log 5B nc) log( M n where eℜnc(P(F)) is the worst-case Rademacher complexity of the projection function class. Proof Sketch. The overall proof idea is similar to Lemma 3.6, but there are two obvious differences. First of all, the covering numbers involved in the proof here are all ℓ norm covering numbers instead of ℓ2 norm covering numbers, which makes the proof techniques involved completely different. Secondly, in the third step, instead of using Sudakov s minoration inequality, we use the fat-shattering dimension to bound the empirical ℓ norm covering number of P(F), and the fat-shattering dimension can be bounded by the worst-case Rademacher complexity of P(F). Theorem 3.17. Let F be a vector-valued function class of the multi-label learning defined by (1). Let Assumptions 3.1 and 3.9 hold. Given a dataset D of size n. Then, for any 0 < δ, η, a < 1, there exists an absolute constant E > 0 such that with probability at least 1 δ, the following holds for any f F: R(f) b RD(f) + 3M EµB a 1 n 1 + log 5B nc) log( M n Proof Sketch. We first upper bound eℜnc(P(F)), then combined with Lemma 3.16, the bound is immediate. Remark 3.18. Lemma 3.16 decouples the relationship among different components, and the term eℜnc(P(F)) B nc, which makes the Rademacher complexity ˆℜD(F) actually independent on c, and results in a tighter bound than the e O( p c/n) bound in Theorem 3.12 with a faster convergence rate e O(1/ n). How to develop novel vectorcontraction inequalities that can induce e O(1/ n) bounds and deal with the assumption of the decoupling case are the two most critical difficulties in deriving tighter bounds. The introduction of the projection function class plays an important role in solving these two difficulties. It improves the vector-contraction inequalities by a factor of c and handles the assumption of the decoupling case indirectly. Theorem 3.17 improves the bounds from e O( p c/n) to e O(1/ n) for Generalization Analysis for Multi-Label Learning ℓ Lipschitz Loss in the decoupling case, and explains the good generalization ability of multi-label learning with first-order label correlations. 4. Generalization Analysis for Macro-Averaged AUC Macro-Averaged AUC is a typical label-based ranking metric (Zhang & Zhou, 2014). Here we analyze the relationship between Macro-Averaged AUC and class-imbalance according to the generalization bound. We use Lemma 3.11 for the generalization analysis, since improving the dependency on the number of labels is not our main concern here. Rademacher complexity has proved to be a powerful datadependent measure of hypothesis space complexity (Bartlett & Mendelson, 2002; Koltchinskii & Panchenko, 2002). However, a sequence of pairs of i.i.d. individual observation in (2) is no longer independent, which makes standard techniques in the i.i.d case for traditional Rademacher complexity inapplicable for Macro-Averaged AUC. We convert the non-sum-of-i.i.d pairwise function to a sum-of-i.i.d form by using permutations in U-process (Cl emenc on et al., 2008). We denote X+ j and X j in (2) as sj and tj, rj = min{sj, tj}, r0 = minj [c]{rj}, and sj + tj = n for any j [c]. Then, we have the following definition: Definition 4.1. Let F be a class of vector-valued functions mapping from X to Rc and ℓbe the base loss defined in (3). Let D = {x1, . . . , xn} be a set with n i.i.d. samples. The empirical label-based ranking multi-label Rademacher complexity of a loss function space associated with the vector-valued function class F is defined by ˆℜD(F) = Eϵ i=1 ϵijℓ(fj(xi) fj(x i)) where each ϵij is an independent doubly indexed Rademacher random variable, and fj (xi) is the j-th component of f (xi). We refer to the expectation ℜ(F) = ED[ˆℜD(F)] as the label-based ranking multi-label Rademacher complexity of F. With this definition, we then derive the bound as follows: Theorem 4.2. Let F be a vector-valued function class of the multi-label learning defined by (1) and the loss be Macro Averaged AUC. Let Assumptions 3.1 and 3.9 hold. Given a dataset D of size n. Then, for any 0 < δ < 1, with probability at least 1 δ, the following holds for any f F: R( ˆ f ) R = e O( p Proof Sketch. First, by using the U-process technique, we bound E[R( ˆ f )] R with the label-based ranking multilabel Rademacher complexity. Second, combining with Lemma 3.11 and bounding the worst-case Rademacher complexity, we can upper bound the label-based ranking multi-label Rademacher complexity. Finally, with the Mc Diarmid s inequality, the desired bound can be derived. Remark 4.3. Theorem 4.2 shows that when class-imbalance occurs, r0 will be smaller than n 2 . When class-imbalance is more serious, r0 will be smaller, which will lead to a looser bound for Macro-Averaged AUC. This means that when class-imbalance becomes more and more serious, if the learned classifier cannot handle the problem of class imbalance well, then its performance on Macro-Averaged AUC will be worse. Wu et al. (2023) also obtained similar conclusions, but the methods used were completely different. Wu et al. (2023) transformed the macro-averaged maximization problem in multi-label learning into the problem of learning multiple tasks with graph-dependent examples-which is hard to verify in practice, then proposed a new Mc Diarmidtype inequality to develop O( 1 n) bound in the balanced case. Our method is simpler and can also yield bounds with no dependency on c by combining Lemma 3.16. 5. Comparison with Related Work The generalization analysis of multi-label learning originated from (Dembczynski et al., 2010), which performed regret analysis on Hamming and Subset loss, and derived the relationship between the expectations of Hamming and Subset loss. Dembczynski et al. (2012) performed regret analysis on Ranking loss. These analyses laid the foundation for research in (Wu & Zhu, 2020; Wu et al., 2021a). Theorem 3.4 shows that when using the typical vectorcontraction inequality (Maurer, 2016) (i.e., ℓ2 norm Lipschitz loss) and the multi-label Rademacher complexity, one can only derive bounds of order O(c/ n) for general function classes in the decoupling case. Wu & Zhu (2020); Wu et al. (2021a) also obtained similar results for surrogate Hamming, Subset and Ranking losses, which mainly exploited the relationship between losses. Wu & Zhu (2020); Wu et al. (2021a) also showed that for kernel function classes, the order of the bounds for some losses can be improved to O( p c/n) when preserving the coupling. Lei et al. (2015) first derived a O( p c/n) bound for multi-class SVM with ℓp norm regularized kernel function classes under the assumption of ℓ2 Lipschitz loss. Li et al. (2018) used the local Rademacher complexity to derive a O(log2 c/n) bound for multi-class classification with ℓp norm regularized kernel function classes under the assumption of ℓ2 Lipschitz and smooth loss. Theorem 3.7 improves the bounds to O(1/ n) for ℓ2 Lipschitz loss even in the decoupling case. Theorem 3.12 shows that when using ℓ norm Lipschitz loss, one can derive bounds of order e O( p c/n) for general Generalization Analysis for Multi-Label Learning function classes. Liu et al. (2018) also obtained a bound of order O( p c/n) for the dual set multi-label learning for margin loss and kernel function classes. Lei et al. (2019); Wu et al. (2021b) improved the dependency on c in the coupling case, where Lei et al. (2019) derived a e O(1/ n) bound for multi-class classification with norm regularized function classes under ℓ Lipschitz loss, and Wu et al. (2021b) derived a O(log3(nc)/nσ) bound for vector-valued learning with norm regularized kernel function classes under the assumption of ℓ Lipschitz and σ-strongly convex loss. Here Theorem 3.17 shows a e O(1/ n) bound for general function classes with ℓ Lipschitz loss in the decoupling case. Yu et al. (2014) obtained a O(1/ n) bound for trace norm regularized linear function classes with the decomposable loss. In addition, Xu et al. (2016) used the local Rademacher complexity to derive a e O(1/n) bound for trace norm regularized linear function classes with the assumption that the singular values of w decay exponentially. Compared with these works, we obtain tighter bounds with the state-of-theart dependency on c for general function classes under the mild assumptions in both decoupling and coupling cases. 6. Discussion The main goal of our capacity-based generalization bounds is to provide general and efficient theoretical guarantees for empirically successful multi-label learning methods, especially regarding the dependency on the number of labels. The generalization differences of different multi-label learning models or algorithms are mainly reflected in two aspects. On the one hand, the differences are reflected in the Lipschitz constant of the loss functions, as we showed in Proposition 3.3 and 3.10, different loss functions have different µ values in our bounds. On the other hand, the differences are reflected in the nonlinear mappings corresponding to the specific models used. In fact, when we analyze the generalization of the specific models, the constraint on nonlinear mappings β(φ(x)) A is actually refined as φ(x) A in our analysis, and we will further have φ(x) ρ x (ρ is the Lipschitz constant of the nonlinear mappings) to take into account the differences or characteristics of different models. The generalization differences are further reflected in the corresponding Lipschitz constants ρ. Compared with the Lipschitz constants of deep models and shallow models, their differences are particularly obvious (Bartlett et al., 2017; Golowich et al., 2018; Bartlett et al., 2019; Zhang & Liao, 2020; Ledent et al., 2021; Zhang & Zhang, 2023). In order to provide general theoretical guarantees for multi-label learning, here we only make the most general assumptions (e.g., for nonlinear mappings) and do not specify specific models, so the generalization differences of different multi-label learning models or algorithms are not explicitly shown. And if a specific model or algorithm is specified or refined, the results obtained will lose their generality and will not be able to provide theoretical guarantees for all or most multi-label learning models or algorithms. The analysis of the lower bound will greatly promote our theoretical understanding of multi-label learning and is the most powerful criterion for testing whether the given upper bound is the tightest. However, the effective lower bound remains relatively under-explored. We believe that such a lower bound Ω(1/ n) is still not tight enough. The main evidence is as follows: if we regard the classification problem corresponding to each label as a task, then multi-label learning can be regarded as multi-task learning. Since tasks share some constraint or generative properties with each other, the typical bound for multi-task learning is O(1/ nt), where t is the number of tasks. Hence, if we consider that the components share some constraint properties or information with each other, i.e., the label correlations, then appropriate label correlations will improve the generalization performance of multi-label learning. This means that some constraint properties shared between labels (i.e., label correlations) will facilitate the learning effect of multiple labels. In such a case, we should have the lower bound Ω(1/ nc). A e O(1/ nc) bound for ℓ2 Lipschitz surrogate Hamming loss in the decoupling case (as we discussed in Remark 3.8) provides evidence for this analysis. Hence, although our bounds are tighter than the state of the art, the above analysis actually motivates us to develop new theories to investigate the lower bound and break through the theoretical limitations of current methodologies. 7. Conclusion In this paper, we propose several novel vector-contraction inequalities and derive bounds with a weaker dependency on c, and study the impact of different label correlations on the generalization analysis. In addition, with the label-based ranking multi-label Rademacher complexity, we derive the bound for Macro-Averaged AUC and analyze the relationship between Macro-Averaged AUC and class-imbalance. In future work, we will extend our bounds to more general settings (especially for label correlations induced by labelspecific features and other norm regularizers), and derive tighter bounds for multi-label learning with a faster convergence rate with respect to the number of observations, and further design efficient algorithms to construct multi-label models with good generalization performance. Acknowledgements The authors wish to thank the anonymous reviewers for their helpful comments and suggestions. This work was supported by the National Science Foundation of China (62225602), the Fundamental Research Funds for the Cen- Generalization Analysis for Multi-Label Learning tral Universities (2242024K30035), and the Big Data Computing Center of Southeast University. Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here. Bartlett, P. L. and Mendelson, S. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463 482, 2002. Bartlett, P. L., Foster, D. J., and Telgarsky, M. Spectrallynormalized margin bounds for neural networks. Advances in Neural Information Processing Systems, 30 (NIPS 2017):6240 6249, 2017. Bartlett, P. L., Harvey, N., Liaw, C., and Mehrabian, A. Nearly-tight vc-dimension and pseudodimension bounds for piecewise linear neural networks. Journal of Machine Learning Research, 20(63):1 17, 2019. Barutc uoglu, Z., Schapire, R. E., and Troyanskaya, O. G. Hierarchical multi-label prediction of gene function. Bioinformatics, 22(7):830 836, 2006. Boutell, M. R., Luo, J., Shen, X., and Brown, C. M. Learning multi-label scene classification. Pattern Recognition, 37(9):1757 1771, 2004. Cabral, R. S., la Torre, F. D., Costeira, J. P., and Bernardino, A. Matrix completion for multi-label image classification. Advances in Neural Information Processing Systems, 24 (NIPS 2011):190 198, 2011. Cesa-Bianchi, N., Re, M., and Valentini, G. Synergy of multi-label hierarchical ensembles, data fusion, and costsensitive methods for gene functional inference. Machine Learning, 88(1-2):209 241, 2012. Cl emenc on, S., Lugosi, G., and Vayatis, N. Ranking and empirical minimization of u-statistics. The Annals of Statistics, 36(2):844 874, 2008. Dembczynski, K., Waegeman, W., Cheng, W., and H ullermeier, E. Regret analysis for performance metrics in multi-label classification: The case of hamming and subset zero-one loss. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, volume 6321, pp. 280 295, 2010. Dembczynski, K., Kotlowski, W., and H ullermeier, E. Consistent multilabel ranking through univariate losses. ar Xiv:1206.6401, 2012. Foster, D. J. and Rakhlin, A. ℓ vector contraction for rademacher complexity. ar Xiv:1911.06468v1, 2019. Golowich, N., Rakhlin, A., and Shamir, O. Size-independent sample complexity of neural networks. International Conference on Computational Learning Theory, 75(COLT 2018):297 299, 2018. Hang, J.-Y. and Zhang, M.-L. Collaborative learning of label semantics and deep label-specific features for multi-label classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(12):9860 9871, 2022. Huang, J., Li, G., Huang, Q., and Wu, X. Learning labelspecific features and class-dependent labels for multilabel classification. IEEE Transactions on Knowledge and Data Engineering, 28(12):3309 3323, 2016. Jia, B., Liu, J., Hang, J., and Zhang, M. Learning labelspecific features for decomposition-based multi-class classification. Frontiers of Computer Science, 17(6): 176348, 2023. Koltchinskii, V. and Panchenko, D. Empirical margin distributions and bounding the generalization error of combined classifiers. The Annals of Statistics, 30(1):1 50, 2002. Ledent, A., Mustafa, W., Lei, Y., and Kloft, M. Norm-based generalisation bounds for deep multi-class convolutional neural networks. In Proceedings of the 35th AAAI Conference on Artificial Intelligence, volume 35, pp. 8279 8287, 2021. Lei, Y., Dogan, U., Binder, A., and Kloft, M. Multi-class svms: From tighter data-dependent generalization bounds to novel algorithms. In Advances in Neural Information Processing Systems, volume 28, pp. 2035 2043, 2015. Lei, Y., Dogan, U., Zhou, D., and Kloft, M. Data-dependent generalization bounds for multi-class classification. IEEE Transactions on Information Theory, 65(5):2995 3021, 2019. Li, J., Liu, Y., Yin, R., Zhang, H., Ding, L., and Wang, W. Multi-class learning: From theory to algorithm. Advances in Neural Information Processing Systems, 31(Neur IPS 2018):1593 1602, 2018. Liu, C., Zhao, P., Huang, S., Jiang, Y., and Zhou, Z. Dual set multi-label learning. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, number AAAI 2018, pp. 3635 3642, 2018. Liu, W. and Shen, X. Sparse extreme multi-label learning with oracle property. In Proceedings of the 36th International Conference on Machine Learning, volume 97, pp. 4032 4041, 2019. Generalization Analysis for Multi-Label Learning Lust-Piquard, F. and Pisier, G. Non commutative khintchine and paley inequalities. Arkiv f or matematik, 29:241 260, 1991. Maurer, A. A vector-contraction inequality for rademacher complexities. In Proceedings of the 27th International Conference on Algorithmic Learning Theory, volume 9925, pp. 3 17, 2016. Prabhu, Y. and Varma, M. Fastxml: a fast, accurate and stable tree-classifier for extreme multi-label learning. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, number KDD 2014, pp. 263 272, 2014. Rubin, T. N., Chambers, A., Smyth, P., and Steyvers, M. Statistical topic models for multi-label document classification. Machine Learning, 88(1-2):157 208, 2012. Rudelson, M. and Vershynin, R. Combinatorics of random processes and sections of convex bodies. Annals of Mathematics, pp. 603 648, 2006. Schapire, R. E. and Singer, Y. Boostexter: A boostingbased system for text categorization. Machine Learning, 39(2/3):135 168, 2000. Srebro, N., Sridharan, K., and Tewari, A. Smoothness, low noise and fast rates. In Advances in Neural Information Processing Systems, volume 23, pp. 2199 2207, 2010. Wainwright, M. J. High-dimensional statistics: A nonasymptotic viewpoint, volume 48. Cambridge university press, 2019. Wu, G. and Zhu, J. Multi-label classification: do hamming loss and subset accuracy really conflict with each other? Advances in Neural Information Processing Systems, 33 (Neur IPS 2020), 2020. Wu, G., Li, C., Xu, K., and Zhu, J. Rethinking and reweighting the univariate losses for multi-label ranking: Consistency and generalization. Advances in Neural Information Processing Systems, 34(Neur IPS 2021):14332 14344, 2021a. Wu, G., Li, C., and Yin, Y. Towards understanding generalization of macro-auc in multi-label learning. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pp. 37540 37570, 2023. Wu, L., Ledent, A., Lei, Y., and Kloft, M. Fine-grained generalization analysis of vector-valued learning. In Proceedings of the 35th AAAI Conference on Artificial Intelligence, number AAAI 2021, pp. 10338 10346, 2021b. Xu, C., Liu, T., Tao, D., and Xu, C. Local rademacher complexity for multi-label learning. IEEE Transactions on Image Processing, 25(3):1495 1507, 2016. Yen, I. E., Huang, X., Ravikumar, P., Zhong, K., and Dhillon, I. S. Pd-sparse : A primal and dual sparse approach to extreme multiclass and multilabel classification. In Proceedings of the 33nd International Conference on Machine Learning, volume 48, pp. 3069 3077, 2016. Yu, H., Jain, P., Kar, P., and Dhillon, I. S. Large-scale multilabel learning with missing labels. In Proceedings of the 31th International Conference on Machine Learning, volume 32, pp. 593 601, 2014. Yu, K., Yu, S., and Tresp, V. Multi-label informed latent semantic indexing. In Baeza-Yates, R. A., Ziviani, N., Marchionini, G., Moffat, A., and Tait, J. (eds.), Proceedings of the 28th Annual International Conference on Research and Development in Information Retrieval, number SIGIR 2005, pp. 258 265, 2005. Zhang, M.-L. and Wu, L. Lift: Multi-label learning with label-specific features. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(1):107 120, 2015. Zhang, M.-L. and Zhou, Z.-H. A review on multi-label learning algorithms. IEEE Transactions on Knowledge and Data Engineering, 26(8):1819 1837, 2014. Zhang, Y. and Liao, S. A kernel perspective for the decision boundary of deep neural networks. In Proceedings of the 32nd IEEE International Conference on Tools with Artificial Intelligence, number ICTAI 2020, pp. 653 660, 2020. Zhang, Y. and Zhang, M. Nearly-tight bounds for deep kernel learning. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pp. 41861 41879, 2023. Generalization Analysis for Multi-Label Learning A. Appendix Outline In the appendix, we give the detailed proofs of those theoretical results which we only provide proof sketches in the main paper. Our main proofs include: The basic generalization bound for ℓ2 Lipschitz loss (Theorem 3.4). The novel vector-contraction inequality for for ℓ2 Lipschitz loss (Lemma 3.6). The generalization bound with a square-root dependency on c for ℓ2 Lipschitz loss (Theorem 3.7). The ℓ Lipschitz continuity of the commonly used loss for multi-label learning (Proposition 3.10). The novel vector-contraction inequality for ℓ Lipschitz loss with a square-root dependency on c (Lemma 3.11). The generalization bound with a square-root dependency on c for ℓ Lipschitz loss (Theorem 3.12). The generalization bounds with no dependency on c for ℓ Lipschitz loss in the coupling case (Corollary 3.14). The novel vector-contraction inequality for ℓ Lipschitz loss with no dependency on c (Lemma 3.16). The generalization bound with no dependency on c for ℓ Lipschitz loss in the decoupling case (Theorem 3.17). The generalization bound for Macro-Averaged AUC w.r.t. ℓ Lipschitz loss (Theorem 4.2). B. Preliminaries First, we define the loss function space as follows: L = {L(f(x), y) : f F}, where F is the vector-valued function class (1) of multi-label learning defined in the main paper: F = {x 7 f(x) :f(x) = (f1(x), . . . , fc(x)), fj(x) = wj, φ(x) , w = (w1, . . . , wc) Rd c, α(w) Λ, β(φ(x)) A, x X, j [c]}, For any training dataset D = {(xi, yi) : i [n]}, let D = {(xi, yi) : i [n]} be the training dataset with only one sample different from D, where the k-th sample is replaced by (x k, y k). Let Φ(D) = supf F[E(x,y) P [L(f(x), y)] 1 n Pn i=1 L(f(xi), yi)] = supf F[R(f) b RD(f)], then Φ (D ) Φ(D) = sup f F [R(f) b RD (f)] sup f F [R(f) b RD(f)] sup f F [ b RD(f) b RD (f)] [L(f(xk), yk) L(f(x k), y k)] n According to Mc Diarmid s inequality, for any 0 < δ < 1, with probability at least 1 δ 2 over the training dataset D, the following holds: Φ(D) ED[Φ(D)] + M Generalization Analysis for Multi-Label Learning Then, we will estimate the upper bound of ED[Φ(D)]. h ED h b RD (f) b RD(f) ii# h b RD (f) b RD(f) i# i=1 L(f(x i), y i) L(f(xi), yi) i=1 ϵi (L(f(x i), y i)) L(f(xi), yi)) i=1 ϵi L(f(x i), y i) i=1 ϵi L(f(xi), yi) i=1 ϵi L(f(xi), yi) Then apply Mc Diarmid s inequality to Eϵ supf F 1 n Pn i=1 ϵi L(f(xi), yi) , we have i=1 ϵi L(f(xi), yi) i=1 ϵi L(f(xi), yi) ℜ(L) ˆℜD(L) + M Combining with (4), (5) and (6), then R(f) b RD(f) + 2ˆℜD(L) + 3M C. Generalization Bounds for ℓ2 Lipschitz Loss C.1. Proof of Theorem 3.4 We first introduce the following lemmas: Lemma C.1 (Corollary 1 in (Maurer, 2016)). Let F be a vector-valued function class. Given a dataset D of size n. Assume that the loss function is µ-Lipschitz continuous with respect to the ℓ2 norm. Then i=1 ϵi L(f(xi)) j=1 ϵijfj (xi) where each ϵij is an independent doubly indexed Rademacher random variable, and fj (xi) is the j-th component of f (xi). Lemma C.2 (Khintchine-Kahane inequality (Lust-Piquard & Pisier, 1991)). Let v1, . . . , vn H, where H is a Hilbert space with being the associated p-th norm. Let ϵ1, . . . , ϵn be a sequence of independent Rademacher variables. Then, Generalization Analysis for Multi-Label Learning for any p 1 there holds i=1 vi 2 # 1 i=1 vi 2 # 1 i=1 vi 2 # 1 According to Lemma C.1, we then have j=1 ϵijfj (xi) 2µc max j Eϵ i=1 ϵijfj (xi) 2µc max j sup fj Fj i=1 (fj (xi))2 ! 1 2 (Use Lemma C.2) 2µc B n . (Use Assumption 3.1 in the main paper) Combining with (7), then R(f) b RD(f) + 2 2µc B n + 3M C.2. Proof of Lemma 3.6 Proof Sketch: First, the Rademacher complexity of the loss function space associated with F can be bounded by the empirical ℓ2 norm covering number with the refined Dudley s entropy integral inequality. Second, according to the Lipschitz continuity w.r.t the ℓ2 norm, the empirical ℓ2 norm covering number of F can be bounded by the empirical ℓ2 norm covering number of P(F). Third, the empirical ℓ2 norm covering number of P(F) can be bounded by using Sudakov s minoration (Wainwright, 2019), which bounds the ℓ2 norm covering number of a function class by the expectation of a Gaussian process indexed by the function class, and the expectation of the Gaussian process can be bounded by the worst-case Rademacher complexity of the projection function class P(F). Hence, the problem is transferred to the estimation of the worst-case Rademacher complexity. Finally, we estimate the lower bound of the worst-case Rademacher complexity of P(F), and then combined with the above steps, the Rademacher complexity of the loss function space associated with F can be bounded. We first introduce the following lemmas: Lemma C.3 (Sudakov s minoration (Wainwright, 2019)). Let {Zf, f F} be a zero-mean Gaussian process. Then sup ϵ>0 ϵ 2 log M2(ϵ, F, D) sup ϵ>0 ϵ 2 log N2(ϵ, F, D). Lemma C.4 (Relationship between Rademacher and Gaussian complexity (Wainwright, 2019)). Let G be a class of real-valued functions mapping from X to R. Let D = {x1, . . . , xn} be a set with n i.i.d. samples. Then r 2 π ˆℜD(G) ˆGD(G) 2ˆℜD(G) p where ˆGD(G) = Eσ supg G 1 n Pn i=1 σig (xi) , and σ1, . . . , σn are i.i.d. random variables obeying the normal distribution. Generalization Analysis for Multi-Label Learning The following lemma is a refined result of Proposition 22 in (Ledent et al., 2021), where we replace the function class taking values in [0, 1] with the b-bounded function class, the refinement is obvious. Lemma C.5 (Refined Dudley s entropy integral inequality). Let F be a real-valued function class with f b, f F, b > 0, and assume that 0 F. Let S be a finite sample of size n. For any 2 p , we have the following relationship between the Rademacher complexity ˆℜS(F) and the covering number Np(ϵ, F, S). ˆℜS(F) inf α>0 log Np(ϵ, F, S)dϵ Step 1: We first derive the relationship between the empirical ℓ2 norm covering number N2(ϵ, L, D) and the empirical ℓ2 norm covering number N2(ϵ, P(F), [c] D). For the dataset D = {(x1, y1), . . . , (xn, yn)} with n i.i.d. examples: v u u t 1 L(f(xi), yi) L(f (xi), yi) 2 i=1 µ2 f(xi) f (xi) 2 2 (Use Assumption 3.2) fj (xi) f j (xi) 2 pj(f(xi)) pj(f (xi)) 2. (The definition of the projection function class P(F)) Then, according to the definition of the empirical ℓ2 covering number, we have that an empirical ℓ2 cover of P(F) at radius ϵ/µ c is also an empirical ℓ2 cover of the loss function space associated with F at radius ϵ, and we can conclude that: N2 (ϵ, L, D) N2 ϵ µ c, P(F), [c] D . (8) Step 2: We show that the empirical ℓ2 norm covering number of P(F) can be bounded by using Sudakov s minoration (Wainwright, 2019), which bounds the ℓ2 norm covering number of a function class by the expectation of a Gaussian process indexed by the function class, and the expectation of the Gaussian process can be bounded by the worst-case Rademacher complexity of the projection function class P(F). Let G be a class of real-valued functions. Let D = {x1, . . . , xn} be a set with n i.i.d. samples. We consider the Gaussian process Zf = 1 n Pn i=1 σif (xi), then according to Lemma C.3, we have E supf F Zf = nˆGD(F). Then, combining Lemma C.3 and Lemma C.4, we can get p log N2(ϵ, F, D) 4 ϵ ˆℜD(F) n log n. Hence, for the projection function class P(F), we have p log N2(ϵ, P(F), [c] D) 4 ϵ ˆℜ[c] D(P(F)) p n log n. (9) Step 3: According to Assumption 3.1 in the main paper, we can obtain the lower bound of the worst-case Rademacher complexity eℜnc(P(F)) by the Khintchine-Kahane inequality with p = 1: eℜnc(P(F)) = sup [c] D [c] X n ˆℜ[c] D(P(F)) = sup [c] D [c] X n Eϵ sup pj(f(xi)) P(F) j=1 ϵipj(f(xi)) = sup D X n Eϵ j=1 ϵifj (xi) sup D X n 1 nc sup fj Fj j=1 (fj (xi))2 Generalization Analysis for Multi-Label Learning Since |fj( )| B, we set sup D X n 1 nc supfj Fj h Pn i=1 Pc j=1 (fj (xi))2i 1 2 = B nc. So, eℜnc(P(F)) B Then, according to Lemma C.5 and combined with the above steps, we have log N2(ϵ, L, D)dϵ log N2( ϵ µ c, P(F), [c] D)dϵ (Use inequality (8)) 4α + 48 cµeℜnc(P(F)) log 1 2 (nc) Z M (Use inequality (9) and the definition of the worst-case Rademacher complexity) 48µ ceℜnc(P(F)) + 48µ ceℜnc(P(F)) log 1 2 (nc) log M 12 cµeℜnc(P(F)) (Choose α = 12 cµeℜnc(P(F))) 48µ ceℜnc(P(F)) 1 + log 1 2 (nc) log M n . (Use inequality (10)) C.3. Proof of Theorem 3.7 We upper bound the worst-case Rademacher complexity eℜnc(P(F)) as the following: = sup [c] D [c] X n ˆℜ[c] D(P(F)) = sup [c] D [c] X n Eϵ sup pj(f(xi)) P(F) j=1 ϵipj(f(xi)) = sup D X n Eϵ j=1 ϵifj (xi) sup D X n sup fj Fj j=1 (fj (xi))2 (Use Lemma C.2) B nc. (Use Assumption 3.1 in the main paper) (11) Then, we have ˆℜD(F) 48µ ceℜnc(P(F)) 1 + log 1 2 (nc) log M n 48Bµ 1 + log 1 2 (nc) log M n Combining with (7), then R(f) b RD(f) + 96Bµ 1 + log 1 2 (nc) log M n Generalization Analysis for Multi-Label Learning D. Generalization Bounds for ℓ Lipschitz Loss D.1. Proof of Proposition 3.10 We first prove that the surrogate Hamming Loss is µ-Lipschitz continuous with respect to the ℓ norm. LH(f(x), y) LH f (x), y j=1 ℓ(yjfj(x)) 1 j=1 ℓ yjf j(x) ℓ(yjfj(x)) ℓ yjf j(x) j=1 µ fj(x) f j(x) c µc max j [c] fj(x) f j(x) =µ f(x) f (x) . Then, with the elementary inequality |max {a1, . . . , ac} max {b1, . . . , bc}| max {|a1 b1| , . . . , |ac bc|} , we proof that the surrogate Subset Loss is µ-Lipschitz continuous with respect to the ℓ norm. LS(f(x), y) LS f (x), y = max j [c] ℓ(yjfj(x)) max j [c] ℓ yjf j(x) ℓ(yjfj(x)) ℓ yjf j(x) µ max j [c] fj(x) f j(x) =µ f(x) f (x) . Finally, we proof that the surrogate Ranking Loss is 2µ-Lipschitz continuous with respect to the ℓ norm. LR(f(x), y) LR f (x), y = 1 |Y +| |Y | ℓ(fp(x) fq(x)) ℓ f p(x) f q(x) max p Y +,q Y ℓ(fp(x) fq(x)) ℓ f p(x) f q(x) µ max p Y +,q Y (fp(x) fq(x)) f p(x) f q(x) µ max p Y + fp(x) f p(x) + max q Y fq(x) f q(x) 2µ max j [c] fj(x) f j(x) =2µ f(x) f (x) . D.2. Proof of Lemma 3.11 We first introduce the following vector-contraction inequality in (Foster & Rakhlin, 2019): Generalization Analysis for Multi-Label Learning Lemma D.1 (Theorem 1 in (Foster & Rakhlin, 2019)). Let F {f : X Rc}, and let φ : Rc R be L-lipschitz with respect to the ℓ norm. For any 1 > η > 0, there exists a constant C > 0 such that if |φ(f(x))| f(x) β, then ˆℜD(φ F) CL c max j eℜn (Fj) log maxj eℜn (Fj) where ˆℜD(φ F) = Eϵ supf F 1 n Pn i=1 ϵiφ(f(xi)) , eℜn(Fj) = sup D X n ˆℜD(Fj). Next we will refine the proof of Lemma D.1. First, we introduce the result in Theorem 1 in (Foster & Rakhlin, 2019): log N (ϵ, φ F, D) maxi Ec Lri log1+η e2+ηn riϵ0 , where ri = 8n( eℜn(Fj) aϵ0 )2, E > 0, 0 < a < 1, ϵ0 = ϵ Then, by direct calculation and proper scaling, we have log N2(ϵ, φ F, D) max j Ecri log1+η e2+ηn max j Ec8n( eℜn(Fj) aϵ0 )2 log1+η 4 (eℜn(Fj))2 (Use inequality 2eℜn(Fj) < ϵ0 < 1 ) max j Ec32n( eℜn(Fj) aϵ0 )2 log1+η 2 eℜn(Fj) max j Ec32n( eℜn(Fj) aϵ0 )2 log1+η 2 (Use the similar technique to the proof of Lemma 3.6, the lower bound of eℜn(Fj) B According to Lemma C.5 and combined with (12), we have log N2(ϵ, φ F, D)dϵ max j Ec32n( eℜn(Fj) aϵ0 )2 log1+η 2 (Use inequality (12)) 2Ec L a max j eℜn(Fj) log 2Ec L a max j eℜn(Fj) + 48 2Ec L a max j eℜn(Fj) log 2n B ) log a M 2Ec Leℜn(Fj) (Choose α = 12 2Ec L a max j eℜn(Fj)) 2Ec L a max j eℜn(Fj) + 48 2Ec L a max j eℜn(Fj) log 2n B ) log M n Ec LB (Use 0 < a < 1 and eℜn(Fj) B Finally, combined with our problem setting, we have 2Eµ a c max j eℜn (Fj) 2n B ) log( M D.3. Proof of Theorem 3.12 Using the similar technique to the proof of Theorem 3.4, eℜn(Fj) can be upper bounded by B n, we then have Generalization Analysis for Multi-Label Learning 2n B ) log( M Combining with (7), then R(f) b RD(f) + 96 2n B ) log( M D.4. Proof of Corollary 3.14 We first introduce the following lemmas: Lemma D.2 (Theorem 5 in (Lei et al., 2019)). Let F be a vector-valued function class of the multi-label learning defined by (1). Let Assumptions 3.1 and 3.9 hold. Given a dataset D of size n. Suppose that the loss function is L-Lipschitz continuous w.r.t. ℓ norm. Then the ˆℜD(L) can be bounded by ˆℜD(L) 16L p c log 2eℜnc( e F) 1 + log 2n 3 2 c 1 and 16 log 2 < 8.8, we have ˆℜD(L) 32L p c log 2eℜnc( e F) log 2n 3 2 c 18L ceℜnc( e F) log Using the similar technique to the proof of Theorem 3.4, eℜnc( e F) can be upper bounded by ΛA nc. Then combining with (7) and our problem setting, we have R(f) b RD(f) + 36µΛA log Lemma D.3 (Corollary 10 in (Lei et al., 2019)). Let F be a vector-valued function class defined by (1), where α(w) := w Sp, β(φ(x)) := maxi [n] xi 2, i.e., φ be the identity map, and p 1. Assume that the loss function is L-Lipschitz continuous w.r.t. ℓ norm. Then, if 1 p 2, for any 0 < δ < 1 with probability of 1 δ, we have ˆℜD(L) 16L log 2Λ maxi [n] xi 2 n 2n 3 2 c 18LΛ maxi [n] xi 2 log 2n 3 2 c) n ˆℜD(L) 16L log 2Λ maxi [n] xi 2 min{c, d} 1 2 1 18LΛ maxi [n] xi 2 min{c, d} 1 2 1 2n 3 2 c) n , where we scale them by log 2n 3 2 c 1 and 16 log 2 < 8.8. This lemma holds for the class of linear functions and is logarithmically dependent on the number of labels. We adjust the constraint on nonlinear mappings such that our bound holds for the general class of nonlinear functions. Then combining with (7) and our problem setting, when 1 p 2, we have: R(f) b RD(f) + 36µΛA log 2n 3 2 c) n + 3M when p > 2, we have: R(f) b RD(f) + 36µΛA min{c, d} 1 2 1 2n 3 2 c) n + 3M Generalization Analysis for Multi-Label Learning D.5. Proof of Lemma 3.16 Proof Sketch: First, the Rademacher complexity of the loss function space associated with F can be bounded by the empirical ℓ norm covering number with the refined Dudley s entropy integral inequality. Second, according to the Lipschitz continuity w.r.t the ℓ norm, the empirical ℓ norm covering number of F can be bounded by the empirical ℓ norm covering number of P(F). Third, the empirical ℓ norm covering number of P(F) can be bounded by the fat-shattering dimension, and the fat-shattering dimension can be bounded by the worst-case Rademacher complexity of P(F). Hence, the problem is transferred to the estimation of the worst-case Rademacher complexity. Finally, we estimate the lower bound of the worst-case Rademacher complexity of P(F), and then combined with the above steps, the Rademacher complexity of the loss function space associated with F can be bounded. We first introduce the following lemmas: Lemma D.4 (Lemma A.2 in (Srebro et al., 2010)). For any function class F, any S with a finite sample of size n and any ϵ > 2ˆℜS(F), we have that fatϵ(F) 16nˆℜ2 S(F) ϵ2 . Lemma D.5 (Theorem 4.4 in (Rudelson & Vershynin, 2006)). For any function class F, any S with a finite sample of size n, and any η (0, 1) there exist constants 0 < a < 1 and E > 0 such that for all ϵ (0, 1), log N (ϵ, F, S) Ev log (en/vϵ) logη (en/v) , where v = fataϵ(F). Step 1: We first derive the relationship between the empirical ℓ norm covering number N (ϵ, L, D) and the empirical ℓ norm covering number N (ϵ, P(F), [c] D). For the dataset D = {(x1, y1), . . . , (xn, yn)} with n i.i.d. examples: max i |L(f(xi), yi) L(f (xi), yi)| µ max i f(xi) f (xi) (Use Assumption 3.9) µ max i max j |fj (xi) f j (xi) | µ max i max j |pj(f(xi)) pj(f (xi)|. (The definition of the projection function class P(F)) Then, according to the definition of the empirical ℓ covering number, we have that an empirical ℓ cover of P(F) at radius ϵ/µ is also an empirical ℓ cover of the loss function space associated with F at radius ϵ, and we can conclude that: N (ϵ, L, D) N µ, P(F), [c] D . (13) Step 2: We show that the empirical ℓ norm covering number of P(F) can be bounded by the fat-shattering dimension, and the fat-shattering dimension can be bounded by the worst-case Rademacher complexity of P(F). According to Lemma D.4, for any ϵ > 2ˆℜ[c] D(P(F)), we have fatϵ(P(F)) 16ncˆℜ2 [c] D(P(F)) Then, combining with Lemma D.5, for any η (0, 1) there exist constants 0 < a < 1 and E > 0 such that for all ϵ (0, 1), we have log N (ϵ, P(F), [c] D) Ev log (enc/vϵ) logη (enc/v) , Generalization Analysis for Multi-Label Learning where v = fataϵ(P(F)) 16ncˆℜ2 [c] D(P(F)) a2ϵ2 , and we set d = 16ncˆℜ2 [c] D(P(F)) a2ϵ2 . Hence, log N (ϵ, P(F), [c] D) vϵ logη enc Ed log e2+ηnc dϵ logη e2+ηnc Ed log1+η e2+ηnc E 16ncˆℜ2 [c] D(P(F)) a2ϵ2 log1+η 1.2 ˆℜ[c] D(P(F)) , (Use inequality 2ˆℜ[c] D(P(F)) < ϵ < 1 ) (14) where we use that for any s, t > 0, the function x 7 x log s x is non-decreasing as long as s > t > e1+ηx and that v nc in the second inequality, the third and fourth inequalities are obtained by direct calculation and proper scaling. Then, we have log N (ϵ, P(F), [c] D) E 16ncˆℜ2 [c] D(P(F)) a2ϵ2 log1+η 1.2 ˆℜ[c] D(P(F)) E 64ncˆℜ2 [c] D(P(F)) a2ϵ2 log1+η 1.2 ˆℜ[c] D(P(F)) E 64ncˆℜ2 [c] D(P(F)) a2ϵ2 log1+η 8 nc 5B (Use inequality (10)) E 64nceℜ2 nc(P(F)) a2ϵ2 log1+η 8 nc 5B (The definition of the worst-case Rademacher complexity) (15) Step 3: According to Lemma C.5 and combined with the above steps, we have log N (ϵ, L, D)dϵ µ, P(F), [c] D)dϵ (Use inequality (14)) 64Eµ2nceℜ2nc(P(F)) a2ϵ2 log1+η 8 nc (Use inequality (15)) Ecµ a eℜnc(P(F)) log Ecµ a eℜnc(P(F)) + 96 Ecµ a eℜnc(P(F)) log 5B ) log a M Ecµeℜnc(P(F)) (Choose α = 24 Ecµ a eℜnc(P(F))) Ecµ a eℜnc(P(F)) + 96 Ecµ a eℜnc(P(F)) log 5B ) log M n EµB (Use 0 < a < 1 and inequality (10) ) Ecµ a eℜnc(P(F)) 1 + log 5B ) log M n Generalization Analysis for Multi-Label Learning D.6. Proof of Theorem 3.17 According to the inequality (11), we have eℜnc(P(F)) B nc, then EµB a 1 n 1 + log 5B ) log M n Combining with (7), then R(f) b RD(f) + 192 EµB a 1 n 1 + log 5B nc) log( M n E. Generalization Analysis for Macro-Averaged AUC E.1. Proof of Theorem 4.2 We first proof the following lemma: Lemma E.1. Let qτ : X X 7 R be real-valued functions indexed by τ T where T is some set. If x1, . . . , xs and x 1, . . . , x t are i.i.d., r = min{s, t}, then for any convex non-decreasing function ψ, j=1 qτ xi, x j i=1 qτ (xi, x i) Proof. The proof of this lemma is inspired by (Cl emenc on et al., 2008). j=1 qτ xi, x j i=1 qτ xπ(i), x π(i) i=1 qτ xπ(i), x π(i) (ψ is nondecreasing) i=1 qτ xπ(i), x π(i) ! (Jensen s inequality) i=1 qτ (xi, x i) With this lemma, we first prove that E h R ˆf i R 384 2n B ) log( M E h R ˆf i R = E h R ˆf b RD(ˆf ) + b RD(ˆf ) R i =E h R ˆf b RD(ˆf ) i + E h b RD(ˆf ) R i R (f) b RD(f) + E sup f F b RD(f) R (f) = 2E sup f F R (f) b RD(f) i=1 ℓ(fj(xi) fj(x i)) . (Use Lemma E.1) Generalization Analysis for Multi-Label Learning Let D = {x1, . . . , xn} be an independent copy of D = {x1, . . . , xn}, then by the standard symmetrization technique and the Jensen s inequality similar to Preliminaries in Section B, the above inequality can be bounded by: 2ED,D sup f F i=1 ℓ(fj(xi) fj(x i)) 1 i=1 ℓ(fj(xi) fj(x i)) =2ED,D,ϵ sup f F i=1 ϵij [ℓ(fj(xi) fj(x i)) ℓ(fj(xi) fj(x i))] =4ED,ϵ sup f F i=1 ϵijℓ(fj(xi) fj(x i)) 2Eµ a c max j eℜ(Fj) (Use Lemma 3.11) 2Eµ a c max j ED sup D X n Eϵ i=1 ϵi (fj(xi) fj(x i)) 2Eµ a c max j ED sup D X n sup fj Fj i=1 (fj(xi) fj(x i))2 ! 1 (Use Lemma C.2) 2EµB a max j 2n B ) log( M Similarly, we can derive that =R ˆf b RD(ˆf ) + b RD(ˆf ) R sup f F [R (f) b RD(f)] + sup f F [ b RD(f) R (f)] =2 sup f F [R (f) b RD(f)]. Let D = {x1, . . . , x k, . . . , xn} be the training dataset with only one sample different from D, where the k-th sample is replaced by x k. Then sup f F [R(f) b RD (f)] sup f F [R(f) b RD(f)] sup f F [ b RD(f) b RD (f)] 1 X+ j X j X (ℓ(fj(xk) fj(x i)) ℓ(fj(xk) fj(x i))) Generalization Analysis for Multi-Label Learning According to Mc Diarmid s inequality, for any 0 < δ < 1, with probability at least 1 δ over the training dataset D, the following holds: R( ˆ f ) R 384 2n B ) log( M Note that when we analyze class-imbalance, we focus on whether sj and tj are seriously imbalanced for a fixed number of n examples. Hence, the order of the bound is e O( q