# generalization_analysis_for_labelspecific_representation_learning__bab5b619.pdf Generalization Analysis for Label-Specific Representation Learning Yi-Fan Zhang1,3, Min-Ling Zhang2,3 1 School of Cyber Science and Engineering, Southeast University, Nanjing 210096, China 2 School of Computer Science and Engineering, Southeast University, Nanjing 210096, China 3 Key Laboratory of Computer Network and Information Integration (Southeast University), Ministry of Education, China {yfzh, zhangml}@seu.edu.cn Label-specific representation learning (LSRL), i.e., constructing the representation with specific discriminative properties for each class label, is an effective strategy to improve the performance of multi-label learning. However, the generalization analysis of LSRL is still in its infancy. The existing theory bounds for multilabel learning, which preserve the coupling among different components, are invalid for LSRL. In an attempt to overcome this challenge and make up for the gap in the generalization theory of LSRL, we develop a novel vector-contraction inequality and derive the generalization bound for general function class of LSRL with a weaker dependency on the number of labels than the state of the art. In addition, we derive generalization bounds for typical LSRL methods, and these theoretical results reveal the impact of different label-specific representations on generalization analysis. The mild bounds without strong assumptions explain the good generalization ability of LSRL. 1 Introduction Multi-label learning has received continued attention in machine learning community due to its widespread encounters in real-world applications, where each object is represented by a single instance associated with multiple class labels [Zhang and Zhou, 2014, Liu et al., 2022]. The goal of multi-label learning is to model real-world objects with multiple semantics. It has made important advances in multimedia content annotation [Cabral et al., 2011, You et al., 2020], text categorization [Rubin et al., 2012, Xun et al., 2020], bioinformatics [Cesa-Bianchi et al., 2012, Chen et al., 2017] and other fields [Yu et al., 2005]. The failure to take into account that each class label may possess its own discriminative properties results in the most straightforward strategy which exploits the identical representation of an instance for dealing with multi-label data being perhaps suboptimal [Zhang and Wu, 2015, Huang et al., 2016, Hang and Zhang, 2022]. In recent years, label-specific representation learning (LSRL) [Zhang and Wu, 2015] has been proposed to facilitate the discrimination of each class label by tailoring its own representations. Due to its ability to model distinct characteristics for each class label, LSRL has become an effective strategy to improve the performance of multi-label learning [Huang et al., 2015, 2016, 2018, Yu and Zhang, 2022, Hang and Zhang, 2022, Hang et al., 2022]. Although LSRL has achieved impressive empirical advances in multi-label learning, the problem of understanding LSRL theoretically remains completely under-explored. In recent years, efforts to explain why multi-label models generalize well is an important open problem in multi-label learning community. The empirical success of LSRL makes the generalization Corresponding author 38th Conference on Neural Information Processing Systems (Neur IPS 2024). analysis of LSRL an important problem in multi-label learning. However, existing theoretical results and analysis methods are not applicable to LSRL, which leads to a serious lack of progress in its generalization analysis. A satisfactory and complete study of the generalization analysis for LSRL should include two aspects: 1) the dependency of the generalization bounds on the number of labels (i.e., c) should be weaker than square-root, and 2) the relationship among components should be decoupled in the generalizability analysis. First, existing bounds with a linear or square-root dependency on c are difficult to explain empirical success of multi-label learning. For example, the bounds with a linear dependency on c are vacuous (i.e., no longer less than 1) for commonly used multi-label datasets such as CAL500, corel5k, rcv1-s1, Corel16k-s1, delicious, iaprtc12, espgame, etc., since c is larger than n (the number of examples) for these datasets. The bounds with a linear or square-root dependency on c are vacuous for commonly used extreme multi-label datasets such as Wiki 10, Amazon-670K, etc., since c is larger than n. The above failure of existing bounds for multilabel learning also applies to LSRL. Hence, this suggests that only the bounds with weaker dependency on c can provide effective theoretical guarantees. Second, existing theoretical results preserve the coupling among different components [Lei et al., 2015, Wu and Zhu, 2020, Wu et al., 2021a,b]. However, LSRL decomposes the multi-label learning problem into multiple binary classification problems, which means that the relationship among different components needs to be decoupled in the generalization analysis. Hence, we need to develop new analysis methods for LSRL. As a matter of fact, theoretical research on LSRL can also promote a better understanding of multi-label learning. In this paper, we derive novel and tighter bounds based on the Rademacher complexity for LSRL. Specifically, we develop a novel vector-contraction inequality with the assumption that the loss function is Lipschitz continuous w.r.t. the ℓ norm, then we derive the bound for general function classes of LSRL with no dependency on c, up to logarithmic terms, which is tighter than the state of the art. In addition, we also analyze the bounds for several typical LSRL methods, and we show that the construction method of label-specific representations will affect the generalization bound. Our bounds for LSRL improve the dependency on c. Major contributions of the paper include: We develop a novel vector-contraction inequality for ℓ norm Lipschitz continuous loss, which overcomes the limitations of existing theoretical results and provides a theoretical tool for the generalization analysis of LSRL. We derive bounds for general function classes of LSRL with a weaker dependency on c than the state of the art, which provides a general theoretical guarantee for LSRL. We derive bounds for typical LSRL methods, which reveals the impact of different labelspecific representations on the generalization analysis. The theoretical techniques and results on k-means clustering, Lasso, and DNNs involved here may be of independent interest. 2 Related Work 2.1 Multi-Label Learning Multi-label learning is one of the most studied and important machine learning paradigms in practice [Zhang and Zhou, 2014, Liu et al., 2022]. To cope with the challenge that the output space is exponential in size to the number of class labels, modeling label correlations is adopted as a feasible strategy to facilitate the learning process. Generally speaking, existing methods can be roughly grouped into three major categories based on the order of label correlations being considered, namely first-order methods [Boutell et al., 2004, Zhang and Zhou, 2007, Zhang et al., 2018] which tackle multi-label learning problem by decomposing it into a number of independent binary classification problems, second-order methods [Elisseeff and Weston, 2001, Zhu et al., 2018, Sun and Zhang, 2021] which tackle multi-label learning problem by considering pairwise relationships between labels, and high-order methods [Ji et al., 2010, Xu and Guo, 2021] which tackle multi-label learning problem by exploiting high-order relationships among labels. Recent years, benefiting from the good generalization performance of deep learning, deep methods such as recurrent neural networks [Yazici et al., 2020], graph neural networks[Chen et al., 2020], and embedding models [Bai et al., 2020, Dahiya et al., 2021] have been explored to model label correlations. As a complement to the exploitation of label correlations, label-specific representation learning (LSRL) have been proven to be another effective strategy to improve multi-label learning by manipulating the input space. Existing methods can be roughly grouped into three major categories based on the construction method of label-specific representations, i.e., prototype-based label-specific representation transformation methods [Zhang and Wu, 2015, Zhang et al., 2015, Weng et al., 2018, Guo et al., 2019, Zhang and Li, 2021] which generate label-specific representations by treating the prototypes of each class label as the transformation bases, label-specific feature selection methods [Huang et al., 2015, 2016, 2018, Weng et al., 2020, Yu and Zhang, 2022] which generate label-specific representations by retaining a feature subset as the most pertinent features for each class label, and deep label-specific representation methods [Hang and Zhang, 2022, Hang et al., 2022] which learn label-specific representations in an end-to-end manner by exploiting deep models. In particular, k-means clustering-based LSRL, i.e., LIFT [Zhang and Wu, 2015], is the representative method of prototype-based label-specific representation transformation methods, which constructs label-specific representations by querying the distances between the original inputs and the cluster centers for each class label. Lasso-based LSRL, i.e., LLSF [Huang et al., 2016], is the representative method of label-specific feature selection methods, which presents a Lasso-based framework with the constraint of pairwise label correlations for each class label. DNN-based LSRL, i.e., CLIF [Hang and Zhang, 2022], is the representative method of deep label-specific representation methods, which proposes to learn label semantics and label-specific representations in a collaborative way. In this paper, we will analyze the generalization bounds of these three representative LSRL methods. 2.2 Generalization Bounds for Multi-Label Learning Dembczynski et al. [2010] derived the relationship between the expectations of Hamming and Subset loss based on the regret analysis on Hamming and Subset loss, and Dembczynski et al. [2012] further performed regret analysis on Ranking loss, which provided preliminary theoretical insights for understanding Hamming, Subset and Ranking loss. With the typical vector-contraction inequality [Maurer, 2016] (i.e., assume that the loss function is Lipschitz continuous w.r.t. the ℓ2 norm), one can obtain generalization bounds of order O(c/ n) for multi-label learning. Wu and Zhu [2020], Wu et al. [2021a] showed that the order of the generalization bounds for Subset loss, Hamming Loss and reweighted convex surrogate univariate loss can be improved to O( p c/n), which preserved the coupling among different components and exploited the relationship between loss functions. Liu et al. [2018] also obtained a generalization bound of order O( p c/n) for the dual set multi-label learning, which was analyzed under the margin loss and kernel function classes. Wu et al. [2021b] derived a generalization bound of order O(log(nc)/nσ) for norm regularized kernel function classes with the assumptions that the loss function is Lipschitz continuous w.r.t. the ℓ norm and the regularizer is σ-strongly convex with respect to some norms. Yu et al. [2014] obtained a generalization bound of order O(1/ n) for trace norm regularized linear function classes with Decomposable loss. Xu et al. [2016] used the local Rademacher complexity to derive a generalization bound of order e O(1/n) for trace norm regularized linear function classes with the assumption that the singular values of the weight matrix decay exponentially. These theoretical results all preserved the coupling among different components. In addition, Wu et al. [2023] obtained O(1/ n) bounds for Macro-Averaged AUC and gave thorough discussions about its relationships with the label-wise class imbalance, which transformed the macro-averaged maximization problem in multi-label learning into the problem of learning multiple tasks with graph-dependent examples. Here we obtain e O(1/ n) bounds with the state-of-the-art dependency on the number of labels for LSRL function classes under the assumption that the loss function is Lipschitz continuous w.r.t. the ℓ norm. 3 Preliminaries Let [n] := {1, . . . , n} for any natural number n. In the context of multi-label learning, given a dataset D = {(x1, y1) , . . . , (xn, yn)} with n examples which are identically and independently distributed (i.i.d.) from a probability distribution P on X Y, where X Rd denotes the d-dimensional input space and Y denotes the label space with c class labels, x X, y Y { 1, +1}c, i.e., each y = (y1, . . . , yc) is a binary vector and yj = 1 (yj = 1) denotes that the j-th label is relevant (irrelevant), j [c]. The task of multi-label learning is to learn a multi-label classifier h H : X 7 { 1, +1}c which assigns each instance with a set of relevant labels. A common strategy is to learn a vector-valued function f = (f1, . . . , fc) : X 7 Rc and derive the classifier by a thresholding function which divides the label space into relevant and irrelevant label sets. For any vector-valued function f : X 7 Rc, the prediction quality on the example (x, y) is measured by a loss function ℓ: Rc { 1, +1}c 7 R+. The goal of learning is to find a hypothesis f F with good generalization performance from the dataset D by optimizing the loss ℓ. The generalization performance is measured by the expected risk: R(f) = E(x,y) P [ℓ(f(x), y)]. We denote the empirical risk w.r.t. the training dataset D as b RD(f) = 1 n Pn i=1 ℓ(f(xi), yi). We denote the optimal risk as R = inff F R(f), the minimizer of the optimal risk as f = arg minf F R(f) and denote the minimizer of the empirical risk as ˆ f = arg minf F b RD(f). In addition, we define the loss function space as L = {ℓ(f(x), y) : f F}, where F is the vector-valued function class. 3.1 Label-Specific Representation Learning Label-specific representation learning aims to construct the representation with specific discriminative properties for each class label to facilitate its discrimination process. The basic idea of LSRL is to decompose the multi-label learning problem into c binary classification problems, i.e., decoupling the relationship among different components, where each binary classification problem corresponds to a possible label in the label space. We consider the prediction function for each label of the general form fj(x) = wj, ζj(φj(x)) , where the inner nonlinear mapping φj corresponds to the nonlinear transformation induced by the construction method of label-specific representation, while the outer nonlinear mapping ζj refers to the nonlinear mapping corresponding to the classifier learned on the generated label-specific representation. We define a vector-valued function class of LSRL as follows: F = {x 7 f(x) :f(x) = (f1(x), . . . , fc(x)), fj(x) = hj (φj(x)) = w j ζj(φj(x)), w = (w1, . . . , wc) Rd c, α(w) Λ, β(ζj( )) A, γ(φj( )) C, x X, j [c], Λ, A, C > 0}, (1) where α represents a functional that constrains weights, β represents a functional that constrains nonlinear mappings ζj, γ represents a functional that constrains nonlinear mappings φj. 3.2 Related Evaluation Metrics A number of evaluation metrics are proposed to measure the generalization performance of different multi-label learning methods. Here we focus on commonly used evaluation metrics, i.e., Hamming Loss, Subset Loss, Ranking Loss and Decomposable Loss. However, the above mentioned loss functions are typically 0/1 losses, which are actually hard to handle in optimization. Hence, one usually consider their surrogate losses, which are defined as follows: Hamming Loss : ℓH(f(x), y) = 1 j=1 ℓb (yjfj(x)) , where the base convex surrogate loss ℓb can be various popular forms, such as the hinge loss, the logistic loss, the exponential loss and the squared loss. Subset Loss : ℓS(f(x), y) = max j [c] {ℓb (yjfj(x))} . Ranking Loss : ℓR(f(x), y) = 1 |Y +| |Y | q Y ℓb (fp(x) fq(x)) , where Y + (Y ) denotes the relevant (irrelevant) label index set induced by y, and | | denotes the cardinality of a set. Decomposable Loss : ℓD(f(x), y) = j=1 ℓb (yjfj(x)) . 3.3 Related Complexity Measures Here we introduce the complexity measures involved in theoretical analysis. The Rademacher complexity is used to perform generalization analysis for LSRL. Definition 1 (Rademacher complexity). Let G be a class of real-valued functions mapping from X to R. Let D = {x1, . . . , xn} be a set with n i.i.d. samples. The empirical Rademacher complexity over G is defined by ˆℜD(G) = Eϵ supg G 1 n Pn i=1 ϵig (xi) , where ϵ1, . . . , ϵn are i.i.d. Rademacher random variables, and we refer to the expectation ℜ(G) = ED[ˆℜD(G)] as the Rademacher complexity of G. In addition, we define the worst-case Rademacher complexity as ℜn(G) = sup D X n ˆℜD(G), and its expectation is denoted as ℜ(G) = ED[ ℜn(G)]. In multi-label learning, f F is a vector-valued function, which makes traditional Rademacher complexity analysis methods invalid. Hence, we need to convert the Rademacher complexity of a loss function space associated with the vector-valued function class F into the Rademacher complexity of a tractable scalar-valued function class. The Rademacher complexity can be bounded by other scale-sensitive complexity measures, such as the covering number and fat-shattering dimension [Srebro et al., 2010, Zhang and Zhang, 2023]. The relevant definitions are provided in the appendix. 4 General Bounds for LSRL In this section, we first introduce the assumptions used. Then, we develop a novel vector-contraction inequality for the Rademacher complexity of the loss function space associated with the vector-valued function class F. Finally, with the novel vector-contraction inequality, we derive the generalization bound for general function classes of LSRL with no dependency on the number of labels, up to logarithmic terms, which is tighter than the state of the art. The detailed proofs of the theoretical results in this paper are provided in the appendix. Assumption 1. Assume that the input features, the loss function and the components of the vectorvalued function are bounded: xi 2 R, ℓ( , ) M, |fj( )| B for i [n], j [c] where R > 0, M > 0 and B > 0 are constants. Assumption 2. Assume that the loss function ℓis ρ-Lipschitz continuous w.r.t. the ℓ norm, that is: ℓ(f(x), ) ℓ(f (x), ) ρ f(x) f (x) , where ρ > 0, t = maxj [c] |tj| for t = (t1, . . . , tc). Assumption 1 and 2 are mild assumptions. For Assumption 1, normalization of input features is a common data preprocessing operation. When we consider the function class (1), we often use the assumptions wj 2 Λ, ζj( ) 2 A for any j [c] to replace the boundedness of the components of the vector-valued function, i.e., B := ΛA. For Assumption 2, the Lipschitz continuity w.r.t. the ℓ norm has been considered in some literature [Foster and Rakhlin, 2019, Lei et al., 2019, Wu et al., 2021b]. The following Proposition 1 further illustrates that the commonly used loss functions in multi-label learning actually satisfy Assumption 2. Proposition 1. Assume that the base loss ℓb defined in Subsection 3.2 is µ-Lipschitz continuous, then the surrogate Hamming Loss is µ-Lipschitz w.r.t. the ℓ norm, the surrogate Subset Loss is µ-Lipschitz w.r.t. the ℓ norm, the surrogate Ranking Loss is 2µ-Lipschitz w.r.t. the ℓ norm, and the surrogate Decomposable Loss is cµ-Lipschitz w.r.t. the ℓ norm. We define a function class P consisting of projection operators pj : Rc 7 R for any j [c] which project the c-dimensional vector onto the j-th coordinate. Then, we have P(F) = {(j, x) 7 pj(f(x)) : pj(f(x)) = fj(x), f F, (j, x) [c] X}. The projection function class decouples the relationship among different components. With the assumption of ℓ norm Lipschitz loss and the above definitions, we show that the Rademacher complexity of the loss function space associated with F can be bounded by the worst-case Rademacher complexity of the projection function class P(F). We develop the following novel vector-contraction inequality: Lemma 1. Let F be a vector-valued function class of LSRL defined by (1). Let Assumptions 1 and 2 hold. Given a dataset D of size n. Then, we have 2ρ ceℜnc(P(F)) 1 + log 1 2 (8e2n3c3) log M n where ˆℜD(L) = Eϵ supℓ L,f F 1 n Pn i=1 ϵiℓ(f(xi)) is the empirical Rademacher complexity of the loss function space associated with F, and eℜnc(P(F)) is the worst-case Rademacher complexity of the projection function class. Proof Sketch. First, the Rademacher complexity of the loss function space associated with F can be bounded by the empirical ℓ norm covering number with the refined Dudley s entropy integral inequality. Second, according to the Lipschitz continuity w.r.t the ℓ norm, the empirical ℓ norm covering number of F can be bounded by that of P(F). Third, the empirical ℓ norm covering number of P(F) can be bounded by the fat-shattering dimension, and the fat-shattering dimension can be bounded by the worst-case Rademacher complexity of P(F). Hence, the problem is transferred to the estimation of the worst-case Rademacher complexity. Finally, we estimate the lower bound of the worst-case Rademacher complexity of P(F), and then combined with the above steps, the Rademacher complexity of the loss function space associated with F can be bounded. With the vector-contraction inequality above, we can derive the following tight bound for LSRL: Theorem 1. Let F be a vector-valued function class of LSRL defined by (1). Let Assumptions 1 and 2 hold. Given a dataset D of size n. Then, for any 0 < δ < 1, with probability at least 1 δ, the following holds for any f F: R(f) b RD(f) + 24 2ρΛA 1 + log 1 2 (8e2n3c3) log M n Proof Sketch. We first upper bound the worst-case Rademacher complexity ℜnc(P(F)), and then combined with Lemma 1, the desired bound can be derived. Remark 1. Although Lemma 1 shows a factor of c, the term eℜnc(P(F)) ΛA/ nc, which makes the Rademacher complexity of the loss function space associated with F (i.e., ˆℜD(L)) actually independent on c, up to logarithmic terms, and results in a tighter bound than the existing O(c/ n) and O( p c/n) bounds with a faster convergence rate e O(1/ n). The bound in Theorem 1 with no dependency on c can provide a general theoretical guarantee for LSRL, even for extreme multi-label learning where the number of labels far exceeds the number of examples [Yu et al., 2014, Prabhu and Varma, 2014, Yen et al., 2016, Liu and Shen, 2019], since it is easy to get that log c is much smaller than n. The main challenge of generalization analysis for LSRL is that existing theoretical results and existing generalization analysis methods for multi-label learning are not applicable to LSRL. Specifically, existing theoretical bounds often involve the typical vectorcontraction inequality [Maurer, 2016] for ℓ2 Lipschitz loss: Eϵ supf F 1 n Pn i=1 ϵiℓ(f(xi)) 2µEϵ h supf F 1 n Pn i=1 Pc j=1 ϵijfj (xi) i , which will lead to bounds with a linear dependency on c for multi-label learning. Lei et al. [2015], Wu and Zhu [2020], Wu et al. [2021a] improve the dependency of the bounds on c to square-root, which preserve the coupling among different components reflected by the constraint w Λ. Lei et al. [2019] also improves the bounds of multi-class classification to be independent on c (up to logarithmic terms) for ℓ Lipschitz loss by preserving the coupling among different components, and Wu et al. [2021b] further generalizes these results to multi-label learning. However, LSRL decomposes the multi-label learning problem into c binary classification problems, which means that the relationship among different components needs to be decoupled in the generalization analysis. Hence, how to develop novel vector-contraction inequalities that can induce e O(1/ n) bounds and deal with the case of decoupling the relationship among different components are the two most critical difficulties in deriving tighter bounds for LSRL. The introduction of the projection function class plays an important role in solving these two difficulties. It improves the vector-contraction inequalities by a factor of c and decouples the relationship among different components (which is also reflected by the constraint wj Λ for any j [c]). Our tighter e O(1/ n) bound in Theorem 1 with no dependency on c (up to logarithmic terms) solve the limitations of existing theoretical results for multi-label learning and can provide a general theoretical guarantee for LSRL. The differences in generalization bounds of different LSRL methods are mainly reflected in two aspects. On the one hand, the Lipschitz constant of the loss functions, as we proved in Proposition 1, the Lipschitz constants ρ corresponding to different loss functions are various. On the other hand, the nonlinear mappings induced by different LSRL methods. In fact, when we analyze the generalization for LSRL, we will further have ζ( ) A := κR (κ is the Lipschitz constant of the nonlinear mappings ζ( )) to take into account the differences or characteristics of different LSRL methods. We provide detailed analysis for A of typical LSRL methods in the next section. 5 Generalization Bounds for Typical LSRL Methods In this section, we analyze the generalization bounds for several typical LSRL methods, i.e., k-means clustering-based [Zhang and Wu, 2015], Lasso-based [Huang et al., 2016] and DNN-based [Hang and Zhang, 2022] LSRL methods. We show that different construction methods of label-specific representation will lead to significant differences in the constant A of the generalization bound in Theorem 1. For each LSRL method, we first give a brief introduction, then give its formal definition corresponding to the class of LSRL defined in (1), and finally derive the generalization bound. 5.1 Generalization Bounds for k-Means Clustering-Based LSRL Method As a seminal work, k-means clustering-based LSRL method, i.e., LIFT [Zhang and Wu, 2015], uses k-means clustering to construct label-specific representation that effectively capture the specific characteristics of each label. Specifically, first, for each label, k-means clustering is used to divide the training instances into K clusters, and the centers of the K clusters are obtained, which is denoted as cj k for the j-th label, k [K], j [c]. Then, these K centers are used to construct the label-specific representation, i.e., in the vector-valued function class of LSRL defined by (1), φj(x) = h d(x, cj 1), . . . , d(x, cj K) i , d(x, cj k) = x cj k . Finally, a family of c classifiers fj with κ-Lipschitz nonlinear mapping are induced based on the generated label-specific representations. Next, we formally define the process of k-means clustering. Here we follow some of the settings and definitions in [Li and Liu, 2021]. Assume that V : X 2 R+is a pairwise distance-based function used to measure the dissimilarity between pair observations, and Z = [Z1, . . . , ZK] is a collection of K partition functions Zk : X 2 R+for k [K]. The clustering framework can be cast as the problem of minimizing the following criterion: b RD(V, Z) = 1 n(n 1) k=1 V (xi, xj) Zk (xi, xj) over all possible functions V and Zk for k [K]. In k-means clustering, we have V (xi, xj) = xi xj 2 2, and Zk (xi, xj) = I (xi, xj) C2 k , ck = 1 |Ck| P x Ck x, where C1, . . . , CK are the partitions of the feature space X. Let gk (xi, xj) = V (xi, xj) Zk (xi, xj) and g = (g1, . . . , g K) be a vector-valued function, ℓclu(g(x, x )) = PK k=1 gk (x, x ), then b RD(V, Z) can be written as b RD(V, Z) = 1 n(n 1) i,j=1,i =j ℓclu(g(xi, xj)) = 1 n(n 1) k=1 gk (xi, xj) . We then define a vector-valued function class of k-means clustering as follows: G = {(x, x ) 7 g(x, x ) :g(x, x ) = (g1(x, x ), . . . , g K(x, x )), gk(x, x ) = x x 2 2 I (x, x ) C2 k , x, x X, k [K]}, where |gk( , )| G for k [K] and G > 0 are constants. We denote the function class of k-means clustering corresponding to the j-th label as Gj. With the above definitions, we can derive the tight bound for k-means clustering-based LSRL method: Theorem 2. Let F be a vector-valued function class of k-means clustering-based LSRL defined by (1). Let Assumptions 1 and 2 hold. Given a dataset D of size n. Then, for any 0 < δ < 1, with probability at least 1 δ, the following holds for any f F: R(f) b RD(f) + 3M KR 1 + log 1 2 (8e2n3c3) log M n 1 + log 1 2 (e2n3c3) log M nc 1 + log 1 2 (8e2n3c3) log M n Remark 2. There are three key points in the generalization analysis of k-means clustering-based LSRL method. 1) Since k-means clustering-based LSRL method is two-stage, i.e., the centers of clusters is generated by using k-means clustering in the first stage, then these centers are exploited to generate label-specific representations which are used to learn a multi-label classifier in the second stage, we cannot formally express these two stages in a closed-form expression through a composite function (K centers are generated by the arg min function). Furthermore, K centers generated in the first stage are actually used as fixed parameters rather than inputs in the second stage. Hence, in order to fully consider the capacity of the model corresponding to the first stage, it is reasonable to define the whole function class as the sum of the function classes F + Lclu G corresponding to the methods of these two stages. Then, combined with Lemma 1, the generalization analysis is transformed into the bounding of the complexity of the projection function class P(F +Lclu G). The introduction of class G induces an additional increase in complexity, i.e., the last term in Theorem 2. 2) The generalization analysis of k-means clustering-based LSRL method involves the generalization analysis for k-means clustering. However, since the k-means clustering framework involves pairwise functions, a sequence of pairs of i.i.d. individual observation in k-means clustering is no longer independent, which makes standard techniques in the i.i.d case for traditional Rademacher complexity inapplicable for k-means clustering. We convert the non-sum-of-i.i.d pairwise function to a sum-ofi.i.d form by using permutations in U-process [Clémençon et al., 2008]. We show that the empirical Rademacher complexity of a loss function space associated with the vector-valued function class G can be bounded by ˆℜD (Lclu G) := Eϵ h supg G 1 n 2 i=1 ϵiℓclu(g(xi, xi+ n 2 )) i . 3) In order to derive tight bounds for k-means clustering, we develop a novel vector-contraction inequality that can induce bounds with a square-root dependency on the number of clusters. The theoretical techniques and results involved here may be of independent interest. The generalization bound of k-means clustering-based LSRL method is tighter than the state of the art with a faster convergence rate e O( p K/n), which is independent on the number of labels. Since the lower bound for clustering is Ω( p K/n) [Bartlett et al., 1998], our bound is (nearly) optimal, up to logarithmic terms, even from the perspective of clustering. The constant A of the generalization bound in Theorem 1 corresponds to 2κ 5.2 Generalization Bounds for Lasso-Based LSRL Method Lasso-based LSRL method, i.e., LLSF [Huang et al., 2016], assumes that the label-specific representation of each label should have sparsity and sharing properties. For sparsity, LLSF uses Lasso as the model corresponding to each label. The property of sharing is achieved by considering that two strongly correlated labels will share more features with each other than two uncorrelated or weakly correlated labels and the corresponding weights will be similar, i.e., their inner product will be large. Formally, since each label corresponds to a Lasso, this means that in the class of LSRL defined by (1), the base loss ℓb is the squared loss, the nonlinear mappings ζ( ) and φ( ) are both identity transformations for any j [c], and the constraint α(w) is wj 1 Λ for any j [c]. For the j-th label, the property of sharing is reflected by the additionally introduced constraint Pc i(1 sji)wj wi τ, where sji is the cosine similarity between labels yj and yi, here we refer to it as the sharing constraint. The loss function used by Lasso-based LSRL method is the Decomposable loss. Since the squared loss is not Lipschitz continuous, the theoretical results on the Lipschitz continuity of the loss functions in Proposition 1 cannot be applied to Lasso-based LSRL method. To overcome this challenge, we define the pseudo-Lipschitz function, which is also used in the theoretical analysis of approximate message passing algorithms [Bayati and Montanari, 2011]. Definition 2. For k 1, we say that a function f : R R is pseudo-Lipschitz of order k if there exists a constant L > 0 such that the following inequality holds for all x, y R: |f(x) f(y)| L 1 + |x|k 1 + |y|k 1 |x y|. Note that any pseudo-Lipschitz function of order 1 is Lipschitz continuous. The following Proposition shows that the Decomposable loss is still Lipschitz continuous if the base loss is the squared loss. Proposition 2. The squared loss is 1 pseudo-Lipschitz of order 2, the surrogate Decomposable Loss is (3 + 2B)c-Lipschitz w.r.t. the ℓ norm if the base loss ℓb is the squared loss. With the above definitions, we can derive the generalization bound for Lasso-based LSRL method: Theorem 3. Let F be a vector-valued function class of Lasso-based LSRL defined by (1). Let Assumptions 1 and 2 hold. Given a dataset D of size n. Then, for any 0 < δ < 1, with probability at least 1 δ, the following holds for any f F: R(f) b RD(f) + 24 2(3 + 2B)cΛR 1 + log 1 2 (8e2n3c3) log M n Remark 3. The complexity of the LLSF function class can be bounded by the complexity of the LSRL function class where each label corresponds to a Lasso, since the introduction of the sharing constraint in the LLSF function class reduces the complexity of the function class compared with the LSRL function class where each label corresponds to a Lasso. Hence, the complexity analysis of the LLSF function class can be converted into upper bounding the Rademacher complexity of the LSRL function class where each label corresponds to a Lasso. The constant A of the generalization bound in Theorem 1 corresponds to R here, and the value of ρ is (3 + 2B)c, which induce the e O(c/ n) bound here. If other loss functions are used, e.g., Hamming, Subset or Ranking loss, instead of Decomposable loss, the dependency of the bounds for Lasso-based LSRL method on c can be improved from linear to independent, up to logarithmic terms. 5.3 Generalization Bounds for DNN-Based LSRL Method DNN-based LSRL method, i.e., CLIF [Hang and Zhang, 2022], exploits the powerful representation learning capability of deep neural networks (DNNs) to learn label-specific representation in an endto-end manner. Since the construction of label-specific representation involves graph convolutional networks (GCNs), we first introduce the relevant definitions for GCN. Let G = {V, E} be a given undirected graph, where V = {x1, x2, . . . , xn} is the set of nodes with size |V| = n and E is the set of edges. Let A and D be the adjacency matrix and the diagonal degree matrix respectively, where Dii = Pn j=1 Aij. Let A = (D + In) 1 2 (A + In) (D + In) 1 2 denote the normalized adjacency matrix with self-connections, where I is the identity matrix. The feature propagation process of a two-layer GCN is σ( Aσ( AXW1)W2), where W1 and W2 are parameter matrices, X is the node feature matrix, and the i-th row Xi is the node feature xi. Specifically, for DNN-based LSRL method, first, a graph (here we call it the label graph) is constructed over the label space and a GCN is used to generate the label embeddings, i.e., the label embeddings can be denoted by ψ(Y ) = σRe LU( AσRe LU( AY W1)W2), where Y is the node feature matrix of the label graph with size c and the nodes are also bounded by R, σRe LU is the Re LU activation. Second, the label embedding of the j-th label is decoded into the importance vector by a one-layer fully-connected neural network, i.e., σsig(W3ψ(Y )j), where σsig is the sigmoid activation, and the input feature is mapped into the latent representation through a one-layer fully-connected neural network, i.e., σRe LU(W4x). Third, for the j-th label, the Hadamard product of the importance vector and the latent representation is defined as the pertinent representation, and then the label-specific representation for the j-th label is obtained by feeding the pertinent representation into another one-layer fully-connected neural network. Hence, the label-specific representation for the j-th label is φj(x) = σRe LU {W5 [σRe LU(W4x) σsig(W3ψ(Y )j)]} . Finally, the j-th model is implemented by a fully-connected layer, i.e., fj(x) = σsig(w j φj(x)). In the generalization analysis of the above deep neural network, we introduce the following assumption, which is a common assumption in the generalization analysis for DNNs [Bartlett et al., 2017, Golowich et al., 2018, Zhang and Zhang, 2023, Tang and Liu, 2023]. Assumption 3. Assume that the parameter metrices in DNN-based LSRL method are bounded, i.e., Wi D, i [5], where D > 0 is a constant. With the above definitions, we can derive the generalization bound for DNN-based LSRL method: Theorem 4. Let F be a vector-valued function class of DNN-based LSRL defined by (1). Let Assumptions 1, 2, and 3 hold. Given a dataset D of size n. Then, for any 0 < δ < 1, with probability at least 1 δ, the following holds for any f F: R(f) b RD(f) + 6 2ρΛD5R2(gmax + 1) 1 + log 1 2 (8e2n3c3) log M n where gmax is the maximum node degree of the label graph. Remark 4. The term eℜnc(P(F)) ΛD5R2(gmax + 1)/4 nc, which makes the Rademacher complexity ˆℜD(L) actually independent on c, up to logarithmic terms, and results in a tight bound for DNN-based LSRL method with a faster convergence rate e O(1/ n). The constant A of the generalization bound in Theorem 1 corresponds to D5R2(gmax + 1)/4 here. For deep GCNs, the increase in depth means that the generalization performance will deteriorate, which is consistent with empirical performance and guides us not to design too many layers of GCN. In addition, the bound is linearly dependent on the maximum node degree of the label graph, which suggests that when the performance of the model is always unsatisfactory, we can check whether the maximum node degree is large and consider using some techniques to remove some edges, e.g., Drop Edge [Rong et al., 2020], to alleviate the over-fitting problem. We will further explore more network structures to learn more effective label-specific representations, e.g., hypernetworks [Galanti and Wolf, 2020, Chen et al., 2023, Shen et al., 2023], deep kernel networks [Zhang and Liao, 2020, Zhang and Zhang, 2023], and provide generalization analysis for the corresponding DNN-based LSRL methods. 6 Discussion Compared with existing methods considering label correlations, which mainly focus on the processing of the label space by exploiting or modeling relationships between labels, LSRL mainly focuses on the operation on the input space and implicitly considers label correlations in the process of constructing label-specific representations. For example, in the construction of label-specific representations in LLSF and CLIF, the label correlation information is embedded into the label-specific representations in the input space by introducing the sharing constraint and using a GCN over the label graph to generate label embeddings, respectively. LSRL methods are more effective since the label correlation information is considered in the construction of label-specific representations. Our theoretical results explain why LSRL is an effective strategy to improve the generalization performance of multi-label learning. On the one hand, existing results can improve the dependency of the bound on c from linear to square-root by preserving the coupling among different components, which corresponds to high-order label correlations induced by norm regularizers. However, the improvement in the preservation of coupling by a factor of c benefits from replacing cΛ with Λ in the constraint to some extent, and preserving the coupling corresponds to the stricter assumption [Zhang and Zhang, 2024]. Our results for LSRL decouple the relationship among different components, and the bounds with a weaker dependency on c are tighter than the existing results that preserve the coupling, which also explains why LSRL methods outperform the multi-label methods that consider high-order label correlations induced by norm regularizers. On the other hand, based on our results, we can find that LSRL methods substantially increase the data processing, i.e., the process of constructing label-specific representations. From the perspective of model capacity, compared with traditional multi-label methods, since the introduction of construction methods of label-specific representations, the capacity of the model is significantly increased, especially if deep learning methods are used to generate label-specific representations, which improves the representation ability of the model. Or more intuitively, LSRL means an increase in model capacity and stronger representation ability, which makes it easier to find the hypotheses with better generalization in the function class. The vector-contraction inequality and the theoretical tools developed here are applicable to the theoretical analysis of other problem settings, such as multi-class classification, or more general vectorvalued learning problem. For multi-class classification, multi-class margin-based loss, multinomial logistic loss, Top-k hinge loss, etc. are all ℓ Lipschitz [Lei et al., 2019]. For multi-label learning, the surrogate loss for Macro-Averaged AUC is also ℓ Lipschitz [Zhang and Zhang, 2024]. 7 Conclusion In this paper, we propose a novel vector-contraction inequality for ℓ norm Lipschitz continuous loss, and derive bounds for general function classes of LSRL with a weaker dependency on c than the state of the art. In addition, we analyze the bounds for several typical LSRL methods, and study the impact of different label-specific representations on the generalization analysis. In future work, we will extend our bounds to more LSRL methods, and derive tighter bounds for LSRL with a faster convergence rate w.r.t. the number of examples, and further design efficient models and algorithms to construct label-specific representations with good generalization performance. Acknowledgements The authors wish to thank the anonymous reviewers for their helpful comments and suggestions. This work was supported by the National Science Foundation of China (62225602) and the Big Data Computing Center of Southeast University. Junwen Bai, Shufeng Kong, and Carla P. Gomes. Disentangled variational autoencoder based multilabel classification with covariance-aware multivariate probit model. In Christian Bessiere, editor, Proceedings of the 29th International Joint Conference on Artificial Intelligence, number IJCAI 2020, pages 4313 4321, 2020. Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463 482, 2002. Peter L. Bartlett, Tamás Linder, and Gábor Lugosi. The minimax distortion redundancy in empirical quantizer design. IEEE Transactions on Information Theory, 44(5):1802 1813, 1998. Peter L. Bartlett, Dylan J. Foster, and Matus Telgarsky. Spectrally-normalized margin bounds for neural networks. Advances in Neural Information Processing Systems, 30(NIPS 2017):6240 6249, 2017. Mohsen Bayati and Andrea Montanari. The dynamics of message passing on dense graphs, with applications to compressed sensing. IEEE Transactions on Information Theory, 57(2):764 785, 2011. Matthew R. Boutell, Jiebo Luo, Xipeng Shen, and Christopher M. Brown. Learning multi-label scene classification. Pattern Recognition, 37(9):1757 1771, 2004. Ricardo Silveira Cabral, Fernando De la Torre, João Paulo Costeira, and Alexandre Bernardino. Matrix completion for multi-label image classification. Advances in Neural Information Processing Systems, 24(NIPS 2011):190 198, 2011. Nicolò Cesa-Bianchi, Matteo Re, and Giorgio Valentini. Synergy of multi-label hierarchical ensembles, data fusion, and cost-sensitive methods for gene functional inference. Machine Learning, 88 (1-2):209 241, 2012. Di Chen, Yexiang Xue, Daniel Fink, Shuo Chen, and Carla P. Gomes. Deep multi-species embedding. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, number IJCAI 2017, pages 3639 3646. ijcai.org, 2017. Sirui Chen, Yuan Wang, Zijing Wen, Zhiyu Li, Changshuo Zhang, Xiao Zhang, Quan Lin, Cheng Zhu, and Jun Xu. Controllable multi-objective re-ranking with policy hypernetworks. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, number KDD 2023, pages 3855 3864, 2023. Tianshui Chen, Liang Lin, Riquan Chen, Xiaolu Hui, and Hefeng Wu. Knowledge-guided multi-label few-shot learning for general image recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(3):1371 1384, 2020. Stéphan Clémençon, Gábor Lugosi, and Nicolas Vayatis. Ranking and empirical minimization of u-statistics. The Annals of Statistics, 36(2):844 874, 2008. Kunal Dahiya, Ananye Agarwal, Deepak Saini, K Gururaj, Jian Jiao, Amit Singh, Sumeet Agarwal, Purushottam Kar, and Manik Varma. Siamesexml: Siamese networks meet extreme classifiers with 100m labels. In Proceedings of the 38th International Conference on Machine Learning, pages 2330 2340, 2021. Krzysztof Dembczynski, Willem Waegeman, Weiwei Cheng, and Eyke Hüllermeier. Regret analysis for performance metrics in multi-label classification: The case of hamming and subset zero-one loss. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, volume 6321, pages 280 295, 2010. Krzysztof Dembczynski, Wojciech Kotlowski, and Eyke Hüllermeier. Consistent multilabel ranking through univariate losses. ar Xiv:1206.6401, 2012. André Elisseeff and Jason Weston. A kernel method for multi-labelled classification. In Advances in Neural Information Processing Systems, number NIPS 2001, pages 681 687, 2001. Dylan J. Foster and Alexander Rakhlin. ℓ vector contraction for rademacher complexity. ar Xiv:1911.06468v1, 2019. Tomer Galanti and Lior Wolf. On the modularity of hypernetworks. In Advances in Neural Information Processing Systems, volume 33, pages 10409 10419, 2020. Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of neural networks. International Conference on Computational Learning Theory, 75(COLT 2018): 297 299, 2018. Yumeng Guo, Fulai Chung, Guozheng Li, Jiancong Wang, and James C. Gee. Leveraging labelspecific discriminant mapping features for multi-label learning. ACM Transactions on Knowledge Discovery from Data, 13(2):24:1 24:23, 2019. Jun-Yi Hang and Min-Ling Zhang. Collaborative learning of label semantics and deep label-specific features for multi-label classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(12):9860 9871, 2022. Jun-Yi Hang, Min-Ling Zhang, Yanghe Feng, and Xiaocheng Song. End-to-end probabilistic labelspecific feature learning for multi-label classification. In Proceedings of the 36th AAAI Conference on Artificial Intelligence, number AAAI 2022, pages 6847 6855, 2022. Jun Huang, Guorong Li, Qingming Huang, and Xindong Wu. Learning label specific features for multi-label classification. In Charu C. Aggarwal, Zhi-Hua Zhou, Alexander Tuzhilin, Hui Xiong, and Xindong Wu, editors, Proceedings of the 15th IEEE International Conference on Data Mining, number ICDM 2015, pages 181 190, 2015. Jun Huang, Guorong Li, Qingming Huang, and Xindong Wu. Learning label-specific features and class-dependent labels for multi-label classification. IEEE Transactions on Knowledge and Data Engineering, 28(12):3309 3323, 2016. Jun Huang, Guorong Li, Qingming Huang, and Xindong Wu. Joint feature selection and classification for multilabel learning. IEEE Transactions on Cybernetics, 48(3):876 889, 2018. Shuiwang Ji, Lei Tang, Shipeng Yu, and Jieping Ye. A shared-subspace learning framework for multi-label classification. ACM Transactions on Knowledge Discovery from Data, 4(2):1 29, 2010. Antoine Ledent, Waleed Mustafa, Yunwen Lei, and Marius Kloft. Norm-based generalisation bounds for deep multi-class convolutional neural networks. In Proceedings of the 35th AAAI Conference on Artificial Intelligence, volume 35, pages 8279 8287, 2021. Yunwen Lei, Ürün Dogan, Alexander Binder, and Marius Kloft. Multi-class svms: From tighter data-dependent generalization bounds to novel algorithms. In Advances in Neural Information Processing Systems, volume 28, pages 2035 2043, 2015. Yunwen Lei, Ürün Dogan, Ding-Xuan Zhou, and Marius Kloft. Data-dependent generalization bounds for multi-class classification. IEEE Transactions on Information Theory, 65(5):2995 3021, 2019. Shaojie Li and Yong Liu. Sharper generalization bounds for clustering. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 6392 6402, 2021. Chong Liu, Peng Zhao, Sheng-Jun Huang, Yuan Jiang, and Zhi-Hua Zhou. Dual set multi-label learning. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, number AAAI 2018, pages 3635 3642, 2018. Weiwei Liu and Xiaobo Shen. Sparse extreme multi-label learning with oracle property. In Proceedings of the 36th International Conference on Machine Learning, volume 97, pages 4032 4041, 2019. Weiwei Liu, Haobo Wang, Xiaobo Shen, and Ivor W. Tsang. The emerging trends of multi-label learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):7955 7974, 2022. Françoise Lust-Piquard and Gilles Pisier. Non commutative khintchine and paley inequalities. Arkiv för matematik, 29:241 260, 1991. Andreas Maurer. A vector-contraction inequality for rademacher complexities. In Proceedings of the 27th International Conference on Algorithmic Learning Theory, volume 9925, pages 3 17, 2016. Yashoteja Prabhu and Manik Varma. Fastxml: a fast, accurate and stable tree-classifier for extreme multi-label learning. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, number KDD 2014, pages 263 272, 2014. Yu Rong, Wenbing Huang, Tingyang Xu, and Junzhou Huang. Dropedge: Towards deep graph convolutional networks on node classification. In Proceedings of the 8th International Conference on Learning Representations, number ICLR 2020, 2020. Timothy N. Rubin, America Chambers, Padhraic Smyth, and Mark Steyvers. Statistical topic models for multi-label document classification. Machine Learning, 88(1-2):157 208, 2012. Chenglei Shen, Xiao Zhang, Wei Wei, and Jun Xu. Hyperbandit: Contextual bandit with hypernewtork for time-varying user preferences in streaming recommendation. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, pages 2239 2248, 2023. Nathan Srebro, Karthik Sridharan, and Ambuj Tewari. Smoothness, low noise and fast rates. In Advances in Neural Information Processing Systems, volume 23, pages 2199 2207, 2010. Yan-Ping Sun and Min-Ling Zhang. Compositional metric learning for multi-label classification. Frontiers of Computer Science, 15(5):155320, 2021. Huayi Tang and Yong Liu. Towards understanding the generalization of graph neural networks. ar Xiv:2305.08048v1, 2023. Wei Weng, Yaojin Lin, Shunxiang Wu, Yuwen Li, and Yun Kang. Multi-label learning based on label-specific features and local pairwise label correlation. Neurocomputing, 273:385 394, 2018. Wei Weng, Yan-Nan Chen, Chin-Ling Chen, Shunxiang Wu, and Jinghua Liu. Non-sparse label specific features selection for multi-label classification. Neurocomputing, 377:85 94, 2020. Guoqiang Wu and Jun Zhu. Multi-label classification: do hamming loss and subset accuracy really conflict with each other? Advances in Neural Information Processing Systems, 33(Neur IPS 2020), 2020. Guoqiang Wu, Chongxuan Li, Kun Xu, and Jun Zhu. Rethinking and reweighting the univariate losses for multi-label ranking: Consistency and generalization. Advances in Neural Information Processing Systems, 34(Neur IPS 2021):14332 14344, 2021a. Guoqiang Wu, Chongxuan Li, and Yilong Yin. Towards understanding generalization of macro-auc in multi-label learning. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pages 37540 37570, 2023. Liang Wu, Antoine Ledent, Yunwen Lei, and Marius Kloft. Fine-grained generalization analysis of vector-valued learning. In Proceedings of the 35th AAAI Conference on Artificial Intelligence, number AAAI 2021, pages 10338 10346, 2021b. Chang Xu, Tongliang Liu, Dacheng Tao, and Chao Xu. Local rademacher complexity for multi-label learning. IEEE Transactions on Image Processing, 25(3):1495 1507, 2016. Miao Xu and Lan-Zhe Guo. Learning from group supervision: the impact of supervision deficiency on multi-label learning. Science China Information Sciences, 64(3), 2021. Guangxu Xun, Kishlay Jha, Jianhui Sun, and Aidong Zhang. Correlation networks for extreme multi-label text classification. In Proceedings of the 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, number KDD 2020, pages 1074 1082, 2020. Vacit Oguz Yazici, Abel Gonzalez-Garcia, Arnau Ramisa, Bartlomiej Twardowski, and Joost van de Weijer. Orderless recurrent models for multi-label classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, number CVPR 2020, pages 13440 13449, 2020. Ian En-Hsu Yen, Xiangru Huang, Pradeep Ravikumar, Kai Zhong, and Inderjit S. Dhillon. Pdsparse : A primal and dual sparse approach to extreme multiclass and multilabel classification. In Proceedings of the 33nd International Conference on Machine Learning, volume 48, pages 3069 3077, 2016. Renchun You, Zhiyao Guo, Lei Cui, Xiang Long, Yingze Bao, and Shilei Wen. Cross-modality attention with semantic graph embedding for multi-label classification. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, number AAAI 2020, pages 12709 12716, 2020. Hsiang-Fu Yu, Prateek Jain, Purushottam Kar, and Inderjit S. Dhillon. Large-scale multi-label learning with missing labels. In Proceedings of the 31th International Conference on Machine Learning, volume 32, pages 593 601, 2014. Kai Yu, Shipeng Yu, and Volker Tresp. Multi-label informed latent semantic indexing. In Ricardo A. Baeza-Yates, Nivio Ziviani, Gary Marchionini, Alistair Moffat, and John Tait, editors, Proceedings of the 28th Annual International Conference on Research and Development in Information Retrieval, number SIGIR 2005, pages 258 265, 2005. Ze-Bang Yu and Min-Ling Zhang. Multi-label classification with label-specific feature generation: A wrapped approach. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9): 5199 5210, 2022. Chunyu Zhang and Zhanshan Li. Multi-label learning with label-specific features via weighting and label entropy guided clustering ensemble. Neurocomputing, 419:59 69, 2021. Jujie Zhang, Min Fang, and Xiao Li. Multi-label learning with discriminative features for each label. Neurocomputing, 154:305 316, 2015. Min-Ling Zhang and Lei Wu. Lift: Multi-label learning with label-specific features. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(1):107 120, 2015. Min-Ling Zhang and Zhi-Hua Zhou. ML-KNN: A lazy learning approach to multi-label learning. Pattern Recognition, 40(7):2038 2048, 2007. Min-Ling Zhang and Zhi-Hua Zhou. A review on multi-label learning algorithms. IEEE Transactions on Knowledge and Data Engineering, 26(8):1819 1837, 2014. Min-Ling Zhang, Yu-Kun Li, Xu-Ying Liu, and Xin Geng. Binary relevance for multi-label learning: an overview. Frontiers of Computer Science, 12(2):191 202, 2018. Yi-Fan Zhang and Shizhong Liao. A kernel perspective for the decision boundary of deep neural networks. In Proceedings of the 32nd IEEE International Conference on Tools with Artificial Intelligence, number ICTAI 2020, pages 653 660, 2020. Yi-Fan Zhang and Min-Ling Zhang. Nearly-tight bounds for deep kernel learning. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pages 41861 41879, 2023. Yi-Fan Zhang and Min-Ling Zhang. Generalization analysis for multi-label learning. In Proceedings of the 41st International Conference on Machine Learning, number ICML 2024, 2024. Yue Zhu, James T. Kwok, and Zhi-Hua Zhou. Multi-label learning with global and local label correlation. IEEE Transactions on Knowledge and Data Engineering, 30(6):1081 1094, 2018. A.1 Appendix Outline In the appendix, we give the detailed proofs of those theoretical results in the main paper. Our main proofs include: The ℓ Lipschitz continuity of the commonly used losses for LSRL (Proposition 1). The novel vector-contraction inequality for ℓ Lipschitz loss (Lemma 1). The generalization bound of the general LSRL class with no dependency on c (Theorem 1). The generalization bound for k-means clustering-based LSRL method (Theorem 2). The (pseudo-) Lipschitz continuity of the squared and Decomposable loss (Proposition 2). The generalization bound for Lasso-based LSRL method (Theorem 3). The generalization bound for DNN-based LSRL method (Theorem 4). A.2 Preliminaries A.2.1 Definitions of the corresponding complexity measures Definition 3 (ℓ norm covering number). Let F be a class of real-valued functions mapping from X to R. Let D = {x1, . . . , xn} be a set with n i.i.d. samples. For any ϵ > 0, the empirical ℓ norm covering number N (ϵ, F, D) w.r.t. D is defined as the minimal number m of a collection of vectors v1, . . . , vm Rn such that maxi [n] f (xi) vj i ϵ (vj i is the i-th component of the vector vj). In this case, we call v1, . . . , vm an (ϵ, ℓ )-cover of F with respect to D. We also define N (ϵ, F, n) = sup D N (ϵ, F, D). Definition 4 (Fat-shattering dimension). Let F be a class of real-valued functions mapping from X to R. We define the fat-shattering dimension fatϵ(F) at scale ϵ > 0 as the largest p N such that there exist p points x1, . . . , xp X and witnesses s1, . . . , sp R satisfying: for any δ1, . . . , δp { 1, +1} there exists f F with δi (f(xi) si) ϵ, i = 1, . . . , p. A.2.2 The Bound for the loss function space For any training dataset D = {(xi, yi) : i [n]}, let D = {(xi, yi) : i [n]} be the training dataset with only one sample different from D, where the k-th sample is replaced by (x k, y k). Let Φ(D) = supf F[E(x,y) P [ℓ(f(x), y)] 1 n Pn i=1 ℓ(f(xi), yi)] = supf F[R(f) b RD(f)], then Φ (D ) Φ(D) = sup f F [R(f) b RD (f)] sup f F [R(f) b RD(f)] sup f F [ b RD(f) b RD (f)] [ℓ(f(xk), yk) ℓ(f(x k), y k)] n According to Mc Diarmid s inequality, for any 0 < δ < 1, with probability at least 1 δ 2 over the training dataset D, the following holds: Φ(D) ED[Φ(D)] + M Then, we will estimate the upper bound of ED[Φ(D)]. h ED h b RD (f) b RD(f) ii# h b RD (f) b RD(f) i# i=1 ℓ(f(x i), y i) ℓ(f(xi), yi) i=1 ϵi (ℓ(f(x i), y i)) ℓ(f(xi), yi)) i=1 ϵiℓ(f(x i), y i) i=1 ϵiℓ(f(xi), yi) i=1 ϵiℓ(f(xi), yi) Then apply Mc Diarmid s inequality to Eϵ supf F 1 n Pn i=1 ϵiℓ(f(xi), yi) , we have i=1 ϵiℓ(f(xi), yi) i=1 ϵiℓ(f(xi), yi) ℜ(L) ˆℜD(L) + M Combining with (2), (3) and (4), then R(f) b RD(f) + 2ˆℜD(L) + 3M A.3 General Bounds for LSRL A.3.1 Proof of Proposition 1 We first prove that the surrogate Hamming Loss is µ-Lipschitz continuous with respect to the ℓ norm. ℓH(f(x), y) ℓH f (x), y j=1 ℓb (yjfj(x)) 1 j=1 ℓb yjf j(x) ℓb (yjfj(x)) ℓb yjf j(x) j=1 µ fj(x) f j(x) c µc max j [c] fj(x) f j(x) =µ f(x) f (x) . Second, with the elementary inequality |max {a1, . . . , ac} max {b1, . . . , bc}| max {|a1 b1| , . . . , |ac bc|} , we prove that the surrogate Subset Loss is µ-Lipschitz continuous with respect to the ℓ norm. ℓS(f(x), y) ℓS f (x), y = max j [c] ℓb (yjfj(x)) max j [c] ℓb yjf j(x) ℓb (yjfj(x)) ℓb yjf j(x) µ max j [c] fj(x) f j(x) =µ f(x) f (x) . Third, we prove that the surrogate Ranking Loss is 2µ-Lipschitz continuous with respect to the ℓ norm. ℓR(f(x), y) ℓR f (x), y = 1 |Y +| |Y | ℓb (fp(x) fq(x)) ℓb f p(x) f q(x) max p Y +,q Y ℓb (fp(x) fq(x)) ℓb f p(x) f q(x) µ max p Y +,q Y (fp(x) fq(x)) f p(x) f q(x) µ max p Y + fp(x) f p(x) + max q Y fq(x) f q(x) 2µ max j [c] fj(x) f j(x) =2µ f(x) f (x) . Finally, we prove that the surrogate Decomposable Loss is cµ-Lipschitz continuous with respect to the ℓ norm. ℓD(f(x), y) ℓD f (x), y j=1 ℓb (yjfj(x)) j=1 ℓb yjf j(x) ℓb (yjfj(x)) ℓb yjf j(x) j=1 µ fj(x) f j(x) cµ max j [c] fj(x) f j(x) =cµ f(x) f (x) . A.3.2 Proof of Lemma 1 Proof Sketch: First, the Rademacher complexity of the loss function space associated with F can be bounded by the empirical ℓ norm covering number with the refined Dudley s entropy integral inequality. Second, according to the Lipschitz continuity w.r.t the ℓ norm, the empirical ℓ norm covering number of F can be bounded by the empirical ℓ norm covering number of P(F). Third, the empirical ℓ norm covering number of P(F) can be bounded by the fat-shattering dimension, and the fat-shattering dimension can be bounded by the worst-case Rademacher complexity of P(F). Hence, the problem is transferred to the estimation of the worst-case Rademacher complexity. Finally, we estimate the lower bound of the worst-case Rademacher complexity of P(F), and then combined with the above steps, the Rademacher complexity of the loss function space associated with F can be bounded. We first introduce the following lemmas: Lemma 2 (Khintchine-Kahane inequality [Lust-Piquard and Pisier, 1991]). Let v1, . . . , vn H, where H is a Hilbert space with being the associated p-th norm. Let ϵ1, . . . , ϵn be a sequence of independent Rademacher variables. Then, for any p 1 there holds i=1 vi 2 # 1 i=1 vi 2 # 1 i=1 vi 2 # 1 Lemma 3 (Lemma A.2 in [Srebro et al., 2010]). For any function class F, any S with a finite sample of size n and any ϵ > ˆℜS(F), we have that fatϵ(F) 4nˆℜ2 S(F) ϵ2 . Lemma 4 ([Srebro et al., 2010, Lei et al., 2019]). If any function in class F takes values in [ B, B], then for any S with a finite sample of size n, any ϵ > 0 with fatϵ(F) < n, we have log N (ϵ, F, S) fatϵ(F) log 2e Bn The following lemma is a refined result of Proposition 22 in [Ledent et al., 2021], where we replace the function class taking values in [0, 1] with the B-bounded function class, the refinement is obvious. Lemma 5 (Refined Dudley s entropy integral inequality). Let F be a real-valued function class with f B, f F, B > 0, and assume that 0 F. Let S be a finite sample of size n. For any 2 p , we have the following relationship between the Rademacher complexity ˆℜS(F) and the covering number Np(ϵ, F, S). ˆℜS(F) inf α>0 log Np(ϵ, F, S)dϵ Step 1: We first derive the relationship between the empirical ℓ norm covering number N (ϵ, L, D) and the empirical ℓ norm covering number N (ϵ, P(F), [c] D). For the dataset D = {(x1, y1), . . . , (xn, yn)} with n i.i.d. examples: max i |ℓ(f(xi), yi) ℓ(f (xi), yi)| ρ max i f(xi) f (xi) (Use Assumption 2) ρ max i max j |fj (xi) f j (xi) | ρ max i max j |pj(f(xi)) pj(f (xi)|. (The definition of the projection function class P(F)) Then, according to the definition of the empirical ℓ covering number, we have that an empirical ℓ cover of P(F) at radius ϵ/ρ is also an empirical ℓ cover of the loss function space associated with F at radius ϵ, and we can conclude that: N (ϵ, L, D) N ρ, P(F), [c] D . (6) Step 2: We show that the empirical ℓ norm covering number of P(F) can be bounded by the fatshattering dimension, and the fat-shattering dimension can be bounded by the worst-case Rademacher complexity of P(F). According to Lemma 3, for any ϵ > ˆℜ[c] D(P(F)), we have fatϵ(P(F)) 4ncˆℜ2 [c] D(P(F)) Then, combining with Lemma 4, we have log N (ϵ, P(F), [c] D) 4ncˆℜ2 [c] D(P(F)) ϵ2 log 2e Bnc Use inequality ϵ > ˆℜ[c] D(P(F)), we have log N (ϵ, P(F), [c] D) 4ncˆℜ2 [c] D(P(F)) ϵ2 log 2e Bnc ˆℜ[c] D(P(F)) . (7) Step 3: According to Assumption 1 in the main paper, we can obtain the lower bound of the worst-case Rademacher complexity eℜnc(P(F)) by the Khintchine-Kahane inequality with p = 1: = sup [c] D [c] X n ˆℜ[c] D(P(F)) = sup [c] D [c] X n Eϵ sup pj(f(xi)) P(F) j=1 ϵijpj(f(xi)) = sup [c] D [c] X n Eϵ j=1 ϵijfj (xi) = sup ζj(φj(xi)) 2 A:i [n],j [c] j=1 ϵij wj, ζj(φj(xi)) = sup ζj(φj(xi)) 2 A:i [n],j [c] j=1 ϵijζj(φj(xi)) sup ζj(φj(xi)) 2 A:i [n],j [c] j=1 ζj(φj(xi)) 2 . (Use Lemma 2) Since ζj(φj(xi)) 2 A, we set sup ζj(φj(xi)) 2 A:i [n],j [c] 1 nc h Pn i=1 Pc j=1 ζj(φj(xi)) 2i 1 eℜnc(P(F)) ΛA Step 4: According to Lemma 5 and combined with the above steps, we have ˆℜD(L) log N (ϵ, L, D)dϵ ρ, P(F), [c] D)dϵ (Use inequality (6)) v u u t4ncρ2 ˆℜ2 [c] D(P(F)) ϵ2 log 2e Bnc ˆℜ[c] D(P(F)) dϵ (Use inequality (7)) 4ncρ2eℜ2nc(P(F)) 2e Bn 3 2 c 3 2 )dϵ (Use inequality (8) and the definition of the worst-case Rademacher complexity) 2ρ ceℜnc(P(F)) log 1 2 (8e2n3c3) Z M 2ρ ceℜnc(P(F)) + 12 2ρ ceℜnc(P(F)) log 1 2 (8e2n3c3) log M 2ρ ceℜnc(P(F)) (Choose α = 3 2ρ ceℜnc(P(F))) 2ρ ceℜnc(P(F)) + 12 2ρ ceℜnc(P(F)) log 1 2 (8e2n3c3) log M n ρB (Use inequality (8) ) 2ρ ceℜnc(P(F)) 1 + log 1 2 (8e2n3c3) log M n A.3.3 Proof of Theorem 1 We upper bound the worst-case Rademacher complexity eℜnc(P(F)) as the following: eℜnc(P(F)) = sup [c] D [c] X n ˆℜ[c] D(P(F)) = sup [c] D [c] X n Eϵ sup pj(f(xi)) P(F) j=1 ϵijpj(f(xi)) = sup [c] D [c] X n Eϵ j=1 ϵijfj (xi) = sup ζj(φj(xi)) 2 A:i [n],j [c] j=1 ϵij wj, ζj(φj(xi)) = sup ζj(φj(xi)) 2 A:i [n],j [c] j=1 ϵijζj(φj(xi)) sup ζj(φj(xi)) 2 A:i [n],j [c] j=1 ϵijζj(φj(xi)) 2 (Use Jensen s Inequality) sup ζj(φj(xi)) 2 A:i [n],j [c] j=1 ζj(φj(xi)) 2 ΛA nc. (Use Lemma 2) Then, we have 2ρ ceℜnc(P(F)) 1 + log 1 2 (8e2n3c3) log M n 2ρΛA 1 + log 1 2 (8e2n3c3) log M n Combining with (5), then R(f) b RD(f) + 24 2ρΛA 1 + log 1 2 (8e2n3c3) log M n A.4 Generalization Bounds for Typical LSRL Methods A.4.1 Proof of Theorem 2 The generalization analysis of LIFT involves the generalization analysis for k-means clustering. According to the definitions of k-means clustering in Subection 5.1, we have the empirical Rademacher complexity of a loss function space associated with the vector-valued function class G: ˆℜD(Lclu G) = Eϵ i,j=1,i =j ϵiℓclu(g(xi, xj)) Rademacher complexity has proved to be a powerful data-dependent measure of hypothesis space complexity. However, since the k-means clustering framework involves pairwise functions, a sequence of pairs of i.i.d. individual observation in k-means clustering is no longer independent, which makes standard techniques in the i.i.d case for traditional Rademacher complexity inapplicable for kmeans clustering. We convert the non-sum-of-i.i.d pairwise function to a sum-of-i.i.d form by using permutations in U-process [Clémençon et al., 2008]. We first proof the following lemma: Lemma 6. Let qτ : X X 7 R be real-valued functions indexed by τ T where T is some set. If x1, . . . , xs and x 1, . . . , x t are i.i.d., r = min{s, t}, then for any convex non-decreasing function ψ, j=1 qτ xi, x j i=1 qτ (xi, x i) Proof. The proof of this lemma is inspired by [Clémençon et al., 2008]. j=1 qτ xi, x j i=1 qτ xπ(i), x π(i) i=1 qτ xπ(i), x π(i) (ψ is nondecreasing) i=1 qτ xπ(i), x π(i) ! (Jensen s inequality) i=1 qτ (xi, x i) According to Lemma 6, ˆℜD(Lclu G) can be bounded by ˆℜD (Lclu G) := Eϵ i=1 ϵiℓclu(g(xi, xi+ n where each ϵi is an independent Rademacher random variable. In order to obtain tighter generalization bounds for k-means clustering, we develop the following novel vector-contraction inequality: Lemma 7. Let G be a vector-valued function class of k-means clustering defined in Subection 5.1. Given a dataset D of size n. Then, we have ˆℜD(Lclu G) 12 K max k eℜ n 1 + log 1 2 (e2n3) log M eℜ n 2 (Gk) is the worst-case Rademacher complexity, Gk is the restriction of the function class along the k-th coordinate, gk Gk. Proof. We first introduce the following lemma: Lemma 8 (Lemma 1 in [Foster and Rakhlin, 2019]). Let F f : X RK , and let φ1, . . . , φn each be L Lipschitz with respect to the ℓ norm. For any D with a finite sample of size n and ϵ > 0, we have that log N2 (ϵ, φ F, D) K max k log N ϵ L, Fk, D , (10) where Fk is the restriction of the function class along the k-th coordinate, fk Fk, k [K]. The empirical ℓ norm covering number of Gk can be bounded by the fat-shattering dimension, and the fat-shattering dimension can be bounded by the worst-case Rademacher complexity of Gk. Combined with the above steps, we have ˆℜD(Lclu G) ˆℜD (Lclu G) log N2(ϵ, Lclu G, D )dϵ (Use Lemma 5 ) K max k log N (ϵ, Gk, D )dϵ (Use inequality (10) and ℓclu( ) is 1-Lipschitz w.r.t. ℓ norm for k-means clustering) K max k fatϵ(Gk) log 2e B n (Use Lemma 4) K max k 4 n 2 ˆℜ2 D (Gk) ϵ2 log 2e G n 2 ˆℜD (Gk) dϵ (Use Lemma 3) v u u t K max k ϵ2 log 2e G n (The definition of the worst-case Rademacher complexity) 2n K maxk eℜ2 n ϵ2 log(en 3 2 )dϵ (Use the similar technique to the proof in inequality (8), the lower bound of eℜ n K max k eℜ n 2 (Gk) log 1 2 (e2n3) Z M K max k eℜ n 2 (Gk) + 12 K max k eℜ n 2 (Gk) log 1 2 (e2n3) log M K maxk eℜ n (Choose α = 3 K max k eℜ n K max k eℜ n 1 + log 1 2 (e2n3) log M eℜ n Since k-means clustering-based LSRL method is two-stage, the k-means clustering is used to generate label-specific representations in the first stage, and the second stage is conventional multi-label learning. Therefore, the corresponding whole function class is actually denoted as H = F + Lclu G. Since eℜnc(P(H)) eℜnc(P(F)) + eℜnc(P(Lclu G) [Bartlett and Mendelson, 2002], with Lemma 1, we have ˆℜD(L H) 12 2ρ c eℜnc(P(F)) + eℜnc(P(Lclu G)) 1 + log 1 2 (8e2n3c3) log M n According to Lemma 7, we have that the worst-case Rademacher complexity of the loss function space associated with P(G) can be bounded by the worst-case Rademacher complexity of the restriction of the function class along each coordinate P(Gk). Hence, for k-means clustering-based LSRL method, Lemma 7 involves the upper and lower bounds of eℜ nc We then obtain the lower bound of the worst-case Rademacher complexity eℜ nc 2 (P(Gk)) by the Khintchine-Kahane inequality with p = 1: 2 (P(Gk)) = sup [c] D [c] X n 2 ˆℜ[c] D (P(Gk)) [c] D [c] X n sup pj(gk(xi,xi+ n j=1 ϵijpj(gk(xi, xi+ n [c] D [c] X n sup gj k Gj k j=1 ϵijgj k(xi, xi+ n = sup Zj k(xi,xi+ n sup V (xi,xi+ n 2 ) 4R2 1 n j=1 ϵij V (xi, xi+ n 2 )Zj k(xi, xi+ n = sup Zj k(xi,xi+ n j=1 ϵij Zj k(xi, xi+ n sup Zj k(xi,xi+ n j=1 Zj k(xi, xi+ n Since Zj k(xi, xi+ n 2 ) 1, we set sup Zj k(xi,xi+ n 2 i=1 Pc j=1 Zj k(xi, xi+ n 2 c. So, eℜ nc 2 (P(Gk))) 4R2 Then, we replace the lower bound of eℜ n 2 (Gk) in the proof of Lemma 7 with the lower bound of eℜ nc 2 (P(Gk))), and we have eℜnc(P(Lclu G)) 12 K max k eℜ nc 2 (P(Gk))) 1 + log 1 2 (e2n3c3) log M nc We then upper bound the worst-case Rademacher complexity eℜ nc 2 (P(Gk)) as the following: [c] D [c] X n 2 ˆℜ[c] D (P(Gk)) [c] D [c] X n sup pj(gk(xi,xi+ n j=1 ϵijpj(gk(xi, xi+ n [c] D [c] X n sup gj k Gj k j=1 ϵijgj k(xi, xi+ n = sup Zj k(xi,xi+ n sup V (xi,xi+ n 2 ) 4R2 1 n j=1 ϵij V (xi, xi+ n 2 )Zj k(xi, xi+ n = sup Zj k(xi,xi+ n j=1 ϵij Zj k(xi, xi+ n sup Zj k(xi,xi+ n j=1 ϵij Zj k(xi, xi+ n sup Zj k(xi,xi+ n j=1 Zj k(xi, xi+ n Hence, we have eℜnc(P(Lclu G)) 24 1 + log 1 2 (e2n3c3) log M nc Use the similar technique to the proof of the inequality above, the upper bound of eℜnc(P(F)) KR nc , since φj(x) = q PK k=1(d(x, cj k))2 K maxk x cj k K maxk( x + Finally, combining with the above inequalities and (5), we have R(f) b RD(f) + 3M KR 1 + log 1 2 (8e2n3c3) log M n 1 + log 1 2 (e2n3c3) log M nc 1 + log 1 2 (8e2n3c3) log M n A.4.2 Proof of Proposition 2 First, we prove that the squared loss is 1 pseudo-Lipschitz of order 2. ℓb (yjfj(x)) ℓb yjf j(x) = (yj fj(x))2 yj f j(x) 2 = (yj fj(x)) + yj f j(x) (yj fj(x)) yj f j(x) |yj fj(x)| + yj f j(x) (yj fj(x)) yj f j(x) 1 + |yj fj(x)| + yj f j(x) (yj fj(x)) yj f j(x) According to the definition of the pseudo-Lipschitz function, the squared loss is 1 pseudo-Lipschitz of order 2. Then, we have ℓb (yjfj(x)) ℓb yjf j(x) 1 + |yj fj(x)| + yj f j(x) (yj fj(x)) yj f j(x) 1 + 2 |yj| + |fj(x)| + f j(x) fj(x) f j(x) (3 + 2B) fj(x) f j(x) . Second, we prove that the surrogate Decomposable Loss is (3 + 2B)c-Lipschitz continuous with respect to the ℓ norm if the base loss ℓb is the squared loss. ℓD(f(x), y) ℓD f (x), y j=1 ℓb (yjfj(x)) j=1 ℓb yjf j(x) ℓb (yjfj(x)) ℓb yjf j(x) j=1 (3 + 2B) fj(x) f j(x) c(3 + 2B) max j [c] fj(x) f j(x) =(3 + 2B)c f(x) f (x) . A.4.3 Proof of Theorem 3 Compared with the function class of LSRL where each label corresponds to a Lasso, the function class of LLSF introduces the additional sharing constraint. In fact, the introduction of the sharing constraint reduces the complexity of the function class compared with the original class. Then, the complexity for the function class of LLSF can be bounded by the complexity of the function class of LSRL where each label corresponds to a Lasso. Hence, the complexity analysis of the function class of LLSF can be converted into giving the bound of the Rademacher complexity of the LSRL function class where each label corresponds to a Lasso. According to the definition, the function class of LSRL where each label corresponds to a Lasso means that in the class of LSRL defined by (1), the base loss ℓb is the squared loss, the nonlinear mappings ζ( ) and φ( ) are both identity transformations for any j [c], and the constraint α(w) is wj 1 Λ for any j [c]. The proof process is similar to Lemma 1 and Theorem 1, but the upper bound of the worst-case Rademacher complexity eℜnc(P(F)) here is different from Theorem 1. According to the above definitions, we have = sup [c] D [c] X n ˆℜ[c] D(P(F)) = sup [c] D [c] X n Eϵ sup pj(f(xi)) P(F) j=1 ϵijpj(f(xi)) = sup [c] D [c] X n Eϵ j=1 ϵijfj (xi) = sup xj i 2 R:i [n],j [c] j=1 ϵij wj, xj i = sup xj i 2 R:i [n],j [c] sup wj 1 Λ wj 1 j=1 ϵijxj i (Use H older s Inequality) sup xj i 2 R:i [n],j [c] j=1 ϵijxj i 2 sup xj i 2 R:i [n],j [c] j=1 ϵijxj i 2 2 (Use Jensen s Inequality) sup xj i 2 R:i [n],j [c] j=1 xj i 2 2 (Use Lemma 2) ΛR nc. (11) Combining with Lemma 1, inequalities (11), (5), and ρ = (3 + 2B)c, then we have R(f) b RD(f) + 24 2(3 + 2B)cΛR 1 + log 1 2 (8e2n3c3) log M n A.4.4 Proof of Theorem 4 First, we upper bound the worst-case Rademacher complexity eℜnc(P(F)) for DNN-based LSRL method. With the definitions in the main paper, we have = sup [c] D [c] X n ˆℜ[c] D(P(F)) = sup [c] D [c] X n Eϵ sup pj(f(xi)) P(F) j=1 ϵijpj(f(xi)) = sup [c] D [c] X n Eϵ j=1 ϵijfj (xi) = sup [c] D [c] X n Eϵ j=1 ϵijσsig(w j φj(xi)) 4 sup [c] D [c] X n Eϵ sup wj Λ,φj j=1 ϵijw j φj(xi) (The Lipschitz constant of sigmoid activation is bounded by 1 2Λ sup [c] D [c] X n Eϵ sup φj j=1 ϵijφj(xi) 2Λ sup [c] D [c] X n Eϵ sup 1 j=1 ϵijσRe LU {W5 [σRe LU(W4xi) σsig(W3ψ(Y )j)]} 2Λ sup [c] D [c] X n Eϵ sup W5 D j=1 ϵij W5 [σRe LU(W4xi) σsig(W3ψ(Y )j)] (Re LU activation is 1-Lipschitz) ΛD sup [c] D [c] X n Eϵ sup 1 j=1 ϵij [σRe LU(W4xi) σsig(W3ψ(Y )j)] ΛD sup σsig(W3ψ(Y )j) sup [c] D [c] X n Eϵ sup W4 D j=1 ϵijσRe LU(W4xi) 2ΛD sup σsig(W3ψ(Y )j) sup xi R Eϵ sup W4 D j=1 ϵij W4xi 2ΛD2 sup σsig(W3ψ(Y )j) sup xi R Eϵ 1 nc 2ΛD2 sup σsig(W3ψ(Y )j) sup xi R (Use Jensen s Inequality) 2ΛD2 sup σsig(W3ψ(Y )j) sup xi R 2ΛD2 sup σsig(W3ψ(Y )j) R nc. (12) Then, we have to bound sup σsig(W3ψ(Y )j) , sup σsig(W3ψ(Y )j) 4 sup W3 D W3ψ(Y )j (The Lipschitz constant of sigmoid activation is bounded by 1 4D sup j ψ(Y )j (Use Cauchy-Schwarz Inequality) 4 sup j sup W2 D σRe LU( AσRe LU( AYj W1)W2) 4 sup j sup W2 D AσRe LU( AYj W1) W2 4 sup j AσRe LU( AYj W1) i=1 AjiσRe LU( j=1 Aij Yj W1) i=1 Aji σRe LU( j=1 Aij Yj W1) 4 sup j sup W1 D j=1 Aij Yj W1 j=1 Aij Yj (Use Cauchy-Schwarz Inequality) We denote N(j) as the index set of the one-hop neighbors of the j-th node, and denote g as the node degree, gmax as the maximum node degree. Then, we bound A as follows Aij gi + 1pgj + 1 1 gi + 1 + X 1 1 + 1 + X 1 gi + 1 gi + 1 Combining with Lemma 1, inequalities (12), (13), (14), and (5), then we have R(f) b RD(f) + 6 2ρΛD5R2(gmax + 1) 1 + log 1 2 (8e2n3c3) log M n Neur IPS Paper Checklist Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: The paper s contributions and scope are reflected accurately by the main claims made in the abstract and introduction. Please see Abstract and Introduction sections. 2. Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: This is a purely theoretical work which improves existing theoretical results. The assumptions involved in the theoretical results are explained in detail and are satisfied in the relevant settings. Please see Conclusion section. 3. Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: This is a purely theoretical work which improves existing theoretical results. The assumptions involved in the theoretical results are explained in detail and are satisfied in the relevant settings. In the appendix, we give the detailed proofs of those theoretical results in the main paper. 4. Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [NA] Justification: This is a purely theoretical work. This paper does not include experiments. 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [NA] Justification: This is a purely theoretical work. This paper does not include experiments requiring code. 6. Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [NA] Justification: This is a purely theoretical work. This paper does not include experiments. 7. Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [NA] Justification: This is a purely theoretical work. This paper does not include experiments. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [NA] Justification: This is a purely theoretical work. This paper does not include experiments. 9. Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: This is theoretical research and the research conducted in the paper conform with the Neur IPS Code of Ethics. 10. Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justification: This is theoretical research that does not have a potential negative societal impact. 11. Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: This is a purely theoretical work. This paper poses no such risks. 12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: This is a purely theoretical work. 13. New Assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: This is a purely theoretical work. 14. Crowdsourcing and Research with Human Subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: This is a purely theoretical work. This paper does not involve crowdsourcing nor research with human subjects. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: This is a purely theoretical work. This paper does not involve crowdsourcing nor research with human subjects.